Automatic conversation systems require large quantities of data to learn task specific language patterns and underlying conversation policies. Such data either come from human-to-human conversation logs Lowe et al. (2015); Hardalov et al. (2018) or is collected in crowd-sourced environments, where two or more crowd-workers play specific roles under some guidelines Zhang et al. (2018); Budzianowski et al. (2018). Since real human-to-human conversation logs are scarce, many datasets have been created using the latter approach. However, crowd-sourced conversation data collection is time consuming, costly and presents multiple challenges to ensure data quality Kang et al. (2018).
Conversation summarization is an emerging research area that has been ill-studied due to the lack of large-scale datasets. Most existing public datasets in this domain are small, for example, AMI meeting corpus McCowan et al. (2005) contains summary transcripts. CRD3 Rameshkumar and Bailey (2020) is a spoken conversation dataset that consists of conversations and summaries. Samsum Gliwa et al. (2019), the only large scale dataset for conversation summarization, contains over open-domain conversations and summaries created artificially by humans.
Large scale pre-trained language models (PLMs) Lewis et al. (2020); Brown et al. (2020); Raffel et al. (2020) have been used in various text generation tasks Budzianowski and Vulić (2019); Min et al. (2020); Cachola et al. (2020)
. In recent studies, PLMs are used to generate training data for natural language processing (NLP) applications. For example,Anaby-Tavor et al. (2020); Yang et al. (2020)
use PLMs to create paraphrases for intent classifiers in conversation systems, and show that, when the original datasets are augmented with the generated data, performance improves. More recentlyMohapatra et al. (2020) generated entire conversations grounded on instructions that are provided to crowd-workers using a modular approach, where different PLMs are trained for different roles.
We investigate how PLMs can be utilized to generate entire conversations that are grounded on a given summary. We explore three approaches: (1) Supervised Learning (SL) based conversation generation(SL-Gen)
: where, a PLM is trained to generate an entire conversation, taking the summary of a conversation as input, (2) Reinforced Learning (RL) based conversation generation(RL-Gen): where, we further improve the SL-Gen method using the quality of the generated conversations as a reward, and (3) Controlled turn-by-turn conversation generation (CN-Gen): which allows us to generate conversations turn-by-turn, constrained on the summary and a set of pre-defined control parameters. We evaluate the quality of the generated conversations by conducting automatic and human evaluation. We also show that once a conversation summarization dataset is augmented with the generated conversations, the performance of the downstream summarization task is improved.
2 Summary grounded conversation generation
In the conversation summarization task, a model takes a conversation as input, and learns to generate a summary. We study the inverse of that problem, where the input to our model is a summary, and the model generates a conversation. In this section, we propose three models for this task and the hyper-parameters used in training the models are available in Section A of the appendix.
2.1 SL based generation (SL-Gen)
A seq2seq model can be trained for this task by providing a summary as the input and generating a conversation token-by-token. As PLMs have shown significant improvement over the traditional seq2seq architecture for text generation, we use a GPT-2 model and fine-tune it to generate a conversation given a summary as the input. Our input to the model follows the following format: bossummary text dialogconversation texteos. We also use different token-type-ids to indicate the summary and the conversation text. The model is trained to optimize Cross Entropy loss.
2.2 RL based generation (RL-Gen)
Many studies train text generation models with RL Paulus et al. (2018); Li et al. (2016), where the generator network is optimized with a task specific reward. We investigate how the quality of the generated conversation can be used as a reward to improve the generation network. To this end, we train a summary generator network, which generates a summary, given a conversation. We measure the quality of the generated conversation by identifying the similarity between the summary of the generated conversation (generated, in turn, by the summary generator network) and the ground truth summary. The similarity score is used as a reward to train the conversation generation model. Our RL based generation framework is shown in Figure 1, and the critical components are described below.
Conversation Generator: A trained SL-Gen model is used as the conversation generator, which, given an summary can generate a conversation.
Summary Generator: We use a lightweight variant of BART Lewis et al. (2019), named DistilBART, which is fine-tuned on the Extreme summarization task Narayan et al. (2018). We further fine-tune this instance on the conversation summarization data by providing the conversations as the input and training the model to output summaries.
Reward Model: Once the Summary Generator generates an output summary for the generated conversation, the reward model compares it with the ground truth summary, which was used to ground the conversation generation. As Paulus et al. (2018) we use ROUGE-2 F1-score as the reward.
Policy training: We use proximal policy optimization Schulman et al. (2017) as the optimizer for the policy training as it prevents the generator from deviating far away from the pretrained LM Wu et al. (2020).
2.3 Controlled conversation generation
We propose another approach, (CN-Gen), for conversation generation, which grants more control over the properties of the generated conversations. Here, we generate one utterance of the conversation at a time, as opposed to the RL-Gen, where we generate the whole conversation at once. The properties of the generated conversations is controlled by adding several components to the input sequence to the model. The following three variables were used as the control parameters, (1) Number of remaining turns to generate in the conversation (Num turns): During the generation of a turn, we indicate the remaining number of turns in the conversation. In generating a turn conversation, this starts with for the first turn and reduces by after the generation of each turn, (2) The speaker of the next turn (Speaker): This indicates to the model the speaker of the next turn, and (3) The length of the next turn (Turn length): We define, categories of lengths: Short ( 3 tokens), Long ( 10 tokens) and Medium (otherwise).
We use the following input representation to fine-tune a GPT-2 model: bos summary text context dialog context turns_to_go Num turns speaker speaker turn_length turn length turn utterance eos. Changing these parameters allows us to generate different variants of conversations which are grounded on the same summary. During training, we obtain the values for the control parameters from the ground truth conversations, and at inference we randomly select the next speaker, number of turns of the conversation to be generated (in a range of 4-15 turns), and the next turn length. In Table 1 we show conversations of different lengths that were generated by the CN-Gen approach grounded on the same summary by changing the control parameters.
|Summary: person0 will be late. person1 will order pasta with salmon and basil for her.|
We experiment on the Samsum Gliwa et al. (2019) dataset, which, to the best of our knowledge, is the only public large-scale conversation summarization dataset. We pre-process the dataset by replacing the personal names (ex: John) with unique tags (ex:person_). First, we evaluate of the quality of generated conversations using automatic measures and human judgments, and then assess the performance of the generated conversations in a downstream summarization task after augmentation.
3.1 Quality of the generated conversations
We evaluate the quality of the conversations generated by the three approaches that were introduced in Section 2. In Table 2 we show the properties of generated conversations and the ground truth conversations in the test set of Samsum dataset.
|Model||Ave. Turns||Ave. Tokens/Turn|
Automatic Evaluation: We trained the conversation generation models on the Samsum training set and generated conversations on the test set. We compare the generated conversation with the ground truth conversations using the measures used by Sharma et al. (2017) to evaluate conversation system responses. The results shown in Table 4 suggest that CN-Gen outperform the SL-Gen and RL-Gen on all measures.
We also compare the summaries of generated conversations (generated by the Summary Generator) with the ground truth summaries, and the results are shown in Table 4. We believe that this is a semantic evaluation of the conversations, as the summaries capture the crux of the conversations. According to the results, CN-Gen outperforms the other two methods. This, along with the previous result suggest that the conversations produced by CN-Gen are the most similar to the ground truth conversations.
Human Evaluation: To evaluate the quality of generated conversations, we randomly selected summaries from the Samsum test dataset and generated conversations using the three models. Three NLP experts were then asked to read the ground truth summary and rank the four conversations (3 generated and the ground truth conversation) using a [1-5] scale according to Grammaticality, Coherency, and Informativeness, with respect to the ground truth summary. Results are shown in table 6
. As expected, the ground-truth conversations obtained the highest scores on all three aspects and can be considered as an upper bound for this task. RL-Gen and CN-Gen obtained higher scores than SL-Gen and relatively good scores compared to the Ground Truth conversations. This corroborates the assumption that our proposed models generate high quality conversations. The Welch Two Sample t-testWelch (1947) shows that both RL-Gen and CN-Gen models outperform the SL-Gen model statistically significantly with . However, there is no statistical significance between the results obtained from RL-Gen and CN-Gen. We report in Table 6 the average quadratic Cohen’s Kappa calculated over the three possible combinations of two judges Toledo et al. (2019).
CN-Gen obtained the best scores during the automatic evaluation, while RL-Gen got the best scores from the human evaluation. The CN-Gen conversations are longer than the RL-Gen conversation by turns on average (see Table 2), and hence would contain more word overlap with the ground truth. This results in better automatic evaluation scores for the CN-Gen, while the humans prefer short targeted conversations generated by RL-Gen.
3.2 Evaluation on the summarization task
To further evaluate the quality of the generate conversations, we augmented a conversation summarization dataset with generated conversations and evaluated the summarization model. We followed the following process: (1) We randomly selected x% of the summaries of the dataset and trained our conversation generation models, (2) The trained models were applied on the other (y=100-x%) of the summaries and generated conversations, (3) Those generated conversations along with the original summaries were added to the data. Using this approach, we can add extra y% (summary, conversation) pairs to the training data, (4) The conversation summarization model (discussed in Section 2 under ‘Summary Generator‘) was trained on the augmented data. We compare the performance of the conversation summarization model on the original dataset and with augmentation.
Automatic Evaluation: We compare the three conversation generation methods at different augmentation percentages, and the results are shown in Table 7.
At all augmentation levels, the summarization models trained with augmented data outperform the summarization model trained on the original dataset (without augmentation). CN-Gen based augmentation produces the best accuracy compared to other two methods. One prevalent pattern is that, when augmentation data increases, the accuracy seems to increase up to a certain point and then starts to decrease. The best accuracies were found around 30% data augmentation. We believe that more augmentation leads performance to drop due to the following reason. For augmenting with more data, we are left with less data to train the model for conversation generation (for 10% augmentation, the conversation generation models are trained on 90% of the data, while for 50% augmentation, the models are trained only on 50% of the data). Therefore as the augmentation increases, the quality of generated conversations go down. This leads to overall smaller gains in the summarization task with increased augmentation after some point. To neutralize the effect of increasing the data points during augmentation, we experimented with a baseline which over-samples the original training data at different percentages to obtain same number of training instances as the augmented datasets. While the ROUGE-2 obtained with the original training data is 30.98, oversampling at 10%, 20%, 30%, 40% and 50%, only changes the ROUGE-2 to 30.55, 30.38, 30.74, 30.99 and 30.27 respectively. Hence, this suggests that oversampling hardly changes ROUGE scores obtained by training with the original dataset, while the augmentation according to our algorithms show significantly improved scores (as shown in Table 7).
Human Evaluation: We recruited 3 NLP experts to evaluate 50 instances of summaries generated with data augmentation (RL-Gen, CN-Gen), and respective summaries generated without augmentation (No-Aug). Here we consider two aspects with respect to a ground-truth summary: Coherency (whether the summary is easy to read) and Focus (whether the summary represents the ground-truth summary). Following Amplayo and Lapata (2020) we use the Best-Worst Scaling method. The score of each system is computed as the percentage of times it was chosen as the Best system minus times it was chosen as Worst. On the Coherency question, RL-Gen, CN-Gen and No-Aug obtained scores of 12.6, 6.6 and -4.0 respectively. On the Focus question RL-Gen, CN-Gen, and No-Aug obtained scores of 14.6, 6.0 and -2.6 respectively. These results confirm that the use of augmentation improves the quality of the summaries.
We investigated how the PLMs can be utilized to generate entire conversations that are grounded on a summary. We propose three approaches for conversation generation: SL-Gen, RL-Gen and CN-Gen and conducted multiple automatic and human evaluations to assess the quality of the generated conversations. Both automatic and human evaluations show that when compared to the ground truth conversations, RL-Gen and CN-Gen obtain high scores, suggesting that the proposed models generate high quality conversations. When a conversation summarization dataset is augmented with the generated conversations, the performance of conversation summarization is improved (over to 7% improvement in ROUGE-2 F-1), which also suggests that the proposed methods generate high quality conversations.
We have used the publicly available Samsum dataset (https://huggingface.co/datasets/samsum
). For the human evaluation of both conversations and summaries, we recruited 3 NLP researchers, who have graduate degree in NLP and Machine Learning. The annotation task itself was executed on Appen.com platform. Before the official annotation, we sampled 10 tasks to get an estimate of the duration of the task, and to make sure the instructions are clear enough.
- Amplayo and Lapata (2020) Reinald Kim Amplayo and Mirella Lapata. 2020. Unsupervised opinion summarization with noising and denoising. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1934–1945.
Anaby-Tavor et al. (2020)
Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Amir Kantor, George Kour,
Segev Shlomov, Naama Tepper, and Naama Zwerdling. 2020.
Do not have enough data? deep learning to the rescue!In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7383–7390.
- Brown et al. (2020) Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
- Budzianowski and Vulić (2019) Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s gpt-2-how can i help you? towards the use of pretrained language models for task-oriented dialogue systems. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 15–22.
- Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026.
- Cachola et al. (2020) Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel S Weld. 2020. Tldr: Extreme summarization of scientific documents. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 4766–4777.
- Gliwa et al. (2019) Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. 2019. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. EMNLP-IJCNLP 2019, page 70.
- Hardalov et al. (2018) Momchil Hardalov, Ivan Koychev, and Preslav Nakov. 2018. Towards automated customer support. In International Conference on Artificial Intelligence: Methodology, Systems, and Applications, pages 48–59. Springer.
- Kang et al. (2018) Yiping Kang, Yunqi Zhang, Jonathan K Kummerfeld, Lingjia Tang, and Jason Mars. 2018. Data collection for dialogue system: A startup perspective. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 33–40.
- Lewis et al. (2019) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
- Li et al. (2016) Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202.
- Lowe et al. (2015) Ryan Lowe, Nissan Pow, Iulian Vlad Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 285–294.
McCowan et al. (2005)
Iain McCowan, Jean Carletta, Wessel Kraaij, Simone Ashby, S Bourban, M Flynn,
M Guillemot, Thomas Hain, J Kadlec, Vasilis Karaiskos, et al. 2005.
The ami meeting corpus.
In Proceedings of the 5th International Conference on Methods
and Techniques in Behavioral Research
, volume 88, page 100. Citeseer.
- Min et al. (2020) Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2020. Ambigqa: Answering ambiguous open-domain questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5783–5797.
- Mohapatra et al. (2020) Biswesh Mohapatra, Gaurav Pandey, Danish Contractor, and Sachindra Joshi. 2020. Simulated chats for task-oriented dialog: Learning to generate conversations from instructions. arXiv preprint arXiv:2010.10216.
Narayan et al. (2018)
Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018.
Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807.
- Paulus et al. (2018) Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations.
Raffel et al. (2020)
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020.
Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67.
Rameshkumar and Bailey (2020)
Revanth Rameshkumar and Peter Bailey. 2020.
Storytelling with dialogue: A critical role dungeons and dragons dataset.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5121–5134.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Sharma et al. (2017) Shikhar Sharma, Layla El Asri, Hannes Schulz, and Jeremie Zumer. 2017. Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. CoRR, abs/1706.09799.
- Toledo et al. (2019) Assaf Toledo, Shai Gretz, Edo Cohen-Karlik, Roni Friedman, Elad Venezian, Dan Lahav, Michal Jacovi, Ranit Aharonov, and Noam Slonim. 2019. Automatic argument quality assessment-new datasets and methods. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5629–5639.
Bernard L Welch. 1947.
The generalization ofstudent’s’ problem when several different population variances are involved.Biometrika, 34(1/2):28–35.
- Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, pages arXiv–1910.
- Wu et al. (2020) Qingyang Wu, Lei Li, and Zhou Yu. 2020. Textgail: Generative adversarial imitation learning for text generation. arXiv preprint arXiv:2004.13796.
- Yang et al. (2020) Yiben Yang, Chaitanya Malaviya, Jared Fernandez, Swabha Swayamdipta, Ronan Le Bras, Ji-Ping Wang, Chandra Bhagavatula, Yejin Choi, and Doug Downey. 2020. G-daug: Generative data augmentation for commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 1008–1025.
- Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213.
Appendix A Model Training and Hyperparameter Details
a.1 Supervised Conversation Generation (SL-Conv-Gen)
We fine-tune a GPT-2 language model using the implementation available at HuggingFace Wolf et al. (2019). The hyper-parameters used during training and inference are shown below. The model takes around 6 hours to train on 2 V100 GPUs (single machine).
model_name_or_path: gpt2 per_gpu_train_batch_size: 4 per_gpu_eval_batch_size: 4 gradient_accumulation_steps: 4 learning_rate: 6.25e-5 adam_epsilon: 1e-8 max_grad_norm: 1.0 num_train_epochs: 10 warmup_steps: 500 min_length: 20 max_length: 512 top_k: 0 top_p: 0.95
a.2 Summary Generator
We use DistilBART instance111https://huggingface.co/sshleifer/distilbart-cnn-12-6 fine-tuned on the extreme summarization (XSum) task, and we fine-tune this model further on the Samsum dataset. The model takes around 12 hours to train on 2 V100 GPUs (single machine).
The hyperparameters used for training the DistilBART model are as follows:
[fontsize=] train_batch_size: 4 eval_batch_size: 4 num_train_epochs: 10 model_name_or_path: sshleifer/distilbart -xsum-12-6 learning_rate: 3e-5 val_check_interval: 0.1 max_source_length: 512 max_target_length: 80
a.3 Reinforced Learning based conversation generation (RL-Conv-Gen)
To train the RL based conversation generation model, we adapted a publicly available Proximal Policy Optimization (PPO) implementation 222 https://github.com/lvwerra/trl . The model takes around 12 hours to train on 2 V100 GPUs (single machine). Following hyper-parameters were used to train the model. [fontsize=] steps: 10000 batch_size: 16 forward_batch_size: 4 learning_rate: 1.41e-5 init_kl_coef:0.2 target: 6 horizon:10000 gamma:1 lam:0.95 cliprange: 0.2 cliprange_value: 0.2 vf_coef: 0.1
Appendix B Sample summaries with corresponding ground-truth
Figure 3 shows some samples of dialogs with their corresponding summaries - ground-truth and automatic generated ones.
|Summary: Person0 closed some deals today. Person1 didn’t manage to do it.|
|Summary: Person0 bought a table, six chairs, a vase and a pile of clothes and the second hand shop downtown. She paid 70 euros for everything.|
|Summary: Person1 is not at home. Person0 wants Person1 to keep her pasta in the microwave.|
|Summary: Person0 needs Person1’s help as he cannot get the application running.|
|Summary: Person0 and Person1 will meet the new person in an hour.|