Humans interactively converse every day with other humans, and more recently with machines. Humans constantly converse with each other, online or offline, for example, when attending meetings or using customer service. Therefore, textual dialogues are an essential component of the interactions between users and user agents.
This abundance of personal and public conversations represents a valuable source of information, but analyzing such an immense amount of data to meet specific information needs can lead to information overload problems jones2004information.
Recently, dialogue summarization has emerged as a means to resolve this issue. Dialogue summarization is the task of distilling the highlights of an instance of a dialogue and rewriting them into an abridged version. This task is closely associated with abstractive text summarization shi2021neural; gehrmann-etal-2018-bottom; nallapati2016abstractive.
However, most of the existing investigation efforts on abstractive text summarization have been concentrated on single-speaker documents, such as news articles paulus2017deep; see2017get and scientific documents cachola-etal-2020-tldr; erera-etal-2019-summarization. Research has largely focused on the composition of strictly formatted paragraphs (e.g., introduction-method-conclusion) or is simply dependent on locational biases (i.e., the tendency towards a particular location such as the lead or tail of the text) kim2019abstractive.
Unlike these strictly structured formats, the format of dialogues between multiple interlocutors is often informal. Dialogues are represented through a variety of utterances, including the expression of personal perspectives, opinions, markers of certainty and doubt, speaker interruptions precht2008sex; sacks1978simplest, colloquial representations, and even the use of emojis in a textual way, and important information is scattered throughout dialogue, not necessarily restricted to predictable areas. These features lead to a focus on informative utterances. Figure 1 presents the representation of different speaker styles for informal and formal utterances that we attempt to address in this paper.
Thus, the critical challenges of dialogue summarization task are: 1) Multiple speakers and the different textual styles; the essential pieces of information in a conversation are scattered across the utterances of interlocutors through their different textual styles koppel2009computational. 2) Informal structure; dialogues consist of an informal structure, including slang and colloquial language, free of the locational biases found in formal structures chen2020multi.
To address these challenges, we investigated the relationship between textual styles and representative attributes of utterances. kubler2010adding proposed that the types (e.g., intent or role of a speaker) of sentences from speakers are associated with different syntactic structures (i.e., linguistic information), such as part-of-speech (POS) tagging. This is derived from the fact that different speaker roles are characterized by different syntactic structures.
Research in dialogue summarization is benefited from research in related fields, such as speaker recognition. Speaker recognition is the process of identity recognition, and specifically uses identity information (i.e., voiceprint) from the human voice. The speaker’s speech signal is considered as the hidden features that can be used to distinguish the identity of that individual and is commonly used in the speaker recognition research field guo2021speaker; liu2018gmm. In essence, the uttered text has a unique representation from each speaker, like a voiceprint. Based on this prior research, we began our study with the assumption that because syntactic structures tend to be associated with a representative of a sentence uttered from speakers, these structures would help distinguish the different styles of utterances. This assumption was also maintained in previous research. zhu2020hierarchical
proposed a hierarchical structure to handle the transcripts of long meetings and adopted a role vector to represent the individual speakers in conversation summarizations.
Inspired by the previous works, we propose a novel abstractive dialogue summarization model for use in a daily conversation setting, which is characterized by an informal style of text, including emoticons and abbreviations of chat terms. Furthermore, we explore the locational biases in dialogue structures. Although dialogues show independent locational biases different from that of formal documentation, we evaluated different simple baselines based on locational biases motivated by gliwa2019samsum; kim2019abstractive (see details in Section 3.3). The main contributions of this paper are fourfold.
First, we propose a novel approach for the abstractive dialogue summarization task. Specifically, the multi-task learning model is proposed to learn abstractive dialogue summarization and perform sequence labeling tasks simultaneously, to reflect the syntactic features on the dialogue summarization model.
Second, to the best of our knowledge, this is the first study to perform multi-task learning on the dialogue summarization task using the SAMSum corpus and, specifically, to integrate these tasks using linguistic information.
Third, we propose a novel input type training method, rather than using the traditional method of truncating the input, to investigate locational biases.
Finally, the proposed method outperformed the base models for all ROUGE scores lin2004rouge.
2 Proposed Method
2.1 Why Part-of-speech-tagging?
Part-of-speech-tagging is a valuable resource for analysis in the syntax-aware approach. arifin2018sentence conducted multi-document summarization to find representative sentences, not only by sentence distribution to select the most important sentences but also by how informative a term is in a sentence. They used part-of-speech (POS) tagging information to resolve this and obtain improved performance. This approach is characterized by its use of grammatical information, which is carried by POS labels, and the presence or absence of informative content in a sentence.
We further considered that incorporating the linguistic information from POS tagging could help alleviate structure/context (i.e., formal and informal) dependency issues for text summarization, and also in dialogue summarization. Simultaneously learning the syntax-aware approach using linguistic information and language generation allows the sharing of grammatical information that constrains next word generation.
Also, it is possible to deal with the first challenge by applying syntax-awareness to the entirety of the utterances from the dialogue because this will recognize the linguistic information from the speakers and also intrinsically represent the textual styles. Therefore, the model obtains the built-in ability to distinguish text styles.
2.2 Problem Formulation
We formalize the problem of dialogue summarization as follows. The input consists of dialogues and dialogue speakers . Assume there are dialogues in total. The dialogues are . Each dialogue consists of multiple turns, where each turn is the utterance of a speaker. Therefore, , where , , is a speaker and is the tokenized utterance from . The human-annotated summary for dialogue , denoted by , is also a sequence of tokens. In the end, the aim of the task is to generate a dialogue summary given the dialogues and the reference summaries .
The purpose of the sequence labeling task is to predict the sequence label , where , and is the number of tags in an utterance.
To summarize, the final goal of dialogue summarization is to maximize the conditional probability of the dialogue summary, given dialogues and model parameters .
2.3 Preprocessing for Syntax-Aware SAMSum
We automatically annotated sequence labels for all the utterances as these are not included in the SAMSum corpus. We used these labels for training syntax-aware information using the steps below.
Tokenization for data labeling
To better recognize syntax-aware information, we used Twokenizer111https://github.com/myleott/ark-twokenize-py Note that Twokenizer is used for data labeling not for model training. owoputi2013improved before annotating the sequence labeling. The Twokenizer was revised for tweet text to conduct part-of-speech (POS) tagging. Tweet text consists of online conversational text that also contains many nonstandard lexical items and syntactic patterns, such as emojis and emoticons. Also, the daily chat includes those like tweet text. Emoticons (e.g., :), XD) refer to the generation of a face or icon using traditional alphabetic or punctuation symbols, whereas emojis (e.g., 2, -1) refer to when small pictures used as symbols. The Twokenizer accurately recognizes emoticons as, for example, “:)” not as “: ).”
We obtained tokenized utterances using the above method. This process improves the model’s ability to recognize each token to use the POS tagger. For the sequence labeling, we adopted the CMU-Twitter-POS-tagger222http://www.cs.cmu.edu/~ark/TweetNLP/ owoputi2013improved; gimpel2010part, which addresses the problem of POS tagging for English data from the popular micro-blogging service Twitter.
2.4 Model Overview
Regarding the multi-task learning for BART backbone, we address two different tasks simultaneously: token classification (i.e., sequence labeling) and language modeling (i.e., generation). BART consists of a bidirectional encoder and an autoregressive decoder. Therefore, we conducted the token classification task in the encoder (i.e., syntax-aware encoder) and the language model task in the decoder (i.e., conversational decoder). As illustrated in Figure 2, task-specific linear heads were trained through multi-task learning, which performs the main task as a dialogue summarization task and the POS sequence labeling task as an auxiliary task.
2.5 Syntax-Aware Encoder
We sought to address the application of syntax-aware information to a dialogue summarization model through the sharable encoder.
Each utterance was composed of a special [EOU] token, as was done in previous work gliwa2019samsum, considering each utterance separately. In general, the input sequence,
is fed into the bottom encoder of BART. Given the hidden outputs of the encoder’s last layer
, the output layer for the sequence labeling task was a linear classifier, where denotes the dimension of the hidden layer and is a simplex, where is the number of POS tags.
, where is the last layer. The probability that the word aligns with the -th POS tag is computed using softmax:
, where is a parameter to be learned.
2.6 Conversational Decoder
To integrate the syntax-aware encoder with the decoder, the dialogue summarization model consists of combining the shared syntax-aware encoder using the Equation 2 and the conversational decoder.
Shared Syntax-Aware Encoder
We used the same encoder over all-around layers from the Section 2.5 to apply the syntax-aware information to the dialogue summarization model. Both encoders from each task were shared.
The syntactic information could provide different conversational aspects for the models to learn and further determine which set of utterances deserve more attention to generate better dialogue summaries.
The input to the decoder included previously generated tokens . We fed the tokens to the conversation decoder , and the -th token () was predicted as follows:
, where is a parameter to be learned.
2.7 Syntax-Aware Multi-task Learning
We trained the two tasks jointly using multi-task learning. We considered the dialogue summarization task as the main task and the sequence labeling task as an auxiliary task.
During training, the two self-learning objectives were combined with the cross-entropy loss for each task, and we sought to minimize the loss as follows:
Thus, the final loss of our model is:
, where and are the loss of the dialogue summarization model and the sequence labeling model, respectively, and denotes the parameter of strength in each task.
Finally, the model can activate linguistic information to enhance its ability to distinguish a speaker’s utterance style.
2.8 Speaker Styles of Utterance for Ad-hoc Analysis
In order to represent the uttering styles of speakers, we considered a list of POS tags extracted from utterances of each speaker as a style of the speaker as Equation 10 where denotes the index of speaker, and is the number of tags. It has the same form of a document made up of words.
With the style documents, we conducted tf-idf, a commonly used method to weight the importance of each term in a document. The tf-idf formula is as follows:
In Equation 11, represents a term frequency of the -th tag in the -th speaker style, and denotes the number of -th speaker styles in which the -th tag appears, and
is the total number of speaker styles. Then we employed K-means clustering for grouping the speaker styles. Consequently, this ad-hoc analysis was used to show whether our proposed strategies have been worked as intended in the trained models.
3 Experimental Setup
3.1 Dataset and Baselines
We trained and evaluated our model on a large-scale dialogue summary dataset SAMSum gliwa2019samsum. SAMSum is the first daily chat corpus for use in dialogue summarization and truly informal conversation. The subject of each conversation is open domain, and the conversation type is informal. The details on data statistics in SAMSum corpus are shown below, including in Table 1.
|# Conv||S.L||# Speakers||# Turns|
|Train||14732||23.44||[2, 73]||2.40||[1, 14]||11.17||[1, 46]|
|Dev||818||23.42||[4, 68]||2.39||[2, 12]||10.83||[3, 30]|
|Test||819||23.12||[4, 71]||2.36||[2, 11]||11.25||[3, 30]|
To better understand the characteristic of this corpus, we explored the density of the number of utterances.
As shown in Figure 5 (see Appendix C), the density of the number of utterances is consistently below ten. This result indicates that the utterances are generally fewer than ten in the training set. This result allows us to choose the locational biases, like input sequence types such as LEAD-3 see2017get, which takes the three leading sentences of the source text as the summary (we discuss this in Section 3.3).
We evaluated the model’s performance with the following summarization models, based upon previous works gliwa2019samsum.
Pointer Generator see2017get This model followed gliwa2019samsum, wherein separators are added between each utterance and utilized as input for the pointer generator model.
DynamicConv + GPT-2 wu2019pay Based on gliwa2019samsum
, this model uses GPT-2 to initialize token embeddingsradford2019language.
Fast Abs RL Enhanced chen2018fast adopts a hybrid method that selects salient sentences and then paraphrases them as abstractive sentences through sentence-level policy gradient methods.
BART lewis-etal-2020-bart We used BART as the vanilla in the following setting and added a separator in each utterance. The default parameter setting was BART-base333https://huggingface.co/transformers/model_doc/bart.html. Additionally, we fed the input type with the LONGEST-10 settings, as described in Section 3.3.
3.2 Evaluation Metrics
We utilized different evaluation metrics, including several recently introduced methods used in text summarization and generation tasks.
Rouge-N444https://github.com/pltrdy/rouge Note that different packages may generate different ROUGE scores.
lin2004rouge mostly used evaluation metrics for the text summarization task. We calculated ROUGE-1, ROUGE-2, and ROUGE-L.
zhang2019bertscore calculates the aligning similarity scores between the generated and reference summaries on a token level using BERT.
3.3 Implementation Details
We tested different input type modes in our proposed model. The traditional method for handling long sequences in a pretrained language model is to truncate the sequence in an uncompleted format, not in the true utterance format, owing to limitations in system memory. To alleviate this issue, we propose a novel input type method to retain the utterance format. Inspired by previous work gliwa2019samsum; see2017get, we defined the input types as LEAD-n, MIDDLE-n, and LONGEST-n. The underlying assumption of these input types is that the locational biases kim2019abstractive contain the essential information at the head (i.e., the beginning of the lead) or middle of lengthy conversations. Additionally, this method preserves the entire utterance sequences without breaking up sentence information. To support the above assumptions, we performed data statistics as described in the above section. We trained the model according to the different settings described as follows: LEAD-n - takes leading utterances of the dialogues, MIDDLE-n - takes utterances from the middle of the dialogue, and MIDDLE-n - takes utterances from the middle of the dialogue. We used the BART-base
model to initialize the backbone of the encoder/decoder frame and followed the default settings. The learning rate was set to 3e-4. We trained the model for 20 epochs. Also, we setas 0.1 in the final model. The training was conducted on a single RTX 8000 GPU with 48 GB memory. We trained the model with the Adam optimizer kingma2014adam and an early stop on validation set ROUGE-1. During the inference, the beam size was 4, including for the baseline models.
4 Main Results
4.1 Quantitative Results
|Model||Type||avg # words|
|Syntax-aware BART ()||LEAD-10||19.95|
Internal model verification
In Table 2, we explore the influence locational biases in dialogue have on performance. The input type settings are depicted according to the different measures, based on the F1 scores of both ROUGE and BertScore. We set as 0.5 and 0.1 and compared of 10 and 20 for each setting666Note that we did not set the LONGEST-20 due to limitations in computing power.. With set as 0.1, the LONGEST-10 model showed the best performance across every measure. However, the performance when was set as 0.5 was lower than when was 0.1, in general. Although the BertScore showed a subtle difference, it also showed the highest performance in this result. We interpret this result as indicating that lengthy utterances are valuable when generating summaries, and the key topics are located at the length of 10 in a dialogue.
Comparison of the generated length
We investigated the length of the generative summary, which varies with the input type. As shown in Table 3, we examined the average of words according to the generated summaries at each inference step. The BART base model (i.e., BART) used 22.25 words when it generated the summaries. Moreover, the average words generated by our proposed model according to the input type shows that our best performance model (i.e., Syntax-aware BART (LONG-10)) used only 21.95 words, thus requiring fewer words than the baseline model. This observation reveals that our proposed model often favors generating slightly shorter summaries than does the BART baseline model, which leads to more concise summaries while still capturing the important information. According to input types, input length influences the average number of words.
External model verification
In Table 4, we present the ROUGE-1, ROUGE-2, and ROUGE-L scores between our model and other, comparison models. First, our proposed model outperformed the other baselines with respect to F1 for all ROUGE scores. As hypothesized previously, our experiments demonstrate that the usage of linguistic information is worthwhile to enhance the model performance.
Fast Abs RL Enhanced achieved slightly better scores than Pointer Generator and DynamicConv+GPT2. This indicates that the use of reinforcement learning to first select important sentences is beneficial. The key factor related to the overall lower performance of the baseline models seems to be that the baseline models fundamentally are not based on the language model; however, the DynamicConv model with the GPT-2 embeddings is based on the usage of pretrained embeddings from the language model GPT-2, which is trained on a large corpus.
|Pointer Generator see2017get*||-||0.401||-||-||0.153||-||-||0.366||-||-|
|DynamicConv + GPT-2 wu2019pay*||-||0.418||-||-||0.164||-||-||0.376||-||-|
|Fast Abs RL Enhanced chen2018fast*||-||0.420||-||-||0.181||-||-||0.392||-||-|
|Syntax-aware BART ( =0.1)||LONG-10||0.431||0.486||0.426||0.189||0.216||0.186||0.420||0.460||0.418|
corresponds to our proposed method model, which shows the best performance (LONGEST-10). Note that F, P, and R indicate F1, precision, and recall scores, respectively.
|Dialogue 1||Dialogue 2||Speaker Style Utterance (abbreviated)|
|1. lilly: sorry, I’m gonna be late||1. randolph: honey|
|2. lilly: don’t wait for me and order the food||2. randolph: are you still in the pharmacy?|
|3. gabriel: no problem, shall we also order||3. maya: yes||(1) Stlye A vs B|
|something for you?||4. randolph: buy me some earplugs please||…Robert: …The Swedes didn’t even bother to find out…|
|4. gabriel: so that you get it as soon as you get||5. maya: how many pairs?||they started laying them off… (B)|
|to us?||6. randolph: 4 or 5 packs||Cynthia: …i’d like us to go to this new bistro i discovered… (A)|
|5. lilly: good idea||7. maya: i’ll get you 5|
|6. lilly: pasta with salmon and basil as always||8. randolph: thanks darling|
|very tasty there|
|REF: lilly will be late. gabriel will order pasta with salmon and basil for her.||REF: maya will buy 5 packs of earplugs for|
|randolph at the pharmacy.||(2) Style C|
|…Iris : <file other> My husband is famous… Haha. You don’t even realize what this…|
|FE: lilly will be late. lilly and gabriel are going to pasta with salmon and basil is always tasty.[62/46/68]||FE: randolph is in the pharmacy. randolph will buy some earplugs for randolph. maya will get 5.[64/38/71]||…Dan : <photo file>…But its not working any more and it hurts :(…|
|B: lilly will be late. gabriel and lilly will order food for lilly and gabriel. [72/39/63]||B: laurie is in the pharmacy. maya will buy 4 or 5 pairs of earphones for him. [51/23/51]||…Simon : BTW it’s so annoying that people can’t see that such immigration policy reduces…|
|SB: lilly is going to be late. gabriel will order food for her. lilly will get pasta with salmon and basil and she will get it as soon as she arrives at them.[68/23/68]||SB: maya will buy 4 or 5 pairs of earplugs for raymond at the pharmacy. [63/34/63]|
4.2 Qualitative Results
We compared the generated examples from several baselines including our proposed model in terms of their ROUGE scores. We present the qualitative results in Table 5. We observed the error analysis through the following major error types – (i) Incorrect reasoning: indicates that the model came to the incorrect conclusion, which occurred when the generated summaries reasoned relations in the dialogue incorrectly. (ii) Incorrect reference: indicates the association of one’s locations or actions with an incorrect speaker, regardless of the original context in the generated summaries. (iii) Redundancy: is the case wherein the content of the generated summaries was not mentioned in a reference. (iv) Missing information: content existing in the reference is absent in generated summaries.
In dialogue 1, FE (i.e., Fast Abs RL Enhanced) and SB (i.e., Syntax-aware BART) performed well at capturing the meaning of the reference summary, despite it being slightly lengthy. The B model (i.e.,vanilla BART model) showed the highest score but contained instances of (i) incorrect reasoning (gabriel and lilly). However, our SB model highlighted (i.e., lime-colored) the content influenced by the lengthy utterance at line 6. It appears that the model was affected by the lengthy input type. Alternatively, there is also the (iii) Redundancy case, as shown in SB. The related content appeared in dialogue but was absent from the reference.
In dialogue 2, the SB model showed the highest performance. FE mismatched (ii) the information of who was acting (i.e., the subject) and who received the action (i.e., the object). This observation also true of the B (laurie) and SB (raymond) models. Additionally, there was a case of (iii) redundancy in the SB (4) model, despite existing content being present in the dialogue. In sum, according to the observations listed in Table 5, our proposed model captured the lengthy utterance well in terms of our objective to use the locational information. Nonetheless, there are limitations to this study, such as the inclusion of incorrect references. This is an area that we aim to improve in future work.
|Description||discourse/file marker||verb particle||emoticon||
|Example||@user:hello||up, out||;-), :b||btw (by the way)||and, but||both, all, half|
Speaker utterance style (Vitamin)
We discovered the speaker styles from the test set by following the Section 2.8 for ad-hoc analysis. In details, our research question777We carried out the group characteristics in Appendix B. was “what are the differences between the speakers?”. Figure 3
depicts the distributional characteristics through PCA (principal components analysis) projection for each speaker style and K-means clustering (K=3). The style ‘A’ and ‘B’ have some intersection what is even shown in Figure4, and ‘C’ is relatively distant from other groups.
In Figure 4, we illustrate the top-6 ranked POS features to distinguish the groups (see the detail values in Appendix B). Style ‘B’ specifies T that represent the verb particle, and style ‘C’ mainly consists of , E, G as different factors than other groups. However, style ‘A’ shows the relatively flatten performance below 1.0 as Figure 4 and Appendix B. In Table 5’s right table, we compared the speaker styles: (1) Style ‘B’ used verb particle (find out), but ‘A’ used the same representation in a different way(discovered) and (2) Style ‘C’ mostly tend to represent their intention using picture or contents. In the end, we found that the ability to distinguish those speaker styles was reflected in our model.
In this study, we proposed a novel syntax-aware sequence-to-sequence model that leverages syntactic information (i.e., POS tagging), considering the informal daily chat structure constraints, and distinguishes the different textual styles from multiple speakers for abstractive dialogue summarization. To strategically combine syntactic information to the dialogue summarization task, we adopted multi-task learning to reproduce both syntactic information and dialogue summarization. Furthermore, we presented a novel input type to train the model to explore locational biases in dialogue structures. We benchmarked the experiments using the SAMSum corpus, and the experimental results demonstrate that the proposed method improves comparison models for all ROUGE scores.
There are promising future directions regarding this research. It would be worthwhile to apply the traditional truncation method with our proposed model to deeply compare performance differences.
Appendix A Related Work
a.1 Dialogue summarization
Most of the previous works focused on summarizing conversations from meetings mccowan2005ami, in part due to the lack of a corpus for daily conversation summaries. Hence, SAMSum, which is a corpus of daily conversation summaries, has been receiving much attention as announced by gliwa2019samsum recently.
In the standard dialogue summarization paradigm, a pointer-generator see2017get
, which is a hybrid model of the typical sequence-to-sequence attention modelnallapati2016abstractive, and a pointer network vinyals2015pointer are used as the abstractive dialogue summarization model. This framework encodes the source sequence and generates the target sequence, with the decoder for abstractive dialogue summarization yuan2019abstractive; goo2018abstractive.
Recently, the standard paradigm has shifted to using a combination of a pretraining method with much larger external text corpus (e.g., Wikipedia, books) and a transformer-based sequence model. This strategy has led to a remarkable improvement in performance when fine-tuned for both text generation tasks and natural language understanding like BART lewis-etal-2020-bart.
a.2 Syntax-aware text summarization
Syntax representation of text can be applied to text summarization to leverage linguistic information because it assists in information filtering to obtain highlighted context from a source document bouras2008improving, and yet the importance of this syntax has been previously underestimated zopf2018s. When linguistic information is used to perform text summarization, it finds the relationships between terms in the document through sequence labeling (POS tagging al2018generatingdobreva2020improving), grammar analysis lu2019attributed, and thesaurus usage (e.g., Wordnet) pal2014approach, and then extracts the salient context.
Previous research has investigated the use of linguistic information such as POS tagging for text summarization. al2018generating approached selective POS tagging for words such as nouns, verbs, and adjectives to extract sentence summaries. bouras2008improving and afsharizadeh2018query also attempted to extract keywords by retrieving nouns through POS tagging. liu2017pos reported the utilization of POS tagging to distil keywords for extractive summarization in Korean.
However, these studies only considered extractive summarization tasks and applied linguistic information to extract the key sentences as features using the scoring function. By contrast, we propose to have a pretrained model learn linguistic information implicitly through the sequence labeling task to function in an abstractive way.
Appendix B Speaker Style Cluster Result
Table 7 shows POS features gimpel2010part by speaker styles. We calculated the average of tf-idf values according to each groups. Note that the extremely common terms occur tf-idf value to zero. The standard deviation shows how much there is the between-group deviation for each POS feature.
|feature||A style||B style||C style||std|