Who says like a style of Vitamin: Towards Syntax-Aware DialogueSummarization using Multi-task Learning

09/29/2021 ∙ by Seolhwa Lee, et al. ∙ NYU college Korea University 0

Abstractive dialogue summarization is a challenging task for several reasons. First, most of the important pieces of information in a conversation are scattered across utterances through multi-party interactions with different textual styles. Second, dialogues are often informal structures, wherein different individuals express personal perspectives, unlike text summarization, tasks that usually target formal documents such as news articles. To address these issues, we focused on the association between utterances from individual speakers and unique syntactic structures. Speakers have unique textual styles that can contain linguistic information, such as voiceprint. Therefore, we constructed a syntax-aware model by leveraging linguistic information (i.e., POS tagging), which alleviates the above issues by inherently distinguishing sentences uttered from individual speakers. We employed multi-task learning of both syntax-aware information and dialogue summarization. To the best of our knowledge, our approach is the first method to apply multi-task learning to the dialogue summarization task. Experiments on a SAMSum corpus (a large-scale dialogue summarization corpus) demonstrated that our method improved upon the vanilla model. We further analyze the costs and benefits of our approach relative to baseline models.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans interactively converse every day with other humans, and more recently with machines. Humans constantly converse with each other, online or offline, for example, when attending meetings or using customer service. Therefore, textual dialogues are an essential component of the interactions between users and user agents.

This abundance of personal and public conversations represents a valuable source of information, but analyzing such an immense amount of data to meet specific information needs can lead to information overload problems jones2004information.

Figure 1: Example utterance of formal and informal sentences of the same meaning from different speakers, with the different parts-of-speech labeled. The histogram shows the different individual textual styles.

Recently, dialogue summarization has emerged as a means to resolve this issue. Dialogue summarization is the task of distilling the highlights of an instance of a dialogue and rewriting them into an abridged version. This task is closely associated with abstractive text summarization shi2021neural; gehrmann-etal-2018-bottom; nallapati2016abstractive.

However, most of the existing investigation efforts on abstractive text summarization have been concentrated on single-speaker documents, such as news articles paulus2017deep; see2017get and scientific documents cachola-etal-2020-tldr; erera-etal-2019-summarization. Research has largely focused on the composition of strictly formatted paragraphs (e.g., introduction-method-conclusion) or is simply dependent on locational biases (i.e., the tendency towards a particular location such as the lead or tail of the text) kim2019abstractive.

Unlike these strictly structured formats, the format of dialogues between multiple interlocutors is often informal. Dialogues are represented through a variety of utterances, including the expression of personal perspectives, opinions, markers of certainty and doubt, speaker interruptions precht2008sex; sacks1978simplest, colloquial representations, and even the use of emojis in a textual way, and important information is scattered throughout dialogue, not necessarily restricted to predictable areas. These features lead to a focus on informative utterances. Figure 1 presents the representation of different speaker styles for informal and formal utterances that we attempt to address in this paper.

Thus, the critical challenges of dialogue summarization task are: 1) Multiple speakers and the different textual styles; the essential pieces of information in a conversation are scattered across the utterances of interlocutors through their different textual styles koppel2009computational. 2) Informal structure; dialogues consist of an informal structure, including slang and colloquial language, free of the locational biases found in formal structures chen2020multi.

To address these challenges, we investigated the relationship between textual styles and representative attributes of utterances. kubler2010adding proposed that the types (e.g., intent or role of a speaker) of sentences from speakers are associated with different syntactic structures (i.e., linguistic information), such as part-of-speech (POS) tagging. This is derived from the fact that different speaker roles are characterized by different syntactic structures.

Research in dialogue summarization is benefited from research in related fields, such as speaker recognition. Speaker recognition is the process of identity recognition, and specifically uses identity information (i.e., voiceprint) from the human voice. The speaker’s speech signal is considered as the hidden features that can be used to distinguish the identity of that individual and is commonly used in the speaker recognition research field guo2021speaker; liu2018gmm. In essence, the uttered text has a unique representation from each speaker, like a voiceprint. Based on this prior research, we began our study with the assumption that because syntactic structures tend to be associated with a representative of a sentence uttered from speakers, these structures would help distinguish the different styles of utterances. This assumption was also maintained in previous research. zhu2020hierarchical

proposed a hierarchical structure to handle the transcripts of long meetings and adopted a role vector to represent the individual speakers in conversation summarizations.

Inspired by the previous works, we propose a novel abstractive dialogue summarization model for use in a daily conversation setting, which is characterized by an informal style of text, including emoticons and abbreviations of chat terms. Furthermore, we explore the locational biases in dialogue structures. Although dialogues show independent locational biases different from that of formal documentation, we evaluated different simple baselines based on locational biases motivated by gliwa2019samsum; kim2019abstractive (see details in Section 3.3). The main contributions of this paper are fourfold.

  • First, we propose a novel approach for the abstractive dialogue summarization task. Specifically, the multi-task learning model is proposed to learn abstractive dialogue summarization and perform sequence labeling tasks simultaneously, to reflect the syntactic features on the dialogue summarization model.

  • Second, to the best of our knowledge, this is the first study to perform multi-task learning on the dialogue summarization task using the SAMSum corpus and, specifically, to integrate these tasks using linguistic information.

  • Third, we propose a novel input type training method, rather than using the traditional method of truncating the input, to investigate locational biases.

  • Finally, the proposed method outperformed the base models for all ROUGE scores lin2004rouge.

2 Proposed Method

2.1 Why Part-of-speech-tagging?

Part-of-speech-tagging is a valuable resource for analysis in the syntax-aware approach. arifin2018sentence conducted multi-document summarization to find representative sentences, not only by sentence distribution to select the most important sentences but also by how informative a term is in a sentence. They used part-of-speech (POS) tagging information to resolve this and obtain improved performance. This approach is characterized by its use of grammatical information, which is carried by POS labels, and the presence or absence of informative content in a sentence.

We further considered that incorporating the linguistic information from POS tagging could help alleviate structure/context (i.e., formal and informal) dependency issues for text summarization, and also in dialogue summarization. Simultaneously learning the syntax-aware approach using linguistic information and language generation allows the sharing of grammatical information that constrains next word generation.

Also, it is possible to deal with the first challenge by applying syntax-awareness to the entirety of the utterances from the dialogue because this will recognize the linguistic information from the speakers and also intrinsically represent the textual styles. Therefore, the model obtains the built-in ability to distinguish text styles.

2.2 Problem Formulation

We formalize the problem of dialogue summarization as follows. The input consists of dialogues and dialogue speakers . Assume there are dialogues in total. The dialogues are . Each dialogue consists of multiple turns, where each turn is the utterance of a speaker. Therefore, , where , , is a speaker and is the tokenized utterance from . The human-annotated summary for dialogue , denoted by , is also a sequence of tokens. In the end, the aim of the task is to generate a dialogue summary given the dialogues and the reference summaries .

The purpose of the sequence labeling task is to predict the sequence label , where , and is the number of tags in an utterance.

To summarize, the final goal of dialogue summarization is to maximize the conditional probability of the dialogue summary

, given dialogues and model parameters .

2.3 Preprocessing for Syntax-Aware SAMSum

We automatically annotated sequence labels for all the utterances as these are not included in the SAMSum corpus. We used these labels for training syntax-aware information using the steps below.

Tokenization for data labeling

To better recognize syntax-aware information, we used Twokenizer111https://github.com/myleott/ark-twokenize-py Note that Twokenizer is used for data labeling not for model training. owoputi2013improved before annotating the sequence labeling. The Twokenizer was revised for tweet text to conduct part-of-speech (POS) tagging. Tweet text consists of online conversational text that also contains many nonstandard lexical items and syntactic patterns, such as emojis and emoticons. Also, the daily chat includes those like tweet text. Emoticons (e.g., :), XD) refer to the generation of a face or icon using traditional alphabetic or punctuation symbols, whereas emojis (e.g., 2, -1) refer to when small pictures used as symbols. The Twokenizer accurately recognizes emoticons as, for example, “:)” not as “: ).”

Part-of-speech tagging

We obtained tokenized utterances using the above method. This process improves the model’s ability to recognize each token to use the POS tagger. For the sequence labeling, we adopted the CMU-Twitter-POS-tagger222http://www.cs.cmu.edu/~ark/TweetNLP/ owoputi2013improved; gimpel2010part, which addresses the problem of POS tagging for English data from the popular micro-blogging service Twitter.

Figure 2: Overview of the model architecture. The syntax-aware encoder with a task-specific linear head learns the sequence labeling task given the dialogue utterances in a bidirectional encoder setting from the BART encoder. The conversation decoder (i.e., autoregressive decoder from the BART decoder) learns the dialogue summarization task through the linear head.

2.4 Model Overview

Regarding the multi-task learning for BART backbone, we address two different tasks simultaneously: token classification (i.e., sequence labeling) and language modeling (i.e., generation). BART consists of a bidirectional encoder and an autoregressive decoder. Therefore, we conducted the token classification task in the encoder (i.e., syntax-aware encoder) and the language model task in the decoder (i.e., conversational decoder). As illustrated in Figure 2, task-specific linear heads were trained through multi-task learning, which performs the main task as a dialogue summarization task and the POS sequence labeling task as an auxiliary task.

2.5 Syntax-Aware Encoder

We sought to address the application of syntax-aware information to a dialogue summarization model through the sharable encoder.

Each utterance was composed of a special [EOU] token, as was done in previous work gliwa2019samsum, considering each utterance separately. In general, the input sequence,


is fed into the bottom encoder of BART. Given the hidden outputs of the encoder’s last layer

, the output layer for the sequence labeling task was a linear classifier

, where denotes the dimension of the hidden layer and is a simplex, where is the number of POS tags.


, where is the last layer. The probability that the word aligns with the -th POS tag is computed using softmax:


, where is a parameter to be learned.

2.6 Conversational Decoder

To integrate the syntax-aware encoder with the decoder, the dialogue summarization model consists of combining the shared syntax-aware encoder using the Equation 2 and the conversational decoder.

Shared Syntax-Aware Encoder

We used the same encoder over all-around layers from the Section 2.5 to apply the syntax-aware information to the dialogue summarization model. Both encoders from each task were shared.

The syntactic information could provide different conversational aspects for the models to learn and further determine which set of utterances deserve more attention to generate better dialogue summaries.

The input to the decoder included previously generated tokens . We fed the tokens to the conversation decoder , and the -th token () was predicted as follows:


, where is a parameter to be learned.

2.7 Syntax-Aware Multi-task Learning

We trained the two tasks jointly using multi-task learning. We considered the dialogue summarization task as the main task and the sequence labeling task as an auxiliary task.

Joint training

During training, the two self-learning objectives were combined with the cross-entropy loss for each task, and we sought to minimize the loss as follows:


Thus, the final loss of our model is:


, where and are the loss of the dialogue summarization model and the sequence labeling model, respectively, and denotes the parameter of strength in each task.

Finally, the model can activate linguistic information to enhance its ability to distinguish a speaker’s utterance style.

2.8 Speaker Styles of Utterance for Ad-hoc Analysis

In order to represent the uttering styles of speakers, we considered a list of POS tags extracted from utterances of each speaker as a style of the speaker as Equation 10 where denotes the index of speaker, and is the number of tags. It has the same form of a document made up of words.


With the style documents, we conducted tf-idf, a commonly used method to weight the importance of each term in a document. The tf-idf formula is as follows:


In Equation 11, represents a term frequency of the -th tag in the -th speaker style, and denotes the number of -th speaker styles in which the -th tag appears, and

is the total number of speaker styles. Then we employed K-means clustering for grouping the speaker styles. Consequently, this ad-hoc analysis was used to show whether our proposed strategies have been worked as intended in the trained models.

3 Experimental Setup

3.1 Dataset and Baselines

We trained and evaluated our model on a large-scale dialogue summary dataset SAMSum gliwa2019samsum. SAMSum is the first daily chat corpus for use in dialogue summarization and truly informal conversation. The subject of each conversation is open domain, and the conversation type is informal. The details on data statistics in SAMSum corpus are shown below, including in Table 1.

# Conv S.L # Speakers # Turns
Mean Range Mean Range Mean Range
Train 14732 23.44 [2, 73] 2.40 [1, 14] 11.17 [1, 46]
Dev 818 23.42 [4, 68] 2.39 [2, 12] 10.83 [3, 30]
Test 819 23.12 [4, 71] 2.36 [2, 11] 11.25 [3, 30]
Table 1: Data statistics of the SAMSum corpus. S.L denotes summary length. Range indicates the minimum and maximum values.

Data Statistics

To better understand the characteristic of this corpus, we explored the density of the number of utterances.

As shown in Figure 5 (see Appendix C), the density of the number of utterances is consistently below ten. This result indicates that the utterances are generally fewer than ten in the training set. This result allows us to choose the locational biases, like input sequence types such as LEAD-3 see2017get, which takes the three leading sentences of the source text as the summary (we discuss this in Section 3.3).


We evaluated the model’s performance with the following summarization models, based upon previous works gliwa2019samsum.

  • Pointer Generator see2017get This model followed gliwa2019samsum, wherein separators are added between each utterance and utilized as input for the pointer generator model.

  • DynamicConv + GPT-2 wu2019pay Based on  gliwa2019samsum

    , this model uses GPT-2 to initialize token embeddings 


  • Fast Abs RL Enhanced chen2018fast adopts a hybrid method that selects salient sentences and then paraphrases them as abstractive sentences through sentence-level policy gradient methods.

  • BART lewis-etal-2020-bart We used BART as the vanilla in the following setting and added a separator in each utterance. The default parameter setting was BART-base333https://huggingface.co/transformers/model_doc/bart.html. Additionally, we fed the input type with the LONGEST-10 settings, as described in Section 3.3.

3.2 Evaluation Metrics

We utilized different evaluation metrics, including several recently introduced methods used in text summarization and generation tasks.

Rouge-N444https://github.com/pltrdy/rouge Note that different packages may generate different ROUGE scores.

lin2004rouge mostly used evaluation metrics for the text summarization task. We calculated ROUGE-1, ROUGE-2, and ROUGE-L.


zhang2019bertscore calculates the aligning similarity scores between the generated and reference summaries on a token level using BERT.

3.3 Implementation Details

We tested different input type modes in our proposed model. The traditional method for handling long sequences in a pretrained language model is to truncate the sequence in an uncompleted format, not in the true utterance format, owing to limitations in system memory. To alleviate this issue, we propose a novel input type method to retain the utterance format. Inspired by previous work gliwa2019samsum; see2017get, we defined the input types as LEAD-n, MIDDLE-n, and LONGEST-n. The underlying assumption of these input types is that the locational biases kim2019abstractive contain the essential information at the head (i.e., the beginning of the lead) or middle of lengthy conversations. Additionally, this method preserves the entire utterance sequences without breaking up sentence information. To support the above assumptions, we performed data statistics as described in the above section. We trained the model according to the different settings described as follows: LEAD-n - takes leading utterances of the dialogues, MIDDLE-n - takes utterances from the middle of the dialogue, and MIDDLE-n - takes utterances from the middle of the dialogue. We used the BART-base

model to initialize the backbone of the encoder/decoder frame and followed the default settings. The learning rate was set to 3e-4. We trained the model for 20 epochs. Also, we set

as 0.1 in the final model. The training was conducted on a single RTX 8000 GPU with 48 GB memory. We trained the model with the Adam optimizer kingma2014adam and an early stop on validation set ROUGE-1. During the inference, the beam size was 4, including for the baseline models.

4 Main Results

4.1 Quantitative Results

We evaluated the models across the different settings with ROUGE-1, ROUGE-2 and ROUGE-L, and BertScore on the SAMSum test set. The experimental results are shown in Tables 2 and  4.

Type n ROUGE-1 ROUGE-2 ROUGE-L BertScore
LEAD 0.5 10 0.409 0.170 0.390 0.909
MIDDLE 0.403 0.167 0.382 0.908
LONGEST 0.425 0.183 0.405 0.909
LEAD 0.5 20 0.425 0.188 0.409 0.910
MIDDLE 0.414 0.181 0.404 0.908
LEAD 0.1 10 0.426 0.188 0.414 0.910
MIDDLE 0.428 0.192 0.414 0.910
LONGEST 0.431 0.189 0.420 0.910
LEAD 0.1 20 0.424 0.187 0.415 0.909
MIDDLE 0.425 0.189 0.416 0.909
Table 2: Performance comparison according to the different input type settings for training. and n indicate the strength of the task ability and the number of utterances in dialogue, respectively.
Model Type avg # words
Ground summary - 23.12
BART LONG-10 22.25
Syntax-aware BART () LEAD-10 19.95
MIDDLE-10 18.25
LONG-10 21.95
Table 3: The average number of words of the generated summaries at an inference.

Internal model verification

In Table 2, we explore the influence locational biases in dialogue have on performance. The input type settings are depicted according to the different measures, based on the F1 scores of both ROUGE and BertScore. We set as 0.5 and 0.1 and compared of 10 and 20 for each setting666Note that we did not set the LONGEST-20 due to limitations in computing power.. With set as 0.1, the LONGEST-10 model showed the best performance across every measure. However, the performance when was set as 0.5 was lower than when was 0.1, in general. Although the BertScore showed a subtle difference, it also showed the highest performance in this result. We interpret this result as indicating that lengthy utterances are valuable when generating summaries, and the key topics are located at the length of 10 in a dialogue.

Comparison of the generated length

We investigated the length of the generative summary, which varies with the input type. As shown in Table 3, we examined the average of words according to the generated summaries at each inference step. The BART base model (i.e., BART) used 22.25 words when it generated the summaries. Moreover, the average words generated by our proposed model according to the input type shows that our best performance model (i.e., Syntax-aware BART (LONG-10)) used only 21.95 words, thus requiring fewer words than the baseline model. This observation reveals that our proposed model often favors generating slightly shorter summaries than does the BART baseline model, which leads to more concise summaries while still capturing the important information. According to input types, input length influences the average number of words.

External model verification

In Table 4, we present the ROUGE-1, ROUGE-2, and ROUGE-L scores between our model and other, comparison models. First, our proposed model outperformed the other baselines with respect to F1 for all ROUGE scores. As hypothesized previously, our experiments demonstrate that the usage of linguistic information is worthwhile to enhance the model performance.

Fast Abs RL Enhanced achieved slightly better scores than Pointer Generator and DynamicConv+GPT2. This indicates that the use of reinforcement learning to first select important sentences is beneficial. The key factor related to the overall lower performance of the baseline models seems to be that the baseline models fundamentally are not based on the language model; however, the DynamicConv model with the GPT-2 embeddings is based on the usage of pretrained embeddings from the language model GPT-2, which is trained on a large corpus.

- F P R F P R F P R
Pointer Generator see2017get* - 0.401 - - 0.153 - - 0.366 - -
DynamicConv + GPT-2 wu2019pay* - 0.418 - - 0.164 - - 0.376 - -
Fast Abs RL Enhanced chen2018fast* - 0.420 - - 0.181 - - 0.392 - -
BART LONG-10 0.426 0.488 0.419 0.188 0.220 0.184 0.419 0.464 0.415
Syntax-aware BART ( =0.1) LONG-10 0.431 0.486 0.426 0.189 0.216 0.186 0.420 0.460 0.418
Table 4: Performance comparison of the proposed method with different models on the test set. * denotes the results from chen2020multi, and

corresponds to our proposed method model, which shows the best performance (LONGEST-10). Note that F, P, and R indicate F1, precision, and recall scores, respectively.

Dialogue 1 Dialogue 2 Speaker Style Utterance (abbreviated)
1. lilly: sorry, I’m gonna be late 1. randolph: honey
2. lilly: don’t wait for me and order the food 2. randolph: are you still in the pharmacy?
3. gabriel: no problem, shall we also order 3. maya: yes (1) Stlye A vs B
something for you? 4. randolph: buy me some earplugs please …Robert: …The Swedes didn’t even bother to find out
4. gabriel: so that you get it as soon as you get 5. maya: how many pairs? they started laying them off… (B)
to us? 6. randolph: 4 or 5 packs Cynthia: …i’d like us to go to this new bistro i discovered… (A)
5. lilly: good idea 7. maya: i’ll get you 5
6. lilly: pasta with salmon and basil as always 8. randolph: thanks darling
very tasty there
REF: lilly will be late. gabriel will order pasta with salmon and basil for her. REF: maya will buy 5 packs of earplugs for
randolph at the pharmacy. (2) Style C
…Iris : <file other> My husband is famous… Haha. You don’t even realize what this…
FE: lilly will be late. lilly and gabriel are going to pasta with salmon and basil is always tasty.[62/46/68] FE: randolph is in the pharmacy. randolph will buy some earplugs for randolph. maya will get 5.[64/38/71] …Dan : <photo file>…But its not working any more and it hurts :(
B: lilly will be late. gabriel and lilly will order food for lilly and gabriel. [72/39/63] B: laurie is in the pharmacy. maya will buy 4 or 5 pairs of earphones for him. [51/23/51] …Simon : BTW it’s so annoying that people can’t see that such immigration policy reduces…
SB: lilly is going to be late. gabriel will order food for her. lilly will get pasta with salmon and basil and she will get it as soon as she arrives at them.[68/23/68] SB: maya will buy 4 or 5 pairs of earplugs for raymond at the pharmacy. [63/34/63]
Table 5: Examples of dialogues from each model. REF – reference summary, FE – Fast Abs RL Enhanced, B – vanilla BART, and SB – Syntax-aware BART (Ours). [R-1/R-2/R-L] indicates F1 score from ROUGE-n. The error consists of the following factors (i), (ii), and (iii); otherwise, the accurate case is colored lime.

4.2 Qualitative Results

We compared the generated examples from several baselines including our proposed model in terms of their ROUGE scores. We present the qualitative results in Table 5. We observed the error analysis through the following major error types – (i) Incorrect reasoning: indicates that the model came to the incorrect conclusion, which occurred when the generated summaries reasoned relations in the dialogue incorrectly. (ii) Incorrect reference: indicates the association of one’s locations or actions with an incorrect speaker, regardless of the original context in the generated summaries. (iii) Redundancy: is the case wherein the content of the generated summaries was not mentioned in a reference. (iv) Missing information: content existing in the reference is absent in generated summaries.

Error anaylsis

In dialogue 1, FE (i.e., Fast Abs RL Enhanced) and SB (i.e., Syntax-aware BART) performed well at capturing the meaning of the reference summary, despite it being slightly lengthy. The B model (i.e.,vanilla BART model) showed the highest score but contained instances of (i) incorrect reasoning (gabriel and lilly). However, our SB model highlighted (i.e., lime-colored) the content influenced by the lengthy utterance at line 6. It appears that the model was affected by the lengthy input type. Alternatively, there is also the (iii) Redundancy case, as shown in SB. The related content appeared in dialogue but was absent from the reference.

Figure 3: Two-dimensional PCA projection of each speaker style - A (margenta), B (blue), and C (purple). The legend indicates the center point of each cluster.
Figure 4:

Average tf-idf score on top six ranked POS features by standard deviation (std) according to the speaker styles (A, B, and C).

In dialogue 2, the SB model showed the highest performance. FE mismatched (ii) the information of who was acting (i.e., the subject) and who received the action (i.e., the object). This observation also true of the B (laurie) and SB (raymond) models. Additionally, there was a case of (iii) redundancy in the SB (4) model, despite existing content being present in the dialogue. In sum, according to the observations listed in Table 5, our proposed model captured the lengthy utterance well in terms of our objective to use the locational information. Nonetheless, there are limitations to this study, such as the inclusion of incorrect references. This is an area that we aim to improve in future work.

T E G & X
Description discourse/file marker verb particle emoticon
foreign words
existential \textit{there},
Example @user:hello up, out ;-), :b btw (by the way) and, but both, all, half
Table 6: POS tag description for the top-6 ranked.

Speaker utterance style (Vitamin)

We discovered the speaker styles from the test set by following the Section 2.8 for ad-hoc analysis. In details, our research question777We carried out the group characteristics in Appendix B. was “what are the differences between the speakers?”. Figure 3

depicts the distributional characteristics through PCA (principal components analysis) projection for each speaker style and K-means clustering (K=3). The style ‘A’ and ‘B’ have some intersection what is even shown in Figure 

4, and ‘C’ is relatively distant from other groups.

In Figure 4, we illustrate the top-6 ranked POS features to distinguish the groups (see the detail values in Appendix B). Style ‘B’ specifies T that represent the verb particle, and style ‘C’ mainly consists of , E, G as different factors than other groups. However, style ‘A’ shows the relatively flatten performance below 1.0 as Figure 4 and Appendix B. In Table 5’s right table, we compared the speaker styles: (1) Style ‘B’ used verb particle (find out), but ‘A’ used the same representation in a different way(discovered) and (2) Style ‘C’ mostly tend to represent their intention using picture or contents. In the end, we found that the ability to distinguish those speaker styles was reflected in our model.

5 Conclusion

In this study, we proposed a novel syntax-aware sequence-to-sequence model that leverages syntactic information (i.e., POS tagging), considering the informal daily chat structure constraints, and distinguishes the different textual styles from multiple speakers for abstractive dialogue summarization. To strategically combine syntactic information to the dialogue summarization task, we adopted multi-task learning to reproduce both syntactic information and dialogue summarization. Furthermore, we presented a novel input type to train the model to explore locational biases in dialogue structures. We benchmarked the experiments using the SAMSum corpus, and the experimental results demonstrate that the proposed method improves comparison models for all ROUGE scores.

There are promising future directions regarding this research. It would be worthwhile to apply the traditional truncation method with our proposed model to deeply compare performance differences.


Appendix A Related Work

a.1 Dialogue summarization

Most of the previous works focused on summarizing conversations from meetings mccowan2005ami, in part due to the lack of a corpus for daily conversation summaries. Hence, SAMSum, which is a corpus of daily conversation summaries, has been receiving much attention as announced by gliwa2019samsum recently.

In the standard dialogue summarization paradigm, a pointer-generator see2017get

, which is a hybrid model of the typical sequence-to-sequence attention model 

nallapati2016abstractive, and a pointer network vinyals2015pointer are used as the abstractive dialogue summarization model. This framework encodes the source sequence and generates the target sequence, with the decoder for abstractive dialogue summarization  yuan2019abstractive; goo2018abstractive.

Recently, the standard paradigm has shifted to using a combination of a pretraining method with much larger external text corpus (e.g., Wikipedia, books) and a transformer-based sequence model. This strategy has led to a remarkable improvement in performance when fine-tuned for both text generation tasks and natural language understanding like BART lewis-etal-2020-bart.

a.2 Syntax-aware text summarization

Syntax representation of text can be applied to text summarization to leverage linguistic information because it assists in information filtering to obtain highlighted context from a source document bouras2008improving, and yet the importance of this syntax has been previously underestimated zopf2018s. When linguistic information is used to perform text summarization, it finds the relationships between terms in the document through sequence labeling (POS tagging al2018generating

, named entity recognition 

dobreva2020improving), grammar analysis lu2019attributed, and thesaurus usage (e.g., Wordnet) pal2014approach, and then extracts the salient context.

Previous research has investigated the use of linguistic information such as POS tagging for text summarization. al2018generating approached selective POS tagging for words such as nouns, verbs, and adjectives to extract sentence summaries. bouras2008improving and afsharizadeh2018query also attempted to extract keywords by retrieving nouns through POS tagging. liu2017pos reported the utilization of POS tagging to distil keywords for extractive summarization in Korean.

However, these studies only considered extractive summarization tasks and applied linguistic information to extract the key sentences as features using the scoring function. By contrast, we propose to have a pretrained model learn linguistic information implicitly through the sequence labeling task to function in an abstractive way.

Appendix B Speaker Style Cluster Result

Table 7 shows POS features gimpel2010part by speaker styles. We calculated the average of tf-idf values according to each groups. Note that the extremely common terms occur tf-idf value to zero. The standard deviation shows how much there is the between-group deviation for each POS feature.

feature A style B style C style std
0 0 2.58 1.49
T 0 1.75 0.36 0.92
E 0.3 0.4 0.96 0.35
G 0.14 0.14 0.39 0.15
& 0.32 0.51 0.45 0.1
X 0.13 0.31 0.24 0.09
$ 0.32 0.49 0.44 0.09
Z 0.09 0.18 0.22 0.07
L 0.29 0.4 0.33 0.06
! 0.27 0.37 0.31 0.05
U 0 0 0.05 0.03
S 0.06 0.09 0.06 0.02
D 0.19 0.22 0.22 0.02
R 0.19 0.22 0.2 0.02
A 0.19 0.23 0.21 0.02
P 0.17 0.2 0.18 0.02
@ 0.01 0.02 0 0.01
O 0.1 0.12 0.11 0.01
N 0.09 0.1 0.1 0.01
V 0.07 0.08 0.07 0
Y 0 0 0 0
^ 0 0 0 0
, 0 0 0 0
Table 7: Average tf-idf score of each speaker styles. Ordered by standard deviation (std). A-, B-, and C-.

Appendix C Density of the number of utterances

Figure 5: Data distribution of the number of utterances in the SAMSum corpus (training set).