Conditional Neural Generation using Sub-Aspect Functions for Extractive News Summarization

04/29/2020 ∙ by Zhengyuan Liu, et al. ∙ 0

Much progress has been made in text summarization, fueled by neural architectures using large-scale training corpora. However, reference summaries tend to be position-biased and constructed in an under-constrained fashion, especially for benchmark datasets in the news domain. We propose a neural framework that can flexibly control which sub-aspect functions (i.e. importance, diversity, position) to focus on during summary generation. We demonstrate that automatically extracted summaries with minimal position bias can achieve performance at least equivalent to standard models that take advantage of position bias. We also show that news summaries generated with a focus on diversity can be more preferred by human raters. These results suggest that a more flexible neural summarization framework can provide more control options to tailor to different application needs. This framework is useful because it is often difficult to know or articulate a priori what the user-preferences of certain applications are.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Motivation

Text summarization is a core task in natural language processing, targeting to automatically generate a shorter version of the source content while retaining the most important information. As a straightforward and effective method, extractive summarization creates a summary by selecting and subsequently concatenating the most salient semantic units in a document. Recently, neural networks, which can be trained in an end-to-end manner, has achieved favorable improvements on various large-scale benchmarks [1, 2, 3].

Despite renewed interest and avid development in extractive summarization, there are still long-standing, unresolved challenges. One major problem is position bias, which is especially common in the news domain, where the majority of research in summarization is studied. In news articles, sentences appearing earlier tend to be more important for summarization tasks [4], and this preference is reflected in many reference summaries of public datasets. However, while there is a tendency for important sentences to be presented in the very beginning of a news article, news articles can be presented in various ways in addition to this classic textbook writing style of the “inverted pyramid” [5]. Other journalism writing styles include anecdotal lead, question-and-answer format, and chronological organization [6]. Therefore, salient information could also be scattered across the entire article, instead of being concentrated in the first few sentences, depending on the chosen writing style of the journalist.

In addition to journalistic writing style variations, more subjective variability is injected into text summarization tasks at the ground-truth construction stage. According to [7]: “Content selection is not a deterministic process [8, 9, 10]. Different people choose different sentences to include in a summary, and even the same person can select different sentences at different times [11]. Such observations lead to concerns about the advisability of using a single human model …”

such observations suggest that without explicit instructions and targeted applications in mind, ground-truth construction for summarization could easily become an under-constrained assignment for human evaluators, making it difficult for current machine learning models to reach their full potential


Therefore, in this work, we propose a flexible neural summarization framework that is able to provide more explicit control options when automatically generating summaries (see Figure 1). We follow the spirit of sub-aspect theory and adopt control codes on sub-aspects to condition summary generation. The advantages of this framework are two-fold: (1) It provides a systematic approach to investigate and analyze how one might minimize position bias in extractive news summarization in neural modeling. Most, if not all, previous work (e.g. [13, 12]) only focus on analyzing the degree and prevalence of position bias. In this work, we take one step further to propose a research methodology direction to disentangle position bias from important and non-redundant summary content. (2) Text summarization needs are often domain or application specific, and difficult to articulate a priori what the user-preferences are, thus requiring potential iterations to adapt and refine. However, human ground-truth construction for summarization is time-consuming and labor-intensive. Therefore, if there is a more flexible summary generation framework, we can cut down on manual labor and generate useful summaries more efficiently.

Figure 1: Proposed conditional generation framework exploiting sub-aspect functions.

I-B Generation Framework Overview

In neural approaches, maximum likelihood estimation is commonly applied for model optimization, which maximizes the probability

, where is the input, is the target, and is a trainable parameter set. This setup results in neural models tagging on to features that correlate the most with the output, which are often positional-related features in the case of extractive news summarization. The model will easily overfit and select the first- sentences as best candidates regardless of considering the full context, resulting in sub-optimal models with fancy neural architectures that do not generalize well to other domains [14].

To this end, we postulate that sentence selection can benefit from finer-constrained conditional learning. Since summarization has been regarded as a combination of sub-aspect functions (e.g. information, layout) [13, 15, 16], we can use these sub-aspects to condition summary generation. Therefore, we transform the learning object from to , where

is an auxiliary conditional vector. In our framework,

is the control code which is an integral part of the input that helps guide the model to focus on different sub-aspect features. We expect that such control measures can reduce position bias and provide more extractive news summarization options for downstream applications.

A suitable set of sub-aspect control codes is important for such conditional generation. An ideal set should characterize different aspects of summarization well in a comprehensive manner but at the same time possess a relatively clear boundary between one another to minimize the set size [17]. To achieve this, we adopt the sub-aspects defined in [13]: IMPORTANCE, DIVERSITY, and POSITION, and assess their characterization capability on the CNN/Daily Mail news corpus [18] via quantitative analyses and unsupervised clustering.

We utilize control codes based on these three sub-aspect functions to label the training data and implement our conditional generation approach with a neural selector model. Experiment results show that given different control codes, the model can generate output summaries of alternative styles while maintaining performance comparable to the current state-of-the-art models.

Ii In Relation to Other Work

Most benchmark datasets in text summarization focus on the news domain, such as New York Times [19] and CNN/Daily Mail [18], where the human-written summaries can be utilized in both abstractive and extractive paradigms. While abstractive approaches focus on writing a summary automatically via exploring comprehension and generative methods [20, 21], extractive approaches are to score and select semantic units such as phrases and sentences from the source content, and concatenate them as a summary with better fluency and readability.

To improve the performance of extractive summarization, non-neural approaches explore to utilize various linguistic and statistical features such as lexical characteristics [22], explicit or latent topic information of source content [23], document-level rhetorical discourse analysis [24], and graph-based lexical [25] and structural [26] modeling. On the other side, neural approaches learn the features in a data-driven manner, which are significantly improved by large-scale available corpora and the development of neural architectures. Various sophisticated designs and components equip neural models with powerful learning capability: word embedding methods like Word2Vec [27] provide feature-rich semantic vector representation, and sequential modeling architectures like recurrent networks [28] help to obtain contextual comprehension, which can be further enhanced by adapting self-attention mechanism [29]

. Based on recurrent neural networks, SummaRuNNer is one of the earliest neural models


. Much development in extractive summarization has been made by applying reinforcement learning

[30], jointly learning of scoring and ranking [31], exploiting multi-level segmentation [32], and utilizing deep contextual language models [3].

Despite much development in recent neural approaches, there are still challenges such as corpus bias and system bias [13] in the summary, which often stems from position bias in the golden ground-truth, conceivably resulting from the “inverted pyramid” writing style in journalism [33]. However, to date only analysis work has been done to characterize the position-bias problem and its ramifications, such as inability to generalize across corpora or domains [12, 13, 14]. Few, if any, has attempted to resolve this long-standing problem of position bias using neural approaches. In this work, we take a first stab to explore the possibility of disentangling three sub-aspects that are commonly used to characterize summarization: POSITION for choosing sentences by their position, IMPORTANCE for choosing relevant and repeating content across the document, and DIVERSITY for ensuring minimal redundancy between summary sentences [13] during the summary generation process. In particular, we use these three sub-aspects as control codes for conditional training. To the best of our knowledge, this is the first work in applying auxiliary conditional codes for extractive summary generation.

In other research areas such as computer vision and voice conversion, there has been work on including auxiliary condition signals as input to obtain finer-constrained outputs. In image style transfer, codes specifying color or texture are used to train conditional generative adversarial networks


and variational autoencoders

[17]. In natural language processing, topic information can be categorized and imported as conditional signals, which has been applied in dialogue response generation [35] and pre-training of large scale language models [36], and sentiment polarity is used in text style transferring [37].

Iii Extractive Oracle Construction

Iii-a Similarity Metric: Semantic Affinity vs. Lexical Overlap

For benchmark corpora that are widely adopted, e.g. CNN/Daily Mail [18], there are only golden abstractive summaries written by humans with no corresponding extractive oracle summaries. To convert the human-written abstracts to extractive oracle summaries, most previous work used ROUGE score [38]

, which counts contiguous n-gram overlap, as the similarity criteria to rank and select sentences from the source content. Since ROUGE scores only conduct simple lexical matching using word overlapping algorithms, salient sentences from the source content paraphrased by human-editors could be overlooked as the ROUGE scores would be low, while sentences with a high count of common words could get an inflated ROUGE score


To tackle this drawback of using ROUGE as scoring criteria, we propose to apply the semantic similarity metric BertScore [39] to rank the candidate sentences due to the following reasons. First, BertScore has shown better performance than ROUGE and BLEU in sentence-level semantic similarity assessment [39]. Moreover, BertScore includes recall measures between reference and candidate tokens, which is a more suitable metric than distance-based similarity measures [40, 41]

for summarization related tasks, where there is an asymmetrical relationship between the reference and the generated text.

Iii-B Oracle Construction and Evaluation

To build oracles with semantic similarity, first we conduct sentence segmentation on source documents and human-written gold summaries. Then we convert the text to a semantic-rich distributed vector space. For each sentence in a gold summary, we use BertScore to calculate its semantic similarity with candidates from the source content, then the sentence with the highest recall score is chosen. We filtered out candidates with a recall score lower than to further streamline the selection process.

Figure 2: Cumulative position distribution of oracles built on ROGUE (Blue) and BertScore (Orange). X axis is the ratio of article length. Y axis is the cumulative percentage of summary sentences.

We observed that the oracle summaries generated through semantic similarity is different from those chosen from traditional n-gram overlap. As shown in Figure 2, the positional distribution of the two schemes are different, where early sentence bias is less significant for the BertScore scheme. To further evaluate the effectiveness of this oracle construction approach, we conducted two assessments. First, we calculated their ROUGE scores with the gold summary. As shown in Table I, the oracle summaries derived from BertScore are comparable though slightly lower than those from ROUGE, which is not unexpected given that the former is more mismatched with the ROUGE metric criteria.

Next, we conducted 2 human evaluation experiments. The first one was to rank the candidate summary pairs of 50 news samples based on their similarity to human-written gold summaries [2]. Four linguistic analyzers were thus asked to consider two aspects: informativeness and coherence [42]. The evaluation score represents the likelihood of a higher ranking, and is normalized to range . Then, we adopted the question-answering paradigm as in [3] to evaluate the selected candidates of 30 samples. For each sentence of gold summary, questions were related to key information like event and name entities. Then human annotators were asked to answer these questions given an oracle summary. The accuracy of answers is regarded as the evaluation score. Moreover, we constructed some extended questions in which the answer can only be answered with comprehension of the full summary. The extractive summaries constructed with BertScore score are significantly higher in all human-evaluation experiments (see Table I).

F1 Score F1 Score
ROUGE Oracle 51.84 31.08
BertScore Oracle 50.56 29.41
Similarity Evaluation Score
Gold Summaries -
ROUGE Candidates 0.70
BertScore Candidates 0.84
QA Paradigm Evaluation Accuracy
Entity and Event Questions:
Gold Summaries 0.95
ROUGE Candidates 0.54
BertScore Candidates 0.72
Extended Questions:
Gold Summaries 0.87
ROUGE Candidates 0.52
BertScore Candidates 0.70
Table I: ROUGE and Human evaluation scores of oracle summaries built on BertScore and ROUGE.

Iv Sub-Aspect Control Codes

Iv-a Sub-Aspect Features in News Summarization

Conditional generation approaches often use control codes, in the form of an auxiliary vector, to adjust the pre-defined style features. Some classic examples include polarity in sentiment style transferring [37], topics in task-oriented dialogue systems [35], and physical attributes (e.g. texture, color) in image generation [17]. However, for extractive news summarization, it is more challenging to pinpoint such intuitive and well-defined features, as the writing style could vary according to genre, topic, or editor preference.

In this work, we adopt position, importance and diversity as a set of sub-function features to characterize extractive news summarization [13]. Considerations include: (1) “inverted pyramid” writing style is common in news articles, thus making layout or position a salient sub-aspect for summarization; (2) Importance sub-aspect indicates the assumption that repeatedly occurring content in the source document contains more important information; (3) Diversity sub-aspect suggests that selected salient sentences should maximize the semantic volume in a distributed semantic space [16, 43].

Iv-B Summary-Level Quantitative Analysis

Next, we apply two methods to evaluate the compatibility and effectiveness of the sub-aspects we choose for extractive news summarization. First, we conduct a quantitative analysis on the CNN/Daily Mail corpus, based on the assumption that the writing style variability of summaries can be characterized through different combinations of sub-aspects. [16].

Figure 3: Sample-level distribution of sub-aspect functions of the BertScore oracle. Values are the percentage in categorized samples, which adds up to 60.03% of CNN/Daily Mail training set. The remaining 39.97% do not belong to any of these 3 sub-aspects.

For each source document, we segmented and converted all sentences to vector representations with a pre-trained contextual language model BERT111 [44]. For each sentence, we averaged hidden states of all tokens as the sentence embedding. Similar to [13], to obtain the subset of sentences which correspond to importance sub-aspect, we adopted an N-Nearest method which calculates an averaged Pearson correlation between one sentence and the rest for all source sentence vectors, and collected the first- candidates with the highest scores ( equals oracle summary length). To obtain the subset which corresponds to the diversity sub-aspect, we used one implementation222 of the QuickHull algorithm [45] to find vertices, which can be regarded as sentences that maximize the volume size in a projected semantic space. For the subset that corresponds to the position sub-aspect, the first 4 sentences in the source document were chosen.

With three sets of sub-aspects, we quantified the distribution of different sub-aspects on the extractive oracle constructed in Section III. An oracle summary will be mapped to the importance sub-aspect when at least two sentences in the summary are in the subset of importance sub-aspect. For those oracle summaries that are shorter than 3 sentences (occupying 19% of the oracle), only one sentence was used to determine which sub-aspect they would be mapped to. Note that the mapping is many to many; i.e. each summary can be mapped to more than one sub-aspect. Figure 3 displays the distribution of the three sub-aspect functions of the oracle summaries, where position occupies the largest area. This visualization shows that the three sub-aspects represent distinct linguistic attributes but could overlap with one another. This agrees with the linguistic feature of news writing. Additionally, the these three sub-aspects overlapping with each other supports our assumption that a summary can be viewed as a combination of different sub-aspects. Together, sample-level quantitative analysis demonstrates those three sub-aspects we choose are reasonable.

Figure 4: Autoencoder with adversarial training strategy for unsupervised clustering of sentence-level distribution of sub-aspect functions.

Iv-C Sentence-Level Unsupervised Analysis

According to the mapping algorithm in the previous section, 39% summaries were not mapped to a sub-aspect. This finding motivated us to investigate the distribution of sub-aspect functions at the sentence level. Thus, we conducted unsupervised clustering, assuming that samples within one cluster are most similar to each other and they can be represented by the dominant feature.

As shown in Figure 4, we use an autoencoder architecture with adversarial training to model the correlation between document and summary sentences in the semantic space. The encoding component receives the source document representation and one summary sentence representation as input, and compresses it to a latent feature vector. Then, the latent vector and document vector are concatenated and fed to the decoding component to reconstruct the sentence vector. To obtain a compact yet effective latent vector representing the correlation between the source and summary, we adopt an adversarial training strategy as in [37]

. More specifically, the adversarial decoder we include aims to reconstruct the sentence vector directly from the latent vector. During the training process, we update parameters of the autoencoder with an adversarial penalty. After training this autoencoder, we conduct k-means clustering (

) on the latent representation vectors. Then, we analyze the clustering output, with the sentence-level labels of sub-aspect functions as defined in Section IV-B. As shown in Figure 5, sentences with position sub-aspect is distributed relatively equally across each cluster, while importance and diversity dominate in respectively different clusters. Based on the clustering results, we assign the sub-aspect function which is dominant to unmapped sentences in the same cluster. For instance, diversity is assigned to unmapped sentences in cluster 0 and 1 while importance is assigned to those in cluster 3 and 4. By doing this, we reduce of unmapped sentences and further reduce 35% unmapped summaries using the same criteria in Section IV-B.

Iv-D Implementation Details of Unsupervised Analysis Model

In this section, we provide implementation details of the model in Section IV-C: an autoencoder with adversarial training strategy.

Encoding Component: Given a document representation vector , and a sentence representation vector as input, the encoding component (two linear layers) compresses it to a lower dimension, namely the latent feature vector . In our setting, the hidden dimensions of , and are 768, 768 and 10, respectively. is the hidden vector, defined as:


the lantent feature vector is defined as:


where and are trainable parameters in each layer, and ; denotes the concatenation operation.

Decoding Component: Given a latent feature representation vector and a document representation as input, the decoding component (two linear layers) is targeted to reconstruct the sentence representation .


where and are the hidden state and reconstruction output, respectively.

Adversarial Decoding Component: Given a latent feature representation vector as input, the adversarial decoding component (one linear layer) is targeted to reconstruct the sentence representation .


where is the reconstruction output.

Training Procedure: During each training batch, there is a two-step parameter update:
1) Update the adversarial decoder with Mean Square Error (MSE) loss between and .


2) Update the autoencoder with MSE loss between and , combined with a penalty from the adversarial MSE to reduce the unnecessary information leaked from in the encoding component. The adversarial loss is defined as:


where in our training setting.

Figure 5: Sentence-level clustering result labeled with sub-aspect features. X axis is the cluster index. Y axis is the proportion of sub-aspect features in each cluster.

V Conditional Neural Generation

In this section, we construct a set of control codes to specify the three sub-aspect features described in Section IV, and label the oracle summaries constructed in Section III, then we propose a neural extractive model with a conditional learning strategy for more flexible summary generation.

V-a Control Code Specification Scheme

The control codes are constructed in the form of [importance, diversity, position] to specify sub-aspect features. We can flexibly indicate the ‘ON’ and ‘OFF’ state of each sub-aspect by switching its corresponding value to or , thus enabling disentanglement of each sub-aspect function. For instance, the control code would tell the model to focus more on importance during sentence scoring and selection, while would focus on both diversity and position. Indeed, switching the position code to would help the model obtain minimal position bias. Note that this does not mean the first few sentences would not be selected, as there is overlap between position, importance and diversity (shown in Figure 3). There are 8 control codes under this specification scheme, and we expect this code design can provide the model with sub-aspect conditions for generating summaries.

V-B Neural Extractive Selector

Given a document containing a number of sentences , the content selector assigns a score to each sentence , indicating its probability of being included in the summary. A neural model can be trained as an extractive selector for text summarization tasks by contextually modeling the source content.

Here, we implemented the neural extractive selector in a sequence labeling manner [14]. As shown in Figure 6

, the model consists of three components: a contextual encoding component, a selection modeling component and an output component. First, we used BERT in the contextual encoding component to obtain feature-rich sentence-level representations. Then, in the training process, we concatenated these sentence embeddings with the pre-calculated control code vector and fed them to the next layer, which models the contextual hidden states with the conditional signals. Next, a linear layer with Sigmoid function receives the hidden states and produces scores for each segment between 0 and 1 as the probability of extractive selection. While this architecture is straightforward, it has shown to be competitive when combined with state-of-the-art contextual representation


In our setting, sentences were processed by a sub-word tokenizer [46] and their embeddings were initialized with 768-dimension “base-uncased” BERT [44] and were fixed during training. Lengthy source documents were not truncated. For the selection modeling component, we applied a multi-layer Bi-directional LSTM [28]

and a Transformer network

[29] and it was empirically shown that a two-layer Bi-LSTM performed best. During testing, sentences with the top-3 selection probability were extracted as output summary.

Figure 6: Overview of the controllable neural selector architecture.

V-C Implementation Details of Neural Selector Model

In this section, we provide implementation details of the model in Section V-B: a neural sentence selector for extractive summarization.

BERT Encoding Component: Given a document containing a number of sentences as input, the encoding component produces the sentence representation from each , which is a list of tokens . Here we use the average of the token-level hidden states in the last layer of BERT as .


Selection Modeling Component: Given the specific control code and sentence vectors as input, this component use a bi-directional LSTM layer to model the contextual information with sub-aspect conditioning. The forward and backward hidden states are concatenated as output.


where the hidden dimension is 768, control code dimension is 3, and ; denotes the concatenation operation.

Output Component: A linear layer is used to produce output for each sentence, as the probability of being included in the generated summary.


Training Setting: Binary cross entropy (BCE) is used to measure the loss between the prediction and the ground-truth for all time steps:


Adam optimizer [47] with learning rate of was used. Batch size was set to 64. Drop-out [48] of was applied in the modeling layer and output linear layer. BERT parameters were fixed during training. Lengthy documents were not truncated.

Figure 7: Position distribution of generated summaries from a strong baseline model BertEXT and our conditional summarization model with position code set to (3 implementations). X axis is the position ratio. Y axis is the sentence-level proportion.
Figure 8: Sub-aspect mapping of generated summary with importance-focus code [1,0,0]. Left panel: one sentence in the summary belongs to importance sub-aspect. Right panel: two sentences in the summary belong to importance sub-aspect. Contour lines denote the number of generated summaries.
Figure 9: Sub-aspect mapping of generated summary with diversity-focus code [0,1,0]. Left panel: one sentence in the summary belongs to diversity sub-aspect. Right panel: two sentences in the summary belong to diversity sub-aspect. Contour lines denote the number of generated summaries.

Vi Experimental Results and Analysis

In this section, we conduct quantitative analysis, automatic and human evaluations to assess the performance of our proposed conditional neural summarization framework.

Vi-a Baseline

We compare the proposed model with following baseline systems:

LEAD3 Since the news articles tend to present important information at the beginning, selecting the leading three sentences is a strong and commonly used baseline method.

SummaRuNNer A neural extractive model proposed in [1], which fuses some interpretable lexical features to enhance a RNN-based sentence scoring network.

TransformerEXT An end-to-end extractive model as a strong baseline used in [3] with Transformer [29] as the base neural architecture.

BertEXT A state-of-the-art model [3] with fine-tuned BERT [44] as strong encoding backbone, however the input documents were truncated in this model due to the position embedding limitation.

Vi-B Quantitative Analysis

To evaluate the effectiveness of applying sub-aspect functions to conditioning summary generation, we fed various control codes separately on the test set, and compared the output summaries.

To test the possibility of reducing position bias by conditioning summary generation, we switched the position code to and compared the position of selected sentences in summaries generated by our model to the state-of-the-art baseline BertEXT, based on fine-tuning BERT [3]. The results show that BertEXT has a 50% chance of choosing the first 10% of sentences in the document. While the proposed framework still has a stronger tendency to choose sentences from the first 30% of the sentences, its position distribution is flattened compared to that of BertEXT.

We respectively switched importance and diversity codes to and categorized the generated summaries into subset of each sub-aspect function as in Section IV-B. As shown in Figure 8 and 9, summaries in the subset of importance and diversity weigh higher when the corresponding control codes are ON. Together, these results demonstrate the feasibility of our proposed framework, which can generate output summaries of alternative styles when given different control codes.

Figure 10: One news article example. Oracle summary is underlined, summary generated from a baseline model is in blue while from our model with diversity-focus code is in orange, and their overlap is in purple.
Oracle (BertScore) 50.56 29.41
LEAD-3 40.42 17.62
SummaRuNNer* 39.60 16.20
TransformerEXT* 40.90 18.02
BertEXT* 43.23 20.24
Code [0,0,0] 39.44 17.37
Code [0,0,1] 40.21 18.25
Code [0,1,0] 39.18 17.11
Code [0,1,1] 40.70 18.42
Code [1,0,0] 36.72 14.74
Code [1,0,1] 40.33 17.90
Code [1,1,0] 37.59 15.68
Code [1,1,1] 40.87 18.50
Table II: ROUGE F1 score evaluation with various control codes, in the form of [importance, diversity, position]. * denotes the results from corresponding paper.

Vi-C Automatic Evaluation

We calculated F1 scores of ROUGE-1 and ROUGE-2 for generated summaries under 8 control codes, and compared them with the BertScore oracle (see Section III), the Lead-3 baseline, and several strong extractive baseline models. From Table II we observe that: (1) Summary generated from code [0,0,1] is similar to LEAD-3 but can dynamically learn the positional features not limited to the first 3 sentences, while isolating out diversity and importance features. (2) Only focusing on the importance sub-aspect leads to the worst performance, but performance can be improved when considering other sub-aspects. (3) Focusing on the diversity sub-aspect alone (i.e. Code [0,1,0]) can generate results comparable to strong baselines such as SummaRuNNer.

Vi-D Human Evaluation

In addition to automatic evaluation using ROUGE, we also evaluated output summaries by eliciting human judgments. The study was conducted by experienced linguistic analyzers using Best-Worst Scaling [49]. Participants were given 50 news articles that were randomly chosen from the CNN/Daily Mail testset and the corresponding summaries from five systems: the oracle, BertEXT and three codes that disable sub-aspect position. Then they were asked to decide the best and the worst summaries for each document in terms of informativeness and coherence [2, 42]. We collected judgments from 5 human evaluators for each comparison. For every evaluator, the documents were randomized differently as well as the order of summaries for each document. The score of a model was calculated as the percentage of times it was labeled as best minus the percentage of times it was labeled as worst, which ranges from to . Since these differences come in pairs, the sum of all the evaluation scores for all summary types add up to zero. As shown in Table III

, summaries under diversity code are more favored than those under importance, and their combination can further produce better results. These findings resonate those from the automatic evaluation, suggesting that whether the evaluation metric is lexical overlap (ROUGE) or human judgement, the

diversity sub-aspect plays a more salient role than importance.

Both automatic and human evaluations show that summarizing with sub-aspect condition codes achieve reasonable summaries, which can be tailored to different styles. Figure 10 shows an example, where a generated summary is not position-biased but still preserves key information from the source content.

Evaluation Score
Oracle 0.0458
BertEXT 0.0332
Code [1,0,0] -0.062
Code [0,1,0] 0.0198
Code [0,0,1] -0.071
Code [1,1,0] 0.0350
Table III: Human evaluation on samples from baselines and our model with control codes, in the form of [importance, diversity, position].
Code [1,0,0] 33.94 (-2.78) 13.04 (-1.70)
Code [0,1,0] 36.59 (-2.59) 14.33 (-2.78)
Code [0,0,1] 30.34 (-9.87) 8.90 (-9.35)
Table IV: Inference scores on samples with shuffled sentences. Control codes are in the form of [importance, diversity, position]. Values in bracket are the absolute decrease from scores on original in order samples.

Vi-E Inference on Samples with Shuffled Sentences

To further assess the extent of uncoupling between utilizing sub-aspect signals and the position information learned by the model, we conducted an experiment on the samples with shuffled sentence, which is similar to the document shuffle in [14]. In our setting, we only introduced the shuffle process in the model inference phase. More specifically, we shuffled the sentences of the test samples we used in Section VI-C, then applied the well-trained model to generate the predicted summaries. As show in TableIV, outputs under position sub-aspect suffer a significant drop in performance when we shuffle the sentence order. By comparison, there is far less decrease between the shuffled and in order samples under diversity and importance control code, demonstrating that the latent features of these two semantic-related sub-aspects rely less on the position information. This suggests that applying semantic sub-aspects in training process can reduce the system bias, which learnt by the model on a corpus with position bias.

R-1 F1 R-2 F1 R-2 Recall
Oracle - - 8.70*
Baseline - - 6.10*
Code [1,0,0] 34.81 6.23 6.34
Code [0,1,0] 31.79 5.32 4.62
Code [0,0,1] 29.67 3.98 3.47
Table V: Inference scores on AMI corpus from baselines and our model with control codes, in the form of [importance, diversity, position]. * denotes the results from corresponding paper.

Vi-F Inference on AMI Corpus

We also conducted an inference experiment on a less position-biased corpus. The AMI corpus [50] is a collection of meetings annotated with text transcriptions with human-written summaries. Different from news summarization, the meeting summaries are usually more abstractive with keywords extracted from the whole conversation. Unlike the previous comparison work in [14], we did not train the model from scratch with the training set of AMI, instead, we only applied the pre-trained model in Section VI for summarization inference on its test set (20 meeting transcript-summary pairs). As shown in Table V, summaries under importance code obtain the highest ROUGE-1 and ROUGE-2 scores, and it is better than the best-reported model in [14]. Not surprisingly, summaries under position code do not perform well as the corpus is more balanced. This shows the effectiveness of semantic-related control codes and the generality of our model. Compared with the result in Section VI-C showing news summaries under diversity code obtain higher performance, it suggests that semantic-related sub-aspects can be favored differently in various domains.

Vii Conclusion

In this paper, we proposed a neural framework for conditional extractive news summarization. In particular, sub-aspect functions of importance, diversity and position are used to condition summary generation. This framework enables us to reduce position bias, a long standing problem in news summarization, in generated summaries while preserving comparable performance with other standard models. Moreover, our results suggest that with conditional learning, summaries can be more efficiently tailored to different user preferences and application needs.


  • [1] R. Nallapati, F. Zhai, and B. Zhou, “Summarunner: A recurrent neural network based sequence model for extractive summarization of documents,” in

    Thirty-First AAAI Conference on Artificial Intelligence

    , 2017.
  • [2]

    S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization,” in

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.   Brussels, Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, pp. 1797–1807. [Online]. Available:
  • [3] Y. Liu and M. Lapata, “Text summarization with pretrained encoders,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).   Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 3721–3731. [Online]. Available:
  • [4] K. Hong and A. Nenkova, “Improving the estimation of word importance for news multi-document summarization,” in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics.   Gothenburg, Sweden: Association for Computational Linguistics, Apr. 2014, pp. 712–721. [Online]. Available:
  • [5] C. Scanlan, Reporting and writing: basics for the 21st century.   Oxford University Press, 1999.
  • [6] J. G. Stovall, Writing for the mass media.   Prentice-Hall, 1985.
  • [7] A. Nenkova, R. Passonneau, and K. McKeown, “The pyramid method: Incorporating human content selection variation in summarization evaluation,” ACM Trans. Speech Lang. Process., vol. 4, no. 2, May 2007. [Online]. Available:
  • [8] G. Salton, A. Singhal, M. Mitra, and C. Buckley, “Automatic text structuring and summarization,” Information processing & management, vol. 33, no. 2, pp. 193–207, 1997.
  • [9] D. Marcu, “From discourse structures to text summaries,” in Intelligent Scalable Text Summarization, 1997. [Online]. Available:
  • [10] I. Mani, “Summarization evaluation: An overview.” in ACL/EACL-97 summarization workshop, 2001.
  • [11] G. Rath, A. Resnick, and T. Savage, “The formation of abstracts by the selection of sentences. part i. sentence selection by men and machines,” American Documentation, vol. 12, no. 2, pp. 139–141, 1961.
  • [12] W. Kryscinski, N. S. Keskar, B. McCann, C. Xiong, and R. Socher, “Neural text summarization: A critical evaluation,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).   Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 540–551. [Online]. Available:
  • [13] T. Jung, D. Kang, L. Mentch, and E. Hovy, “Earlier isn’t always better: Sub-aspect analysis on corpus and system biases in summarization,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).   Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 3315–3326. [Online]. Available:
  • [14]

    C. Kedzie, K. McKeown, and H. Daume III, “Content selection in deep learning models of summarization,” in

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.   Brussels, Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, pp. 1818–1828. [Online]. Available:
  • [15] J. Carbonell and J. Goldstein, “The use of mmr, diversity-based reranking for reordering documents and producing summaries,” in Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’98.   New York, NY, USA: ACM, 1998, pp. 335–336. [Online]. Available:
  • [16] H. Lin and J. Bilmes, “Learning mixtures of submodular shells with application to document summarization,” in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, ser. UAI’12.   Arlington, Virginia, United States: AUAI Press, 2012, pp. 479–490. [Online]. Available:
  • [17] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a constrained variational framework.” ICLR, vol. 2, no. 5, p. 6, 2017.
  • [18] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom, “Teaching machines to read and comprehend,” in Advances in neural information processing systems, 2015, pp. 1693–1701.
  • [19] E. Sandhaus, “The new york times annotated corpus,” Linguistic Data Consortium, Philadelphia, vol. 6, no. 12, p. e26752, 2008.
  • [20] C. A. Colmenares, M. Litvak, A. Mantrach, and F. Silvestri, “HEADS: Headline generation as sequence prediction using an abstract feature-rich space,” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.   Denver, Colorado: Association for Computational Linguistics, May–Jun. 2015, pp. 133–142. [Online]. Available:
  • [21] S. Gehrmann, Y. Deng, and A. Rush, “Bottom-up abstractive summarization,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.   Brussels, Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, pp. 4098–4109. [Online]. Available:
  • [22] J. Kupiec, J. Pedersen, and F. Chen, “A trainable document summarizer,” in Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’95.   New York, NY, USA: Association for Computing Machinery, 1995, p. 68–73. [Online]. Available:
  • [23] Ying-Lang Chang and J. Chien, “Latent dirichlet learning for document summarization,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, pp. 1689–1692.
  • [24] T. Hirao, M. Nishino, Y. Yoshida, J. Suzuki, N. Yasuda, and M. Nagata, “Summarizing a document by trimming the discourse tree,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 23, no. 11, p. 2081–2092, Nov. 2015. [Online]. Available:
  • [25] G. Erkan and D. R. Radev, “Lexrank: Graph-based lexical centrality as salience in text summarization,” Journal of artificial intelligence research, vol. 22, pp. 457–479, 2004.
  • [26] R. Mihalcea and P. Tarau, “TextRank: Bringing order into text,” in Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.   Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 404–411. [Online]. Available:
  • [27]

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in

    Advances in neural information processing systems, 2013, pp. 3111–3119.
  • [28]

    M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”

    IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
  • [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  • [30] S. Narayan, S. B. Cohen, and M. Lapata, “Ranking sentences for extractive summarization with reinforcement learning,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers).   New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 1747–1759. [Online]. Available:
  • [31] Q. Zhou, N. Yang, F. Wei, S. Huang, M. Zhou, and T. Zhao, “Neural document summarization by jointly learning to score and select sentences,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).   Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 654–663. [Online]. Available:
  • [32] Z. Liu and N. Chen, “Exploiting discourse-level segmentation for extractive summarization,” in Proceedings of the 2nd Workshop on New Frontiers in Summarization.   Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 116–121. [Online]. Available:
  • [33] C.-Y. Lin and E. Hovy, “Identifying topics by position,” in Fifth Conference on Applied Natural Language Processing.   Washington, DC, USA: Association for Computational Linguistics, Mar. 1997, pp. 283–290. [Online]. Available:
  • [34] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
  • [35] C. Xing, W. Wu, Y. Wu, J. Liu, Y. Huang, M. Zhou, and W.-Y. Ma, “Topic aware neural response generation,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • [36] N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher, “Ctrl: A conditional transformer language model for controllable generation,” arXiv preprint arXiv:1909.05858, 2019.
  • [37]

    V. John, L. Mou, H. Bahuleyan, and O. Vechtomova, “Disentangled representation learning for non-parallel text style transfer,” in

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.   Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 424–434. [Online]. Available:
  • [38] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out: Proceedings of the ACL-04 Workshop.   Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available:
  • [39] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” arXiv preprint arXiv:1904.09675, 2019.
  • [40] J. Wieting, K. Gimpel, G. Neubig, and T. Berg-Kirkpatrick, “Simple and effective paraphrastic similarity from parallel translations,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.   Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 4602–4608. [Online]. Available:
  • [41] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).   Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 3980–3990. [Online]. Available:
  • [42] D. R. Radev, E. Hovy, and K. McKeown, “Introduction to the special issue on summarization,” Computational Linguistics, vol. 28, no. 4, pp. 399–408, 2002. [Online]. Available:
  • [43] D. Yogatama, F. Liu, and N. A. Smith, “Extractive summarization by maximizing semantic volume,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.   Lisbon, Portugal: Association for Computational Linguistics, Sep. 2015, pp. 1961–1966. [Online]. Available:
  • [44] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).   Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available:
  • [45] C. B. Barber, D. P. Dobkin, D. P. Dobkin, and H. Huhdanpaa, “The quickhull algorithm for convex hulls,” ACM Transactions on Mathematical Software (TOMS), vol. 22, no. 4, pp. 469–483, 1996.
  • [46] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
  • [47] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the 3rd International Conference for Learning Representations, 2015.
  • [48] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [49] J. J. Louviere, T. N. Flynn, and A. A. J. Marley, Best-worst scaling: Theory, methods and applications.   Cambridge University Press, 2015.
  • [50] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal et al., “The ami meeting corpus: A pre-announcement,” in International workshop on machine learning for multimodal interaction.   Springer, 2005, pp. 28–39.