Neural network-based models have achieved great success on summarization tasks see2017get; celikyilmaz2018deep; jadhav2018extractive. Current studies on summarization either explore the possibility of optimization in terms of networks’ structures zhou2018neural; chen2018fast; gehrmann2018bottom, the improvement in terms of training schemas wangexploring2019; narayan2018ranking; wu2018learning; chen2018fast, or the information fusion with large pre-trained knowledge peters2018deep; devlin2018bert; DBLP:journals/corr/abs-1903-10318; dong2019unified. More recently, zhong-etal-2019-searching conducts a comprehensive analysis on why existing summarization systems perform so well from above three aspects. Despite their success, a relatively missing topic111Concurrent with our work, jungearlier2019 makes a similar analysis on datasets biases and presents three factors which matter for the text summarization task. is to analyze and understand the impact on the models’ generalization ability from a dataset perspective. With the emergence of more and more summarization datasets sandhaus2008new; nallapati2016abstractive; cohan2018discourse; grusky2018newsroom, the time is ripe for us to bridge the gap between the insufficient understanding of the nature of datasets themselves and the increasing improvement of the learning methods.
In this paper, we take a step towards addressing this challenge by taking neural extractive summarization models as an interpretable testbed, investigating how to quantify the characteristics of datasets. As a result, we could explain the behaviour of our models and design new ones. Specifically, we seek to answer two main questions:
Q1: In the summarization task, different datasets present diverse characteristics, so what is the bias introduced by these dataset choices and how does it influence the model’s generalization ability? We explore two types of factors: constituent factors and style factors, and analyze how they affect the generalization of neural summarization models respectively. These factors can help us diagnose the weakness of existing models. Q2: How different properties of datasets influence the choices of model structure design and training schemas? We propose some measures and examine their abilities to explain how different model architectures, training schemas, and pre-training strategies react to various properties of datasets.
|Factors of Datasets||Measures||Model designs|
|Constituent[4.1]||Positional coverage rate [4.1.1]||Architecture designs [6.2]|
|Content coverage rate [4.1.2]||Pre-trained strategies [6.2]|
|Style [4.2]||Density [4.2.1]||Training schemas [6.1]|
Our contributions can be summarized as follows:
1) For the summarization task itself, we diagnose the weakness of existing learning methods in terms of networks’ structures, training schemas, and pre-trained knowledge. Some observations could instruct future researchers for a new state-of-the-art performance. 2) We show that a comprehensive understanding of the dataset’s properties guides us to design a more reasonable model. We hope to encourage future research on how characteristics of datasets influence the behavior of neural networks.
We summarize our observations as follows: 1) Existing models under-utilize the nature of the training data. We demonstrate that a simple training method on CNN/DM (dividing training set based on domain) can achieve significant improvement. 2) BERT is not a panacea and will fail in some situation. The improvement brought by BERT is related to the style factor defined in this paper. 3) It is difficult to handle the hard cases (defined by style factor) via architecture design and pre-training knowledge under the extractive framework. 4) Based on the sufficient understanding of the nature of datasets, a more reasonable data partitioning (based on constituent factors) method can be mined.
2 Related Work
We briefly outline connections and differences to following related lines of research.
Neural Extractive Summarization
Recently, neural network-based models have achieved great success in extractive summarization. celikyilmaz2018deep; jadhav2018extractive; DBLP:journals/corr/abs-1903-10318. Existing works on text summarization can roughly fall into one of three classes: exploring networks’ structures with suitable bias cheng2016neural; nallapati2017summarunner; zhou2018neural; introducing new training schemas narayan2018ranking; wu2018learning; chen2018fast and incorporating large pre-trained knowledge peters2018deep; devlin2018bert; DBLP:journals/corr/abs-1903-10318; dong2019unified. Instead of exploring the possibility for a new state-of-the-art along one of above three lines, in this paper, we aim to bridge the gap between the lack of understanding of the characteristics for the datasets and the increasing development of above three learning methods.
Concurrent with our work, jungearlier2019 conducts a quite similar analysis on datasets biases and proposes three factors which matter for the text summarization task. One major difference between these two works is that we additionally focus on how dataset biases influence the designs of models.
Understanding the Generalization Ability of Neural Networks
While neural networks have shown superior generalization ability, yet it remains largely unexplained. Recently, some researchers begin to take a step towards understanding the generalization behaviour of neural networks from the perspective of network architectures or optimization procedure schmidt2018adversarially; baluja2017adversarial; zhang2016understanding; arpit2017closer. Different from these work, in this paper, we claim that interpreting the generalization ability of neural networks is built on a good understanding of the characteristic of the data.
3 Learning Methods and Datasets
3.1 Learning Methods
Generally, given a dataset , different learning methods are trying to explain the data in diverse ways, which show different generalization behaviours. Existing learning methods for extractive summarization systems vary in architectures designs, pre-trained strategies and training schemas.
Architecturally speaking, most of existing extractive summarization systems consists of three major modules: sentence encoder, document encoder and decoder.
In this paper, our architectural choices vary with two types of document encoders: LSTM222We use the implementation of he2017deep. hochreiter1997long and Transformer vaswani2017attention
while we keep the sentence encoder (convolutional neural networks) and decoder (sequence labeling) unchanged333Since they do not show significant influence on our explored experiments.. The base model in all experiments refers to Transformer equipped with sequence labelling.
To explore how different pre-trained strategies influence the model, we take two types of pre-trained knowledge into consideration: we choose Word2vec mikolov2013efficient as an investigated exemplar for non-contextualized word embeddings and adopt BERT as a contextualized word pre-trainer devlin2018bert.
In general, we train a monolithic model to fit the dataset, but in particular, when the data itself has some special properties, we can introduce different training methods to fully exploit all the information contained in the data.
The basic idea of multi-domain learning in this paper is to introduce domain tag as a low-dimension vector which can augment learned representations. Domain-aware model will make it possible to learn domain-specific features.
Meta-learning we also try to make models aware of different distribution by meta-learning based on wangexploring2019. Specifically, for each iteration, we sample several domains as meta-train and the other as meta-test. The meta-test gradients will be combined with the meta-train gradients and finally update the model.
We explore four mainstream news articles summarization datasets (CNN/DM, Newsroom, NYT50 and DUC2002) which are various in their publications. We also modify two large-scale scientific paper datasets (arXiv and PubMed) to investigate characteristics for different domains. Detailed statistics are illustrated in Table 2.
4 Quantifying Characteristics of Text Summarization Datasets
In this paper, we present four measures to quantify the characteristics of summarization datasets, which can be abstracted into two types: constituent factor and style factors.
4.1 Constituent Factors
When the neural summarization model determines whether a sentence should be extracted, the representation of the sentence consists of two components: position representation444The position representation is obtained from the model structure in LSTM and by positional embedding in Transformer., which indicates the position of the sentence in the document; content representation, which contains the semantic information of the sentence.
Therefore, we define the position and content information of the sentence as constituent factors, aiming to explore how the selected sentences in the test set relate to the training set in terms of position and content information.
4.1.1 Positional Information
Positional Value (P-Value)
Given a document , for each sentence with label , we introduce the notion of positional value , whose value is the output of the mapping function .
Positional Coverage Rate (PCR)
Taking positional valueover a dataset ,
where denotes the number of sentence with and represents the number of sentences with in dataset .
Based on above definition, for any two datasets and , we could quantify the proximity of their positional value distribution
where denotes -divergence function. and represent two position value distribution over two datasets. The datasets with similar positional value distribution usually have large PCR .
4.1.2 Content Information
Content Value (C-Value)
Given a dataset , we want to find the patterns that appear most frequently in the ground truth555Ground truth is extracted by the greedy algorithm in nallapati2017summarunner of and score them. For each sentence in gound truth, we remove the stop words and punctuation, replace all numbers with “0”, and perform lemmatization on each token. After the pre-processing, we treat -gram () as the pattern in and calculate the score for each pattern as follows:
where denotes the number of -th pattern.
Content Coverage Rate (CCR)
We introduce the notion of to measure the degree of contents’ overlap between training and test set in which the sentences with ground truth labels reside in.
where denotes the set666We choose 100 bigrams and trigrams as the set. of patterns which is helpful to pick out ground truth sentences. measures the similarity of two patterns, and represent the training set and test set of respectively.
4.2 Style Factors
Different from constituent factors, style factors influence the generalization ability of summarization models by adjusting the learning difficulty of samples’ features.
For this type of factor, we did not propose a new measure, but adopt the indicators density, compression proposed by grusky2018newsroom777density and compression was originally used to describe the diversity between datasets in the construction of new datasets. We claim that the contribution here is to focus on the understanding of these metrics and explore the reasons why they affect the performance of summarization models, which is missing from previous work. More importantly, only when we understand how these metrics affect the performance of the models can we use them to explain some of the differences in model generalization.
Density is used to qualitatively measure the degree to which a summary is derivative of a document grusky2018newsroom. Specifically, given a document and its corresponding summary , Density(D,S) measures the percentage of words in the summary that are from document.
where denotes the number of words. is a set of extractive fragments, which characterize the the longest shared token sequence.
Compression is used to characterize the word ratio between the document and summary grusky2018newsroom.
5 Investigating Influence of Proposed Factors on Summarization Models
5.1 Constituent Factors
5.1.1 Exp-I: Breaking Down the Test set
For the P-Value, the threshold set can be denoted as . We calculate for each sentence :
and define if . The considers both absolute and relevant position of the sentence in the document. In the experiment, we make and choose for the threshold set.
For the C-Value, we calculate the score for each sentence based on the pattern score from training set.
where denotes sentence in the ground truth of test set. The score indicates the degree of overlap between the sentence and important patterns of the training set. We then sort all the sentences in ascending order by score and divide them into five intervals with the same number of sentences.
As shown in Figure 1, when the sentence is in the front of the document or contains more salient patterns, the accuracy of the model to extract sentences is higher. The phenomenon means that our proposed P-Value and C-Value reflect position distribution and content information of a specific dataset to a certain extent, and the model does learn constituent factors and uses them to determine whether a sentence is selected.
5.1.2 Exp-II: Cross-dataset Generalization
From the above experiments, we can see that P-Value and C-Value are sufficient to characterize some attributes in a specific dataset, but beyond that, we seek to understand the differences between mainstream datasets through PCR and CCR.
We calculate PCR/CCR score and measure the performance of the base model by ROUGE-2 score on five datasets. We can see from Table 3 that the training and test set of the same dataset always have the highest PCR/CCR score, which indicates the distribution between them is the closest based on consitituent factors. Furthermore, model performance is also in accord with this trend. Consistency presented by the experiment, on the one hand, illustrates that there are significant shifts between different datasets, which results in performance differences of the model in cross-dataset setting, on the other hand, it reflects that position distribution and content information are the key factors of such dataset-shift.
After verifying the validity of PCR and CCR, we utilize them to estimate the distance between the real distribution of datasets. For instance, news articles datasets (CNN/DM, NTY50 and Newsroom) and scientific paper datasets (arXiv and PubMed) both have lower scores in terms of two metrics, that is to say, there is a larger shift between them, which is also in line with our knowledge. Based on the estimation, we can understand more deeply the impact of different datasets on the generalization ability of various neural extractive summarization models.
5.2 Style Factors
We integrate training set, validation set and test set as a whole set and divide it into three parts according to the density or compression of each article and name them “low”, “medium” and “high”. For example, articles in “density, high” represents these articles have a higher density in the entire dataset. Based on above operation, we break down the test set and attempt to analyze how style factors influence the model performance.
Exploration of Density
Density represents the overlap between the summary and the original text, so the samples with high density are more friendly to extractive models. Consequently, it is easy for us to understand the higher the density, the higher the ROUGE score in Table 4. However, the value of prediction is also positively correlated with the density, which means that density is closely related to the learning difficulty.
In order to comprehend this correlation, we conduct the following experiment. Given an article and summary pair, we assign a score to each sentence in article to indicate how much sailent information is contained in the sentence.
where denotes the longest common subsequence length (not counting stop words and punctuation) of the sentence and summary. We calculate the percentage of to and present the results of the three highest-scoring sentences in Table 5. Obviously, in samples with high density, the salient information is more concentrated in a few sentences, making it easier for the model to extract correct sentences.
Therefore, for dataset with high density, we can try to introduce external knowledge into the model, which helps the model better understand the semantic information, and thus easier to capture sentences with salient patterns. In addition, models with external knowledge should have better generalization ablity when transferred to high-density dataset. These inferences will be verified in Section 6.1 and 6.2.1.
Exploration of Compression
Documents with high compression tend to have fewer sentences because summaries usually have a similar length in the same dataset. So the results of compression in Table 4 are in line with our expectations, how the model represents long documents to get good performance in text summarization task remains a challenge celikyilmaz2018deep.
Unlike the exploration of density, we attempt to understand how the model extracts sentences when faced with different compression samples. We utilize an attribution technique called Integrated Gradients (IG) sundararajan2017axiomatic to separate the position and content information of each sentence. The setting of input x and baseline x’ in this paper is close to mudrakarta2018did888 Using empty documents (a sequence of word embeddings corresponding
to padding value) as
Using empty documents (a sequence of word embeddings corresponding to padding value) asbaseline x’., but it is worth noting that our base model adds positional embedding to each sentence, so input x and baseline x’ both have positional information.
We tend to think that denotes the attribution of positional information, and - denotes the attribution of content information when model makes decisions, where represents a deep network. Figure 2 illustrates that as compression increases, the help provided by positional information is gradually reduced and content information becomes more important to the model. In other words, the model can perceive the compression and decide whether to pay more attention to positional information or important patterns, this observation is helpful for us to design models or study their generalization ability in Section 6.2.1.
6 Bridge the Gap between Dataset Bias and Model Design Prior
In this section we investigate how different properties of datasets influence the choices of model structures, pre-trained strategies, and training schemas.
Idea of Experiment Design
Through the above analysis in Section 4, the constituent factors reflect the relationship between diverse data distributions and style factors directly affect the learning difficulty of samples’ features. Based on the different attributes of the above two types of factors, we designed the following investigation accordingly: for the style factor, we not only investigate the influence of different model architectures and pre-trained strategies on it, but utilize it to explain the generalization behaviour of the models. For the constituent factors, we discuss their effects on different training strategies, such as multi-domain learning and meta-learning, because these learning modes are all about how to better model various types of distributions.
6.1 Style Factors Bias
In this section, we study whether the samples with different learning difficulties described by the style factors can be well handled through the improvement of structure or the introducing of pre-training knowledge or we need to extend our model in other ways.
Table 6 shows the breakdown performance on CNN/DM based on density and compression. And we can observe that: 1) An obvious trend is that LSTM performs better than Transformer with increasing difficulty in sample learning (low density and high compression). For instance, LSTM performs worse than Transformer on the subset with high density, while surpasses Transformer when the density of testing examples becomes lower. 2) Generally, the introducing of pre-training word vectors can improve the overall results of the models. However, we found that increasing the learning difficulty of samples would weaken the benefits brought by pre-trained embeddings. 3) The prospects for further gains for these hard cases described by style factor from novel architecture design and knowledge pre-training seem quite limited, suggesting that perhaps we should explore other ways, such as generating summaries instead of extracting.
6.2 Constituent Factors Bias
We design our experiment towards the answer to two main questions as follows.
6.2.1 Exp-I: How do dataset properties influence the choices of training schemas?
When our training set itself contains multiple domains grouped by the constituent factors, how can we make full use of the dataset’s characteristic and find the most suitable training schemas? For example, CNN/DailyMail, as one of the most popular datasets, consists of two sub-datasets. For this question, dataset-shift discussed in Section 5.1.2 and the learning diffuculties of the dataset should be taken into consideration.
Choices of Training Schemas:
We compare four training schemas: joint training, multi-domain learning999We view CNN and DailyMail in CNN/DM as two different domains. with explicit information (tag embedding), implicit information (BERT) and meta-learning.
In order to more comprehensively reflect the generalization ability of different models, we conducted zero-shot transfer evaluation. Specifically, each of our models is trained on CNN/DM while evaluated both on CNN/DM (in-dataset) and other datasets (cross-dataset).
Table 7 shows the results of four models under two types of evaluation settings: in-dataset, and cross-data, and we have the following findings:
1) For in-dataset setting, comparing the Tag and the basic models, we find a very simple method that assign each sample a domain tag could achieve improvement. The reason here we claim is that domain-aware model makes full use of the nature of dataset. 2) For multi-domain and meta-learning model, we attempt to explain from the perspective of data distribution. Although meta-learning obtains worse performance under in-dataset setting, it yet has achieved impressive performance under cross-dataset setting. Concretely, meta-learning model surpasses Tag model on three datasets: DUC2002, NYT50 and Newsroom, whose distribution is closer to CNN/DM based on constituent factors in Table 3. Correspondingly, Tag model uses a randomly initialized embedding for zero-shot transfer, and we suspect that this perturbation unexpectedly generalizes well on some far-distributed datasets (arXiv and PubMed). 3) BERT has shown its superior performance and nearly outperforms all competitors. However, the generalization ability of BERT is poor on arXiv, PubMed and DUC2002 compared to the performance improvement in in-dataset setting. In contrast, BERT shows good generalization when tranferring to datasets with high density and compression (NYT50 and Newsroom). As we have discussed in Sec. 5.2, samples with high style factors require model to capture salient patterns, which is exactly the improvement of introducing external knowledge from BERT.
6.2.2 Exp-II: Searching for a Good Domain
The second question we study is what makes a good domain? To answer this question, we define the concept of domain based not solely on the dataset, but divide the training set by directly utilizing the constituent factors. Specifically, we explore the following different settings:
1) Random tag: Each sample is assigned a random “pseudo-domains” tag.
2) Domain: Divide training samples according to the domain (CNN or DM) they belong to .
3) P- and C-Value: Each sentence is assigned a tag by its corresponding P-Value and C-value scores.
|+ random tag||41.19||18.52||37.57|
|+ domain tag||41.41||18.71||37.74|
|+ P-Value tag||41.38||18.71||37.67|
|+ C-Value tag||41.39||18.73||37.71|
|+ P-Value & C-Value tag||41.41||18.74||37.74|
|BERT (our implementation)||42.59||19.92||38.94|
|+ domain tag||42.72||19.91||39.05|
|+ P-Value & C-Value tag||42.77||19.98||39.10|
We experiment with tags on our base model and the current state-of-the-art model DBLP:journals/corr/abs-1903-10318. DBLP:journals/corr/abs-1903-10318 and the results are presented in Table 8, we can obtain the following observations:
1) Random partitioning does not make sense and cannot lead to the improvement of performance. Conversely, the partitions based on the constituent factors have obtained the benefit. 2) This simple learning method that dividing the training set based on domain has shown considerable benefit, which can be complementary to the improvement brought by BERT. 3) The division based on the constituent factors (P-value & C-value) achieves the best result in the context of BERT, which implies that for the summarization task, mining the characteristics of the dataset itself plays an important role.
In this paper, we conduct a data-dependent understanding of neural extractive summarization models, exploring how different factors of datasets influence these models and how to make full use of the nature of the dataset so as to design a more powerful model. Experiments with in-depth analyses diagnose the weakness of existing models and provide guidelines for future research.