Advances of Transformer-Based Models for News Headline Generation

07/09/2020 ∙ by Alexey Bukhtiyarov, et al. ∙ Moscow Institute of Physics and Technology 0

Pretrained language models based on Transformer architecture are the reason for recent breakthroughs in many areas of NLP, including sentiment analysis, question answering, named entity recognition. Headline generation is a special kind of text summarization task. Models need to have strong natural language understanding that goes beyond the meaning of individual words and sentences and an ability to distinguish essential information to succeed in it. In this paper, we fine-tune two pretrained Transformer-based models (mBART and BertSumAbs) for that task and achieve new state-of-the-art results on the RIA and Lenta datasets of Russian news. BertSumAbs increases ROUGE on average by 2.9 and 2.0 points respectively over previous best score achieved by Phrase-Based Attentional Transformer and CopyNet.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text summarization aims to condense vital information from text into a shorter, coherent form that includes main ideas. Two main approaches are distinguished: extractive, which involves organizing words and phrases extracted from text to create a summary, and abstractive, which requires the ability to generate novel phrases not featured in the source text while preserving the meaning and essential information.

Headline generation is considered within the automatic text generation area, so these methods are conventional approaches to that task. Because headlines are usually shorter than summaries, the model has to be good at distinguishing the most salient theme and compressing it in a syntactically correct way. The task is vital for news agencies and especially news aggregators 

[1]. The right solution can be beneficial both for them and for the readers. The headline is the most widely read part of any article, and due to its summarization abilities, it can help decide whether a particular article is worth spending time.

An essential property of the headline generation task is data abundance. It is much easier to collect a dataset in any language containing articles with headlines than articles with summaries because articles usually have a headline by default. We can use the headline generation as a pretraining phase for other problems like text classification or clustering news articles. This two-stage fine-tuning approach is shown to be effective [2]. That is why it is essential to investigate the performance of different models on this particular task.

In this paper, we explore the effectiveness of applying the pretrained Trans-former-based models for the task of headline generation. Concretely, we fine-tune mBART and BertSumAbs models and analyze their performance. We obtain results that validate the applicability of these models to the headline generation task.

2 Related Work

Previous advances in abstractive text summarization have been made using RNNs with an attention mechanism [3].

One more important technique to improve RNN encoder-decoder model is copying mechanism [15] that increases the model ability to copy tokens from the input. LSTM-based CopyNet on byte pair encoded tokens achieved the previous state-of-the-art results on Lenta dataset [12]. In the Pointer-Generator network (PGN) that idea was further developed by introducing coverage mechanism [16] to keep track of what has been summarized.

The emergence of pretrained models based on Transformer architecture [4] led to new improvements. Categorization, history and applications of these models are comprehensively described in the survey [5]. Applying a Phrase-Based Attentional Transformer (PBATrans) achieved the latest state-of-the-art results on RIA dataset that we are also considering [10].

3 Models Description

Applying Transformer-based models usually follows these steps. Firstly, during unsupervised training, the model learns the universal representation of language. Then it is fine-tuned to a downstream task. Below we briefly describe the pretrained models we focus on in this work.

The first model is mBART [6]. It is a standard Transformer-based model consisting of an encoder and autoregressive decoder. It is trained by reconstructing the document from a noisy version of that document. Document corruption strategies include randomly shuffling the original sentences’ order, randomly rotating the document so that it starts from different token and text infilling schemes with different span lengths. mBART is pretrained once for all languages on a subset of 25 languages. It has 12 encoder layers and 12 decoder layers with a total of 680M parameters, and its vocabulary size is 250K.

The second model we examine is BertSumAbs [7]. It utilizes pretrained Bidirectional Encoder Representations from Transformers (BERT) [8] as a building block for the encoder. The encoder itself is a 6 stacked layers of BERT. Multi-sentence representations [CLS] tokens are added between sentences, and interval segmentation embeddings are used to distinguish multiple sentences within a document. The decoder is randomly initialized 6-layered Transformer. However, because the encoder is pretrained and the decoder is trained from scratch, the fine-tuning may be unstable. Following [7], we use a fine-tuning schedule that separates the optimizers to handle this mismatch.

As a pretrained BERT for BertSumAbs, we use RuBERT trained on the Russian part of Wikipedia and news data  [9]. A vocabulary of Russian subtokens of size 120K was built from this data as well. BertSumAbs has 320M parameters.

Scripts and models’ checkpoints for both models are available111

4 Datasets

The RIA dataset consists of news articles and associated headlines for around 1 million examples [11]. These documents were published on the website ”RIA Novosti (RIA news)” from January 2010 to December 2014. For training purposes, we split the dataset into the train, validation, and test parts in a proportion of 90:5:5.

The Lenta dataset222 contains about 800 thousand news articles with titles that were published from 1999 to 2018. The purpose of model evaluation on this dataset is to measure the model capabilities to generate summaries given articles with another structure, different period, and style of writing.

There are no timestamps in both datasets, so time-based splits are unavailable. It can cause some bias because models tend to perform better on texts with entities that they saw during training.

5 Experiments

5.1 Evaluation

To evaluate the model, we use the ROUGE metric [13], reporting score for unigram and bigram overlap (ROUGE-1, ROUGE-2), and the longest common subsequence (ROUGE-L). We use the mean of these three metrics as the primary metric (R-mean). To balance this with a precision-based measure, we report BLEU [14].

We also reporting the proportion of novel n-grams that appear in the summaries but not in the source texts. The higher that number is the fewer copying was made to generate a summary meaning more abstractive model.

As baseline models we use first sentence and PGN. First sentence is a strong baseline for news articles because often the most valuable information is at the start while further sentences provide details and background information. PGN on byte-pair encoded tokens approach was described in the paper [12]. The implementation and parameters for PGN are used from here333

5.2 Training dynamics

In Fig. 1 we present the training dynamics of the BertSumAbs model. It takes about 3 days to train for 45K steps (on GeForce GTX 1080) with a batch size equals 8, gradient accumulation every 95 steps. We use a 40K checkpoint as a final model due to its better loss score on the validation dataset.

Figure 1: BertSumAbs training dynamics on train and validation parts
Model R1 R2 RL R-mean BLEU
First sentence 23.8 10.5 16.6 16.9 21.8
CopyNet [12] 41.6 24.5 38.9 35.0 53.8
PBATrans [10] 43.0 25.4 40.0 36.1 -
PGN 42.3 25.1 39.6 35.7 54.2
mBART 42.8 25.5 39.9 36.1 55.1
BertSumAbs 46.0 28.0 43.1 39.0 57.6

Table 1: RIA dataset evaluation

6 Results

Table 1 demonstrates the evaluation results on the RIA dataset. There is a significant improvement in all considered metrics for BertSumAbs whereas mBART performance is on the previous state-of-the-art level. Table 2 presents results on the Lenta dataset while models are trained on the RIA dataset. Both models show an improvement compared to previous results for all metrics. On both datasets, BertSumAbs shows better performance than mBART. A possible reason is that BertSumAbs utilizes RuBERT trained specifically on Russian corpus, whereas Russian is one of 25 languages for mBART.

Model R1 R2 RL R-mean BLEU
First sentence 24.0 10.6 18.3 17.6 24.9
CopyNet [12] 28.3 14.0 25.8 22.7 40.4
PGN 26.4 12.3 24.0 20.9 39.8
mBART 30.3 14.5 27.1 24.0 43.2
BertSumAbs 31.0 14.9 28.1 24.7 45.1

Table 2: Lenta dataset evaluation using model trained on RIA dataset

Fig. 2 presents the proportion of novel n-grams for both models in comparison with true headlines, designated here and below as Reference. Results show that BertSumAbs is more abstractive on both datasets. A manual inspection confirms the fact that mBART is more prone to coping.

To measure the impact of time bias, we scrape about 2.5K articles and associated headlines from the RIA website444 All articles were published in the 2020 year. The results are shown in Table 3. As expected, there is a decrease in metrics which is explained by new entities in articles. BertSumAbs suffers a greater decrease than mBART which may be because of more diverse pretraining of the latter.

Figure 2: Proportion of novel n-grams in headlines
Model R1 R2 RL R-mean BLEU
PGN 39.6 20.8 35.2 31.9 52.1
mBART 41.7 22.7 37.2 33.9 53.2
BertSumAbs 41.9 22.5 37.3 33.9 54.2

Table 3: Evaluation on RIA articles from 2020
1. Reference курс доллара подрос на открытии в среду на 1 коп - до 28,04 руб
en: the dollar rate increased at the opening on wednesday by 1 kopeck - up to 28,04 rubles
1. mBART рубль укрепился на открытии на открытии на открытии на одну копейку
en: ruble strengthened at the opening at the opening at the opening by 1 kopeck
1. BertSumAbs курс доллара подрос на открытии в среду на 1 коп - до 28,04 руб
en: the dollar rate increased at the opening on wednesday by 1 kopeck - up to 28,04 rubles
2. Reference 7 дней в гаване: неделя в кубе
en: 7 days in havana: a week in cuba
2. mBART "экстранхерос" или "виахерос"
en: "extranheros" or "viheros"
2. BertSumAbs кубинская мама, или как таксисты застряли на таможне
en: сuban mom, or how taxi drivers got stuck at customs
3. Reference женщина без рук стирает ногами, а зубами развешивает белье
en: a woman without hands washes her feet and hangs clothes with her teeth
3. mBART безрукой женщине, живущей в калужской области, говорят: "имела бы ты руки"
en: to armless woman living in the kaluga region say: "would you have hands"
3. BertSumAbs безрукая женщина в калужской области родила безрукую женщину
en: armless woman in the kaluga region gave birth to an armless woman
4. Reference китаец четыре года жил с ножом в голове, не зная об этом
en: for four years a chinese man lived with a knife in his head, not knowing about it
4. mBART врачи обнаружили лезвие ножа в голове своего пациента
en: doctors found a knife blade in their patient’s head
4. BertSumAbs врачи обнаружили в голове жителя китая десятисантиметровое ружьe
en: doctors found in the head of a resident of china a ten-centimeter gun

Table 4: Bad examples

6.1 Human Evaluation

To understand model performance beyond automatic metrics, we conducted the human evaluation of the results. We randomly sample 1110 articles and show them both with true and predicted headlines to human annotators. We use Yandex.Toloka555 to manage the process of distributing samples and recruiting annotators. The task was to decide which headline is better: generated by BertSumAbs or by human. There was also a draw option if the annotator could not decide which headline is better. There was 9 annotators for each example and they did not know about the origin of the headlines. As a result, we get draw in 8% and BertSumAbs win in 49% of samples. Analysing the aggregate statistics, we found that BertSumAbs headlines were chosen by 5 or more annotators in 32% whereas human generated in 28%.

6.2 Error Analysis

Although, in general, the model’s output is fluent and grammatically correct, some common mistakes are intrinsic for both models. In Table 4, we provide several such examples. In the first example, there is a repeated phrase in mBART output, while BertSumAbs is very accurate. The second example is so hard for both models that there are imaginary words in mBART prediction. It is a common problem for models operating on a subword level. The third example confirms that some articles are hard for both models. One more type of mistake is factual errors, as in the fourth example, where BertSumAbs reported about the wrong subject. This type of error is the worst because it is the most difficult to detect.

7 Conclusion and Future Work

In this paper, we showcase the effectiveness of fine-tuning the pretrained Trans-former-based models for the task of abstractive headline generation and achieve new state-of-the-art results on Russian datasets. We showed that BertSumAbs that utilize language-specific encoder has better results than mBART. Moreover, human evaluation confirms that BertSumAbs is capable of generating headlines indistinguishable from the human-created ones.

In future work, we should use the headline generation task as a first step in the two-stage fine-tuning strategy to transfer knowledge to other tasks, such as news clustering and classification.