1 Introduction
Article | |||||||||||
トヨタ自動車は18日、エンジン車だけの車種を2025年ごろまでにゼロにすると発表した。…ハイブリッド車 (HV)やプラグインハイブリッド車 (PHV)、燃料電池車 (FCV)も加えた「電動車」を、すべての車種に設定する。… | |||||||||||
On 18th Toyota announced that it will set the model of only engine cars to zero by about 2025.…They set ”electric vehicle” which is Hybrid Vehicle (HV), Plug-in Hybrid Vehicle (PHV), and Fuel Cell Vehicle (FCV) to all models.… | |||||||||||
Headline for print media | |||||||||||
トヨタ、全車種に電動車 25年ごろまでに HVやFCV含め | |||||||||||
All Toyota’s models will contain electric vehicles including HV and FCV by about 2025. | |||||||||||
Multi-length headlines for digital media | |||||||||||
|
The news media publish newspapers in print forms and in electronic forms. In the electric forms, articles might be read on various types of devices using any application; thus, news media companies have an increasing need to produce multiple headlines for the same news article on the basis of what would be most appropriate and most compelling on an array of devices. Every device and application used for viewing articles has a strict upper bound regarding the number of characters allowed because of limitations in the space where the headline appears. The technology of automatic headline generation has the potential to contribute greatly to this domain, and the problems of news headline generation have motivated a wide range of researches (Wang et al., 2018; Chen et al., 2018; Li et al., 2018; Song et al., 2018; Kiyono et al., 2018; Zhou et al., 2018).
Table 1 shows the sample headlines in three different lengths written by professional editors of a media company for the same news article: The first headline for the digital media is restricted the length to 10 characters, the second is to 13 characters, and the third is to 26 characters. From a practical perspective, headlines must be generated under a rigid length constraint. However, few studies have been performed based on this assumption.
The first study to consider the length of system outputs in the context of encoder-decoder language generation was rush-chopra-weston:2015:EMNLP. This study controlled the length of an output sequence by reducing the score of the end-of-sentence token to until the method generated the desired number of words. Subsequently, kikuchi-EtAl:2016:EMNLP2016 and W18-2706 have proposed mechanisms for length control; however, these studies produced summaries of 30, 50, and 75 bytes, and the researches evaluated them by using the reference summaries of a single length (approximately 75 bytes long) in DUC 2004111https://duc.nist.gov/duc2004/. Thus, some questions can be posed: (1) Can longer length references adequately evaluate system outputs shorter than the reference to some extent? (2) How do the words not included in shorter references but included in longer references affect the evaluation? (3) What type of tasks influence each length limit? and (4) How do the existing length control methods manage those tasks? In this study, we present novel corpora to investigate these research questions. The contributions of this study are threefold.
-
We release the Japanese News Corpus (JNC)222https://cl.asahi.com/api_data/jnc-jamul-en.html, which includes 1.83 million pairs of headlines and the lead three sentences of Japanese news articles. We expect this corpus to provide common supervision data for headline generation.
-
We build the JApanese MUlti-Length Headline Corpus (JAMUL)22footnotemark: 2 for the evaluation of headlines of different lengths. In this novel dataset, each news article is associated with multiple headlines of three different lengths.
-
We report new findings on the JAMUL; for example, although the longer reference seems to be able to evaluate the short system output, we also found a problem with this evaluation setting. Additionally, we clarified what type of tasks the existing method solves according to the length.
2 JNC and JAMUL

2.1 Headlines composed by a media company
Before describing the JNC and JAMUL in detail, we explain the process where a media company composes headlines for a news article. First, reporters write an article and submit it to the editorial department to be published in the newspaper. The editorial department writes a headline for the article dedicated to print media. We call these headlines print headlines or length-insensitive headlines hereafter.
In addition to print headlines, digital media editors, who are typically not the same editors for print, pick up those they want to distribute on digital media from the articles submitted for print and compose three different headlines. The first headline for the digital signage and audio media has a limit of up to 10 characters. This type of headline is appended to the beginning of a concise summary of the article so that readers can understand the news at first glance. The second type of headline is produced for portable telephones with small LCDs and small areas on the news site (e.g., the access ranking); the upper limit of the number of characters is 13. The third type of headline is produced for PC news websites, and the upper limit of the number of characters is 26. This limit is derived from the layout of the news site. We refer to the three types of headlines as 10char-ref, 13char-ref, and 26char-ref (refer to Table 1 for example). We collectively call these headlines length-sensitive headlines.
Table 1 presents an example of headlines written for an article by the professional editors. We extract the JNC and JAMUL from the process of news production of trusted and professional sources maintained in databases with time series; therefore, they can be considered representative of contemporary editorial practice.
2.2 Jnc
The JNC is a collection of 1,829,231 pairs of the three lead sentences of articles and their print headlines published from 2007 to 2016. Figure 1 (a) depicts the distribution of lengths of the headlines in the JNC. Lengths of headlines in the JNC are diverse because of various factors related to publishing newspapers (e.g., space limitation, importance of the news). The tendency is important articles tend to have longer print headlines assigned.
The JNC is useful for training headline generation models because it has many training instances. Furthermore, the corpus is suitable for training a model for variable-length headline generation because of the variety of the headline lengths.
2.3 Jamul
The JAMUL is a corpus containing 1,524 news articles and their length-sensitive headlines of 10 characters, 13 characters, and 26 characters for digital media. All the articles and headlines were published between September 2017 and March 2018. The volume of the news articles may be insufficient for training a headline generation model. However, as Figure 1 (b) shows, the JAMUL includes length-sensitive headlines that strictly preserve the length requirements. This novel characteristic of the JAMUL is a test set for headline generation. No overlap of articles between the JNC and JAMUL is observed.
2.4 Comparing headlines with article bodies
System | Reference | Precision | Recall |
---|---|---|---|
Article | Paper headline | 3.67 | 87.34 |
Article | 10char-ref | 1.47 | 88.77 |
Article | 13char-ref | 1.94 | 89.82 |
Article | 26char-ref | 3.85 | 90.14 |
Word-level precision and recall when comparing article and length-insensitive/sensitive headlines.
What type of operation did the editors perform to create length-sensitive and length-insensitive headlines in the JAMUL? To clarify this question, we analyzed the proportions of the number of extractive and abstractive operations. Specifically, we reported the word-level precision and recall scores in Table 2, assuming that articles are “system” summaries and that 10char-ref, 13char-ref, and 26char-ref headlines are “reference” summaries. Notably, we removed blank spaces, which were the most common token in longer headlines. The relatively high recall score indicates that the most often required operations to generate headlines are extractive, and the abstractive operation is 10% of the total.
2.5 Comparing among length-sensitive headlines with print headlines
System | Reference | Precision | Recall |
---|---|---|---|
Print headline | 10char-ref | 24.66 | 64.11 |
Print headline | 13char-ref | 33.24 | 66.30 |
Print headline | 26char-ref | 56.36 | 55.75 |
Article | ||||||||||
米フェイスブック(FB)は1日、2017年7 9月期決算を発表し、モバイル広告の伸びなどで売上高、純利益ともに四半期として過去最高を記録した。…FB上で偽ニュースの拡散防止など、安全確保のための要員を約2万人に倍増させることを明らかにしている。 | ||||||||||
On 1st the U.S. Facebook (FB) announced financial results from July to September in 2017 and archived the record quarterly amount of sales and net income thanks to its mobile advertising growth and other factors. …FB revealed to double the number of personnel engaged in preventing the fake news from spreading to about 20,000 in order to secure the safety on FB. | ||||||||||
Headline for print media | ||||||||||
フェイスブック、四半期で最高益 モバイル広告好調 | ||||||||||
Facebook achieved the record quarterly profit thanks to its mobile advertising business. | ||||||||||
Multi-length headlines for digital media | ||||||||||
|
How similar are the headlines used for training (length-insensitive) and for evaluation (length-sensitive)? We estimated the appropriateness of length-insensitive headlines as a “seed” for producing length-sensitive headlines. More concretely, we reported word-level precision and recall scores in Table
3, assuming that length-insensitive headlines are “system” summaries and that 10char-ref, 13char-ref and 26char-ref headlines are “reference” summaries. The relatively high recall scores indicate that the training and evaluation data are not so distant. Additionally, we found that the editors use a moderate number of words that do not appear in print headlines when composing length-sensitive headlines. Table 4 is an example of the typical differences between the length-insensitive and length-sensitive headlines. Comparing the 26-character headline with the print headline, the choices of contents are different from each other; for example, while the print headline reports the reason about the record profit, the 26-character headline describes the topic with regards to the increasing number of personnel. Next, comparing the 7-character (10char-ref) headline with the print headline, we observe that the choices of words are different; the print headline uses “Facebook”, which is changed to “FB” in 7-character headline.2.6 Comparing length-sensitive headlines
System | Reference | P | R |
---|---|---|---|
26char-ref | 10char-ref | 28.77 | 78.55 |
26char-ref | 13char-ref | 42.53 | 88.75 |
First 10 chars in 26char-ref | 10char-ref | 38.40 | 41.65 |
First 13 chars in 26char-ref | 13char-ref | 60.31 | 65.58 |
Last 10 chars in 26char-ref | 10char-ref | 14.55 | 17.05 |
Last 13 chars in 26char-ref | 13char-ref | 23.13 | 26.56 |
How similar is the composition of headlines for a news article of different lengths? How good are 26char-ref headlines as “seeds” for generating 10char-ref or 13char-ref headlines? Is the simple strategy of trimming 26char-ref headlines to 10 or 13 characters sufficient? To answer these questions, we computed word-level precision and recall scores, assuming that 26char-ref headlines are “system” summaries and that 10char-ref and 13char-ref headlines are “reference” summaries.
The first and second rows of Table 5 represent the situation when we used 26char-ref headlines as they are and without preserving the length constraint. Although this setting was unrealistic, we could estimate the upper bound when we composed a shorter headline from a 26char-ref. The high recall scores indicate that 26char-ref headlines mostly cover the words included in 10char-ref and 13char-ref headlines. The third and fourth rows of Table 5 correspond to the strategy where we generated headlines in 10 and 13 characters from the first 10 and 13 characters of 26char-ref headlines. This strategy achieved moderate success for generating headlines in 13 characters but did not work well for headlines in 10 characters. In other words, we observed large differences between 10char-ref and 26char-ref headlines. The fifth and sixth rows of Table 5
correspond to the strategy where we generated headlines in 10 and 13 characters from the last 10 and 13 characters of 26char-ref headlines. Extracting the latter part of a 26char-ref headline was probably not a good idea because the precision and recall scores were much worse than those for the first 10 and 13 characters. On the other hand, these results also indicate that the words included in 10char-ref and 13char-ref are observed in the latter part of 26char-ref.
In sum, we found similarities in headlines of different lengths in the JAMUL. However, the simple strategy to trim a longer headline into a shorter headline is insufficient (except for shrinking 26char-ref headlines into 13char-ref headlines). Table 1 is an example of the typical differences among length-sensitive headlines. There is a little overlap between longer and shorter headlines because 9- and 13-character headline extract the shorter phrases which have the nearly same meaning as the 24-character headline. Focusing on “車種” (models), the words are in the latter half of the 24-character headline, and we could confirm that important keywords are not always included at the beginning of the headlines.
3 Comparing headline generation methods on JAMUL
In this section, we explore a question about evaluation: how reliable is the conventional evaluation method using a single length summary for measuring the quality of summaries of different lengths? In order to answer this question, we generate multiple summaries of different lengths by using the existing methods, and measure the correlation between the performance values computed by the conventional evaluation method and those computed on JAMUL.
3.1 Headline generation methods with the mechanism to control output length
In this study, we explored four methods for headline generation that can control the output length. The first two methods, LenEmb and LenInit, were proposed by kikuchi-EtAl:2016:EMNLP2016.
LenEmb provides the decoder with output length information in the form of the length embedding. LenInit controls the output length by multiplying the initial state of the decoder’s memory cell by the desired length.
W18-2706 also proposed a length-controllable method for a convolutional sequence-to-sequence (ConvS2S) model (Gehring et al., 2017). Their method added special tokens indicating the range of the output length at the beginning of an input sequence. In our experiment, we used a special token to specify an output length333W18-2706 also included special tokens for entities, but we did not use them in the experiments. and called this method SP-token.
We also considered the method of LC (Liu et al., 2018)
, which extends ConvS2S and multiplies the initial state of the residual connection
(He et al., 2016) by the desired number of output tokens. In the experiment, we set the desired number of characters instead of that of tokens.In addition to the four methods aforementioned, we combined SP-token not only on ConvS2S but also on Seq2Seq and Transformer (Vaswani et al., 2017). Eventually, we examined six combinations in total: (1) Seq2Seq LenEmb, (2) Seq2Seq LenInit, (3) Seq2Seq SP-token, (4) ConvS2S SP-token, (5) ConvS2S LC, and (6) Transformer SP-token.
3.2 Datasets and evaluation protocol
We trained the six methods for headline generation on the JNC.
We removed instances that were duplicated or unsuitable for training a headline generation model444The filtering script is available at:
https://github.com/asahi-research/Gingo.
The filtering step obtained 1,568,360 pairs of newspaper articles and headlines.
We randomly selected 1% of the instances (15,546 pairs) as a validation set and used the remainder (1,523,469 pairs) as a training set.
We used Byte Pair Encoding (BPE)555https://github.com/rsennrich/subword-nmt (Sennrich et al., 2016) for tokenization.
We set the merge operation to 8,000 and pretokenized all the data by MeCab.
Finally, we obtained 11,257 tokens for both sides.
When training a model, we set the length of each reference headline to the model.
When generating headlines in the evaluation, we set the output lengths to 10, 13 and 26 characters; each output was evaluated by the reference that had the same length in the JAMUL.
We evaluated all models by using three variants of ROUGE (Lin, 2004) recall metric666We used MeCab (Kudo et al., 2004) to tokenize the system outputs.: ROUGE-1, ROUGE-2, and ROUGE-L.
Headlines exceeding the length limits were trimmed for the fairness of the evaluation.
3.3 Implementation
We employed OpenNMT777https://github.com/OpenNMT/OpenNMT-py (Klein et al., 2017) for Seq2Seq and fairseq888https://github.com/pytorch/fairseq for ConvS2S and Transformer. We extended the implementations to realize LenEmb, LenInit and LC. We set the dimensions for token and length embeddings to 512, those for hidden states to 512, and the beam width to 5. These parameters are common in all the models; Table 6 summarizes other parameters specific to each sequence-to-sequence model. We used Nesterov’s accelerated gradient method (NAG) (Sutskever et al., 2013) with a momentum of 0.99 in ConvS2S. In Transformer, we set the number of attention heads to 8, the dimensions for the feed-forward network to 2048, Adam’s to 0.98, warm up steps to 4000, and label smoothing to 0.1.
Seq2Seq | ConvS2S | Transformer | |
---|---|---|---|
Num of Layer | 2 | 8 | 6 |
Dropout Rate | 0.3 | 0.1 | 0.3 |
Grad Clipping | [-5.0, 5.0] | [-0.1, 0.1] | - |
Learning Rate | 0.001 | 0.2 | 0.001 |
Optimizer | Adam | NAG | Adam |
10 characters | 13 characters | 26 characters | |||||||
---|---|---|---|---|---|---|---|---|---|
Models | R-1 | R-2 | R-L | R-1 | R-2 | R-L | R-1 | R-2 | R-L |
(1) Seq2Seq + LenEmb | 34.66 | 15.29 | 33.56 | 40.66 | 19.40 | 38.23 | 44.82 | 20.75 | 36.62 |
(2) Seq2Seq + LenInit | 36.50 | 16.75 | 35.54 | 41.40 | 19.49 | 38.83 | 46.77 | 22.06 | 38.29 |
(3) Seq2Seq + SP-token | 38.09 | 17.43 | 36.67 | 42.51 | 19.79 | 39.76 | 47.33 | 22.12 | 38.59 |
(4) ConvS2S + SP-token | 38.90 | 17.84 | 37.53 | 43.32 | 20.35 | 40.31 | 47.10 | 21.51 | 37.86 |
(5) ConvS2S + LC | 37.71 | 16.89 | 36.50 | 42.60 | 20.11 | 39.97 | 45.76 | 21.93 | 37.91 |
(6) Transformer + SP-token | 42.85 | 19.84 | 41.40 | 46.92 | 22.85 | 44.09 | 51.57 | 24.52 | 41.05 |
10 characters | 13 characters | |||||
---|---|---|---|---|---|---|
R-1 | R-2 | R-L | R-1 | R-2 | R-L | |
(1) | 21.22 | 8.56 | 19.29 | 27.42 | 12.20 | 24.23 |
(2) | 21.72 | 9.03 | 19.70 | 28.34 | 12.83 | 25.01 |
(3) | 22.02 | 9.08 | 20.03 | 28.26 | 12.09 | 24.73 |
(4) | 22.50 | 9.25 | 20.41 | 29.23 | 12.82 | 25.58 |
(5) | 22.64 | 9.27 | 20.57 | 28.94 | 12.53 | 25.21 |
(6) | 24.36 | 10.32 | 21.97 | 31.22 | 13.99 | 27.25 |
R-1 | R-2 | R-L | |
---|---|---|---|
10char/26char | 0.867 | 0.999 | 0.867 |
10char/26char-trim | 0.733 | 0.600 | 0.733 |
13char/26char | 0.733 | 0.600 | 0.867 |
13char/26char-trim | 0.867 | 0.690 | 0.867 |
3.4 Evaluating multi-length headlines generated by methods on the JAMUL
Table 7 presents ROUGE scores of each method on the JAMUL. Transformer + SP-token
was the clear winner in all lengths and evaluation metrics on this dataset. Additionally, the three methods with
SP-token outperformed the others except for R-2 and R-L on 26 characters (Seq2Seq + LenInit was better than ConvS2S + SP-token).What if we do not have multiple headlines of different lengths to evaluate the methods? To answer this question, we followed the evaluation setup of the previous studies on DUC 2004: the reference summaries of 75 bytes were used even when evaluating summaries of 30 and 50 bytes. Table 8 reports ROUGE scores for the system outputs in 10 and 13 characters evaluated based on the 26char-ref headlines. This evaluation setup reduced the performance differences between the methods. Although Transformer + SP-token remained the clear winner, the performance of Seq2Seq + SP-token in 13 characters was now lower than that of Seq2Seq + LenInit.
Thus, we computed rank correlation coefficients (Kendall’s ) to assess the discrepancy in the ranking among the methods presented by Tables 7 and 8. Additionally, we computed Kendall’s when using the first 10 or 13 characters in 26char-ref headlines as a reference (26char-trim). Table 9 reveals that the rank correlation is not perfect (lower than one) but moderate: there is a possibility that an order of the scores of two methods may flip depending on the evaluation setup. This result is similar to DBLP:conf/emnlp/ShapiraGRBAND18, that is, the validity for the evaluation setting to use single length reference in multidocument summarization.
10 characters | 13 characters | |||
---|---|---|---|---|
P | R | P | R | |
(1) | 3.80 | 10.55 | 36.80 | 10.09 |
(2) | 4.65 | 13.70 | 36.45 | 9.02 |
(3) | 4.48 | 13.13 | 35.87 | 10.03 |
(4) | 4.32 | 13.44 | 35.90 | 10.53 |
(5) | 4.26 | 13.32 | 36.18 | 10.09 |
(6) | 4.57 | 14.59 | 39.30 | 10.84 |
4 Analysis
10 characters | 13 characters | 26 characters | ||||
---|---|---|---|---|---|---|
P | R | P | R | P | R | |
(1) | 34.04 | 40.74 | 40.45 | 47.71 | 45.34 | 57.59 |
(2) | 35.23 | 44.87 | 39.12 | 48.85 | 45.18 | 60.07 |
(3) | 36.16 | 44.94 | 39.58 | 49.11 | 45.07 | 60.05 |
(4) | 35.81 | 45.09 | 39.37 | 49.41 | 44.03 | 59.59 |
(5) | 35.04 | 44.40 | 39.51 | 49.60 | 45.16 | 59.72 |
(6) | 38.15 | 48.33 | 41.71 | 52.46 | 45.31 | 60.81 |
10 characters | 13 characters | 26 characters | ||||
---|---|---|---|---|---|---|
P | R | P | R | P | R | |
(1) | 1.91 | 6.73 | 2.32 | 10.52 | 8.00 | 37.69 |
(2) | 2.50 | 10.08 | 2.51 | 11.83 | 7.92 | 38.90 |
(3) | 2.22 | 8.74 | 2.40 | 11.62 | 7.97 | 38.88 |
(4) | 2.29 | 9.24 | 2.61 | 12.83 | 7.89 | 39.26 |
(5) | 2.25 | 9.13 | 2.36 | 11.26 | 8.08 | 39.07 |
(6) | 3.05 | 12.55 | 3.10 | 15.08 | 8.83 | 42.85 |
4.1 Performance of word selection according to output length
How well do the existing methods change the word selection depending on the output length? As shown in the first and second rows of Table 5, 10char-ref and 13char-ref headlines contain words that are not included in 26char-ref headlines. In other words, the selection of words in the generated headline should be changed in response to the length restriction. To confirm this question, we computed word-level precision and recall scores for the system outputs generated by each method, assuming that the groups of the words included in 10char-ref or 13char-ref but not in 26char-ref headlines are the “reference” summaries. For instance, the red words in Table 1 are the “reference” summaries in this experiment.
We report this result in Table 10. The low recall score indicates that each system cannot select the words tailored to the length constraints. The difference of the precision scores between the models is small. We infer that there is almost no difference between the existing methods in terms of the word selection specialized for the length constraint.
4.2 Performance of managing extractive and abstractive tasks
In Table 2, we reported the proportion of the number of extractive and abstractive operations in JAMUL. We analyze how the existing methods can reflect extractive and abstractive operations in generating summaries.
First, to observe extractive operations, we computed word-level precision and recall scores for the system outputs generated by each system, measuring the number of overlapping words between an article as “system” summaries and 10char-ref, 13char-ref, and 26char-ref headlines as “reference” summaries. Table 11 reports the result. The relatively high recall score indicates that the length control method succeeds in managing extractive operations.
Next, we examine whether the length control methods could perform abstractive operations. We adopted the words included in 10char-ref, 13char-ref, or 26char-ref headlines but not included in an article as “reference” summaries, and computed the precision and recall scores for the system outputs (Table 12). Regarding the outputs targeting at 26 characters, the recall scores of around 40% imply that each model can manage abstractive operations to some extent. In contrast, the low recall scores for the outputs targeting at 10 and 13 characters revealed that all length control methods could not perform well on abstractive operations under the severe length constraint.
4.3 How do length control mechanisms work?

We wonder that a method that could control the output length would produce similar headlines even for different lengths for the same news article. To confirm this suspicion, we reported ROUGE-1 recall scores in Figure 2 with three different configurations: (a) evaluating the first 13 characters of headlines generated to be 26 characters long on 13char-ref headlines (yellow); (b) evaluating headlines generated to be 13 characters long on 13char-ref headlines (green); and (c) is the same as (a) but evaluated on the headline generated to be 13 characters long (blue).
Setting (a) corresponds to the strategy where we trimmed headlines of different lengths to 13 characters long. This setting was worse than setting (b), where a method tailored headlines to the desired length. However, the difference in ROUGE scores between (a) and (b) was not so large, indicating that the existing methods do not drastically change the content for 13 characters long and 26 characters long. This tendency was also verified by setting (c), which assessed how much the first 13 characters of headlines generated to be 26 characters long covered the content of those generated to be 13 characters long. These facts suggest that we should explore a method not only trained by generic supervision data (print headlines) but also tuned for the desired length in further research.
5 Related Work
rush-chopra-weston:2015:EMNLP created the first approach to neural abstractive summarization. They generated a headline from the first sentence of a news article in the Annotated English Gigaword corpus (Napoles et al., 2012), which contains an enormous number of pairs of headlines and articles. After their study, a number of researchers addressed this task: for example, chopra-auli-rush:2016:N16-1 used the encoder-decoder framework (Sutskever et al., 2014; Bahdanau et al., 2015) and DBLP:conf/conll/NallapatiZSGX16 incorporated additional features into the model, such as parts-of-speech tags and named entities. suzuki-nagata:2017:EACLshort proposed word-frequency estimation to reduce the repeated phrases being generated. P17-1101 proposed a gating mechanism (sGate) to ensure that important information is selected at each decoding step.
Unfortunately, attempts to control the output length in neural abstractive summarization have been limited. shi-knight-yuret:2016:EMNLP2016 reported that hidden states in recurrent neural networks in the encoder-decoder framework could implicitly model the length of output sequences. kikuchi-EtAl:2016:EMNLP2016 was the first to propose the idea of controlling the output length in the encoder-decoder framework. Their approach inserts embeddings for the output length into the decoder. Additionally, W18-2706 reported that output lengths could be controlled by embeddings of special tokens given to an input sequence. These two studies used DUC 2004
(Over et al., 2007), which comprises only 75-byte summaries, to evaluate the outputs in multiple lengths. D18-1444 also proposed a method to control the number of output words in the ConvS2S model. However, no previous work built a dataset for evaluating headlines of multiple lengths nor reported an in-depth perspective on this task along the process of new production in the real world. On the other hand, a single length reference that could appropriately evaluate multiple length summaries in multiple document summarization was reported
Shapira et al. (2018). In that study, they confirmed the correlation coefficient of ROUGE-1 scores between the scores using a single length reference and multiple (gold) length references in the evaluation. Our research differed in that we examined why strong correlation occurs and studied headline generation domain, which requires stricter keyword selection.6 Conclusion
In this paper, we presented two new corpora: The JNC contains a large number of pairs of news articles and their headlines, and the JAMUL includes headlines of three different lengths (10, 13, and 26 characters long) written by professional editors. This study is the first to analyze the characteristics of multiple headlines of different lengths and to evaluate existing approaches for length control based on the reference headlines composed for different lengths. We found that Transformer model with a special length token (SP-token) outperformed the other methods on the JAMUL. Additionally, while we confirmed that single length (the longest) references could adequately evaluate multiple length system outputs, the existing methods cannot take into account the word selection according to length constraint. We also found it difficult to evaluate methods to control output length because headlines of different lengths are written based on different goals and because the training data does not necessarily reflect the goal of the headlines of a specific length. In future, we plan to explore an approach to adapt a model trained on the print headlines to those which dedicated to a different length.
References
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR 2015).
- Chen et al. (2018) Wenhu Chen, Guanlin Li, Shuo Ren, Shujie Liu, Zhirui Zhang, Mu Li, and Ming Zhou. 2018. Generative bridging network for neural sequence prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), pages 1706–1715.
- Chopra et al. (2016) Sumit Chopra, Michael Auli, and Alexander M. Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016), pages 93–98.
-
Fan et al. (2018)
Angela Fan, David Grangier, and Michael Auli. 2018.
Controllable
abstractive summarization.
In
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation (WNMT 2018)
, pages 45–54. -
Gehring et al. (2017)
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin.
2017.
Convolutional sequence to sequence learning.
In
Proceedings of the 34th International Conference on Machine Learning, (ICML 2017)
, pages 1243–1252. -
He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016.
Deep residual learning
for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR 2016)
, pages 770–778. -
Kikuchi et al. (2016)
Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu
Okumura. 2016.
Controlling output
length in neural encoder-decoders.
In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016)
, pages 1328–1338. - Kiyono et al. (2018) Shun Kiyono, Sho Takase, Jun Suzuki, Naoaki Okazaki, Kentaro Inui, and Masaaki Nagata. 2018. Unsupervised token-wise alignment to improve interpretation of encoder-decoder models. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 74–81.
- Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations (ACL 2017), pages 67–72.
- Kudo et al. (2004) Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pages 230–237.
- Li et al. (2018) Haoran Li, Junnan Zhu, Jiajun Zhang, and Chengqing Zong. 2018. Ensure the correctness of the summary: Incorporate entailment knowledge into abstractive sentence summarization. In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018), pages 1430–1441.
- Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81.
- Liu et al. (2018) Yizhu Liu, Zhiyi Luo, and Kenny Zhu. 2018. Controlling length in abstractive summarization using a convolutional neural network. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pages 4110–4119.
- Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence–to–sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL 2016), pages 280–290.
- Napoles et al. (2012) Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated Gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX 2012), pages 95–100.
- Over et al. (2007) Paul Over, Hoa Dang, and Donna Harman. 2007. DUC in context. Information Processing and Management, 43(6):1506–1520.
- Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), pages 379–389.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), pages 1715–1725.
- Shapira et al. (2018) Ori Shapira, David Gabay, Hadar Ronen, Judit Bar-Ilan, Yael Amsterdamer, Ani Nenkova, and Ido Dagan. 2018. Evaluating multiple system summary lengths: A case study. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pages 774–778. Association for Computational Linguistics.
- Shi et al. (2016) Xing Shi, Kevin Knight, and Deniz Yuret. 2016. Why neural translations are the right length. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), pages 2278–2282.
- Song et al. (2018) Kaiqiang Song, Lin Zhao, and Fei Liu. 2018. Structure-infused copy mechanisms for abstractive summarization. In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018), pages 1717–1729.
-
Sutskever et al. (2013)
Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. 2013.
On the importance of initialization and momentum in deep learning.
In Proceedings of the 30nd International Conference on Machine Learning (ICML 2013), pages 1139–1147. - Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS 2014), pages 3104–3112.
- Suzuki and Nagata (2017) Jun Suzuki and Masaaki Nagata. 2017. Cutting-off redundant repeating generations for neural abstractive summarization. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017), pages 291–297.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017 (NIPS 2017), pages 6000–6010.
-
Wang et al. (2018)
Li Wang, Junlin Yao, Yunzhe Tao, Li Zhong, Wei Liu, and Qiang Du. 2018.
A reinforced
topic-aware convolutional sequence-to-sequence model for abstractive text
summarization.
In
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, (IJCAI 2018)
, pages 4453–4460. - Zhou et al. (2017) Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. 2017. Selective encoding for abstractive sentence summarization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), pages 1095–1104.
- Zhou et al. (2018) Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. 2018. Sequential copying networks. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), pages 4987–4995.