A Large-Scale Multi-Length Headline Corpus for Improving Length-Constrained Headline Generation Model Evaluation

03/28/2019 ∙ by Yuta Hitomi, et al. ∙ Tohoku University Retrieva The Asahi Shimbun Company 0

Browsing news articles on multiple devices is now possible. The lengths of news article headlines have precise upper bounds, dictated by the size of the display of the relevant device or interface. Therefore, controlling the length of headlines is essential when applying the task of headline generation to news production. However, because there is no corpus of headlines of multiple lengths for a given article, prior researches on controlling output length in headline generation have not discussed whether the evaluation of the setting that uses a single length reference can evaluate multiple length outputs appropriately. In this paper, we introduce two corpora (JNC and JAMUL) to confirm the validity of prior experimental settings and provide for the next step toward the goal of controlling output length in headline generation. The JNC provides common supervision data for headline generation. The JAMUL is a large-scale evaluation dataset for headlines of three different lengths composed by professional editors. We report new findings on these corpora; for example, while the longest length reference summary can appropriately evaluate the existing methods controlling output length, the methods do not manage the selection of words according to length constraint.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

トヨタ自動車は18日、エンジン車だけの車種を2025年ごろまでにゼロにすると発表した。…ハイブリッド車 (HV)やプラグインハイブリッド車 (PHV)、燃料電池車 (FCV)も加えた「電動車」を、すべての車種に設定する。…
On 18th Toyota announced that it will set the model of only engine cars to zero by about 2025.…They set ”electric vehicle” which is Hybrid Vehicle (HV), Plug-in Hybrid Vehicle (PHV), and Fuel Cell Vehicle (FCV) to all models.…
Headline for print media
トヨタ、全車種に電動車 25年ごろまでに HVやFCV含め
All Toyota’s models will contain electric vehicles including HV and FCV by about 2025.
Multi-length headlines for digital media
9 chars 車種に「電動車」
(10char-ref) “Electric cars” for all models
13 chars トヨタ、車種に「電動車」
(13char-ref) “Electric cars” for all Toyota’s models
24 chars (26char-ref) トヨタ、エンジン車だけの車種ゼロへ 2025年ごろ
Toyota sets the number of models with only engine cars to zero by about 2025.
Table 1: An example of four headlines for the same article that were created by professional editors. In this example, ‘電動車’(Electric cars) and ‘全’(all) are represented by red letters and are not included in the 24-character headline. These tokens cannot be evaluated by 24-character headlines. The blue tokens are not included in 9- and 13-character headlines. These tokens should not be included in shorter headlines.

The news media publish newspapers in print forms and in electronic forms. In the electric forms, articles might be read on various types of devices using any application; thus, news media companies have an increasing need to produce multiple headlines for the same news article on the basis of what would be most appropriate and most compelling on an array of devices. Every device and application used for viewing articles has a strict upper bound regarding the number of characters allowed because of limitations in the space where the headline appears. The technology of automatic headline generation has the potential to contribute greatly to this domain, and the problems of news headline generation have motivated a wide range of researches (Wang et al., 2018; Chen et al., 2018; Li et al., 2018; Song et al., 2018; Kiyono et al., 2018; Zhou et al., 2018).

Table 1 shows the sample headlines in three different lengths written by professional editors of a media company for the same news article: The first headline for the digital media is restricted the length to 10 characters, the second is to 13 characters, and the third is to 26 characters. From a practical perspective, headlines must be generated under a rigid length constraint. However, few studies have been performed based on this assumption.

The first study to consider the length of system outputs in the context of encoder-decoder language generation was rush-chopra-weston:2015:EMNLP. This study controlled the length of an output sequence by reducing the score of the end-of-sentence token to until the method generated the desired number of words. Subsequently, kikuchi-EtAl:2016:EMNLP2016 and W18-2706 have proposed mechanisms for length control; however, these studies produced summaries of 30, 50, and 75 bytes, and the researches evaluated them by using the reference summaries of a single length (approximately 75 bytes long) in DUC 2004111https://duc.nist.gov/duc2004/. Thus, some questions can be posed: (1) Can longer length references adequately evaluate system outputs shorter than the reference to some extent? (2) How do the words not included in shorter references but included in longer references affect the evaluation? (3) What type of tasks influence each length limit? and (4) How do the existing length control methods manage those tasks? In this study, we present novel corpora to investigate these research questions. The contributions of this study are threefold.

  1. We release the Japanese News Corpus (JNC)222https://cl.asahi.com/api_data/jnc-jamul-en.html, which includes 1.83 million pairs of headlines and the lead three sentences of Japanese news articles. We expect this corpus to provide common supervision data for headline generation.

  2. We build the JApanese MUlti-Length Headline Corpus (JAMUL)22footnotemark: 2 for the evaluation of headlines of different lengths. In this novel dataset, each news article is associated with multiple headlines of three different lengths.

  3. We report new findings on the JAMUL; for example, although the longer reference seems to be able to evaluate the short system output, we also found a problem with this evaluation setting. Additionally, we clarified what type of tasks the existing method solves according to the length.


Figure 1: Length distributions of headlines in JNC (a) and JAMUL (b).

2.1 Headlines composed by a media company

Before describing the JNC and JAMUL in detail, we explain the process where a media company composes headlines for a news article. First, reporters write an article and submit it to the editorial department to be published in the newspaper. The editorial department writes a headline for the article dedicated to print media. We call these headlines print headlines or length-insensitive headlines hereafter.

In addition to print headlines, digital media editors, who are typically not the same editors for print, pick up those they want to distribute on digital media from the articles submitted for print and compose three different headlines. The first headline for the digital signage and audio media has a limit of up to 10 characters. This type of headline is appended to the beginning of a concise summary of the article so that readers can understand the news at first glance. The second type of headline is produced for portable telephones with small LCDs and small areas on the news site (e.g., the access ranking); the upper limit of the number of characters is 13. The third type of headline is produced for PC news websites, and the upper limit of the number of characters is 26. This limit is derived from the layout of the news site. We refer to the three types of headlines as 10char-ref, 13char-ref, and 26char-ref (refer to Table 1 for example). We collectively call these headlines length-sensitive headlines.

Table 1 presents an example of headlines written for an article by the professional editors. We extract the JNC and JAMUL from the process of news production of trusted and professional sources maintained in databases with time series; therefore, they can be considered representative of contemporary editorial practice.

2.2 Jnc

The JNC is a collection of 1,829,231 pairs of the three lead sentences of articles and their print headlines published from 2007 to 2016. Figure 1 (a) depicts the distribution of lengths of the headlines in the JNC. Lengths of headlines in the JNC are diverse because of various factors related to publishing newspapers (e.g., space limitation, importance of the news). The tendency is important articles tend to have longer print headlines assigned.

The JNC is useful for training headline generation models because it has many training instances. Furthermore, the corpus is suitable for training a model for variable-length headline generation because of the variety of the headline lengths.

2.3 Jamul

The JAMUL is a corpus containing 1,524 news articles and their length-sensitive headlines of 10 characters, 13 characters, and 26 characters for digital media. All the articles and headlines were published between September 2017 and March 2018. The volume of the news articles may be insufficient for training a headline generation model. However, as Figure 1 (b) shows, the JAMUL includes length-sensitive headlines that strictly preserve the length requirements. This novel characteristic of the JAMUL is a test set for headline generation. No overlap of articles between the JNC and JAMUL is observed.

2.4 Comparing headlines with article bodies

System Reference Precision Recall
Article Paper headline 3.67 87.34
Article 10char-ref 1.47 88.77
Article 13char-ref 1.94 89.82
Article 26char-ref 3.85 90.14
Table 2:

Word-level precision and recall when comparing article and length-insensitive/sensitive headlines.

What type of operation did the editors perform to create length-sensitive and length-insensitive headlines in the JAMUL? To clarify this question, we analyzed the proportions of the number of extractive and abstractive operations. Specifically, we reported the word-level precision and recall scores in Table 2, assuming that articles are “system” summaries and that 10char-ref, 13char-ref, and 26char-ref headlines are “reference” summaries. Notably, we removed blank spaces, which were the most common token in longer headlines. The relatively high recall score indicates that the most often required operations to generate headlines are extractive, and the abstractive operation is 10% of the total.

2.5 Comparing among length-sensitive headlines with print headlines

System Reference Precision Recall
Print headline 10char-ref 24.66 64.11
Print headline 13char-ref 33.24 66.30
Print headline 26char-ref 56.36 55.75
Table 3: Word-level precision and recall when comparing length-insensitive/sensitive headlines.
米フェイスブック(FB)は1日、2017年7 9月期決算を発表し、モバイル広告の伸びなどで売上高、純利益ともに四半期として過去最高を記録した。…FB上で偽ニュースの拡散防止など、安全確保のための要員を約2万人に倍増させることを明らかにしている。
On 1st the U.S. Facebook (FB) announced financial results from July to September in 2017 and archived the record quarterly amount of sales and net income thanks to its mobile advertising growth and other factors. …FB revealed to double the number of personnel engaged in preventing the fake news from spreading to about 20,000 in order to secure the safety on FB.
Headline for print media
フェイスブック、四半期で最高益 モバイル広告好調
Facebook achieved the record quarterly profit thanks to its mobile advertising business.
Multi-length headlines for digital media
7 chars 米FBが最高益
(10char-ref) The U.S. FB achieved the record profit.
12 chars (13char-ref) 米フェイスブックが最高益
The U.S. Facebook achieved the record profit.
26 chars (26char-ref) 米フェイスブックが最高益 偽ニュース対策で要員倍増も
The U.S. Facebook achieved the record profit and double the personnel for countermeasures against the fake news.
Table 4: A typical example for comparing length-insensitive/sensitive headlines

How similar are the headlines used for training (length-insensitive) and for evaluation (length-sensitive)? We estimated the appropriateness of length-insensitive headlines as a “seed” for producing length-sensitive headlines. More concretely, we reported word-level precision and recall scores in Table

3, assuming that length-insensitive headlines are “system” summaries and that 10char-ref, 13char-ref and 26char-ref headlines are “reference” summaries. The relatively high recall scores indicate that the training and evaluation data are not so distant. Additionally, we found that the editors use a moderate number of words that do not appear in print headlines when composing length-sensitive headlines. Table 4 is an example of the typical differences between the length-insensitive and length-sensitive headlines. Comparing the 26-character headline with the print headline, the choices of contents are different from each other; for example, while the print headline reports the reason about the record profit, the 26-character headline describes the topic with regards to the increasing number of personnel. Next, comparing the 7-character (10char-ref) headline with the print headline, we observe that the choices of words are different; the print headline uses “Facebook”, which is changed to “FB” in 7-character headline.

2.6 Comparing length-sensitive headlines

System Reference P R
26char-ref 10char-ref 28.77 78.55
26char-ref 13char-ref 42.53 88.75
First 10 chars in 26char-ref 10char-ref 38.40 41.65
First 13 chars in 26char-ref 13char-ref 60.31 65.58
Last 10 chars in 26char-ref 10char-ref 14.55 17.05
Last 13 chars in 26char-ref 13char-ref 23.13 26.56
Table 5: Difference between 26char-ref headlines and shorter headlines. P and R denote precision and recall respectively.

How similar is the composition of headlines for a news article of different lengths? How good are 26char-ref headlines as “seeds” for generating 10char-ref or 13char-ref headlines? Is the simple strategy of trimming 26char-ref headlines to 10 or 13 characters sufficient? To answer these questions, we computed word-level precision and recall scores, assuming that 26char-ref headlines are “system” summaries and that 10char-ref and 13char-ref headlines are “reference” summaries.

The first and second rows of Table 5 represent the situation when we used 26char-ref headlines as they are and without preserving the length constraint. Although this setting was unrealistic, we could estimate the upper bound when we composed a shorter headline from a 26char-ref. The high recall scores indicate that 26char-ref headlines mostly cover the words included in 10char-ref and 13char-ref headlines. The third and fourth rows of Table 5 correspond to the strategy where we generated headlines in 10 and 13 characters from the first 10 and 13 characters of 26char-ref headlines. This strategy achieved moderate success for generating headlines in 13 characters but did not work well for headlines in 10 characters. In other words, we observed large differences between 10char-ref and 26char-ref headlines. The fifth and sixth rows of Table 5

correspond to the strategy where we generated headlines in 10 and 13 characters from the last 10 and 13 characters of 26char-ref headlines. Extracting the latter part of a 26char-ref headline was probably not a good idea because the precision and recall scores were much worse than those for the first 10 and 13 characters. On the other hand, these results also indicate that the words included in 10char-ref and 13char-ref are observed in the latter part of 26char-ref.

In sum, we found similarities in headlines of different lengths in the JAMUL. However, the simple strategy to trim a longer headline into a shorter headline is insufficient (except for shrinking 26char-ref headlines into 13char-ref headlines). Table 1 is an example of the typical differences among length-sensitive headlines. There is a little overlap between longer and shorter headlines because 9- and 13-character headline extract the shorter phrases which have the nearly same meaning as the 24-character headline. Focusing on “車種” (models), the words are in the latter half of the 24-character headline, and we could confirm that important keywords are not always included at the beginning of the headlines.

3 Comparing headline generation methods on JAMUL

In this section, we explore a question about evaluation: how reliable is the conventional evaluation method using a single length summary for measuring the quality of summaries of different lengths? In order to answer this question, we generate multiple summaries of different lengths by using the existing methods, and measure the correlation between the performance values computed by the conventional evaluation method and those computed on JAMUL.

3.1 Headline generation methods with the mechanism to control output length

In this study, we explored four methods for headline generation that can control the output length. The first two methods, LenEmb and LenInit, were proposed by  kikuchi-EtAl:2016:EMNLP2016.

LenEmb provides the decoder with output length information in the form of the length embedding. LenInit controls the output length by multiplying the initial state of the decoder’s memory cell by the desired length.

W18-2706 also proposed a length-controllable method for a convolutional sequence-to-sequence (ConvS2S) model (Gehring et al., 2017). Their method added special tokens indicating the range of the output length at the beginning of an input sequence. In our experiment, we used a special token to specify an output length333W18-2706 also included special tokens for entities, but we did not use them in the experiments. and called this method SP-token.

We also considered the method of LC (Liu et al., 2018)

, which extends ConvS2S and multiplies the initial state of the residual connection 

(He et al., 2016) by the desired number of output tokens. In the experiment, we set the desired number of characters instead of that of tokens.

In addition to the four methods aforementioned, we combined SP-token not only on ConvS2S but also on Seq2Seq and Transformer (Vaswani et al., 2017). Eventually, we examined six combinations in total: (1) Seq2Seq LenEmb, (2) Seq2Seq LenInit, (3) Seq2Seq SP-token, (4) ConvS2S SP-token, (5) ConvS2S LC, and (6) Transformer SP-token.

3.2 Datasets and evaluation protocol

We trained the six methods for headline generation on the JNC. We removed instances that were duplicated or unsuitable for training a headline generation model444The filtering script is available at:
. The filtering step obtained 1,568,360 pairs of newspaper articles and headlines. We randomly selected 1% of the instances (15,546 pairs) as a validation set and used the remainder (1,523,469 pairs) as a training set. We used Byte Pair Encoding (BPE)555https://github.com/rsennrich/subword-nmt (Sennrich et al., 2016) for tokenization. We set the merge operation to 8,000 and pretokenized all the data by MeCab. Finally, we obtained 11,257 tokens for both sides. When training a model, we set the length of each reference headline to the model. When generating headlines in the evaluation, we set the output lengths to 10, 13 and 26 characters; each output was evaluated by the reference that had the same length in the JAMUL. We evaluated all models by using three variants of ROUGE (Lin, 2004) recall metric666We used MeCab (Kudo et al., 2004) to tokenize the system outputs.: ROUGE-1, ROUGE-2, and ROUGE-L. Headlines exceeding the length limits were trimmed for the fairness of the evaluation.

3.3 Implementation

We employed OpenNMT777https://github.com/OpenNMT/OpenNMT-py (Klein et al., 2017) for Seq2Seq and fairseq888https://github.com/pytorch/fairseq for ConvS2S and Transformer. We extended the implementations to realize LenEmb, LenInit and LC. We set the dimensions for token and length embeddings to 512, those for hidden states to 512, and the beam width to 5. These parameters are common in all the models; Table 6 summarizes other parameters specific to each sequence-to-sequence model. We used Nesterov’s accelerated gradient method (NAG) (Sutskever et al., 2013) with a momentum of 0.99 in ConvS2S. In Transformer, we set the number of attention heads to 8, the dimensions for the feed-forward network to 2048, Adam’s to 0.98, warm up steps to 4000, and label smoothing to 0.1.

Seq2Seq ConvS2S Transformer
Num of Layer 2 8 6
Dropout Rate 0.3 0.1 0.3
Grad Clipping [-5.0, 5.0] [-0.1, 0.1] -
Learning Rate 0.001 0.2 0.001
Optimizer Adam NAG Adam
Table 6: Parameters of each encoder-decoder model.
10 characters 13 characters 26 characters
Models R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L
(1) Seq2Seq + LenEmb 34.66 15.29 33.56 40.66 19.40 38.23 44.82 20.75 36.62
(2) Seq2Seq + LenInit 36.50 16.75 35.54 41.40 19.49 38.83 46.77 22.06 38.29
(3) Seq2Seq + SP-token 38.09 17.43 36.67 42.51 19.79 39.76 47.33 22.12 38.59
(4) ConvS2S + SP-token 38.90 17.84 37.53 43.32 20.35 40.31 47.10 21.51 37.86
(5) ConvS2S + LC 37.71 16.89 36.50 42.60 20.11 39.97 45.76 21.93 37.91
(6) Transformer + SP-token 42.85 19.84 41.40 46.92 22.85 44.09 51.57 24.52 41.05
Table 7: ROUGE scores of each model on the JAMUL. Specified lengths are 10, 13 and 26 characters. R-1, R-2, and R-L represent ROUGE-1, ROUGE-2, and ROUGE-L, respectively. Note that (1) to (6) in Table 8 and after that represent the model (1) to (6) of this table.
10 characters 13 characters
R-1 R-2 R-L R-1 R-2 R-L
(1) 21.22 8.56 19.29 27.42 12.20 24.23
(2) 21.72 9.03 19.70 28.34 12.83 25.01
(3) 22.02 9.08 20.03 28.26 12.09 24.73
(4) 22.50 9.25 20.41 29.23 12.82 25.58
(5) 22.64 9.27 20.57 28.94 12.53 25.21
(6) 24.36 10.32 21.97 31.22 13.99 27.25
Table 8: ROUGE scores of the system outputs in 10 and 13 characters evaluated by 26char-ref headlines as the references.
R-1 R-2 R-L
10char/26char 0.867 0.999 0.867
10char/26char-trim 0.733 0.600 0.733
13char/26char 0.733 0.600 0.867
13char/26char-trim 0.867 0.690 0.867
Table 9: Kendall’s rank correlation coefficients () between Table 7 and Table 8. The part left of the slash represents the specified output length, and the part right of the slash presents the reference headlines.

3.4 Evaluating multi-length headlines generated by methods on the JAMUL

Table 7 presents ROUGE scores of each method on the JAMUL. Transformer + SP-token

was the clear winner in all lengths and evaluation metrics on this dataset. Additionally, the three methods with

SP-token outperformed the others except for R-2 and R-L on 26 characters (Seq2Seq + LenInit was better than ConvS2S + SP-token).

What if we do not have multiple headlines of different lengths to evaluate the methods? To answer this question, we followed the evaluation setup of the previous studies on DUC 2004: the reference summaries of 75 bytes were used even when evaluating summaries of 30 and 50 bytes. Table 8 reports ROUGE scores for the system outputs in 10 and 13 characters evaluated based on the 26char-ref headlines. This evaluation setup reduced the performance differences between the methods. Although Transformer + SP-token remained the clear winner, the performance of Seq2Seq + SP-token in 13 characters was now lower than that of Seq2Seq + LenInit.

Thus, we computed rank correlation coefficients (Kendall’s ) to assess the discrepancy in the ranking among the methods presented by Tables 7 and 8. Additionally, we computed Kendall’s when using the first 10 or 13 characters in 26char-ref headlines as a reference (26char-trim). Table 9 reveals that the rank correlation is not perfect (lower than one) but moderate: there is a possibility that an order of the scores of two methods may flip depending on the evaluation setup. This result is similar to DBLP:conf/emnlp/ShapiraGRBAND18, that is, the validity for the evaluation setting to use single length reference in multidocument summarization.

10 characters 13 characters
(1) 3.80 10.55 36.80 10.09
(2) 4.65 13.70 36.45 9.02
(3) 4.48 13.13 35.87 10.03
(4) 4.32 13.44 35.90 10.53
(5) 4.26 13.32 36.18 10.09
(6) 4.57 14.59 39.30 10.84
Table 10: Word-level precision and recall scores when comparing system outputs and the groups of the words included in 10char-ref or 13char-ref but not included in 26char-ref.

4 Analysis

10 characters 13 characters 26 characters
(1) 34.04 40.74 40.45 47.71 45.34 57.59
(2) 35.23 44.87 39.12 48.85 45.18 60.07
(3) 36.16 44.94 39.58 49.11 45.07 60.05
(4) 35.81 45.09 39.37 49.41 44.03 59.59
(5) 35.04 44.40 39.51 49.60 45.16 59.72
(6) 38.15 48.33 41.71 52.46 45.31 60.81
Table 11: Word-level precision and recall scores comparing the system outputs and the overlap words between the words in an article and the words in 10char-ref, 13char-ref, and 26char-ref.
10 characters 13 characters 26 characters
(1) 1.91 6.73 2.32 10.52 8.00 37.69
(2) 2.50 10.08 2.51 11.83 7.92 38.90
(3) 2.22 8.74 2.40 11.62 7.97 38.88
(4) 2.29 9.24 2.61 12.83 7.89 39.26
(5) 2.25 9.13 2.36 11.26 8.08 39.07
(6) 3.05 12.55 3.10 15.08 8.83 42.85
Table 12: Word-level precision and recall scores comparing the system outputs and the words included in 10char-ref, 13char-ref, or 26char-ref but not included in an article.

4.1 Performance of word selection according to output length

How well do the existing methods change the word selection depending on the output length? As shown in the first and second rows of Table 5, 10char-ref and 13char-ref headlines contain words that are not included in 26char-ref headlines. In other words, the selection of words in the generated headline should be changed in response to the length restriction. To confirm this question, we computed word-level precision and recall scores for the system outputs generated by each method, assuming that the groups of the words included in 10char-ref or 13char-ref but not in 26char-ref headlines are the “reference” summaries. For instance, the red words in Table 1 are the “reference” summaries in this experiment.

We report this result in Table 10. The low recall score indicates that each system cannot select the words tailored to the length constraints. The difference of the precision scores between the models is small. We infer that there is almost no difference between the existing methods in terms of the word selection specialized for the length constraint.

4.2 Performance of managing extractive and abstractive tasks

In Table 2, we reported the proportion of the number of extractive and abstractive operations in JAMUL. We analyze how the existing methods can reflect extractive and abstractive operations in generating summaries.

First, to observe extractive operations, we computed word-level precision and recall scores for the system outputs generated by each system, measuring the number of overlapping words between an article as “system” summaries and 10char-ref, 13char-ref, and 26char-ref headlines as “reference” summaries. Table 11 reports the result. The relatively high recall score indicates that the length control method succeeds in managing extractive operations.

Next, we examine whether the length control methods could perform abstractive operations. We adopted the words included in 10char-ref, 13char-ref, or 26char-ref headlines but not included in an article as “reference” summaries, and computed the precision and recall scores for the system outputs (Table 12). Regarding the outputs targeting at 26 characters, the recall scores of around 40% imply that each model can manage abstractive operations to some extent. In contrast, the low recall scores for the outputs targeting at 10 and 13 characters revealed that all length control methods could not perform well on abstractive operations under the severe length constraint.

4.3 How do length control mechanisms work?

Figure 2: Recall-oriented ROUGE-1 scores to assess the similarity of headlines generated for different lengths.

We wonder that a method that could control the output length would produce similar headlines even for different lengths for the same news article. To confirm this suspicion, we reported ROUGE-1 recall scores in Figure 2 with three different configurations: (a) evaluating the first 13 characters of headlines generated to be 26 characters long on 13char-ref headlines (yellow); (b) evaluating headlines generated to be 13 characters long on 13char-ref headlines (green); and (c) is the same as (a) but evaluated on the headline generated to be 13 characters long (blue).

Setting (a) corresponds to the strategy where we trimmed headlines of different lengths to 13 characters long. This setting was worse than setting (b), where a method tailored headlines to the desired length. However, the difference in ROUGE scores between (a) and (b) was not so large, indicating that the existing methods do not drastically change the content for 13 characters long and 26 characters long. This tendency was also verified by setting (c), which assessed how much the first 13 characters of headlines generated to be 26 characters long covered the content of those generated to be 13 characters long. These facts suggest that we should explore a method not only trained by generic supervision data (print headlines) but also tuned for the desired length in further research.

5 Related Work

rush-chopra-weston:2015:EMNLP created the first approach to neural abstractive summarization. They generated a headline from the first sentence of a news article in the Annotated English Gigaword corpus (Napoles et al., 2012), which contains an enormous number of pairs of headlines and articles. After their study, a number of researchers addressed this task: for example, chopra-auli-rush:2016:N16-1 used the encoder-decoder framework (Sutskever et al., 2014; Bahdanau et al., 2015) and DBLP:conf/conll/NallapatiZSGX16 incorporated additional features into the model, such as parts-of-speech tags and named entities. suzuki-nagata:2017:EACLshort proposed word-frequency estimation to reduce the repeated phrases being generated. P17-1101 proposed a gating mechanism (sGate) to ensure that important information is selected at each decoding step.

Unfortunately, attempts to control the output length in neural abstractive summarization have been limited. shi-knight-yuret:2016:EMNLP2016 reported that hidden states in recurrent neural networks in the encoder-decoder framework could implicitly model the length of output sequences. kikuchi-EtAl:2016:EMNLP2016 was the first to propose the idea of controlling the output length in the encoder-decoder framework. Their approach inserts embeddings for the output length into the decoder. Additionally, W18-2706 reported that output lengths could be controlled by embeddings of special tokens given to an input sequence. These two studies used DUC 2004 

(Over et al., 2007)

, which comprises only 75-byte summaries, to evaluate the outputs in multiple lengths. D18-1444 also proposed a method to control the number of output words in the ConvS2S model. However, no previous work built a dataset for evaluating headlines of multiple lengths nor reported an in-depth perspective on this task along the process of new production in the real world. On the other hand, a single length reference that could appropriately evaluate multiple length summaries in multiple document summarization was reported 

Shapira et al. (2018). In that study, they confirmed the correlation coefficient of ROUGE-1 scores between the scores using a single length reference and multiple (gold) length references in the evaluation. Our research differed in that we examined why strong correlation occurs and studied headline generation domain, which requires stricter keyword selection.

6 Conclusion

In this paper, we presented two new corpora: The JNC contains a large number of pairs of news articles and their headlines, and the JAMUL includes headlines of three different lengths (10, 13, and 26 characters long) written by professional editors. This study is the first to analyze the characteristics of multiple headlines of different lengths and to evaluate existing approaches for length control based on the reference headlines composed for different lengths. We found that Transformer model with a special length token (SP-token) outperformed the other methods on the JAMUL. Additionally, while we confirmed that single length (the longest) references could adequately evaluate multiple length system outputs, the existing methods cannot take into account the word selection according to length constraint. We also found it difficult to evaluate methods to control output length because headlines of different lengths are written based on different goals and because the training data does not necessarily reflect the goal of the headlines of a specific length. In future, we plan to explore an approach to adapt a model trained on the print headlines to those which dedicated to a different length.