Log In Sign Up

A Character-Level Length-Control Algorithm for Non-Autoregressive Sentence Summarization

Sentence summarization aims at compressing a long sentence into a short one that keeps the main gist, and has extensive real-world applications such as headline generation. In previous work, researchers have developed various approaches to improve the ROUGE score, which is the main evaluation metric for summarization, whereas controlling the summary length has not drawn much attention. In our work, we address a new problem of explicit character-level length control for summarization, and propose a dynamic programming algorithm based on the Connectionist Temporal Classification (CTC) model. Results show that our approach not only achieves higher ROUGE scores but also yields more complete sentences.


page 1

page 2

page 3

page 4


Learning Non-Autoregressive Models from Search for Unsupervised Sentence Summarization

Text summarization aims to generate a short summary for an input text. I...

Discrete Optimization for Unsupervised Sentence Summarization with Word-Level Extraction

Automatic sentence summarization produces a shorter version of a sentenc...

Experiments in Extractive Summarization: Integer Linear Programming, Term/Sentence Scoring, and Title-driven Models

In this paper, we revisit the challenging problem of unsupervised single...

Ontology-based Graph Visualization for Summarized View

Data summarization that presents a small subset of a dataset to users ha...

A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss

We propose a unified model combining the strength of extractive and abst...

A Two-Phase Approach for Abstractive Podcast Summarization

Podcast summarization is different from summarization of other data form...

Faithful or Extractive? On Mitigating the Faithfulness-Abstractiveness Trade-off in Abstractive Summarization

Despite recent progress in abstractive summarization, systems still suff...

1 Introduction

Sentence summarization aims at compressing a long sentence into a short one that keeps the main gist of the input. It is an established setting of summarization, and has extensive real-world applications such as headline generation [nenkova2011automatic, Rush_2015, filippova-etal-2015-sentence]. In previous work, researchers have developed various approaches to improve the ROUGE score, which is the main evaluation metric for summarization [lin2004rouge], whereas controlling the summary length has not drawn much attention.

Recently, Liu et al. [liu-etal-2018-controlling] show that length control is the key to summarization, since it is typically required by real-world applications. Moreover, the ROUGE score is found to be sensitive to the summary length [schumann-etal-2020-discrete, saito2020length], and summarization systems can achieve higher scores by simply generating longer output. In previous work, a typical setting of length control is to set a word budget for summarization [schumann-etal-2020-discrete, liu-etal-2022-learning].

In this paper, we address explicit length control in the character level for summarization, which is a different setting from previous work. In other words, we constrain the summary length by the number of characters, such as letters, punctuation marks, and spaces. We observe that this is a more realistic setting in real-world applications than word-level length control. For example, the headline shown in a mobile app or web page is constrained by the screen width (roughly speaking, the number of characters), rather than the number of words.

We further observe that controlling the summary length by characters cannot be easily addressed by previous approaches. For example, truncating is able to explicitly control the length, but the resulting summary is incomplete; Takase and Okazaki [takase-okazaki-2019-positional] feed length embeddings into the model as input, but such an approach cannot control the summary length in an explicit manner; and Schumann et al. [schumann-etal-2020-discrete] perform constrained discrete optimization by selecting a certain number of words from the source text as the output, but their generated summaries may vary to a large extent in terms of the number of characters.

To this end, we propose NACC, a Non-Autoregressive summarization model with Character-level length Control. We adopt a non-autoregressive approach because it generates all tokens in parallel and is much faster than autoregressive models. Moreover, we observe that non-autoregressive models predict the output words independently. Such predicted probabilities are thus local, which provides us with the unique opportunity to design a dynamic programming algorithm to constrain the summary length. Specifically, we formulate length control as a knapsack-like problem, where the weight is the number of characters in a token and the value is the predicted probability of the token. In this way, we are able to explicitly control the summary length at the character level, while retaining the completeness of the output text.

We evaluated our NACC model on the Gigaword headline generation 

[graff2003english] and DUC2004 [duc2004] datasets in both supervised and unsupervised settings. In the latter setting, NACC learns from the pseudo-reference given by an unsupervised word-extraction method based on discrete search [schumann-etal-2020-discrete]. Experiments show that NACC establishes the state-of-the-art performance of non-autoregressive summarization under various target lengths in both supervised and unsupervised settings; NACC even outperforms autoregressive Transformers [attentionisallyouneed] in the unsupervised setting, where the input and output have stronger correspondence. These all confirm the effectiveness of our length-control algorithm. Regarding inference efficiency, we show that non-autoregressive models without length control are 10 times faster than autoregressive ones; even with our length-control dynamic programming, NACC is still several times more efficient. Further, our NACC is capable of length-transfer generation, i.e., generating summaries of different lengths from the training targets.

2 Related Work

Non-Autoregressive Generation. Non-autoregressive (NAR) models [gu2017non] predict target tokens in parallel and thus enjoy a much higher inference efficiency than autoregressive (AR) ones, which generate output tokens sequentially. NAR generation is more difficult than AR, because there lacks dependencies among the outputs. Previous work addresses the dependency issue by iterative refinement [leedeterministic, miao2019cgmh, chi-etal-2021-align] or structured decoding with the conditional random field (CRF) [sun2019fast, sunon, deng2020cascaded].

The Connectionist Temporal Classification (CTC) algorithm [10.1145/1143844.1143891] addresses a common problem in NAR generation, namely, token repetition, by merging consecutive identical tokens (unless separated by an empty token). Our work follows Liu et al. [liu-etal-2022-learning], who adopt CTC for NAR summarization, allowing empty tokens scattering over the entire sentence to generate a short summary.

NAR models are traditionally thought to generate worse-quality text than AR models. However, we have the unique insight that NAR also brings new opportunities for designing length-control algorithms: since the model predicts output independently, the decoding problem can be divided into shared sub-problems for dynamic programming.

Summarization Systems. Summarization systems can be generally categorized into two types: extractive and abstractive. Extractive methods output a summary by extracting important sentences or clauses from the source text [dong-etal-2018-banditsum, mihalcea2004textrank, kaageback2014extractive], while abstractive methods generate summaries with new expressions [nallapati2016abstractive, paulus2018deep, gehrmann2018bottom].

Depending on the availability of training data, we may categorize summarization systems into supervised and unsupervised ones. Supervised methods typically follow a sequence-to-sequence (seq2seq) paradigm [zhang2020pegasus, liurefsum]. Recently, unsupervised summarization is drawing increasing attention because it does not require parallel training data. Yang et al. [yang2020ted] use the lead approach (selecting the first several words or sentences) as pseudo-groundtruth for seq2seq training. Alternatively, cycle consistency is adopted as the training objective for unsupervised summarizaiton [wang-lee-2018-learning, baziotis-etal-2019-seq]. Schumann et al. [schumann-etal-2020-discrete]

generate a summary by maximizing a heuristically defined scoring function (involving fluency and semantic similarity) with word-level extraction. We consider both settings in our work.

Controlling the output length is crucial for deploying summarization systems to real-world applications. In early extractive summarization research, truncating is adopted for fair comparison 

[mihalcea2004textrank, murray2005extractive, kaageback2014extractive], but the resulting summary may not be complete sentences. As pointed out by [saito2020length, schumann-etal-2020-discrete], however, the length-control problem is not adequately addressed for abstractive summarization in the neural era, probably because researchers do not want to hurt the completeness. For example, length information [liu-etal-2018-controlling, takase-okazaki-2019-positional] and soft penalty [sunon] are adopted to encourage short summaries, but they cannot control the length in an explicit way. Very recently, Schumann et al. [schumann-etal-2020-discrete] address the length-control problem by extracting a certain number of source tokens as the summary. Liu et al. [liu-etal-2022-learning] design a CTC-based algorithm that controls the output words explicitly. However, these approaches cannot be applied to the character level.

Our work follows Liu et al. [liu-etal-2022-learning] and adopts CTC as the training objective. However, we propose a character-level length-control algorithm that can explicitly constrain the number of characters within a given budget, which is different from all previous work.

3 Approach

In this section, we first introduce the CTC-trained NAR model for summarization (§3.1). Then, we propose a dynamic programming algorithm to explicitly control the number of characters (§3.2).

3.1 A Non-Autoregressive Model for Summarization

In the summarization task, the goal is to compress a source text into a shorter text , while preserving the key information.

Encoder-Only Architecture. Our work follows recent non-autoregressive summarization methods [sunon, liu-etal-2022-learning], utilizing the encoder-only Transformer [attentionisallyouneed] as the base model (Figure 1a). It is essentially a stack of Transformer layers, the last of which predicts all words simultaneously; thus, it has a much higher inference efficiency than autoregressive models.

Formally, let be the contextual representation of the input at the th Transformer layer, where is the number of tokens in and is the dimension. Specially, is the word and positional embeddings of input tokens. Each layer is a Transformer block [attentionisallyouneed] based on its predecessor, which can be denoted by , for .

At the last layer , a function is applied to predict the probability at every prediction slot independently, given by for input slot , where

is a column vector transposing the

th row of matrix , and is a weight matrix.

Such an encoder-only architecture is able to capture the strong correspondence between the input and output in the summarization task and outperforms encoder–decoder non-autoregressive models, shown by Liu et al. [liu-etal-2022-learning].

CTC Training.

Training the above encoder-only architecture requires padding the target with empty tokens 

; otherwise, the output will have the same length as the input, and cannot be a summary. Gu et al. [gu2017non] pad s at the end of the target, but this hinders the input–output correspondence.

Instead, we follow Liu et al. [liu-etal-2022-learning] and train our model with the Connectionist Temporal Classification (CTC) [10.1145/1143844.1143891], allowing s to scatter over the entire target summary as appropriate. During inference, the CTC approach removes s to generate a shorter text as the summary. In addition, CTC merges consecutive identical tokens that are not separated by , because non-autoregressive generation suffers from the word-repetition problem [gu2017non, saharia-etal-2020-non]. We denote the CTC operation as , for example, .

For training, CTC maximizes the marginal likelihood of token sequences (including ) that can be reduced to the target text :


where is the probability of generating a token sequence , and is given by the non-autoregressive model. Although a brute force summation in Eqn. (1) is intractable, it can be efficiently computed by dynamic programming. We refer interested readers to Alex et al. [10.1145/1143844.1143891] for the details of the above marginalization in CTC training.

3.2 Proposed Algorithm for Character-Level Length-Control Decoding

Figure 1: Overview of our NACC approach. Dashed yellow arrows refer to transitions that do not increase the summary length, while solid blue arrows refer to the increase of length. Thick arrows and blocks refer to the selected path by CTC. Due to the space limit, is omitted in the predicted sentence, and we use to denote a sequence of s. The number demonstrates the value (i.e., log-probability) of a word.

One of our main contributions is to propose a dynamic programming (DP) algorithm for character-level length control. mentioned in §1, controlling the number of characters in the output is important for summarization, since real-world applications usually involve character-level length constraints.

Our insight is that a non-autoregressive model predicts probabilities independently, so the length-control problem can be divided into shared sub-problems, making DP algorithms feasible. We formulate character-level length control as a knapsack-like problem. We treat the number of characters in a word (plus one) as the weight,222For the purposes of research, we assume every word is appended with another character, namely, a white space. In real applications, our algorithm can be applied with any measure of length, such as the display width of a word in some font. denoted by for the word , and the predicted log-probability as the value for the prediction slot . Our goal of character-level length-control summarization can be formulated as


where is the total length budget. Here, the value is the sum of the log-probability of every generation slot including , whereas the length is said in terms of the words of the CTC-reduced sequence .

We observe that handling every possible integer weight (i.e., length) as in a standard knapsack algorithm may slow down the inference. Thus, we divide the lengths into buckets for efficient inference. Formally, let the th bucket cover the length ranging from to characters, where

is a hyperparameter controlling the bucket size. We denote by

the most probable -token sequence that is reduced to a summary in the th length bucket. Specially, we let mean that the reduced summary has zero words.

The initialization of fills in the DP table for and .

  • For , we must have (-many), because means no non- word has been generated. (First row of Figure 1b.)

  • For , we have


    Here, is the same as the previous bullet item. For , we select the most probable word for each length bucket according to the value , i.e., the predicted log-probability of the first generation slot. (First column of Figure 1b.)

The DP recursion is to compute based on a newly predicted token , assuming its top-left sub-table is filled. This involves three scenarios:

  • Case 1: . In this case, the new word is . Thus, the index for generation slots increases from to , but the summary length does not change. We denote as the set containing the candidate sequence, given by


    where denotes string concatenation. (See yellow dash arrows in Figure 1b.)

  • Case 2: . In other words, the candidate non- word for the th slot is identical to the last token of . Since repeated tokens are merged during CTC decoding, the output length index is unchanged. We include this sequence in a set:


    (Also see yellow dash arrows in Figure 1b.)

  • Case 3: . That is, is neither nor repetition, and thus the summary length will be increased from bucket to . We denote this candidate set by


    (See blue arrows in Figure 1b.)

Then, our DP finds the most probable sequence at each recursion step:


where is the th token of a sequence from the three candidate sets above.

Theorem 1.

(1) If the bucket size and consecutive repetitions are not merged, then is the most probable sentence of characters given by the prediction slots. (2) If or repeating tokens are merged, our algorithm may not be exact. See the appendix for the proof.

Discussion. Our DP algorithm is inspired by the standard 0-1 knapsack problem [andonov2000unbounded], but also differs in several significant ways. First, we merge consecutive tokens during CTC decoding; this establishes some dependencies among different generation slots, and thus exact inference with DP is not possible. Second, our value function is non-stationary, as it changes over time. We require that every slot should select a token, either or a word. In both cases, the token’s value is added to the total value. Therefore, our algorithm is compatible with negative values, namely, log-probabilities in our application, because only the relative difference matters for the value function.

4 Experiments

4.1 Setup

Datasets. Our model is evaluated on the Gigaword headline generation [Rush_2015] and DUC2004 datasets [duc2004]. The Gigaword dataset comprises pairs of news articles and titles; we follow the standard setting [Rush_2015, schumann-etal-2020-discrete, liu-etal-2022-learning] and adopt the first sentence of each news article as the input and consider the news title as the summary. In total, the dataset contains 3.8M, 198K, and 1951 samples for training, validation, and test, respectively. On the other hand, the DUC2004 dataset has 500 paired samples and is designed to be test-only; its performance is obtained by models trained on Gigaword.

Metrics. We follow previous work [schumann-etal-2020-discrete, Rush_2015, sunon] and evaluate the output by ROUGE scores [lin2004rouge]: ROUGE- evaluates -gram overalpping, whereas ROUGE- evaluates the longest common sequence. We follow the convention and adopt ROUGE F1 for the Gigaword dataset [wang-lee-2018-learning, schumann-etal-2020-discrete, liu-etal-2022-learning] and ROUGE Recall for the DUC2004 dataset [dorr-etal-2003-hedge, west-etal-2019-bottlesum].

Implementation Details. We use a Transformer encoder as the base model, which has 6 layers and 8 attention heads for each layer, following the settings in [attentionisallyouneed]. The dimensions are 512 and 2048 for the attention and feed-forward modules, respectively. Each training batch contains samples amounting to 4K tokens. The learning rate is chosen from {1e-4, 5e-4} by validation, and we ran 100K gradient updates for the unsupervised setting, but 400K updates for the supervised setting. Note that, in the unsupervised setting, our model learns from an extractive search-based method [schumann-etal-2020-discrete], and thus its training is relatively easy. By contrast, the supervised setting takes human-written sentences as the training targets, which are more abstractive and difficult to train with. All experiments were run on an i9-9940X CPU and an RTX6000 GPU.

For our length-control algorithm, we adopt a bucket size of 4, and only consider the most probable 20 words for every generation slot (cf.  in Eqn. 6) due to efficiency concerns.

4.2 Results and Analyses

Results on Gigaword Headline Generation. Table 1

presents the performance on the Gigaword test set in both supervised and unsupervised settings. For the supervised setting, we train with the groundtruth references, which contain 51.29 characters on average. In the unsupervised setting, the training targets of machine learning models (Rows 9–13) are the output summaries given by an unsupervised search-based method 

[schumann-etal-2020-discrete]; specifically, we adopt the standard 8-word setting, which counts to 48.75 characters on average. Based on these, we set our length constraint to be 50 characters. We observe that our proposed algorithm is the only machine learning-based approach that can perform explicit character-level length control. For fair comparison, we truncate the output of other methods to satisfy the length constraint.

As seen, NACC even with truncating outperforms all other non-autoregressive (NAR) models [sunon, pmlr-v139-qi21a, yang-etal-2021-pos] in both settings. Specifically, Su et al. [sunon] emit multiple end-of-sequence tokens at the end of the output sequence to generate a shorter summary than the source text (Rows 1 & 9); Qi et al. [pmlr-v139-qi21a] propose to pretrain a summarization system in an autoregressive manner, and gradually adapt it to the NAR setting (Rows 5 & 10); Yang et al. [yang-etal-2021-pos]333Yang et al. [yang-etal-2021-pos] only provided execution commands in their GitHub repo, but no training code. We emailed the authors, but have not obtained the code either. The reported results are based on our best replication. propose a two-step strategy of autoregressive part-of-speech (POS) prediction and non-autoregressive summary generation (Rows 6 & 11). Our NACC trained by CTC is able to surpass all these methods, demonstrating the effectiveness of CTC training for summarization. Besides, NACC outperforms its search-based teacher model [schumann-etal-2020-discrete], showing that learning can smooth out the search noise.

Equipped with the length-control algorithm, our NACC model has a significant boost in terms of ROUGE scores. In a fair comparison with the same base architecture, the length-control algorithm alone improves NACC (truncate) by ~3 points. Our full model achieves an improvement of more than 4 total ROUGE points compared with previous state-of-the-art NAR methods.

Regarding the inference efficiency, NACC with truncating is faster than previous NAR models. Even with the length-control algorithm, our NACC is still in the same ballpark, being ~500x faster than the search-based method.

Additionally, we design a variant of the unsupervised summarization method based on [schumann-etal-2020-discrete], where we directly impose the character-level length constraint during each search step (Row 8). We find this approach outperforms truncating word-constrained search (Row 7), but is much worse than our machine learning-based NACC with the length-control algorithm.

Setting # Approach Len ROUGE F1 Time
R-1 R-2 R-L R
Su et al. [sunon] (truncate) 38.43 32.28 14.21 30.56 0 0.016
2 Qi et al. [pmlr-v139-qi21a] (truncate) 27.98 31.69 12.52 30.05 -2.79 0.019
3 Yang et al. [yang-etal-2021-pos] (truncate) 35.37 28.85 6.45 27.00 -14.75
4 NACC (truncate) 34.15 33.12 13.93 31.34 1.34 0.011
5 NACC (length control) 34.40 33.66 13.73 31.79 4.74 0.017
6 Baseline Lead-50 chars 49.03 20.66 7.08 19.30 -9.23
7 Search Schumann et al. [schumann-etal-2020-discrete] (truncate) 45.45 24.98 9.08 23.18 0.97 9.573
8 Char constrained search 44.05 25.30 9.25 23.43 1.71 17.324
Su et al. [sunon] (truncate) 45.24 24.65 8.64 22.98 0 0.017
10 Qi et al. [pmlr-v139-qi21a] (truncate) 44.54 24.31 7.66 22.48 -1.82 0.019
11 Yang et al. [yang-etal-2021-pos] (truncate) 49.37 21.70 4.60 20.13 -9.84
12 NACC (truncate) 47.77 25.79 8.94 23.75 2,21 0.012
13 NACC (length control) 47.03 27.45 8.87 25.14 5.19 0.025
Table 1: Performance on the Gigaword headline generation test set, where NAR stands for non-autoregressive. Len: Average number of characters in the predicted summaries. R-1, R-2, R-L: ROUGE-1, ROUGE-2, ROUGE-L. R: The difference of total ROUGE (sum of R-1, R-2, and R-L) in comparison with the (previous) state-of-the-art NAR summarization system [sunon]. Time: Average inference time in seconds for one sample on an i9-9940X CPU and an RTX6000 GPU.

Results on DUC2004. Table 2 shows the performance of NACC on DUC2004, where we constrain the summary length to be at most 75 characters, following previous work [west-etal-2019-bottlesum, baziotis-etal-2019-seq, schumann-etal-2020-discrete]. Since the Gigaword training summaries have very different lengths from DUC2004 summaries, it is not straightforward to evaluate NACC in the supervised setting, as this involves length transfer. Therefore, we consider the unsupervised setting for DUC2004, where we use 13-word summaries (~80 characters) from [schumann-etal-2020-discrete] as our training targets. (We will have length-transfer experiments in later analysis.)

# Approach ROUGE Recall Time
R-1 R-2 R-L R
1 Baseline Lead-75 chars 22.52 6.50 19.74 -4.97
2 Search Schumann et al. [schumann-etal-2020-discrete] (truncate) 26.09 8.03 22.86 3.25 30.362
3 Char-constrained search 26.30 7.95 22.78 3.30 31.540
Su et al. [sunon] (truncate) 24.67 7.25 21.81 0 0.017
5 Qi et al. [pmlr-v139-qi21a] (truncate) 22.79 5.91 20.05 -4.98 0.018
6 NACC (truncate) 26.43 7.86 22.66 3.22 0.012
7 NACC (length control) 28.37 7.74 24.30 6.68 0.030
Table 2: Results on DUC 2004 dataset.

Experiments show that NACC with length control again largely outperforms previous NAR models, while retaining high inference efficiency. Results are consistent with the Gigaword experiment.

Decoding Wins Ties Loses -value
Overall quality Truncate 18% 44% 38% 0.0001
Length control 38% 44% 18%
& fluency
Truncate 22% 36% 42% 0.0002
Length control 42% 36% 22%
Table 3: Human evaluation comparing truncating and length-control decoding of our NACC approach on 150 samples selected from the Gigaword headline generation dataset in the unsupervised setting. The -value is given by a two-sided binomial test.

Human Evaluation. We additionally conduct human evaluations to investigate the effect of our length-control algorithm in terms of the completeness of summaries, which may not be captured by automatic metrics such as ROUGE scores. For controlled comparison, we consider NACC with truncating and NACC with the proposed length-control algorithm.

We invite three human annotators to compare the output summaries on each of 150 randomly selected samples from the Gigaword test set. We adopt the unsupervised setting in Table 1, where models are trained with 8-word summaries generated by [schumann-etal-2020-discrete]. The two systems’ outputs for a sample are blindly given to annotators in a random order; annotators then vote for the better summary or a tie, regarding the overall quality and completeness/fluency separately. We count the votes for each decoding method on the two evaluation criteria, and show the percentage of wins/ties/loses on all samples in Table 3

As seen, our length-control decoding has a dominant advantage over the truncating method in terms of both the overall quality and completeness/fluency. Further, this result is statistically significant based on a two-sided binomial test (ignoring ties), verifying that our length-control algorithm indeed improves the quality and completeness of the predicted summaries.

Setting # Approach Len ROUGE F1 Time
R-1 R-2 R-L
1 AR Transformer (truncate) 40.89 35.12 16.61 32.55 0.093
2 Transformer (len control) 39.73 34.10 15.65 31.64 0.095
3 NAR NACC (truncate) 34.15 33.12 13.93 31.34 0.011
4 NACC (len control) 34.40 33.66 13.73 31.79 0.017
5 AR Transformer (truncate) 46.62 26.31 9.33 24.29 0.092
6 Transformer (length control) 45.23 25.33 9.03 23.44 0.095
7 NAR NACC (truncate) 47.77 25.79 8.94 23.75 0.012
8 NACC (len control) 47.03 27.45 8.87 25.14 0.025
Table 4: Comparing autoregressive (AR) and non-autoregressive (NAR) models on the Gigaword headline generation test set. Our length-control algorithm requires the predicted probabilities to be independent, and thus is not compatible with AR models.

Comparison with Autoregressive Models. We are curious about how our non-autoregressive NACC is compared with autoregressive (AR) methods. Thus, we train a standard AR Transformer with truncating and length-control decodings, and show results in Table 4.

As seen, our length-control algorithm is not compatible with the AR Transformer and hurts the ROUGE scores (Rows 2 & 6). This is because our algorithm is based on dynamic programming and requires model outputs to be local, so that the length-control problem can be divided into shared sub-problems; however, the predicted probabilities of the AR Transformer depend on partial generation at previous time steps. Note that this is not a disadvantage of our approach, but shows that NAR generation provides unique opportunities for length control.

Admittedly, NACC has slightly lower ROUGE scores than AR Transformers in the supervised setting (Rows 1–4), because human-written summaries are flexible, causing difficulties for NAR training. Nevertheless, our NACC achieves much higher inference efficiency, and outperforms all previous NAR systems (recall Table 1).

In the unsupervised setting, our approach achieves higher performance than the AR Transformer, which is a very strong result. This is because we learn from a search-based method that extracts source words as the summary [schumann-etal-2020-discrete], and our CTC training—with blank tokens scattering over the whole output sentence as appropriate—can capture such strong correspondence between input and output. Moreover, the proposed length-control algorithm is able to maintain the summary completeness given the length constraint, achieving better ROUGE scores than NACC with truncating.

Figure 2: Performance of NACC with different bucket sizes.
Figure 3: Length-transfer performance of NACC and Su et al. [sunon].

Analysis of the Length Bucket. Our dynamic programming is an approximate algorithm with a -sized length bucket (see Figure 1b and §3.2). Here, we investigate the effect of the bucket size in terms of ROUGE scores (the arithmetic mean of R-1, R-2, and R-L) and inference efficiency in the unsupervised setting where training targets are 8-word summaries from [schumann-etal-2020-discrete], following Table 1.

As seen in Figure 3, the ROUGE score continuously decreases with a larger bucket size (thick orange curve). This not only confirms the inexactness of our algorithm, but also shows that a small bucket size does not hurt the performance much. On the other hand, the inference time decreases drastically at the beginning (thin blue curve) because we have fewer dynamic programming steps; as the bucket size increases, the inference time convergences to NACC without length control. Based on this analysis, we set the bucket size to be 4 in our experiments.

Length-Transfer Generation. Our NACC model is capable of length-transfer generation, that is, generating summaries of different lengths from the training targets. Such generation is important to real-world applications where summaries of various lengths are needed for the same input, e.g., fitting different screen widths. Although generating a short enough summary may satisfy all possible length constraints, a longer summary that better utilizes the length budget can preserve more information; this is also reflected by ROUGE scores, which prefer longer summaries, as shown in [schumann-etal-2020-discrete].

Figure 3 compares the performance of NACC with Su et al. [sunon] when learning from 8-word summaries in the unsupervised setting. When the inference length budget is less than training (-axis ), the ROUGE score of NACC decreases almost linearly with a decreasing length budget, but Su et al. [sunon]’s approach degrades faster than ours. For -axis , we find the soft penalty in [sunon] is unable to generate longer summaries than trained (shown by the dashed orange line), whereas our approach is able to utilize the increased length budget and achieve higher ROUGE scores.

Input: singapore airline and delta air lines announced two differing strategies to upgrade their long-haul in-flight
service for business travelers .
Reference: business travel : competing strategies ; crowded skies
Supervised Setting:
AR Transformer (no control):
delta airlines upgrade service for business travel (50 characters)
NACC (no control):
delta to upgrade long-haul in-flight (36 characters)
NACC (length control):
delta air to upgrade long-haul in-flight service (48 characters)
Unsupervised Setting:
AR Transformer (no control):
delta air lines differing strategies to upgrade their in-flight service for business travelers
NACC (no control):
delta air lines differing strategies to upgrade long-haul in-flight service for business travelers
NACC (length control):
delta air lines upgrade service business travelers
Table 5: Example summaries for Gigaword headline generation, where gray words are truncated for fair comparison.

Case Study. Table 5 shows example summaries generated by NACC and the AR Transformer on the Gigaword test set.

As seen, a model without length control may generate a summary that happens to have the desired length (AR Transformer in the supervised setting), a shorter summary (NACC in the supervised setting), or a longer summary (both AR Transformer and NACC in the unsupervised setting). A shorter summary fails to utilize the full length budget, leading to less information conveyed, whereas a longer summary requires truncating for explicit length control. Both cases are undesired.

By contrast, our proposed algorithm is able to generate a summary whose length is close to but less than the length budget. The resulting summary is more complete than truncating, and better keeps the key information.

5 Conclusion

In this work, we propose a Non-Autoregressive Summariation model with Character-level length Control (NACC), which can explicitly control the number of characters in the predicted summaries. Experiments show that our NACC approach not only achieves the state-of-the-art non-autoregressive (NAR) performance on Gigaword and DUC2004 datasets under length constraints. but is several times faster than autoregressive (AR) models, and even outperforms an AR Transformer in the unsupervised setting. Moreover, our approach is able to perform length-transfer generation, that is, generating summaries of different lengths from the training set.

Limitation and Future Work. Our proposed length-control algorithm only works with non-autoregressive models, which is a limitation (but not a disadvantage) of our work. We have clearly stated this throughout the paper.

Our NACC approach—although largely outperforming previous NAR models and retaining high inference efficiency—achieves lower ROUGE scores than AR models. This is the limitation of all NAR approaches in general, and can be addressed by future work. We nevertheless outperform AR models in the unsupervised setting.


Appendix A Proof of Theorem 1

Theorem 1. (1) If the bucket size and consecutive repetitions are not merged, then is the most probable sentence of characters given by the prediction slots. (2) If or repeating tokens are merged, our algorithm may not be exact.


[Part (1)] Our NACC is trained by the Connectionist Temporal Classification (CTC) algorithm [10.1145/1143844.1143891], which merges repeated consecutive tokens and removes s in the output sequence. Since the merging operation establishes dependencies between tokens in the output sequence, our length-control algorithm is inexact.

In this part, we consider a variant of the CTC algorithm that does not merge repeated tokens but only removes s; we denote this modified reduction operation by . For example, . Our thus revised algorithm works as follows.

We denote as the recursion variable, being the most probable -token sequence that is reduced to a summary of length .

The initialization of is the same as the original length-control algorithm (§3.2), since the merging operation is not involved here. However, the recursion involves only two cases:

  • Case 1: . The recursion of this case is also the same (see Eqn. 4):

  • Case 2: . We have a set of candidate sequences:


    This is analogous to Eqn. (6), where (due to our theorem assumption). Also, the condition in Eqn. (6) is dropped here because this algorithm variant does not merge repeated tokens.

Then, the algorithm chooses the most probable candidate sequence as , given by


Now we will prove that the algorithm is exact: suppose is the log probability of , we have


In other words, is the most probable -token sequence that is reduced to length . This is proved by mathematical induction as follows.

Base Cases. For , the variable can only be -many s. The optimality in Eqn. (11) holds trivially.

For but , the algorithm chooses . Therefore, , showing that Eqn. (11) is also satisfied with only one term in the summation.

Induction Step. The induction hypothesis assumes for every . We will show that the algorithm finds the sequence with .

According to Eqn. (10), the variable is the most probable sequence in . Thus, we have


Here, (13) separates the operation over and ; (14) is due to the induction hypothesis; (15) holds because the two terms in (14) are independent given , and thus the summations can be grouped; and (16) further groups the two operations with eliminated. The last two lines are original proved in [jelinek1998statistical] and also used in [1323262].

[Part (2)] We now prove our algorithm may be inexact if or repeated tokens are merged. We show these by counterexamples.444To make our counterexample intuitive, we work with probabilities, rather than log-probabilities.

Suppose and in particular we assume . We further assume repeated tokens are not merged. Consider the example shown in Table 6. The length-control algorithm finds , and then with the probability of , as the first bucket covers the length range and second . Here, we notice that two words are separated by a white space, which also counts as a character). However, the optimum should be , which has a probability of .

Now suppose repeated tokens are merged, and we further assume the length bucket in this counterexample. Again, this can be shown by Table 6: the algorithm finds and , based on which we have with probability . However, the optimum should be with probability .

I 0.3 0.1
am 0.4 0.6
a 0.2 0.05
0.1 0.25
Table 6: A counterexample showing that our algorithm may be inexact if or repeated tokens are merged. Here, we set the vocabulary to be three words plus a blank token .

The above theoretical analysis helps us understand when our algorithm is exact (or inexact). Empirically, our approach works well as an approximate inference algorithm.