1 Introduction
Sentence summarization aims at compressing a long sentence into a short one that keeps the main gist of the input. It is an established setting of summarization, and has extensive realworld applications such as headline generation [nenkova2011automatic, Rush_2015, filippovaetal2015sentence]. In previous work, researchers have developed various approaches to improve the ROUGE score, which is the main evaluation metric for summarization [lin2004rouge], whereas controlling the summary length has not drawn much attention.
Recently, Liu et al. [liuetal2018controlling] show that length control is the key to summarization, since it is typically required by realworld applications. Moreover, the ROUGE score is found to be sensitive to the summary length [schumannetal2020discrete, saito2020length], and summarization systems can achieve higher scores by simply generating longer output. In previous work, a typical setting of length control is to set a word budget for summarization [schumannetal2020discrete, liuetal2022learning].
In this paper, we address explicit length control in the character level for summarization, which is a different setting from previous work. In other words, we constrain the summary length by the number of characters, such as letters, punctuation marks, and spaces. We observe that this is a more realistic setting in realworld applications than wordlevel length control. For example, the headline shown in a mobile app or web page is constrained by the screen width (roughly speaking, the number of characters), rather than the number of words.
We further observe that controlling the summary length by characters cannot be easily addressed by previous approaches. For example, truncating is able to explicitly control the length, but the resulting summary is incomplete; Takase and Okazaki [takaseokazaki2019positional] feed length embeddings into the model as input, but such an approach cannot control the summary length in an explicit manner; and Schumann et al. [schumannetal2020discrete] perform constrained discrete optimization by selecting a certain number of words from the source text as the output, but their generated summaries may vary to a large extent in terms of the number of characters.
To this end, we propose NACC, a NonAutoregressive summarization model with Characterlevel length Control. We adopt a nonautoregressive approach because it generates all tokens in parallel and is much faster than autoregressive models. Moreover, we observe that nonautoregressive models predict the output words independently. Such predicted probabilities are thus local, which provides us with the unique opportunity to design a dynamic programming algorithm to constrain the summary length. Specifically, we formulate length control as a knapsacklike problem, where the weight is the number of characters in a token and the value is the predicted probability of the token. In this way, we are able to explicitly control the summary length at the character level, while retaining the completeness of the output text.
We evaluated our NACC model on the Gigaword headline generation
[graff2003english] and DUC2004 [duc2004] datasets in both supervised and unsupervised settings. In the latter setting, NACC learns from the pseudoreference given by an unsupervised wordextraction method based on discrete search [schumannetal2020discrete]. Experiments show that NACC establishes the stateoftheart performance of nonautoregressive summarization under various target lengths in both supervised and unsupervised settings; NACC even outperforms autoregressive Transformers [attentionisallyouneed] in the unsupervised setting, where the input and output have stronger correspondence. These all confirm the effectiveness of our lengthcontrol algorithm. Regarding inference efficiency, we show that nonautoregressive models without length control are 10 times faster than autoregressive ones; even with our lengthcontrol dynamic programming, NACC is still several times more efficient. Further, our NACC is capable of lengthtransfer generation, i.e., generating summaries of different lengths from the training targets.2 Related Work
NonAutoregressive Generation. Nonautoregressive (NAR) models [gu2017non] predict target tokens in parallel and thus enjoy a much higher inference efficiency than autoregressive (AR) ones, which generate output tokens sequentially. NAR generation is more difficult than AR, because there lacks dependencies among the outputs. Previous work addresses the dependency issue by iterative refinement [leedeterministic, miao2019cgmh, chietal2021align] or structured decoding with the conditional random field (CRF) [sun2019fast, sunon, deng2020cascaded].
The Connectionist Temporal Classification (CTC) algorithm [10.1145/1143844.1143891] addresses a common problem in NAR generation, namely, token repetition, by merging consecutive identical tokens (unless separated by an empty token). Our work follows Liu et al. [liuetal2022learning], who adopt CTC for NAR summarization, allowing empty tokens scattering over the entire sentence to generate a short summary.
NAR models are traditionally thought to generate worsequality text than AR models. However, we have the unique insight that NAR also brings new opportunities for designing lengthcontrol algorithms: since the model predicts output independently, the decoding problem can be divided into shared subproblems for dynamic programming.
Summarization Systems. Summarization systems can be generally categorized into two types: extractive and abstractive. Extractive methods output a summary by extracting important sentences or clauses from the source text [dongetal2018banditsum, mihalcea2004textrank, kaageback2014extractive], while abstractive methods generate summaries with new expressions [nallapati2016abstractive, paulus2018deep, gehrmann2018bottom].
Depending on the availability of training data, we may categorize summarization systems into supervised and unsupervised ones. Supervised methods typically follow a sequencetosequence (seq2seq) paradigm [zhang2020pegasus, liurefsum]. Recently, unsupervised summarization is drawing increasing attention because it does not require parallel training data. Yang et al. [yang2020ted] use the lead approach (selecting the first several words or sentences) as pseudogroundtruth for seq2seq training. Alternatively, cycle consistency is adopted as the training objective for unsupervised summarizaiton [wanglee2018learning, baziotisetal2019seq]. Schumann et al. [schumannetal2020discrete]
generate a summary by maximizing a heuristically defined scoring function (involving fluency and semantic similarity) with wordlevel extraction. We consider both settings in our work.
Controlling the output length is crucial for deploying summarization systems to realworld applications. In early extractive summarization research, truncating is adopted for fair comparison
[mihalcea2004textrank, murray2005extractive, kaageback2014extractive], but the resulting summary may not be complete sentences. As pointed out by [saito2020length, schumannetal2020discrete], however, the lengthcontrol problem is not adequately addressed for abstractive summarization in the neural era, probably because researchers do not want to hurt the completeness. For example, length information [liuetal2018controlling, takaseokazaki2019positional] and soft penalty [sunon] are adopted to encourage short summaries, but they cannot control the length in an explicit way. Very recently, Schumann et al. [schumannetal2020discrete] address the lengthcontrol problem by extracting a certain number of source tokens as the summary. Liu et al. [liuetal2022learning] design a CTCbased algorithm that controls the output words explicitly. However, these approaches cannot be applied to the character level.Our work follows Liu et al. [liuetal2022learning] and adopts CTC as the training objective. However, we propose a characterlevel lengthcontrol algorithm that can explicitly constrain the number of characters within a given budget, which is different from all previous work.
3 Approach
In this section, we first introduce the CTCtrained NAR model for summarization (§3.1). Then, we propose a dynamic programming algorithm to explicitly control the number of characters (§3.2).
3.1 A NonAutoregressive Model for Summarization
In the summarization task, the goal is to compress a source text into a shorter text , while preserving the key information.
EncoderOnly Architecture. Our work follows recent nonautoregressive summarization methods [sunon, liuetal2022learning], utilizing the encoderonly Transformer [attentionisallyouneed] as the base model (Figure 1a). It is essentially a stack of Transformer layers, the last of which predicts all words simultaneously; thus, it has a much higher inference efficiency than autoregressive models.
Formally, let be the contextual representation of the input at the th Transformer layer, where is the number of tokens in and is the dimension. Specially, is the word and positional embeddings of input tokens. Each layer is a Transformer block [attentionisallyouneed] based on its predecessor, which can be denoted by , for .
At the last layer , a function is applied to predict the probability at every prediction slot independently, given by for input slot , where
is a column vector transposing the
th row of matrix , and is a weight matrix.Such an encoderonly architecture is able to capture the strong correspondence between the input and output in the summarization task and outperforms encoder–decoder nonautoregressive models, shown by Liu et al. [liuetal2022learning].
CTC Training.
Training the above encoderonly architecture requires padding the target with empty tokens
; otherwise, the output will have the same length as the input, and cannot be a summary. Gu et al. [gu2017non] pad s at the end of the target, but this hinders the input–output correspondence.Instead, we follow Liu et al. [liuetal2022learning] and train our model with the Connectionist Temporal Classification (CTC) [10.1145/1143844.1143891], allowing s to scatter over the entire target summary as appropriate. During inference, the CTC approach removes s to generate a shorter text as the summary. In addition, CTC merges consecutive identical tokens that are not separated by , because nonautoregressive generation suffers from the wordrepetition problem [gu2017non, sahariaetal2020non]. We denote the CTC operation as , for example, .
For training, CTC maximizes the marginal likelihood of token sequences (including ) that can be reduced to the target text :
(1) 
where is the probability of generating a token sequence , and is given by the nonautoregressive model. Although a brute force summation in Eqn. (1) is intractable, it can be efficiently computed by dynamic programming. We refer interested readers to Alex et al. [10.1145/1143844.1143891] for the details of the above marginalization in CTC training.
3.2 Proposed Algorithm for CharacterLevel LengthControl Decoding
One of our main contributions is to propose a dynamic programming (DP) algorithm for characterlevel length control. mentioned in §1, controlling the number of characters in the output is important for summarization, since realworld applications usually involve characterlevel length constraints.
Our insight is that a nonautoregressive model predicts probabilities independently, so the lengthcontrol problem can be divided into shared subproblems, making DP algorithms feasible. We formulate characterlevel length control as a knapsacklike problem. We treat the number of characters in a word (plus one) as the weight,^{2}^{2}2For the purposes of research, we assume every word is appended with another character, namely, a white space. In real applications, our algorithm can be applied with any measure of length, such as the display width of a word in some font. denoted by for the word , and the predicted logprobability as the value for the prediction slot . Our goal of characterlevel lengthcontrol summarization can be formulated as
(2) 
where is the total length budget. Here, the value is the sum of the logprobability of every generation slot including , whereas the length is said in terms of the words of the CTCreduced sequence .
We observe that handling every possible integer weight (i.e., length) as in a standard knapsack algorithm may slow down the inference. Thus, we divide the lengths into buckets for efficient inference. Formally, let the th bucket cover the length ranging from to characters, where
is a hyperparameter controlling the bucket size. We denote by
the most probable token sequence that is reduced to a summary in the th length bucket. Specially, we let mean that the reduced summary has zero words.The initialization of fills in the DP table for and .

For , we must have (many), because means no non word has been generated. (First row of Figure 1b.)

For , we have
(3) Here, is the same as the previous bullet item. For , we select the most probable word for each length bucket according to the value , i.e., the predicted logprobability of the first generation slot. (First column of Figure 1b.)
The DP recursion is to compute based on a newly predicted token , assuming its topleft subtable is filled. This involves three scenarios:

Case 1: . In this case, the new word is . Thus, the index for generation slots increases from to , but the summary length does not change. We denote as the set containing the candidate sequence, given by
(4) where denotes string concatenation. (See yellow dash arrows in Figure 1b.)

Case 2: . In other words, the candidate non word for the th slot is identical to the last token of . Since repeated tokens are merged during CTC decoding, the output length index is unchanged. We include this sequence in a set:
(5) (Also see yellow dash arrows in Figure 1b.)

Case 3: . That is, is neither nor repetition, and thus the summary length will be increased from bucket to . We denote this candidate set by
(6) (See blue arrows in Figure 1b.)
Then, our DP finds the most probable sequence at each recursion step:
(7) 
where is the th token of a sequence from the three candidate sets above.
Theorem 1.
(1) If the bucket size and consecutive repetitions are not merged, then is the most probable sentence of characters given by the prediction slots. (2) If or repeating tokens are merged, our algorithm may not be exact. See the appendix for the proof.
Discussion. Our DP algorithm is inspired by the standard 01 knapsack problem [andonov2000unbounded], but also differs in several significant ways. First, we merge consecutive tokens during CTC decoding; this establishes some dependencies among different generation slots, and thus exact inference with DP is not possible. Second, our value function is nonstationary, as it changes over time. We require that every slot should select a token, either or a word. In both cases, the token’s value is added to the total value. Therefore, our algorithm is compatible with negative values, namely, logprobabilities in our application, because only the relative difference matters for the value function.
4 Experiments
4.1 Setup
Datasets. Our model is evaluated on the Gigaword headline generation [Rush_2015] and DUC2004 datasets [duc2004]. The Gigaword dataset comprises pairs of news articles and titles; we follow the standard setting [Rush_2015, schumannetal2020discrete, liuetal2022learning] and adopt the first sentence of each news article as the input and consider the news title as the summary. In total, the dataset contains 3.8M, 198K, and 1951 samples for training, validation, and test, respectively. On the other hand, the DUC2004 dataset has 500 paired samples and is designed to be testonly; its performance is obtained by models trained on Gigaword.
Metrics. We follow previous work [schumannetal2020discrete, Rush_2015, sunon] and evaluate the output by ROUGE scores [lin2004rouge]: ROUGE evaluates gram overalpping, whereas ROUGE evaluates the longest common sequence. We follow the convention and adopt ROUGE F1 for the Gigaword dataset [wanglee2018learning, schumannetal2020discrete, liuetal2022learning] and ROUGE Recall for the DUC2004 dataset [dorretal2003hedge, westetal2019bottlesum].
Implementation Details. We use a Transformer encoder as the base model, which has 6 layers and 8 attention heads for each layer, following the settings in [attentionisallyouneed]. The dimensions are 512 and 2048 for the attention and feedforward modules, respectively. Each training batch contains samples amounting to 4K tokens. The learning rate is chosen from {1e4, 5e4} by validation, and we ran 100K gradient updates for the unsupervised setting, but 400K updates for the supervised setting. Note that, in the unsupervised setting, our model learns from an extractive searchbased method [schumannetal2020discrete], and thus its training is relatively easy. By contrast, the supervised setting takes humanwritten sentences as the training targets, which are more abstractive and difficult to train with. All experiments were run on an i99940X CPU and an RTX6000 GPU.
For our lengthcontrol algorithm, we adopt a bucket size of 4, and only consider the most probable 20 words for every generation slot (cf. in Eqn. 6) due to efficiency concerns.
4.2 Results and Analyses
Results on Gigaword Headline Generation. Table 1
presents the performance on the Gigaword test set in both supervised and unsupervised settings. For the supervised setting, we train with the groundtruth references, which contain 51.29 characters on average. In the unsupervised setting, the training targets of machine learning models (Rows 9–13) are the output summaries given by an unsupervised searchbased method
[schumannetal2020discrete]; specifically, we adopt the standard 8word setting, which counts to 48.75 characters on average. Based on these, we set our length constraint to be 50 characters. We observe that our proposed algorithm is the only machine learningbased approach that can perform explicit characterlevel length control. For fair comparison, we truncate the output of other methods to satisfy the length constraint.As seen, NACC even with truncating outperforms all other nonautoregressive (NAR) models [sunon, pmlrv139qi21a, yangetal2021pos] in both settings. Specifically, Su et al. [sunon] emit multiple endofsequence tokens at the end of the output sequence to generate a shorter summary than the source text (Rows 1 & 9); Qi et al. [pmlrv139qi21a] propose to pretrain a summarization system in an autoregressive manner, and gradually adapt it to the NAR setting (Rows 5 & 10); Yang et al. [yangetal2021pos]^{3}^{3}3Yang et al. [yangetal2021pos] only provided execution commands in their GitHub repo, but no training code. We emailed the authors, but have not obtained the code either. The reported results are based on our best replication. propose a twostep strategy of autoregressive partofspeech (POS) prediction and nonautoregressive summary generation (Rows 6 & 11). Our NACC trained by CTC is able to surpass all these methods, demonstrating the effectiveness of CTC training for summarization. Besides, NACC outperforms its searchbased teacher model [schumannetal2020discrete], showing that learning can smooth out the search noise.
Equipped with the lengthcontrol algorithm, our NACC model has a significant boost in terms of ROUGE scores. In a fair comparison with the same base architecture, the lengthcontrol algorithm alone improves NACC (truncate) by ~3 points. Our full model achieves an improvement of more than 4 total ROUGE points compared with previous stateoftheart NAR methods.
Regarding the inference efficiency, NACC with truncating is faster than previous NAR models. Even with the lengthcontrol algorithm, our NACC is still in the same ballpark, being ~500x faster than the searchbased method.
Additionally, we design a variant of the unsupervised summarization method based on [schumannetal2020discrete], where we directly impose the characterlevel length constraint during each search step (Row 8). We find this approach outperforms truncating wordconstrained search (Row 7), but is much worse than our machine learningbased NACC with the lengthcontrol algorithm.
Setting  #  Approach  Len  ROUGE F1  Time  
R1  R2  RL  R  

1 

Su et al. [sunon] (truncate)  38.43  32.28  14.21  30.56  0  0.016  
2  Qi et al. [pmlrv139qi21a] (truncate)  27.98  31.69  12.52  30.05  2.79  0.019  
3  Yang et al. [yangetal2021pos] (truncate)  35.37  28.85  6.45  27.00  14.75  –  
4  NACC (truncate)  34.15  33.12  13.93  31.34  1.34  0.011  
5  NACC (length control)  34.40  33.66  13.73  31.79  4.74  0.017  

6  Baseline  Lead50 chars  49.03  20.66  7.08  19.30  9.23  –  
7  Search  Schumann et al. [schumannetal2020discrete] (truncate)  45.45  24.98  9.08  23.18  0.97  9.573  
8  Char constrained search  44.05  25.30  9.25  23.43  1.71  17.324  
9 

Su et al. [sunon] (truncate)  45.24  24.65  8.64  22.98  0  0.017  
10  Qi et al. [pmlrv139qi21a] (truncate)  44.54  24.31  7.66  22.48  1.82  0.019  
11  Yang et al. [yangetal2021pos] (truncate)  49.37  21.70  4.60  20.13  9.84  –  
12  NACC (truncate)  47.77  25.79  8.94  23.75  2,21  0.012  
13  NACC (length control)  47.03  27.45  8.87  25.14  5.19  0.025 
Results on DUC2004. Table 2 shows the performance of NACC on DUC2004, where we constrain the summary length to be at most 75 characters, following previous work [westetal2019bottlesum, baziotisetal2019seq, schumannetal2020discrete]. Since the Gigaword training summaries have very different lengths from DUC2004 summaries, it is not straightforward to evaluate NACC in the supervised setting, as this involves length transfer. Therefore, we consider the unsupervised setting for DUC2004, where we use 13word summaries (~80 characters) from [schumannetal2020discrete] as our training targets. (We will have lengthtransfer experiments in later analysis.)
#  Approach  ROUGE Recall  Time  
R1  R2  RL  R  
1  Baseline  Lead75 chars  22.52  6.50  19.74  4.97  –  
2  Search  Schumann et al. [schumannetal2020discrete] (truncate)  26.09  8.03  22.86  3.25  30.362  
3  Charconstrained search  26.30  7.95  22.78  3.30  31.540  
4 

Su et al. [sunon] (truncate)  24.67  7.25  21.81  0  0.017  
5  Qi et al. [pmlrv139qi21a] (truncate)  22.79  5.91  20.05  4.98  0.018  
6  NACC (truncate)  26.43  7.86  22.66  3.22  0.012  
7  NACC (length control)  28.37  7.74  24.30  6.68  0.030 
Experiments show that NACC with length control again largely outperforms previous NAR models, while retaining high inference efficiency. Results are consistent with the Gigaword experiment.
Decoding  Wins  Ties  Loses  value  

Overall quality  Truncate  18%  44%  38%  0.0001  
Length control  38%  44%  18%  

Truncate  22%  36%  42%  0.0002  
Length control  42%  36%  22% 
Human Evaluation. We additionally conduct human evaluations to investigate the effect of our lengthcontrol algorithm in terms of the completeness of summaries, which may not be captured by automatic metrics such as ROUGE scores. For controlled comparison, we consider NACC with truncating and NACC with the proposed lengthcontrol algorithm.
We invite three human annotators to compare the output summaries on each of 150 randomly selected samples from the Gigaword test set. We adopt the unsupervised setting in Table 1, where models are trained with 8word summaries generated by [schumannetal2020discrete]. The two systems’ outputs for a sample are blindly given to annotators in a random order; annotators then vote for the better summary or a tie, regarding the overall quality and completeness/fluency separately. We count the votes for each decoding method on the two evaluation criteria, and show the percentage of wins/ties/loses on all samples in Table 3
As seen, our lengthcontrol decoding has a dominant advantage over the truncating method in terms of both the overall quality and completeness/fluency. Further, this result is statistically significant based on a twosided binomial test (ignoring ties), verifying that our lengthcontrol algorithm indeed improves the quality and completeness of the predicted summaries.
Setting  #  Approach  Len  ROUGE F1  Time  

R1  R2  RL  

1  AR  Transformer (truncate)  40.89  35.12  16.61  32.55  0.093  
2  Transformer (len control)  39.73  34.10  15.65  31.64  0.095  
3  NAR  NACC (truncate)  34.15  33.12  13.93  31.34  0.011  
4  NACC (len control)  34.40  33.66  13.73  31.79  0.017  

5  AR  Transformer (truncate)  46.62  26.31  9.33  24.29  0.092  
6  Transformer (length control)  45.23  25.33  9.03  23.44  0.095  
7  NAR  NACC (truncate)  47.77  25.79  8.94  23.75  0.012  
8  NACC (len control)  47.03  27.45  8.87  25.14  0.025 
Comparison with Autoregressive Models. We are curious about how our nonautoregressive NACC is compared with autoregressive (AR) methods. Thus, we train a standard AR Transformer with truncating and lengthcontrol decodings, and show results in Table 4.
As seen, our lengthcontrol algorithm is not compatible with the AR Transformer and hurts the ROUGE scores (Rows 2 & 6). This is because our algorithm is based on dynamic programming and requires model outputs to be local, so that the lengthcontrol problem can be divided into shared subproblems; however, the predicted probabilities of the AR Transformer depend on partial generation at previous time steps. Note that this is not a disadvantage of our approach, but shows that NAR generation provides unique opportunities for length control.
Admittedly, NACC has slightly lower ROUGE scores than AR Transformers in the supervised setting (Rows 1–4), because humanwritten summaries are flexible, causing difficulties for NAR training. Nevertheless, our NACC achieves much higher inference efficiency, and outperforms all previous NAR systems (recall Table 1).
In the unsupervised setting, our approach achieves higher performance than the AR Transformer, which is a very strong result. This is because we learn from a searchbased method that extracts source words as the summary [schumannetal2020discrete], and our CTC training—with blank tokens scattering over the whole output sentence as appropriate—can capture such strong correspondence between input and output. Moreover, the proposed lengthcontrol algorithm is able to maintain the summary completeness given the length constraint, achieving better ROUGE scores than NACC with truncating.
Analysis of the Length Bucket. Our dynamic programming is an approximate algorithm with a sized length bucket (see Figure 1b and §3.2). Here, we investigate the effect of the bucket size in terms of ROUGE scores (the arithmetic mean of R1, R2, and RL) and inference efficiency in the unsupervised setting where training targets are 8word summaries from [schumannetal2020discrete], following Table 1.
As seen in Figure 3, the ROUGE score continuously decreases with a larger bucket size (thick orange curve). This not only confirms the inexactness of our algorithm, but also shows that a small bucket size does not hurt the performance much. On the other hand, the inference time decreases drastically at the beginning (thin blue curve) because we have fewer dynamic programming steps; as the bucket size increases, the inference time convergences to NACC without length control. Based on this analysis, we set the bucket size to be 4 in our experiments.
LengthTransfer Generation. Our NACC model is capable of lengthtransfer generation, that is, generating summaries of different lengths from the training targets. Such generation is important to realworld applications where summaries of various lengths are needed for the same input, e.g., fitting different screen widths. Although generating a short enough summary may satisfy all possible length constraints, a longer summary that better utilizes the length budget can preserve more information; this is also reflected by ROUGE scores, which prefer longer summaries, as shown in [schumannetal2020discrete].
Figure 3 compares the performance of NACC with Su et al. [sunon] when learning from 8word summaries in the unsupervised setting. When the inference length budget is less than training (axis ), the ROUGE score of NACC decreases almost linearly with a decreasing length budget, but Su et al. [sunon]’s approach degrades faster than ours. For axis , we find the soft penalty in [sunon] is unable to generate longer summaries than trained (shown by the dashed orange line), whereas our approach is able to utilize the increased length budget and achieve higher ROUGE scores.







AR Transformer (no control): 


NACC (no control): 


NACC (length control): 




AR Transformer (no control): 


NACC (no control): 


NACC (length control): 

Case Study. Table 5 shows example summaries generated by NACC and the AR Transformer on the Gigaword test set.
As seen, a model without length control may generate a summary that happens to have the desired length (AR Transformer in the supervised setting), a shorter summary (NACC in the supervised setting), or a longer summary (both AR Transformer and NACC in the unsupervised setting). A shorter summary fails to utilize the full length budget, leading to less information conveyed, whereas a longer summary requires truncating for explicit length control. Both cases are undesired.
By contrast, our proposed algorithm is able to generate a summary whose length is close to but less than the length budget. The resulting summary is more complete than truncating, and better keeps the key information.
5 Conclusion
In this work, we propose a NonAutoregressive Summariation model with Characterlevel length Control (NACC), which can explicitly control the number of characters in the predicted summaries. Experiments show that our NACC approach not only achieves the stateoftheart nonautoregressive (NAR) performance on Gigaword and DUC2004 datasets under length constraints. but is several times faster than autoregressive (AR) models, and even outperforms an AR Transformer in the unsupervised setting. Moreover, our approach is able to perform lengthtransfer generation, that is, generating summaries of different lengths from the training set.
Limitation and Future Work. Our proposed lengthcontrol algorithm only works with nonautoregressive models, which is a limitation (but not a disadvantage) of our work. We have clearly stated this throughout the paper.
Our NACC approach—although largely outperforming previous NAR models and retaining high inference efficiency—achieves lower ROUGE scores than AR models. This is the limitation of all NAR approaches in general, and can be addressed by future work. We nevertheless outperform AR models in the unsupervised setting.
References
Appendix A Proof of Theorem 1
Theorem 1. (1) If the bucket size and consecutive repetitions are not merged, then is the most probable sentence of characters given by the prediction slots. (2) If or repeating tokens are merged, our algorithm may not be exact.
Proof..
[Part (1)] Our NACC is trained by the Connectionist Temporal Classification (CTC) algorithm [10.1145/1143844.1143891], which merges repeated consecutive tokens and removes s in the output sequence. Since the merging operation establishes dependencies between tokens in the output sequence, our lengthcontrol algorithm is inexact.
In this part, we consider a variant of the CTC algorithm that does not merge repeated tokens but only removes s; we denote this modified reduction operation by . For example, . Our thus revised algorithm works as follows.
We denote as the recursion variable, being the most probable token sequence that is reduced to a summary of length .
The initialization of is the same as the original lengthcontrol algorithm (§3.2), since the merging operation is not involved here. However, the recursion involves only two cases:

Case 1: . The recursion of this case is also the same (see Eqn. 4):
(8)
Then, the algorithm chooses the most probable candidate sequence as , given by
(10) 
Now we will prove that the algorithm is exact: suppose is the log probability of , we have
(11) 
In other words, is the most probable token sequence that is reduced to length . This is proved by mathematical induction as follows.
Base Cases. For , the variable can only be many s. The optimality in Eqn. (11) holds trivially.
For but , the algorithm chooses . Therefore, , showing that Eqn. (11) is also satisfied with only one term in the summation.
Induction Step. The induction hypothesis assumes for every . We will show that the algorithm finds the sequence with .
According to Eqn. (10), the variable is the most probable sequence in . Thus, we have
(12)  
(13)  
(14)  
(15)  
(16) 
Here, (13) separates the operation over and ; (14) is due to the induction hypothesis; (15) holds because the two terms in (14) are independent given , and thus the summations can be grouped; and (16) further groups the two operations with eliminated. The last two lines are original proved in [jelinek1998statistical] and also used in [1323262].
[Part (2)] We now prove our algorithm may be inexact if or repeated tokens are merged. We show these by counterexamples.^{4}^{4}4To make our counterexample intuitive, we work with probabilities, rather than logprobabilities.
Suppose and in particular we assume . We further assume repeated tokens are not merged. Consider the example shown in Table 6. The lengthcontrol algorithm finds , and then with the probability of , as the first bucket covers the length range and second . Here, we notice that two words are separated by a white space, which also counts as a character). However, the optimum should be , which has a probability of .
Now suppose repeated tokens are merged, and we further assume the length bucket in this counterexample. Again, this can be shown by Table 6: the algorithm finds and , based on which we have with probability . However, the optimum should be with probability .
Word  

I  0.3  0.1 
am  0.4  0.6 
a  0.2  0.05 
0.1  0.25 
∎
The above theoretical analysis helps us understand when our algorithm is exact (or inexact). Empirically, our approach works well as an approximate inference algorithm.