1 Introduction
Text summarization is an important natural language processing (NLP) task, aiming at generating concise summaries for given texts while preserving the key information. It has extensive realworld applications such as headline generation
nenkova2011automatic. In this paper, we focus on the setting of sentence summarization Rush_2015; filippovaetal2015sentence.Stateoftheart text summarization models are typically trained in a supervised way with large training corpora, comprising pairs of long texts and their summaries zhang2020pegasus; aghajanyan2020better; aghajanyan2021muppet. However, such parallel data are expensive to obtain, preventing the applications to less popular domains and less spoken languages.
Unsupervised text generation has been attracting increasing interest, because it does not require parallel data for training. One widely used approach is to compress a long text into a short one, and to reconstruct it to the long text by a cycle consistency loss
miao2016language; wanglee2018learning; baziotisetal2019seq. Due to the indifferentiability of the compressed sentence space, such an approach requires reinforcement learning (or its variants), which makes the training difficult
kreutzeretal2021offline.Recently, schumannetal2020discrete propose an editbased approach for unsupervised summarization. Their model maximizes a heuristically defined scoring function that evaluates the quality (fluency and semantics) of the generated summary, achieving higher performance than cycleconsistency methods. However, the search approach is slow in inference because hundreds of search steps are needed for each data sample. Moreover, their approach can only select words from the input sentence with the word order preserved. Thus, it is restricted and may generate noisy summaries due to the local optimality of search algorithms.
To address the above drawbacks, we propose a NonAutoregressive approach to Unsupervised Summarization (NAUS). The idea is to perform search as in schumannetal2020discrete and, inspired by NEURIPS2020_7a677bb4, to train a machine learning model to smooth out such noise and to speed up the inference process. Different from NEURIPS2020_7a677bb4, we propose to utilize
nonautoregressive decoders, which generate all output tokens in parallel due to our following observations:Nonautoregressive models are several times faster than autoregressive generation, which is important when the system is deployed.
The input and output of the summarization task have a strong correspondence. Nonautoregressive generation supports encoderonly architectures, which can better utilize such input–output correspondence and even outperform autoregressive models for summarization.
For nonautoregressive models, we can design a lengthcontrol algorithm based on dynamic programming to satisfy the constraint of output lengths, which is typical in summarization applications but cannot be easily achieved with autoregressive models.
We conducted experiments on Gigaword headline generation
graff2003english and DUC2004 duc2004 datasets. Experiments show that our NAUS achieves stateoftheart performance on unsupervised summarization; especially, it outperforms its teacher (i.e., the search approach), confirming that NAUS can indeed smooth out the search noise. Regarding inference efficiency, our NAUS with truncating is 1000 times more efficient than the search approach; even with dynamic programming for length control, NAUS is still 100 times more efficient than search and several times more efficient than autoregressive models. Our NAUS is also able to perform lengthtransfer summary generation, i.e., generating summaries of different lengths from training.2 Approach
In our approach, we first follow schumannetal2020discrete and obtain a summary by discrete search towards a heuristically defined objective function (§2.1). Then, we propose a nonautoregressive model for the summarization task (§2.2). We present the training strategy and the proposed lengthcontrol algorithm in §2.3.
2.1 SearchBased Summarization
Consider a given source text . The goal of summarization is to find a shorter text as the summary.
Our work on unsupervised summarization follows the recent progress of searchbased text generation liuetal2020unsupervised; liu2021simulated; kumar2020iterative. schumannetal2020discrete formulate summarization as wordlevel extraction (with order preserved), and apply editbased discrete local search to maximize a heuristically designed objective.
Specifically, the objective function considers two aspects: (1) a language fluency score , given by the reciprocal of a language model’s perplexity; and (2) a semantic similarity score , given by the cosine embeddings. The overall objective combines the two aspects as
(1) 
where
is a weighting hyperparameter. Interested readers are referred to schumannetal2020discrete for the details of the scoring function.
Further, the desired summary length can be specified as a hard constraint, achieved by searching only among sentences of the correct length. Suppose the desired summary length is , the approach selects random words from the input, and maximizes the scoring function (1) by changing the selection and nonselection of two words.
A greedy hillclimbing algorithm determines whether the change is accepted or not. In other words, a change is accepted if the score improves, or rejected otherwise. Such a process continues until a (possibly local) optimum is found.
A pilot analysis in schumannetal2020discrete shows that words largely overlap between a source text and its reference summary. This explains the high performance of such a word extraction approach, being a stateoftheart unsupervised summarization system and outperforming strong competitors, e.g., cycle consistency wanglee2018learning; baziotisetal2019seq.
2.2 NonAutoregressive Model for Summarization
Despite the high performance, such editbased search has several drawbacks. First, the search process is slow because hundreds of local search steps are needed to obtain a highquality summary. Second, their approach only extracts the original words with order preserved. Therefore, the generated summary is restricted and may be noisy.
To this end, we propose a NonAutoregressive approach to Unsupervised Summarization (NAUS) by learning from the search results. In this way, the machine learning model can smooth out the search noise and is much faster, largely alleviating the drawbacks of searchbased summarization. Compared with training an autoregressive model from search NEURIPS2020_7a677bb4, nonautoregressive generation predicts all the words in parallel, further improving inference efficiency by several times.
Moreover, a nonautoregressive model enables us to design an encoderonly architecture, which is more suited to the summarization task due to the strong correspondence between input and output, which cannot be fully utilized by encoder–decoder models, especially autoregressive ones.
Specifically, we propose to use multilayer Transformer attentionisallyouneed
as the nonautoregressive architecture for summarization. Each Transformer layer is composed of a multihead attention sublayer and a feedforward sublayer. Additionally, there is a residual connection in each sublayer, followed by layer normalization.
Let be the representation at the th layer, where is the number of words and is the dimension. Specially, the input layer is the embeddings of words. Suppose we have attention heads. The output of the head in the th attention sublayer is , where , , and
are matrices calculated by three distinct multilayer perceptrons (MLPs) from
; is the attention dimension.Multiple attention heads are then concatenated:
where is a weight matrix.
Then, we have a residual connection and layer normalization by
(2) 
Further, an MLP sublayer processes , followed by residual connection and layer normalization, yielding the th layer’s representation
(3) 
The last Transformer layer is fed to
to predict the words of the summary in a nonautoregressive manner, that is, the probability at the
th step is given by , where is the th row of the matrix and is the weight matrix.It is emphasized that, in the vocabulary, we include a special blank token , which is handled by dynamic programming during both training and inference (§2.3). This enables us to generate a shorter summary than the input with such a multilayer Transformer.
Our model can be thought of as an encoderonly architecture, differing from a typical encoder–decoder model with cross attention attentionisallyouneed; baziotisetal2019seq; zhourush2019simple. Previously, sunon propose a seemingly similar model to us, but put multiple endofsequence (EOS) tokens at the end of the generation; thus, they are unable to maintain the correspondence between input and output. Instead, we allow blank tokens scattering over the entire sentence; the residual connections in Eqns (2) and (3) can better utilize such input–output correspondence for summarization.
2.3 Training and Inference
In this section, we first introduce the Connectionist Temporal Classification (CTC) training. Then, we propose a lengthcontrol decoding approach for summary generation.
CTC Training. The Connectionist Temporal Classification (CTC, 10.1145/1143844.1143891) algorithm allows a special blank token in the vocabulary, and uses dynamic programming to marginalize out such blank tokens, known as latent alignment sahariaetal2020non. In addition, nonautoregressive generation suffers from a common problem that words may be repeated in consecutive steps gu2017non; leedeterministic; thus, CTC merges repeated words unless separated by . For example, the sequence of tokens is reduced to the text , denoted by .
Concretely, the predicted likelihood is marginalized over all possible fillings of , i.e., all possible token sequences that are reduced to the groundtruth text:
(4) 
where is the probability of generating a sequence of tokens . Although enumerating every candidate in is intractable, such marginalization fortunately can be computed by dynamic programming in an efficient way.
Let be the marginal probability of generating up to the th decoding slot. Moreover, is defined to be the probability that is all , thus not having matched any word in . The variable can be further decomposed into two terms , where the first term is such probability with , and the second term . Apparently, the initialization of variables is
(5)  
(6)  
(7)  
(8) 
Eqn. (7) is because, at the first prediction slot, the empty token does not match any target words; Eqn. (8) is because the predicted non first token must match exactly the first target word.
The recursion formula for is
since the newly predicted token with probability does not match any target word, inheriting .
The recursion formula for is
Here, is not , so we must have , having the predicted probability .
If , then we have two subcases: first, is reduced to with separating two repeating words in , having probability ; or second, is reduced to with , having probability , which implies we are merging and .
If , is reduced to either or . In the first case, can be either or non, given by . In the second case, we must have , which has a probability of .
Finally, is the marginal probability in Eqn. (4), as it is the probability that the entire generated sequence matches the entire target text.
The CTC maximum likelihood estimation is to maximize the marginal probability, which is equivalent to minimizing the loss
. Since the dynamic programming formulas are differentiable, the entire model can be trained by backpropagation in an endtoend manner with autodifferentiation tools (such as PyTorch).
LengthControl Inference.
Controlling output length is the nature of the summarization task, for example, displaying a short news headline on a mobile device. Moreover, schumannetal2020discrete show that the main evaluation metric ROUGE
lin2004rouge is sensitive to the summary length, and longer summaries tend to achieve higher ROUGE scores. Thus, it is crucial to control the summary length for fair comparison.We propose a lengthcontrol algorithm by dynamic programming (DP), following the nature of CTC training. However, our DP is an approximate algorithm because of the dependencies introduced by removing consecutive repeated tokens. Thus, we equip our DP with a beam search mechanism.
We define to be a set of top sequences with predicted tokens that are reduced to words. is constructed by three scenarios.
First, the blank token is predicted for the th generation slot, and thus the summary length remains the same, shown by the blue arrow in Figure 2. This yields a set of candidates
(9) 
where refers to string/token concatenation.
Second, a repeated word is predicted for the th generation slot, i.e., for a subsequence of length . In this case, the summary length also remains the same, also shown in the blue arrow in Figure 2. This gives a candidate set
(10) 
Third, a non, nonrepeating word is generated, increasing the summary length from to , shown by the red arrow in Figure 2. This gives
(11) 
where selects the best elements by the probability .
Based on the three candidates sets, we select top sequences to keep the beam size fixed:
(12) 
where the sequences are ranked by their predicted joint probabilities.
Theorem 1.
(1) If repeating tokens are not merged, then the proposed lengthcontrol algorithm with beam size finds the exact optimum being the most probable length sentence given by prediction slots. (2) If we merge repeating tokens predicted by CTCtrained models, the above algorithm may not be exact.
Appendix A presents the proof of the theorem and provides a more detailed analysis, showing that our lengthcontrol algorithm, although being approximate inference, can generate a summary of the desired length properly. Compared with truncating an overlength output, our approach is able to generate more fluent and complete sentences. Also, our lengthcontrol algorithm is different from conventional beam search, shown in Appendix C.
3 Experiments
3.1 Setup
Datasets. We evaluated our NAUS model on Gigaword headline generation and DUC2004 datasets.
The headline generation dataset Rush_2015 is constructed from the Gigaword news corpus graff2003english, where the first sentence of a news article is considered as input text and the news title is considered as the summary. The dataset contains 3.8M/198K/1951 samples for training/validation/test. Based on the analysis of the training size in Appendix B, we used 3M samples for training NAUS.
It should be emphasized that, when NAUS learns from search, we only use the input of the training corpus: we perform search schumannetal2020discrete for each input, and train our NAUS from the search results. Therefore, we do not utilize any labeled parallel data, and our approach is unsupervised.
Moreover, we considered two settings with desired summary lengths of 8 and 10, following schumannetal2020discrete. Our NAUS is trained from respective search results.
The DUC2004 dataset duc2004 is designed for testing only with 500 samples, where we also take the first sentence of an article as the input text. Our NAUS is transferred from the above headline generation corpus. Based on the length of DUC2004 summaries, we trained NAUS from search results with 13 words, also following schumannetal2020discrete for fair comparison.
Evaluation Metrics. We evaluated the quality of predicted summaries by ROUGE scores ^{2}^{2}2https://github.com/tagucci/pythonrouge lin2004rouge, which are the most widely used metrics in previous work wanglee2018learning; baziotisetal2019seq; zhourush2019simple. Specifically, ROUGE evaluates gram overlap between a predicted summary and its reference summary; ROUGEL, instead, measures the longest common sequence between the predicted and reference summaries.
Different ROUGE variants are adopted in previous work, depending on the dataset. We followed the standard evaluation scripts and evaluated headline generation by ROUGE F1 wanglee2018learning; baziotisetal2019seq; schumannetal2020discrete and DUC2004 by Truncate ROUGE Recall dorretal2003hedge; westetal2019bottlesum.
In addition to summary quality, we also evaluated the inference efficiency of different methods, as it is important for the deployment of deep learning models in realtime applications. We report the average inference time in seconds for each data sample, and compare the speedup with schumannetal2020discrete’s search approach, which achieves (previous) stateoftheart ROUGE scores. Our experiments were conducted on an i99940X CPU and an RTX6000 graphic card. Appendix
B presents additional implementation details.Group  #  Approach  Len  ROUGE F1  Inf.Time  Speedup  
R1  R2  RL  R  

1  Baseline  7.9  21.39  7.42  20.03  11.12  –  –  
2  Search  7.9  26.32  9.63  24.19  0.18  –  –  
3  Our replication  7.9  26.17  9.69  24.10  0  6.846  1x  
4 

sunon  7.7  26.88  9.37  24.54  0.83  0.017  403x  
5  NAUS (truncate)  7.8  27.27  9.49  24.96  1.76  0.005  1369x  
6  NAUS (length control)  7.8  27.94  9.24  25.51  2.73  0.041  167x  

7  Baseline  9.8  23.03  7.95  21.29  10.2  –  –  
8  10.8  27.29  10.01  24.59  0.58  –  –  
9  9.3  26.48  10.05  24.41  1.53  –  –  
10  Search  9.8  27.52  10.27  24.91  0.23  –  –  
11  Our replication  9.8  27.35  10.25  24.87  0  9.217  1x  
12 

sunon  9.4  27.86  9.88  25.51  0.78  0.020  461x  
13  NAUS (truncate)  9.8  28.24  10.04  25.40  1.21  0.005  1843x  
14  NAUS (length control)  9.8  28.55  9.97  25.78  1.83  0.044  210x 
3.2 Results and Analyses
Main Results. Table 1 presents the performance of our model and baselines on the Gigaword headline test set. For a fair comparison, we categorize all approaches by average summary lengths of ~8 and ~10 into Groups A and B, respectively.
The Lead baseline extracts the first several words of the input sentence. Despite its simplicity, the Lead approach is a strong summarization baseline adopted in most previous work fevryphang2018unsupervised; baziotisetal2019seq.
Model  ROUGE Recall  Time  Speedup  
R1  R2  RL  R  
22.50  6.49  19.72  8.34  –  –  
25.12  6.46  20.12  5.35  –  –  
22.13  6.18  19.30  9.44  –  –  
22.85  5.71  19.87  8.62  –  –  
26.04  8.06  22.90  0.05  –  –  
Our replication  26.14  8.03  22.88  0  12.314  1x 
sunon  26.25  7.66  22.83  0.31  0.022  559x 
NAUS (truncate)  26.52  7.88  22.91  0.26  0.005  2463x 
NAUS (length control)  26.71  7.68  23.06  0.40  0.048  257x 
wanglee2018learning utilize cycle consistency miao2016language for unsupervised summarization; the performance is relatively low, because the cycle consistency loss cannot ensure the generated text is a valid summary. zhourush2019simple perform beam search towards a stepbystep decomposable score of fluency and contextual matching. Both are unable to explicitly control the summary length: in a fair comparison of length 10 (Group B, Table 1), their performance is worse than the (previous) stateoftheart approach schumannetal2020discrete,^{3}^{3}3schumannetal2020discrete present a few variants that use additional datasets for training language models (in an unsupervised way). In our study, we focus on the setting without data augmentation, i.e., the language model is trained on nonparallel the Gigawords corpus. which performs editbased local search.
Our NAUS approach follows schumannetal2020discrete, but trains a nonautoregressive model from search results. We consider two settings for controlling the summary length: truncating longer summaries and decoding with our proposed lengthcontrol algorithm. Both of our variants outperform schumannetal2020discrete by 1.21–2.73 in terms of the total ROUGE score (Rows 5–6 & 13–14, Table 1). As mentioned, schumannetal2020discrete only extract original words with order preserved, yielding noisy sentences. Our NAUS, as a student, learns from the searchbased teacher model and is able to smooth out its noise. This is a compelling result, as our student model outperforms its teacher.
Regarding inference efficiency, our NAUS method with truncating is more than 1300 times faster than schumannetal2020discrete, because we do not need iterative search. Even with dynamic programming and beam search for length control, NAUS is still over 100 times faster. This shows our NAUS is extremely efficient in inference, which is important for realtime applications.
Although the efficiency of wanglee2018learning and zhourush2019simple is not available, we still expect our approach to be a few times faster (despite our higher ROUGE scores) because their models are autoregressive. By contrast, our NAUS is nonautoregressive, meaning that it predicts all words simultaneously. We will provide a controlled comparison between autoregressive and nonautoregressive models in Table 3.
Table 2 shows the results on the DUC2004 dataset. The cycleconsistency approach baziotisetal2019seq; westetal2019bottlesum does not perform well on this dataset, outperformed by an early rulebased syntax tree trimming approach zajic2004bbn and the stateoftheart editbased search schumannetal2020discrete.
The performance of our NAUS model is consistent with Table 1, outperforming all previous methods in terms of the total ROUGE score, and being 100–1000 times faster than the search approach schumannetal2020discrete.
In general, the proposed NAUS not only achieves stateoftheart ROUGE scores for unsupervised summarization, but also is more efficient when deployed. Results are consistent on both datasets, demonstrating the generality of our NAUS.
InDepth Analyses. We conduct indepth analyses on the proposed NAUS model in Table 3. Due to the limit of time and space, we chose the Gigaword headline generation as our testbed. All the autoregressive (AR) and nonautoregressive (NAR) variants learn from the search output of our replication (Rows 2 & 11), where we achieve very close results to those reported in schumannetal2020discrete.
#  Approach  ROUGE Recall  Speedup  
R1  R2  RL  R  
Group A (desired length 8)  
1  Search  schumannetal2020discrete  26.32  9.63  24.19  0.18  –  
2  Our replication  26.17  9.69  24.10  0  1x  
3  AR  Transformer (T)  26.65  9.51  24.67  0.87  58x  
4 

Vanilla  24.87  8.33  22.74  4.02  571x  
5  CTC (T)  27.30  9.20  24.96  1.5  571x  
6  CTC (LC)  27.76  9.13  25.33  2.26  149x  
7 

sunon  26.88  9.37  24.54  0.83  403x  
8  Our NAUS (T)  27.27  9.49  24.96  1.76  1396x  
9  Our NAUS (LC)  27.94  9.24  25.51  2.73  167x  
Group B (desired length 10)  
10  Search  schumannetal2020discrete  27.52  10.27  24.91  0.23  –  
11  Our replication  27.35  10.25  24.87  0  1x  
12  AR  Transformer (T)  27.06  9.63  24.55  1.23  66x  
13 

Vanilla  25.77  8.69  23.52  4.49  709x  
14  CTC (T)  28.14  10.07  25.37  1.11  709x  
15  CTC (LC)  28.45  9.81  25.63  1.42  192x  
16 

sunon  27.86  9.88  25.51  0.78  461x  
17  Our NAUS (T)  28.24  10.04  25.40  1.21  1843x  
18  Our NAUS (LC)  28.55  9.97  25.78  1.83  210x 
We first tried vanilla encoder–decoder NAR Transformer (Rows 4 & 13, gu2017non), where we set the number of decoding slots as the desired summary length; thus, the blank token and the lengthcontrol algorithm are not needed. As seen, a vanilla NAR model does not perform well, and CTC largely outperforms vanilla NAR in both groups (Rows 5–6 & 14–15). Such results are highly consistent with the translation literature sahariaetal2020non; imputer; gukong2021fully; qian2020glancing; huang2021non.
The proposed encoderonly NAUS model outperforms encoder–decoder ones in both groups in terms of the total ROUGE score, when the summary length is controlled by either truncating or lengthcontrol decoding (Rows 8–9 & 17–18). Profoundly, our nonautoregressive NAUS is even better than the autoregressive Transformer (Rows 3 & 12). We also experimented with previous nonautoregressive work for supervised summarization sunon^{4}^{4}4To the best of our knowledge, the other two nonautoregressive supervised summarization models are yangetal2021pos and pmlrv139qi21a. Their code and pretrained models are not available, making replication difficult. in our learningfromsearch setting. Although their approach appears to be encoderonly, it adds endofsequence (EOS) tokens at the end of the generation, and thus is unable to utilize the input–output correspondence. Their performance is higher than vanilla NAR models, but lower than ours. By contrast, NAUS is able to capture such correspondence with the residual connections, i.e., Eqns. (2) and (3), in its encoderonly architecture.
Generally, the efficiency of encoderonly NAR^{5}^{5}5The standard minimal encoder–decoder NAR model has 6 layers for the encoder and another 6 layers for the decoder attentionisallyouneed. Our NAUS only has a 6layer encoder. Our pilot study shows that more layers do not further improve performance in our encoderonly architecture. (without lengthcontrol decoding) is ~2 times faster than encoder–decoder NAR and ~20 times faster than the AR Transformer.
Further, our lengthcontrol decoding improves the total ROUGE score, compared with truncating, for both encoder–decoder CTC and encoderonly NAUS models (Rows 6, 9, 15, & 18), although its dynamic programming is slower. Nevertheless, our nonautoregressive NAUS with length control is ~200 times faster than search and ~3 times faster than the AR Transformer.
Additional Results. We present additional results in our appendices:
C. Analysis of Beam Search
D. Case Study
E. Human Evaluation
F. LengthTransfer Summarization
4 Related Work
Summarization systems can be generally categorized into two paradigms: extractive and abstractive. Extractive systems extract certain sentences and clauses from input, for example, based on salient features zhourush2019simple or feature construction he2012document. Abstraction systems generate new utterances as the summary, e.g., by sequencetosequence models trained in a supervised way zhang2020pegasus; liurefsum.
Recently, unsupervised abstractive summarization is attracting increasing attention. yangetal2020ted propose to use the Lead baseline (first several sentences) as the pseudogroundtruth. However, such an approach only works with wellstructured articles (such as CNN/DailyMail). wanglee2018learning and baziotisetal2019seq use cycle consistency for unsupervised summarization. zhourush2019simple propose a stepbystep decomposable scoring function and perform beam search for summary generation. schumannetal2020discrete propose an editbased local search approach, which allows a more comprehensive scoring function and outperforms cycle consistency and beam search.
Our paper follows schumannetal2020discrete but trains a machine learning model to improve efficiency and smooth out search noise. Previously, NEURIPS2020_7a677bb4 finetune a GPT2 model based on search results for unsupervised paraphrasing; jolly2021search adopt the searchandlearning framework to improve the semantic coverage for fewshot datatotext generation. We extend previous work in a nontrivial way by designing a nonautoregressive generator and further proposing a lengthcontrol decoding algorithm.
The importance of controlling the output length is recently realized in the summarization community. baziotisetal2019seq and sunon adopt soft penalty to encourage shorter sentences; yangetal2021pos and pmlrv139qi21a control the summary length through POS tag and EOS predictions. None of these studies can control the length explicitly. songetal2021new is able to precisely control the length by progressively filling a predetermined number of decoding slots, analogous to the vanilla NAR model in our nonautoregressive setting.
Nonautoregressive generation is originally proposed for machine translation gu2017non; guo2020fine; sahariaetal2020non
, which is later extended to other text generation tasks. wisemanetal2018learning address the tabletotext generation task, and model output segments by a hidden semiMarkov model
ostendorf1996hmm, simultaneously generating tokens for all segments. jia2021flexible apply nonautoregressive models to extractive documentlevel summarization. sunon stack a nonautoregressive BERT model with a conditional random field (CRF) for abstractive summarization; since the summary is shorter than the input text, their approach puts multiple endtosequence (EOS) tokens at the end of the sentence, and thus is unable to utilize the strong input–output correspondence in the summarization task. yangetal2021pos apply auxiliary partofspeech (POS) loss and pmlrv139qi21a explore pretraining strategies for encoder–decoder nonautoregressive summarization. All these studies concern supervised summarization, while our paper focuses on unsupervised summarization. We adopt CTC training in our encoderonly architecture, allowing blank tokens to better align input and output words, which is more appropriate for summarization.5 Conclusion
In this work, we propose a nonautoregressive unsupervised summarization model (NAUS), where we further propose a lengthcontrol decoding algorithm based on dynamic programming. Experiments show that NAUS not only archives stateoftheart unsupervised performance on Gigaword headline generation and DUC2004 datasets, but also is much more efficient than search methods and autoregressive models. Appendices present additional analyses and lengthtransfer experiments.
Limitation and Future Work. Our paper focuses on unsupervised summarization due to the importance of lowdata applications. One limitation is that we have not obtained rigorous empirical results for supervised summarization, where the developed model may also work. This is because previous supervised summarization studies lack explicit categorization of summary lengths yangetal2020ted; pmlrv139qi21a, making comparisons unfair and problematic schumannetal2020discrete. Such an observation is also evidenced by sunon, where the same model may differ by a few ROUGE points when generating summaries of different lengths. Nevertheless, we have compared with sunon in our setting and show the superiority of the NAUS under fair comparison. We plan to explore supervised summarization in future work after we establish a rigorous experimental setup, which is beyond the scope of this paper.
6 Acknowledgments
We thank Raphael Schumann for providing valuable suggestions on the work. We also thank the Action Editor and reviewers for their comments during ACL Rolling Review. The research is supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) under grant No. RGPIN202004465, the Amii Fellow Program, the Canada CIFAR AI Chair Program, a UAHJIC project, a donation from DeepMind, and Compute Canada (www.computecanada.ca).
References
Appendix A Proof of Theorem 1
Theorem 1. (1) If repeating tokens are not merged, then the proposed lengthcontrol algorithm with beam size finds the exact optimum being the most probable length sentence given by prediction slots. (2) If we merge repeating tokens predicted by CTCtrained models, the above algorithm may not be exact.
Proof..
[Part (1)] This part concerns a variant of our decoding algorithm, which only removes the blank token but does not merge consecutive repeated tokens to a single word, i.e., Eqn. (10) is removed. We denote this by , for example, , as opposed to in our algorithm. We now show that, based on , our dynamic programming algorithm in §2.3 with beam size is an exact inference algorithm.
We define , where denotes the length of a sequence. In other words, is the maximum probability of tokens that are reduced to words.
According to the definition, we have
(13)  
(14)  
(15) 
In (13), refers to the probability of one token that is reduced to zero words, in which case the first predicted token can only be the blank token , corresponding to Eqn. (9) with and . Likewise, is the maximum probability of one token that is reduced to one word. Thus, it is the probability of the most probable non token, corresponding to Eqn. (11) with and . Eqn. (15) asserts that fewer tokens cannot be reduced to more words; it is used for mathematical derivations, but need not to be explicitly implemented in our algorithm in §2.3.
The recursion variable is computed by
(16)  
In other words, the variable can inherit with a predicted blank token , corresponding to Eqn. (9); or it can inherit with a predicted non token, corresponding to Eqn. (11). Specially, if , then the second term has undefined, and thus is ignored in the operation.
We need the operator to take the higher probability in the two cases, since is the maximum probability of tokens being reduced to words. This corresponds to Eqn. (12) with beam size .
To sum up, our inductive calculation guarantees that is the exact maximum probability of for the desired length with generation slots; our algorithm (if not merging repeating tokens) gives the corresponding as under the same constraints, concluding the proof of Part (1).
[Part (2)] CTC training merges consecutive repeated tokens to a single word, unless separated by the blank token 10.1145/1143844.1143891. Since our model is trained by CTC, we should adopt this rule in inference as well. We show in this part that our algorithm, with beam size , may not yield the exact optimum with an example in Table 4.
Word  

I  0.39  0.1 
like  0.4  0.9 
coding  0.1  0 
0.11  0 
We consider generating a sentence of two words from the two prediction slots, i.e., . Apparently, the optimal sequence is “I like” with probability . However, the algorithm would predict because “like” is the most probably token in the first slot. Then, our algorithm will give , because it has to select a nonrepeating token based on , yielding a nonoptimal solution.
∎
It is noted that, if we do not merge repeating tokens as in , our algorithm will give the exact optimum “like like” in the above example. This shows that merging consecutive repeated tokens requires the decoding algorithm to correct early predictions, and thus, our dynamic programming becomes an approximate inference. Nevertheless, our algorithm is able to generate a sequence of the desired length properly; its approximation happens only when the algorithm compares more repetitions with fewer s versus more s with fewer repetitions. Such approximation is further alleviated by beam search in our dynamic programming. Therefore, the proposed lengthcontrol algorithm is better than truncating a longer sentence; especially, our approach generates more fluent and complete sentences.
Appendix B Implementation Details
Our NAUS had a Transformer encoder as the basic structure, generally following the settings in attentionisallyouneed: 6 encoder layers, each having 8 attention heads. The dimension was 512 for attention and 2048 for feedforward modules.
Our training used a batch size of 4K tokens, with a maximum of 200K updates. We used Adam with . In general, the learning rate warmed up to 5e4 in the ﬁrst 10K steps, and then decayed to 1e9 with the inverse squareroot schedule, except that we find the maximum learning rate of 1e4 worked better for headline generation with the summary length of 8. We set the weight decay to 0.01. Our lengthcontrol decoding algorithm had a beam size of 6. More details can be found in our repository (Footnote 1).
Our NAUS training is based on schumannetal2020discrete’s prediction on the input of the Gigaword headline generation training set. We show performance against the number of training samples in Figure 3. As seen, NAUS outperforms its search teacher even with a small set of 0.1 million samples. The performance saturates as the number of samples increases. Based on this analysis, we used 3 million samples from the 3.8 million Gigaword training set to train our NAUS models.
Appendix C Analysis of Beam Search
As mentioned, our lengthcontrol decoding algorithm involves beam search within its dynamic programming, because the algorithm does not find the exact optimum when it merges repeating words. We analyze the effect of the beam size in our lengthcontrol algorithm.
In addition, we compare our approach with CTC beam search 10.1145/1143844.1143891.^{6}^{6}6Our implementation of CTC beam search is based on https://github.com/parlance/ctcdecode Typically, a CTCtrained nonautoregressive model can be decoded either greedily or by beam search. The greedy decoding finds the most probable token at each step, i.e., , and reduces the tokens to a sentence by , where is the number of decoding steps. The CTC beam search algorithm searches for the most likely sentence by marginalizing all token sequences that are reduced to , i.e., .
We show results in Figure 4, where we chose 10word Gigaword headline generation as the testbed with our NAUS model (Group B, Table 1). Notice that CTC beam search does not control the output length, and for fair comparison, we truncated its generated summaries. This also shows that our novel decoding approach and CTC beam search are distinct algorithms.
As seen in Figure 4a, the beam search does play a role in our lengthcontrol algorithm. When the beam enlarges from 1 to 6, the performance (orange solid line) increases by 1.2 points in R, the difference of total ROUGE in comparison with schumannetal2020discrete under our replication (Row 10, Table 1). However, further increasing the beam size does not yield additional performance gain. This is consistent with previous literature in autoregressive generation meister2020if, which also suggests a beam size of 5–7 is the best in their applications. In terms of the efficiency (Figure 4b), a larger beam size monotonically increases the inference time. However, the overhead of beam search is relatively small in our dynamic programming, and thus we chose a beam size of 6 in our experiments.
Our lengthcontrol algorithm significantly outperforms CTC beam search (dashed blue lines) in terms of both R and efficiency. Especially, CTC beam search is three times slower, and degrades more significantly than our lengthcontrol decoding when the beam size increases.
Appendix D Case Study
We show in Table 6 example summaries generated by our NAUS with truncating and lengthcontrol decoding, as well as the previous stateoftheart method schumannetal2020discrete. We observe that NAUS without length control generates slightly longer summaries, and if truncated, the output may be incomplete; by contrast, our lengthcontrol algorithm can generate a fluent and complete sentence of the desired length by dynamic programming. Compared with schumannetal2020discrete, our NAUS (length control) generates a more informative summary that includes the main clause (united nations condemned), which also appears in the reference summary.
Appendix E Human Evaluation
Decoding  Wins  Ties  Loses  val  

Overall quality  Truncate  18.67%  40.67%  40.67%  0.0004  
Length control  40.67%  40.67%  18.67%  

Truncate  24.67%  26.67%  48.67%  0.0005  
Length control  48.67%  26.67%  24.67% 
We conducted human evaluation with a focus on truncating and lengthcontrol decodings. This is because truncating may generate incomplete sentences, which cannot be adequately evaluated by automatic metrics as their ROUGE scores are close.
Specifically, we invited three human annotators to compare the two decoding algorithms for NAUS on 50 randomly selected samples, in the setting of Group B, Table 1 (Gigaword headline generation with a target length of 10). The annotation was conducted in a pairwise manner in terms of overall quality and fluency/completeness; average results (wins/loses/ties) are shown in Table 4. It should be mentioned that our annotation was strictly blind: the samples of two systems were presented in random order and annotators did not know which system generated a sample.
As seen, our lengthcontrol decoding algorithm largely outperforms the truncating approach in terms of both the overall quality and fluency/completeness. The results are statistically significant (values ) in a onesided binomial test. This verifies that lengthcontrol decoding is important for summarization, as truncating yields incomplete sentences, which are inadequately reflected by ROUGE scores.









Group  #  Approach  Len  ROUGE F1  Inf.Time  Speedup  
R1  R2  RL  R  

1  Baseline  7.9  21.39  7.42  20.03  11.12  –  –  
2  Search  7.9  26.32  9.63  24.19  0.18  –  –  
3  Our replication  7.9  26.17  9.69  24.10  0  6.846  1x  
4 

sunon  7.7  26.88  9.37  24.54  0.83  0.017  403x  
5  sunon  8.4  25.71  8.94  23.65  1.84  0.018  380x  
6  NAUS (truncate)  7.8  27.27  9.49  24.96  1.76  0.005  1369x  
7  NAUS  7.8  27.94  9.24  25.50  2.73  0.041  167x  
8  NAUS  7.9  27.12  9.08  24.86  1.10  

9  Baseline  9.8  23.03  7.95  21.29  10.2  –  –  
10  10.8  27.29  10.01  24.59  0.58  –  –  
11  9.3  26.48  10.05  24.41  1.53  –  –  
12  Search  9.8  27.52  10.27  24.91  0.23  –  –  
13  Our replication  9.8  27.35  10.25  24.87  0  9.217  1x  
14 

sunon  –  –  –  –  –  –  –  
15  sunon  9.4  27.86  9.88  25.51  0.78  0.020  461x  
16  NAUS (truncate)  9.8  28.24  10.04  25.40  1.21  0.005  1843x  
17  NAUS  9.9  28.32  9.58  25.46  0.89  0.044  210x  
18  NAUS  9.8  28.55  9.97  25.78  1.83  

19  Baseline  14.6  24.97  8.65  22.43  4.58  –  –  
20  14.8  23.16  5.93  20.11  11.43  –  –  
21  15.1  24.70  7.97  22.41  5.55  –  –  
22  Search  14.9  27.05  9.75  23.89  0.06  –  –  
23  Our replication  14.9  27.03  9.81  23.79  0  17.462  1x  
24 

sunon  –  –  –  –  –  –  –  
25  sunon  –  –  –  –  –  –  –  
26  NAUS  14.9  28.39  9.78  24.94  2.48  0.052  336x  
27  NAUS  14.9  28.53  9.88  25.10  2.88 
Appendix F LengthTransfer Summary Generation
In the main paper, we present results where our NAUS is trained on search outputs schumannetal2020discrete that have the same length as the inference target. This follows the common assumption in machine learning that training and test samples are independently identically distributed.
In this appendix, we show the performance of lengthtransfer summary generation, where the prediction has a different length from that of training. We denote such a model by NAUS, referring to training with words and testing for words.
As seen in Groups A & B in Table 7, NAUS with length transfer is slightly worse than NAUS trained on the correct length, which is understandable. Nevertheless, lengthtransfer decoding still outperforms the search teacher and other baselines.
Moreover, we consider the third setting in schumannetal2020discrete, where the target length is 50% of the input. Since it takes time to obtain pseudogroundtruths given by the editbased search, we would directly transfer already trained NAUS models to this setting by our lengthcontrol decoding. Results are shown in Group C, Table 7. We observe NASU is better than NASU, which makes much sense because the latter has a larger gap during transfer. Remarkably, both NASU and NASU outperform schumannetal2020discrete and other baselines, achieving new stateoftheart unsupervised performance on this setting as well.
We further compare with sunon, who use a length penalty to encourage short summaries. However, their length control works in the statistical sense but may fail for individual samples. Moreover, such a soft length penalty cannot generate longer summaries than trained. Even in the setting of , their generates summaries are slightly longer than required, while the performance degrades much more considerably than NAUS.
These results show that our novel lengthcontrol decoding algorithm is not only effective when generating summaries of similar length to the training targets, but also generalizes well to different desired summary lengths without retraining. In general, our NAUS is an effective and efficient unsupervised summarization system with the ability of explicit length control.