Log In Sign Up

Learning Non-Autoregressive Models from Search for Unsupervised Sentence Summarization

Text summarization aims to generate a short summary for an input text. In this work, we propose a Non-Autoregressive Unsupervised Summarization (NAUS) approach, which does not require parallel data for training. Our NAUS first performs edit-based search towards a heuristically defined score, and generates a summary as pseudo-groundtruth. Then, we train an encoder-only non-autoregressive Transformer based on the search result. We also propose a dynamic programming approach for length-control decoding, which is important for the summarization task. Experiments on two datasets show that NAUS achieves state-of-the-art performance for unsupervised summarization, yet largely improving inference efficiency. Further, our algorithm is able to perform explicit length-transfer summary generation.


page 1

page 2

page 3

page 4


A Character-Level Length-Control Algorithm for Non-Autoregressive Sentence Summarization

Sentence summarization aims at compressing a long sentence into a short ...

Discrete Optimization for Unsupervised Sentence Summarization with Word-Level Extraction

Automatic sentence summarization produces a shorter version of a sentenc...

Q-learning with Language Model for Edit-based Unsupervised Summarization

Unsupervised methods are promising for abstractive text summarization in...

Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search

Abstractive summarization systems today produce fluent and relevant outp...

Applying Transformer-based Text Summarization for Keyphrase Generation

Keyphrases are crucial for searching and systematizing scholarly documen...

An unsupervised extractive summarization method based on multi-round computation

Text summarization methods have attracted much attention all the time. I...

Automatic Summarization of Open-Domain Podcast Episodes

We present implementation details of our abstractive summarizers that ac...

1 Introduction

Text summarization is an important natural language processing (NLP) task, aiming at generating concise summaries for given texts while preserving the key information. It has extensive real-world applications such as headline generation 

nenkova2011automatic. In this paper, we focus on the setting of sentence summarization Rush_2015; filippova-etal-2015-sentence.

State-of-the-art text summarization models are typically trained in a supervised way with large training corpora, comprising pairs of long texts and their summaries zhang2020pegasus; aghajanyan2020better; aghajanyan2021muppet. However, such parallel data are expensive to obtain, preventing the applications to less popular domains and less spoken languages.

Unsupervised text generation has been attracting increasing interest, because it does not require parallel data for training. One widely used approach is to compress a long text into a short one, and to reconstruct it to the long text by a cycle consistency loss

miao2016language; wang-lee-2018-learning; baziotis-etal-2019-seq

. Due to the indifferentiability of the compressed sentence space, such an approach requires reinforcement learning (or its variants), which makes the training difficult


Recently, schumann-etal-2020-discrete propose an edit-based approach for unsupervised summarization. Their model maximizes a heuristically defined scoring function that evaluates the quality (fluency and semantics) of the generated summary, achieving higher performance than cycle-consistency methods. However, the search approach is slow in inference because hundreds of search steps are needed for each data sample. Moreover, their approach can only select words from the input sentence with the word order preserved. Thus, it is restricted and may generate noisy summaries due to the local optimality of search algorithms.

To address the above drawbacks, we propose a Non-Autoregressive approach to Unsupervised Summarization (NAUS). The idea is to perform search as in schumann-etal-2020-discrete and, inspired by NEURIPS2020_7a677bb4, to train a machine learning model to smooth out such noise and to speed up the inference process. Different from NEURIPS2020_7a677bb4, we propose to utilize

non-autoregressive decoders, which generate all output tokens in parallel due to our following observations:

Non-autoregressive models are several times faster than autoregressive generation, which is important when the system is deployed.

The input and output of the summarization task have a strong correspondence. Non-autoregressive generation supports encoder-only architectures, which can better utilize such input–output correspondence and even outperform autoregressive models for summarization.

For non-autoregressive models, we can design a length-control algorithm based on dynamic programming to satisfy the constraint of output lengths, which is typical in summarization applications but cannot be easily achieved with autoregressive models.

We conducted experiments on Gigaword headline generation

graff2003english and DUC2004 duc2004 datasets. Experiments show that our NAUS achieves state-of-the-art performance on unsupervised summarization; especially, it outperforms its teacher (i.e., the search approach), confirming that NAUS can indeed smooth out the search noise. Regarding inference efficiency, our NAUS with truncating is 1000 times more efficient than the search approach; even with dynamic programming for length control, NAUS is still 100 times more efficient than search and several times more efficient than autoregressive models. Our NAUS is also able to perform length-transfer summary generation, i.e., generating summaries of different lengths from training.

2 Approach

Figure 1: The overview of our NAUS approach. In each search step, input words corresponding to grey cells are selected. Besides, the blue arrow refers to the training process, and the green arrow refers to inference.

In our approach, we first follow schumann-etal-2020-discrete and obtain a summary by discrete search towards a heuristically defined objective function (§2.1). Then, we propose a non-autoregressive model for the summarization task (§2.2). We present the training strategy and the proposed length-control algorithm in §2.3.

2.1 Search-Based Summarization

Consider a given source text . The goal of summarization is to find a shorter text as the summary.

Our work on unsupervised summarization follows the recent progress of search-based text generation liu-etal-2020-unsupervised; liu2021simulated; kumar2020iterative. schumann-etal-2020-discrete formulate summarization as word-level extraction (with order preserved), and apply edit-based discrete local search to maximize a heuristically designed objective.

Specifically, the objective function considers two aspects: (1) a language fluency score , given by the reciprocal of a language model’s perplexity; and (2) a semantic similarity score , given by the cosine embeddings. The overall objective combines the two aspects as



is a weighting hyperparameter. Interested readers are referred to schumann-etal-2020-discrete for the details of the scoring function.

Further, the desired summary length can be specified as a hard constraint, achieved by searching only among sentences of the correct length. Suppose the desired summary length is , the approach selects random words from the input, and maximizes the scoring function (1) by changing the selection and non-selection of two words.

A greedy hill-climbing algorithm determines whether the change is accepted or not. In other words, a change is accepted if the score improves, or rejected otherwise. Such a process continues until a (possibly local) optimum is found.

A pilot analysis in schumann-etal-2020-discrete shows that words largely overlap between a source text and its reference summary. This explains the high performance of such a word extraction approach, being a state-of-the-art unsupervised summarization system and outperforming strong competitors, e.g., cycle consistency wang-lee-2018-learning; baziotis-etal-2019-seq.

2.2 Non-Autoregressive Model for Summarization

Despite the high performance, such edit-based search has several drawbacks. First, the search process is slow because hundreds of local search steps are needed to obtain a high-quality summary. Second, their approach only extracts the original words with order preserved. Therefore, the generated summary is restricted and may be noisy.

To this end, we propose a Non-Autoregressive approach to Unsupervised Summarization (NAUS) by learning from the search results. In this way, the machine learning model can smooth out the search noise and is much faster, largely alleviating the drawbacks of search-based summarization. Compared with training an autoregressive model from search NEURIPS2020_7a677bb4, non-autoregressive generation predicts all the words in parallel, further improving inference efficiency by several times.

Moreover, a non-autoregressive model enables us to design an encoder-only architecture, which is more suited to the summarization task due to the strong correspondence between input and output, which cannot be fully utilized by encoder–decoder models, especially autoregressive ones.

Specifically, we propose to use multi-layer Transformer attentionisallyouneed

as the non-autoregressive architecture for summarization. Each Transformer layer is composed of a multi-head attention sublayer and a feed-forward sublayer. Additionally, there is a residual connection in each sublayer, followed by layer normalization.

Let be the representation at the th layer, where is the number of words and is the dimension. Specially, the input layer is the embeddings of words. Suppose we have attention heads. The output of the head in the th attention sublayer is , where , , and

are matrices calculated by three distinct multi-layer perceptrons (MLPs) from

; is the attention dimension.

Multiple attention heads are then concatenated:

where is a weight matrix.

Then, we have a residual connection and layer normalization by


Further, an MLP sublayer processes , followed by residual connection and layer normalization, yielding the th layer’s representation


The last Transformer layer is fed to

to predict the words of the summary in a non-autoregressive manner, that is, the probability at the

th step is given by , where is the th row of the matrix and is the weight matrix.

It is emphasized that, in the vocabulary, we include a special blank token , which is handled by dynamic programming during both training and inference (§2.3). This enables us to generate a shorter summary than the input with such a multi-layer Transformer.

Our model can be thought of as an encoder-only architecture, differing from a typical encoder–decoder model with cross attention attentionisallyouneed; baziotis-etal-2019-seq; zhou-rush-2019-simple. Previously, sunon propose a seemingly similar model to us, but put multiple end-of-sequence (EOS) tokens at the end of the generation; thus, they are unable to maintain the correspondence between input and output. Instead, we allow blank tokens scattering over the entire sentence; the residual connections in Eqns (2) and (3) can better utilize such input–output correspondence for summarization.

2.3 Training and Inference

In this section, we first introduce the Connectionist Temporal Classification (CTC) training. Then, we propose a length-control decoding approach for summary generation.

CTC Training. The Connectionist Temporal Classification (CTC, 10.1145/1143844.1143891) algorithm allows a special blank token in the vocabulary, and uses dynamic programming to marginalize out such blank tokens, known as latent alignment saharia-etal-2020-non. In addition, non-autoregressive generation suffers from a common problem that words may be repeated in consecutive steps gu2017non; leedeterministic; thus, CTC merges repeated words unless separated by . For example, the sequence of tokens is reduced to the text , denoted by .

Concretely, the predicted likelihood is marginalized over all possible fillings of , i.e., all possible token sequences that are reduced to the groundtruth text:


where is the probability of generating a sequence of tokens . Although enumerating every candidate in is intractable, such marginalization fortunately can be computed by dynamic programming in an efficient way.

Let be the marginal probability of generating up to the th decoding slot. Moreover, is defined to be the probability that is all , thus not having matched any word in . The variable can be further decomposed into two terms , where the first term is such probability with , and the second term . Apparently, the initialization of variables is


Eqn. (7) is because, at the first prediction slot, the empty token does not match any target words; Eqn. (8) is because the predicted non- first token must match exactly the first target word.

The recursion formula for is

since the newly predicted token with probability does not match any target word, inheriting .

The recursion formula for is

Here, is not , so we must have , having the predicted probability .

If , then we have two sub-cases: first, is reduced to with separating two repeating words in , having probability ; or second, is reduced to with , having probability , which implies we are merging and .

If , is reduced to either or . In the first case, can be either or non-, given by . In the second case, we must have , which has a probability of .

Finally, is the marginal probability in Eqn. (4), as it is the probability that the entire generated sequence matches the entire target text.

The CTC maximum likelihood estimation is to maximize the marginal probability, which is equivalent to minimizing the loss

. Since the dynamic programming formulas are differentiable, the entire model can be trained by backpropagation in an end-to-end manner with auto-differentiation tools (such as PyTorch).

Length-Control Inference.

Controlling output length is the nature of the summarization task, for example, displaying a short news headline on a mobile device. Moreover, schumann-etal-2020-discrete show that the main evaluation metric ROUGE

lin2004rouge is sensitive to the summary length, and longer summaries tend to achieve higher ROUGE scores. Thus, it is crucial to control the summary length for fair comparison.

We propose a length-control algorithm by dynamic programming (DP), following the nature of CTC training. However, our DP is an approximate algorithm because of the dependencies introduced by removing consecutive repeated tokens. Thus, we equip our DP with a beam search mechanism.

We define to be a set of top- sequences with predicted tokens that are reduced to words. is constructed by three scenarios.

Figure 2: Illustration of our length-control algorithm.

First, the blank token is predicted for the th generation slot, and thus the summary length remains the same, shown by the blue arrow in Figure 2. This yields a set of candidates


where refers to string/token concatenation.

Second, a repeated word is predicted for the th generation slot, i.e., for a subsequence of length . In this case, the summary length also remains the same, also shown in the blue arrow in Figure 2. This gives a candidate set


Third, a non-, non-repeating word is generated, increasing the summary length from to , shown by the red arrow in Figure 2. This gives


where selects the best elements by the probability .

Based on the three candidates sets, we select top- sequences to keep the beam size fixed:


where the sequences are ranked by their predicted joint probabilities.

Theorem 1.

(1) If repeating tokens are not merged, then the proposed length-control algorithm with beam size finds the exact optimum being the most probable length- sentence given by prediction slots. (2) If we merge repeating tokens predicted by CTC-trained models, the above algorithm may not be exact.

Appendix A presents the proof of the theorem and provides a more detailed analysis, showing that our length-control algorithm, although being approximate inference, can generate a summary of the desired length properly. Compared with truncating an overlength output, our approach is able to generate more fluent and complete sentences. Also, our length-control algorithm is different from conventional beam search, shown in Appendix C.

3 Experiments

3.1 Setup

Datasets. We evaluated our NAUS model on Gigaword headline generation and DUC2004 datasets.

The headline generation dataset Rush_2015 is constructed from the Gigaword news corpus graff2003english, where the first sentence of a news article is considered as input text and the news title is considered as the summary. The dataset contains 3.8M/198K/1951 samples for training/validation/test. Based on the analysis of the training size in Appendix B, we used 3M samples for training NAUS.

It should be emphasized that, when NAUS learns from search, we only use the input of the training corpus: we perform search schumann-etal-2020-discrete for each input, and train our NAUS from the search results. Therefore, we do not utilize any labeled parallel data, and our approach is unsupervised.

Moreover, we considered two settings with desired summary lengths of 8 and 10, following schumann-etal-2020-discrete. Our NAUS is trained from respective search results.

The DUC2004 dataset duc2004 is designed for testing only with 500 samples, where we also take the first sentence of an article as the input text. Our NAUS is transferred from the above headline generation corpus. Based on the length of DUC2004 summaries, we trained NAUS from search results with 13 words, also following schumann-etal-2020-discrete for fair comparison.

Evaluation Metrics. We evaluated the quality of predicted summaries by ROUGE scores 222  lin2004rouge, which are the most widely used metrics in previous work wang-lee-2018-learning; baziotis-etal-2019-seq; zhou-rush-2019-simple. Specifically, ROUGE- evaluates -gram overlap between a predicted summary and its reference summary; ROUGE-L, instead, measures the longest common sequence between the predicted and reference summaries.

Different ROUGE variants are adopted in previous work, depending on the dataset. We followed the standard evaluation scripts and evaluated headline generation by ROUGE F1 wang-lee-2018-learning; baziotis-etal-2019-seq; schumann-etal-2020-discrete and DUC2004 by Truncate ROUGE Recall dorr-etal-2003-hedge; west-etal-2019-bottlesum.

In addition to summary quality, we also evaluated the inference efficiency of different methods, as it is important for the deployment of deep learning models in real-time applications. We report the average inference time in seconds for each data sample, and compare the speedup with schumann-etal-2020-discrete’s search approach, which achieves (previous) state-of-the-art ROUGE scores. Our experiments were conducted on an i9-9940X CPU and an RTX6000 graphic card. Appendix 

B presents additional implementation details.

Group # Approach Len ROUGE F1 Inf.Time Speedup
R-1 R-2 R-L R
length 8)
1 Baseline 7.9 21.39 7.42 20.03 -11.12
2 Search 7.9 26.32 9.63 24.19 0.18
3 Our replication 7.9 26.17 9.69 24.10 0 6.846 1x
Learn from
sunon 7.7 26.88 9.37 24.54 0.83 0.017 403x
5 NAUS (truncate) 7.8 27.27 9.49 24.96 1.76 0.005 1369x
6 NAUS (length control) 7.8 27.94 9.24 25.51 2.73 0.041 167x
length 10)
7 Baseline 9.8 23.03 7.95 21.29 -10.2
8 10.8 27.29 10.01 24.59 -0.58
9 9.3 26.48 10.05 24.41 -1.53
10 Search 9.8 27.52 10.27 24.91 0.23
11 Our replication 9.8 27.35 10.25 24.87 0 9.217 1x
Learn from
sunon 9.4 27.86 9.88 25.51 0.78 0.020 461x
13 NAUS (truncate) 9.8 28.24 10.04 25.40 1.21 0.005 1843x
14 NAUS (length control) 9.8 28.55 9.97 25.78 1.83 0.044 210x
Table 1: Results on the Gigaword headline generation test set. Len: Average length of predicted summaries. R-1, R-2, R-L: ROUGE-1, ROUGE-2, ROUGE-L. R: The difference of total ROUGE (sum of R-1, R-2, and R-L) in comparison with the (previous) state-of-the-art search method under replication. Inf.Time: Average inference time in seconds for one sample on an i9-9940X CPU and a RTX6000 GPU. Speedup: Relative to schumann-etal-2020-discrete. Results quoted from previous papers; others are given by our experiments.

3.2 Results and Analyses

Main Results. Table 1 presents the performance of our model and baselines on the Gigaword headline test set. For a fair comparison, we categorize all approaches by average summary lengths of ~8 and ~10 into Groups A and B, respectively.

The Lead baseline extracts the first several words of the input sentence. Despite its simplicity, the Lead approach is a strong summarization baseline adopted in most previous work fevry-phang-2018-unsupervised; baziotis-etal-2019-seq.

Model ROUGE Recall Time Speedup
R-1 R-2 R-L R
22.50 6.49 19.72 -8.34
25.12 6.46 20.12 -5.35
22.13 6.18 19.30 -9.44
22.85 5.71 19.87 -8.62
26.04 8.06 22.90 -0.05
Our replication 26.14 8.03 22.88 0 12.314 1x
sunon 26.25 7.66 22.83 -0.31 0.022 559x
NAUS (truncate) 26.52 7.88 22.91 0.26 0.005 2463x
NAUS (length control) 26.71 7.68 23.06 0.40 0.048 257x
Table 2: Results on the DUC2004 dataset. Quoted from previous papers.

wang-lee-2018-learning utilize cycle consistency miao2016language for unsupervised summarization; the performance is relatively low, because the cycle consistency loss cannot ensure the generated text is a valid summary. zhou-rush-2019-simple perform beam search towards a step-by-step decomposable score of fluency and contextual matching. Both are unable to explicitly control the summary length: in a fair comparison of length 10 (Group B, Table 1), their performance is worse than the (previous) state-of-the-art approach schumann-etal-2020-discrete,333schumann-etal-2020-discrete present a few variants that use additional datasets for training language models (in an unsupervised way). In our study, we focus on the setting without data augmentation, i.e., the language model is trained on non-parallel the Gigawords corpus. which performs edit-based local search.

Our NAUS approach follows schumann-etal-2020-discrete, but trains a non-autoregressive model from search results. We consider two settings for controlling the summary length: truncating longer summaries and decoding with our proposed length-control algorithm. Both of our variants outperform schumann-etal-2020-discrete by 1.21–2.73 in terms of the total ROUGE score (Rows 5–6 & 13–14, Table 1). As mentioned, schumann-etal-2020-discrete only extract original words with order preserved, yielding noisy sentences. Our NAUS, as a student, learns from the search-based teacher model and is able to smooth out its noise. This is a compelling result, as our student model outperforms its teacher.

Regarding inference efficiency, our NAUS method with truncating is more than 1300 times faster than schumann-etal-2020-discrete, because we do not need iterative search. Even with dynamic programming and beam search for length control, NAUS is still over 100 times faster. This shows our NAUS is extremely efficient in inference, which is important for real-time applications.

Although the efficiency of wang-lee-2018-learning and zhou-rush-2019-simple is not available, we still expect our approach to be a few times faster (despite our higher ROUGE scores) because their models are autoregressive. By contrast, our NAUS is non-autoregressive, meaning that it predicts all words simultaneously. We will provide a controlled comparison between autoregressive and non-autoregressive models in Table 3.

Table 2 shows the results on the DUC2004 dataset. The cycle-consistency approach baziotis-etal-2019-seq; west-etal-2019-bottlesum does not perform well on this dataset, outperformed by an early rule-based syntax tree trimming approach zajic2004bbn and the state-of-the-art edit-based search schumann-etal-2020-discrete.

The performance of our NAUS model is consistent with Table 1, outperforming all previous methods in terms of the total ROUGE score, and being 100–1000 times faster than the search approach schumann-etal-2020-discrete.

In general, the proposed NAUS not only achieves state-of-the-art ROUGE scores for unsupervised summarization, but also is more efficient when deployed. Results are consistent on both datasets, demonstrating the generality of our NAUS.

In-Depth Analyses. We conduct in-depth analyses on the proposed NAUS model in Table 3. Due to the limit of time and space, we chose the Gigaword headline generation as our testbed. All the autoregressive (AR) and non-autoregressive (NAR) variants learn from the search output of our replication (Rows 2 & 11), where we achieve very close results to those reported in schumann-etal-2020-discrete.

# Approach ROUGE Recall Speedup
R-1 R-2 R-L R
Group A (desired length 8)
1 Search schumann-etal-2020-discrete 26.32 9.63 24.19 0.18
2 Our replication 26.17 9.69 24.10 0 1x
3 AR Transformer (T) 26.65 9.51 24.67 0.87 58x
Vanilla 24.87 8.33 22.74 -4.02 571x
5 CTC (T) 27.30 9.20 24.96 1.5 571x
6 CTC (LC) 27.76 9.13 25.33 2.26 149x
sunon 26.88 9.37 24.54 0.83 403x
8 Our NAUS (T) 27.27 9.49 24.96 1.76 1396x
9 Our NAUS (LC) 27.94 9.24 25.51 2.73 167x
Group B (desired length 10)
10 Search schumann-etal-2020-discrete 27.52 10.27 24.91 0.23
11 Our replication 27.35 10.25 24.87 0 1x
12 AR Transformer (T) 27.06 9.63 24.55 -1.23 66x
Vanilla 25.77 8.69 23.52 -4.49 709x
14 CTC (T) 28.14 10.07 25.37 1.11 709x
15 CTC (LC) 28.45 9.81 25.63 1.42 192x
sunon 27.86 9.88 25.51 0.78 461x
17 Our NAUS (T) 28.24 10.04 25.40 1.21 1843x
18 Our NAUS (LC) 28.55 9.97 25.78 1.83 210x
Table 3: Model analysis on headline generation. AR: Autoregressive models. NAR enc-dec: Non-autoregressive encoder–decoder. NAR enc-only: Non-autoregressive encoder-only. T: Truncating. LC: Length control. All AR and NAR models use the Transformer architecture.

We first tried vanilla encoder–decoder NAR Transformer (Rows 4 & 13, gu2017non), where we set the number of decoding slots as the desired summary length; thus, the blank token and the length-control algorithm are not needed. As seen, a vanilla NAR model does not perform well, and CTC largely outperforms vanilla NAR in both groups (Rows 5–6 & 14–15). Such results are highly consistent with the translation literature saharia-etal-2020-non; imputer; gu-kong-2021-fully; qian2020glancing; huang2021non.

The proposed encoder-only NAUS model outperforms encoder–decoder ones in both groups in terms of the total ROUGE score, when the summary length is controlled by either truncating or length-control decoding (Rows 8–9 & 17–18). Profoundly, our non-autoregressive NAUS is even better than the autoregressive Transformer (Rows 3 & 12). We also experimented with previous non-autoregressive work for supervised summarization sunon444To the best of our knowledge, the other two non-autoregressive supervised summarization models are yang-etal-2021-pos and pmlr-v139-qi21a. Their code and pretrained models are not available, making replication difficult. in our learning-from-search setting. Although their approach appears to be encoder-only, it adds end-of-sequence (EOS) tokens at the end of the generation, and thus is unable to utilize the input–output correspondence. Their performance is higher than vanilla NAR models, but lower than ours. By contrast, NAUS is able to capture such correspondence with the residual connections, i.e., Eqns. (2) and (3), in its encoder-only architecture.

Generally, the efficiency of encoder-only NAR555The standard minimal encoder–decoder NAR model has 6 layers for the encoder and another 6 layers for the decoder attentionisallyouneed. Our NAUS only has a 6-layer encoder. Our pilot study shows that more layers do not further improve performance in our encoder-only architecture. (without length-control decoding) is ~2 times faster than encoder–decoder NAR and ~20 times faster than the AR Transformer.

Further, our length-control decoding improves the total ROUGE score, compared with truncating, for both encoder–decoder CTC and encoder-only NAUS models (Rows 6, 9, 15, & 18), although its dynamic programming is slower. Nevertheless, our non-autoregressive NAUS with length control is ~200 times faster than search and ~3 times faster than the AR Transformer.

Additional Results. We present additional results in our appendices:

C. Analysis of Beam Search

D. Case Study

E. Human Evaluation

F. Length-Transfer Summarization

4 Related Work

Summarization systems can be generally categorized into two paradigms: extractive and abstractive. Extractive systems extract certain sentences and clauses from input, for example, based on salient features zhou-rush-2019-simple or feature construction he2012document. Abstraction systems generate new utterances as the summary, e.g., by sequence-to-sequence models trained in a supervised way zhang2020pegasus; liurefsum.

Recently, unsupervised abstractive summarization is attracting increasing attention. yang-etal-2020-ted propose to use the Lead baseline (first several sentences) as the pseudo-groundtruth. However, such an approach only works with well-structured articles (such as CNN/DailyMail). wang-lee-2018-learning and baziotis-etal-2019-seq use cycle consistency for unsupervised summarization. zhou-rush-2019-simple propose a step-by-step decomposable scoring function and perform beam search for summary generation. schumann-etal-2020-discrete propose an edit-based local search approach, which allows a more comprehensive scoring function and outperforms cycle consistency and beam search.

Our paper follows schumann-etal-2020-discrete but trains a machine learning model to improve efficiency and smooth out search noise. Previously, NEURIPS2020_7a677bb4 fine-tune a GPT-2 model based on search results for unsupervised paraphrasing; jolly2021search adopt the search-and-learning framework to improve the semantic coverage for few-shot data-to-text generation. We extend previous work in a non-trivial way by designing a non-autoregressive generator and further proposing a length-control decoding algorithm.

The importance of controlling the output length is recently realized in the summarization community. baziotis-etal-2019-seq and sunon adopt soft penalty to encourage shorter sentences; yang-etal-2021-pos and pmlr-v139-qi21a control the summary length through POS tag and EOS predictions. None of these studies can control the length explicitly. song-etal-2021-new is able to precisely control the length by progressively filling a pre-determined number of decoding slots, analogous to the vanilla NAR model in our non-autoregressive setting.

Non-autoregressive generation is originally proposed for machine translation gu2017non; guo2020fine; saharia-etal-2020-non

, which is later extended to other text generation tasks. wiseman-etal-2018-learning address the table-to-text generation task, and model output segments by a hidden semi-Markov model

ostendorf1996hmm, simultaneously generating tokens for all segments. jia2021flexible apply non-autoregressive models to extractive document-level summarization. sunon stack a non-autoregressive BERT model with a conditional random field (CRF) for abstractive summarization; since the summary is shorter than the input text, their approach puts multiple end-to-sequence (EOS) tokens at the end of the sentence, and thus is unable to utilize the strong input–output correspondence in the summarization task. yang-etal-2021-pos apply auxiliary part-of-speech (POS) loss and pmlr-v139-qi21a explore pretraining strategies for encoder–decoder non-autoregressive summarization. All these studies concern supervised summarization, while our paper focuses on unsupervised summarization. We adopt CTC training in our encoder-only architecture, allowing blank tokens to better align input and output words, which is more appropriate for summarization.

5 Conclusion

In this work, we propose a non-autoregressive unsupervised summarization model (NAUS), where we further propose a length-control decoding algorithm based on dynamic programming. Experiments show that NAUS not only archives state-of-the-art unsupervised performance on Gigaword headline generation and DUC2004 datasets, but also is much more efficient than search methods and autoregressive models. Appendices present additional analyses and length-transfer experiments.

Limitation and Future Work. Our paper focuses on unsupervised summarization due to the importance of low-data applications. One limitation is that we have not obtained rigorous empirical results for supervised summarization, where the developed model may also work. This is because previous supervised summarization studies lack explicit categorization of summary lengths yang-etal-2020-ted; pmlr-v139-qi21a, making comparisons unfair and problematic schumann-etal-2020-discrete. Such an observation is also evidenced by sunon, where the same model may differ by a few ROUGE points when generating summaries of different lengths. Nevertheless, we have compared with sunon in our setting and show the superiority of the NAUS under fair comparison. We plan to explore supervised summarization in future work after we establish a rigorous experimental setup, which is beyond the scope of this paper.

6 Acknowledgments

We thank Raphael Schumann for providing valuable suggestions on the work. We also thank the Action Editor and reviewers for their comments during ACL Rolling Review. The research is supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) under grant No. RGPIN2020-04465, the Amii Fellow Program, the Canada CIFAR AI Chair Program, a UAHJIC project, a donation from DeepMind, and Compute Canada (


Appendix A Proof of Theorem 1

Theorem 1. (1) If repeating tokens are not merged, then the proposed length-control algorithm with beam size finds the exact optimum being the most probable length- sentence given by prediction slots. (2) If we merge repeating tokens predicted by CTC-trained models, the above algorithm may not be exact.


[Part (1)] This part concerns a variant of our decoding algorithm, which only removes the blank token but does not merge consecutive repeated tokens to a single word, i.e., Eqn. (10) is removed. We denote this by , for example, , as opposed to in our algorithm. We now show that, based on , our dynamic programming algorithm in §2.3 with beam size is an exact inference algorithm.

We define , where denotes the length of a sequence. In other words, is the maximum probability of tokens that are reduced to words.

According to the definition, we have


In (13), refers to the probability of one token that is reduced to zero words, in which case the first predicted token can only be the blank token , corresponding to Eqn. (9) with and . Likewise, is the maximum probability of one token that is reduced to one word. Thus, it is the probability of the most probable non- token, corresponding to Eqn. (11) with and . Eqn. (15) asserts that fewer tokens cannot be reduced to more words; it is used for mathematical derivations, but need not to be explicitly implemented in our algorithm in §2.3.

The recursion variable is computed by


In other words, the variable can inherit with a predicted blank token , corresponding to Eqn. (9); or it can inherit with a predicted non- token, corresponding to Eqn. (11). Specially, if , then the second term has undefined, and thus is ignored in the operation.

We need the operator to take the higher probability in the two cases, since is the maximum probability of tokens being reduced to words. This corresponds to Eqn. (12) with beam size .

To sum up, our inductive calculation guarantees that is the exact maximum probability of for the desired length with generation slots; our algorithm (if not merging repeating tokens) gives the corresponding as under the same constraints, concluding the proof of Part (1).

[Part (2)] CTC training merges consecutive repeated tokens to a single word, unless separated by the blank token  10.1145/1143844.1143891. Since our model is trained by CTC, we should adopt this rule in inference as well. We show in this part that our algorithm, with beam size , may not yield the exact optimum with an example in Table 4.

I 0.39 0.1
like 0.4 0.9
coding 0.1 0
0.11 0
Table 4: An example of predicted probabilities of two generation slots, where we have a vocabulary of three words and a blank token .

We consider generating a sentence of two words from the two prediction slots, i.e., . Apparently, the optimal sequence is “I like” with probability . However, the algorithm would predict because “like” is the most probably token in the first slot. Then, our algorithm will give , because it has to select a non-repeating token based on , yielding a non-optimal solution.

It is noted that, if we do not merge repeating tokens as in , our algorithm will give the exact optimum “like like” in the above example. This shows that merging consecutive repeated tokens requires the decoding algorithm to correct early predictions, and thus, our dynamic programming becomes an approximate inference. Nevertheless, our algorithm is able to generate a sequence of the desired length properly; its approximation happens only when the algorithm compares more repetitions with fewer s versus more s with fewer repetitions. Such approximation is further alleviated by beam search in our dynamic programming. Therefore, the proposed length-control algorithm is better than truncating a longer sentence; especially, our approach generates more fluent and complete sentences.

Appendix B Implementation Details

Our NAUS had a Transformer encoder as the basic structure, generally following the settings in attentionisallyouneed: 6 encoder layers, each having 8 attention heads. The dimension was 512 for attention and 2048 for feed-forward modules.

Our training used a batch size of 4K tokens, with a maximum of 200K updates. We used Adam with . In general, the learning rate warmed up to 5e-4 in the first 10K steps, and then decayed to 1e-9 with the inverse square-root schedule, except that we find the maximum learning rate of 1e-4 worked better for headline generation with the summary length of 8. We set the weight decay to 0.01. Our length-control decoding algorithm had a beam size of 6. More details can be found in our repository (Footnote 1).

Our NAUS training is based on schumann-etal-2020-discrete’s prediction on the input of the Gigaword headline generation training set. We show performance against the number of training samples in Figure 3. As seen, NAUS outperforms its search teacher even with a small set of 0.1 million samples. The performance saturates as the number of samples increases. Based on this analysis, we used 3 million samples from the 3.8 million Gigaword training set to train our NAUS models.

Figure 3: Performance versus the number of training samples in the setting of Group B, Table 1. Notice that NAUS is trained by pseudo-groundtruth given by unsupervised edit-based search schumann-etal-2020-discrete. Thus, our approach is indeed unsupervised.

Each reported number in Tables 13 were averaged over 10 independent runs, whereas the results in Table 7 (Appendix F) were based on a single run due to the limited time.

Appendix C Analysis of Beam Search

As mentioned, our length-control decoding algorithm involves beam search within its dynamic programming, because the algorithm does not find the exact optimum when it merges repeating words. We analyze the effect of the beam size in our length-control algorithm.

In addition, we compare our approach with CTC beam search 10.1145/1143844.1143891.666Our implementation of CTC beam search is based on Typically, a CTC-trained non-autoregressive model can be decoded either greedily or by beam search. The greedy decoding finds the most probable token at each step, i.e., , and reduces the tokens to a sentence by , where is the number of decoding steps. The CTC beam search algorithm searches for the most likely sentence by marginalizing all token sequences that are reduced to , i.e., .

We show results in Figure 4, where we chose 10-word Gigaword headline generation as the testbed with our NAUS model (Group B, Table 1). Notice that CTC beam search does not control the output length, and for fair comparison, we truncated its generated summaries. This also shows that our novel decoding approach and CTC beam search are distinct algorithms.

Figure 4: Comparing our length-control NAUS and the truncated CTC beam search on the Gigaward headline generation test set.

As seen in Figure 4a, the beam search does play a role in our length-control algorithm. When the beam enlarges from 1 to 6, the performance (orange solid line) increases by 1.2 points in R, the difference of total ROUGE in comparison with schumann-etal-2020-discrete under our replication (Row 10, Table 1). However, further increasing the beam size does not yield additional performance gain. This is consistent with previous literature in autoregressive generation meister2020if, which also suggests a beam size of 5–7 is the best in their applications. In terms of the efficiency (Figure 4b), a larger beam size monotonically increases the inference time. However, the overhead of beam search is relatively small in our dynamic programming, and thus we chose a beam size of 6 in our experiments.

Our length-control algorithm significantly outperforms CTC beam search (dashed blue lines) in terms of both R and efficiency. Especially, CTC beam search is three times slower, and degrades more significantly than our length-control decoding when the beam size increases.

Appendix D Case Study

We show in Table 6 example summaries generated by our NAUS with truncating and length-control decoding, as well as the previous state-of-the-art method schumann-etal-2020-discrete. We observe that NAUS without length control generates slightly longer summaries, and if truncated, the output may be incomplete; by contrast, our length-control algorithm can generate a fluent and complete sentence of the desired length by dynamic programming. Compared with schumann-etal-2020-discrete, our NAUS (length control) generates a more informative summary that includes the main clause (united nations condemned), which also appears in the reference summary.

Appendix E Human Evaluation

Decoding Wins Ties Loses -val
Overall quality Truncate 18.67% 40.67% 40.67% 0.0004
Length control 40.67% 40.67% 18.67%
& fluency
Truncate 24.67% 26.67% 48.67% 0.0005
Length control 48.67% 26.67% 24.67%
Table 5: Human evaluation comparing truncating and length control for our NAUS model on 50 samples in the Gigaword headline generation task. The results are statistically significant, where the -value is given by a one-sided binomial test.

We conducted human evaluation with a focus on truncating and length-control decodings. This is because truncating may generate incomplete sentences, which cannot be adequately evaluated by automatic metrics as their ROUGE scores are close.

Specifically, we invited three human annotators to compare the two decoding algorithms for NAUS on 50 randomly selected samples, in the setting of Group B, Table 1 (Gigaword headline generation with a target length of 10). The annotation was conducted in a pairwise manner in terms of overall quality and fluency/completeness; average results (wins/loses/ties) are shown in Table 4. It should be mentioned that our annotation was strictly blind: the samples of two systems were presented in random order and annotators did not know which system generated a sample.

As seen, our length-control decoding algorithm largely outperforms the truncating approach in terms of both the overall quality and fluency/completeness. The results are statistically significant (-values ) in a one-sided binomial test. This verifies that length-control decoding is important for summarization, as truncating yields incomplete sentences, which are inadequately reflected by ROUGE scores.

Input: the united nations condemned saturday an attack on
russian embassy employees in baghdad that claimed the life
of one russian and resulted in the kidnapping of four others
Reference: un condemns murder of russians in iraq with
annan comment
schumann-etal-2020-discrete: attack on russian embassy in
baghdad claimed one in four
NAUS (truncate): an attack on russian embassy employees
in baghdad claimed in kidnapping of four others
NAUS (length control): united nations condemned attack
on russian embassy employees in baghdad
Table 6: Example summaries for Gigaword headline generation. The gray words are truncated for fair comparison.
Group # Approach Len ROUGE F1 Inf.Time Speedup
R-1 R-2 R-L R
Group A
(desired length 8)
1 Baseline 7.9 21.39 7.42 20.03 -11.12
2 Search 7.9 26.32 9.63 24.19 0.18
3 Our replication 7.9 26.17 9.69 24.10 0 6.846 1x
Learn from
sunon 7.7 26.88 9.37 24.54 0.83 0.017 403x
5 sunon 8.4 25.71 8.94 23.65 -1.84 0.018 380x
6 NAUS (truncate) 7.8 27.27 9.49 24.96 1.76 0.005 1369x
7 NAUS 7.8 27.94 9.24 25.50 2.73 0.041 167x
8 NAUS 7.9 27.12 9.08 24.86 1.10
Group B
(desired length 10)
9 Baseline 9.8 23.03 7.95 21.29 -10.2
10 10.8 27.29 10.01 24.59 -0.58
11 9.3 26.48 10.05 24.41 -1.53
12 Search 9.8 27.52 10.27 24.91 0.23
13 Our replication 9.8 27.35 10.25 24.87 0 9.217 1x
Learn from
15 sunon 9.4 27.86 9.88 25.51 0.78 0.020 461x
16 NAUS (truncate) 9.8 28.24 10.04 25.40 1.21 0.005 1843x
17 NAUS 9.9 28.32 9.58 25.46 0.89 0.044 210x
18 NAUS 9.8 28.55 9.97 25.78 1.83
Group C
(desired length
50% of the input)
19 Baseline 14.6 24.97 8.65 22.43 -4.58
20 14.8 23.16 5.93 20.11 -11.43
21 15.1 24.70 7.97 22.41 -5.55
22 Search 14.9 27.05 9.75 23.89 0.06
23 Our replication 14.9 27.03 9.81 23.79 0 17.462 1x
Learn from
25 sunon
26 NAUS 14.9 28.39 9.78 24.94 2.48 0.052 336x
27 NAUS 14.9 28.53 9.88 25.10 2.88
Table 7: Analysis of length-transfer summary generation. A subscript (or refers to a model trained with words and tested for (or ) words. Len: Average length of predicted summaries. R-1, R-2, R-L: ROUGE-1, ROUGE-2, ROUGE-L. R: The difference of total ROUGE (sum of R-1, R-2, and R-L) in comparison with the (previous) state-of-the-art model schumann-etal-2020-discrete under replication. Inf.Time: Average inference time in seconds for one sample on an i9-9940X CPU and a RTX6000 GPU. Speedup: Relative to schumann-etal-2020-discrete. Results quoted from previous papers; others are given by our experiments. sunon’s approach has a soft length penalty to encourage short output, but cannot generate longer summaries than trained.

Appendix F Length-Transfer Summary Generation

In the main paper, we present results where our NAUS is trained on search outputs schumann-etal-2020-discrete that have the same length as the inference target. This follows the common assumption in machine learning that training and test samples are independently identically distributed.

In this appendix, we show the performance of length-transfer summary generation, where the prediction has a different length from that of training. We denote such a model by NAUS, referring to training with words and testing for words.

As seen in Groups A & B in Table 7, NAUS with length transfer is slightly worse than NAUS trained on the correct length, which is understandable. Nevertheless, length-transfer decoding still outperforms the search teacher and other baselines.

Moreover, we consider the third setting in schumann-etal-2020-discrete, where the target length is 50% of the input. Since it takes time to obtain pseudo-groundtruths given by the edit-based search, we would directly transfer already trained NAUS models to this setting by our length-control decoding. Results are shown in Group C, Table 7. We observe NASU is better than NASU, which makes much sense because the latter has a larger gap during transfer. Remarkably, both NASU and NASU outperform schumann-etal-2020-discrete and other baselines, achieving new state-of-the-art unsupervised performance on this setting as well.

We further compare with sunon, who use a length penalty to encourage short summaries. However, their length control works in the statistical sense but may fail for individual samples. Moreover, such a soft length penalty cannot generate longer summaries than trained. Even in the setting of , their generates summaries are slightly longer than required, while the performance degrades much more considerably than NAUS.

These results show that our novel length-control decoding algorithm is not only effective when generating summaries of similar length to the training targets, but also generalizes well to different desired summary lengths without re-training. In general, our NAUS is an effective and efficient unsupervised summarization system with the ability of explicit length control.