Unsupervised Abstractive Sentence Summarization using Length Controlled Variational Autoencoder

09/14/2018 ∙ by Raphael Schumann, et al. ∙ University of Heidelberg 0

In this work we present a unsupervised approach to summarize sentences in an abstractive way using Variational Autoencoder (VAE). VAE are known to learn a semantically rich latent variable, representing the a high dimensional input. VAEs are trained by learning to reconstruct the input from the probabilistic latent variable. Explicitly providing the information about output length during training influences the VAE to not encode this information and thus can be manipulated during inference. Instructing the decoder to produce a shorter output sequence leads to expressing the input sentence with fewer words. We show on different summarization data sets, that these shorter sentences can not beat a simple baseline but yield higher ROUGE scores than trying to reconstruct the whole sentence.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The increasing amount of text data in the digital age calls for methods to reduce reading time while maintaining information content. The process of summarization achieves this by deleting, generalizing or paraphrasing fragments of the input text. Summarization methods can be categorized into single or multi document and extractive or abstractive approaches. In contrast to single document Rush et al. (2015), the multi document setup can leverage the fact that in some domains like news articles there are different sources describing the same event Banerjee et al. (2016); Haghighi and Vanderwende (2009). Extractive methods solely rely on the words of the input and e.g. extract whole sentences Erkan and Radev (2004); Parveen and Strube (2015) or recombine phrases on the sentences level Banerjee et al. (2016). Abstractive approaches on the other hand are rarely bound to any constraints and gained a lot of traction due to recent advances in machine translation like the encoder-decoder framework Sutskever et al. (2014); Paulus et al. (2017) or attention mechanism Bahdanau et al. (2014); Rush et al. (2015); Paulus et al. (2017). Another more general distinction is the need of supervision. Supervised methods require training pairs of input text and output summarization Paulus et al. (2017); Rush et al. (2015), whereas unsupervised methods abuse inherent properties of the input like frequency of phrases Banerjee et al. (2016) or centrality Erkan and Radev (2004). In this work we use a Variational Autoencoder (VAE) Kingma and Welling (2013); Bowman et al. (2016) and control the decoding length Kikuchi et al. (2016) to obtain a shortened version of an input sentence. VAEs work unsupervised and decoding makes use of the whole available vocabulary. This work is organized into following sections. At first we give background about used technologies and concepts. In 3 we describe the architecture of our model. The data we use for the experiments in section 5 is outlined in section 4. At last we report the results in section 1.

2 Background

2.1 Variational Autoencoder

Variational Autoencoder (VAE) is a generative model firstly introduces by Kingma and Welling (2013). Like regular autoencoders VAEs learn a mapping from high dimensional input to a low dimensional latent variable . Instead of doing this in a deterministic way VAE imposes a prior distribution on , e.g. standard Gaussian:


The desired effect is that each area in the space gets a semantic meaning and thus samples from can be decoded in a meaningful way. The decoder is trained to reconstruct the input based on the latent variable . In order to approximate via gradient descent the reparameterization trick Kingma and Welling (2013) was introduced. This trick allows the gradient to flow through the sampling decision of (Formula 1) by outsourcing the discrete operation. Let and be deterministic outputs of the encoder :


and is the element-wise product. To prevent the model pushing close to and basically fall back to a regular autoencoder the objective is extended by the Kullback-Leibler (KL) divergence between prior and :


The goal is to have a non-zero, but not out of control KL term while maintaining a reasonable reconstruction loss. This guarantees a semantically rich latent variable and good generation ability.

2.2 Controlling Output Length

There are different methods for controlling the output length in an encoder-decoder model. One of them is LenEmb Kikuchi et al. (2016) where the decoder is fed information about the remaining length at every decoding step . This information is encoded as an embedding matrix accessed by and learned during training. Instead of calculating the remaining length as bytes we use a more straight forward approach by counting whole words. At each decoding step the length embedding is concatenated to the input and chosen as follows:


where is the desired length. This encourages the decoder to fit the information left into the remaining words. The authors show in a supervised summarization setup that setting to the desired number of output bytes, conveniently the 75 bytes of the references, yield better performance during evaluation.

3 Model

In order to apply the VAE principle to text data, Bowman et al. (2016)

employ RNNs as encoder and decoder. The vectors

and are constructed from the last hidden state of the encoder and the first cell state of the decoder is initialized as . Since then many improvements of this basic architecture have been published and are adopted in this work. First of all we use a bidirectional encoder which reads forward and backward through the input sequence . At each encoding step the forward and backward hidden states and are concatenated to .  Vani and Birodkar (2016) then calculate and from the mean of all hidden states , arguing that this produces a better sequence representation and the gradient reaches every input vector more easily. This is depicted in Figure 1. Besides the reconstruction loss of the input sequence  Zhao et al. (2017) introduce a so called bag-of-words loss. A dimensional vector is predicted by a feed-forward layer which takes as input, where is the vocabulary size. This vector is compared against the label which is the one-hot representation of the input sentence. This forces the model to put more general information into the latent variable instead of encoding the start of a sentence and derive the rest by memorizing word order in the decoder. As seen in Figure 2 the multi-layer RNN gets fed the latent variable at every decoding step, again allowing to have an easier way for the gradient to flow back. Additionally the last emitted word and the length embedding, see 2.2, are concatenated to the input. To speed up the training sampled softmax Jean et al. (2015)estimates the softmax function at each decoding output.

Figure 1: VAE Encoder with bidirectional RNN and mean representation of the input
Figure 2: VAE Decoder with bag-of-word loss and LenEmb

4 Data

The data setup is similar to  Rush et al. (2015). For training they use 4 million pairs of title and first sentence of the article from Gigaword Graff et al. (2003) data set. As we do not need supervision we remove the titles and due to resource limitations remove all sentences with more than 30 words. The remaining 1.8 million training sentences are preprocessed by lower-casing and tokenizing all words. Additionally numbers are replaced by # and words not in the top 40000 are replaced by UNK token. For evaluation we also use the around 2000 held-out article-title pairs from Gigaword and the DUC-2004 set Over et al. (2007). This consist of 500 news articles from New York Times and Associated Press Wire service and comes with 4 different reference summaries (capped at 75 bytes) written by humans.

5 Experiment

We train the proposed model on the above presented data by maximizing the objective in Formula 3. To obtain a shortened version of the input sentence during testing we set to the desired length. Our assumption is that the decoder tries to fit all the information present in the latent variable into the limited output words. Doing so by skipping meaningless words or rephrasing semantic bits to fewer tokens. All under observation of the implicit language model ensuring a grammatically correct sentence.

5.1 Baseline

We use Prefix as baseline which cuts the first 75 characters from the input sentence as summarization. This simple baseline shows to what extent out model is able to pass the information of the input sentence trough the low dimensional latent variable.

5.2 Training Details

Similar to  Bowman et al. (2016) a weight for the KL term in the objective function is annealed from 0 to 1 during training. This hinders the model to go the easy way and set the KL term to by letting be equal to . This would mean there is no information encoded in and degenerate the VAE to a regular language model. Another technique to overcome this is dropping the previous emitted word during decoding, relying the decoder further on the latent variable.

The LSTM cell Hochreiter and Schmidhuber (1997) is used as basic RNN unit. Optimization is done by Adam Kingma and Ba (2014)

and sampled softmax draws 1000 words. Beam search size is set to 100 and batch size to 512. The number of desired output words is set to 20 to reliable reach the 75 bytes of the reference summarizations. All other hyperparameter are searched by Bayesian optimization

111https://scikit-optimize.github.io/. Encoder and decoder RNN cell size is 243. Word embedding size is 254 and the latent variable has 124 dimensions. A 236 wide hidden layer predicts

. The best size for length embeddings is found to be 50. Words are not dropped during decoding by a probability of 0.20 and the output layer of RNN cells is regularized by a dropout keep rate of 0.87.

Figure 3: The annealing of the KL term weight during training steps and the reaction of KL term value

6 Results

DUC-2004 Gigaword
Prefix 22.43 6.49 19.65 23.14 8.25 21.73 100
no len limit 14.49 2.06 12.28 19.91 4.14 18.02 51
LenEmb 20 16.38 2.56 14.19 22.19 4.56 19.88 60
Table 1: ROUGE-1, ROUGE-2, ROUGE-L on DUC-2004 and Gigaword evaluation set. no len Limit decodes the input sentence with modifying the length. LenEmb 20 sets the desired length to 20 output words. Ext. % reports the amount of extracted words from input.

6.1 Evaluation Metric

ROUGE Lin (2004)

is an n-gram based evaluation metric to quantify the quality of a summary relative to given references. We report results on ROUGE-1 and ROUGE-2 which basically count the uni- and bi-gram overlap. Furthermore ROUGE-L score is based on the longest common subsequence (LCS) between the given texts. ROUGE is just an indicator if a automatically generated summary is as good as a human-written reference and should be handled with caution.

6.2 Quantitative Evaluation

Before discussing the summarization results we take a look at how the LenEmb effects the model. In Figure 4 and 5 we see the output length of the model without length restrictions and the one with a desired length of 20 words. Figure 4 is about the same distribution as the input sentences. Figure 5 proofs that we are able to reduce the output length near the desired 75 characters. In fact 20 words are chosen to have the majority slightly above 75 characters to not waste word space during ROUGE evaluation. We perform another analysis to study the effect of LenEmb. We train a model with explicitly providing the information about the sentence length via LenEmb and one without this extension. This means the model has to somehow encode the length information into the latent variable to reproduce the input sentence with minimal loss. In Table 2 we see the

results of a Linear Regression (LR) trained on the latent variables of both models with the objective to predict the length of the encoded sentence. For the model without explicit length information LR can better predict the length of the encoded sentence with only looking at the latent variable. With less length information stored in the latent variable it should be easier to influence the model to produce a certain output length.

The ROUGE scores are found in Table 1. Our model is not able to beat the Prefix baseline. This however could be the effect of the VAE not being able to restore the correct input sentence. We verify this by testing a vanilla VAE model on solely reconstructing the input sentence and see that a lot of mistakes are made. One reason is the lack of attention, which can’t be used in a VAE setting, to ’copy’ rare words from the input. Our LenEmb model however is consistently better than the vanilla VAE, which shows that the reducing of output length can fit more information into the first 75 characters. If we could improve the vanilla VAE to reproduce the input sentence without making a lot mistakes and the LenEmb model maintains the performance gain over the vanilla VAE, we could beat the Prefix baseline. The grammatical quality of the generated sentences was not evaluated.

Figure 4: Frequency of output characters without limiting the length
Figure 5: Frequency of output characters with setting desired length to 20
DUC2004 Gigaword
with LenEmb 0.41 0.54
w/o LenEmb 0.59 0.72
Table 2: Linear Regression prediction on sentences length with and w/o LenEmb

7 Conclusion

We extended a VAE with LenEmb to control the length of the produced sentences. The hypotheses that stimulating the decoder to produce shorter outputs will result in more information expressed in fewer words could be verified in a summarization experiment. However a simple baseline could not be beaten with this approach. A reason and subject to further research is how the vanilla VAE can be improved to better reconstruct the input sentence and how this influences the LenEmb extended model. A Linear Regression experiment demonstrated that the length of the input sentence is encoded in the latent variable. All in all this is a reasonable approach to construct a unsupervised abstractive sentence summarization model and worth further investigation.