Prior Attention for Style-aware Sequence-to-Sequence Models

06/25/2018 ∙ by Lucas Sterckx, et al. ∙ Ghent University 0

We extend sequence-to-sequence models with the possibility to control the characteristics or style of the generated output, via attention that is generated a priori (before decoding) from a latent code vector. After training an initial attention-based sequence-to-sequence model, we use a variational auto-encoder conditioned on representations of input sequences and a latent code vector space to generate attention matrices. By sampling the code vector from specific regions of this latent space during decoding and imposing prior attention generated from it in the seq2seq model, output can be steered towards having certain attributes. This is demonstrated for the task of sentence simplification, where the latent code vector allows control over output length and lexical simplification, and enables fine-tuning to optimize for different evaluation metrics.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Section 1 Introduction

Apart from its application to machine translation, the encoder-decoder or sequence-to-sequence (seq2seq) paradigm has been successfully applied to monolingual text-to-text tasks including simplification Nisioi et al. (2017), paraphrasing Mallinson et al. (2017), style transfer Jhamtani et al. (2017), sarcasm interpretation Peled and Reichart (2017), automated lyric annotation Sterckx et al. (2017) and dialogue systems Serban et al. (2016).

Figure 1:

Training of a conditional variational autoencoder applied to attention matrices. The seq2seq model translates training sentences from the source to a target domain while generating attention matrices. These matrices are concatenated with a representation of the source sentence and encoded to a low dimensional latent vector space.

A sequence of input tokens is encoded to a series of hidden states using an encoder network and decoded to a target domain by a decoder network. During decoding, an attention mechanism is used to indicate which are the relevant input tokens at each step. This attention component is computed as an intermediate part of the model, and is trained jointly with the rest of the model. Alongside being crucial for effective translation, attention — while not necessarily correlated with human attention — brings interpretability to seq2seq models by visualizing how individual input elements contribute to the model’s decisions. Attention values typically match up well with word alignments used in traditional statistical machine translation, obtained with tools such as GIZA++ Och and Ney (2000) or fast-align Dyer et al. (2013). Therefore, several works have included prior alignments from dedicated alignment software such as GIZA++ or fast-align Alkhouli et al. (2016); Mi et al. (2016); Liu et al. (2016). In particular, Mi et al. (2016) showed that the distance between the attention-infused alignments and the ones learned by an independent alignment model can be added to the networks’ training objective, resulting in improved translation and alignment quality. Further, Gulcehre et al. (2017) demonstrated that this alignment between given input sentence and generated output can be planned ahead as part of a seq2seq model: their model makes a plan of future alignments using an alignment-plan matrix and decides when to follow this plan by learning a separate commitment vector. In the standard seq2seq model, where attention is calculated at each time step, such overall alignment or focus is only apparent after decoding and is thus not carefully planned nor controlled. We hypothesize that many text-to-text operations have varying levels of alignment and focus. To enable control over these aspects, we propose to pre-compute alignments and use this prior attention to determine the structure or focus before decoding in order to steer output towards having specific attributes, such as length or level of compression. We facilitate this control through an input represented in a latent vector space (rather than, e.g., explicit ‘style’ attributes).

After training of the initial seq2seq model (with standard attention) on a parallel text corpus, a conditional variational autoencoder Sohn et al. (2015) learns to reconstruct matrices of alignment scores or attention matrices from a latent vector space and the input sentence encoding. At translation time, we are able to efficiently generate specific attention by sampling from regions in the latent vector space, resulting in output having specific stylistic attributes. We apply this method on a sentence simplification corpus, showing that we can control length and compression of output while producing realistic output and allowing fine-tuning for optimal evaluation scores.

Section 2 Generation of Prior Attention

This section describes our proposed method, sketched in Figure 1, with emphasis on the generation of prior attention matrices.

An encoder recurrent neural network computes a sequence of representations over the source sequence, i.e., its hidden states

(with and the length of the source sequence). In attention-based models, an alignment vector is obtained by comparing the current target hidden state with each source hidden state . A global context vector is then computed as the weighted average, according to alignment weights of , over all the source states at time step (for over decoding steps). After decoding, these alignment vectors form a matrix of attention vectors, , capturing the alignment between source and target sequence.

The wave traveled across the Atlantic , and organized into a tropical depression off the northern coast of Haiti on September 13 .
The wave traveled across the Atlantic , and organized into a tropical depression off the northern coast of the country on September 13 .
+ The wave traveled across the Atlantic Ocean into a tropical depression off the northern coast of Haiti on September 13 .
+ The wave traveled across the Atlantic Ocean and the Pacific Ocean to the south , and the Pacific Ocean to the south , and the Atlantic Ocean to the west .
+ + The storm was the second largest in the Atlantic Ocean .
Below are some useful links to facilitate your involvement .
Below are some useful links to facilitate your involvement .
+ Below are some useful links to help your involvement .
+ Below are some useful to be able to help help develop to help develop .
+ + Below is a software program that is used to talk about what is now .
Table 1: Output excerpts for prior attention matrices sampled from a 2D latent vector space. Samples are drawn from outer regions, with indicating large positive values and for negative values.

Inspired by the field of image generation, we treat alignment matrices as grayscale images and use generative models to create previously unseen attention. Generative models have been applied to a variety of problems giving state-of-the-art results in image generation, text-to-speech synthesis, and image captioning. One of the most prominent models is the variational autoencoder (VAE) proposed by Kingma and Welling (2013). Given an observed variable , the VAE introduces a continuous latent variable , and assumes to be generated from , i.e., , with being a prior over the latent variables. is the conditional distribution that models the generation procedure parameterized by a decoder network . For a given , an encoder network outputs a variational approximation of the true posterior over the latent values . The parameters of , are learned using stochastic variational inference to maximize a lower bound for the marginal likelihood of each observation in the training data. In our setting, represents the attention matrix.

Next to control over stylistic features, we want attention matrices to be relevant for a specific source sentence. In the Conditional Variational Autoencoder (CVAE) Yan et al. (2016); Sohn et al. (2015), the standard VAE is conditioned on additional variables which can be used to generate diverse images conditioned on certain attributes, e.g., generating different human faces given a sentiment. We view the source contexts as the added conditional attributes and use the CVAE to generate diverse attention matrices instead of images. This context vector is represented by the source sentence encoding . The CVAE encoder is conditioned on two variables, the attention matrix and the sentence encoding . Analogously, for the decoder, the likelihood is now conditioned on two variables, a latent code and again the source sentence encoding, . This training procedure of the CVAE is visualized in Figure 1. At test time, the attention scores from the attention matrix, pre-generated from a latent code sample and the source sentence encoding, are used instead of the standard seq2seq model’s attention mechanism.

Figure 2: (a) Attention matrices for a single source sentence encoding and a two-dimensional latent vector space. By conditioning the autoencoder on the source sentence, the decoder recognizes the length of the source and reduces attention beyond the last source token. (b) Score distributions for different regions of the latent vector space.

Section 3 Experiments

3.1 Prior Attention for Text Simplification

While our model is essentially task-agnostic, we demonstrate prior attention for the task of sentence simplification. The goal of sentence simplification is to reduce the linguistic complexity of text, while still retaining its original information and meaning. It has been suggested that sentence simplification can be defined by three major types of operations: splitting, deletion, and paraphrasing Shardlow (2014). We hypothesize that these operations occur at varying frequencies in the training data. We adopt our model in an attempt to capture these operations into attention matrices and the latent vector space, and thus control the form and degree of simplification through sampling from that space. We train on the Wikilarge collection used by Zhu Zhu et al. (2010). Wikilarge is a collection of 296,402 automatically aligned complex and simple sentences from the ordinary and simple English Wikipedia corpora, used extensively in previous work Wubben et al. (2012); Woodsend and Lapata (2011); Zhang and Lapata (2017); Nisioi et al. (2017). The training data includes 2,000 development and 359 test instances created by Xu et al. (2016). These are complex sentences paired with simplifications provided by Amazon Mechanical Turk workers and provide a more reliable evaluation of the task.

Wubben et al. (2012) 67.74 35.34 0.90 10.0
Zhang and Lapata (2017) 90.00 37.62 0.95 10.4
Nisioi et al. (2017) 88.16 33.86 0.91 10.1
Seq2seq + online attention 89.92 33.06 0.91 10.3
Seq2seq + CVAE 90.14 38.30 0.97 10.5
Table 2: Quantitative evaluation of existing baselines from previous work and seq2seq with prior attention from the CVAE when choosing an optimal sample for BLEU scores.

3.2 Hyperparameters and Optimization

We extend the OpenNMT Klein et al. (2017) framework with functions for attention generation and release our code as a submodule. We use a similar architecture as Zhu et al. (2010) and Nisioi et al. (2017): 2 layers of stacked unidirectional LSTMs with bi-linear global attention as proposed by Luong et al. (2015)

, with hidden states of 512 dimensions. The vocabulary is reduced to the 50,000 most frequent tokens and embedded in a shared 500-dimensional space. We train using SGD with batches of 64 samples for 13 epochs after which the autoencoder is trained by translating sequences from training data. Both the encoder and decoder of the CVAE comprise 2 fully connected layers of 128 nodes. Weights are optimized using ADAM 

Kingma and Ba (2014)

. We visualize and evaluate using a two-dimensional latent vector space. Source and target sequences are both padded or reduced to 50 tokens. The integration of the CVAE is analogous across the family of attention-based seq2seq models, i.e., our approach can be applied more generally with different models or training data.

3.3 Discussion

To study the influence of sampling from different regions in the latent vector space, we visualize the resulting attention matrices and measure simplification quality using automated metrics. Figure 1(a) shows the two-dimensional latent space for a single source sentence encoding using 64 samples ranging from values to . Next to the target-to-source length ratio, we apply automated measures commonly used to evaluate simplification systems Woodsend and Lapata (2011); Zhang and Lapata (2017): BLEU, SARI Xu et al. (2016), FKGL111Fleish-Kincaid Grade Level index. Kincaid (1975). Automated evaluation metrics for matrices originating from samples from different regions of latent codes are shown in Figure 1(b). Inclusion of an attention mechanism was instrumental to match existing baselines. Our standard seq2seq model with attention, without prior attention, obtains a score of 89.92 BLEU points, which is close to scores obtained by similar models used in existing work on neural text simplification Zhang and Lapata (2017); Nisioi et al. (2017). In Table 2, we compare our seq2seq model with attention and without prior attention. A value for BLEU of 90.14 is found for which was tuned on a development set. For the same value, a SARI value of 38.30 was reached. For comparison, we include the SMT-based model by Wubben et al. (2012), the NTS model by Nisioi et al. (2017) and the EncDecA by Zhang and Lapata (2017). For decreasing values of the first hidden dimension , we observe that attention becomes situated at the diagonal, thus keeping closer to the structure of the source sentence and having one-to-one word alignments. For increasing values of , attention becomes more vertical and focused on single encoder states. This type of attention gives more control to the language model, as exemplified by output samples shown in Table 1. Output from this region is far longer and less related to the source sentence.

Influence of the second latent variable is less apparent from the attention matrices. However, sampling across this dimension shows large effects on evaluation metrics. For decreasing values, output becomes more similar to the source, with higher BLEU as a result. Sampling these values along the zero-axis results in the overall highest BLEU and SARI scores, trading similarity for simplification and readability.

Section 4 Conclusion

We introduced a method to control the decoding process in sequence-to-sequence models using attention, in terms of stylistic characteristics of the output. This means that the trained model is able to produce output with custom stylistic properties, given a well-chosen style input vector by the user at prediction time. Given the input sequence and an additional code vector to influence decoding characteristics, a variational autoencoder generates an attention matrix, which is used by the decoder to generate the output sequence according to the alignment style directed by the code vector. We demonstrated the resulting variations in output for the task of text simplification. Yet, our method can be applied to any form of parallel text: we expect different types of training collections, such as translation or style transfer, to give rise to different characteristics or mappings in the latent space.


  • Alkhouli et al. (2016) Tamer Alkhouli, Gabriel Bretschner, Jan-Thorsten Peter, Mohammed Hethnawi, Andreas Guta, and Hermann Ney. 2016.

    Alignment-based neural machine translation.

  • Dyer et al. (2013) Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2013. A simple, fast, and effective reparameterization of IBM Model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644–648. Association for Computational Linguistics.
  • Gulcehre et al. (2017) Caglar Gulcehre, Francis Dutil, Adam Trischler, and Yoshua Bengio. 2017. Plan, attend, generate: Planning for sequence-to-sequence models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5474–5483. Curran Associates, Inc.
  • Jhamtani et al. (2017) Harsh Jhamtani, Varun Gangal, Eduard Hovy, and Eric Nyberg. 2017. Shakespearizing modern language using copy-enriched sequence to sequence models. In Proceedings of the Workshop on Stylistic Variation, pages 10–19, Copenhagen, Denmark. Association for Computational Linguistics.
  • Kincaid (1975) J.P. Kincaid. 1975. Derivation of New Readability Formulas: (automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Research Branch report. Chief of Naval Technical Training, Naval Air Station Memphis.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
  • Kingma and Welling (2013) Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. CoRR, abs/1312.6114.
  • Klein et al. (2017) G. Klein, Y. Kim, Y. Deng, J. Senellart, and A.M. Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. ArXiv e-prints.
  • Liu et al. (2016) Lemao Liu, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2016. Neural machine translation with supervised attention. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3093–3102. The COLING 2016 Organizing Committee.
  • Luong et al. (2015) Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015. Multi-task sequence to sequence learning. CoRR, abs/1511.06114.
  • Mallinson et al. (2017) Jonathan Mallinson, Rico Sennrich, and Mirella Lapata. 2017. Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 881–893, Valencia, Spain. Association for Computational Linguistics.
  • Mi et al. (2016) Haitao Mi, Zhiguo Wang, and Abe Ittycheriah. 2016. Supervised attentions for neural machine translation. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    , pages 2283–2288. Association for Computational Linguistics.
  • Nisioi et al. (2017) Sergiu Nisioi, Sanja Štajner, Simone Paolo Ponzetto, and Liviu P. Dinu. 2017. Exploring neural text simplification models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 85–91. Association for Computational Linguistics.
  • Och and Ney (2000) Franz Josef Och and Hermann Ney. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics.
  • Peled and Reichart (2017) Lotem Peled and Roi Reichart. 2017. Sarcasm sign: Interpreting sarcasm with sentiment based monolingual machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1690–1700, Vancouver, Canada. Association for Computational Linguistics.
  • Serban et al. (2016) Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, volume 16, pages 3776–3784.
  • Shardlow (2014) Matthew Shardlow. 2014. A survey of automated text simplification. International Journal of Advanced Computer Science and Applications.
  • Sohn et al. (2015) Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3483–3491. Curran Associates, Inc.
  • Sterckx et al. (2017) Lucas Sterckx, Jason Naradowsky, Bill Byrne, Thomas Demeester, and Chris Develder. 2017. Break it down for me: A study in automated lyric annotation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2074–2080, Copenhagen, Denmark. Association for Computational Linguistics.
  • Woodsend and Lapata (2011) Kristian Woodsend and Mirella Lapata. 2011. Learning to simplify sentences with quasi-synchronous grammar and integer programming. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 409–420. Association for Computational Linguistics.
  • Wubben et al. (2012) Sander Wubben, Antal van den Bosch, and Emiel Krahmer. 2012. Sentence simplification by monolingual machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1015–1024. Association for Computational Linguistics.
  • Xu et al. (2016) Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. 2016. Optimizing statistical machine translation for text simplification. Transactions of the Association of Computational Linguistics, 4:401–415.
  • Yan et al. (2016) Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. 2016. Attribute2image: Conditional image generation from visual attributes. In ECCV (4), volume 9908 of Lecture Notes in Computer Science, pages 776–791. Springer.
  • Zhang and Lapata (2017) Xingxing Zhang and Mirella Lapata. 2017.

    Sentence simplification with deep reinforcement learning.

    In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 584–594. Association for Computational Linguistics.
  • Zhu et al. (2010) Zhemin Zhu, Delphine Bernhard, and Iryna Gurevych. 2010. A monolingual tree-based translation model for sentence simplification. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 1353–1361. Coling 2010 Organizing Committee.