Self-Attention-Based Message-Relevant Response Generation for Neural Conversation Model

05/23/2018 ∙ by Jonggu Kim, et al. ∙ POSTECH 0

Using a sequence-to-sequence framework, many neural conversation models for chit-chat succeed in naturalness of the response. Nevertheless, the neural conversation models tend to give generic responses which are not specific to given messages, and it still remains as a challenge. To alleviate the tendency, we propose a method to promote message-relevant and diverse responses for neural conversation model by using self-attention, which is time-efficient as well as effective. Furthermore, we present an investigation of why and how effective self-attention is in deep comparison with the standard dialogue generation. The experiment results show that the proposed method improves the standard dialogue generation in various evaluation metrics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dialogue systems are designed to have a conversation with a user. According to the objective of conversation, dialogue systems are classified into task-oriented dialogue systems which conduct specific tasks such as booking and ordering, and non-task-oriented dialogue systems (chatbots) which are constructed for chit-chat. While components of task-oriented dialogue systems are pipelined after the components are constructed separately, chatbots are usually constructed in an end-to-end way which is similar to neural machine translation models based on the sequence-to-sequence architecture. Even though such chatbots have achieved great success in naturalness of the response like human-being, but they still have a challenge called the generic response problem. The generic response problem indicates that the produced response is not informative or specific to the given message, but generic such as “I see.” or “I don’t know”.

Even though much recent research on the problem has been conducted, the problem has not been easily cleared; some methods are not enough effective, the other methods are complex and time-inefficient.

In this paper, we present an empirical analysis on structural reason why sequence-to-sequence models generate such responses and provide its clues. Based on the analysis, we propose a decoding method using self-attention to promote message-relevant and diverse responses for standard sequence-to-sequence models, which does not require a new model architecture. Then, we present a variety of experimental results for verification of the proposed method. The experimental results demonstrate that the proposed method generates more interesting responses than the standard dialogue generation.

In Section 2, we introduce previous methods to alleviate the problem as related work. In Section 3, we introduce our motivation and the proposed method in detail. We then show the experimental settings and results in Section 4. We discuss the results in detail in Section 5. Finally, we conclude our work in Section 6.

2 Related Work

Non-task-oriented dialogue systems often use a framework of machine translation Ritter et al. (2011)

. Recently, the framework of neural machine translation which is a sequence-to-sequence framework based on neural networks is applied for dialogue systems. In terms of a natural response, such dialogue systems are in great success. However, in terms of an informative response, they still have a challenge to overcome. To provide more informative responses which are diverse, message-specific or contextual, much research has been conducted.

Li:NAACL16 propose new objective functions based on maximum mutual information (MMI) for neural conversation models to generate an informative and relevant response to a given message. Li:ACL16 propose to model personalities of a speaker and an addressee in a sequence-to-sequence model as embedding vectors. The model can be driven by characteristics of a speaker and an addressee. Mou:COLING16 propose a forward and backward directional model. Also, they propose to use pointwise mutual information to introduce contents to the model explicitly. As the other strategy, a method of data distillation that reduces the most similar examples to generic responses in the training dataset is proposed by Li:arxiv17. Shao:EMNLP17 propose a sequence-to-sequence model that pays not only attention on input, but also attention on target words that are already generated. To promote diversity of responses, Serban:AAAI17b propose to model a latent variable in the sequence-to-sequence model by using the method of variational auto-encoders. The latent variable is first sampled and then expected to make response diverse. In a similar way, Shen:ACL17 propose a framework dealing with specific attributes like personal emoticons. Serban:AAAI17a propose a model that covers multiple-level abstractions of input to capture more important information.

Research to utilize a dialogue history has also been conducted. Sordoni:NAACL15 propose a context-sensitive response generation model. The model incorporates contextual information as continuous vector representations to generate contextually coherent responses. Dusek:SIGDIAL16 propose a sequence-to-sequence model incorporating a context encoder to be conetext-aware also. Tian:ACL17 study how to use context information to be more useful. They used several variants of RNN structures and found which models achieve the best performance in an empirical way. Mei:AAAI17 propose an attention-based language model for coherent conversation with an user. They propose to use the attention mechanism to a dialogue history for a coherent dialogue.

3 The Proposed Method

Our method is to use self-attention in response generation. To introduce our method in detail, we first introduce the standard response generation model and present an investigation of the standard response generation that shows why generic responses could be generated in terms of an architecture. Then, we present our motivation and introduce our method in detail. We also present clues supporting our self-attention-based response generation in this section.

MESSAGE RESPONSE
We can guess that the author and his publisher Standard: We ’re not
By 1st: We ’re not
By 5th: I don ’t know what you’re ?
She just has to sleep it off rest a couple of days Standard: I don ’t want to go
By 1st: She ’s a good idea
By 5th: I don ’t want to go
And i feel very good about not leaving the hospital Standard: I don ’t know what you ’re talking about ?
By 1st: And i don ’t know you’re gonna
By 5th: But i don ’t know what you’re talking about ?
You are aware that it is difficult to make even ends meet Standard: I don ’t know what you ’re talking about ?
By 1st: You ’re going to do you know ?
By 5th: You ’re going to do you know ?
Is my blouse ready honey ? Standard: I don ’t know what about it ?
By 1st: What do you mean ?
By 5th: I don ’t know what about it ?
Table 1: Example responses generated by the standard method and by an imperative selection of hidden vectors in an absolute position ( or ) from for . The tokens in red indicate tokens in the position, and the tokens in blue indicate tokens in the position. The underlined tokens indicate tokens generated by the selections.

3.1 Response Generation Model

The response generation model is based on the attention-based RNN sequence-to-sequence (encoder-decoder) structure Bahdanau et al. (2014)

. In the model, long short-term memory (LSTM) is applied to both the encoder and the decoder. LSTM is designed to conceive information that is far from the current step by a gating mechanism using more trainable parameters than a basic RNN model. The parameters

, , are used to organize a cell vector which consists of three gates called an input gate , a forget gate and a output gate , and the other one for hypothesis at each time step . In detail, given an input vector and the previous cell output , the current cell output is computed as:

(1)
(2)
(3)
(4)
(5)
(6)

where

is a logistic sigmoid function and

, and are model parameters.

The model objective is to generate a sequence of words given

. That can be represented as the conditional probability:

(7)

where is a synthetic symbol which represents beginning of a sequence.

Given , the encoder of the model generates a sequence of cell output in turn as:

(8)

where is a continuous real valued vector that is transformed to by the word vector lookup table.

The decoder of the model generates a cell output at each time step as:

(9)

where

(10)
(11)
(12)

is a context vector calculated by a weighted sum of , and the weights are calculated by an inner product of and each .

Then, is fed to a feed-forward layer to produce the last vector .

The loss function is a categorical cross entropy between

and the one-hot target word vector . The loss is calculated as:

(13)

where is a corresponding index to the target word in .

3.2 Investigation of Standard Response Generation

The standard response generation is to generate a response in the same way as explained in the previous subsection. Given a message , the decoder generates one word by one word starting with the beginning symbol . Specifically, in the first decoding step, a hidden vector is constructed by soft-attention of to , which is expected to select the most related vector to among . However, we have seen safe responses starting with a general word like “I” by the attention many times. We ascribe them to a selection of similar vectors to in the first decoding step. The selection is likely to be safe but message-uninformative with a high probability, which will be the seed of safe responses. After the generation of a safe first word like “I” by the selection, the decoder could be a language generator while the message or the encoder do not considerably influence the decoder. As a result, selecting a safe hidden vector as the first context vector could result in a generic response. We can also think that an uninformative context vector is constructed with a high probability because is uninformative by itself.

To support the intuition, Table 1 shows example responses generated by the standard decoding method and by an imperative selection in an absolute position ( or ) from for the first context vector . In Table 1, responses are obviously different even though the difference in decoding is only . Also, we can find that the first generated word depends on the selected word among hidden vectors of the encoder, and even the words are the same. For example, words in red in most message/response pairs are the same words; - We/We, - She/She, - And/And and - You/You 111On the other hand, hidden vectors do not guarantee the same as the current word because they are accumulated until the step..

In conclusion, a selection of is crucial to generate a whole response, so we take it into account for message-relevant and diverse responses.

3.3 Motivation

As we see in Subsection 3.2, a decision of is crucial to generate a whole response. To generate an informative response, our objective is to select the most informative context vector from by escaping an uninformative vector of the soft-attention to .

In training, each hidden vector of the encoder becomes similar to the previous hidden vector of the decoder according to the model structure. It means if two hidden vectors of the encoder which indirectly become similar to each other by training, they have similar meaning also. In addition, a vector to be selected as should be not dull, but informative. In other words, the vector should indicate an prominent part of the sentence.

A variety of methods for meaningful sentence representation have been proposed. One of the methods is self-attention which learns relationships of every vector pair in a single set of vectors. Self-attention is based on the inner product of two vectors. We believe that such a method can be used to find abstracted vector representation which has the whole meaning of a sentence, or at least indicates a prominent part of a sentence by comparing every pair.

Based on the idea, we propose a self-attention-based response generation method that is introduced in the next subsection.

3.4 Self-Attention-Based Response Generation

Self-attention is a special case of the attention mechanism, which is modeled to learn dependencies in a word sequence Vaswani et al. (2017); Shen et al. (2017a). Such a self-attention is usually used for sentence representation which abstracts sentence-level meanings. Specifically, multiplicative self-attention is an attention mechanism to build a context vector by the inner product of the input vector and the given query that is also another input vector :

(14)
(15)
(16)

where and are scalar.

Before and

are computed, they are once transformed by a feed-forward network to be trainable. However, contrary to such multiplicative self-attention models, we do not directly model or train self-attention by placing or stacking trainable parameters around it. Instead, we just expect the standard sequence-to-sequence model to indirectly learn similarities or dependencies between hidden vectors of the encoder while a decoder hidden vector

and an encoder hidden vector that has similar meaning to become similar in its own architecture.

In other words, we consider similarity between hidden vectors of the encoder trained in the standard sequence-to-sequence way. Hidden vectors of the encoder are either informative or dull for . We use the multiplicative self-attention mechanism on hidden vectors to select a message-relevant and diverse one that is supported by other hidden vectors according to similarity. Then is expected to conceive representative meaning of a message.

For use as a compact encoding of a sentence, we slightly modify the process above. Specifically, we slightly modify Equation (15) and (16) to select a message-abstracted vector using a hard-attention that follows the greatest weight. Then, the first context vector is computed as:

(17)
(18)
(19)

Context vectors at other time steps () are computed as usual.

Method BLEU distinct-1 distinct-2
Seq2Seq 0.97 0.008 0.062
Seq2Seq & Hard-Attention 1.18 0.009 0.064
Random Hard-Attention 1.15 0.008 0.064
Self-Attention & Min 1.12 0.009 0.071
Self-Attention & Max 1.26 0.009 0.076
Seq2Seq using MMI 3.38 0.010 0.119
Self-Attention & Max using MMI 2.67 0.012 0.171
Table 2: Automatic evaluation result
Method Good (1) Mediocre (2) Bad (3) Average
Seq2Seq 7 126 66 2.296
Seq2Seq & Hard-Attention 6 126 67 2.307
Random Hard-Attention 5 127 67 2.312
Self-Attention & Min 10 124 65 2.276
Self-Attention & Max 13 123 63 2.251
Seq2Seq using MMI 10 18 171 2.809
Self-Attention & Max using MMI 28 27 144 2.583
Table 3: Human evaluation result

4 Experiments

To verify our method, we train a standard sequence-to-sequence model on open-domain dialogues. In subsections, we introduce the experimental conditions, and show the experimental result.

4.1 Dataset and Settings

We used the OpenSubtitles dataset Tiedemann (2009) which is a large and noisy open-domain dataset spoken by movie characters for the experiments. We extracted unique input/output pairs of the dialogues from the dataset and reduced them according to dialogue length which was set to 6 to reduce the training time. As a result, we obtained about 0.6 M dialogues which contain 5.4 M unique input/output pairs. Then, we shuffled and divided the data for training, testing and validation with the rate of 0.85, 0.1 and 0.05 respectively. The size of vocabulary used in the dataset is 25,000.

To verify our method, we also placed representative generation methods and variants of our methods. The methods are described as follows:

  • Attention-based sequence-to-sequence model (Seq2Seq): The standard beam search decoding of the attention-based sequence-to-sequence model.

  • Selection based on hard-attention by attention-based sequence-to-sequence model (Seq2Seq & Hard-Attention): For the first hidden vector in the decoder, the method uses hard-attention instead of soft-attention. We expect this method to show an effect of hard-attention itself and the difference between Seq2Seq and the proposed method.

  • Random selection based on hard-attention by attention-based sequence-to-sequence model (Random Hard-Attention): For the first hidden vector in the decoder, the method randomly selects the context vector in the hard-attention way. The random method will be used to show the effectiveness of the self-attention-based method by the comparison.

  • Self-attention-based response generation selecting the minimum probability (Self-Attention & Min): This method chooses a context vector whose probability is the minimum using self-attention instead of the context vector constructed by soft-attention to . This method is to construct the first context vector in an opposite way to the proposed one, which is expected to select the most distinct vector among hidden vectors of the encoder.

  • Self-attention-based response generation selecting the maximum probability (Self-Attention & Max): This method chooses a context vector whose probability is the maximum using self-attention instead of the context vector constructed by soft-attention to .

  • Attention-based sequence-to-sequence model using maximum mutual information (Seq2Seq using MMI) Li et al. (2016a): Dialogue generation using two distinct sequence-to-sequence models trained on the dataset in the order of message/response pairs and response/message pairs, respectively. After the standard attention-based sequence-to-sequence model generates -best (beam sized) candidates, the other model rescores the candidates to produce a final response 222The same as the bidi method in Li:NAACL16..

  • Self-attention-based response generation selecting the maximum probability using maximum mutual information (Self-Attention & Max using MMI): Like Seq2Seq using MMI, the other standard model trained on the dataset of reverse-ordered pairs rescores the

    -best candidates that were generated by Self-Attention & Max.

For training the standard attention-based sequence-to-sequence model, we used AdaDelta Zeiler (2012)

as an optimizer and set the learning rate to 0.2. We used batch size of 128 and dropout with the rate of 0.2. We set the maximum epoch to 10 and we did early stopping to select the best model parameters on the validation dataset at the end of each epoch for comparison. For every decoding method, we used the beam search algorithm which may mitigate drastic responses, and the beam size was set to 10. We set the maximum length of the response to 50.

Note that the settings were common in all the models for fair comparison.

4.2 Evaluation Metric

To verify our model, we used two automatic evaluation metrics as well as human evaluation. The automatic evaluation metrics we used for the experiment are described as follows:

  • BLEU Papineni et al. (2002): We used BLEU which is widely used as a metric in machine translation and dialogue generation. BLEU is a metric of similarity between the response and the reference.

  • distinct-1 and distinct-2: We used distinct-1 and distinct-2 which are widely used in dialogue generation to check a diversity of responses of a model. distinct-1 and distinct-2 represent the number of unique unigrams and unique bigrams scaled by the number of all the generated unigrams and bigrams, respectively.

We also evaluated models by two human evaluators. We randomly sampled 200 responses among different responses generated from models and rated the responses in 3-scale, Good, Mediocre and Bad and took the average of the rates of the responses to compare the models. For fair evaluation, we have a simple rule to guide the scales: the rate is Good when it is acceptable and specific to the given message; the rate is Mediocre when it could not be specific to the message, but acceptable; the rate is Bad when it is not acceptable at all or not a complete sentence.

MESSAGE RESPONSE
Mexico was definitely the hottest Seq2Seq: But I didn ’t want to do you know that ?
Self-Attention & Max: I don ’t know it was
We all like you here you know it Seq2Seq: I don ’t know what you ’re ?
Self-Attention & Max: You ’re all right ?
They talk about him while he ’s standing right there Seq2Seq: I don ’t know what about it ?
Self-Attention & Max: He doesn ’t want to talk about him
Yeah we have to go now be quick Seq2Seq: We ’re going to do you know ?
Self-Attention & Max: I don ’t want to go
Are you keeping up the pace my sweets ? Seq2Seq: I don ’t know what you ’re doing ?
Self-Attention & Max: You ’re not
Their false history is written in the blood of those who might remember and of those who seek the truth Seq2Seq: I don ’t know what about it ?
Self-Attention & Max: What do you think ?
Table 4: Sample responses by the baseline and the proposed method

4.3 Result

The responses of baseline methods and our methods are evaluated in distinct-1, distinct-2 and BLEU (Table 2). In not using MMI, Self-Attention & Max achieved BLEU of 1.26, distinct-1 of 0.009 and distinct-2 of 0.076, which are the best scores in all the metrics. On the other hand, Self-Attention & Max using MMI achieved BLEU of 2.67, distinct-1 of 0.012 and distinct-2 of 0.171 while Seq2Seq using MMI achieved BLEU of 3.38, distinct-1 of 0.010 and distinct-2 of 0.119.

We also present the result of human evaluation (Table 3)333We sampled 200 messages, but one message was not a perfect sentence. The message was not included in the evaluation.. Averages are calculated after mapping Good, Mediocre and Bad to values 1, 2, and 3 respectively. Thus, a lower average score is better than a higher average score. In not using MMI, Self-Attention & Max achieved 2.251, which is the best among the methods. In using MMI, Self-Attention & Max achieved 2.583, which is better than Seq2Seq.

We present sample responses to show diverse responses of the proposed method compared to the baseline (Table 4).

5 Discussion

All the hard-attention methods achieved better scores than Seq2Seq in all the automatic evaluation metrics. Among them, Self-Attention & Max was prominent and achieved the best scores in the metrics. On the other hand, the other hard-attention methods do not sufficiently promote distinct-1 and distinct-2 scores contrary to the proposed method. Although Random Hard-Attention was especially expected to make response diverse, the result did not satisfy the expectation. Two hard-attention methods did not guarantee a good seed of diversity to generate a response. In using MMI, Seq2Seq achieved a higher BLEU score than that of Self-Attention & Max. However, Self-Attention & Max was obviously better than Seq2Seq in terms of a diversity.

In the human evaluation, all the methods using MMI had significantly fewer responses in Mediocre than the methods not using MMI444The bad result on MMI is a counter to the previous result reported by Li:NAACL16. We think that the main reason could be the difference of datasets.. It means using MMI tends to avoid safe responses. However, the avoidance of safe responses did not always succeed, and such avoidance often led to Bad responses. Especially, Seq2Seq using MMI had such a tendency while Self-Attention & Max using MMI sometimes led Mediocre responses to Good responses. In not using MMI, there were no significant differences between the methods. We think such a result could be likely due to characteristics of the dataset. Otherwise, our human evaluation metric could not be appropriate to evaluation of methods on the dataset.

While Seq2Seq generates safe responses, self-attention-based methods generate Good responses or Bad responses. It possibly indicates self-attention-based methods tend to avoid safe responses at risk. As a result, both self-attention-based methods achieved slightly better average scores than Seq2Seq.

6 Conclusion

In this paper, we proposed a self-attention-based message-relevant response generation method for neural conversation model. The method is based on self-attention that is originally modeled to learn dependencies of the given sequence and usually used for a sentence encoding. In our work, we use self-attention to select the most informative vector in the encoder, which is based on similarity.

To verify the proposed method, we conducted the experiment to show the proposed method is simple, but effective. The experimental result shows that our methods generated responses which are more diverse and message-specific than baseline methods. It indicates our self-attention tends to select an important vector as a seed among hidden vectors.

References

  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, , and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .
  • Dusek and Jurcicek (2016) Ondrej Dusek and Filip Jurcicek. 2016.

    A context-aware natural language generator for dialogue systems.

    In Proceedings of the SIGDIAL 2016 Conference. pages 185–190.
  • Li et al. (2016a) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A diversity-promoting objective function for neural conversation models. In Proceedings of NAACL-HLT 2016. pages 110–119.
  • Li et al. (2016b) Jiwei Li, Michel Galley, Chris Brockett, Georgios P.Spithourakis, Jianfeng Gao, and Bill Dolan. 2016b. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. pages 994–1003.
  • Li et al. (2017) Jiwei Li, Will Monroe, and Dan Jurafsky. 2017. Data distillation for controlling specificity in dialogue generation. arXiv preprint arXiv:1702.06703 .
  • Mei et al. (2017) Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. 2017. Coherent dialogue with attention-based language models. In

    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

    . pages 3252–3258.
  • Mou et al. (2016) Lili Mou, Yiping Song, Rui Yan, Ge Li, Lu Zhang, and Zhi Jin. 2016. Sequence to backward and forward sequences: A content-introducing approach to generative short-text conversation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. pages 3349–3358.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. pages 311–318.
  • Ritter et al. (2011) Alan Ritter, Colin Cherry, and William B. Dolan. 2011. Data-driven response generation in social media. In

    Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

    .
  • Serban et al. (2017a) Iulian Vlad Serban, Tim Klinger, Gerald Tesauro, Kartik Talamadupula, Bowen Zhou, Yoshua Bengio, and Aaron Courville. 2017a.

    Multiresolution recurrent neural networks: An application to dialogue response generation.

    In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). pages 3288–3294.
  • Serban et al. (2017b) Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017b. A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). pages 3295–3301.
  • Shao et al. (2017) Louis Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, and Ray Kurzweil. 2017. Generating high-quality and informative conversation responses with sequence-to-sequence models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pages 2210–2219.
  • Shen et al. (2017a) Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2017a. Disan: Directional self-attention network for rnn/cnn-free language understanding. arXiv preprint arXiv:1709.04696 .
  • Shen et al. (2017b) Xiaoyu Shen, Hui Su, Yanran Li, Wenjie Li, Shuzi Niu, Yang Zhao, Akiko Aizawa, and Guoping Long. 2017b. A conditional variational framework for dialog generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers). pages 504–509.
  • Sordoni et al. (2015) Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to context-sensitive generation of conversational responses. In Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL. pages 196–205.
  • Tian et al. (2017) Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, Yansong Feng, and Dongyan Zhao. 2017. How to make context more useful? an empirical study on context-aware neural conversational models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers). pages 231–236.
  • Tiedemann (2009) Jorg Tiedemann. 2009. News from OPUS-A Collection Of Multilingual Parallel Corpora with Tools and Interfaces, volume 5. Recent Advances in Natural Language Processing.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30.
  • Zeiler (2012) Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 .