1 Introduction
Adaptive gradientbased optimizers such as AdaGrad [9] and Adam [14] are among the de facto methods of choice in modern machine learning. These methods adaptively tune the learning rate for each parameter during the optimization process using cumulative secondorder statistics of the parameter. Often offering superior convergence properties, these methods are very attractive in large scale applications due to their moderate time and space requirements, which are linear in the number of parameters.
However, in extremely large scale applications even the modest memory overhead imposes grave limitations on the quality of the trained model. For example, recent advances in machine translation hinge on inflating the number of parameters in the trained language model to hundreds of millions. In such applications, the memory overhead of the optimizer severely restricts the size of the model that can be used as well as the number of examples in each minibatch, both of which have been shown to have a dramatic effect on the accuracy of the model.
Motivated by these challenges, we describe an adaptive optimization method that retains the benefits of standard perparameter adaptivity while significantly reducing its memory costs. Our construction is general and flexible, yet is remarkably simple and almost trivial to implement. We give simple convergence guarantees in the convex (stochastic and online) optimization setting, which show our method to be most effective when the gradients have a natural activation pattern
, namely, the parameters can be subdivided into (not necessarily disjoint) sets such that the gradient entries within each set are correlated with each other and tend to share a similar order of magnitude. For example, in deep networks the incoming or outgoing edges of a neuron are jointly activated and, loosely speaking, their associated gradients exhibit similar statistical characteristics. That said, we do
not assume that the activation pattern is fullyprescribed to the optimization algorithm before its run.Large scale experiments show that our algorithm achieves comparable, and at times superior, rates of convergence to those obtained by standard, linearspace adaptive methods using the same batch size. Focusing primarily on language modeling tasks that are notorious for their huge models, we further demonstrate that the reduction in memory footprint can be utilized for a substantial increase in the batch size, which greatly speeds up convergence. As a byproduct of the diminished memory costs, our method also exhibits improved (wallclock) runtime, which could be attributed to the reduced frequency of memory access.
1.1 Related work
Adaptive learning rates in online and stochastic optimization date back at least to [3] and were popularized in [9, 15], the former of which introduced the wellknown AdaGrad algorithm. Several variants of AdaGrad have now been proposed in the optimization and machine learning literature (see [17] and the references therein), the most notable of which is the Adam algorithm [14]. All of these methods require (at least) linear space for maintaining various perparameter statistics along their execution.
One notable exception, which is directly related to our work, is the Adafactor algorithm [21] that was proposed as a way to reduce the memory costs of Adam, primarily for training large language models. While the memory requirements of our construction are similar to Adafactor’s, the applicability as well as the convergence properties of the two algorithms are quite different. We discuss the connections and disparities in more detail in Section 3 and show an empirical comparison of the algorithms in Section 5.
Another closely related method is the Shampoo [10]
algorithm for optimization over tensor structures. The goal of Shampoo is very different, and perhaps more ambitious, than ours: going beyond entrywise learning rates and employing
fullmatrix regularization in a computationally efficient way. Nonetheless, Shampoo can also be seen as a method to substantially reduce the memory footprint of fullmatrix preconditioned algorithms (specifically, fullmatrix AdaGrad). In a sense, our algorithms are analogous to a diagonalized version of the Shampoo algorithm.Yet another recent adaptive optimization method is the GGT algorithm [2]. Similarly to Shampoo, the goal of GGT is to reduce the computation cost of fullmatrix preconditioning in order to make it practical in large scale settings. However, GGT stores multiple copies of the gradient over the course of its execution, and as a result, the space requirements of GGT are far from being sublinear in the size of the model.
2 Preliminaries
We begin by establishing some basic notation. For a vector
and , we use the notation to refer to vector obtained by raising each of the entries of to the power . We also use to denote the square matrix whose diagonal elements are the entries of (and whose offdiagonal entries are zeros). We use to denote the set . Finally, is the dimensional vector whose entries are all .2.1 Optimization setup
We henceforth assume the general online optimization setting (see [20, 11]).^{1}^{1}1For our analysis, we will assume the online convex optimization setup, in which the loss functions are convex. Optimization takes place in rounds , where in each round the algorithm has to choose a parameter vector . After making the choice on round
, the algorithm receives a loss function
which is used to perform an update to the parameters; often, and as will be the case in this paper, this update is determined by the gradient of the instantaneous loss at the current iterate . The algorithm is measured by its round regret, defined as the quantity ; an algorithm is convergent if its regret is , i.e., if its average regret approaches zero as the number of rounds grows.The above setup includes stochastic (possibly minibatched) optimization as a special case. In the latter, one desires to minimize a population loss based on samples of , where defines the loss of parameters on a batch . The online loss function is then the average loss over a minibatch received on iteration , and the stochastic gradient
is a conditionally unbiased estimate of the gradient of
at the current parameter vector . Under convexity assumptions, an online algorithm with vanishing average regret can be converted to a stochastic optimization algorithm for minimizing the population loss [4].2.2 Adaptive methods
For the sake of selfcontainment, we give a brief description of the AdaGrad algorithm [9]. AdaGrad maintains at every step the following parameterwise accumulated statistics, computed based on the previously obtained gradients :
Relying on these statistics, the update rule of the algorithm on step takes the form:
where is an external learning rate parameter. AdaGrad has been shown to be particularly effective in training sparse models, where the effective learning rates decay in a moderate way for rare (yet possibly informative) features. In these cases, AdaGrad can potentially lead to huge gains in terms of convergence; see the discussion in [9].
2.3 Activation patterns and covers
While the theoretical analysis of AdaGrad and related algorithms does not make any assumptions on gradient values, in practice we often observe that certain entries of a gradient have similar values, and exhibit what we call an activation pattern
. For example, in embedding layers of deep networks, an entire column is either zero or nonzero. Similarly, in layers with ReLU activations it is often observed that all gradients corresponding to the same unit are jointly either zero or nonzero, and in the latter case, their absolute values share a similar order of magnitude.
In both examples, for each parameter there is a certain set of indices such that for all gradients we expect that for all . We do not attempt to formalize this notion further, and the analysis of our algorithm does not rely on a definition of an activation pattern. Rather, we leave it as an intuitive concept that serves as a motivation for our use of a cover.
Definition.
A cover of a set of parameters is a collection of nonempty sets , such that and . In particular, each index may be contained in multiple sets . is the size of the cover.
Specific covers of interest include:

[label=()]

Singletons: for all ; this is a degenerate case which does not model any correlations between parameters.

Matrix rows/columns: parameters are organized as an matrix, and each is the set of indices corresponding to a row/column of this matrix.

Tensor slices: parameters are organized as a tensor of dimension , and each is an dimensional slice of the tensor.

Multiple tensors: parameters are organized in multiple tensors, each of which has its own cover. The cover is then the union of all the individual covers.
Our algorithm is provided with a prescribed cover as input, and its convergence is characterized in terms of the cover. We further argue, though only informally, that when a cover is “consistent” with the natural activation pattern of the parameters, we can expect the convergence of our algorithm to be significantly better.
3 The SM3 algorithm
The idea behind our algorithm is to keep a single variable for each set in the cover. Thus, the additional space it requires is rather than ; typically is substantially smaller than , which yields tangible savings in memory. Concretely, for each set , the algorithm maintains a running sum, , of the maximalvariance over all gradient entries . Next, for each parameter , we take the minimum over all variables associated with sets which cover , denoted . Thereafter, the learning rate corresponding to the ’th gradient entry is determined by taking the squareroot of this minimum, denoted by . Accordingly, we name our algorithm the Squareroot of Minima of Sums of Maxima of Squaredgradients Method, or in short, SM3. See Algorithm LABEL:alg:alg for its pseudocode.
In case (i) above, where there is a set for each , the algorithm reduces to the AdaGrad algorithm [9]. The more interesting cases are where and each index is covered by multiple sets. In such settings, the memory overhead of the algorithm is sublinear in . In particular, in setting (ii) the memory footprint reduces from to , which can be quite substantial in large scale. In setting (iii) the improvement is more pronounced, as the space requirement drops from to .
The time per iteration of LABEL:alg:alg is . To see this, consider a bipartite graph defined over vertices. Nodes on one side of the graph correspond to indices , while nodes on the other side correspond to indices . The edges of the graphs are all pairs such that . The complexity of each of the inner forloops of the algorithm scales with the number of edges in this graph, which is equal to . (Applying the update to the weights takes time, but this is always dominated by the former quantity.)
As a final remark, notice that the update rule of LABEL:alg:alg seems to involve a division by zero when . However, whenever then necessarily also . (This is a direct consequence of creftype 1 below.) In other words, whenever the denominator in the update rule is zero, the corresponding entry has zero gradient and thus need not be updated.
3.1 Analysis
We now prove convergence guarantees for LABEL:alg:alg. We first show two elementary properties of the step sizes the algorithm computes.
Claim 1.
For any and the sequence is monotonically increasing and,
Proof.
The monotonicity is immediate as for any the variable is increasing in by definition, thus is also increasing for all .
Next, since for any set that contains , we have
Hence,
The claim now follows since . ∎
Proposition 2.
Assume that the loss functions are convex, and let be the iterates generated by LABEL:alg:alg. Then, for any ,
where and choosing .
In particular, if the functions are stochastic samples with , e.g., each is the loss function over a batch of i.i.d. examples, then the above bound translates using standard arguments to a convergence guarantee for the average iterate of the form
In the above proposition we implicitly assume that the iterates of LABEL:alg:alg remain bounded and is a constant. This can be enforced by projecting the iterates to a bounded set of choice. We avoid introducing projections explicitly as they are rarely used in practice.
Proof of Proposition 2.
Let us first assume that for all , so that for all and due to creftype 1. The starting point of the analysis is the simple observation that LABEL:alg:alg performs Online Mirror Descent updates, where the step on round uses the positive definite diagonal matrix for regularization. Then, employing a standard regret bound for the Online Mirror Descent algorithm with timedependent regularization (see for instance [9, Proposition 3]), the regret of the algorithm is bounded by
Here, and is the corresponding dual norm, .
Henceforth, for notational convenience we set . Simplifying the first sum above using the fact that are diagonal matrices, we have
Now, let and consider the positive definite diagonal matrix . From [10, Lemma 2] with , we have
Also, from creftype 1 we know that for all , , thus
In summary, we have established that
Plugging in and the expression for the diagonal elements of , we obtain the claim.
For the degenerate case where the matrices may not be strictly positive definite, a careful yet technical inspection of the proof above reveals that our arguments apply to this case as well by replacing inverses with pseudoinverses. The rest of the proof remains intact as the algorithm does not update parameter on step if the corresponding diagonal entry in is zero. ∎
3.2 Discussion
Notice that adding more sets to the cover used by SM3 improves its convergence bound, but results in a worse space complexity and a higher runtime per step. Therefore, it makes sense in practice to include in the cover only the sets for which we can quickly compute the max and min operations as required by the algorithm. We discuss this point from a practical perspective in Section 4.
As we mentioned above, when and for all , LABEL:alg:alg reduces to the AdaGrad algorithm. The regret bound in Proposition 2 then precisely recovers the bound attained by AdaGrad (see [9, Eq. 6]),
In the general case, we have
as follows from creftype 1. Thus, as can be expected from a spacerestricted scheme, our bound is never superior to AdaGrad’s regret bound.
Nevertheless, the two bounds above are of similar order of magnitude when the cover is consistent with the activation pattern of the gradients . Indeed, if for any entry there is a set that covers such that for all , then , and thus .
Therefore, in these scenarios we inherit the convergence properties of AdaGrad while using sublinear memory. In particular, if in addition the gradients are sparse, we can obtain an improved dependence on the dimension as discussed in Duchi et al. [9].
It is also worthwhile to compare our algorithm to Adafactor [21]. The two algorithms differ in a number of important ways. First, Adafactor is only defined for matrixshaped parameter sets while SM3 applies to tensors of arbitrary dimensions, and even more generally, to any predefined cover of the parameters. Second, Adafactor is essentially a fixed stepsize algorithm and often requires an external stepsize decay schedule for ensuring convergence. SM3 in contrast decays its learning rates automatically, similarly to AdaGrad. Finally, SM3 has the benefit of entertaining rigorous, albeit elementary, convergence guarantees in the convex case.
3.3 Sm3Ii
We now discuss a slightly more efficient variant of SM3, which we describe in LABEL:alg:alg2. It is very similar to LABEL:alg:alg, and improves on the latter in the following sense.
Proposition 3.
For any , the sequence is monotonically increasing. Further, fixing a sequence of gradient vectors , we have for all and that
where is the sequence produced by LABEL:alg:alg upon receiving the gradient vectors .
In other words, LABEL:alg:alg2 provides a tighter upper bound on the cumulative gradient squares than LABEL:alg:alg. Consequently, we can show, along similar lines to the proof of Proposition 2, a slightly better convergence bound for LABEL:alg:alg2 that scales with the quantity , which is always smaller than the one appearing in the bound of LABEL:alg:alg.
Proof of Proposition 3.
First, to establish monotonicity note that the algorithm maintains for and . Hence, for and we have
Let . We next prove by induction that for all and . For this is true as for all by creftype 1. For the induction step, assume that for all and write
On the other hand, we have
where the final inequality follows from the fact that, for all one has
4 Implementation details
We implemented SM3 as an optimizer in TensorFlow
[1]. Our implementation follows the pseudocode of LABEL:alg:alg2, as it performed slightly yet consistently better than LABEL:alg:alg in our experiments (as predicted by our bounds). The implementation of LABEL:alg:alg2 optimizer will be released very soon as open source code.Default covers.
Our implementation employs covers induced by rows and columns of matrices, and more generally, by slices of higherorder tensors (e.g., in convolutional layers). These covers allow us to exploit highly efficient tensor operations provided by GPUs and TPUs for computing max and min over the sets.
Momentum.
Our optimizer can be used in conjunction with momentum for improved performance. We found that momentum, set at 0.9, adds stability and allows use of larger learning rates for all optimizers that we compared.
Hyperparameters and learningrate.
An important feature of SM3, compared to other widespread optimizers, is that it only has a single hyperparameter that requires tuning, the learning rate . Concretely, SM3 does not rely on a learningrate decay schedule that is often difficult to tune. The experiments reported in Table 1 of Section 5
verify this empirically. This aspect of SM3 makes it particularly appealing for training large scale models where the training time is too long to allow for exhaustive hyperparameter tuning.
Learningrate ramp up.
Having said the above, we do often find in deep learning tasks that a high learning rate setting in the early stages of optimization causes instability and might result in failure to converge. Therefore, while SM3 does not require an external learning rate decay schedule, it is often helpful to gradually increase the parameter
from zero to its maximal value, typically over the course of the first few thousand updates. While we used this ad hoc safeguard in our experiments, we plan to replace it in the future with norm constraints on the cover sets.5 Experiments
We demonstrate the practical benefits of SM3 on several machine learning tasks using the published stateoftheart architectures and algorithms as baselines. We performed experiments on the following three tasks:

[itemsep=0.1ex]

Machine translation on two standard datasets from WMT’14: English to French (enfr) with 36.3M sentence pairs and English to German (ende) with 4.5M sentence pairs.
Experiment  Optimizer  Decay Rule 
WMT’14 ende  Adafactor  
WMT’14 enfr  Adafactor  
BERT–Large  Adam  
AmoebaNetD  RMSProp  
SM3  None 
5.1 Machine translation
We first ran our experiments using the Transformer model [23] on the smaller WMT’14 ende dataset. We trained models using the Lingvo [22] sequence modeling framework, available in TensorFlow. We compared SM3 with Adafactor which has similar space requirements. Results are provided in Figure 1 and Table 2. SM3 performed slightly better than Adafactor in both test perplexity and BLEU score of the trained models.
We then moved on to the larger WMT’14 enfr dataset using a larger transformer model (TransformerBig) architecture from [5]. Our results are shown in Figure 2 and Table 2. We see significant (more than x) improvement in convergence rate which further translates into a substantial improvement in BLEU score.
We trained both models on a Cloud TPUV2 [13]. A configuration has 32 cores each with 8GB of memory. The transformer model for WMT’14 ende was trained with batches of size 1536 for 700k steps. The TransformerBig model for WMT’14 enfr was trained with the maximal batch size that could fit on each core, yielding an effective batch of size 768, for 1M steps. The TransformerBig model consists of 6 layers for its encoder and decoder, each layer is composed of 1024 model dimensions, 8192 hidden dimensions, and 16 attention heads. In total the TransformerBig has 375.4M parameters (1.432GB) and uses a significant fraction of the overall memory, thus making SM3 more effective there.
All experiments were run with synchronous (stochastic) gradient updates. The models used 32K wordpieces [19] for each language pair. We computed BLEU scores on the Newstest 2014 for evaluation. We also disabled checkpoint averaging in order to underscore the improved convergence rate of SM3. Our BLEU scores are not directly comparable to those of [23], instead we followed the experimental protocol described in [5]. BLEU scores were computed on tokenized, truecase outputs and without manual postprocessing of the text similar to [24].
Dataset  Model  Optimizer  BLEU 

ende  Transformer  Adafactor  26.88 
SM3  27.32  
enfr  TransformerBig  Adafactor  39.67 
SM3  40.49 
5.2 Language modeling
We trained a BERTLarge language model from [8] on the combined Wikipedia and BooksCorpus [25]. BERTLarge is a large bidirectional transformer model containing 24 transformer blocks with 1024 hidden dimensions and 16 self attention heads. It has 340M parameters (1.297 GiB), and is setup to optimize two losses jointly: (a) masked language model (MaskedLM) loss where the task is to predict masked tokens based on surrounding context, and (b) next sentence prediction (NSP) loss where the task is to predict if a sentence follows another sentence where negatives sentences are randomly selected from the corpus.
We ran all our experiments using the open sourced code from [8] on an Cloud TPUV2 configuration which has 128 cores. The baseline used was the Adam optimizer with learning rate , , and . The learning rate was warmedup over the first 10,000 steps, followed by a linear decay. SM3 used the same warmup as a safety mechanism, with no further tinkering. Momentum was set to 0.9. We trained all models for 500K steps. We split the dataset into a traintest split.
Our results are presented in Figure 3. We see that SM3 works as well as Adam for the same batch size. However SM3 lets us train with a much larger batch size using a similar amount of memory as Adam. We were able to increase the number of examples in each batch by a factor of , yielding quality improvements and faster convergence.
5.3 AmoebaNetD on ImageNet
We trained AmoebaNetD described in [16] which was originally constructed to have low training cost on the ImageNet dataset. We used the opensource code available from [6] where we changed the optimizer to SM3 and removed learning rate decay. The model was trained on a Cloud TPUv2 configuration. The baseline used RMSProp [12]
with Nesterov momentum and a staircase learning rate decay schedule. The model was trained with a batchsize of 1024, as recommended in
[6]. Our results in Figure 4 indicate that SM3 performed very well in this task and resulted in improved top1 (77.95) and top5 (93.89) accuracies.6 Conclusions
We presented SM3, a simple and effective adaptive optimization algorithm for stochastic optimization in settings where memory during training is severely limited. In these settings, the memory overhead of adaptive methods such as AdaGrad and Adam is prohibitively large, and thus limits the size of models that can be trained as well as the number of samples in each minibatch. We demonstrated empirically that SM3 can be effectively used in such settings and dramatically decreases memory overhead. Utilizing the freed memory for increasing the batch size, our experiments show that this saving can also lead to significant gains in performance.
In future work we will focus on extending and strengthening our theoretical guarantees, improving the robustness of SM3, and further experimentation with various covers for additional domains. In particular, we plan to evaluate SM3 on training recurrent networks for speech recognition and audio generation.
Acknowledgements
We would like to thank Luke Metz, Kunal Talwar, Yonghui Wu for many discussions and helpful suggestions. Special thanks go to Samy Bengio who made it possible for us to conduct large scale experiments on a tight schedule. We would also like to thank Zhifeng Chen for coming up with the shorthand ‘SM3’.
References
 Abadi et al. [2016] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
 Agarwal et al. [2018] N. Agarwal, B. Bullins, X. Chen, E. Hazan, K. Singh, C. Zhang, and Y. Zhang. The case for fullmatrix adaptive regularization. CoRR, abs/1806.02958, 2018.
 Auer et al. [2002] P. Auer, N. CesaBianchi, and C. Gentile. Adaptive and selfconfident online learning algorithms. Journal of Computer and System Sciences, 64(1):48–75, 2002.
 CesaBianchi et al. [2004] N. CesaBianchi, A. Conconi, and C. Gentile. On the generalization ability of online learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.

Chen et al. [2018]
M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones,
M. Schuster, N. Shazeer, N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser,
Z. Chen, Y. Wu, and M. Hughes.
The best of both worlds: Combining recent advances in neural machine translation.
In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, pages 76–86, 2018.  CloudTPU [2018] CloudTPU. Training AmoebaNetD on cloud TPU. https://cloud.google.com/tpu/docs/tutorials/amoebanet, 2018.
 Coleman et al. [2018] C. Coleman, D. Kang, D. Narayanan, L. Nardi, T. Zhao, J. Zhang, P. Bailis, K. Olukotun, C. Re, and M. Zaharia. Analysis of dawnbench, a timetoaccuracy machine learning performance benchmark. arXiv preprint arXiv:1806.01427, 2018.
 Devlin et al. [2018] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
 Duchi et al. [2011] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 Gupta et al. [2018] V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 1842–1850, 2018.
 Hazan [2016] E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(34):157–325, 2016.
 Hinton et al. [2012] G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of minibatch gradient descent. Cited on, page 14, 2012.
 Jouppi et al. [2017] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. Indatacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pages 1–12. IEEE, 2017.
 Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 McMahan and Streeter [2010] H. B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. COLT 2010, page 244, 2010.
 Real et al. [2018] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
 Reddi et al. [2018] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. 2018.

Russakovsky et al. [2015]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, et al.
Imagenet large scale visual recognition challenge.
International Journal of Computer Vision
, 115(3):211–252, 2015.  Schuster and Nakajima [2012] M. Schuster and K. Nakajima. Japanese and korean voice search. In ICASSP, pages 5149–5152. IEEE, 2012.
 ShalevShwartz [2012] S. ShalevShwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012.
 Shazeer and Stern [2018] N. Shazeer and M. Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, pages 4603–4611, 2018.
 [22] J. Shen, P. Nguyen, Y. Wu, Z. Chen, et al. Lingvo. https://github.com/tensorflow/lingvo.
 Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
 Wu et al. [2016] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
 Zhu et al. [2015] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards storylike visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.