1 Introduction
Over the last few years, scaling neural networks has tremendously improved the quality of models on several machine learning tasks. Stateoftheart Natural Language Processing models have billions of parameters, especially for tasks like Machine Translation (Shazeer et al., 2018; Huang et al., 2019), Language Modeling (Radford et al., ) and Natural Language Understanding (Devlin et al., 2018; Raffel et al., 2019). While training these models is feasible given the dramatic increase in the efficiency of training hardware (Jouppi et al., 2017) and research into efficient modelparallelism (Shazeer et al., 2018; Huang et al., 2019), the amount of computation that can be expended at inference is often limited. However, these huge networks are usually inflexible and offer little control over the amount of computation used on any example, independent of the complexity of the input or the available computation budget for inference.
Conditional Computation based approaches allow training networks where certain subnetworks can be conditionally executed, based on discrete decisions (optionally) trained with the model (Spall et al., 1992; Bengio et al., 2013). These methods also offer the potential for more control over the computation expended by the model during inference, conditioned on example difficulty or the available computation budget.
Our approach for adapting models for conditional computation: During training, subnetwork outputs are gated by noised continuous outputs from control networks trained endtoend with the model. During inference, subnetworks are conditionally executed depending on discrete outputs from control networks. Outputs are optionally shortcircuited with residual connections.
Training a model with discrete intermediate outputs requires backpropagating through discrete random variables, which hinders model trainability. While several approaches have been suggested to alleviate this problem, including the use of gradient estimators
(Bengio et al., 2013; Jang et al., 2016)(Bengio et al., 2015), training neural networks with conditionally executable subnetworks is still rare. As a consequence, most work involving conditional computation is restricted to very specific applications.In this work we present a general framework to adapt models for conditional computation and control the amount of computation used at inference. We make three major contributions: (i) We provide a simple approach to adapt models for conditional computation by adding control networks trained endtoend with the model. These control networks produce continuous outputs during training which allows for backpropagation. During inference these networks act like binary stochastic units that control the execution of their respective subnetworks. (ii) We propose a multitask training approach to train a single model at different computation budgets. This allows controlling the amount of compute expended by the model on any example at inference. (iii) We adapt the Transformer architecture for conditional computation and demonstrate the efficacy of our approach on two large scale sequence modeling tasks: WMT’14 EnFr Translation and Representation learning with BERT.
2 Method
2.1 Adapting models for conditional computation
We adapt neural sequence models for conditional computation by allowing the model to selectively execute certain subnetworks of the computation graph, conditioned on the outputs of small control networks learned jointly with the model.
Let the input to a layer be a sequence of length . Let the output of this operation be , given by . In the presence of residual connections this can be rewritten as .
We now introduce a control network to control the execution of layer . While it is possible to train neural networks with backpropagation in the presence of discrete outputs we preclude this problem by training in expectation. At training time, instead of sampling a discrete decision from , we compute the expected, . We define the operation of by , where can be any function mapping and
is the logistic sigmoid function. As a result
for any during training.The gated version of layer can be written as:
(1) 
At inference, this operation simplifies to
(2) 
This, however, introduces a discrepancy between training and inference, with the former using soft decisions and the latter selectively executing layers based on discrete decisions. To bridge between these two modes of operation we encourage
to become more discrete as training progresses. We follow the approach used in previous work for training binary stochastic neurons for monotonic attention mechanisms
(Raffel et al., 2017; Chiu & Raffel, 2017; Arivazhagan et al., 2019). We add zeromean Gaussian noise to the output of during training, as shown below:(3) 
where increases linearly during the training process. This increasing schedule carries the preactivation towards the saturation range of gating function and in return, forcing the output of closer to the boundaries of . While could have any possible parameterization, for the purpose of this work we restrict it to single hidden layer feedforward networks for simplicity.
(4) 
2.2 Modulating the Inference Budget
In the absence of any other training signal, we would expect the training loss to pull the model towards using all (or most) of its computation in order to maximize performance. To control the amount of computation utilized by the model we impose a computational budget loss in addition to the training objective.
For any layer , the expected cost of computation utilized by the model on any token, , can be given by , where is the cost of applying layer to one token. For the purpose of this work we define the cost of a layer to represent its computational cost in terms of Flops.
Given a batch of sequences with timesteps, let be the th token of the th sequence. We define the computational budget as a fraction, , of the maximum computation available for the batch. Then the computational budget on the given batch of tokens is:
(5) 
The expected computation used by the proposed conditional computation model is determined by the activations of the control networks on individual tokens of the batch:
(6) 
Then we define the computational budget loss on the given batch of tokens to be:
(7) 
We impose a constraint on the total computation used for a batch, instead of the compute used for a single sequence or token. This looser constraint allows the model to allocate more computation for ‘difficult’ examples by using less compute for ‘easy’ examples. Empirically we find that using the batchlevel constraint performs better, especially at lower computation budgets.
Training a conditional computation model with the above loss allows operating that model at a single computation budget, . Given a set of desired computational budgets, , that we want the model to operate at, we utilize a simple multitask training approach. We define a set of control symbols, , which can be fed as additional inputs to the model. We associate each budget, , with a control input . Given a batch of training sequences, , we (uniform) randomly assign each sequence to a budget in the set . Let the batch of sequences assigned to budget be . The corresponding control symbol is then fed to the model as an additional input when training on sequences in . By associating specific control inputs with different computational budgets, we train a single model to operate at specific levels of computation controlled by these external inputs. This is similar to approaches used for training multilingual Machine Translation models (Johnson et al., 2017) and other multitask models.
The total budget loss function in this multitask training setup is then:
(8) 
In certain cases it might be desirable to control the amount of computation spent on different subnetworks of the model independently. For eg., in autoregressive seq2seq models there is an inherent difference in the mode of operation of encoder and decoder subnetworks. To control the budgets for subnetworks independently, our multibudget formulation can be extended to allow . Each symbol, then maps to a tuple of budgets, , specifying the desired budget for each subnetwork.
The generalized budget loss function can be described as:
(9) 
where is the th budget loss for subnetwork .
Given a model adapted for conditional computation following the approaches described above, controlling the inference time computation just requires feeding the right control input, , corresponding to the desired budget .
3 Conditional Computation Transformer
We now apply our approach to the Transformer architecture (Vaswani et al., 2017). We follow the new transformer layout where layer normalization (LN) is applied to the input instead of the output.^{1}^{1}1Please refer to the ‘nda’ layout as implemented in the Tensor2Tensor library (Vaswani et al., 2018).
3.1 Conditional Attention Layer
Given a vector
and a sequence of vectors, , the transformer attention layer can be described by the following sequence of operations. The set of vectors to be attended are first projected to keys () and corresponding values ()(10)  
The projected input queries are then used to attend the keys to summarize the set to be attended.
(11)  
We introduce two control networks to control the execution of the operations defined by Equations 10 and 11 respectively. The first control network, , controls the execution of the keyvalue projections. The second network, , controls the execution of the query projection, multiheaded attention and the attention postprojection. We also introduce additional normalization layers to stabilize training in the presence of discrete operations. During training we implement these changes as:
(12)  
and
(13)  
The above modifications are applied to all selfattention and crossattention layers.
3.2 Conditional Feedforward Layer
Given a vector, , the transformer feedforward layer can be described by
(14) 
where and . Then the output of the layer incorporating residual connections is given by .
Adding conditional execution for this layer, our output can be written as .
While it’s straightforward to add conditional execution for the entire feedforward layer, we can optionally decompose the large feedforward layer into independently controlled smaller layers to provide more granular control over feedforward layer capacity^{2}^{2}2This decomposition makes our feedforward layer similar to the Sparsely Gated MixtureofExperts layer (Shazeer et al., 2017). However, in our approach the number of experts applied per input are a function of the input.
(15) 
where , and maps the input to .
3.3 Feeding Control Input
For training with multiple computation budgets, as described in Section 2.2, we need to feed an additional control input, , with every input sequence. Given an input sequence, , input to transformer layers is a sequence of embeddings corresponding to each input symbol summed with the corresponding position embedding. In addition to the position embeddings we learn an additional input embedding of control symbols, . The embedding of the symbol is then added to the embedding of each token before feeding into the model.
4 Experiments on Machine Translation
Most machine translation models fall within the sequencetosequence paradigm (Sutskever et al., 2014; Bahdanau et al., 2015), with an encoder that learns representations of the source sequence and a decoder to generate the target sequence, trained on the crossentropy loss . Since there is a difference in the inferencetime operation of the encoder and decoder (the encoder processes all source tokens simultaneously, while the decoder processes each token one at a time), we allow controlling their respective computation budgets separately. To elaborate, we permit using a set of computation budgets . Here the first budget of every tuple, , corresponds to the desired encoder budget while the second budget, , corresponds to the desired decoder budget. The control symbol, , is fed as an embedding added to every source and target token. We train the model endtoend on
We now evaluate our approach on the WMT’14 EnglishFrench translation task^{3}^{3}3statmt.org/wmt14/translationtask.html. We use newstest13 for validation and newstest14 for test. BLEU scores are computed with tokenized truecased output and references with Moses multibleu.perl^{4}^{4}4github.com/mosessmt.
We train a Transformer Big (Vaswani et al., 2017) model as our baseline. For smaller budget baselines, we reduce the capacity of our Transformer following two approaches: (i) Reducing the model depth by reducing the number of layers and (ii) Reducing the model width by reducing the hidden dimension of the feedforward layers. We compare these baselines against a single CCT model operating at different computation budgets, with a maximum capacity equivalent to Transformer Big. Since decoder computation is typically the bottleneck for Transformer inference, we also train additional baselines where we only reduce the capacity of the decoder, while using a full Transformer Big encoder. These baselines are compared against the same CCT model from the previous comparison, but use full encoder computation while varying the decoder budget.
We use a Transformer learning rate schedule (Vaswani et al., 2017) of (3.0, 40K)^{5}^{5}5(3.0, 40K) schedule is the shorthand for a learning rate of 3.0, with 40K warmup steps for the schedule, which is decayed with the inverse square root of the number of training steps after warmup.
and all dropout probabilities are set to 0.1. For all our models, we use a shared vocabulary Sentence Piece Model
(Kudo & Richardson, 2018) for subword tokenization, with a vocabulary size of 32000 tokens. We train each model for 300k steps with batches of 250k tokens. The CCT is trained with the same set of hyperparameters. In addition to the above hyperparameters, we set and use a set of computation budgets, , set to . This results in control tasks, one corresponding to each tuple from the above crossproduct^{6}^{6}6We empirically find that allowing half of the control tasks to use their entire computational budgets strikes a good balance between properly training all parameters and learning to operate at reduced budgets.. The noise factor, in Equation 3, is linearly increased during the training process, from at the first step to at 300k steps. For these experiments, we break the feedforward layer into smaller layers i.e. in Equation 15. All our models are trained on 32 Cloud TPUv3 chips and evaluated at 300k steps.Figure 4 compares a single CCT against individual Transformer models with different amounts of encoder and decoder capacity. Our results suggest that CCT is competitive with Transformer Big even when operating at half its computation budget. At smaller computation budgets CCT improves over smaller baseline Transformer models by up to 11.5 Bleu. Figure 5 depicts the results of our second experiment, comparing CCT against baseline Transformers when using a large encoder (equivalent to Transformer Big) and controlling for decoder computation. We observe a similar trend, with CCT being competitive with Transformers at higher computation budgets, while improving over baselines by almost 1 Bleu at reduced budgets.
5 Experiments with BERT
BERT (Devlin et al., 2018) uses a Masked Language Modeling objective, , in order to learn tokenlevel representations of text using a Transformer architecture. Following the pretraining stage, the model is finetuned on individual tasks by training on (smaller) taskspecific datasets and objectives. To control the amount of computation used by BERT for generating representations of text, we replace the Transformer model in the original BERT implementation with our CCT from Section 3. We train this model following the multicomputation budget recipe described in Section 2.2. The objective function used for training this model is . When finetuning on a downstream task we use a different scaled to the new objective.
We train a BERTLarge (Devlin et al., 2018) model as our baseline. For smaller budget baselines, we reduce the capacity of BERT following two approaches: (i) Reducing the model depth by reducing the number of layers and (ii) Reducing the width by reducing the model dimension and hidden dimension of the feedforward layers, maintaining a ratio of 4 between the model dimension and feedforward hidden dimension. We compare these baselines against a single CCT model operating at different computation budgets, with maximum capacity equivalent to BERTLarge.
We use the same pretraining process used in Devlin et al. (2018), except for one difference: we train on sequences of length 512 for 1M steps with a batch size of 1024, instead of training on shorter sequences for 900k steps and finetuning with longer sequences. The CCT is trained with the same set of hyperparameters. In addition to the above hyperparameters, we set and use a set of computation budgets . is linearly increased during the training process, going from at the first step to at 300k steps and capping at that value. For these experiments, we break the feedforward layer into smaller layers i.e. in Equation 15. All our models are trained on 64 Cloud TPUv3 chips.
When finetuning BERT baselines on downstream tasks, we search over the same grid used in Devlin et al. (2018). We reuse the same finetuning parameters as BERTLarge for finetuning CCT. The value of used for finetuning CCT is different from pretraining, to scale to the downstream task loss. We report validation performance on 4 GLUE benchmark (Wang et al., 2019) tasks over 3 runs: MNLI, SST2, Squad and CoLA. A comparison of CCT with comparable baselines on MNLI, SST2, CoLA and Squad tasks is depicted in Figures 6, 7, 8 and 9 respectively. On MNLI and SST2 we see a trend similar to translation, and the performance of CCT is close to the performance of BERTLarge at the highest computation setting, while improving significantly over baselines at smaller computation budgets. On CoLA we see the reverse trend: CCT improves by a significant margin at the highest computation setting while losing to baselines at smaller computation budgets.
The performance of CCT on Squad is worse than baselines at all computation budgets. It is worth noting that Squad is the only benchmark task that uses token level outputs from the pretrained representations, while all other tasks act on a pooled representation of the entire sequence. The weak performance on Squad suggests that tokenlevel representations extracted from CCTBERT, without pooling, might not perform as well as those from a static architecture that uses the same set of operations for every token.
6 Additional Insights
In order to shed more light on the training dynamics and factors effecting the quality and performance of the proposed CCT approach, we conducted further analysis probing various aspects.
Layer usage under a limited budget
To understand the distribution of computation across layers at different budgets, we look at the fraction of times is active for different layers for our machine translation model. Figure 10 depicts the fraction of feedforward operations applied pertoken in different encoder layers at different budgets. Our results suggest that encoders utilize more feedforward computation at higher levels of the stack. We observe the same trend for the decoder. From analyzing the outputs of control networks for different subnetworks, we also find that lower layers tend to be either active or inactive for most tokens for a given inference budget, while upper layers have different outputs for different tokens. This behaviour could have two possible explanations: (i) Lower level layers perform more general operations applicable to all tokens, while higher level layers apply more inputspecific operations, or (ii) Control networks of layers lower in the stack might be difficult to train resulting in learning the trivial solution i.e. ‘on’ for high computation budgets and ‘off’ for smaller budgets.
Importance of the Noise Schedule
We next attempt to understand the role of gradually increasing noise variance when training with discrete decisions. We compare a CCT model trained with linearly increasing
(Equation 3) against one where it is set to its highest value (here ) from the beginning of training. From Figure 11, we notice that the two models are within 0.3 BLEU of each other at , with the difference increasing to 0.5 for and . At , the performance of the discrete model deteriorates much faster with the difference growing to more than 1 BLEU. This suggests that the quality of control network training has a larger effect on model performance at smaller computation budgets.Importance of Parallel SubNetworks
Deciding how to divide the model’s computation graph into subnetworks controlled by different control networks can have a significant impact on model quality. For example, for most of our experiments we split each feedforward layer into 4 smaller, independently controlled feedforward layers ( in Equation 15). We compare the effect of splitting the feedforward subnetwork at different granularities (setting ). Our results from Figure 12 suggest that having more control on how network computation is utilized, by having control networks for smaller subnetworks, significantly impacts model quality especially at lower computation budgets.
Tricks of the trade
We list some tricks and observations that were empirically found to be useful during the course of this work.
(i) Careful normalization was critical for stable training and good model quality. This includes additional layer normalization applied to every gated subnetwork output and using separate layer normalization for the input of every independently gated feedforward subnetwork (i.e. separate layernormalization for each of the feedforward layers).
(ii) The range of budgets ( from Section 2.2) used during training affected model quality. We observed significant quality deterioration when one of the values in was too low (). For BERT experiments setting resulted in worse performance on the MLM loss at a budget of while performance at other budgets was not severely impacted. We suspect this is caused by the special role of the MASK token during pretraining.
(iii) Even with all the stabilization approaches, approximately of our runs deteriorated in performance on training further beyond convergence.
(iv) Varying control network capacity did not have a huge effect on model quality within the range of hidden dimensions evaluated by us ().
(v) The proposed computational budget loss (Equation 7) is twosided and also penalizes the model for underutilization. While this is counterintuitive, in practice we found that not penalizing the model for using less computation resulted in undertraining certain subnetworks, resulting in suboptimal downstream quality.
7 Related Work
Activating a subnetwork depending on the particular input example has been the focus of conditional computation approaches (Bengio et al., 2013; Davis & Arel, 2014). Following this line of research, Cho & Bengio (2014) studied increasing the capacity of neural networks without increasing required computation by exploiting the bit patterns associated with hidden units. As a majority of conditional computation approaches make use of stochastic binary units that pose trainability challenges, Bengio et al. (2015) cast the problem as a reinforcement learning problem and proposed a policy that maps the activations of layers to Bernoulli masks. Graves (2016)
proposed the first application of conditional computation to neural sequence models, called Adaptive Computation Time (ACT), where a recurrent neural network is trained to learn the lag between reading an input and generating the output.
The recently introduced Transformer architecture (Vaswani et al., 2017) has allowed researchers to train neural networks with billions of parameters, reaffirming the need for more efficient and adaptive models. Universal transformer (Dehghani et al., 2018) addressed the parameter inefficiency problem of Transformers by tying the weights of consecutive layers and utilizing ACT to decide the halting of such recurrence. The recently proposed depthadaptive Transformer (DAT) (Elbayad et al., 2019)
is perhaps the most similar to our approach. In DAT, decoder layers are equipped with halting classifiers that decide to exit and predict the output or continue processing, extending the ACT framework. DAT requires explicit supervision from oracles (or implicit supervision from multiple softmax computations) to train halting classifiers, restricting their approach to specific applications (like decoders in sequence to sequence models). Our approach trains control networks endtoend with the rest of the model, allowing us to extend it to a wider range of subnetworks not directly connected with the final classifier (for example, keyvalue projections in selfattention layers or encoder layers in seq2seq models).
Fan et al. (2019) propose another approach to control the inference time computation budget. Their method applies structured pruning (in the form of layer dropout) to yield varying number of shallower networks that can be used at inference time. Our approach however, results in a single network that can simultaneously adapt to the difficulty of the input example and the computation budget that is available at hand during inference time.
8 Conclusion
In this work we present a general framework to adapt neural sequence models (Transformer) for conditional computation and control the amount of computation used at inference. Our proposed approach injects simple control networks into the core computation graph, in order to modulate the information flow through the network. The incorporated control networks are trained endtoend simultaneously with the model, simulating the binary decisions to be made at inference time. We also introduce a novel multitask objective that allows the network to operate at multiple computation budgets at inference time efficiently, addressing the need for ondemand computation requirements of large networks. Experiments on large scale machine translation (WMT’14 EnglishFrench) and unsupervised representation learning (BERT) demonstrate that our proposed approach is competitive with baseline Transformer models at the same computation budget, and significantly better at smaller computational budgets compared to computationally equivalent baselines.
We believe more analysis is needed to understand the decisions made by control networks and the behavior of the final subnetwork on inputs with different levels of complexity.
Acknowledgements
We would like to thank the Google Translate and Tensorflow Lingvo teams for foundational contributions to the project.
References
 Arivazhagan et al. (2019) Arivazhagan, N., Cherry, C., Macherey, W., Chiu, C.C., Yavuz, S., Pang, R., Li, W., and Raffel, C. Monotonic infinite lookback attention for simultaneous machine translation. arXiv preprint arXiv:1906.05218, 2019.
 Bahdanau et al. (2015) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.0473.
 Bengio et al. (2015) Bengio, E., Bacon, P.L., Pineau, J., and Precup, D. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.
 Bengio et al. (2013) Bengio, Y., Léonard, N., and Courville, A. C. Estimating or propagating gradients through stochastic neurons for conditional computation. ArXiv, abs/1308.3432, 2013.
 Chiu & Raffel (2017) Chiu, C.C. and Raffel, C. Monotonic chunkwise attention. arXiv preprint arXiv:1712.05382, 2017.

Cho & Bengio (2014)
Cho, K. and Bengio, Y.
Exponentially increasing the capacitytocomputation ratio for conditional computation in deep learning, 2014.
 Davis & Arel (2014) Davis, A. S. and Arel, I. Lowrank approximations for conditional feedforward computation in deep neural networks. In Bengio, Y. and LeCun, Y. (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Workshop Track Proceedings, 2014. URL http://arxiv.org/abs/1312.4461.
 Dehghani et al. (2018) Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and Kaiser, L. Universal transformers. CoRR, abs/1807.03819, 2018. URL http://arxiv.org/abs/1807.03819.
 Devlin et al. (2018) Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 Elbayad et al. (2019) Elbayad, M., Gu, J., Grave, E., and Auli, M. Depthadaptive transformer, 2019.
 Fan et al. (2019) Fan, A., Grave, E., and Joulin, A. Reducing transformer depth on demand with structured dropout, 2019.
 Graves (2016) Graves, A. Adaptive computation time for recurrent neural networks, 2016.
 Huang et al. (2019) Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems, pp. 103–112, 2019.
 Jang et al. (2016) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.

Johnson et al. (2017)
Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N.,
Viégas, F., Wattenberg, M., Corrado, G., et al.
Google’s multilingual neural machine translation system: Enabling zeroshot translation.
Transactions of the Association for Computational Linguistics, 5:339–351, 2017. 
Jouppi et al. (2017)
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R.,
Bates, S., Bhatia, S., Boden, N., Borchers, A., et al.
Indatacenter performance analysis of a tensor processing unit.
In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12, 2017.  Kudo & Richardson (2018) Kudo, T. and Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/D182012.
 Masoudnia & Ebrahimpour (2014) Masoudnia, S. and Ebrahimpour, R. Mixture of experts: A literature survey. Artif. Intell. Rev., 42(2):275–293, August 2014. ISSN 02692821. doi: 10.1007/s104620129338y. URL https://doi.org/10.1007/s104620129338y.
 (19) Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pretraining.
 Raffel et al. (2017) Raffel, C., Luong, M.T., Liu, P. J., Weiss, R. J., and Eck, D. Online and lineartime attention by enforcing monotonic alignments. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2837–2846, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/raffel17a.html.
 Raffel et al. (2019) Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified texttotext transformer. arXiv preprint arXiv:1910.10683, 2019.
 Shazeer et al. (2017) Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparselygated mixtureofexperts layer. arXiv preprint arXiv:1701.06538, 2017.
 Shazeer et al. (2018) Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A., Koanantakool, P., Hawkins, P., Lee, H., Hong, M., Young, C., et al. Meshtensorflow: Deep learning for supercomputers. In Neurips, pp. 10414–10423, 2018.
 Shen et al. (2019) Shen, T., Ott, M., Auli, M., and Ranzato, M. Mixture models for diverse machine translation: Tricks of the trade, 2019.
 Spall et al. (1992) Spall, J. C. et al. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE transactions on automatic control, 37(3):332–341, 1992.
 Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215, 2014. URL http://arxiv.org/abs/1409.3215.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Neurips, pp. 5998–6008, 2017.
 Vaswani et al. (2018) Vaswani, A., Bengio, S., Brevdo, E., Chollet, F., Gomez, A. N., Gouws, S., Jones, L., Kaiser, L., Kalchbrenner, N., Parmar, N., Sepassi, R., Shazeer, N., and Uszkoreit, J. Tensor2tensor for neural machine translation. CoRR, abs/1803.07416, 2018. URL http://arxiv.org/abs/1803.07416.
 Wang et al. (2019) Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A MultiTask Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of ICLR 2019, 2019.