1 Introduction
Neural networks are commonly employed to address many complex tasks such as machine translation [6], image classification [26] or speech recognition [15]. As more and more data becomes available for training, these networks are increasingly larger [18]. For instance, recent models both in vision [27]
and in natural language processing
[19, 34] have more than a billion parameters. The highercapacity enables better modeling of data like natural text or images, and it also improves generalization [40, 31]. Unfortunately, increasing capacity has led to a dramatic increase of computational complexity, both at training and inference time [19].There is a growing interest in developing architectures with reasonable computational complexity. Recently, there has been some efforts to develop high capacity architectures that operate on a limited computational budget [38, 17]. This is well illustrated by the “Ondevice Visual Intelligence Challenge” [1], which specifically focuses on the complexity/accuracy tradeoff for image classification.
Some researchers have attempted to increase the capacity of a network without increasing its computational complexity. Most notably, Rae et al. [35] incorporate fast nearest neighbor search within a neural network architecture to leverage large keyvalue layers with sparse reads and writes. Their approach relies on an external indexing structure [30], which is approximate and needs to be relearned regularly while training the neural network to avoid a catastrophic drift.
In this work, we propose a keyvalue memory layer that can scale to very large sizes while keeping exact search on the key space. This layer dramatically increases the capacity of the overall system for a negligible computational overhead. Unlike existing models based on keyvalue memories (see Figure 1), we define keys as the concatenation of two subkeys, in the spirit of product quantization [20]. As shown in more details in Figure 2
, this structure implicitly defines a very large set of keys, each being associated with a value memory slot. The set of value vectors introduces the bulk of the parameters, as it scales quadratically with the number of subkeys. Despite the large number of memory slots, finding the exact closest keys to the input is very efficient, typically requiring
vector comparisons, where is the total number of memory slots. All the memory parameters are trainable, yet only a handful of memory slots are updated for each input at training time. Sparsity of key selection and parameter updates make both training and inference very efficient.Our layer allows us to tackle problems where current architectures underfit given the vast amount of available data, or when they are too slow to work in practice. We thus focus on the language modeling task, integrating our memory within the popular transformer architecture [42]. This choice is motivated by the success of BERT [11]
and GPT2
[34], which demonstrated that increasing the capacity of large models directly translates to large improvements in language modeling, which in turn translates to better performance in both language understanding tasks [11, 44] and text generation [34]. Overall, our paper makes the following contributions:
We introduce a new layer that provides a large capacity to a neural network for only a slight computational overhead both at train and test time.

Our fast indexing strategy offers exact nearest neighbor search by construction, and avoids the pitfall of relying on an indexing structure that needs to be relearned during training.

We demonstrate our method within a large stateoftheart transformer, composed of 24 layers of dimension 1600. Our method with 1 memory and 12 layers outperforms a 24layer transformer while being twice faster at inference time. We show that adding more memory layers to transformers of various complexities provides systematic and significant improvement on our target task.
2 Related work
Different approaches have been proposed to increase the capacity of neural networks without increasing too much the computational complexity. For instance, conditional computation models aim at routing inputs into very large neural networks such that only a subset of connections and/or layers are used to process each input. Different methods have been developed like large mixture of experts [39], gating techniques [4, 12, 6]
or even reinforcement learningbased approaches
[10].Another line of research is the development of memory augmented neural networks. For instance, memorybased neural layers [45, 41] are an efficient way to represent variable length inputs for complex problems such as question answering [46]. Such memories can also operate in feature space and have various reading and writing mechanisms [22, 16]. Unfortunately, these approaches scale linearly with the size of the memory which is prohibitive for very large memories. Neural cache models [14] suffer from the same scaling issues.
Discretization techniques have been intensively studied for compressing network weights [8, 36] and/or activations [7, 36] or to accelerate inference. For instance, Gerald et al. [13] propose to map an input to a lowdimensional binary code, each code being associated with one category, thus reducing the complexity of inference by avoiding the use of a final large linear layer. Another model is proposed in [43], where the authors develop a fast localitysensitive hashing technique to approximate the dot product between large matrices and vectors in neural networks. However, exploiting binary codes or approximate techniques at training time raises several challenges in terms of optimization, because approximate indexes are not accurate in highdimensional spaces. In our paper, we borrow some ideas from product quantization (PQ) [20]. This is an approximate search technique that maps database vectors into compact codes. However, our goal is different: we do not build an approximate index, but rather we exploit the idea to represent a large set of key vectors by a drastically smaller number of vectors, that we update by regular backpropagation. As discussed later, the selection of the closest keys is exact and inherits from the fast neighbor search of PQ.
Our model is also related to sparsity models which have been mainly studied in the unsupervised learning setting
[32, 23]. For instance, the ksparse autoencoder
[28] only keeps the k largest values in the latent representation of an autoencoder, similar to our memory layer but without the product keys component. In winner take all autoencoders [29], sparsity is induced by using minibatch statistics, while in the sparse access memory [35] reports some speedup by both thresholding the memory to a sparse subset, and by using efficient data structures for contentbased read operations. Unfortunately, the fast access to memories rely on an approximate external indexing structure [30] that has to be relearned periodically. Our work solves this issue by fully incorporating the key selection mechanism as a network component.3 Learnable product key memories
We consider the design of a function , that will act as a layer in a neural network. The purpose of is to offer a large capacity within a neural network.
3.1 Memory design
Highlevel structure.
The overall structure of our memory is illustrated by Figures 1 and 2. The memory is composed of three components: a query network, a key selection module containing two sets of subkeys, and a value lookup table. It first computes a query that is compared to the set of product keys. For each product key, it computes a score and selects the product keys with the highest scores. The scores are then used to produce an output via a weighted sum over the values associated with the selected keys. All the parameters of the memory are trainable, yet only memory slots are updated for each input. The sparse selection and parameter update make both training and inference very efficient.
Query generation: preprocessing network.
The function , referred to as the query network, maps the dimensional input to a latent space of dimensionality . Typically,
is a linear mapping or a multilayer perceptron that reduces the dimensionality from
to. As keys are randomly initialized, they occupy the space relatively uniformly. Adding a batch normalization layer on the top of the query network helps increasing key coverage during training. This insight is confirmed by our ablation experiments in Section
4.5.Standard key assignment and weighting.
Let be a query and denote the topk operator^{2}^{2}2If the permutation sorts numbers as , the topk indices are . Given a set of keys composed of dimensional vectors, and an input , we select the top keys maximizing the inner product with the query :
# Get k nearest neighbors  (1)  
# Normalize topk scores  (2)  
# Aggregate selected values  (3) 
Here denotes the indices of the most similar keys (where the similarity measure is the inner product), and is the vector that represents the normalized scores associated with the selected keys. All these operations can be implemented using autodifferentiation mechanisms, making our layer pluggable at any location in a neural network.
Operations (2), (3) only depend on the topk indices and are therefore computationally efficient. In contrast, the exhaustive comparison of Equation (1) is not efficient for large memories since it involves computing inner products. To circumvent this issue, we resort to a structured set of keys, that we refer to as product keys.
The product key set
is defined as the outer product, with respect to the vector concatenation operator, of two vector codebooks and :
The total number of keys induced by this Cartesian product construction is . The sets and both comprise a set of subkeys of dimension . We exploit this structure to compute the closest keys efficiently. First, we split the query into two subqueries and . We then compute the subkeys in (resp. ) closest to the subquery (resp. ):
(4) 
We are guaranteed that the most similar keys in are of the form . An example of product keys with the key selection process is shown in Figure 2.
3.2 Complexity
Searching for the topk most similar keys when the keys have a flat representation requires comparisons of vectors of size , i.e. operations.
For product keys, we consider the setup where , i.e. the configuration that maximizes for a fixed number of subkeys . Since , we have . We only need to compare the two subqueries with and subkeys of size , which amounts to operations.
Then, we need to search for the topk keys in , which is a set composed of keys of dimension . This can be done in operations (in practice, this could be done in scalar operations with a priority list [2], but this choice is less compliant with GPU architectures). As a result, the overall complexity is:
For small values of , and a memory of size , retrieving the nearest product keys requires about less operations than an exhaustive search. As shown later in our ablation study, product keys also lead to a better performance compared to a set composed of flat keys.
3.3 Multihead memory attention
We make the model more expressive with a multihead mechanism, where each head independently computes a query used to select keys from the memory. The memory simply sums the output of each head : where is the number of heads.
Each head has its own query network and its own set of subkeys, but all heads share the same values. This is similar to the multihead attention used in transformers, except that we do not split the query into heads, but instead create queries. As the query networks are independent from each other and randomly initialized, they often map the same input to very different values of the memory. In practice, for the same input we observe very little overlap between the keys selected by two different heads. This method let us increase key usage and generally improves performance. The impact of the multihead attention mechanism is discussed in Section 4.5.
4 Experiments
We report results on largescale experiments for transformer models equipped with a memory, followed by an ablation study that shows the impact of different memory components on the model performance and memory usage. We propose to replace the FFN block of some transformer layers by a memory, as presented in Figure 3
. In that setting, the memory is integrated with a residual connection in the network, and the input
to the memory layer becomes instead of . In practice, we could also keep the FFN layer and simply interleave the memory between some transformer layers.4.1 Dataset
We evaluate the impact of our memory in a large scale language modeling task, where traditional models are known to underfit. The largest publicly available language modeling dataset is the One Billion Word corpus [5]. As noted in prior work [3, 9, 34], obtaining a good performance on this dataset requires tedious regularization as it is now too small for standard architectures. In our experiments, we encountered the same issues, and observed that even a small model was enough to overfit: on this dataset, for a 16 layers model with a dimensionality of 1024, we obtain a test perplexity of when the validation perplexity starts to increase. The train perplexity is then equal to and keeps improving while the validation perplexity deteriorates.
We therefore evaluate the benefit of our approach on a corpus that is 30 times larger and extracted from the public Common Crawl. The training set is composed of 28 billion words (140 GB of data) extracted from about 40 million English news articles indexed by Common Crawl corpora. The validation and test sets are both composed of 5000 news articles removed from the training set.
Unlike in the One Billion Word corpus, we did not shuffle sentences, allowing the model to learn long range dependencies. On this dataset, we did not observe any overfitting, and increasing the model capacity systematically led to a better performance on the validation set. We tokenized the data using the tokenizer provided by the Moses toolkit [25]. To reduce the vocabulary size, we use Byte Pair Encoding (BPE) [37], with 60k BPE splits.
4.2 Evaluation metrics
We measure the performance of our models by reporting the perplexity on the test set. For models with memories, we report two different metrics to evaluate the usage:

The memory usage that represents the fraction of accessed values:
where , and is defined as where represents the weights of the keys accessed in the memory when the network is fed with an input from the test set (i.e., the are sparse with at most nonzero elements).
At test time, we expect the model to access as many keys as possible, i.e. to have a usage near 100%; a lower usage means that part of the capacity is not exploited at all. The KL divergence reflects imbalance in the access patterns to the memory: if the model attends the same key for every query (while giving a tiny weight to the remaining keys), it would give a perfect usage but a very high KL, showing that the same performance could be achieved with just one value.
4.3 Training details
We use a transformer architecture with 16 attention heads and learned positional embeddings. We consider models with 12, 16 or 24 layers, with either 1024 or 1600 dimensions. We train our models with the Adam optimizer [24], with a learning rate of , with , , following the learning rate schedule of Vaswani et al. [42]. In the memory, the keys and the query network are learned with the same optimizer and learning rate as the rest of the network. Since the memory values are learned with sparse updates, we found it beneficial to learn them with a higher Adam learning rate of
. We implement our models with PyTorch
[33], and train them on 32 Volta GPUs. We use float16 operations to speed up training and to reduce the GPU memory usage of our models. To retrieve key indices efficiently, we perform the search over subkeys with a fast nearest neighbors implementation by Johnson et al. [21].For a transformer model with layers and memories, we interspersed the memories at regular intervals. For instance, for and , we replace the FFN of layers 6 and 12. This way, the network can leverage information at different levels of the architecture. The impact of the memory position within the network is studied in Section 4.5. In our main experiments, we use memory heads, we select keys per head, and use memory slots.
4.4 Results
Table 1 and Figure 4 show the perplexity of different models on the test set of the CCNews corpus. We observe that increasing either the dimensionality or the number of layers leads to significant perplexity improvements in all the models. However, adding a memory to the model is more beneficial than increasing the number of layers; for instance, a model with a single memory and 12 layers outperforms a memoryless model with the same hidden dimension and 24 layers, both when the number of hidden units is 1024 and 1600. Adding 2 or 3 memory layers further improves performance.
Figure 4 also shows speed as measured in words per second, for different model configurations. In particular, when the internal hidden state has 1600 dimensions, a model with 12 layers and a memory obtains a better perplexity than a model with 24 layers (same configuration as BERT large), and it is almost twice faster. When adding memory to large models that have internal dimensionality equal to 1600, inference time barely increases.
4.5 Ablation Study
In this section we study the impact of the different components on the memory layer, and measure how they affect the model performance and the memory usage. For all experiments, we consider a transformer network with 6 layers and 8 heads. Unless specified otherwise, we consider a memory of = 262k slots, with 4 memory heads, selected keys, and we insert it at layer 5.
Memory size.
We train transformer models with memories of size , with and . Table 2 shows that test perplexity decreases as the memory becomes larger. A model with a memory size of 16k obtains a perplexity of . Increasing the size to 1M decreases the perplexity down to while leaving the inference time unchanged. The dominant factor for inference time is the number of accessed memory values, which is governed by the number of memory heads and the parameter k, but not the memory size.
Query Batch Normalization.
Table 2 presents results with and without batch normalization in the query network. We observe that for small memories the usage is always close to 100%, but for a memory of size 1M, the batch normalization layer improves usage from % to %, with a consequent perplexity decrease from down to . For comparison, a model that does not use any memory obtains a perplexity of , which is on par with a memory of size 16k.
Finally, we observe a correlation between the number of used keys and the model performance. In particular, a model with a memory of size 1M that does not use batch normalization uses about % of the memory values (i.e. roughly 250k values), and obtains a perplexity of , which is on par with the model using a memory of size 262k that uses batch normalization, and that has a nearly optimal memory usage of 100%.
Memory size  16k  65k  147k  262k  590k  1M  
BatchNorm  No  Yes  No  Yes  No  Yes  No  Yes  No  Yes  No  Yes 
Perplexity  
Usage (%)  
KL 
Memory size  16k  65k  147k  262k  590k  1M  
Product Keys  No  Yes  No  Yes  No  Yes  No  Yes  No  Yes  No  Yes 
Perplexity        
Usage (%)        
KL        
Speed (w/s)  35.0k  35.8k  28.5k  36.7k  13.9k  36.4k  7.7k  36.3k  4.7k  36.2k  1.2k  35.7k 
Memory position.
In this experiment we insert the memory at different levels in the transformer, to see where it is the most beneficial. In Table 3 we observe that the model benefits the most from the memory when it replaces the FFN of the layers 4 or 5 in the transformer. Putting memory at layer 1 (after the input token embeddings) gives the worst performance. When the memory is inserted in layer 6, it is located right before the softmax output, the model has only one linear layer to process the information read from the memory. The best position to insert the memory is at an intermediate layer. We surmise that effective use of the memory requires operating in a more abstract feature space than the input and that it is important to have some layers on the top of the memory to further process and aggregate information from every location.
Number of heads / kNN.
Figure 5 shows that increasing the number of heads or the number of kNN improves both the perplexity of the model, and the memory usage. We also note that models with identical ( being the number of heads and the number of nearest neighbors) have a similar memory usage, i.e. models with all have a memory usage around 70%, and a perplexity around 20.5. Adding more heads overall improves the performance, but also increases the computation time. Overall, we found that using 4 heads and 32 kNN strikes a good tradeoff between speed and performance.
Product keys vs. flat keys.
Product keys presented in Figure 2 enable finding the nearest neighbors in a matrix of size with the same time/compute complexity of a search over two matrices of size . As a result, product keys contain times less parameters than keys represented by a full matrix. Table 4 and Figure 6 compare product keys to the default regular flat keys. In the second case, searching the nearest keys boils down to a liner index search at each iteration, which is computationally very expensive. As a result, we only report results for memories of size 16k, 65k, 147k, as experiments with a flat index on larger memories takes an unreasonable amount of time to converge. We can see that models with product keys are not only faster but they have also a much better memory usage, and consequently obtain a better perplexity.
5 Conclusion
This paper introduces a memory layer that allows to drastically improve the capacity of a neural network with a negligible computational overhead. The efficiency of our layer relies on two key ingredients: the factorization of keys as a product set, and the sparse read/write accesses to the memory values. Our layer is integrated into an existing neural network architecture. We show experimentally that it provides important gains on largescale language modeling, reaching with 12 layers the performance of a layer BERTlarge model with half the running time.
References
 [1] The ondevice visual intelligence challenge. https://ai.googleblog.com/2018/04/introducingcvpr2018ondevicevisual.html. Accessed: 20190520.
 Babenko and Lempitsky [2014] Artem Babenko and Victor Lempitsky. The inverted multiindex. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(6):1247–1260, 2014.
 Baevski and Auli [2018] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018.
 Bengio et al. [2013] Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013. URL http://arxiv.org/abs/1308.3432.
 Chelba et al. [2013] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
 Cho and Bengio [2014] Kyunghyun Cho and Yoshua Bengio. Exponentially increasing the capacitytocomputation ratio for conditional computation in deep learning. CoRR, abs/1406.7362, 2014.
 Courbariaux and Bengio [2016] Matthieu Courbariaux and Yoshua Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. CoRR, 2016.
 Courbariaux et al. [2015] Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. CoRR, 2015.
 Dai et al. [2019] Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformerxl: Attentive language models beyond a fixedlength context. arXiv preprint arXiv:1901.02860, 2019.
 Denoyer and Gallinari [2014] Ludovic Denoyer and Patrick Gallinari. Deep sequential neural network. CoRR, abs/1410.0510, 2014.
 Devlin et al. [2018] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 Eigen et al. [2014] D. Eigen, I. Sutskever, and M. Ranzato. Learning factored representations in a deep mixture of experts. In Workshop at the International Conference on Learning Representations, 2014.
 Gerald et al. [2017] Thomas Gerald, Nicolas Baskiotis, and Ludovic Denoyer. Binary stochastic representations for large multiclass classification. In Neural Information Processing  24th International Conference, ICONIP 2017, Guangzhou, China, November 1418, 2017, Proceedings, Part I, pages 155–165, 2017.
 Grave et al. [2017] Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. In International Conference on Representation Learning, 2017.

Graves et al. [2013]
Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton.
Speech recognition with deep recurrent neural networks.
In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. IEEE, 2013.  Graves et al. [2014] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014.

Gross et al. [2017]
Sam Gross, Marc’Aurelio Ranzato, and Arthur Szlam.
Hard mixtures of experts for large scale weakly supervised vision.
In
Conference on Computer Vision and Pattern Recognition
, 2017.  He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition, 2016.
 Huang et al. [2018] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. CoRR, abs/1811.06965, 2018. URL http://arxiv.org/abs/1811.06965.
 Jégou et al. [2011] Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011.
 Johnson et al. [2017] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billionscale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.
 Joulin and Mikolov [2015] Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stackaugmented recurrent nets. In Advances in Neural Information Processing Systems, 2015.
 Kavukcuoglu et al. [2010] Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. Fast inference in sparse coding algorithms with applications to object recognition. CoRR, abs/1010.3467, 2010. URL http://arxiv.org/abs/1010.3467.
 Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Koehn et al. [2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris CallisonBurch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 177–180. Association for Computational Linguistics, 2007.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 2012.
 Mahajan et al. [2018] Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In European Conference on Computer Vision, 2018.
 Makhzani and Frey [2014] Alireza Makhzani and Brendan J. Frey. ksparse autoencoders. In ICLR, 2014.
 Makhzani and Frey [2015] Alireza Makhzani and Brendan J Frey. Winnertakeall autoencoders. In Advances in Neural Information Processing Systems, 2015.

Muja and Lowe [2014]
Marius Muja and David G. Lowe.
Scalable nearest neighbor algorithms for high dimensional data.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 2014.  Neyshabur et al. [2019] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towards understanding the role of overparametrization in generalization of neural networks. In International Conference on Representation Learning, 2019.
 Olshausen and Field [1997] Bruno A. Olshausen and David J. Field. Sparse coding with an overcomplete basis set, a strategy employed by v1? Vision Research, 37:3311––3325, 1997.
 Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. NIPS 2017 Autodiff Workshop, 2017.
 Radford et al. [2019] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
 Rae et al. [2016] Jack Rae, Jonathan J Hunt, Ivo Danihelka, Timothy Harley, Andrew W Senior, Gregory Wayne, Alex Graves, and Timothy Lillicrap. Scaling memoryaugmented neural networks with sparse reads and writes. In Advances in Neural Information Processing Systems. 2016.
 Rastegari et al. [2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, 2016.
 Sennrich et al. [2015] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1715–1725, 2015.
 Shazeer et al. [2017a] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparselygated mixtureofexperts layer. arXiv preprint arXiv:1701.06538, 2017a.
 Shazeer et al. [2017b] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparselygated mixtureofexperts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, 2017b.
 Spigler et al. [2018] Stefano Spigler, Mario Geiger, Stéphane d’Ascoli, Levent Sagun, Giulio Biroli, and Matthieu Wyart. A jamming transition from under to overparametrization affects loss landscape and generalization. CoRR, abs/1810.09665, 2018. URL http://arxiv.org/abs/1810.09665.
 Sukhbaatar et al. [2015] Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. Endtoend memory networks. In Advances in Neural Information Processing Systems 28. 2015.
 Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
 Vijayanarasimhan et al. [2015] Sudheendra Vijayanarasimhan, Jonathon Shlens, Rajat Monga, and Jay Yagnik. Deep networks with large output spaces. In ICLR (Workshop), 2015.
 Wang et al. [2018] Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multitask benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
 Weston et al. [2015] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In ICLR, 2015.
 Weston et al. [2016] Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. Towards aicomplete question answering: A set of prerequisite toy tasks. In ICLR, 2016.