Large Memory Layers with Product Keys

07/10/2019
by   Guillaume Lample, et al.
0

This paper introduces a structured memory which can be easily integrated into a neural network. The memory is very large by design and therefore significantly increases the capacity of the architecture, by up to a billion parameters with a negligible computational overhead. Its design and access pattern is based on product keys, which enable fast and exact nearest neighbor search. The ability to increase the number of parameters while keeping the same computational budget lets the overall system strike a better trade-off between prediction accuracy and computation efficiency both at training and test time. This memory layer allows us to tackle very large scale language modeling tasks. In our experiments we consider a dataset with up to 30 billion words, and we plug our memory layer in a state-of-the-art transformer-based architecture. In particular, we found that a memory augmented model with only 12 layers outperforms a baseline transformer model with 24 layers, while being twice faster at inference time. We release our code for reproducibility purposes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/07/2021

Differentiable Random Access Memory using Lattices

We introduce a differentiable random access memory module with O(1) perf...
research
01/27/2021

CNN with large memory layers

This work is centred around the recently proposed product key memory str...
research
06/15/2021

PairConnect: A Compute-Efficient MLP Alternative to Attention

Transformer models have demonstrated superior performance in natural lan...
research
05/21/2018

A Simple Cache Model for Image Recognition

Training large-scale image recognition models is computationally expensi...
research
07/30/2015

Multilinear Map Layer: Prediction Regularization by Structural Constraint

In this paper we propose and study a technique to impose structural cons...
research
11/06/2019

Fast Transformer Decoding: One Write-Head is All You Need

Multi-head attention layers, as used in the Transformer neural sequence ...
research
08/11/2021

DEMix Layers: Disentangling Domains for Modular Language Modeling

We introduce a new domain expert mixture (DEMix) layer that enables cond...

Please sign up or login with your details

Forgot password? Click here to reset