1 Introduction
Memory plays an important role in both artificial and biological learning systems anderson2000learning . Various forms of external memory have been used to augment neural networks bahdanau2014neural ; graves2016hybrid ; pritzel2017neural ; santoro2016one ; weston2014memory ; wu2018the . Most of these approaches use attentionbased reading mechanisms that compute a weighted average of memory contents. These mechanisms typically retrieve items in a single step and are fixed after training. While externalmemory offers the potential of quickly adapting to new data after training, it is unclear whether these previously proposed attentionbased mechanisms can fully exploit this potential. For example, when inputs are corrupted by noise that is unseen during training, are such onestep attention processes always optimal?
In contrast, experimental and theoretical studies of neural systems suggest memory retrieval is a dynamic and iterative process: memories are retrieved through a potentially varying period of time, rather than a single step, during which information can be continuously integrated amit1992modeling ; cohen1997temporal ; kording2007dynamics . In particular, attractor dynamics are hypothesised to support the robust performance of various forms of memory via their selfstabilising property conklin2005controlled ; ganguli2008memory ; ijspeert2013dynamical ; samsonovich1997path ; zemel2000generative . For example, point attractors eventually converge to a set of fixed points even from noisy initial states. Memories stored at such fixed points can thus be retrieved robustly. To our knowledge, only the Kanerva Machine (KM) incorporates iterative reconstruction of a retrieved pattern within a modern deep learning model, but it does not have any guarantee of convergence wu2018the .
Incorporating attractor dynamics into modern neural networks is not straightforward. Although recurrent neural networks can in principle learn any dynamics, they face the problem of
vanishing gradients. This problem is aggravated when directly training for attractor dynamics, which by definition imply vanishing gradients pascanu2013difficulty (see also Section 2.2). In this work, we avoid vanishing gradients by constructing our model to dynamically optimise a variational lowerbound. After training, the stored patterns serve as attractive fixedpoints to which even random patterns will converge. Thanks to the underlying probabilistic model, we do not need to simulate the attractor dynamics during training, thus avoiding the vanishing gradient problem. We applied our approach to a generative distributed memory. In this context we focus on demonstrating high capacity and robustness, though the framework may be used for any other memory model with a welldefined likelihood.To confirm that the emerging attractor dynamics help memory retrieval, we experiment with the Omniglot dataset lake2015human and images from DMLab beattie2016deepmind , showing that the attractor dynamics consistently improve images corrupted by noise unseen during training, as well as lowquality prior samples. The improvement of sampling quality tracks the decrease of an energy which we defined based on the variational lowerbound.
2 Background and Notation
All vectors are assumed to be column vectors. Samples from a dataset
, as well as other variables, are indexed with the subscript when the temporal order is specified. We use the shorthand subscript and to indicate all elements with indexes “less than” and “less than or equally to” , respectively. is used to denotes the expectation of function over the distribution .2.1 Kanerva Machines
Our model shares the same essential structure as the Kanerva Machine (figure 1, left) wu2018the , which views memory as a global latent variable in a generative model. Underlying the inference process is the assumption of exchangeability of the observations: i.e., an episode of observations
is exchangeable if shuffling the indices within the episode does not affect its probability
aldous1985exchangeability . This ensures that a pattern can be retrieved regardless of the order it was stored in the memory — there is no forgetting of earlier patterns. Formally, exchangeability implies all the patterns in an episode are conditionally independent: .More specifically, defines the distribution over the memory matrix , where is the number of rows and is the code size used by the memory. The statistical structure of the memory is summarised in its mean and covariance through parameters and . Intuitively, while the mean provides materials for the memory to synthesise observations, the covariance coordinates memory reads and writes. is the mean matrix of , which has same shape.
’s columns are independent, with the same variance for all elements in a given row. The covariance between rows of
is encoded in the covariance matrix . The vectorised form of has the multivariate distribution , where is the vectorisation operator anddenotes the Kronecker product. Equivalently, the memory can be summarised as the matrix variate normal distribution
, Reading from memory is achieved via a weighted sum over rows of , weighted by addressing weights :(1) 
where indexes the elements of and the rows of . is observation noise with fixed variance to ensure the model’s likelihood is well defined (Appendix A). The memory interfaces with data inputs via neural network encoders and decoders.
Since the memory is a linear Gaussian model, its posterior distribution
is analytically tractable and online Bayesian inference can be performed efficiently.
wu2018the interpreted inferring the posterior of memory as a writing process that optimally balances previously stored patterns and new patterns. To infer , however, the KM uses an amortised inference model, similar to the encoder of a variational autoencoder (VAE)
kingma2013auto ; rezende2014stochastic , which does not access the memory. Although it can distil information about the memory into its parameters during training, such parameterised information cannot easily by adapted to testtime data. This can damage performance during testing, for example, when the memory is loaded with different numbers of patterns, as we shall demonstrated in experiments.does not affect the graphical model. Right: Schematic structure of our model. The memory is a Gaussian random matrix.
2.2 Attractor Dynamics
A theoretically wellfounded approach for robust memory retrieval is to employ attractor dynamics amit1992modeling ; hopfield1982neural ; kanerva1988sparse ; zemel2000generative . In this paper, we focus on point attractors, although other types of attractor may also support memory systems ganguli2008memory . For a discretetime dynamical system with state and dynamics specified by the function , its states evolve as: . A fixedpoint satisfies the condition , so that . A fixed point is attractive if, for any point near , iterative application of converges to . A more formal definition of a point attractor is given in Appendix E, along with a proof of attractor dynamics for our model. Gradientbased training of attractors with parametrised models
, such as neural networks, is difficult: for any loss function
that depends on the ’th state , the gradient(2) 
vanishes when approaches a fixed point, since when
according to the fixedpoint condition. This is the “vanishing gradients” problem, which makes backpropagating gradients through the attractor settling dynamics difficult
pascanu2013difficulty ; pearlmutter1989learning .3 Dynamic Kanerva Machines
We call our model the Dynamic Kanerva Machine (DKM), because it optimises weights at each step via dynamic addressing. We depart from both Kanerva’s original sparse distributed memory kanerva1988sparse and the KM by removing the static addresses that are fixed after training. The DKM is illustrated in figure 1 (right). Following the KM, we use a Gaussian random matrix for the memory, and approximate samples of using its mean . We use subscripts for , and to distinguish the memory or parameters after the online update at the ’th step when necessary. Therefore, .
We use a neural network encoder to deterministically map an external input to embedding . To obtain a valid likelihood function, the decoder is a parametrised distribution that transforms an embedding to a distribution in the input space, similar to the decoder in the VAE. Together the pair forms an autoencoder.
Similar to eq. 1, we construct from the memory and addressing weights via . Since both mappings and are deterministic, we hereafter omit all dependencies of distributions on for brevity. For a Bayesian treatment of the addressing weights, we assume they have the Gaussian prior . The posterior distribution has a variance that is trained as a parameter and a mean that is optimised analytically at each step (Section 3.1). All parameters of the model and their initialisations are summarised in Appendix B.
To train the model in a maximumlikelihood setting, we update the model parameters to maximise the loglikelihood of episodes sampled from the training set (summarised in Algorithm 1). As is common for latent variable models, we achieve this by maximising a variational lowerbound of the likelihood. To avoid cluttered notation we assume all training episodes have the same length ; nothing in our algorithm depends on this assumption. Given an approximated memory distribution , the loglikelihood of an episode can be decomposed as (see full derivation in Appendix C):
(3) 
with its variational lowerbound:
(4) 
For consistency, we write . From the perspective of the EM algorithm dempster1977maximum , the lowerbound can be maximised in two ways: 1. By tightening the the bound while keeping the likelihood unchanged. This can be achieved by minimising the KLdivergences in eq. 3, so that approximates the posterior distribution and approximates the posterior distribution . 2. By directly maximising the lowerbound as an evidence lowerbound objective (ELBO) by, for example, gradient ascent on parameters of ^{1}^{1}1This differs from the original EM algorithm, which fixes the approximated posterior in the M step.. This may both improve the quality of posterior approximation by squeezing the bound, and maximising the likelihood of the generative model.
We develop an algorithm analogous to the two stepEM algorithm: it first analytically tighten the lowerbound by minimising the KLdivergence terms in eq. 3 via inference of tractable parameters, and then maximises the lowerbound by slow updating of the remaining model parameters via backpropagation. The analytic inference in the first step is quick and does not require training, allowing the model to adapt to new data at test time.
3.1 Dynamic Addressing
Recall that the approximate posterior distribution of has the form: . While the variance parameter is trained using gradientbased updates, dynamic addressing is used to find the that minimises . Dropping the subscript when it applies to any given and , it can be shown that the KLdivergence can be approximated by the following quadratic form (see Appendix D for derivation):
(5) 
where the terms that are independent of are omitted. Then, the optimal can be found by solving the (regularised) leastsquares problem:
(6) 
This operation can be implemented efficiently via an offtheshelve leastsquare solver, such as TensorFlow’s
matrix_solve_ls
function which we used in experiments. Intuitively, dynamic addressing finds the combination of memory rows that minimises the square error between the read out and the embedding , subject to the constraint from the prior .
3.2 Bayesian Memory Update
We now turn to the more challenging problem of minimising . We tackle this minimisation via a sequential update algorithm. To motivate this algorithm we begin by considering . In this case, eq. 3 can be simplified to:
(7) 
While it is still unclear how to minimise , if a suitable weight distribution were given, a slightly different term can be minimised to . To achieve this, we can set by updating the parameters of using the same Bayesian update rule as in the KM (Appendix A): . We may then marginalise out to obtain
(8) 
A reasonable guess of can be obtained by be solving
(9) 
as in section 3.1, but using the prior memory . To continue, we treat the current posterior as next prior, and compute using following the same procedure until we obtain using all .
More formally, Appendix C shows this heuristic online update procedure maximises another lowerbound of the loglikelihood. In addition, the marginalisation in eq.
8 can be approximated by using instead of sampling for each memory update:(10) 
Although this lowerbound is looser than (eq. 4), Appendix C suggests it can be tighten by iteratively using the updated memory for addressing (e.g., replacing in eq. 9 by the updated , the “optional” step in Algorithm 1) and update the memory with the refined . We found that extra iterations yielded only marginal improvement in our setting, so we did not use it in our experiments.
3.3 GradientBased Training
Having inferred and , we now focus on gradientbased optimisation of the lowerbound (eq. 4). To ensure the likelihood in eq. 4 can be produced from the likelihood given by the memory , we ideally need a bijective pair of encoder and decoder (see Appendix D for more discussion). This is difficult to guarantee, but we can approximate this condition by maximising the autoencoder loglikelihood:
(11) 
Taken together, we maximise the following joint objective using backpropagation:
(12) 
We note that dynamic addressing during online memory updates introduces order dependence since always depends on the previous memory. This violates the model’s exchangeable structure (orderindependence). Nevertheless, gradientascend on mitigates this effect by adjusting the model so that remains close to a minimum even for previous . Appendix C explains this in more details.
3.4 Prediction / Reading
The predictive distribution of our model is the posterior distribution of given a query and memory : This posterior distribution does not have an analytic form in general (unless is Gaussian). We therefore approximate the integral using the maximum a posteriori
(MAP) estimator of
:(13) 
Thus, can be computed by solving the same leastsquare problem as in eq. 6 and choosing (see Appendix D for details).
3.5 Attractor Dynamics
To understand the model’s attractor dynamics, we define the energy of a configuration with a given memory as:
(14) 
For a well trained model, with fixed, is at minimum with respect to after minimising (eq. 6). To see this, note that the negative of consist of just terms in in eq. 4 that depend on a specific and , which are maximised during training. Now we can minimise further by fixing and optimising . Since only the first term depends on , is further minimised by choosing the mode of the likelihood function . For example, we take the mean for the Gaussian likelihood, and round the sigmoid outputs for the Bernoulli likelihood. Each step can be viewed as coordinate descent over the energy , as illustrated in figure 2 (left).
The step of optimising following by taking the mode of is exactly the same as taking the mode of the predictive distribution (eq. 13). Therefore, we can simulate the attractor dynamics by repeatedly feedingback the predictive mode as the next query: , , …, . This sequence converges to a stored pattern in the memory, because each iteration minimises the energy , so that , unless it has already converged at . Therefore, the sequence will converge to some , a local minimum in the energy landscape, which in a well trained memory model corresponds to a stored pattern.
Viewing as a dynamical system, the stored patterns correspond to point attractors in this system. See Appendix C for a formal treatment. In this work we employed deterministic dynamics in our experiments and to simplify analysis. Alternatively, sampling from
and the predictive distribution would give stochastic dynamics that simulate MarkovChain Monte Carlo (MCMC). We leave this direction for future investigation.
4 Experiments
We tested our model on Ominglot lake2015human and frames from DMLab tasks beattie2016deepmind
. Both datasets have images from a large number of classes, well suited to testing fast adapting external memory: 1200 different characters in Omniglot, and infinitely many procedurally generated 3D maze environments from DMLab. We treat Omniglot as binary data, while DMLab has larger realvalued colour images. We demonstrate that the same model structure with identical hyperparameters (except for number of filters, the predictive distribution, and memory size) can readily handle these different types of data.
To compare with the KM, we followed wu2018the to prepare the Ominglot dataset, and employed the same convolutional encoder and decoder structure. We trained all models using the Adam optimiser kingma2013auto with learning rate . We used 16 filters in the convnet and memory for Omniglot, and 256 filters and memory for DMLab. We used the Bernoulli likelihood function for Omniglot, and the Gaussian likelihood function for DMLab data. Uniform noise was added to the labyrinth data to prevent the Gaussian likelihood from collapsing.
Following wu2018the , we report the lowerbound on the conditional loglikelihood by removing from (eq. 4). This is the negative energy , and we obtained the perimage bound (i.e., conditional ELBO) by dividing it by the episode size. We trained the model for Omniglot for approximately steps; the test conditional ELBO reached , which is worse than the reported from the KM wu2018the . However, we show that the DKM generalises much better to unseen long episodes. We trained the model for DMLab for steps; the test conditional ELBO reached , which corresponds to bits per pixel. After training, we used the same testing protocol as wu2018the , first computing the posterior distribution of memory (writing) given an episode, and then performing tasks using the memory’s posterior mean. For reference, our implementation of the memory module is provided at https://github.com/deepmind/dynamickanervamachines.
Capacity
We investigated memory capacity using the Omniglot dataset, and compared our model with the KM and DNC. To account for the additional cost in the proposed dynamic addressing, our model in Omniglot experiments used a significantly smaller number of memory parameters () than the DNC (), and less than half of that used for the KM in wu2018the . Moreover, our model does not have additional parametrised structure, like the memory controllers in DNC or the amortised addressing module in the KM. As in wu2018the , we train our model using episodes with 32 patterns randomly sampled from all classes, and test it using episodes with lengths ranging from 10 to 200, drawn from 2, 4, or 8 classes of characters (i.e. varying the redundancy of the observed data). We report retrieval error as the negative of the conditional ELBO. The results are shown in figure 2 (right), with results for the KM and DNC adapted from wu2018the .
The capacity curves for our model are strikingly flat compared with both the DNC and the KM; we believe that this is because the parameterfree addressing (section 3.1) generalises to longer episodes much better than the parametrised addressing modules in the DNC or the KM. The errors are larger than the KM for small numbers of patterns (approximately <60), possibly because the KM overfits to shorter episodes that were more similar to training episodes.
Attractor Dynamics: Denoising and Sampling
We next verified the attractor dynamics through denoising and sampling tasks. These task demonstrate how lowquality patterns, either from noisecorruption or imperfect priors, can be corrected using the attractor dynamics.
Figure 3 (a) and Figure 4 (a) show the result of denoising. We added saltandpepper noise to Omniglot images by randomly flipping of the bits, and independent Gaussian noise to all pixels in DMLab images. Such noise is never presented during training. We ran the attractor dynamics (section 3.5) for 15 iterations from the noise corrupted images. Despite the significant corruption of images via different types of noise, the image quality improved steadily for both datasets. Interestingly, the denoised Omniglot patterns are even cleaner and smoother than the original patterns. The trajectories of the energy during denoising for 20 examples (including those we plotted as images) are shown in Figure 3 (c) and Figure 4 (c), demonstrating that the system states were attracted to points with lower energy.
Sampling from the models’ prior distributions provides another application of the attractor dynamics. Generative models trained with stochastic variational inference usually suffer from the problem of low sample quality, because the asymmetric KLdivergence they minimise usually results in priors broader than the posterior that is used to train the decoders. While different approaches exist to improve sample quality, including using more elaborated posteriors rezende2015variational and different training objectives goodfellow2014 , our model solves this problem by moving to regions with higher likelihoods via the attractor dynamics. As illustrated in Figure 3 (c) and Figure 4 (c), the initial samples have relatively low quality, but they were improved steadily through iterations. This improvement is correlated with the decrease of energy. We do observe fluctuations in energy in all experiments, especially for DMLab. This may be caused by the saddlepoints that are more common in larger models dauphin2014 . While the observation of saddlepoints violates our assumption of local minima (section 3.5), our model still worked well and the energy generally dropped after temporarily rising.
5 Discussion
Here we have presented a novel approach to robust attractor dynamics inside a generative distributed memory. Other than the neural network encoder and decoder, our model has only a small number of statistically welldefined parameters. Despite its simplicity, we have demonstrated its high capacity by efficiently compressing episodes online and have shown its robustness in retrieving patterns corrupted by unseen noise.
Our model can trade increased computation for higher precision retrieval by running attractor dynamics for more iterations. This idea of using attractors for memory retrieval and cleanup dates to Hopfield nets hopfield1982neural and Kanerva’s sparse distributed memory kanerva1988sparse . Zemel and Mozer proposed a generative model for memory zemel2000generative
that pioneered the use of variational free energy to construct attractors for memory. By restricting themselves to a localist representation, their model is easy to train without backpropagation, though this choice constrains its capacity. On the other hand, Boltzmann Machines
ackley1987learningare highcapacity generative models with distributed representations which obey stochastic attractor dynamics. However, writing memories into the weights of Boltzmann machines is typically slow and difficult. In comparison, the DKM trains quickly via a lowvariance gradient estimator and allows fast memory writing as inference.
As a principled probabilistic model, the linear Gaussian memory of the DKM can be seen as a special case of the Kalman Filter (KF)
kalman1960 without the driftdiffusion dynamics of the latent state. This more stable structure captures the statistics of entire episodes during sequential updates with minimal interference. The idea of using the latent state of the KF as memory is closely related to the heteroassociative novelty filter suggested in dayan2001explaining . The DKM can be also contrasted with recently proposed nonlinear generalisations of the KF such as krishnan2015deep in that we preserve the higherlevel linearity for efficient analytic inference over a very large latent state (). By combining deep neural networks and variational inference this allows our model to store associations between a large number of patterns, and generalise to large scale nonGaussian datasets .References

(1)
David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski.
A learning algorithm for boltzmann machines.
In
Readings in Computer Vision
, pages 522–533. Elsevier, 1987.  (2) David J Aldous. Exchangeability and related topics. In École d’Été de Probabilités de SaintFlour XIII—1983, pages 1–198. Springer, 1985.
 (3) Daniel J Amit. Modeling brain function: The world of attractor neural networks. Cambridge university press, 1992.
 (4) John Robert Anderson. Learning and memory: An integrated approach. John Wiley & Sons Inc, 2000.
 (5) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 (6) Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016.
 (7) Jonathan D Cohen, William M Perlstein, Todd S Braver, Leigh E Nystrom, Douglas C Noll, John Jonides, and Edward E Smith. Temporal dynamics of brain activation during a working memory task. Nature, 386(6625):604, 1997.
 (8) John Conklin and Chris Eliasmith. A controlled attractor network model of path integration in the rat. Journal of computational neuroscience, 18(2):183–203, 2005.
 (9) Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2933–2941. Curran Associates, Inc., 2014.
 (10) Peter Dayan and Sham Kakade. Explaining away in weight space. In Advances in neural information processing systems, pages 451–457, 2001.
 (11) Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
 (12) Surya Ganguli, Dongsung Huh, and Haim Sompolinsky. Memory traces in dynamical systems. Proceedings of the National Academy of Sciences, 105(48):18970–18975, 2008.
 (13) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
 (14) Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka GrabskaBarwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471, 2016.
 (15) John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
 (16) Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, and Stefan Schaal. Dynamical movement primitives: learning attractor models for motor behaviors. Neural computation, 25(2):328–373, 2013.
 (17) Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D):35–45, 1960.
 (18) Pentti Kanerva. Sparse distributed memory. MIT press, 1988.
 (19) Diederik P Kingma and Max Welling. Autoencoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2013.
 (20) Konrad P Kording, Joshua B Tenenbaum, and Reza Shadmehr. The dynamics of memory as a consequence of optimal adaptation to a changing body. Nature neuroscience, 10(6):779, 2007.
 (21) Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.
 (22) Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.

(23)
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
On the difficulty of training recurrent neural networks.
In
International Conference on Machine Learning
, pages 1310–1318, 2013.  (24) Barak A Pearlmutter. Learning state space trajectories in recurrent neural networks. Neural Computation, 1(2):263–269, 1989.
 (25) Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adrià Puigdomènech, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. arXiv preprint arXiv:1703.01988, 2017.
 (26) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
 (27) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In The 31st International Conference on Machine Learning (ICML), 2014.
 (28) Alexei Samsonovich and Bruce L McNaughton. Path integration and cognitive mapping in a continuous attractor neural network model. Journal of Neuroscience, 17(15):5900–5920, 1997.
 (29) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Oneshot learning with memoryaugmented neural networks. arXiv preprint arXiv:1605.06065, 2016.
 (30) Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
 (31) Yan Wu, Greg Wayne, Alex Graves, and Timothy Lillicrap. The kanerva machine: A generative distributed memory. In International Conference on Learning Representations, 2018.
 (32) Richard S Zemel and Michael C Mozer. A generative model for attractor dynamics. In Advances in neural information processing systems, pages 80–88, 2000.
Appendix
Appendix A The Bayesian update rule for the Kanerva Machine
Here we reproduce the exact Bayesian update rule used in the Kanerva Machine.
(15) 
A memory with mean , row covariance and observational noise variance is updated given a newly observed sample and its addressing weight by:
(16)  
(17)  
(18)  
(19)  
(20) 
Note that the new covariance would collapse to zero if the noise variance .
The graphical model assumes an observational noise , which results in the read out distribution given the addressing weights , (eq. 1 in matrix notation). We ignore observation noise when reading the memory and directly take . This simplification reduces the variance in training and is justified by the fact that the fixed observation noise does not convey any information. A strongenough decoder will learn to remove such noise through training.
Appendix B Parameters and Initialisation
Here we enumerate parameters of the model and their initialisations. In our experiments, the memory parameters are insensitive to initial values.
Parameters  description  Initial Value 

memory prior mean matrix  
memory prior covariance matrix  
Addressing weight posterior variance  0.3  
Memory observation noise variance  1.0  
Neural network weights of encoder and decoder  Glorot Initialization 
Appendix C Sequential Variational Inference for Memory
The loglikelihood for any can be decompose as a sum of a variational lowerbound and KLdivergences as:
(21) 
(22) 
However, as we noted in the main text, it is hard to maximise directly, since we can not compute directly.
To derive a sequential update rule of the memory to compute , we consider updating the memory for step of an episode. This assumes memory from the previous update is given, so that we can decompose conditioned on :
(23) 
where we have a likelihood lowerbound , which has the same form as . This lowerbound is tight when . This suggests, ideally, that the memory at step needs to be predictive of the next observation , in addition to accumulating information from the that are already used in computing .
As illustrated in Figure 5, we assume a deterministic transition , so the prior of simplifies to . We can then recursively expand the likelihood term in eq. 23, similar to eq. 21 and eq. 22 (omitting the expectation over ):
(24) 
(25) 
The above is easier to maximise, since it only depends on , and on , which we assume we know. We can minimise the gap between and by minimising using the Bayes’ update rule (Appendix A), and minimising using dynamic addressing (Section 3.1).
We can tighten by allowing further updating iterations as shown by the optional step in Algorithm 1. This is likely to be a tighter lowerbound, since generally the KLdivergence after incorporating information from . This process can be repeated until this KLdivergence is tightened to it’s minimum.
From the above equations, we have the inequality
(26) 
Therefore, we can maximising by maximising the lowerbound . Naively, in eq. 24, all the terms in need to be minimised at step . This would result in cost in both inferring and . To reduce the computational cost, we keep previously , and only infer , resulting in only cost. The tradeoff is a looser lowerbound and therefore a looser , since some of the for may not be minimised.
Once is computed, we can compute the marginal for the next step as:
(27) 
Memory updating is nonlinear, so the integral is not analytically tractable. A simple approximation is to use the mode of , which is the mean . At this point, we carry forward the approximation of to , which can be used for the step update. This procedure can start from and continue until ; we thus obtain an approximate memory posterior by maximising the lowerbound . Thus, the sequential update of memory, as summarised in Algorithm 1, maximises a lowerbound of the episode loglikelihood.
Appendix D The LeaseSquare Problem in Inference and Prediction
This sections shows that solving the same least squares problem is involved in both of the following problems:

minimising the KLdivergence between during inference (section 3.1),

approximating the predictive distribution (Section 3.3),
We first rewrite the KLdivergence using its definition:
(28) 
where the first entropy term is a constant that depends on the fixed variance . Therefore, minimising this KLdivergence is equivalent to maximising .
This posterior distribution over can be expanded using Bayes’ rule:
(29) 
We omitted terms that do not depend on , including various normalising constants. In addition, the last line used the encoding projection to transform the distribution over to that over . When is invertible, the Jacobian factor resulting from the distribution transform is welldefined and can be omitted since it does not depend on . However, the assumption of bijection is unlikely to be strictly satisfied by the neural network encoder/decoder pair, so the relation is approximate.
Taking the expectation of the above quadratic equation over the Gaussian distribution
results in the same quadratic form:(30) 
where the last two terms do not depend on . Therefore, both inference and prediction involve solving the same leastsquares problem.
Appendix E Proof of Attractor Dynamics
Here we show that in a well trained model, a pattern in the memory is asymptotically stable under the dynamics, so that a state near will converge to it. By “a well trained model”, we assume that pattern is a local maximum of the ELBO (eq. 4 in the main text)
(31) 
When is at maximum, the energy we defined in eq. 14 (copied below) would be at its local minimum. This follows since the negative energy is just the first 2 terms of without the KLdivergence between , which is a constant when the memory is fixed.
(32) 
Section 3.5 of the main text shows under the predictive dynamics. Therefore, we can construct a Lyapunov function candidate as:
(33) 
which satisfies:
(34)  
(35)  
(36) 
Therefore, according to Lyapunov Stability theory, state is asymptotically stable and serves as a point attractor in the system.
Appendix F Other Practical Considerations
For readers interested in applying the DKM, a few variants of Algorithm 1 may be worth considering. First, instead of using in eq.22 (eq. 3 in the main text) as the objective, an alternative objective is . Although this lowerbound tends to be less tight than , it is also cheaper to compute. It may be particularly useful in online settings, since we only need to run through an episode once to compute . This bounds can be further tightened by: 1. Using the optional step in Algorithm 1. We recommend starting with 2 or 3 steps. 2. Minimising a few other intermediate ’s, for . This may be helpful in the case of long episodes wherein gradient propagation through the entire episode is infeasible.