Kanerva++: extending The Kanerva Machine with differentiable, locally block allocated latent memory

02/20/2021 ∙ by Jason Ramapuram, et al. ∙ Google Apple Inc. 0

Episodic and semantic memory are critical components of the human memory model. The theory of complementary learning systems (McClelland et al., 1995) suggests that the compressed representation produced by a serial event (episodic memory) is later restructured to build a more generalized form of reusable knowledge (semantic memory). In this work we develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory via a hierarchical latent variable model. We take inspiration from traditional heap allocation and extend the idea of locally contiguous memory to the Kanerva Machine, enabling a novel differentiable block allocated latent memory. In contrast to the Kanerva Machine, we simplify the process of memory writing by treating it as a fully feed forward deterministic process, relying on the stochasticity of the read key distribution to disperse information within the memory. We demonstrate that this allocation scheme improves performance in memory conditional image generation, resulting in new state-of-the-art conditional likelihood values on binarized MNIST (<=41.58 nats/image) , binarized Omniglot (<=66.24 nats/image), as well as presenting competitive performance on CIFAR10, DMLab Mazes, Celeb-A and ImageNet32x32.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Memory is a central tenet in the model of human intelligence and is crucial to long-term reasoning and planning. Of particular interest is the theory of complementary learning systems McClelland et al. (1995) which proposes that the brain employs two complementary systems to support the acquisition of complex behaviours: a hippocampal fast-learning system that records events as episodic memory, and a neocortical slow learning system that learns statistics across events as semantic memory. While the functional dichotomy of the complementary systems are well-established McClelland et al. (1995); Kumaran et al. (2016), it remains unclear whether they are bounded by different computational principles. In this work we introduce a model that bridges this gap by showing that the same statistical learning principle can be applied to the fast learning system through the construction of a hierarchical Bayesian memory.

Figure 1: Example final state of a traditional heap allocator (Marlow et al., 2008) (Left) vs. K++ (Right); final state created by sequential operations listed on the left. K++ uses a key distribution to stochastically point to a memory sub-region while Marlow et al. (2008) uses a direct pointer. Traditional heap allocated memory affords free / malloc computational complexity and serves as inspiration for K++ which uses differentiable neural proxies.

While recent work has shown that using memory augmented neural networks can drastically improve the performance of generative models

(Wu et al., 2018a, b)

, language models

(Weston et al., 2015), meta-learning (Santoro et al., 2016), long-term planning (Graves et al., 2014, 2016)

and sample efficiency in reinforcement learning

(Zhu et al., 2019), no model has been proposed to exploit the inherent multi-dimensionality of biological memory Reimann et al. (2017). Inspired by the traditional (computer-science) memory model of heap allocation (Figure 1-Left), we propose a novel differentiable memory allocation scheme called Kanerva ++ (K++), that learns to compress an episode of samples, referred to by the set of pointers in Figure 1, into a latent multi-dimensional memory (Figure 1-Right). The K++ model infers a key distribution as a proxy to the pointers (Marlow et al., 2008) and is able to embed similar samples to an overlapping latent representation space, thus enabling it to be more efficient on compressing input distributions. In this work, we focus on applying this novel memory allocation scheme to latent variable generative models, where we improve the memory model in the Kanerva Machine (Wu et al., 2018a, b).

2 Related Work

Variational Autoencoders

: Variational autoencoders (VAEs)

(Kingma and Welling, 2014)

are a fundamental part of the modern machine learning toolbox and have wide ranging applications from generative modeling

(Kingma and Welling, 2014; Kingma et al., 2016; Burda et al., 2016), learning graphs (Kipf and Welling, 2016), medical applications (Sedai et al., 2017; Zhao et al., 2019) and video analysis (Fan et al., 2020). As a latent variable model, VAEs infer an approximate posterior over a latent representation and can be used in downstream tasks such as control in reinforcement learning (Nair et al., 2018; Pritzel et al., 2017). VAEs maximize an evidence lower bound (ELBO), , of the log-marginal likelihood, . The produced variational approximation, , is typically called the encoder, while comes from the decoder. Methods that aim to improve these latent variable generative models typically fall into two different paradigms: learning more informative priors or leveraging novel decoders. While improved decoder models such as PixelVAE (Gulrajani et al., 2017) and PixelVAE++ (Sadeghi et al., 2019) drastically improve the performance of , they suffer from a phenomenon called posterior collapse (Lucas et al., 2019), where the decoder can become almost independent of the posterior sample, but still retains the ability to reconstruct the original sample by relying on its auto-regressive property (Goyal et al., 2017a).

In contrast, VampPrior (Tomczak and Welling, 2018), Associative Compression Networks (ACN) (Graves et al., 2018), VAE-nCRP (Goyal et al., 2017b) and VLAE (Chen et al., 2017) tighten the variational bound by learning more informed priors. VLAE for example, uses a powerful auto-regressive prior; VAE-nCRP learns a non-parametric Chinese restaurant process prior and VampPrior learns a Gaussian mixture prior representing prototypical virtual samples. On the other hand, ACN takes a two-stage approach: by clustering real samples in the space of the posterior; and by using these related samples as inputs to a learned prior, ACN provides an information theoretic alternative to improved code transmission. Our work falls into this latter paradigm: we parameterize a learned prior by reading from a common memory, built through a transformation of an episode of input samples.

Memory Models: Inspired by the associative nature of biological memory, the Hopfield network (Hopfield, 1982)

introduced the notion of content-addressable memory, defined by a set of binary neurons coupled with a Hamiltonian and a dynamical update rule. Iterating the update rule minimizes the Hamiltonian, resulting in patterns being stored at different configurations

(Hopfield, 1982; Krotov and Hopfield, 2016). Writing in a Hopfield network, thus corresponds to finding weight configurations such that stored patterns become attractors via Hebbian rules (Hebb, 1949). This concept of memory was extended to a distributed, continuous setting in Kanerva (1988) and to a complex valued, holographic convolutional binding mechanism by Plate (1995). The central difference between associative memory models Hopfield (1982); Kanerva (1988) and holographic memory Plate (1995) is that the latter decouples the size of the memory from the input word size.

Most recent work with memory augmented neural networks treat memory in a slot-based manner (closer to the associative memory paradigm), where each column of a memory matrix, , represents a single slot. Reading memory traces,

, entails using a vector of addressing weights,

, to attend to the appropriate column of ,

. This paradigm of memory includes models such as the Neural Turing Machine (NTM)

(Graves et al., 2014), Differentiable Neural Computer (DNC) (Graves et al., 2016) 111While DNC is slot based, it should be noted that DNC reads rows rather than columns., Memory Networks (Weston et al., 2015), Generative Temporal Models with Memory (GTMM) Fraccaro et al. (2018), Variational Memory Encoder-Decoder (VMED) Le et al. (2018), and Variational Memory Addressing (VMA) (Bornschein et al., 2017). VMA differs from GTMM, VMED, DNC, NTM and Memory Networks by taking a stochastic approach to discrete key-addressing, instead of the deterministic approach of the latter models.

Recently, the Kanerva Machine (KM) (Wu et al., 2018a) and its extension, the Dynamic Kanerva Machine (DKM) (Wu et al., 2018b), interpreted memory writes and reads as inference in a generative model, wherein memory is now treated as a distribution,

. Under this framework, memory reads and writes are recast as sampling or updating the memory posterior. The DKM model differs from the KM model by introducing a dynamical addressing rule that could be used throughout training. While providing an intuitive and theoretically sound bound on the data likelihood, the DKM model requires an inner optimization loop which entails solving an ordinary least squares (OLS) problem. Typical OLS solutions require a matrix inversion (

), preventing the model from scaling to large memory sizes. More recent work has focused on employing a product of smaller Kanerva memories (Marblestone et al., 2020) in an effort to minimize the computational cost of the matrix inversion. In contrast, we propose a simplified view of memory creation by treating memory writes as a deterministic process in a fully feed-forward setting. Crucially, we also modify the read operand such that it uses localized sub-regions of the memory, providing an extra dimension of operation in comparison with the KM and DKM models. While the removal of memory stochasticity might be interpreted as reducing the representation power of the model, we empirically demonstrate through our experiments that our model performs better, trains quicker and is simpler to optimize. The choice of a deterministic memory is further reinforced by research in psychology, where human visual memory has been shown to change deterministically Gold et al. (2005); Spencer and Hund (2002); Hollingworth et al. (2013).

3 Model

To better understand the K++ model, we examine each of the individual components to understand their role within the complete generative model. We begin by first deriving a conditional variational lower bound (Section 3.1), describing the optimization objective and probabilistic assumptions. We then describe the write operand (Section 3.3), the generative process (Section 3.4) and finally the read and iterative read operands (Section 3.5).

3.1 Preliminaries

K++ operates over an exchangeable episode (Aldous, 1985) of samples, , drawn from a dataset , as in the Kanerva Machine. Therefore, the ordering of the samples within the episode does not matter. This enables factoring the conditional, , over each of the individual samples: , given the memory, . Our objective in this work is to maximize the expected conditional log-likelihood as in (Bornschein et al., 2017; Wu et al., 2018a):

(1)

As alluded to in Barber and Agakov (2004) and Wu et al. (2018a), this objective can be interpreted as maximizing the mutual information, , between the memory, , and the episode, , since and given that the entropy of the data, , is constant. In order to actualize Equation 1 we rely on a variational bound which we derive in the following section.

3.2 Variational Lower Bound

To efficiently read from the memory, , we introduce a set of latent variables corresponding to the addressing read heads, , and a set of latent variables corresponding to the readout from the memory, . Given these latent variables, we can decompose the conditional, , using the product rule and introduce variational approximations 222We use as our variational approximation instead of in DKM as it presents a more stable objective. We discuss this in more detail in Section 3.5. and via a multipy-by-one trick:

(2)
(3)

Equation 2 assumes that is independent from : . This decomposition results in Equation 3, which includes two KL-divergences against true (unknown) posteriors, and . We can then train the model by maximizing the evidence lower bound (ELBO), , to the true conditional, :

(4)

The bound in Equation 4 is tight if and , however, it involves inferring the entire memory . This prevents us from decoupling the size of the memory from inference and scales the computation complexity based on the size of the memory. To alleviate this constraint, we assume a purely deterministic memory, , built by transforming the input episode, , via a deterministic encoder and memory transformation model, . We also assume that the regions of memory which are useful in reconstructing a sample, , can be summarized by a set of localized contiguous memory sub-blocks as described in Equation 5 below. The intuition here is that similar samples, , might occupy a disjoint part of the representation space and the decoder, , would need to read multiple regions to properly handle sample reconstruction. For example, the digit “3” might share part of the representation space with a “2” and another part with a “5”.

(5)

in equation 5 represents a set of dirac-delta memory sub-regions, determined by the addressing key, , and a spatial transformer () network Jaderberg et al. (2015), 333We provide a brief review of spatial transformers in Appendix A. Our final optimization objective, , is attained by approximating from Equation 4 with (Equation 5) and is summarized by the graphical model in 2 below.

Figure 2: (a): Generative model (§3.4). (b): Read inference model (§3.5). (c): Iterative read inference model (§3.5). (d): Write inference model (§3.3). Dashed lines represent approximate inference, while solid lines represent computing of a conditional distribution. Double sided arrow in (c) represents the KL divergence between and from Equation 4. Squares represent deterministic nodes. Standard plate notation is used to depict a repetitive operation.

3.3 Write Model

sample episode: ;
compute embedding: ;
infer keys: ) ;
write memory:
Figure 3: Left: Write model. Right: Write operation.

Writing to memory in the K++ model (Figure 3) entails encoding the input episode, , through the encoder, , pooling the representation over the episode and encoding the pooled representation with the memory writer, . In this work, we employ a Temporal Shift Module (TSM) (Lin et al., 2019), applied on a ResNet18 (He et al., 2016). TSM works by shifting feature maps of a two-dimensional vision model in the temporal dimension in order to build richer representations of contextual features. In the case of K++, this allows the encoder to build a better representation of the memory by leveraging intermediary episode specific features. Using a TSM encoder over a standard convolutional stack improves the performance of both K++ and DKM, where the latter observes an improvement of 6.32 nats/image over the reported test conditional variational lower bound of 77.2 nats/image (Wu et al., 2018b) for the binarized Omniglot dataset. As far as we are aware, the application of a TSM encoder to memory models has not been explored and is a contribution of this work.

The memory writer model, , in Figure 3

, allows K++ to non-linearly transform the pooled embedding to better summarize the episode. In addition to inferring the deterministic memory,

, we also project the non-pooled embedding, , through a key model, :

(6)

The reparameterized keys will be used to read sample specific memory traces, , from the full memory, . The memory traces, , are used in training through the learned prior, , from Equation 4 via the KL divergence, . This KL divergence constrains the optimization objective to keep the representation of the amortized approximate posterior, , (probabilistically) close to the memory readout representation of the learned prior, . In the generative setting, this constraint enables memory traces to be routed from the learned prior, , to the decoder, , in a similar manner to standard VAEs. We detail this process in the following section.

3.4 Sample Generation

given memory: ;
sample keys: ;
extract regions: ;
infer latent: ;
decode:
Figure 4: Left: Generative model. Right: Generative operation.

The Kanerva++ model, like the original KM and DKM models, enables sample generation given an existing memory or set of memories. samples from the prior key distribution, , are used to parameterize the spatial transformer, , which indexes the deterministic memory, . The result of this differentiable indexing is a set of memory sub-regions, , which are used in the decoder,

, to generate synthetic samples. Reading samples in this manner forces the encoder to utilize memory sub-regions that are useful for reconstruction, as non-read memory regions receive zero gradients during backpropagation. This insight allows us to use the simple feed-foward write process described in Section

3.3, while still retaining the ability to produce locally contiguous block allocated memory.

3.5 Read / Iterative Read Model

given embedding: ; if training then         infer latent:         infer prior: else         infer latent: end if decode:
Figure 5: Left: Read model: bottom branch from embedding used during iterative reading and prior evaluation. Stable top branch used to infer during training. Right: Read operation.

K++ involves two forms of reading (Figure 5): iterative reading and a simpler and more stable read model used for training. During training we actualize from Equation 4 using an amortized isotropic-gaussian posterior that directly transforms the embedding of the episode, , using a learned neural network (Figure 2-b). As mentioned in Section 3.5, the readout, , of the memory traces, , are encouraged to learn a meaningful structured representation through the memory read-out KL divergence, , which attempts to minimize the (probabilistic) distance between and .

Kanerva memory models also possess the ability to gradually improve a sample through interative inference (Figure 2-c), whereby noisy samples can be improved by leveraging contextual information stored in memory. This can be interpreted as approximating the posterior, , by marginalizing the approximate key distribution:

(7)

where in Equation 7

uses a single sample Monte Carlo estimate, evaluated by re-infering the previous reconstruction,

, through the approximate key posterior. Each subsequent memory readout, , improves upon its previous representation by absorbing additional information from the memory.

4 Experiments

We contrast K++ against state-of-the-art memory conditional vision models and present empirical results in Table 1. Binarized datasets assume Bernoulli output distributions, while continuous values are modeled by a discretized mixture of logistics Salimans et al. (2017). As is standard in literature Burda et al. (2016); Sadeghi et al. (2019); Ma et al. (2018); Chen et al. (2017), we provide results for binarized MNIST and binarized Omniglot in nats/image and rescale the corresponding results to bits/dim for all other datasets. We describe the model architecture, the optimization procedure and the memory creation protocol in Appendix E and E.1. Finally, extra Celeb-A generations and test image reconstructions for all experiments are provided in Appendix B and Appendix D respectively.

Method
Binarized
MNIST
(nats / image)
Binarized
Omniglot
(nats / image)
Fashion
MNIST
(bits / dim)
CIFAR10
(bits / dim)
DMLab
Mazes
(bits / dim)
VAE Kingma and Welling (2014) 87.86 104.75 5.84 6.3 -
IWAE Burda et al. (2016) 85.32 103.38 - - -
Improved decoders
PixelVAE++ Sadeghi et al. (2019) 78.00 - - 2.90 -
MAE Ma et al. (2018) 77.98 89.09 - 2.95 -
DRAW Gregor et al. (2015) 87.4 96.5 - 3.58 -
MatNet Bachman (2016) 78.5 89.5 - 3.24 -
Richer priors
Ordered ACN Graves et al. (2018) 73.9 - - 3.07 -
VLAE Chen et al. (2017) 78.53 102.11 - 2.95 -
VampPrior Tomczak and Welling (2018) 78.45 89.76 - - -
Memory conditioned models
VMA Bornschein et al. (2017) - 103.6 - - -
KM Wu et al. (2018a) - 68.3 - 4.37 -
DNC Graves et al. (2016) - 100 - - -
DKM Wu et al. (2018b) 75.3 77.2 - 4.79 2.75
DKM w/TSM (our impl) 51.84 70.88 4.15 4.31 2.92
Kanerva++ (ours) 41.58 66.24 3.40 3.28 2.88
Table 1: Negative test likelihood and conditional test likelihood values (lower is better). was graciously provided by original authors. estimated from Wu et al. (2018a) Appendix Figure 12. variadic performance due to online generation of DMLab samples.

K++ presents state-of-the-art results for memory conditioned binarized MNIST and binarized Omniglot, and presents competitive performance for Fashion MNIST, CIFAR10 and DMLab mazes. The performance gap on the continuous valued datasets can be explained by our use of a simple convolutional decoder, rather than the autoregressive decoders used in models such as PixelVAE Sadeghi et al. (2019). We leave the exploration of more powerful decoder models to future work and note that our model can be integrated with autoregressive decoders.

4.1 Iterative inference

Figure 6: Left: First column to left visualizes first random key generation. Following columns created by inferring previous sample through K++. Right: denoising of salt & pepper (top), speckle (middle) and Poisson noise (bottom).

One of the benefits of K++ is that it uses the memory to learn a more informed prior by condensing the information from an episode of samples. One might suspect that based on the dimensionality of the memory and the size of the read traces, the memory might only learn prototypical patterns, rather than a full amalgamation of the input episode. This presents a problem for generation, as described in Section 3.4 , and can be observed in the first column of Figure 6-Left where the first generation from a random key appears blurry. To overcome this limitation, we rely on the iterative inference of Kanerva memory models (Wu et al., 2018a, b). By holding the memory, , fixed and repeatedly inferring the latents, we are able to clean-up the pattern by leveraging the contextual information contained within the memory ( 3.5). This is visualized in the proceeding columns of Figure 6-Left, where we observe a slow but clear improvement in generation quality. This property of iterative inference is one of the central benefits of using a memory model over a tradition solution like a VAE. We also present results of iterative inference on more classical image noise distributions such as salt-and-pepper, speckle and Poisson noise in Figure 6-Right. For each original noisy pattern (top rows) we provide the resultant final reconstruction after ten steps of clean-up. The proposed K++ is robust to input noise and is able to clean-up most of the patterns in a semantically meaningful way.

4.1.1 Image Generations

Figure 7: Key perturbed generations. Left: DMLab mazes. Center: Omniglot. Right: Celeb-A 64x64.
Figure 8: Random key generations. Left: DMLab mazes. Center: Omniglot. Right: Celeb-A 64x64.

Typical VAEs use high dimensional isotropic Gaussian latent variables () Burda et al. (2016); Kingma and Welling (2014)

. A well known property of high dimensional Gaussian distributions is that most of their mass is concentrated on the surface area of a high dimensional ball. Perturbations to a sample in an area of valid density can easily move it to an invalid density region

(Arvanitidis et al., 2018; White, 2016), causing blurry or irregular generations. In the case of K++, since the key distribution, , is within a low dimensional space, local perturbations,

, are likely in regions with high probability density. We visualize this form of generation in Figure

7 for DMLab Mazes, Omniglot and Celeb-A 64x64, as well as the more traditional random key generations, , in Figure 8.

Interestingly, local key perturbations of a trained DMLab Maze K++ model induces resultant generations that provide a natural traversal of the maze as observed by scanning Figure 7-Left, row by row, from left to right. In contrast, the random generations of the same task (Figure 8-Left) present a more discontinuous set of generations. We see a similar effect for the Omniglot and Celeb-A datasets, but observe that the locality is instead tied to character or facial structure as shown in Figure 7-Center and Figure 7-Right. Finally, in contrast to VAE generations, K++ is able to generate sharper images of ImageNet32x32 as shown in Appendix C. Future work will investigate this form of locally perturbed generation through an MCMC lens.

4.2 Ablation: Is Block Allocated Spatial Memory Useful?

Figure 9: Left: Simplified write model directly produces readout prior from Equation 4 by projecting embedding via a learned network. Right: Test negative variational lower bound (mean 1std).

While Figure 7 demonstrates the advantage of having a low dimensional sampling distribution and Figure 6 demonstrates the benefit of iterative inference, it is unclear whether the performance benefit in Table 1 is achieved from the episodic training, model structure, optimization procedure or memory allocation scheme. To isolate the cause of the performance benefit, we simplify the write architecture from Section 3.3 as shown in Figure 9-Left. In this scenario, we produce the learned memory readout, , via an equivalently sized dense model that projects the embedding, , while keeping all other aspects the same. We train both models five times with the exact same TSM-ResNet18 encoder, decoder, optimizer and learning rate scheduler. As shown in Figure 9-Right, the test conditional variational lower bound of the K++ model is 20.6 nats/image better than the baseline model for the evaluated binarized Omniglot dataset. This confirms that the spatial, block allocated latent memory model proposed in this work is useful when working with image distributions. Future work will explore this dimension for other modalities such as audio and text.

4.3 Ablation: episode length (T) and memory read steps (K).

Figure 10: Binarized MNIST. Left: Episode length (T) ablation showing negative test conditional variational lower bound (meanstd). Right: Memory read steps (K) ablation showing test KL divergence.

To further explore K++, we evaluate the sensitivity of the model to varying episode lengths (T) in Figure 10-left and memory read steps (K) in Figure 10-right using the binarized MNIST dataset. We train K++ five times (each) for episode lengths ranging from 5 to 64 and observe that the model performs within margin of error for increasing episode lengths, producing negative test conditional variational bounds within a 1-std of nats/image. This suggests that for the specific dimensionality of memory () used in this experiment, K++ was able to successfully capture the semantics of the binarized MNIST distribution. We suspect that for larger datasets this relationship might not necessary hold and that the dimensionality of the memory should scale with the size of the dataset, but leave the prospect of such capacity analysis for future research.

While ablating the number of memory reads (K) in Figure 10-right, we observe that the total test KL-divergence varies by 1-std of 0.041 nats/image for a range of memory reads from 1 to 64. A lower KL divergence implies that the model is able to better fit the approximate posteriors and to their correspondings priors in Equation 4. It should however be noted that a lower KL-divergence does not necessary imply a better generative model Theis et al. (2016). While qualitatively inspecting the generated samples, we observed that K++ generated more semantically sound generations at lower memory read steps. We suspect that the difficulty of generating realistic samples increases with the number of disjoint reads and found that produces high quality results. We use this value for all experiments in this work.

5 Conclusion

In this work, we propose a novel block allocated memory in a generative model framework and demonstrate its state-of-the-art performance on several memory conditional image generation tasks. We also show that stochasticity in low-dimensional spaces produces higher quality samples in comparison to high-dimensional latents typically used in VAEs. Furthermore, perturbations to the low-dimensional key generate samples with high variations. Nonetheless, there are still many unanswered questions: would a hard attention based solution to differentiable indexing prove to be better than a spatial transformer? What is the optimal upper bound of window read regions based on the input distribution? Future work will hopefully be able to address these lingering issues and further improve generative memory models.

References

  • D. J. Aldous (1985) Exchangeability and related topics. In École d’Été de Probabilités de Saint-Flour XIII—1983, pp. 1–198. Cited by: §3.1.
  • G. Arvanitidis, L. K. Hansen, and S. Hauberg (2018) Latent space oddity: on the curvature of deep generative models. In 6th International Conference on Learning Representations, ICLR 2018, Cited by: §4.1.1.
  • P. Bachman (2016) An architecture for deep, hierarchical generative models. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4826–4834. External Links: Link Cited by: Table 1.
  • D. Barber and F. V. Agakov (2004) Information maximization in noisy channels: a variational approach. In Advances in Neural Information Processing Systems, pp. 201–208. Cited by: §3.1.
  • J. Bornschein, A. Mnih, D. Zoran, and D. J. Rezende (2017) Variational memory addressing in generative models. In Advances in Neural Information Processing Systems, pp. 3920–3929. Cited by: §2, §3.1, Table 1.
  • Y. Burda, R. B. Grosse, and R. Salakhutdinov (2016) Importance weighted autoencoders. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §2, §4.1.1, Table 1, §4.
  • X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel (2017) Variational lossy autoencoder. ICLR. Cited by: §2, Table 1, §4.
  • Y. Fan, G. Wen, D. Li, S. Qiu, M. D. Levine, and F. Xiao (2020)

    Video anomaly detection and localization via gaussian mixture fully convolutional variational autoencoder

    .
    Computer Vision and Image Understanding, pp. 102920. Cited by: §2.
  • M. Fraccaro, D. J. Rezende, Y. Zwols, A. Pritzel, S. M. A. Eslami, and F. Viola (2018) Generative temporal models with spatial memory for partially observed environments. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1544–1553. External Links: Link Cited by: §2.
  • J. M. Gold, R. F. Murray, A. B. Sekuler, P. J. Bennett, and R. Sekuler (2005) Visual memory decay is deterministic. Psychological Science 16 (10), pp. 769–774. Cited by: §2.
  • A. G. A. P. Goyal, A. Sordoni, M. Côté, N. R. Ke, and Y. Bengio (2017a) Z-forcing: training stochastic recurrent networks. In Advances in neural information processing systems, pp. 6713–6723. Cited by: §2.
  • P. Goyal, Z. Hu, X. Liang, C. Wang, and E. P. Xing (2017b) Nonparametric variational auto-encoders for hierarchical representation learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5094–5102. Cited by: §2.
  • P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017c)

    Accurate, large minibatch SGD: training imagenet in 1 hour

    .
    CoRR abs/1706.02677. External Links: Link, 1706.02677 Cited by: Appendix E.
  • A. Graves, J. Menick, and A. v. d. Oord (2018) Associative compression networks for representation learning. arXiv preprint arXiv:1804.02476. Cited by: §2, Table 1.
  • A. Graves, G. Wayne, and I. Danihelka (2014)

    Neural turing machines

    .
    arXiv preprint arXiv:1410.5401. Cited by: §1, §2.
  • A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. (2016) Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471–476. Cited by: §1, §2, Table 1.
  • K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra (2015)

    DRAW: a recurrent neural network for image generation

    .
    In International Conference on Machine Learning, pp. 1462–1471. Cited by: Table 1.
  • I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taïga, F. Visin, D. Vázquez, and A. C. Courville (2017) PixelVAE: A latent variable model for natural images. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 770–778. Cited by: §3.3.
  • D. O. Hebb (1949) The organization of behavior: a neuropsychological theory. J. Wiley; Chapman & Hall. Cited by: §2.
  • A. Hollingworth, M. Matsukura, and S. J. Luck (2013) Visual working memory modulates rapid eye movements to simple onset targets. Psychological science 24 (5), pp. 790–796. Cited by: §2.
  • J. J. Hopfield (1982) Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences 79 (8), pp. 2554–2558. Cited by: §2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, F. R. Bach and D. M. Blei (Eds.), JMLR Workshop and Conference Proceedings, Vol. 37, pp. 448–456. External Links: Link Cited by: Appendix E.
  • M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: Appendix A, §3.2.
  • P. Kanerva (1988) Sparse distributed memory. MIT press. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), Cited by: Appendix E.
  • D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §2.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. ICLR. Cited by: §2, §4.1.1, Table 1.
  • T. N. Kipf and M. Welling (2016) Variational graph auto-encoders. CoRR abs/1611.07308. External Links: Link, 1611.07308 Cited by: §2.
  • D. Krotov and J. J. Hopfield (2016) Dense associative memory for pattern recognition. In Advances in neural information processing systems, pp. 1172–1180. Cited by: §2.
  • D. Kumaran, D. Hassabis, and J. L. McClelland (2016) What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences 20 (7), pp. 512–534. Cited by: §1.
  • H. Le, T. Tran, T. Nguyen, and S. Venkatesh (2018) Variational memory encoder-decoder. In Advances in Neural Information Processing Systems, pp. 1508–1518. Cited by: §2.
  • J. Lin, C. Gan, and S. Han (2019) Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7083–7093. Cited by: §3.3.
  • H. Liu, A. Brock, K. Simonyan, and Q. V. Le (2020) Evolving normalization-activation layers. CoRR abs/2004.02967. External Links: Link, 2004.02967 Cited by: Appendix E.
  • I. Loshchilov and F. Hutter (2017)

    SGDR: stochastic gradient descent with warm restarts

    .
    In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: Appendix E.
  • J. Lucas, G. Tucker, R. B. Grosse, and M. Norouzi (2019) Understanding posterior collapse in generative latent variable models. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019, External Links: Link Cited by: §2.
  • X. Ma, C. Zhou, and E. Hovy (2018) MAE: mutual posterior-divergence regularization for variational autoencoders. In International Conference on Learning Representations, Cited by: Table 1, §4.
  • A. Marblestone, Y. Wu, and G. Wayne (2020) Product kanerva machines: factorized bayesian memory. Bridging AI and Cognitive Science Workshop. ICLR. Cited by: §2.
  • S. Marlow, T. Harris, R. P. James, and S. Peyton Jones (2008) Parallel generational-copying garbage collection with a block-structured heap. In Proceedings of the 7th international symposium on Memory management, pp. 11–20. Cited by: Figure 1, §1.
  • J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly (1995) Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.. Psychological review 102 (3), pp. 419. Cited by: Kanerva++: Extending the Kanerva Machine With Differentiable, Locally Block Allocated Latent Memory, §1.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018)

    Spectral normalization for generative adversarial networks

    .
    In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: Appendix E.
  • A. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine (2018) Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 9209–9220. External Links: Link Cited by: §2.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, J. Fürnkranz and T. Joachims (Eds.), pp. 807–814. External Links: Link Cited by: Appendix E.
  • T. A. Plate (1995) Holographic reduced representations. IEEE Transactions on Neural networks 6 (3), pp. 623–641. Cited by: §2.
  • A. Pritzel, B. Uria, S. Srinivasan, A. P. Badia, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell (2017) Neural episodic control. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 2827–2836. External Links: Link Cited by: §2.
  • P. Ramachandran, B. Zoph, and Q. V. Le (2018)

    Searching for activation functions

    .
    In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, External Links: Link Cited by: Appendix E.
  • M. W. Reimann, M. Nolte, M. Scolamiero, K. Turner, R. Perin, G. Chindemi, P. Dłotko, R. Levi, K. Hess, and H. Markram (2017) Cliques of neurons bound into cavities provide a missing link between structure and function. Frontiers in Computational Neuroscience 11, pp. 48. External Links: Link, Document, ISSN 1662-5188 Cited by: §1.
  • H. Sadeghi, E. Andriyash, W. Vinci, L. Buffoni, and M. H. Amin (2019) PixelVAE++: improved pixelvae with discrete prior. CoRR abs/1908.09948. External Links: Link, 1908.09948 Cited by: §2, Table 1, §4, §4.
  • T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma (2017) PixelCNN++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §4.
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. P. Lillicrap (2016) Meta-learning with memory-augmented neural networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, M. Balcan and K. Q. Weinberger (Eds.), JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1842–1850. External Links: Link Cited by: §1.
  • S. Sedai, D. Mahapatra, S. Hewavitharanage, S. Maetschke, and R. Garnavi (2017) Semi-supervised segmentation of optic cup in retinal fundus images using variational autoencoder. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 75–82. Cited by: §2.
  • L. N. Smith and N. Topin (2017) Super-convergence: very fast training of residual networks using large learning rates. CoRR abs/1708.07120. External Links: Link, 1708.07120 Cited by: Appendix E.
  • J. P. Spencer and A. M. Hund (2002) Prototypes and particulars: geometric and experience-dependent spatial categories.. Journal of Experimental Psychology: General 131 (1), pp. 16. Cited by: §2.
  • L. Theis, A. van den Oord, and M. Bethge (2016) A note on the evaluation of generative models. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §4.3.
  • J. Tomczak and M. Welling (2018) VAE with a vampprior. In

    International Conference on Artificial Intelligence and Statistics

    ,
    pp. 1214–1223. Cited by: §2, Table 1.
  • J. Weston, S. Chopra, and A. Bordes (2015) Memory networks. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1, §2.
  • T. White (2016) Sampling generative networks. arXiv preprint arXiv:1609.04468. Cited by: §4.1.1.
  • Y. Wu, G. Wayne, A. Graves, and T. Lillicrap (2018a) The kanerva machine: a generative distributed memory. ICLR. Cited by: §1, §2, §3.1, §3.1, §4.1, Table 1.
  • Y. Wu, G. Wayne, K. Gregor, and T. Lillicrap (2018b) Learning attractor dynamics for generative memory. In Advances in Neural Information Processing Systems, pp. 9379–9388. Cited by: §1, §2, §3.3, §4.1, Table 1.
  • Y. Wu and K. He (2020) Group normalization. Int. J. Comput. Vis. 128 (3), pp. 742–755. External Links: Link, Document Cited by: Appendix E.
  • Y. You, I. Gitman, and B. Ginsburg (2017) Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888. Cited by: Appendix E.
  • Q. Zhao, N. Honnorat, E. Adeli, A. Pfefferbaum, E. V. Sullivan, and K. M. Pohl (2019) Variational autoencoder with truncated mixture of gaussians for functional connectivity analysis. In International Conference on Information Processing in Medical Imaging, pp. 867–879. Cited by: §2.
  • G. Zhu, Z. Lin, G. Yang, and C. Zhang (2019) Episodic reinforcement learning with associative memory. In International Conference on Learning Representations, Cited by: §1.

Appendix A Spatial Transformer Review

Indexing a matrix, , is typically a non-differentiable operation since it involves hard cropping around an index. Spatial transformers (Jaderberg et al., 2015) provide a solution to this problem by decoupling the problem into two differentiable operands:

  1. Learn an affine transformation of coordinates.

  2. Use a differntiable bilinear transformation.

The affine transformation of source coordinates, , to target coordinates, is defined as:

(8)

Here, the affine transform, has three learnable scalars: which define a scaling and translation in and respectively. In the case of K++, these three scalars represent the components of the key sample, as shown in Equation 8. After transforming the co-ordinates (not to be confused with the actual data), spatial transformers learn a differentiable bilinear transform which can be interpreted as learning a differentiable mask that is element-wise multiplied by the original data, :

(9)

Consider the following example where ; this parameterization differntiably extracts the region shown in Figure 11-Right from Figure 11-Left:

Figure 11: Spatial transformer example. Left: original image with region inlaid. Right: extracted grid.

The range of values for is bound between , where the center of the image is .

Appendix B Celeb-A Generations

Figure 12: Random key Celeb-A generations.

We present random key generations of Celeb-A 64x64, trained without center cropping in Figure 12.

Appendix C VAE vs. K++ ImageNet32x32 Generations

Figure 13: ImageNet32x32 generations. Left: VAE; Right: K++.

Figure 13 shows the difference in generations of a standard VAE vs. K++. In contrast to the standard VAE generation (Figure 13-Left), the K++ generations (Figure 13-Right) appear much sharper, avoiding the blurry generations observed with standard VAEs.

Appendix D Test Image Reconstructions

Figure 14: Binarized test reconstructions; top row are true samples. Left: Omniglot; Right: MNIST.
Figure 15: Test set reconstructions; top row are true samples. Left: ImageNet64x64. Right: DMLab Mazes.
Figure 16: Test set reconstructions; top row are true samples. Celeb-A 64x64.

Appendix E Model Architecture & Training

Encoder: As mentioned in Section 3.3, we use a TSM-Resnet18 encoder with Batchnorm (Ioffe and Szegedy, 2015)

and ReLU activations

(Nair and Hinton, 2010) for all tasks. We apply a fractional shift of the feature maps by 0.125 as suggested by the authors.

Decoder: Our decoder is a simple conv-transpose network with EvoNormS0 (Liu et al., 2020)

inter-spliced between each layer. Evonorms0 is similar in stride to Groupnorm

(Wu and He, 2020) combined with the swish activation function (Ramachandran et al., 2018).

Optimizer & LR scheule: We use LARS (You et al., 2017) coupled with ADAM (Kingma and Ba, 2014) and a one-cycle (Smith and Topin, 2017) cosine learning rate schedule (Loshchilov and Hutter, 2017)

. A linear warm-up of 10 epochs

(Goyal et al., 2017c) is also used for the schedule. A weight decay of is used on every parameter barring biases and the affine terms of batchnorm. Each task is trained for 500 or 1000 epochs depending on the size of the dataset.

Dense models: All dense models such as our key network are simple three layer deep linear dense models with a latent dimension of 512 coupled with spectral normalization (Miyato et al., 2018).

Memory writer: uses a deep linear conv-transpose decoder on the pooled embedding, with a base feature map projection size of 256 with a division by 2 per layer. We use a memory size of for all the experiments in this work.

Learned Prior: uses a convolutional encoder that stacks the read traces, , along the channel dimension and projects it to the dimensionality of .

In practice, we observed that K++ is about 2x as fast (wall clock) compared to our re-implementation of DKM. We mainly attribute this to not having to solve an inner OLS optimization loop for memory inference.

e.1 Memory creation protocol

The memory creation protocol of K++ is similar in stride to that of the DKM model, given the deterministic relaxations and addressing mechanism described in Sections 3.3 and 3.4. Each memory, , is a function of an episode of samples, . To efficiently optimize the conditional lower bound in Equation 4, we parallelize the learning objective using a set of minibatches, as is typical with the optimization of neural networks. As with the DKM model, K++ computes the train and test conditional evidence lower bounds in Table 1, by first inferring the memory, , from the input episode, followed by the read out procedure as described in Section 3.5.