be a random variable inhaving density . A fundamental problem in statistics is to find an estimator on the basis of , independent and identically distributed observations drawn from . An intuitive solution, going back at least to 1962 (Parzen, 1962), known as kernel density estimation, is to smooth the empirical measure with a convolution kernel that integrates to 1, where we consider an isotropic Gaussian kernel in . The kernel density estimator is rooted in the smoothness assumption, our belief that the “true density” is smooth. But what is the effect of the Gaussian smoothing on the data manifold in high dimensions?
The answer lies in the concentration of measure phenomenon. It is well known that the measure of Gaussian distributions in high dimensions is concentrated far away from its mode, on a thin shell ofaround the sphere (Tao, 2012; Vershynin, 2018). In fact, as , and become one and the same. This has severe consequences on the measure of a distribution after convolution with a Gaussian: e.g. the zero dimensional manifold of point masses transforms to dimensional spheres far away from . The first conceptual contribution of this paper is to expose this “manifold disintegration” phenomenon with the following takeaway: in (very) high dimensions, the Gaussian convolution pushes away (all) the masses near low-dimensional manifolds.
A seemingly unrelated theory is empirical Bayes (Robbins, 1955), which we use for restoring the measure near low-dimensional manifolds. In 1955, Robbins considered a scenario of a random variable
having a probability distribution that dependsin a known way on an unknown random variable . He found, very remarkably, that for certain measurement kernels, given an observation one does not need to know the prior on to evaluate the least squares estimator . Robbins work was later extended to Gaussian kernels by (Miyasawa, 1961). The estimator , referred to as nonparametric empirical Bayes least squares (NEBLS) estimator, is given by:
The NEBLS estimator is derived for any . The term nonperturbative is used throughout the paper to signify this important fact. We also refer to this single-step nonperturbative estimation as “Robbins jump”, or jump for short. Robbins was interested in the scenario where one had no knowledge of the distribution (neither through samples from the prior, nor the knowledge of the prior itself). However, here, we do have samples ! The NEBLS estimator is powerful in that we can use it in a learning objective, where we learn , the density in , by minimizing on average. Therefore, the empirical Bayes estimator becomes a mechanism to both learn the density in and also jump to . The second conceptual contribution of this paper is this fresh new take on the empirical Bayes formalism that gives rise to a learning objective to learn smooth densities and restore low-dimensional manifolds simultaneously. This two-sided use of the NEBLS estimator in both learning the density in the probability space and also jumping back to is what we mean by unifying kernel density estimation and empirical Bayes.
For the Gaussian measurement kernel, the empirical Bayes learning objective (written compactly as ) is the same as the deep energy estimator networks (DEEN) learning objective studied in (Saremi et al., 2018). DEEN’s inception was in devising scalable algorithms for learning deep energy models built on denoising score matching (Vincent, 2011). We reinterpret its learning objective from the more general perspective of empirical Bayes. The objective is simple at first sight:111 is short for in the limit .
Given , the samples are drawn from the Gaussian distribution centered at : . In high dimensions,
are (approximately) uniformly distributed on. The thin spherical shell around the sphere centered at where the measure of the Gaussian is concentrated plays an important role in this work and is referred to as “-sphere”. The in -sphere refers to the in , and is reserved for indexing samples in . Therefore, is the th sample on the -sphere. DEEN approximates the negative log-pdf of the smooth density with a globally defined energy function parameterized by , parameters of a deep neural network. The objective is to shape the global energy function (find ) such that locally, on the -spheres, the NEBLS estimator comes as close as possible (in squared norm) to at the center of the -sphere. It should be emphasized that the energy function
is learned/estimated without MCMC inference (generating negative samples) or variational approximations. However, and its approximation depend on implicitly.
The distinction between and is crucial. The density estimation happens in the probability space. The empirical Bayes is an efficient nonperturtabtive single-step mechanism to jump to with the NEBLS estimator.
The central term in the DEEN objective is
, the globally-defined vector fieldevaluated at , the th sample on the -sphere. In the literature, is referred to as the score function, due to its similarity in form to the Fisher score and following its definition in (Hyvärinen, 2005). Here we call it the probability flow field to imply a physical picture as the -spheres interact indirectly via the vector field . The indirect interaction is best seen by looking at the optimization problem, where SGD optimizer must take all the -spheres in the learning objective into account in finding . In other words, -spheres compete/interact to shape the energy function. The size of the -spheres, determined by , controls their interaction in loosely defined non-overlapping, overlapping, and highly overlapping regimes, creating a rich family of models even for the simple handwritten digit database. Our third conceptual contribution is the physical picture of the emergence of interaction between
samples in a dataset. We view it as a general concept, but in this work, the interaction was the result of (and mediated by) the probability flow field. We borrow from the physics lexicon the terms “interacting many-body system” and “many-body interactions” to refer to this aspect of our model.
From this angle, and due to concentration of measure in high dimensions, the kernel density mixture model is considered non-interacting, in that the mixtures interact only for overlapping -spheres and only at their overlapping regions.
Learning the energy function in the empirical Bayes setup is powerful. We use it in a novel sampling algorithm and in defining a new notion of associative memory. But we start first by discussing a quantification of -sphere overlaps, which is new by itself, and played an important role in designing experiments:
Concentration of measure leads to a geometric picture for the kernel density as a mixture of -spheres. We defined the matrix to understand the extend to which -spheres overlap:
Of special interest is (the subscript in stands for “critical”). For , all -spheres overlap, what we refer to as extreme noise (or the onset of the extreme noise regime to be more precise).
Walk-jump sampling is a novel sampling algorithm to generate samples in (close to the data manifold). It starts by running Langevin discretized dynamics (Langevin MCMC walks for ) in :
Samples in the space are then generated with Robbins jump:
A signature of walk-jump sampling is that Langevin walks do not depend on Robbins jumps. Therefore, jumps to do not have to be synchronized with walks in .
Robbins associative memory is a novel memory concept, defined à la (Hopfield, 1982), as attractors (local minima) of the energy function (obtained e.g. by running Langevin updates above without the noise term). The memory is therefore stored in a deep neural network. In retrieving a memory, one flows deterministically to an attractor, where the flow itself is computed very efficiently (at the cost of evaluating , the cost of backpropagation in input dimensions) by the neural network. Despite its definition, based on the deterministic flow, Robbins associative memory is a probabilistic concept in its core. That is because the deterministic dynamics is on the probability flow. (This is not true for Hopfield networks!) Also, many-body interactions that shape the energy function has a manifestation in memory interference.
This paper is organized as follows. Kernel density estimation is reviewed in Section 2. The concentration of measure and manifold disintegration is discussed in Section 3. The empirical Bayes is reviewed in Section 4. In Section 5, we discuss the neural empirical Bayes and its unification with kernel density. In Section 6, we begin to paint an interacting many-body picture for the learning objective by discussing a dataset with just two samples. In Section 7, we demonstrate “extreme” denoising experiments, testing the nonperturbative nature of the objective. In Section 8, we present the walk-jump sampling scheme and results in two different regimes. In both cases sharp samples are generated in the space, but with different mixing properties. Robbins associative memory is discussed in Section 9, where memory interference is visualized in two different regimes. In Section 10, we investigate and provide the strongest evidence for the complexity of the many-body interactions, where we find, very surprisingly, the emergence of remarkably rich “creative” modes. We finish with a summary.
2 Kernel density estimation
Consider , i.i.d. samples from a density on . We assume the true density is smooth and require a smooth estimator . A simple intuitive solution, going back at least to 1962 (Parzen, 1962), is to spread the masses of the empirical measure with a convolution kernel , giving rise to the nonparametric kernel density estimator:
where is a bounded function that integrates to 1. The estimator is a function, an infinite vector, and the rationale behind the term “nonparametric”, however the kernel and the important bandwidth/resolution parameter are indeed parameterizing the estimator. The choice of in the argument of , instead of the conventional has to do with the unification scheme with empirical Bayes that becomes clear in Section 5. In this paper, we focus on the isotropic Gaussian kernel. The conventional for the bandwidth is now replaced with and with the Gaussian distribution. The estimator , obtained by the convolution of the empirical distribution with , is a mixture of Gaussians centered at , each component with the covariance :
Due to the concentration of measure phenomenon in high dimensions, each term of the Gaussian mixture is concentrated uniformly on the -sphere centered at :
We will come back to this geometric view of the kernel density in Section 5. Next, we review the concentration of measure phenomenon and discuss the effect of the Gaussian “smoothing” in kernel density estimation on dissolving measure near low-dimensional manifolds, what we referred to as manifold disintegration in the Introduction.
3 Manifold disintegration in (very) high dimensions
Our low-dimensional intuitions break down spectacularly in high dimensions. This is in big part due to the exponential growth of space in high dimensions. This abundance of space also underlies
that is a well-known problem in machine learning. But our focus in this section is to develop some intuitions on the effect of Gaussian convolution in high dimensions. We start by a summary of some important results on the concentration of measure of high-dimensional random vectors. The books(Tao, 2012; Vershynin, 2018) should be consulted to fill in details. We then extend the textbook calculations and make a case for manifold disintegration.
Start with the isotropic Gaussian. In 2D, the Gaussian is visualized by a cloud of points centered around the mode (See Fig. 1(a)). Now take the isotropic Gaussian in dimension
and ask where the random vector is likely to be located. Before stating a well-known concentration of measure result, we can build intuition by looking at the following identities for squared norm expectations and variances:
Since components of are jointly independent, and
, we resort to the central limit theorem, where with high confidence we have
This norm concentration is visualized in Fig. 1
(b). In contradiction to our low-dimensional intuition, the isotropic normal distribution is not concentrated close to the mode of its density, the opposite in fact, very far away on a thin spherical shell of widtharound the sphere of radius . The norm concentration suggests a deeper concentration of measure, where the deviation of the norm from has a sub-Gaussian tail:
A related result state that:
The and in above equations are absolute constants (Vershynin, 2018).
The important take-away is that in high dimensions one can approximate the normal distribution with the uniform distribution on the surface of the sphere :
where the approximation turns into equality as . We will come back to this equation in Section 6 in discussing the geometry of the DEEN objective in high dimensions.
The basic intuition behind concentration results is that it is difficult for a large number of (sufficiently) independent variables to work together to simultaneously pull a sum or a more general (sufficiently regular) nonlinear combination of too far away from its mean (Tao, 2012).
We finish this section by a quantitative analysis of the disintegration of low-dimensional manifolds when a density is convolved with the Gaussian kernel . We mentioned in the Introduction that for the empirical distribution, the manifold dimension expands from zero to . Here we consider an example, exploring this phenomenon, where the underlying density is smooth itself, making a stronger case for manifold disintegration.
Consider a high-dimensional Gaussian distribution in . We can impose a simple low-dimensional “manifold structure” with dimension by considering the following covariance matrix:
where , and is placed on top. Now add Gaussian noise, or equivalently consider the empirical Bayes setup of Section 4 with the measurement kernel :
We repeat the norm concentration calculations that we started this section with for :
and the variance of the squared norm is simply the sum of variances:
In high dimensions, assuming
the “manifold terms” and become negligible, and we have the following with high confidence:222 The simpler norm concentration calculation we started this section with is obtained in the special case by setting , , .
In summary, after the convolution with the Gaussian , the dimensional manifold, characterized by the covariance , is transformed to a dimensional manifold, where the norm is concentrated at . In our terminology, the dimensional manifold has been disintegrated. The take-away of this section is that in high dimensions smoothing/convolving a distribution with the Gaussian kernel drastically changes its concentration of measure and “dissolves” the underlying manifold. In the next section, we discuss how we restore the manifold with empirical Bayes. There is a solid intuition behind manifold disintegration. As becomes large, it becomes very likely that one of the components of is from the unlikely tail of the normal distribution . In other words, the events that were highly unlikely in low dimensions become very likely in high dimensions. These events are abundant (as concentration of measure shows) and dissolve the measure near low-dimensional manifolds.
4 Empirical Bayes: a review
In his seminal work of 1955, “An Empirical Bayes Approach to Statistics”, Robbins considered an intriguing scenario of a random variable having a probability distribution that depends in a known way on an unknown random variable (Robbins, 1955). For the purpose of this paper consider to be a noisy observation of . Observing , the least squares estimator of was derived to be the Bayes estimator333, .:
He then asked a deep question (paraphrasing here): If the prior on is known to the experimenter then given by the above expression is a computable function, but what if the prior on is not known? “What if” was a rhetorical question of course and his quest was to find settings where the dependency of the estimator on disappears, as if there is an “abstraction barrier” between and where the knowledge of is not needed to recover (in the least squares sense). He achieved this result for Poisson, geometric, and binomial measurement kernels, that was later extended for the univariate Gaussian kernel by (Miyasawa, 1961).
We repeat the calculation here for the multivariate Gaussian kernel. Take to be a random variable in and a noisy observation of with a known Gaussian kernel with symmetric positive-definite :
From simple algebra follows:
The above equation and are combined next:
which leads to the following expression for :
The above equation is referred to as nonparametric empirical Bayes least squares (NEBLS) estimator for Gaussian kernels. For the isotropic case , and with the definition of the energy function in mind:
the NEBLS estimator takes the form
The above form is what we come back to throughout the paper. To sum up, the estimator is obtained in a setup where the “experimenter” has access only to the corrupted data and the measurement kernel , without any knowledge of . The remarkable result here is that for certain measurement kernels (we demonstrated the calculation for the Gaussian case), the dependency on drops out of the least squares estimator .
Colloquially speaking, there is an “abstraction barrier” between and , where in estimating from measurements, the prior on can be completely ignored.
Robbins’ results together with Miyasawa’s were further generalized and put to a unified framework by (Raphan and Simoncelli, 2011). Our notation and are adapted from that paper instead of (for the clean/unknown random variable) and (for the noisy/measured random variable) used in (Robbins, 1955). There is also a deep connection between Miyasawa’s least squares estimator and Stein’s unbiased risk estimate (Stein, 1981), where again for Gaussian noise the risk estimate has a closed form as expectations over the measurements alone. In (Raphan and Simoncelli, 2011), Stein’s unbiased risk estimate for Gaussian noise is also generalized to other noise models and unified with the empirical Bayes methodology.
5 Neural empirical Bayes
DEEN is a recently proposed algorithm for learning energy models (Saremi et al., 2018). Here, we arrive at its learning objective from a more general perspective of empirical Bayes. We start with a simple Gedankenexperiment, and introduce a learning signal and a learning objective as a result. Consider a random variable in having density , that we otherwise have “noiseless” access to, and let be i.i.d. samples drawn from . Now consider the kernel density estimator with the Gaussian kernel:
The kernel density is simply a Gaussian mixture with mixtures centered at . Taking samples is akin to randomly selecting a center and adding Gaussian noise to it: . We therefore view in the empirical Bayes setup, and this is the reason its argument is chosen to be (instead of the conventional ).
Now consider learning instead of . The power of this setup is rooted in the fact that the function that minimizes is, modulo an additive constant, the negative log-pdf (Robbins, 1955): . All this comes together in the following objective written for finite samples in and in (with an eye towards algorithmic implementation):
The is the learning signal, the squared norm discrepancy between (which we do have samples of as opposed to Robbins setup which is completely unknown) and its NEBLS estimator. The twist to the setup in (Robbins, 1955) is that is generated by us. We corrupted ourselves to learn the density in . Next, we parameterize the energy function with of a MLP, where we arrive at the following parametric form of the objective:
In summary, the idea is to “encode” the log-pdf (the energy function) with a neural network and use the probabilistic machinery of empirical Bayes to “decode” signals. There is a “denoising philosophy” at work here but without a traditional decoder (more discussions at the end of Section 8 and Remark 8).
The neural empirical Bayes objective we advocate here and the denoising score-matching objective of (Vincent, 2011) used in deep energy estimator networks (Saremi et al., 2018) are the same for Gaussian noise, but this, very crucially, breaks down for other noise models — see Table 1 in (Raphan and Simoncelli, 2011) for a summary of empirical Bayes for different measurement/noise models. That being said, in this paper, we refer to our framework as DEEN, since the end result is indeed a deep energy estimator neural network.
6 Interacting high-dimensional spheres
In this section, we build some intuitions for the complex interacting many-body learning dynamics of the DEEN objective. It suffices to consider two samples, and we start with a single sample . Consider very high dimensions, where samples are almost uniformly distributed on the -sphere centered at . This problem (the functional minimization of ) has a closed-form solution , . As expected (by construction), this corresponds to the Gaussian distribution centered at , and the probability flow points to the center of the -sphere. The concentration of measure is irrelevant for a single sample but it is used next.
Now consider a more interesting dataset with two samples (see Fig. 2 for schematics). DEEN’s optimization in learning for this two-sample dataset falls into (at least) three different regimes depending on whether is less, equal, or larger than . Very crucially, the -sphere solution above, say for , accumulates error for samples on the second sphere, and vice versa. The function that minimizes the DEEN objective must take both and and samples on the two spheres into account. This effectively induces an “interaction” between the -sphere and the -sphere, where their “competition” shapes the energy function . Obviously, the optimal solution highly depends on how much the two spheres overlap. The regime is the non-interacting regime and it corresponds to the empirical distribution of the two samples. As is increased, the two spheres grow in size, and their interactions increase as well. This example begins to paint an interacting many-body picture (admittedly qualitative) where all -spheres interact (more or less).
In summary, DEEN in high dimensions is abstracted as a system of highly interacting -spheres centered at , . The spheres interact via the probability flow field . In other words, in minimizing the objective (Eq. 3), the optimizer learns the optimal probability flow by taking all the competing terms into account.
7 “Extreme” denoising
DEEN is powered by the nonperturbative NEBLS estimator (Eq. 2), rewritten here for neural network parametrization of the energy function:
The estimator is called nonperturbative as it is valid for any . As discussed in the previous section controls how much the -spheres overlap/interact (Fig. 2), which is crucial in this work. It is therefore important to test the denoising performance of DEEN for a large range of . The -sphere overlaps of course also depends on the dataset and how they are placed in high dimensions. In designing experiments and following the previous section, we define the matrix :
We would like DEEN to be robust in the extreme noise regime , defined as
The experiments are presented for the handwritten digit database MNIST, and pixel values in the range , where , estimated from for 10000 samples (see Table 1). In Fig. 3, DEEN’s single-step denoising performance for is presented. For , the farthest -spheres overlap. From this perspective, sets the onset of “extreme noise”, what we also refer to as the “critical regime”. (This value of is seen again in Sections 8 and 10, in the presentation of the walk-jump sampling and in the emergence of creative modes.) However, DEEN’s single-step denoising was robust well beyond . An example of this (very) extreme denoising is presented in Fig. 4 for . For , themselves are inside -spheres for all . The results in Fig. 4 are presented to show DEEN’s robustness in the regime well above . This might be of interest from the perspective of signal processing, but we are not interested in this regime in the remainder of this paper.
The denoising results are a significant improvement over (Saremi et al., 2018). This is in part due to the use of ConvNets (instead of a fully connected network) and also the choice of nonlinearity (instead of ). This nonlinearity was studied independently from different angles (and with different names) (Elfwing et al., 2018; Hendrycks and Gimpel, 2016; Ramachandran et al., 2017). The choice of nonlinearity might be especially crucial in our case because of the double backpropagation involved in a single SGD update.
The objective (Eq. 3) is intriguing from the optimization perspective. It is “self regularized” as is evaluated at (noisy samples) on -spheres, not . Also, what enters in the objective is not the output of the neural network , but its gradient . This leads to double backpropagation, one in to evaluate the objective, and then in for the SGD update.
8 Walk-jump sampling: Langevin walk, Robbins jump
Now we come to one of our main technical contributions: a new sampling scheme based on combining the empirical Bayes formalism with the geometric intuition based on -spheres. Sampling arbitrary distributions in high dimensions is in general intractable. This is in part due to the fact that probability mass of complex distributions is concentrated in sharp ridges that are separated by large regions of low probability making MCMC methods ineffective. We relax the problem of sampling the distribution in the probability space (from the density ) to sampling the smoother density in obtained by convolving with the Gaussian:
where the kernel bandwidth controls the smoothness. The problem is thus transformed to an easier one of sampling the smoother density compared to . We have learned so far that we can estimate by parametrizing the energy function with parameters of a deep neural network and optimizing for :
Once we know , we can sample from it with Langevin MCMC. What about samples in ? Since was constructed as , the NEBLS estimator gives an efficient single-step procedure (we have referred to Robbins jump or jump for short) to generate approximate samples from without the knowledge of the density itself:
What emerges is a simple algorithm for generating samples in the space. We first sample with Langevin MCMC walks by discretizing the Langevin diffusion, written compactly:
Samples in at MCMC time are then generated with Robbins jump:
Note that can be any value as opposed to that is small and must be chosen judiciously. This is the reason we referred to it as a jump in comparison to the MCMC walk controlled by . The combined sampling (combination of Eq. 5 and Eq. 6) is referred to as walk-jump sampling. Note the separation between and in this scheme. Exact Langevin diffusion is simulated in the space not , and samples is generated for arbitrary times .444To insure i.i.d. samples, must be chosen with care, as some large values, e.g. as multiples of the autocorrelation time of the Langevin dynamics.
There are two approximations made in obtaining walk-jump samples. First, the exact Langevin diffusion was ran not on but its approximation . The second is the Robbins jump step, which is a least squares approximation itself.
In summary of walk-jump sampling, Gaussian smoothing maps the low-dimensional manifold of the data distribution in into much higher dimensional manifold in , where the is estimated/approximated by the SGD optimization of , arriving at . The Langevin MCMC walks on the estimated energy, generating samples in the space. The low-dimensional manifold in is then restored (in least-squares sense) with Robbins jumps.
The qualitative geometric picture of DEEN as interacting -spheres has consequences. Take MNIST for example. Perhaps, we like to be in the sought-after regime of being able to walk from one class to all other classes, while exploring varieties of styles within a class. It is intuitively clear that this corresponds to the regime that most/all spheres overlap. We tested this for MNIST in the regime of mostly non-overlapping -spheres and the highly overlapping -spheres regime (see Table 1). The results are given in Fig. 5 and Fig. 6 respectively. For samples stay within a class while the styles change during the walk-jump. For , there is mixing between styles/classes.
The “denoising” aspect of empirical Bayes draws attention to denoising autoencoders. It is important to observe that our framework is not an autoencoder as we do not have a decoder. The neural network just “encodes” the energy, and the denoising is through Robbins jumps. More technical comparisons are given below. Let us start by observing that in walk-jump sampling, the jump tois not required in running the Langevin dynamics. This is in contrast with the sampling scheme of generalized denoising autoencoders (Bengio et al., 2013). This distinction is deep and is rooted in the fact that we learn the energy directly (in space) as opposed to denoising autoencoders that learn the score function indirectly/implicitly. The other comparison (and the significant advantage we have here) lies in the nonperturbative nature of our learning objective. By contrast, the denoising autoencoders learn the score function only in the limit (Alain and Bengio, 2014). The generalized denoising autoencoders (Bengio et al., 2013) was devised as a remedy for limitations of denoising autoencoders, but this is avoided altogether in empirical Bayes with its jump scheme. Also, in making visual comparisons, samples in Fig. 6 are sharper and not “prototypical” compared to samples from the generalized denoising autoencoders, Fig. 4 in (Bengio et al., 2013).
The technical comparisons with denoising autoencoders aside, here we have a clear geometrical notion by what we mean by large and small based on the overlapping -spheres and the statistics of the matrix (Table 1). This picture is lacking in (generalized) denoising autoencoders literature, where it is not clear what should be considered.
9 Robbins associative memory
In this section, we use the DEEN methodology to define a new kind of associative memory. Associative memory (also known as content-addressable memory) is a deep subject in neural networks with a rich history, with roots in psychology and neuroscience (long-term potentiation and Hebb’s rule). The depiction of this computation is even present in the arts and literature, championed by Marcel Proust, in the form of “stream of consciousness”, and the constant back and forth between perception and memory.
In 1982, Hopfield brought together earlier attempts to formulate associative memory, and showed that the collective computation of a system of neurons withsymmetric weights, in the form of asynchronous updates of McCulloch-Pitts neurons, minimize an energy function. Associative memory was then formulated as the flow to the local minima of the energy function, what he referred to as attractors of the “phase space flow” (Hopfield, 1982)
. This seminal work formed a bridge between statistical mechanics, a rich literature on Ising models and spin glasses, and neural network research. Hopfield networks was also instrumental in the development of the Boltzmann machine(Hinton and Sejnowski, 1986).
Hopfield’s initial formulation of identifying synaptic connections with Hebb-like outer product rule and the quadratic energy function proved to be very limiting. These limitations are discussed in depth in the literature, most recently in (Hillar et al., 2012; Krotov and Hopfield, 2016; Chaudhuri and Fiete, 2017), where the authors propose alternative solutions we highlight here: (Hillar et al., 2012) keeps the second-order interaction but set the weights to be the weights of a fully visible Boltzmann machine (instead of the outer product rule), (Krotov and Hopfield, 2016) keeps the outer-product rule but leverages non-quatradic (polynomial and rectified polynomial) energy functions, and (Chaudhuri and Fiete, 2017) explores capacity gains with RBM architecture and expander graph connectivity. It is worth noting that all these papers were limited to binary units. The associative memory we will shortly define shares all the three key features we just highlighted, but it is more general, and it is not limited to binary units.
We define Robbins associative memory, à la Hopfield, as the flow to the local minima of the energy function . The memory is therefore stored implicitly in a deep neural network. In retrieving a memory, one flows deterministically to a local minimum (attractor), where the flow itself is computed by the neural network. This computation is efficient, at the cost of evaluating , the cost of backpropagation in input dimensions.
What is special about local minima of ? We start with the simple observation that for a single sample , the learned energy is minimized at . In the associative memory jargon, is the attractor of the probability flow field (see Fig. 1(c).) Also as we discussed in Section 6, for two samples the has local minima close to and . This picture holds in general for many samples as the learning objective is optimized for an energy function whose gradient flow field points to . This simple observation motivated the definition of Robbins associative memory as attractors of the learned flow field. The local minima of shift from due to the interactions between -spheres. This shift is referred to as memory interference. There is a duality between memory interference and -spheres interactions.
The shift is however not pronounced in the regime of weakly overlapping -spheres as is seen in Fig. 7 for . The memory interference become pronounced above the critical regime , visualized in Fig. 8. Finally note that, in Robbins associative memory, unseen samples from (the samples in the test set for example) are also close to the local minima of the energy function. In other words, there is no distinction between training and test set for Robbins associative memory, assuming is a good approximation. To highlight this important fact, the associative memory flow to attractors in Fig. 7 and Fig. 8 are shown on the test set instead.
On the surface, Robbins associative memory follows a deterministic dynamics. However, the deterministic dynamics is derived from a probability flow field. In this respect, Robbins associative memory is a probabilistic notion of associative memory in its core.
10 Emergence of creative modes
We finish the paper with some surprising results on emergence of creative memories/modes. To ground the concept of emergence from the angle of interacting many-body systems, we highlight the classic paper (Anderson, 1972)—More is different: An interacting system can show emergent behavior very different than sum of its parts. Such emergent behaviors are well understood in many-body physics, the theory of critical phenomena and more generally the symmetry breaking phenomena:
“… the state of a really big system does not at all have to have the symmetry of the laws which govern it; in fact, it usually has less symmetry. The outstanding example of this is the crystal: Built from a substrate of atoms and space according to laws which express the perfect homogeneity of space, the crystal suddenly and unpredictably displays an entirely new and very beautiful symmetry.”
History of physics is filled with a rich array of condensed matter phenomena, where the the underlying laws are simple and well understood, yet the phenomena are complex and surprising. Our results in this section can be considered as a machine learning analogue of the emergent phenomena in many-body physics. The results show the richness of the interactions between the -spheres in high dimensions even for a model trained on MNIST, the relatively simple dataset of handwritten digits.
In the previous section, we defined Robbins associative memory as deterministic attractors of the probability flow field. The associative memory flows to a “nearby” local minima of the energy function . The memory is designed to be well-behaved if initialized close to a sample , since from empirical Bayes, samples taken from are close to local minima of . However, very crucially, memories interfere due to interactions between -sphere. Here, we explore this interacting many-body aspect of our model further by probing the phase space more widely by initializing the memory at a random point sampled unifromly in the hypercube instead. The results are presented in Fig. 9, where the system converges to remarkably rich creative modes (except for few memories that can be iterpreted as handwritten digits). Emergence of creative memories/modes provide the strongest evidence for the complexity of the many-body interactions we have discussed throughout the paper. Intuitively, we expect this interaction to be the strongest in the critical regime where the spheres are highly overlapping .
All random initializations in the unit hypercube converged to a solution and they are all shown in Fig. 9.
We did not find such “creative” images by running the Langevin dynamics. They appear to be some (of the many) isolated modes of . Note the fine distinction in high dimensions between the modes of a distribution and where its measure is concentrated.
We have a mathematical understanding of the concept of emergence in physics, but we do not have such understanding for the concept of creativity. In psychology, creativity is defined as the production of an idea, action, or object that is new and valued (Wilson and Keil, 2001). The images in Fig. 9 are not samples from the distribution , and in that way they are “new”. However, how valued they are from the perspective of a learning agent is a question we do not have answer to.
This work is dense with interdisciplinary concepts. We go over a summary of our conceptual contributions. We started by exposing the phenomenon of manifold disintegration in high dimensions. Manifold disintegration becomes very intuitive after a deep dive into concentration of measure phenomenon, but we also considered an example to demonstrate it analytically. Our second conceptual contribution was the fresh take on empirical Bayes. Our two-sided use of NEBLS estimator allowed for density estimation and the restoration of the manifold. We saw our dual use of the NEBLS estimator as a unification of kernel density and empirical Bayes. Perhaps, the most novel conceptual contribution here is the physical picture of the emergence of interaction between i.i.d. samples in a dataset. In this work, the interaction emerged because of the probability flow field and the competition to shape a globally defined energy function. In our view, understanding sample-sample interactions may also shed light on the topic of generalization in machine learning, but that is a topic for future research.
On the algorithmic side, we proposed a novel walk-jump sampling algorithm, that combined Langevin MCMC walks and Robbins jumps. We also introduced Robbins associative memory as local minima of the energy function, the (deterministic) attractors of the probability flow field. In contrast to Hopfield networks, Robbins associative memory is probabilistic in its core. Regarding related works, we made comparisons to denoising autoencoders (Alain and Bengio, 2014; Bengio et al., 2013) and the recent attempts to revive Hopfield networks (Hillar et al., 2012; Krotov and Hopfield, 2016; Chaudhuri and Fiete, 2017). The comparison to denoising autoencoders is highlighted by our nonperturbative formulation and the fact that we learn the energy function directly. The comparison with the recent papers on associative memory is highlighted by the fact that Robbins associative memory has all their key features: the energy is learned to approximate the log pdf, it is highly nonlinear, and the form of nonlinearity is adapted to the distribution of the data to be stored. In addition, Robbins associative memory is not limited to binary data.
The concentration of measure phenomenon was crucial in this work, and it gave us some clarity (more is needed in future works) in designing experiments. The critical regime where the interactions between -spheres are strong (but not too strong) was of special interest. Perhaps, the most surprising experimental result of this paper is the emergence of creative modes. We understood it as a signature of many-body interactions, but more work is needed to fully understand this phenomenon. A well-defined question here is how the number of the local minima of the energy function scales in high dimensions. The question is easy to ask but perhaps hard to answer as the analysis should depend on the dataset and configuration of the -spheres in high dimensions, but it might be possible to put some meaningful bounds on the number of local minima. This is also related to the question of bounds on the capacity of Robbins associative memory.
This work was supported by the Canadian Institute for Advanced Research, the Gatsby Charitable Foundation, and the National Science Foundation through grant IIS-1718991. We especially thank Bruno Olshausen and Eero Simoncelli for discussions.
- Alain and Bengio (2014) Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data-generating distribution. Journal of Machine Learning Research, 15(1):3563–3593, 2014.
- Anderson (1972) Philip W Anderson. More is different. Science, 177(4047):393–396, 1972.
- Bengio et al. (2013) Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto-encoders as generative models. In Advances in Neural Information Processing Systems, pages 899–907, 2013.
- Chaudhuri and Fiete (2017) Rishidev Chaudhuri and Ila Fiete. Associative content-addressable networks with exponentially many robust stable states. arXiv preprint arXiv:1704.02019, 2017.
Elfwing et al. (2018)
Stefan Elfwing, Eiji Uchibe, and Kenji Doya.
Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural Networks, 2018.
- Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv preprint arXiv:1606.08415, 2016.
- Hillar et al. (2012) Christopher Hillar, Jascha Sohl-Dickstein, and Kilian Koepsell. Efficient and optimal binary Hopfield associative memory storage using minimum probability flow. arXiv preprint arXiv:1204.2916, 2012.
- Hinton and Sejnowski (1986) Geoffrey E Hinton and Terrence J Sejnowski. Learning and relearning in Boltzmann machines. Parallel distributed processing: Explorations in the microstructure of cognition, 1(282-317):2, 1986.
- Hopfield (1982) John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
- Hyvärinen (2005) Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709, 2005.
Krotov and Hopfield (2016)
Dmitry Krotov and John J Hopfield.
Dense associative memory for pattern recognition.In Advances in Neural Information Processing Systems, pages 1172–1180, 2016.
- Miyasawa (1961) Koichi Miyasawa. An empirical Bayes estimator of the mean of a normal population. Bulletin of the International Statistical Institute, 38(4):181–188, 1961.
On estimation of a probability density function and mode.The annals of mathematical statistics, 33(3):1065–1076, 1962.
- Ramachandran et al. (2017) Prajit Ramachandran, Barret Zoph, and Quoc V Le. Swish: a self-gated activation function. arXiv preprint arXiv:1710.05941, 7, 2017.
- Raphan and Simoncelli (2011) Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision. Neural Computation, 23(2):374–420, 2011.
- Robbins (1955) Herbert Robbins. An empirical Bayes approach to statistics. In Proc. Third Berkeley Symp., volume 1, pages 157–163, 1955.
- Saremi et al. (2018) Saeed Saremi, Arash Mehrjou, Bernhard Schölkopf, and Aapo Hyvärinen. Deep energy estimator networks. arXiv preprint arXiv:1805.08306, 2018.
- Stein (1981) Charles M Stein. Estimation of the mean of a multivariate normal distribution. The annals of Statistics, pages 1135–1151, 1981.
Topics in random matrix theory. American Mathematical Society, 2012.
High-dimensional probability: An introduction with applications in data science. Cambridge University Press, 2018.
- Vincent (2011) Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.
- Wilson and Keil (2001) Robert Andrew Wilson and Frank C Keil. The MIT encyclopedia of the cognitive sciences. MIT press, 2001.