The traditional view of a learning system is one where an initial parameter vectoris updated until some convergence criterion is met: with (in theory) and a fixed point of the updates. These updates usually maximize some objective such as the log-likelihood of the data. We can view this process as a dynamical system with a contractive map which is designed to iterate to a fixed point. The map
can be either deterministic or stochastic. For instance, batch gradient descent is an example of a deterministic map while stochastic gradient descent is an example of a stochastic map. A natural question is whether the existence of a fixed point
is important, and whether meaningful learning systems can exist that do not converge to any fixed point but traverse an attractor set. To answer this question we can draw inspiration from Markov chain Monte Carlo (MCMC) procedures which generate samples from a posterior distribution(with indicating the data). MCMC also generates a sequence of parameter values but one that does not converge to a fixed point. Rather the samples form an attractor set with a measure (density) equal to the posterior distribution. One can make meaningful predictions with MCMC chains by making predictions for every sampled model separately and subsequently averaging the predictions. There is also evidence that learning in the brain is a dynamical process. For instance, aihara1982temporally
have described chaotic dynamics in the Hodgkin-Huxley equations for membrane dynamics and studied them experimentally in squid giant axons. Also, much evidence has now been accumulated that synapses are subject to fast dynamical processes such as postsynaptic depression and facilitation(TsodyksPawelzikMarkram98).
Herding (Welling09A) is perhaps the first learning dynamical system based on a deterministic map and with a nontrivial attractor (i.e. not a single fixed point). It emerged from taking the limit of infinite stepsize in the usual (maximum likelihood) updates for a Markov random field (MRF) model. It can be observed that in this limit the parameters will not converge to a fixed point but rather traverse a usually non-periodic trajectory in weight space. The information contained in the data is now stored in the trajectories (or the attractor) of this dynamical system, rather than in a point estimate of a collection of parameters. In fact it can be shown that this dynamical system is neither periodic (under some conditions) nor chaotic, a state which is associated with “edge of chaos” dynamics. As illustrated in this chapter, by slowly increasing the stepsize (or equivalently lowering the temperature) we will move from a standard MRF maximum likelihood learning system with a single fixed point, through a series of period doublings to a system on the edge of chaos. One can show that the attractor is sometimes fractal, and that the Lyapunov exponents of this system are equal to implying that two nearby trajectories will eventually separate but only polynomially fast (and not exponentially fast as with chaotic systems). Many of the dynamical properties of this system are described by the theory of “piecewise isometries” (goetz2000dynamics).
Herding can thus be viewed as a dynamical system that generates state-space samples that are highly similar to the samples that would be generated by a learned MRF model with the same features. The state-space samples satisfy the usual moment matching constraints that defines an MRF and can be used for making meaningful predictions. In a way, herding combines learning and inference in one dynamical system. However, the distribution from which herding generates samples is not identical to the associated MRF because while the same moment matching constraints are satisfied, the entropy of the herding samples is usually somewhat lower than the (maximal) entropy of the MRF. The sequence of samples in state space has very interesting properties. First, it forms an infinite memory sequence as every sample depends on all the previous samples and not just the most recent sample as in Markov sequences. It can be shown that the number of distinct subsequences of length grows as implying that their (topological) entropy vanishes. For simple systems these sequences can be identified with “low discrepancy sequences” and Sturmian sequences (MorseHedlund40)
. Probably related to this is the fact that Monte Carlo averages based on these sequences converge as. This should be contrasted with random independent samples from the associated MRF distribution for which the convergence follows the usual rate. Herding sequences thus exhibit strong negative auto-correlations leading to the faster convergence of Monte Carlo averages. It is conjectured that this property is related to the edge of chaos characterization of herding, and that both stochastic systems (such as samplers) as well as fully chaotic systems will always generate samples that can at most result in convergence of Monte Carlo averages.
Similar to “perturb and map” (PaYu11a), the execution of the herding map requires one to compute the maximum a posteriori (MAP) state defined by the current parameter setting. While maximization is sometimes easier than computing the expectations required to update the parameters of an MRF, for complex models maximization can also be NP hard. A natural question is therefore if one can relax the requirement of finding the MAP state and get away with partial maximization to, say, a local maximum instead of the global maximum. The answer to this question comes from a theorem that was proven a long time ago in the context of Rosenblatt’s perceptron (rosenblatt1958perceptron) and is known as the “perceptron cycling theorem” (PCT) (minsky1969perceptrons). This theorem states precisely which conditions need to be fulfilled by herding at every iteration in order for the algorithm to satisfy the moment constraints. The PCT therefore allows us to relax the condition of finding the MAP state at every iteration, and as a side effect also allows us to run herding in an online setting or with stochastic minibatches instead of the entire dataset. A further relaxation of the herding conditions was described in chen2014herdingbookchapter where it was shown that herding with inconsistent
moments as input (moments that can not be generated by a single joint probability distribution) still makes sense and generates the Euclidean projections of these moments on the marginal polytope.
Like MRF models can be extended to models with hidden variables and to discriminative models such as the conditional Markov random field (CRF) models, herding can also be generalized along these same dimensions. Herding with hidden variables was described in Welling09B and shown to increase the ability of this dynamical system to represent complex dependencies. Conditional herding was described in GelfandMaatenChenWelling10 and shown to be equivalent to the voted perceptron algorithm freund1999large and to Collins’ “voted HMM” collins2002discriminative in certain special cases. The herding view allowed the extension of these discriminative models to include hidden variables.
Herding is related to (or has been connected to) a number of optimization, learning and inference methods. Herding has obvious similarities to the concept of “fast weights” introduced by TielemanHinton09. Fast weights follow a dynamics that is designed to make the Markov chain embedded in a MRF learning process mix fast. A similar idea was used in breuleux2011quickly to speed up the mixing rate of an (approximate) sampling procedure. By applying herding dynamics conditionally w.r.t. its parent-states for every variable in a graphical model yet another fast mixing sampling algorithm was developed, called “herded Gibbs” bornn2013herded. Herding was extended in ChenSmolaWelling10 to a deterministic sampling algorithm in continuous state spaces (known as “kernel herding”). The view espoused in that paper led to an analysis of herding as a conditional gradient optimization algorithm (or Franke-Wolfe algorithm) in Bach2012herding from which an improved convergence analysis emerged as well generalizations to versions of herding with non-uniform weights. In related work of Huszar12 it was shown that an optimally weighted version of (kernel) herding is equivalent to Bayesian quadrature, again resulting in faster convergence. harvey2014near focused on the convergence rate of herding with respect to the dimensionality of the feature vector and proposed a new algorithm that scaled near-optimally with the dimensionality.
Perhaps the method closest related to herding is “perturb and map” estimation, where the parameters of a MRF model are perturbed by sampling from a Gumbel distribution followed by maximization over the states. Like in herded Gibbs, the procedure is only “exact” if exponentially many parameters are perturbed. Herding is however different from perturb and map in that the perturbations are generated sequentially and deterministically.
This chapter is built on the results reported earlier in a series of conference papers Welling09A; Welling09B; WellingChen10; ChenSmolaWelling10; GelfandMaatenChenWelling10. Our current understanding of herding is far from comprehensive but rather represents a first attempt to connect learning systems with the theory of nonlinear dynamical systems and chaos. We believe that it opens the door to many new directions of research with potentially surprising and exciting discoveries.
The chapter is organized as follows. In Section 1.2 we introduce the herding algorithm and study its statistical property as both a learning algorithm and a dynamical system. In Section 1.3 we provide a general condition for herding to satisfy the fast moment matching properties, under which the algorithm is extended for partially observed models and discriminative models. We evaluate the performance of the introduced algorithms empirically in Section 1.4. The chapter is concluded with a summary in Section 1.5 and a conclusion in Section 1.6.
1.2 Herding Model Parameters
1.2.1 The Maximum Entropy Problem and Markov Random Fields
to be a random variable in the domain, and to be a set of feature functions of , indexed by . In the maximum entropy problem (MaxEnt), given a data set of observations , we want to learn a probability distribution over , , such that the expected features, a.k.a. moments, match the average value observed in the data set, denoted by
. For the remaining degrees of freedom in the distribution we assume maximum ignorance which is expressed as maximum entropy. Mathematically, the problem is to find a distributionsuch that:
The dual form of the MaxEnt problem is known to be equivalent to finding the maximum likelihood estimate (MLE) of the parameters of a Markov Random Field (MRF) defined on , each parameter associated with one feature :
where the normalization term is also called the partition function. The parameters act as Lagrange multipliers to enforce the constraints in the primal form 1.1. Since they assign different weights to the features in the dual form, we will also called them “weights” below.
It is generally intractable to obtain the MLE of parameters because the partition function involves computing the sum of potentially exponentially many states. Take the gradient descent optimization algorithm for example. Denote the average log-likelihood per data item by
The gradient descent algorithm searches for the maximum of with the following update step:
Notice however that the second term in the gradient that averages over the model distribution, , is derived from the partition function and cannot be computed efficiently in general. A common solution is to approximate that quantity by drawing samples using Markov chain Monte Carlo (MCMC) at each gradient descent step. However, MCMC is known to suffer from slow mixing when the state distribution has multiple modes or variables are strongly correlated (neal1993probabilistic). Furthermore, we can usually afford to run MCMC for only a few iterations in the nested loop for the sake of efficiency (Neal92; Tieleman08), which makes it even harder to obtain an accurate estimate of the gradient.
Even when the MRF is well trained, it is usually difficult to apply the model to regular tasks such as inference, density estimation, and model selection, because all of those tasks require the computation of the partition function. One has to once more resort to running MCMC or other approximate inference methods during the prediction phase to obtain an approximation.
Is there a method to speed up the inference step that exists in both the training and test phases? The herding algorithm was proposed to address the slow mixing problem of MCMC and combine the execution of MCMC in both training and prediction phases into a single process.
1.2.2 Learning MRFs with Herding
When there exist multiple local modes in a model distribution, an MCMC sampler is prone to getting stuck in local modes and it becomes difficult to explore the state space efficiently. However, that is not a serious issue at the beginning of the MRF learning procedure as observed by, for example, TielemanHinton09. This is because the parameters keep being updated with a large learning rate at the beginning. Specifically, when the expected feature vector is approximated by a set of samples in the MCMC approach, after each update in Equation 1.5, the parameter is translated along the direction that tends to reduce the inner product of , and thereby reduces the state probability around the region of the current samples. This change in the state distribution helps the MCMC sampler escape local optima and mix faster.
This observation suggests that we can speed up the MCMC algorithm by updating the target distribution itself with a large learning rate. However, in order to converge to a point estimate of a model, needs to be decreased using some suitable annealing schedule. But one may ask if we are necessarily interested in a fixed value for the model parameters? As discussed in the previous subsection, for many applications one needs to compute averages over the (converged) model which are intractable anyway. In that case, a sequence of samples to approximate the averages is all we need. It then becomes a waste of resources and time to nail down a single point estimate of the parameters by decreasing when a sequence of samples is already available. We will actually kill two birds with one stone by obtaining samples during the training phase and reuse them for making predictions. The idea of the herding algorithm originates from this observation.
The herding algorithm proposed in Welling09A can be considered as an algorithm that runs a gradient descent algorithm with a constant learning rate on an MRF in the zero-temperature limit. Define the distribution of an MRF with a temperature by replacing with , where is an artificial temperature variable. The log-likelihood of a model (multiplied by ) then becomes:
When approaches , all the probability is absorbed into the most probable state, denoted as , and the expectation of the feature vector, , equals that of state . The herding algorithm then consists of the iterative gradient descent updates in the limit, , with a constant learning rate, :
We usually set except when mentioned explicitly because the herding dynamics is invariant to the learning rate as explained in Section 1.2.3. We treat the sequence of most probable states, , as a set of “samples” for herding and use it for inference tasks. At each iteration, we find the most probable state in the current model distribution deterministically, and update the parameter towards the average feature vector from the training data subtracted by the feature vector of the current sample. Compared to maintaining a set of random samples in the MCMC approach (see e.g. Tieleman08), updating with a single sample state facilitates updating the distribution at an even rate.
we see that, after taking the limit , we can interpret herding as maximum likelihood learning with infinitely large stepsize and rescaled weights. The surprising observation is that the state sequence generated by this process is still meaningful and can be interpreted as approximate samples from an MRF model with the correct moment constraints on the features .
One can obtain an intuitive impression of the dynamics of herding by looking at the change in the asymptotic behavior of the gradient descent algorithm as we decrease in Equation 1.9 from a large value towards . Assume that we can compute the expected feature vector w.r.t. the model exactly. Given an initial value of , the gradient descent update equation 1.9 with a constant learning rate is a deterministic mapping in the parameter space. When is large enough ( is small enough), the optimization process will converge and will approach a single point which is the MLE. As decreases below some threshold ( is above some threshold), the convergence condition is violated and the trajectory of will move asymptotically into an oscillation between two points, that is, the attractor set splits from a single point into two points. As decreases further, the asymptotic oscillation period doubles from two to four, four to eight, etc, and eventually the process approaches an infinite period at another temperature threshold. Figure 1.1 shows an example of the attractor bifurcation phenomenon. The example model has 4 discrete states and each state is associated with 2 real valued features which are randomly sampled from . Starting from that second threshold, the trajectory of is still bounded in a finite region as shown shortly in Section 1.3.1 but will not be periodic any more. Instead, we observe that the dynamics often converges to a fractal attractor set as shown in the right plot of Figure 1.1. The bifurcation process is observed very often in simulated models although it is not clear to us if it always happens for any discrete MRF. We discuss the dynamics related to this phenomenon in more detail in Section 1.2.6.
1.2.3 Tipi Function and Basic Properties of Herding
We will discuss a few distinguishing properties of the herding algorithm in this subsection. When we take the zero temperature limit in Equation 1.6, the log-likelihood function becomes
This function has a number of interesting properties that justify the name “Tipi function”111A Tipi is a traditional native Indian dwelling. (see Figure 1.2) (Welling09A):
is continuous piecewise linear ( but not ). It is clearly linear in as long as the maximizing state does not change. However, changing may in fact change the maximizing state in which case the gradient changes discontinuously.
is a concave, non-positive function of with a maximum at . This is true because the first term represents the average over some distribution P, while the second term is its maximum. Therefore, . If we furthermore assume that is not constant on the support of then and the maximum at is unique. Concavity follows because the first term is linear and the second maximization term is convex.
is scale free. This follows because as can be easily checked. This means that the function has exactly the same structure at any scale of .
Herding runs gradient descent optimization on this Tipi function. There is no need to search for the maximum as is the trivial solution. However, the fixed learning rate will always result in a perpetual overshooting of the maximum and thus the sequence of weights will never converge to a fixed point. Every flat face of the Tipi function is associated with a state. An important property of herding is that the state sequence visited by the gradient descent procedure satisfies the moment matching constraints in Equation 1.1, which will be discussed in details in Section 1.2.5. There are a few more properties of this procedure that are worth noticing.
Deterministic Nonlinear Dynamics
Herding is a deterministic nonlinear dynamical system. In contrast to the stochastic MLE learning algorithm based on MCMC, the two update steps in Equation 1.7 and 1.8 consist of a nonlinear deterministic mapping of the weights as illustrated in Figure 1.3. In particular it is not an MCMC procedure and it does not require random number generation.
The dynamics thus produces pseudo-samples that look random, but should not be interpreted as random samples. Although reminiscent of the Bayesian approach, the weights generated during this dynamics should not be interpreted as samples from some Bayesian posterior distribution. We will discuss the weakly chaotic behavior of the herding dynamics in detail in Section 1.2.6.
Invariance to the Learning Rate
Varying the learning rate does not change the behavior of the herding dynamics. The only effect is to change the scale of the invariant attractor set of the sequence . This actually follows naturally from the scale-free property of the Tipi function. More precisely, denote with the standard herding sequence with and the sequence with an arbitrary learning rate. It is easy to see that if we initialize and apply the respective herding updates for and afterwards, the relation will remain true for all . In particular, the states will be the same for both sequences. Therefore we simply set in the herding algorithm.
Of course, if one initializes both sequences with arbitrary different values, then the state sequences will not be identical. However, if one accepts the conjecture that there is a unique invariant attractor set, then this difference can be interpreted as a difference in initialization which only affects the transient behavior (or “burn-in” behavior) but not the (marginal) distribution from which the states will be sampled.
Notice however that if we assign different learning rates across the dimensions of the weight vector , it will change the distribution . While the moment matching constraints are still satisfied, we notice that the entropy of the sample distribution varies as a function of . In fact, changing the relative ratio of learning rates among feature dimensions is equivalent to scaling features with different factors in the greedy moment matching algorithm interpretation of Section 1.2.4. How to choose an optimal set of learning rates is still an open problem.
A key advantage of the herding algorithm we observed in practice over sampling using a Markov chain is that the dynamical system mixes very rapidly over the attractor set. This is attributed to the fact that maximizations are performed on an ever changing model distribution as briefly mentioned at the beginning of this subsection. Let be the distribution of training data , and be the maximizing state at time . The distribution of an MRF at time with a regular temperature is
After the weights are updated with Equation 1.8, the probability of the new model becomes
Comparing Equation 1.12 with 1.11 we see that probable states (with large ) are rewarded with an extra positive term , except the most recently sampled state . This will have the effect (after normalization) that state will have a smaller probability of being selected again. Imagine for instance that the sampler is stuck at a local mode. After drawing samples at that mode for a while, weights are updated to gradually reduce that mode and help the sampler escape it. The resulting negative auto-correlation would help mitigate the notorious problem of positive auto-correlation in most MCMC methods.
We illustrate the negative auto-correlation using a synthetic MRF with discrete states, each associated with a -dimensional feature vector. The parameters of the MRF model are randomly generated from which the expected feature values are then computed analytically and fed into the herding algorithm to draw samples. We define the auto-correlation of the sample sequence of discrete variables as follows:
where is the indication function and is the empirical distribution of the samples. It is easy to observe that and if the samples are independently distributed up to a small error due to the finite sample size. We run herding
times with different model parameters and show the mean and standard deviation of the auto-correlation in Figure1.4. We can see that the auto-correlation is negative for neighboring samples, and converges to 0 as the time lag increases. This effect exists even if we use a local optimization algorithm when a global optimum is hard or expensive to be obtained. This type of “self-avoidance” is also shared with other sampling methods such as over-relaxation (young1954iterative), fast-weights PCD (TielemanHinton09) and adaptive MCMC (salakhutdinov27learning).
1.2.4 Herding as a Greedy Moment Matching Algorithm
As herding does not obtain the MLE, the distribution of the generated samples does not provide a solution to the maximum entropy problem either. However, we observe that the moment matching constraints in Equation 1.1 are still respected, that is, when we compute the sampling average of the feature vector it will converge to the input moments. Furthermore, the negative auto-correlation in the sample sequence helps to achieve a convergence rate that is faster than what one would get from independently drawing samples or running MCMC at the MLE. Before providing any quantitative results, it would be easier for us to understand herding intuitively by taking a “dual view” of its dynamics where we remove weights in favor of the states (ChenSmolaWelling10).
Notice that the expression of can be expanded recursively using the update Equation 1.8:
For ease of intuitive understanding of herding, we temporarily make the assumptions (which are not necessary for the propositions to hold in the next subsection):
The second assumption is easily achieved, e.g. by renormalizing or by choosing a suitable feature map in the first place. Given the first assumption, Equation 1.15 becomes
Combining the second assumption one can show that the herding update equation 1.16 is equivalent to greedily minimizing the squared error defined as
We therefore see that herding will generate pseudo-samples that greedily minimize the distance between the input moments and the sampling average of the feature vector at every iteration (conditioned on past samples). Note that the error function is unfortunately not submodular and the greedy procedure does not imply that the total collection of samples at iteration is jointly optimal (see Huszar12 for a detailed discussion). We also note that herding is an “infinite memory process” on (as opposed to a Markov process) illustrated in Figure 1.5 because new samples depend on the entire history of samples generated thus far.
1.2.5 Moment Matching Property
With the dual view in the previous subsection, the distance between the moments and their sampling average in Equation 1.17 can be considered as the objective function for the herding algorithm to minimize. We discuss in this subsection under what condition and at what speed the moment constraints will be eventually satisfied.
Proposition 1.1 (Proposition 1 in Welling09A).
, if , then .
Following Equation 1.14, we have
Using the premise that the weights grow slower than linearly and observing that is constant we see that the left hand term vanishes in the limit which proves the result. ∎
What this says is that under the very general assumption that the weights do not grow linearly to infinity (note that due to the finite learning rate they can not grow faster than linear either), the moment constraints will be satisfied by the samples collected from the combined learning/sampling procedure. In fact, we will show later that the weights are restricted in a bounded region, which leads to a convergence rate of as stated below.
, if there exists a constant such that , then
The proof follows immediately Equation 1.18.
Note that if we want to estimate the expected feature of a trained MRF by a Monte Carlo method, the optimal standard deviation of the approximation error with independent and identically distributed (i.i.d.) random samples decays as , where is the number of samples. (For positively autocorrelated MCMC methods this rate could be even slower.) Samples from herding therefore achieve a faster convergence rate in estimating moments than i.i.d. samples.
Recurrence of the Weight Sequence
It is important to ensure that the herding dynamics does not diverge to infinity. Welling09A discovered an important property of herding, known as recurrence, that the sequence of the weights is confined in a ball in the parameter space. This property satisfies the premise of both Proposition 2.1 and 2.2. It was stated in a corollary of Proposition 1.3:
Proposition 1.3 (Proposition 2 in Welling09A).
such that a herding update performed outside this radius will always decrease the norm .
Corollary 1.4 (Corollary in Welling09A).
such that a herding algorithm initialized inside a ball with that radius will never generate weights with norm .
However, there was a gap in the proof of Proposition 2 in Welling09A. We give the corrected proof below:
Proof of Proposition 1.3.
Write the herding update equation 1.8 as (set ). Expanding the squared norm of leads to
where we define . is an upper bound of introduced in Lemma 1 of Welling09A. That exists as long as the norm of the feature vector is bounded in . We also use the fact that .
Denote the unit hypersphere as . Since is continuous on and is a bounded closed set, can achieve its supremum on , that is, we can find a maximum point on where .
Combining this with the fact that outside the origin, we know the maximum of on is negative. Now taking into account the fact that is constant (i.e. does not scale with ), there exists some constant for which . Together with the scaling property of , , we can prove that for any with a norm larger than , is smaller then :
The proof is concluded by plugging the inequality above in Equation 1.19. ∎
Corollary 1.4 proves the existence of a bound for and thereby the constant in Proposition 1.2. harvey2014near further studied the value of and proposed a variant of herding that obtained a near-optimal value for w.r.t. the dimensionality of the feature vector and the size of a finite state space . The proposed algorithm has a polynomial time complexity in and .
The Remaining Degrees of Freedom
Both the herding and the MaxEnt methods match the moments of the training data. But how does herding control the remaining degrees of freedom that are otherwise controlled by maximizing the entropy in the MaxEnt method? This is unfortunately still an open problem. Apart from some heuristics there is currently no principled way to enforce high entropy. In practice however, in discrete state spaces we usually observe that the sampling distribution from herding renders high entropy. We illustrate the behavior of herding in the example of simulating an Ising model in the next paragraph.
An Ising model is an MRF defined on a lattice of binary nodes, , with biases and pairwise features. The probability distribution is expressed as
where is the bias parameter, is the pairwise parameter and is the inverse temperature variable. When , for all nodes and edges, and is set at the inverse critical temperature, the Ising model is said to be at a critical phase where regular sampling algorithms fail due to long range correlations among variables. A special algorithm, the Swendsen-Wang algorithm (swendsen1987nonuniversal), was designed to draw samples efficiently in this case. In order to run herding on the Ising model, we need to know the average features, ( in this case) and instead of the MRF parameters. So we first run the Swendsen-Wang algorithm to obtain an estimate of the expected cross terms, , which are constant across all edges, and then run herding with weights for every node and edge . The update equations are:
As finding the global optimum is an NP-hard problem we find a local maximum for by coordinate descent222In Section 1.3.2 we show that the moment matching property still holds with a local search as long as the found state is better than the average.. Figure 1.6 shows a sample from an Ising model on an lattice at the critical temperature. We do not observe qualitative difference between the samples generated by the Ising model (MaxEnt) and herding, which suggests that the sample distribution of herding may be very close to the distribution of the MRF. Furthermore, Figure 1.7 shows the distribution of the size of connected components in the samples. It is known that this distribution should obey a power law at the critical temperature. We find that samples from both methods exhibit the power law distribution with an almost identical exponent.
1.2.6 Learning Using Weak Chaos
There are two theoretical frameworks for statistical inference: the frequentist and the Bayesian paradigm. A frequentist assumes a true objective value for some parameter and tries to estimate its value from samples. Except for the simplest models, estimation usually involves an iterative procedure where the value of the parameter is estimated with increasing precision. In information theoretic terms, this means that more and more information from the data is accumulated in more decimal places of the estimate. With a finite data-set, this process should stop at some scale because there is not enough information in the data that can be transferred into the decimal places of the parameter. If we continue anyway, we will overfit to the dataset at hand. In a Bayesian setting we entertain a posterior distribution over parameters, the spread, or more technically speaking, entropy, of which determines the amount of information it encodes. In Bayesian estimation, the spread automatically adapts itself to the amount of available information in the data. In both cases, the learning process itself can be viewed as a dynamical system. For a frequentist this means a convergent series of parameter estimates indexed by the learning iteration . For a Bayesian running a MCMC procedure this means a stochastic process converging to some equilibrium distribution. Herding introduces a third possibility by encoding all the information in a deterministic nonlinear dynamical system. We focus on studying the weakly chaotic behavior of the herding dynamics in this subsection. The sequence of weights never converges but traces out a quasi-periodic trajectory on an attractor set which is often found to be of fractal dimension. In the language of iterated maps,
, a (frequentist) optimization of some objective results in an attractor set that is a single point, Bayesian posterior inference results in a (posterior) probability distribution while herding will result in a (possibly fractal) attractor set which seems harder to meaningfully interpret as a probability distribution.
Example: Herding a Single Neuron
We first study an example of the herding dynamics in its simplest form and show its equivalence to some well-studied theories in mathematics. Consider a single (artificial) neuron, which can take on two distinct states: either it fires () or it does not fire (). Assume that we want to simulate the activity of a neuron with an irrational firing rate, , that is, the average firing frequency approaches . We can achieve that by applying the herding algorithm with a one-dimensional feature and feeding the input moment with the desired rate . Applying the update equations 1.7-1.8 we get the following dynamics:
where is the indicator function. With the moment matching property we can show immediately that the firing rate converges to the desired value for any initial value of . The update equations are illustrated in Figure 1.8. This dynamics is a simple type of interval translation mapping (ITM) problem in mathematics (boshernitzan1995interval). In a general ITM problem, the invariant set of the dynamics often has a fractal dimension. But for this simple case, the invariant set is the entire interval if is an irrational number and a finite set if it is rational. As a neuron model, one can think of as a “synaptic strength.” At each iteration the synaptic strength increases by an amount . When the synaptic strength rises above , the neuron fires. If it fires its synaptic strength is depressed by a factor . The value of only has some effect on the transient behavior of the resulting sequence .
It is perhaps interesting to note that by setting with the golden mean and initializing the weights at , we exactly generate the “Rabbit Sequence”: a well studied Sturmian sequence which is intimately related with Fibonacci numbers333Imagine two types of rabbits: young rabbits () and adult rabbits (). At each new generation the young rabbits grow up () and old rabbits produce offspring (). Recursively applying these rules we produce the rabbit sequence: etc. The total number of terms of these sequences and incidentally also the total number of ’s (lagged by one iteration) constitutes the Fibonacci sequence: .). In Figure 1.9 we plot the weights (a) and the states (b) resulting from herding with the “Fibonacci neuron” model. For a proof, please see WellingChen10.
When initializing , one may think of the synaptic strength as an error potential that keeps track of the total error so far. One can further show that the sequence of states is a discrete low discrepancy sequence (angel2009discrete) in the following sense:
If is the weight of the herding dynamics for a single binary variable with probability , and at some step , then . Moreover, for , we have:
The second inequality follows by observing that . ∎
As a corollary of Proposition 1.5, when we initialize , we can improve the bound of the discrepancy by a half.
If is the weight of the herding dynamics in Proposition 1.5 and it is initialized at , then for , we have:
The proof immediately follows Equation 1.29 by plugging and . In fact, setting corresponds to the condition in the greedy algorithm interpretation in Section 1.2.4. One can see this by constructing an equivalent herding dynamics with a feature of constant norm as:
When initializing the weight at the moment , one can verify that this dynamics generates the same sample sequence as the original one and their weights are the same up to a constant factor of , i.e. . The new dynamics satisfies the two assumptions in Section 1.2.4 and therefore the sample sequences in both dynamical systems greedily minimize the error of the empirical probability (up to a constant factor):
This greedy algorithm actually achieves the optimal bound one can get with herding dynamics in the 1-neuron model, which is .
Example: Herding a Discrete State Variable
The application of herding to a binary variable can be extended naturally to a discrete state variables. Let be a variable that can take one of the states, . Given any distribution over these states in the set , we can run herding to simulate the activity of the discrete variable. The feature function, , is defined as the 1-of-D encoding of the discrete state, that is, a vector of binary numbers, in which all the numbers are 0 except for the element indexed by the value of . For example, for a variable with 4 states, the feature function of is . It is easy to observe that the expected value of the feature vector under the distribution is exactly equal to . Now, let us apply the herding update equations with the feature map and input moment :
The weight variables act similarly to the synaptic strength analogy in the neuron model example. At every iteration, the state with the highest potential gets activated, and then the corresponding weight is depressed after activation. Applying Proposition 1.2, we know that the empirical distribution of the samples converges to the input distribution at a faster rate than one would get from random sampling:
The dynamics of the weight vector is more complex than the case of a binary variable in the previous subsection. However, there are still some interesting observations one can make about the trajectory of the weights which we explain in the appendix.
Weak Chaos in the Herding Dynamics
Now let us consider herding in a general setting with states and each state is associated with a dimensional feature vector. The update equation for the weights 1.8 can be viewed as a series of translations in the parameter space, , where each discrete state corresponds to one translation vector (i.e. ). See Figure 1.11 for an example with and . The parameter space is partitioned into cones emanating from the origin, each corresponding to a state according to Equation 1.7. If the current location of the weights is inside cone , then one applies the translation corresponding to that cone and moves along to the next point. This system is an example of what is known as a piecewise translation (or piecewise isometry more generally) (goetz2000dynamics).
It is clear that this system has zero Lyapunov exponents444The Lyapunov exponent of a dynamical system is a quantity that characterizes the rate of separation of infinitesimally close trajectories. Quantitatively, two trajectories in phase space with initial separation diverge (provided that the divergence can be treated within the linearized approximation) at a rate given by where is the Lyapunov exponent. everywhere (except perhaps on the boundaries between cones but since this is a measure zero set we will ignore these). As the evolution of the weights will remain bounded inside some finite ball the evolution will converge to some attractor set. Moreover, the dynamics is non-periodic in the typical case (more formally, the translation vectors must form an incommensurate (possibly over-complete) basis set; for a proof see Appendix B of WellingChen10). It can often be observed that this attractor has fractal dimension (see Figure 1.11 for an example). All these facts point to the idea that herding is on the edge between full chaos (with positive Lyapunov exponents) and regular periodic behavior (with negative Lyapunov exponents). In fact, herding is an example of what is called “weak chaos”, which is usually defined through its (topological) entropy discussed below. Finally, as we have illustrated in Figure 1.1
, one can construct a sequence of iterated maps of which herding is the limit and which exhibits period doubling. This is yet another characteristic of systems that are classified as “edge of chaos”. Whether the attractor set is of fractal dimension in general remains an open question. For the case of single neuron model, the attractor is the entire intervalif is irrational but for systems with more states it remains unknown.
We will now estimate the entropy production rate of herding. This will inform us further of the properties of this system and how it processes information. From Figure 1.11 we see that the sequence can be interpreted as the symbolic system of the continuous dynamical system defined for the parameters . A sequence of symbols (states) is sometimes referred to as an “itinerary.” Every time falls inside a cone we record its label which equals the state . The topological entropy for the symbolic system can be defined by counting the total number of subsequences of length , which we will call . One may think of this as a dynamical language where the subsequences are called “words” and the topological entropy is thus related to the number of words of length . More precisely, the topological entropy is defined as,
It was rigorously proven in goetz2000dynamics that grows polynomially in for general piecewise isometries, which implies that the topological entropy vanishes for herding. It is however interesting to study the growth of as a function of to get a sense of how chaotic its dynamics is.
For the simplest model of a single neruon with being an irrational number, it turns out , which is the absolute bare minimum for sequences that are not eventually periodic. It implies that our neuron model generates Sturmian sequences for irrational values of which are precisely defined to be the non-eventually periodic sequences of minimal complexity (LuWang05). (For a proof, please see WellingChen10.)
To count the number of subsequences of length for a general model, we can study the -step herding map that results from applying herding steps at a time. The original cones are now further subdivided into smaller convex polygons, each one labeled with the sequence that the points inside the polygon will follow during the following steps. Thus as we increase , the number of these polygons will increase and it is exactly the number of those polygons which partition our parameter space that is equal to the number of possible subsequences. We first claim that every polygon, however small, will break up into smaller sub-pieces after a finite amount of time. This is proven in WellingChen10. In fact, we expect that in a typical herding system every pair of points will break up as well, which, if true, would infer that the diameter of the polygons must shrink. A partition with this property is called a generating partition. Based on some preliminary analysis and numerical simulations, we expect that the growth of in the typical case (a.k.a. with an incommensurate translation basis, see Appendix B of WellingChen10) is a polynomial function of the time, , where is the number of dimensions (which is equal to the number of herding parameters). Since it has been rigorously proven that any piecewise isometry has a growth rate that must have an exponent less or equal than (goetz2000dynamics), this would mean that herding achieves the highest possible entropy within this class of systems with for a sequence of length (for large enough) as:
This result should be understood in comparison with regular and random sequences. In a regular (constant or periodic) sequence, the number of subsequences is constant with respect to the length, i.e. . In contrast, the dominant part of the Kolmogorov-Sinai entropy of a random sequence (considering, e.g., a stochastic process) or a fully chaotic sequence grows linearly in time , i.e. due to the injected random noise.
1.3 Generalized Herding
The moment matching property in Proposition 1.1 and 1.2 requires only a mild condition on the norm of the dynamic weights. That grants us with great flexibility in modifying the original algorithm for more practical implementation as well as a larger spectrum of applications. GelfandMaatenChenWelling10 provided a general condition on the recurrence of the weight sequence, from which we discuss how to generalize the herding algorithm in this section with two specific examples. chen2014herdingbookchapter described another extension of herding that violated the condition but it achieved the minimum matching distance instead in a constrained problem.
1.3.1 A General Condition for Recurrence - The Perceptron Cycling Theorem
The moment matching property of herding relies on the recurrence of the weight sequence (Corollary 1.4) whose proof again relies on the premise that the maximization is carried out exactly in the herding update equation 1.7. However, the number of model states is usually exponentially large (e.g. when is a vector of discrete variables each with values) and it is intractable to find a global maximum in practice. A local maximizer has to be employed instead. One wonders if the features averaged over samples will still converge to the input moments when the samples are suboptimal states? In this subsection we give a general and verifiable condition for the recurrence of the weight sequence based on the perceptron cycling theorem (minsky1969perceptrons), which consequently suggests that the moment matching property may still hold at the rate of even with a relaxed herding algorithm.
The invention of the perceptron (rosenblatt1958perceptron) goes back to the very beginning of AI more than half a century ago. Rosenblatt’s very simple, neurally plausible learning rule made it an attractive algorithm for learning relations in data: for every input , make a linear prediction about its label: and update the weights as,
A critical evaluation by minsky1969perceptrons revealed the perceptron’s limited representational power. This fact is reflected in the behavior of Rosenblatt’s learning rule: if the data is linearly separable, then the learning rule converges to the correct solution in a number of iterations that can be bounded by , where represents the norm of the largest input vector and represents the margin between the decision boundary and the closest data-case. However, “for data sets that are not linearly separable, the perceptron learning algorithm will never converge” (quoted from bishop2006pattern).
While the above result is true, the theorem in question has something much more powerful to say. The “perceptron cycling theorem” (PCT) (minsky1969perceptrons) states that for the inseparable case the weights remain bounded and do not diverge to infinity. The PCT was initially introduced in minsky1969perceptrons but had a gap in the proof that was fixed in block1970boundedness.
Theorem 1.7 (Boundedness Theorem).
Consider a sequence of vectors , , generated by the iterative procedure of Algorithm 1.
Then, where is a constant depending on but not on .
The theorem still holds when is a finite set in a Hilbert space. The PCT leads to the boundedness of the perceptron weights where we identify , a finite set and observe
When the data is linearly separable, Rosenblatt’s learning rule will find a such that and the sequence of converges. Otherwise, there always exists some such that and PCT guarantees the weights are bounded.
The same theorem also applies to the herding algorithm by identifying with defined in Equation 1.7, a finite set , and observing that
It is now easy to see that, in general, herding does not converge because under very mild conditions we can always find an such that . More importantly, the boundedness theorem (or PCT) provides a general condition for the recurrence property and hence the moment matching property of herding. Inequality 1.40 is easy to be verified at running time and does not require to be the global optimum.
1.3.2 Generalizing the Herding Algorithm
PCT ensures that the average features from the samples will match the moments at a fast convergence rate as long as the algorithm we are running satisfies the following conditions:
The set is finite,
This set of mild conditions allows us to generalize the original herding algorithm easily.
Firstly, the PCT provides a theoretical justification for using a local search algorithm that performs partial maximization. For example, we may start the local search from the state we ended up in during the previous iteration (a so-called persistent chain (Younes89; Neal92; Yuille04; Tieleman08)
). Or, one may consider contrastive divergence-like algorithms(Hinton02), in which the sampling or mean field approximation is replaced by a maximization. In this case, maximizations are initialized on all data-cases and the weights are updated by the difference between the average over the data-cases minus the average over the found after (partial) maximization. In this case, the set V is given by: . For obvious reasons, it is now guaranteed that .
Secondly, we often use mini-batches of size in practice instead of the full data set. In this case, the cardinality of the set is enlarged to, e.g., , with representing the “ choose ” ways to compute the sample mean based on a subset of data-cases. The negative term remains unaltered. Since the PCT still applies: . Depending on how the mini-batches are picked, convergence onto the overall mean can be either (random sampling with replacement) or (sampling without replacement which has picked all data-cases after rounds).
Besides changing the way we compute the positive and negative terms in , generalizing the definition of features will allow us to learn a much wider scope of models beyond the fully visible MRFs as discussed in the following sections.
1.3.3 Herding Partially Observed Random Field Models
The original herding algorithm only works for fully visible MRFs because in order to compute the average feature vector of the training data we have to observe the state of all the variables in a model. In this subsection, we generalize herding to partially observed MRFs (POMRFs) by dynamically imputing the value of latent variables in the training data during the run of herding. This extension allows herding to be applied to models with a higher representative capacity.
Consider a MRF with discrete random variableswhere will be observed and will remain hidden. A set of feature functions is defined on and , , each associated with a weight . Given these quantities we can write the following Gibbs distribution,
The log-likelihood function with a dataset is defined as
Analogous to the duality relationship between MLE and MaxEnt for fully observed MRFs, we can write the log-likelihood of a POMRF as
where are variational distributions on , and is a variational distribution on . The dual form of MLE turns out as a minimax problem on with a set of constraints
We want to achieve high entropy for the distributions and
, and meanwhile the average feature vector on the training set with hidden variables marginalized out should match the expected feature w.r.t. to the joint distribution of the model. The weightsact as Lagrange multipliers enforcing those constraints.
Similar to the derivation of herding for fully observed MRFs, we now introduce a temperature in Equation 1.42 by replacing with . Taking the limit of , we see that the entropy terms vanish. For a given value of and in the absence of entropy, the optimal distribution and are delta-peaks and their averages should be replace with maximizations, resulting in the objective,
where we denote .
Taking a gradient descent update on with a fixed learning rate () defines the herding algorithm on POMRFs (Welling09B):
We use a superscript “” to denote states obtained by maximization. These equations are similar to herding for the fully observed case, but different in the sense that we need to impute the unobserved variables for every data-case separately through maximization. The weight update also consist of a positive “driving term,” which is now a changing average over data-cases, and a negative term, which is identical to the corresponding term in the fully observed case.