1 Probabilistic formulation of the generalised elastic net (GEN)
Given a collection of centroids that we express as a matrix
and a scale, or variance, parameter, consider a Gaussian-mixture density with components, where , i.e., all covariance matrices are equal and isotropic. Consider a Gaussian prior on the centroids where is a regularisation, or inverse variance, parameter and SS is a positive definite or positive semidefinite matrix—in the latter case the prior being improper. The normalisation constant of this density will not be relevant for our purposes and is given in appendix A.
We define a generalised elastic net (GEN) by . The original elastic net of Durbin and Willshaw (1987) and Durbin et al. (1989) is recovered for a particular choice of the matrix SS (section 2.6). It is also possible to define a prior over the scale , but we will not do so here, since will play the role of the temperature in a deterministic annealing algorithm.
Without the prior over centroids, the centroids could be permuted at will with no change in the model, since the variable is just an index. The prior can be used to convey the topologic (dimension and shape) and geometric (e.g. curvature) structure of a manifold implicitly defined by the centroids (as if the centroids were a sample from a continuous latent variable model; Carreira-Perpiñán, 2001). This prior can also be seen as a Gaussian process prior (e.g. Bishop et al., 1998a, p. 219), where our matrix SS is the inverse of the Gaussian process covariance matrix. The semantics of SS is not necessary to develop a learning algorithm and so its study is postponed to section 3. However, it will be convenient to keep in mind that SS will be typically derived from a discretised derivative based on a finite difference scheme (or stencil) and that SS will be a sparse matrix.
2 Parameter estimation with annealing
From a statistical learning point of view, one might wish to find the values of the parameters and that maximise the objective function
given a training set expressed as a matrix
—that is, maximum-a-posteriori (MAP) estimation. For iid data and ignoring a term independent ofand , this equation reduces to:
over alone, starting with a large (for which the tension term dominates) and tracking the minimum to a small value of (for which the fitness term dominates). This is so because (1) one can find good solutions to combinatorial optimisation problems such as the TSP (which require ) and to dimension-reduction problems such as cortical map modelling (which do not require ); and (2) if considered as a dynamical system for a continuous latent space, the evolution of the net as a function of and the iteration index may model the temporal evolution of cortical maps. To attain good generalisation to unseen data, the parameter1997) or some other means. However, in this paper we will be interested in investigating the behaviour of the model for a range of values, rather than fixing according to some criterion.
We will call the -term the fitness term, arising from the Gaussian mixture , and the -term the tension term, arising from the prior .
Eq. (3) differs from the MAP objective function of eq. (2) in an immaterial change of sign, in the deletion of a now constant term , and in the multiplication of the fitness term by , where is a positive constant. The latter reduces the influence of the fitness term with respect to the tension term as decreases. Since can be absorbed in , we will take hereafter. However, it can be useful to have a different for each training point in order to simulate a training set that overrepresents some data points over others (see section 2.5.1).
We derive three iterative algorithms for minimising (gradient descent, matrix iteration and Cholesky factorisation), all based on the gradient of :
where we define the weight matrix and the invertible diagonal matrix as
The weight is also the responsibility of centroid for generating point , and so is the total responsibility of centroid (or the average number of training points assigned to it). The matrix is then a list of average centroids. All three algorithms, particularly the matrix-iteration and Cholesky-factorisation ones, benefit considerably from the fact that the matrix SS will typically be sparse (with a banded or block-banded structure).
2.1 Gradient descent
We simply iterate with . However, this only converges if the ratio is small (for large problems, it needs to be very small). For the original elastic net this is easily seen for a net with two centroids: the gradient of the tension term at each centroid points towards the other, and if the gradient step is larger than the distance between both centroids, the algorithm diverges exponentially (as we have confirmed by simulation). This could be alleviated by taking a shorter gradient step (by taking with ), but for large the step would become very small.
2.2 Iterative matrix method
Equating the gradient of eq. (4) to zero we obtain the nonlinear system
This equation is similar to others obtained under related modelling assumptions (Yuille et al., 1996; Dayan, 1993; Bishop et al., 1998b, a; Utsugi, 1997). is a symmetric positive definite matrix (since SS is positive (semi)definite), and both (through ) and depend on . This equation is the basis for a fixed-point iteration, where we solve for with and fixed, then update and for the new , and repeat till convergence. Convergence is guaranteed since we can derive an analogous procedure as an EM algorithm that optimises over for fixed as in Yuille et al. (1994) and Utsugi (1997). Essentially, eq. (5) becomes the familiar Gaussian-mixture update for the means with . Thus, from the algorithmic point of view, it is the addition of that determines the topology.
For the class of matrices SS studied in this paper, is a large sparse matrix (of the order of ) so that explicitly computing its inverse is out of the question: besides being a numerically ill-posed operation, it is computationally very costly in time and memory, since is a large, nonsparse matrix. Instead, we can solve the system for (with and fixed) by an iterative matrix method (see e.g. Isaacson and Keller, 1966; Smith, 1985). Rearrange the system (5) as (calling the unknowns) with (diagonal lower triangular upper triangular) to give an iterative procedure , a few iterations of which will usually be enough to compute (or in eq. (4)) approximately, assuming the procedure converges. The following procedures are common:
Decomposing the equation as results in the iterative procedure . Since is diagonal, can be efficiently computed and remains sparse.
Decomposing the equation as results in the iterative procedure , which can be implemented without explicitly computing if instead of computing from , we use the already computed elements as well as the old ones to compute ; see e.g. eq. (7d).
- Successive overrelaxation (SOR)
is also possible and can be faster, but requires setting of the relaxation parameter by trial and error.
For sparse matrices, both Jacobi and Gauss-Seidel are particularly fast and respect the sparsity structure at each iteration, without introducing extra nonzero elements. Both methods are quite similar, though for the kind of sparse matrix that we have with the elastic net, Gauss-Seidel should typically be about twice as fast than Jacobi and requires keeping just one copy of in memory.
The matrix iterates can be interleaved with the updates for and .
2.3 Direct method by Cholesky factorisation
We can obtain efficiently and robustly by solving the sparse system of equations (5) by Gaussian elimination via Cholesky decomposition111David Willshaw (pers. comm.) has also developed independently the idea of the Cholesky factorisation for the original elastic net., since is symmetric and positive definite. Specifically, the procedure consists of (George and Liu, 1981; Duff et al., 1986):
Ordering: find a good permutation of the matrix .
Cholesky factorisation: factorise the permuted matrix into where is lower triangular with nonnegative diagonal elements.
Triangular system solution: solve and both by Gaussian elimination in and then set .
Step 1 is not strictly necessary but usually accelerates the procedure for sparse matrices. This is because, although the Cholesky factorisation does not add zeroes outside the bands of (and thus preserves its banded structure), it may add new zeroes inside (i.e., add new “fill”), and it is possible to reduce the number of zeroes in by reordering the rows of . However, the exact minimisation of the fill is NP-complete and so one has to use a heuristic ordering method, a number of which exist, such as minimum degree ordering (George and Liu, 1981, pp. 115–137).
2.4 Method comparison
The gradient descent method as defined earlier converges only for values smaller than a certain threshold that is extremely small for large nets; e.g. Durbin and Willshaw (1987) used for a 2D TSP problem of cities and a net with centroids; but Durbin and Mitchison (1990) used for a 4D cortical map model of centroids. For large problems, a tiny ratio means that—particularly for fast annealing—the tension term has very little influence on the final net, and as a result there is no benefit in using one matrix SS over another.
A sufficient condition for the convergence of both matrix iteration methods (Jacobi and Gauss-Seidel) is that the matrix be positive definite (which it is). Also, the Cholesky factorisation is stable without pivoting for all symmetric positive definite matrices, although pivoting is advisable for positive semidefinite matrices (Golub and van Loan, 1996). Thus, all three methods are generally appropriate. However, for high values of and if SS is positive semidefinite, then becomes numerically close to being positive semidefinite (even in later stages of training, when has large diagonal elements). In this case, the Jacobi and Gauss-Seidel methods may diverge, as we have observed in practice (and converged if was lowered). In contrast, we have never found the Cholesky factorisation method to diverge, at least for the largest problems we have tried with , , stencils of up to order and up to .
The Cholesky factorisation is a direct method, computed in a finite number of operations for dense but much less for sparse . This is unlike the Jacobi or Gauss-Seidel methods, which are iterative and in principle require an infinite number of iterations to converge (although in practice a few may be enough). With enough iterations (a few tens, for the problems we tried), the Jacobi method converges to the solution of the Cholesky method; with as few as to , it gets a reasonably good one. In general, the computation time required for the solution of eq. (5) by the Cholesky factorisation is usually about twice as high as that of the Jacobi method. The bottleneck of all elastic net learning algorithms is the computation of the weight matrix of squared distances between cities and centroids, which typically takes several times longer than solving eq. (5). Thus, using the Cholesky factorisation only means an increase of around –% of the total computation time.
Regarding the quality of the iteration, the Cholesky method (and, for enough iterations, the Jacobi and Gauss-Seidel methods) goes deeper down the energy surface at each annealing iteration; this is particularly noticeable when the net collapses into its centre of mass for large . In contrast, gradient descent takes tiny steps and requires many more iterations for a noticeable change to occur. It might then be possible to use faster annealing than with the gradient method, with considerable speedups. On the other hand, the Cholesky method is not appropriate for online learning, where training points come one at a time, because the net would change drastically from one training point to the next. However, this is not a problem practically since the stream of data can always be split into chunks of appropriate size.
In summary, the robustness and efficiency of the (sparse) Cholesky factorisation make it our method of choice and allow us to investigate the behaviour of the model for a larger range of values than has previously been possible. All the simulations in this paper use this method.
2.5 Practical extensions
Here we describe two practically convenient modifications of the basic elastic net model.
2.5.1 Weighting points of the training set
For some applications it may be convenient to define a separate for each data point (e.g. to overrepresent some data points without having to add extra copies of each point, which would make larger). In this case, the energy becomes
and all other equations remain the same by defining the weights as
that is, multiplying the old times .
Over- or underrepresenting training set points is useful in cortical map modelling to simulate deprivation conditions (e.g. monocular deprivation, by reducing for the points associated with one eye) or nonuniform feature distributions (e.g. cardinal orientations are overrrepresented in natural images). It is also possible to make dependent on , so that e.g. the overrepresentation may take place at specific times during learning; this is useful to model critical periods (Carreira-Perpiñán et al., 2003).
2.5.2 Introducing zero-valued mixing proportions
We defined the fitness term of the elastic net as a Gaussian mixture with equiprobable components. Instead, we can associate a mixing proportion with each component , subject to and . In an unsupervised learning setting, we could take as parameters and learn them from the training set, just as we do with the centroids . However, we can also use them to disable centroids from the model selectively during training (by setting the corresponding to zero). This is a computationally convenient strategy to use non-rectangular grid shapes in 2D nets. Specifically, for each component to be disabled, we need to:
Set (and renormalise all ).
If using a symmetric matrix , set to zero column of and all rows of that had a nonzero in the element corresponding to column . This is equivalent to eliminating from the tension term all linear combinations that involved . This implicitly assumes a specific type of boundary condition; other types of b.c. may be used by appropriately modifying such rows (rather than zeroing them). If SS is not symmetric, the manipulations are more complicated.
It is easy to see that the energy is now independent of the value of : no “force” is exerted on either from the fitness or the tension term, and likewise exerts no tension on any other centroid. Thus, in the training algorithm, we can simply remove it (and the appropriate parts of SS, etc.) from the update equations. We could also simply leave it there, since its gradient component is zero and its associated equation (5) is zero both in the RHS and the LHS (for all ). However, the latter option is computationally wasteful, since the operations associated with are still being carried out, and can lead to numerical instability in the iterative-matrix and Cholesky-factorisation methods, since the matrices involved become singular (although this can be easily overcome by setting to some nonzero value).
When mixing proportions and training set weights are used, the energy becomes:
and all other equations remain the same by defining the weights as
Selectively disabling centroids is useful in cortical map modelling to use a 2D net that approximates the shape of primary visual cortex and may include lesions (patches of inactive neurons in the cortex). It can also be used to train separate nets on the same training set and so force them to compete with each other (both with 1D or 2D nets); in fact, the central-difference stencil (section 5.5.2) leads to a similar situation by separating the tension term into decoupled subterms.
2.6 Comparison with the original elastic net model
The original elastic net results from using the matrix corresponding to a stencil and (see section 4). For example, for a 1D net with centroids (where for nonperiodic b.c. and for periodic b.c.):
The original elastic net papers used annealing with either gradient descent (Durbin and Willshaw, 1987; Durbin et al., 1989) or Gauss-Seidel iteration (Durbin and Mitchison, 1990). For the latter, the updates of must be done sequentially on the index .
3 Construction of the matrix SS
The definition of the model and the optimisation algorithms given depend on the matrix SS; we are left now with the search for a meaningful matrix SS that incorporates some knowledge of the problem being modelled. For example, for the original elastic net, the tension term embodies an approximation to the length of the net, which is the basis for a heuristic solution of the TSP and for a cortical map model. In this section we discuss general properties of the matrix SS and then, in the next section, concentrate on differential operators.
Firstly, note that the formulation of the tension term (ignoring the factor) as is not the most general quadratic form of the parameters . This is because the cross-terms for have weight zero and the terms have a weight independent of the dimension . Thus, the different dimensions of the net (or maps) are independent and formally identical in the tension term (though not so in the fitness term). The most general quadratic form could be represented as where is the column concatenation of all elements of and SS is now , or as where
is a 4D tensor. However, our more particular class of penalties still has a rich behaviour and we restrict ourselves to it.
Secondly, note that it is enough to consider symmetric matrices SS of the form where is arbitrary. We have that, for a nonsymmetric matrix SS:
and so using SS is equivalent to using the symmetric form . Further, since SS must be positive (semi)definite for the energy to be lower bounded, we can always write for some real matrix, without loss of generality, and so the tension term can be written where is the Frobenius matrix norm. However, the learning algorithms given earlier can be used for any SS.
The matrix has two functions. As neighbourhood relation, it specifies the strength of the tension between centroids and thus the expected metric properties of the net, such as its curvature. As adjacency matrix, it specifies what centroids are directly connected222However, the notion of adjacency becomes blurred when we consider that a sparse stencil such as can be equivalent (i.e., produce the same SS) to a nonsparse one, as we show in section 5.6. and so the topology of the net. Thus, by changing , we can turn a given collection of centroids into a line, a closed line, a plane sheet, a torus, etc. Practically, we typically concern ourselves with a fixed topology and are interested in the effects of changing the “elasticity” or “stiffness” of the net via the tension term.
Thirdly, for net topologies where the neighbourhood relations do not depend on the actual region in question of the net, we can represent the matrix in a compact form via a stencil and suitable boundary conditions (b.c.). Multiplication by then becomes convolution by the stencil. Further, in section 5.6 we will show that with periodic b.c. we can always take to be a symmetric matrix and so .
In summary, we can represent quadratic tension terms in full generality through an operator matrix , and we will concentrate on a particular but important class of matrices , where different dimensions are independent and identical in form, and where the operator is assumed translationally invariant so that results from convolving with a stencil. Before considering appropriate types of stencil, we note some important invariance properties.
3.1 Invariance of the penalty term with respect to rigid motions of the net
Consider an orthogonal matrix and an arbitrary vector . In order that the matrix represent an operator invariant to rotations and translations of the net centroids, we must have the same penalty for and for (where is the vector of ones):
Thus, any will provide invariance to rotations, but invariance to translations requires , i.e., a differential operator matrix (see later).
Another way to see this, for the circulant matrix case discussed later, is as follows: a circulant positive definite matrix can always be decomposed as with circulant positive semidefinite (verifying , or ) and . This corresponds to the product of two priors, one given by the differential operator matrix and the other one by the matrix . The effect of the latter is to penalise non-zero-mean nets, since , which we do not want in the present context, since it depends on the choice of origin for the net. Such matrices can be desirable for smoothing applications, naturally, and the training algorithms still work in this case because SS remains positive semidefinite (and so the energy function is bounded below).
3.2 Invariance of SS with respect to transformations of or the stencil
There can be different matrices that result in the same ; in fact, is a square root of SS. Any matrix will produce the same SS if is a matrix verifying . This is the same sort of unidentifiability as in factor analysis (by rotation of the factors; Bartholomew, 1987), and in our case it means that the same nonnegative quadratic form can be obtained for many different matrices . Particular cases include:
Orthogonal rotation of : is square with .
Sign reversal of any number of rows of : is diagonal with elements.
Permutation of rows of (i.e., reordering of the summands in the tension term): is a permutation matrix.
Insertion of rows of zeroes to : is the identity with intercalated rows of zeroes.
When is derived from a stencil, SS is invariant with respect to the following stencil transformations:
Sign reversal, e.g. is equivalent to .
Shifting the stencil, since this is equivalent to permuting rows of
. In particular, padding with zeroes the borders of the stencil (but not inserting zeroes), e.g. the forward differenceis equivalent to the backward difference but not to the central difference .
These invariance properties are useful in the analysis of section 5 and in the implementation of code to construct the matrix SS given the stencil.
4 Construction of the matrix from a differential stencil
To represent the matrix by a stencil consider for example the original elastic net, in which the tension term consists of summing terms of the form over the whole net. In this case, the stencil is and contains the coefficients that multiply , and . In general, a stencil represents a linear combination of which then we take the square. A given row of matrix is obtained by centring the stencil on the corresponding column, and successive rows by shifting the stencil. If the stencil is sparse, i.e., has few nonzero elements, then and SS are sparse (with a banded or block-banded structure).
It is necessary to specify boundary conditions when one of the stencil ends overspills near the net boundaries. In this paper we will consider only the simplest types of boundary conditions: periodic, which uses modular arithmetic; and nonperiodic or open, which simply discards linear combinations (rows of ) that overspill. In both cases the resulting matrix is a structured matrix: circulant for periodic b.c., since it is obtained by successively rotating the stencil (see section 5); or quasi-Toeplitz for nonperiodic b.c., being almost completely defined by its top row and left column. The following example illustrates these ideas for a stencil and a 1D net with :
These ideas generalise directly to elastic nets of two or more dimensions, but the structure of is more complicated, being circulant or quasi-Toeplitz by blocks. In the analysis of section 5 we consider only periodic b.c. for simplicity, but the examples of section 6 will include both periodic and nonperiodic b.c.
Some notational remarks. As in the earlier example, we will assume the convention that the first row of contains the stencil with its central coefficient in the first column, the coefficients to the right of the stencil occupy columns , , etc. of the matrix and the coefficients to the left occupy columns , , etc. respectively. The remaining rows of
are obtained by successively rotating the stencil. Without loss of generality, the stencil must have an odd number of elements to avoid ambiguity in the assignation of these elements to the net points. Occasionally we will index the stencil starting from the left, e.g.. We will always assume that the number of centroids in the net (the dimension of the vector ) is larger than the number of coefficients in the stencil and so will often need to pad the stencil with zeroes. Rather than explicitly notating all these circumstances, which would be cumbersome, we will make clear in each context which version of the indexing we are using. This convention will make easy various operations we will need later, such as rotating the stencil to build a circulant matrix
, convolving the stencil with the net or computing the Fourier transform of the stencil. The tension term separates additively intoterms, one for each row of , so we consider the case ; call the resulting vector. In this formulation, the vector is the discrete convolution of the net and the stencil, while the vector (which is what we use) is the discrete convolution of the net and the reversed stencil . In both cases the tension value is the same. This can be readily seen by noting that if is circulant, which implies and so ; or by noting that the stencil and the reverse stencil have the same power spectrum ().
4.1 Types of discretised operator
Turning now to the choice of stencil, there are two basic types:
- Differential, or roughening
This results when the stencil is a finite-difference approximation to a continuous differential operator, such as the forward difference approximation to the first derivative (Conte and de Boor, 1980):
where the grid constant is small. Differential operators characterise the metric properties of the function , such as its curvature, and are local, their value being given at the point in question. Consequently, the algebraic sum of the elements of a finite-difference stencil must be zero (otherwise, the operator’s value value would diverge in the continuum limit, as ). This zero-sum condition can also be expressed as , where is a vector of ones, and has the consequences that:); SS is positive semidefinite and the prior on the centroids is improper; and the tension term is invariant to rigid motions of the centroids (translations and rotations). Note the fitness term is invariant to permutations of but not to rigid motions.
- Integral, or smoothing
This results when the stencil is a Newton-Cotes quadrature formula, such as Simpson’s rule (Conte and de Boor, 1980):
Integral operators are not local, their value being given by an interval. Thus, the corresponding stencil is not zero-sum.
Integral operators depend on the choice of origin (i.e., the “DC value”) and so do not seem useful to constrain the geometric form of an elastic net, nor would they have an obvious biological interpretation (although, of course, they are useful as smoothing operators). Differential operators, in contrast, can be readily interpreted in terms of continuity, smoothness and other geometric properties of curves—which are one of the generally accepted principles that govern cortical maps, together with uniformity of coverage of the stimuli space (Swindale, 1991, 1996; Swindale et al., 2000; Carreira-Perpiñán and Goodhill, 2002). Note that the effect of the tension term is opposite to that of the operator, e.g. a differential operator will result in a penalty for nonsmooth nets.
Also, let us briefly consider a prior that is particularly simple and easy to implement: , or a tension term proportional to , resulting from a stencil
. By placing it on the parameters of a mapping, this prior has often been used to regularise mappings, as in neural nets (weight decay, or ridge regression;Bishop, 1995) or GTM (Bishop et al., 1998b). In these cases, such a prior performs well because even if the weights are forced to small magnitudes, the class of functions they represent is still large and so the data can be modeled well. In contrast, in the elastic net the prior is not on the weights of a parameterized mapping but on the values of the mapping itself, and so the effect is disastrous: first, it biases the centroids towards the origin irrespectively of the location of the training set in the coordinate system; second, there is no topology because the prior factorizes over all .
We now concentrate on differential stencils, i.e., will approximate a given derivative of a continuous function through a regularly spaced sample of . In essence, the stencil is just a finite difference scheme. Particular cases of such differential operators have been applied in related work. Dayan (1993) proposed to construct topology matrices by selecting different ways of reconstructing a net node from its neighbours (via a matrix ). This strategy is equivalent to using a matrix defined as . For the examples he used, corresponds to (the original elastic net: forward difference of order ); and corresponds to (forward difference of order ). The latter was also used by Utsugi (1997) and Pascual-Marqui et al. (2001). Here we seek a more systematic way of deriving stencils that approximate a derivative of order . Tables 1-2 give several finite-difference schemes for functions (nets) of one and two dimensions, respectively.
4.2 Truncation error of a differential stencil
Methods for obtaining finite-difference schemes such as truncated Taylor expansions (the method of undetermined coefficients), interpolating polynomials or symbolic techniques can be found in numerical analysis textbooks (see e.g.Conte and de Boor, 1980, section 7.1; Gerald and Wheatley, 1994, chapter 4 and appendix B; Godunov and Ryaben’kiĭ, 1987, § 11.6). Here we consider Taylor expansions. Given a differential stencil , we can determine what derivative it approximates and compute its truncation error by using a Taylor expansion around :
Consider a centre-aligned stencil and define over where . Then:
Call and the smallest integers such that and for and . Then, doing gives
where is “near” . That is, represents a derivative of order with a truncation error of order (note that for any differential stencil and so , the zero-sum condition). Conversely, to construct a stencil that approximates a derivative of order with truncation error of order we simply solve for the coefficients given the values .
Obviously, a derivative of order can be approximated by many different stencils (with different or the same truncation error). From a numerical analysis point of view, one seeks stencils that have a high error order (so that the approximation is more accurate) and as few nonzero coefficients as possible (so that the convolution can be efficiently computed); for example, for the first-order derivative the central-difference stencil , which has quadratic error, is preferable to the forward-difference stencil , which has linear error—in fact, the central difference is routinely used by mathematical software to approximate gradients.
However, from the point of view of the GEN, the nonuniqueness of the stencil raises an important question: can we expect different stencils of the same derivative order to behave similarly? Surprisingly, the answer is no (see section 5.5).
Before proceeding, it is important to make a clarification. It is well known that estimating derivatives from noisy data (e.g. to locate edges in an image) is an ill-posed problem. The derivatives we compute are not ill-posed because the net (given by the location of the centroids) is not a noisy function—the value of the tension term is computed exactly.