A large variety of problems arising in engineering, physics, chemistry and economics can be formulated as optimization problems, either constrained or unconstrained having continuous or discrete variables. A well developed area in optimization theory are local optimization methods including conjugate gradient, quasi-Newton, Levenberg-Marquardt, sequential quadratic programming etc. [Bertsekas, 1996; Fletcher, 1987; Gill et al., 1981]. Continuous-time optimization methods have been developed within the area of neural networks [Cichocki & Unbehauen, 1994]. Popular methods for global exploration of the search space are e.g. simulated annealing [Kirkpatrick et al.
, 1983] and genetic algorithms [Goldberg, 1989]. Within standard local optimization techniques one often applies multi-start local optimization, by trying different starting points and running the processes independently from each other and selecting the best result from all trials. On the other hand, for training of neural networks such as multilayer perceptrons (MLP) it is well-known that instead of training several MLPs for random choices of small initial weights and selecting the best of all trained networks, one better forms a committee network which is based upon all trained networks [Arbib, 1995; Bishop, 1995]. In other words, the training efforts of the less optimal networks are not entirely useless but can be employed in order to improve the estimate in view of the bias-variance trade-off [Bishop, 1995]. Unfortunately, committee networks are only applicable to static nonlinear regression and classification problems.
In this paper we propose a new methodology of coupled local minimizers (CLM) for solving continuous nonlinear optimization problems. We pose a somewhat similar challenge as for committee networks but within a different and broader context of solving differentiable optimization problems. The aim is to (on-line) combine the results from local optimizers in order to let the ensemble generate a local minimum that is better than the best result obtained from all individual local minimizers. We show how improved local minima can be obtained by having interaction and information exchange between the local search processes. This is realized through state synchronization constraints that are imposed between the local minimizers by incorporating principles of master-slave dynamics. Synchronization theory has been intensively studied within the area of chaotic systems and secure communications [Chen & Dong, 1998; Pecora & Carroll, 1990; Suykens et al., 1996, 1997, 1998; Wu & Chua, 1994]. The CLM method is related to Lagrange programming network approaches for chaos synchronization [Suykens & Vandewalle, 2000], where identical or generalized synchronization constraints are imposed on dynamical systems. CLMs also fit within the framework of Cellular Neural Networks (CNN) [Chua & Roska, 1993; Chua et al., 1995; Chua, 1998]. By considering the objective of minimizing the average cost of an ensemble of local minimizers subject to pairwise synchronization constraints, a continuous-time optimization algorithm is studied according to Lagrange programming networks [Cichocki & Unbehauen, 1994; Zhang & Constantinides, 1992]. The resulting continuous-time optimization algorithm is described by an array of coupled nonlinear cells or a one-dimensional CNN with bi-directional coupling.
We show how to obtain intelligence from CLMs. This is done by considering a problem of static nonlinear regression with MLPs from given noisy measurement data. In this case CLMs correspond to coupled backpropagation processes [Rumelhartet al., 1986; Werbos, 1990]. In order to obtain an MLP with good generalization performance (i.e. an intelligent solution) it is usually needed to consider a regularization term in addition to the original sum squared error cost function (fitting error) which is defined on the training data [Bishop, 1995; MacKay, 1992; Poggio & Shelton C. 1999; Suykens & Vandewalle, 1998]. When applying CLMs such a regularization term is not needed. It turns out that the initial distribution of the initial states of the several local minimizers can play this role, such that one only has to optimize the training set error. This is consistent with insights from Bayesian learning of neural networks where the regularization term is related to the choice of the prior on the unknown interconnection weights, which expresses that the initial weights should be chosen small (weight decay).
The role of the initial CLM state is not only important for neural networks but also with respect to potential energy surface optimization of Lennard-Jones clusters, a second class of problems that is investigated in this paper. It is considered to be an important benchmark problem in the area of protein folding [Šali et al., 1994; Neumaier, 1997; Wales & Scheraga, 1999]. We show that CLMs are able to detect the global minimum of (LJ) which possesses a double-funnel energy landscape and is known to be a challenging test-case [Wales & Doye, 1997; Wales et al., 1998; Wales & Scheraga, 1999]. The CLM method has also been successfully applied to larger scale problems. The cooperative search mechanism in CLMs is obtained by solving a linear programming problem in the unknown soft constraint penalty factors. In this way a maximal decrease in the average energy cost of the ensemble is achieved.
This paper is organized as follows. In Section 2 we introduce coupled local minimizers. In Section 3 we discuss how to obtain optimal interaction between local minimizers. In Section 4 we present new insights on coupled backpropagation processes and intelligence. In Section 5 we apply coupled local minimizers to the optimization of Lennard-Jones clusters.
2 A CNN of Coupled Local Minimizers
Consider the minimization of a twice continuously differentiable cost function with . Let us take an ensemble consisting of local minimizers. We aim at minimizing the average energy cost of this ensemble subject to pairwise state synchronization of the particles:
with particle states for and boundary conditions , . The synchronization constraints have to be achieved in an asymptotical sense, i.e. the particles have to reach the same final state. One defines the augmented Lagrangian
with Lagrange multipliers . The different terms in this Lagrangian are the objective function, the soft constraints and the hard constraint related to the pairwise state synchronization constraints. The penalty factors emphasize the importance of each of the soft synchronization constraints. From this augmented Lagrangian one obtains the Lagrange programming network [Zhang & Constantinides, 1992]
This is basically a continuous-time optimization algorithm for solving the given constrained optimization problem. One obtains the following array of coupled local minimizers (CLM)
with learning rate . This array consists of a number of coupled nonlinear cells and is considered to be a special case of cellular neural networks (CNN) [Chua, 1998].
The basic mechanism of state synchronization with coupled local minimizers is illustrated in Fig. 1 for a double potential well problem with two particles that are searching for the global minimum. The main idea here is to impose that they should reach the same final position. A simplified CLM considered in this case is
where . This is derived from the augmented Lagrangian
As a result of the constraining of the system, the two particles are enforced to take a decision about which valley to choose. When the initial states of the two particles are located in different valleys, one observes that they are always reaching the global minimum. This phenomenon is independent from the steepness and shape of the valleys. In this process, the search space is in fact duplicated and the local optimizers are exchanging information through the synchronization constraint.
Note that it is not necessarily guaranteed beforehand that the equality constraints in (1) will be exactly realized by the CLM, as is known for Lagrange programming networks. In this case one could consider as CLM goal that the state synchronization constraints should be realized in an approximate sense, i.e. bringing the states close to each other. Eventually, the equality constraints could be replaced by a set of inequality constraints which would implement this.
3 CLMs and Optimal Cooperative Search
In this Section, we discuss how to obtain an optimal interaction between the several local minimizers. This is done by making a suitable choice of the penalty weights . In this design procedure we aim at maximally decreasing the instantaneous average energy cost of the ensemble
For a given state vector and costate of the ensemble the latter expression is affine in and the interaction between the local minimizers is optimized then by solving the linear program (LP):
where are user-defined lower- and upperbounds. The resulting values are kept constant for simulation of the CLM over a certain time interval . The difference between the values in (4) causes a similar effect as is known in master-slave (drive-response) synchronization which has been studied for secure communications using chaos [Chen & Dong, 1998; Pecora & Carroll, 1990; Suykens et al., 1997, 1998; Wu & Chua, 1994]. In this way, successfully performing local optimizers impose their state to the other part of the ensemble, which leads to cooperative behaviour.
The step size is determined here by imposing a target evolution law
where is a user-defined positive real constant and is an estimate of the energy cost value at the global minimum (alternatively one adds a constant value to the cost function such that the energy cost is guaranteed to be positive everywhere). As a result the step size becomes subject to additional user-defined constraints where by definition. The values of are re-scheduled each time in between simulations of the array over user-defined time intervals .
According to the idea of small-world networks [Watts & Strogatz, 1998], the information spreading within the ensemble can be further improved, e.g. by re-numbering a part of the ensemble (state vectors together with their corresponding co-state) at random every steps where . In this way the pairwise synchronization constraints vary among the different local minimizers during optimization.
In order to obtain insight in the influence of the CLM tuning parameters let us consider the optimization of the cost function
with , , in a dimensional search space [Styblinski & Tang, 1990]. The global minimum for this problem is known and located at the origin with and is taken. The solutions from CLMs improve with respect to multi-start local optimization, even when a quasi-Newton method instead of steepest descent optimizers (as for the CLM) are taken. In these comparisons the initial states have been random uniformly generated in where the same initial states have been taken for the different algorithms and several runs of the algorithms have been done. CLMs with are reaching the global minimum in this case. From this example it follows that in order to optimize a surface over a certain region in search space, the number needed to achieve a good performance depends on the complex shape of the surface or typically on the number of local minima per volume in search space that one intends to explore. Experiments suggest that a factor is usually a good choice. In this way the energy level or cost function of the optimizers remains bundled. Otherwise the simulations could lead to excessive exploration and decreased speed of convergence. Typically, larger values of will require shorter intervals. The choice of the pressure coefficient will control the average energy decrease of the ensemble. Demanding a high pressure , however, might lead to less exploration and worse local minima. The bounds on are taken here as , . Although it is not necessary, the CLM performance can be further enhanced by applying the small-world network concept, by re-numbering at random (here 20% of the local minimizers) every steps where typically, leading to improved information spreading among the optimizers. For the simulations here and in the sequel a Runge-Kutta integration rule with adaptive step size (absolute and relative error equal to ) has been used.
4 Intelligence and Coupled Backpropagation
In this Section we apply CLMs to the training of multilayer perceptron networks
with input vector , output , interconnection weights ,
and bias vector, where denotes the number of hidden units. Let us denote the unknown parameter vector as which stacks all the weights into one single vector. Since the introduction of the backpropagation algorithm for training of neural networks [Rumelhart et al., 1986; Werbos, 1990], important insights have been obtained about regularization (weight decay) of the cost function in order to have a good generalization ability in view of the bias-variance trade-off [Bishop, 1995; MacKay, 1992; Suykens & Vandewalle, 1998]. It is well-known that instead of optimizing the cost function
are the desired target outputs (with given data setof training data), one better considers the following cost function with regularization
are additional hyperparameters which can be automatically tuned within the framework of Bayesian learning. The second term is related to a prior distribution on the weights which typically expresses that the initial weights are chosen small. The first term is associated with the likelihood. When networks are overparametrized (too many hidden units), regularization enables to work implicitly with a number of effective parameters that is less than the actual number of interconnection weights such that overfitting can be avoided.
Let us now consider a CLM by taking as state vectors () the unknown parameter vectors . Hence, the CLM consists in fact of coupled backpropagation processes with weight vector synchronization. In Fig. 2 an example of estimating a sine function from given noisy data is given where an overparametrized MLP is taken. Simulation results show that by CLMs one is able to obtain a similar smooth solution as from (13) with Bayesian learning but by optimizing the original cost function (12) without regularization. In order to obtain this intelligent solution (i.e. a good generalization performance), the initial distribution of () (at time 0) plays an important role, similar with the choice of in (13). Consider the following distribution of the initial CLM state
which assumes that the states of the local minimizers
are normally distributed. The value ofthen plays the role of a bifurcation hyperparameter. For chosen very large a solution with bad generalization is obtained, while for small (small initial weights) a good generalization performance is obtained. Hence, CLMs can avoid the overfitting phenomenon without the use of a regularization in the sum squared error cost function. In order to obtain the results with good generalization in Fig.2, the following CLM tuning parameters were taken: , , , , , , , and 20% of the local minimizers is re-numbered after with .
5 Optimization of Lennard-Jones Clusters
In this Section we discuss the application of CLMs to the optimization of Lennard-Jones clusters. In predicting the three-dimensional structure of proteins from amino acid sequences, potential energy surface (PES) minimization is often related to the native structure of the protein. Optimization of Lennard-Jones (LJ) clusters is considered to be an important benchmark problem at this point [Šali et al., 1994; Neumaier, 1997; Wales & Scheraga, 1999]. Recently, detailed studies have been reported about optimization of (LJ) with by means of basin-hopping techniques in comparison with many other approaches [Wales & Doye, 1997; Wales et al., 1998; Wales & Scheraga, 1999]. The cost function in this case is given by
where the Euclidean distance between atom and (). (LJ) which possesses a double-funnel energy landscape and is known to be an interesting test-case.
Fig. 3 shows the result of a CLM run applied to (LJ). The CLM tuning parameters in this case are , , , , , , ( with is optimized in order to ensure a positive cost function everywhere) and 10% of the local minimizers is re-numbered after with . The CLM energy bundle caused by state synchronization of the local minimizers is clearly visible. Like in the application of CLMs to the training of neural networks, the choice of in plays an important role. In Fig. 3 the differences are illustrated for versus . The choice of this prior implicitly corresponds in fact to an additional confining potential energy term, which has been explicitly considered in effective potential minimization methods [Schelstraete & Verschelde, 1997] (in a different form). However, the effective potential minimization method has problems with detecting e.g. the global minimum of (LJ), which on the other hand are easily found by CLMs. In the example of (LJ) a classical local optimization technique (quasi-Newton and scaled conjugate gradient for large scale problems) has been used for post-processing and fine-tuning (about 10-20 additional iterations) of the CLM results. The variety of generated solutions includes the global minimum -173.9284.
In the application of CLMs to larger clusters, first a shifted version of the problem has been minimized in order to avoid ill-conditioning in the LP problem and excessively large values in the initial CLM evolution. For (LJ) , has been chosen. These results have been taken then as initialization for optimization of and are, eventually, post-processed by a classical local optimization method. The CLM tuning parameters in the (LJ) case are , , , , with and with . The other tuning parameters for (LJ) are the same as for the (LJ) example. This CLM run yields a PES value of -872.91, while the best known minimum is -893.31 in this case, obtained by counting nearest-neighbour interactions for icosahedral packing schemes (Northby). However, like simulated annealing and basin-hopping approaches, our focus is here on unbiased methods which are generally applicable.
One can argue that CLMs are providing good solutions given the user-defined initial distribution (prior). In this sense, CLM results can potentially be further improved by exploiting more informative prior knowledge about the problem. The simulation results show that this becomes more important for larger clusters. Simulations have been done on a Sun Ultra 2 workstation in Matlab with cmex implementation of the PES evaluation and gradient calculations. CLM running times are roughly comparable with times applying a standard gradient based local optimization methods (conjugate gradient).
We have introduced a new optimization method of coupled local minimizers.
This is done by considering the average energy cost of the
total ensemble subject to pairwise synchronization constraints
that are asymptotically reached. The resulting
continuous-time optimization algorithm from Lagrange programming
networks is an array of coupled nonlinear cells or CNN.
For problems of MLP training in supervised learning, we have
shown that CLMs are able to produce intelligent solutions,
in the sense that one can obtain a good generalization ability
without taking a regularization term in the cost function.
The obtained insights are consistent with Bayesian learning
methods for MLPs. We have also shown how one can let
the ensemble of local minimizers cooperate in an optimal
way. This is done by solving additional linear programming
problems. The CLM method has been successfully applied to
the optimization of Lennard-Jones clusters which is an
important benchmark problem in the area of protein folding.
We expect that the proposed CLM methodology may offer new insights for
the application of continuous-time optimization algorithms to NP
complete problems in general. It also emphasizes the importance
of the role played by the initial state distribution
of the local minimizers that form the CLM ensemble.
Acknowledgements. This research work was carried out at the ESAT laboratory and the Interdisciplinary Center of Neural Networks ICNN of the Katholieke Universiteit Leuven, in the framework of the FWO projects Learning and Optimization: an Interdisciplinary Approach and Collective Behaviour and Optimization: an Interdisciplinary Approach, the Belgian Program on Interuniversity Poles of Attraction, initiated by the Belgian State, Prime Minister’s Office for Science, Technology and Culture (IUAP P4-02 & IUAP P4-24) and the Concerted Action Project MEFISTO of the Flemish Community. Johan Suykens is a postdoctoral researcher with the National Fund for Scientific Research FWO - Flanders.
Arbib, M.A. (Ed.)  The Handbook of Brain Theory and Neural Networks (MIT Press, Cambridge MA).
Bertsekas, D.P.  Constrained Optimization and Lagrange Multiplier Methods (Athena Scientific).
Bishop, C.M. 
Neural networks for pattern recognition(Oxford University Press).
Chen, G. & Dong, X.  From Chaos to Order - Perspectives, Methodologies, and Applications (World Scientific Pub. Co., Singapore).
Chua, L.O. & Roska, T.  “The CNN Paradigm,” IEEE Trans. Circuits and Systems-I, 40(3), 147-156.
Chua, L.O., Hasler, M., Moschytz, G.S. & Neirynck, J.  “Autonomous Cellular Neural Networks: a Unified Paradigm for Pattern Formation and Active Wave Propagation,” IEEE Trans. Circuits and Systems-I, 42(10), 559-577.
Chua, L.O.  CNN: a Paradigm for Complexity (World Scientific Pub. Co., Signapore).
Cichocki, A. & Unbehauen, R.  Neural Networks for Optimization and Signal Processing (Wiley, Chichester).
Fletcher, R.  Practical Methods of Optimization (Wiley, Chichester).
Gill, P.E., Murray, W. & Wright, M.H.  Practical Optimization (Academic Press, London).
Goldberg, D.E. 
Genetic Algorithms in Search, Optimization and Machine Learning(Addison-Wesley, Reading Mass.).
Kirkpatrick, S., Gelatt, C.D. & Vecchi, M.  “Optimization by simulated annealing,” Science 220, 621-680.
MacKay, D.J.C.  “Bayesian Interpolation,”Neural Computation 4(3), 415-447.
Neumaier, A.  “Molecular modeling of proteins and mathematical prediction of protein structure,” SIAM Rev. 39, 407-460.
Pecora, L.M. & Carroll, T.L.  “Synchronization in Chaotic Systems,” Phys. Rev. Lett. 64, 821-824.
Poggio, T. & Shelton, C.  “Machine Learning, Machine Vision and the Brain,” AI Magazine 20(3), 37-55.
Rumelhart, D.E., Hinton, G.E. & Williams, R.J.  “Learning representations by back-propagating errors,” Nature 323, 533-536.
Šali, A., Shakhnovich, E. & Karplus, M.  “How does a protein fold?,” Nature 369, 248-251.
Schelstraete, S. & Verschelde, H.  “Finding minimum-energy configurations of Lennard-Jones clusters using an effective potential,” J. Phys. Chem. A 101, 310-315.
Styblinski, M.A. & Tang, T.-S.  “Experiments in nonconvex optimization: stochastic approximation with function smoothing and simulated annealing,” Neural Networks 3, 467-483.
Suykens, J.A.K., Vandewalle, J.P.L. & De Moor, B.L.R.  Artificial Neural Networks for Modelling and Control of Non-Linear systems, (Kluwer Academic Publishers, Boston).
Suykens, J.A.K., Curran, P.F., Vandewalle, J. & Chua, L.O.  “Robust nonlinear H synchronization of chaotic Lur’e systems,” IEEE Trans. Circuits and Systems-I 44, 891-904.
Suykens, J.A.K. & Vandewalle, J. (Eds.)  Nonlinear Modeling: advanced black-box techniques (Kluwer Academic Publishers, Boston).
Suykens, J.A.K., Yang, T. & Chua, L.O.  “Impulsive synchronization of chaotic Lur’e systems by measurement feedback,” Int. J. Bifurcation and Chaos 8, 1371-1381.
Suykens, J.A.K. & Vandewalle, J.  “Chaos synchronization: a Lagrange programming network approach,” Int. J. Bifurcations and Chaos (special issue on chaos control and synchronization), 10(4), 797-810.
Wales, D.J. & Doye, J.P.K.  “Global optimization by basin-hopping and the lowest energy structures of Lennard-Jones clusters containing up to 110 atoms,” J. Phys. Chem. A 101, 5111-5116.
Wales, D.J., Miller, M.A. & Walsh, T.R.  “Archetypical energy landscapes,” Nature 394, 758-760.
Wales, D.J. & Scheraga, H.A.  “Global optimization of clusters, crystals, and biomolecules,” Science 285, 1368-1372.
Watts, D. J. & Strogatz, S.H.  “Collective dynamics of ’small-world’ networks,” Nature 393, 440-442.
Wu, C.W. & Chua, L.O.  “A unified framework for synchronization and control of dynamical systems,” Int. J. Bifurcation and Chaos 4, 979-989.
Werbos, P.  “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE 78(10), 1150-1560.
Zhang, S. & Constantinides, A.G.,  “Lagrange programming neural networks,” IEEE Trans. Circuits and Systems-II 39, 441-452.
Captions of Figures
Fig. 1 Illustration of the basic state synchronization
mechanism in coupled local minimizers (CLM):
(A) double well cost function
with global minimum located at ;
(B) CLM with
where (red and blue, respectively)
are the two particle states with costate (green).
The CLM corresponds to the Lagrange programming network
with cost function subject to .
Except when both initial states are positive,
the global minimum is always reached by this CLM.
Similar results are obtained for other double well problems with
broad and narrow minima.
Fig. 2 Supervised learning of multilayer perceptrons on a benchmark problem of training a sinusoidal function (green) given 20 training data points (blue circles) corrupted with Gaussian noise (zero mean, standard dev. 0.4) using a multilayer perceptron with one hidden layer consisting of 10 hidden units. Given the fact that the MLP consists of 30 unknown parameters, overfitting will occur with standard training methods, unless regularization or early stopping is applied. In (A) results from standard neural network training methods are shown: scaled conjugate gradient without early stopping leading to overfitting (red); best result by Bayesian learning with regularization of the cost function and automatic hyperparameter selection, leading to good generalization (blue). (B) shows that a CLM which optimizes a sum squared error on the training data without regularization of the cost function can produce a result (blue) which is comparable to the quality of the Bayesian learning solution in (A), provided the value in is well chosen (here ). (C) shows a result from a CLM for a bad choice of . The large variance among the resulting local minimizers of the CLM is clearly visible, in comparison with (B) where the variance is small with good generalization. In (D) the CLM evolution with of the sum squared error cost function is shown during optimization, related to the results of (B).
Fig. 3 CLM application to potential energy surface minimization of Lennard-Jones clusters: (A) CLM evolution of for on (LJ); (B) Importance of the choice of the initial distribution: (red), (blue) with post-processing by means of a quasi-Newton method resulting into (magenta), (green), respectively. Shown are the mean (dashed line), min and max (solid lines) over 7 CLM runs after sorting the energy values. The variety of solutions includes the global minimum -173.9284, visualized in (C). In (D) a CLM result with is shown for a larger cluster (LJ).