Artificial Neural Networks (ANN) are general function approximators  and can be used to find a functional representation of a data set. Another point of view is that ANN’s represent a way of data compression 
. The compression ratio depends on the number of neurons used in the ANN which encodes the data: the less neurons at the same representation quality, the better the compression.
Given a problem, there are generally two kinds of optimization tasks for the learning process of ANN’s. The first one is to find a network topology, i.e., the optimal number of layers and the optimal number of neurons per layer. The second task is to find the parameters of the network, given a topology. In this paper, we focus on the second task and assume a predefined topology.
The estimation of the ANN-parameters is generally a computationally demanding task. The corresponding Maximum-Likelihood derived error function comprises many local optima. Therefore, local search techniques to find an optimal solution generally fail and typically converge to a suboptimal solution 
. In addition, local search techniques are mainly sequential methods and parallel implementations are limited. On the other hand, global optimization techniques based on Monte Carlo methods such as the Genetic Algorithm (GA)[7, 21], Covariance Matrix Adaptation Evolution Strategies (CMA-ES) [12, 11] or Differential Evolution (DE) [29, 22, 34] are generally very well parallelizable. Differential Evolution is one of the most popular and robust Monte Carlo global search methods, which outperforms many other evolutionary algorithms on a wide range of problems [3, 33, 36]. DE is successfully used in various engineering problems such as multiprocessor synthesis , optimization of radio network designs 
, training Radial Basis Function networks, training multi layer neural networks  and many others . On the other hand, CMA-ES is a state-of-the-art evolutionary algorithm, which is also used for ANN-learning [27, 26, 8] and other engineering tasks [24, 16, 25].
Due to inherent symmetries in the parametric representation of ANN’s, there are multiple global optima in the parameter space. The multiple global optima result from point symmetries and permutation symmetries [30, 31]. In the literature, this problem is also known as the competing conventions problem, or simply the permutation problem. In [32, 31], significant improvements are reported by different approaches to symmetry breaking for GA’s. However, in both publications, the improvement is shown using only one single test-case, respectively. On the other hand, in [10, 9] contradictionary results are presented, where the effect of removing these symmetries on GA’s is reported to be minimal and negligable, and even leading to reduced performance.
Furthermore, crossover operators used in GA’s are reported to be a source of the problems caused by symmetries . Therefore, some researchers disable crossover or apply EA’s which do not have crossover at all .
To our best knowledge, there are no reports on the impact of the ANN-symmetries regarding the performance of the DE and CMA-ES methods. In this paper, we show that the performance of DE and CMA-ES are highly sensitive to the presence of multiple global optima, and that symmetries are also an issue on the performance of EA’s without crossover operators. We show that there are infinitely many ways of symmetry breaking, which differ in the way they partititon the parameter space. Furthermore, we argue that an effective way of partitioning should depend on the location of the global optimum and its symmetric replicas. Therefore, we derive a symmetry breaking operator based on considerations about the partitioning of the ANN-parameter space, which is optimal according to a Minimum Global Optimum Proximity
condition. By theoretical considerations and numerous experimental studies on offline supervised learning problems, we show that typical approaches to symmetry breaking, which are invariant to the global optimum, may lead to superior or inferior results, depending on the ANN-problem.
On the other hand, we show that the proposed global optimum variant approach for symmetry breaking leads to consistent and significant improvements in the estimation of ANN-parameters.
The paper is organized as follows. In the following Section, we briefly review Artificial Feedforward Neural Networks (ANN). Section 3 defines the term ’symmetry’ and introduces the types of symmetries found in the optimization of ANN-parameters. In Section 4, we discuss existing approaches to symmetry breaking. In this Section, we also reformulate the rules applied by existing approaches to prepare a more general view to the topic. In Section 5, we introduce the ’Minimum global optimum proximity’ principle and propose symmetry breaking methods based on this principle. In Section 6, we present the conducted experiments and obtained results, followed by the Section of Conclusions, where the main contributions are emphasized.
2 Brief review of Artificial Feedforward Neural Networks
Artificial (Feedforward) Neural Networks (ANN) are used for approximation of functions . ANN’s typically have multiple layers of artificial neurons. Assuming that an ANN has layers, the first and the last layer are called as the input and the output layer, respectively. Remaining layers are called as hidden layers.
For the -th neuron in layer
, we denote a parameter vector by
where is the weight vector of dimension equal to the number of inputs available to the neuron and is the shift scalar. The output of a tanh-type sigmoid neuron is given by
where is the output vector of layer . After all hidden layers are evaluated, the output layer component of the output vector is typically obtained by the following two alternative ways:
We denote the parameter vector of all neurons in a layer by , where
The vector of all the parameters in the network is given by
where , is the vector of the output layer weights for output . The function defined by the network is denoted by
where is the input vector, which is notationwise equal to the output of the input layer, so that .
Assuming additive normal i.i.d. noise on the available data , the ML-estimate of the parameters can be obtained by the minimizer to the following least squares optimization problem:
For regression problems, the output layer is linear as shown in Eqn. (3). Thus, the corresponding weights can be determined by a least squares method, as described in , which we adopt in this paper. This has the advantage that global search is applied only to the non-linear part of the parameter space, which generally speeds up convergence. For classification problems, we assume that an output vector of a data-sample designating class has the following format
Although the output layer is non-linear as shown in Eqn. (4), corresponding weights can still be determined linearly in the training phase. For this, the output vectors of the training data are rescaled by factor 20, such that and . The weights of the output layer are determined by a least squares method using the rescaled data. Given the remaining parameters, Eqn. (8) is applied by using the non-rescaled data.
Consequently, the parameter vector for the global optimization can be reduced to
The important problem of how to choose the net topology is not considered in this paper. For a given net-topology, we focus on the effect of symmetry breaking on the efficiency of the optimization of the parameters in (10). In the following Section, we investigate the symmetries in the ANN-parameter space.
3 Symmetries in ANN’s
A symmetry is an operator which does not change the output of an ANN when applied to the parameter vector :
Non reducable ANN’s comprise two types of symmetries . The first type is a point symmetry on the neuron parameter level, since
The following definition of a point symmetry operator
changes the sign of the parameters of neuron and the -th weight component of all neurons in the following layer . It satisfies the symmetry condition because of Eqn. (12). In Fig. 1, an example for the application of is shown. For each layer , the point symmetry yields symmetric replicas of the parameter vector .
The second type of symmetry is a permutation symmetry by the neuron parameters and the corresponding weight parameters in the next layer. A permutation operator defined by
leaves the output invariant. Note that . In Fig. 2, the application of is illustrated. In each layer , there are symmetric replicas of the parameter vector due to permutation symmetries. Combining both symmetries, the total count of symmetric replicas per layer is . Another important property is that the length of the vector is invariant under such symmetry operators,
since the point symmetry operator only changes the sign of some components of the parameter vector, whereas the permutation symmetry operator only swaps some components.
Symmetry operators are linear and orthogonal operators.
The proof for the linearity of these operators is trivial and therefore omitted in this paper. The orthogonality follows from Eqn. (15):
Furthermore, applying the same point symmetry operator two times subsequently does not change the parameter vector, since switching the signs of selected components a second time reverts the first sign-change. The same holds also for the permutation symmetry operator: swapping the selected components a second time reverts the first swapping. Therefore, we can write
where is the identity operator. As a result, point symmetry, permutation symmetry as well as joint symmetry operators correspond to rotations and all symmetric replicas of a global optimum lie on a hypersphere. Since such symmetries multiply the local and global optima count in the parameter space, the ultimate goal of symmetry breaking is to reduce the total number of local optima in the parameter space by avoiding all but one symmetrically equivalent space partitions.
There are infinitely many ways for symmetry breaking by using the operators and , which depend on the condition upon which these operators are applied. As an example, consider a 2-D point symmetry as illustrated in Fig. 3. Limiting the search space to the upper half plane () is one possibility to break the symmetry, where only one global optimum remains and the space is separated into two partitions. In this case, the point symmetry operator is to be applied only for . Another possibility is to reduce the space to the right half plane (). This is realized by applying the point symmetry operator only on the condition
. By rotating the coordinate system, we obtain infinitely many other ways to separate and reduce the space. As a result, there is a degree of freedom on the choice of a specific condition or separation. We derive similar results also for the permutation symmetry. In Section5, we argue that there is an optimal choice for a specific symmetry breaking condition (separation) based on considerations about the location of the global optimum. We exploit the degree of freedom on the choice of a specific condition by choosing a condition such that the distance of the global optimum to the separating region is maximal. In other words, we demand that the proximity of the global optimum to the separating region is minimal. This way, the influence of neighboring global optima is minimized and the symmetry breaking can be realized most effectively.
A detailed discussion about an optimal separation follows in Section 5.
4 Existing approaches to deal with symmetries
A commonly used method is to reduce the parameter space to one single symmetrically equivalent region, also called partition. To achieve this, the following rules can be applied :
The shift parameter of all neurons is ensured to be positive by flipping the signs of the parameters when required, for each neuron.
In each hidden layer, neurons are sorted according to the shift parameter.
This method and all other similar methods can be realized by applying a chain of the operators and . In the following, we show that these rules are suboptimal, and in some cases may even cause inferior performance. We show that rules for symmetry breaking should take the position of the global optimum into account in order to be effective. Therefore, we denote rule-1 and rule-2 as global optimum invariant, and rules which depend on the global optimum as global optimum variant.
4.1 Global optimum invariant point symmetry breaking
Assuming a point symmetric function , Fig. 3 shows two cases where rule-1 is applied such that all -coordinates are forced to be positive. As a consequence, all solution candidates are located in the upper half plane and the parameter space is effecively reduced. There is only one remaining global optimum . In the left plot, the global optima and are relatively far away from the -axis, whereas in the right plot, the global optima are close to the -axis, although they have the same distance to the origin in both plots. In case of the right plot, there exists an ’artificial’ local optimum due to the proximity of the hidden global optimum , where some solution candidates may be attracted to. The main problem is that after applying symmetry breaking, some solution candidates may still be closer to the hidden global optimum than to . As a result, the goal of reducing the influence of other global optima is not fully achieved. Furthermore, the introduced artificial local optimum may trap some solution candidates without having a chance to ever reach the corresponding ’hidden’ global optimum . We believe that this is the main reason why an inferior performance is reported by some symmetry breaking approaches. Note that this situation depends on the location of the global optimum, which in turn depends on the problem at hand. Therefore, this issue arises on some problems, whereas on others, a symmetry breaking with increased performance can be achieved by these rules.
In Fig. 3, the -axis is the region of separation
The separating region depends on the rule and divides the parameter space into partitions. As an example, an alternative rule, which would force all coordintates to be positive, would have the -axis as the separating region. We repeat that the distance of the global optimum to the separating region is crucial for effective symmetry breaking, and that it should be arranged to have this distance as large as possible. Another equivalent goal is to apply symmetry breaking such that no solution candidate is closer to the hidden global optimum than to the global optimum of the selected partition.
4.2 Global optimum invariant permutation symmetry breaking
Similar problems caused by rule-1 also arise by the application of rule-2. This is shown in the following example. We use a 2x2 parameter structure, i.e., two neurons with two parameters per neuron : . From the permutation symmetry follows that
where shall be the error function. Let the global optimum be at . There are two possibilities to apply rule-2: sorting by parameter or sorting by parameter , respectively. The separating region varies for each choice. Choosing to sort by parameter yields , whereas sorting by parameter yields :
We show that each separation region has a different distance to the global optimum . The closest point on to is at , which yields the distance . On the other hand, the closest point on to is at , which yields the distance . In this example, applying rule-2 by ordering the -coordinates results in a better sparation of the partitions. Would the global optimum be at , the opposite case would apply. Consequently, similar to rule-1 in the previous Section 4.1, rule-2 can only be effective on some problems.
5 Minimum global optimum proximity principle
In this Section we propose new methods for symmetry breaking to avoid the problems described in Section 4. Here, we assume that the basin, or the region of influence of the global optimum is isotropic. Although this assumption does not apply in general, it is introduced to simplify the discussion. Also, this simplification enables us to easily derive theoretically motivated methods, which prove to be very effective in a wide range of problems. In the presentation, we first consider the point symmetry, then the permutation symmetry and finally the general joint symmetry as a combination of both point and permutation symmetries.
5.1 Minimum global optimum proximity principle for point symmetry
The differences between possible rules to apply the point symmetry operator arise from the condition on which the operator is to be applied. Fig. 4 shows different rules with corresponding separation regions for breaking a point symmetry in relation to the global optimum.
It can be seen that the separating region which has maximum distances to the global optima, which means that the according proximity is minimal, enables the optimal separation or partitioning. This way, an optimal isolation between all symmetric replicas of the global optimum is achieved. As a result, the disturbing influence of other neighboring global optima is decreased to a minimum, which in turn effectively maximizes the attraction of the global optimum of the selected partition.
The following Lemma provides a more general perspective for rule-1 presented in Section 4. Note that the shift parameter is the last entry in the parameter vector.
Rule-1 from Section 4 modifies a parameter vector as:
The rule-structure introduced by Lemma 5.1 can be used to formulate the following strategy to maximize the distance of the global optimum to the separating region.
The solution candidate determined by rule (28) is always closer to than to .
5.2 Minimum global optimum proximity principle for permutation symmetry
In this Section we introduce an optimal rule for breaking a permutation symmetry for parameter spaces with two blocks of permutation-invariant parameters. We define a parameter vector as
where the notation is used to emphasize the block structure. The permutation symmetry is given by
where is the error function and is a permutation operator defined by
The following Lemma restates rule-2 as a distance dependent rule.
Assuming the shift parameter is the last parameter in the parameter block , rule-2, presented in Section 4, can alternatively be described in a more general form by the following rule:
From Eqn. (32) follows with
We state the following proposal in order to maximize the distance of the global optimum to the separating region, according to the rule-structure introduced by Lemma 5.3
The solution candidate determined by rule (36) is always closer to than to .
5.3 Ideal symmetry breaking
For a given ANN-optimization problem, let be the set of all possible symmetry operators. Note that a symmetry operator may be a point symmetry, a permutation symmetry or a joint symmetry operator. A joint symmetry operator is generally composed of a chain of point symmetry and permutation symmetry operators. As an example, applies a permutation symmetry followed by a point symmetry operator. The following properties of symmetry operators are relevant in the following discussion. According to Eqn. (11), a symmetry operator does not change the output of the ANN when applied to the parameter vector . According to Eqn. (15) a symmetry operator does not change the length of a parameter vector. Furthermore, according to Eqn. (18), symmetry operators are orthogonal.
Given a parameter vector , the set of all symmetric replicas of is defined by
Recall that the ultimate goal of symmetry breaking is to minimize the influence of all symmetric replicas of the selected global optimum and to concentrate the global search to the partition where the selected global optimum is located. To achieve this, we propose the following joint separation condition:
In other words, this optimization selects the closest symmetric replica of to the selected global optimum . Finding the closest symmetric replica of means finding the corresponding symmetry operator , where
In case the parameter vector is already close to , i.e., it is in the corresponding partition, the solution for is the identity operator . Note that, according to Eqn. (19), the identity operator is in . In Fig. 5, ideal symmetry breaking according to Eqn. (38) is illustrated on a hypothetical 2-D space.
The solution determined by Equation (38) ensures that no other symmetric replica of the selected global optimum is closer to than . In other words, it minimizes the influence of the symmetric replicas of the selected global optimum.
We prove this by contradiction. According to Eqn. (38), is minimal. Assume that there exists a global optimum replica with
Due to the underlying symmetry properties, each global optimum replica can be mapped to another replica by a symmetry operator, i.e., there exists a symmetry operator which satisfies
Since and therefore , it follows that . But this means that does not minimize the distance to , which contradicts Eqn. (38). ∎
5.4 Approximations of the ideal separation
In order to take advantage of these results, we have to address two issues. First, the global optimum is not known a priori. Second, the brute force method for finding an optimal solution to (38) has exponential complexity, but a low-complexity algorithm is desired. In order to circumvent the first problem, we propose to use an estimate for the global optimum, which can be determined by the population of solution candidates at each iteration of the applied Monte Carlo method. Naturally, this estimate improves with increasing iteration number. The second problem can be addressed by using an approximation for the ideal separation achieved by (38).
To describe the proposed method, for each neuron , we define a symmetry relevant parameter block as
which includes also some corresponding parameters from the next layer . Given a parameter vector and an estimate of the global optimum with corresponding parameter blocks and , the pseudocode 1 describes the proposed approximation for ideal symmetry breaking.
In Fig. 6, the effect of the several symmetry breaking approaches is demonstrated on a hypothetical 2-D parameter space.
5.4.1 DE with symmetry breaking
The DE method [29, 22] comprises a population of solution candidates , which are iteratively updated and moved towards an optimal solution. We propose to choose the centroid of the population at each iteration as an estimate for the global optimum .
The DE method extended by the global optimum invariant symmetry breaking  is denoted by DE-INV-SB, DE extended by the proposed global optimum variant symmetry breaking, described by Algorithm 1, is denoted by DE-SB and DE with global optimum variant ideal symmetry breaking using brute force search is denoted by DE-SB-BF. As shown in Fig. 7, in DE-based symmetry breaking approaches, symmetry breaking is always applied on each solution candidate right after it has been updated for the next iteration. Only in DE-SB, we apply an additional step by increasing the error yield of some solution candidates which are not in the same partition as the selected partition holding
. This increases the probability that these solution candidates are updated and moved closer to the selected partition. This is not required for symmetry breaking approaches which map each solution cadidate exactly to the selected partition, such as DE-INV-SB or DE-SB-BF. The DE-SB method is described in Algorithm2.
5.4.2 CMA-ES with symmetry breaking
at each iteration. According to the Gaussian distributionwith mean and covariance matrix , solution candidate vectors are drawn. After sorting the population by the error each candidate vector yields, the best samples are used to update the mean, covariance matrix and the step size for the next iteration.
In the following discussion, the CMA-ES method extended by the global optimum invariant symmetry breaking  is denoted by CM-ES-INV-SB, CMA-ES extended by the proposed global optimum variant symmetry breaking, described by Algorithm 1, is denoted by CMA-ES-SB and CMA-ES with global optimum variant ideal symmetry breaking using brute force search is denoted by CMA-ES-SB-BF.
In CMA-ES-INV-SB, CMA-ES-SB and CMA-ES-SB-SF, symmetry breaking is applied right after the evaluation of all candidate vectors and prior to updating the parameters of the Gaussian distribution. In CMA-ES-SB, we propose to use the best candidate vector (yielding the smallest error) so far as the estimate for the global optimum, denoted by . In Fig. 8, the flowgraph for CMA-ES-based symmetry breaking approaches is shown. For CMA-ES-SB, the update of the mean is described in Algorithm 3. In all other CMA-ES-based methods, the original update formula for the mean is applied.
In CMA-ES, applying symmetry breaking introduces a bias in the mean, which can lead to an excessive increase of the global step size and negatively affect the performance. This bias results from the rotations caused by the symmetry operators. These rotations move solution candidates to the vicinity of one partiton, which typically increases the radius of the population mean, as shown in Fig. 6. In order to prevent such an increase, in all CMA-ES-based symmetry breaking methods, we modify the damping term for the update of the global step size . Let be the shift vector of the centroid of the best solution candidates induced by applying symmetry breaking. The regular update formula for
is changed to
where is the iteration number and is a term depending on the difference of the previous mean and the current mean, and several other parameters.
In this section, we introduce results of experiments to demonstrate the performance improvements by symmetry breaking. The following methods are compared using regression and classification tests. From the DE-family: Differential Evolution (DE), DE with global optimum invariant symmetry breaking (DE-INV-SB), DE with global optimum variant symmetry breaking (DE-SB) and DE with global optimum variant ideal symmetry breaking using brute force search (DE-SB-BF). From the CMA-ES-family: Covariance Matrix Adaptation Evolution Strategies (CMA-ES), CMA-ES with global optimum invariant symmetry breaking (CMA-ES-INV-SB), CMA-ES with global optimum variant symmetry breaking (CMA-ES-SB) and CMA-ES with global optimum variant ideal symmetry breaking using brute force search (CMA-ES-SB-BF). It should be noted that the purpose of this investigation is not to present the best global optimization method for ANN-learning, but to demonstrate the benefits of symmetry breaking.
With a -dimensional parameter space, all tests are performed with following settings:
DE, DE-SB, DE-INV-SB and DE-SB-BF settings: , , initial population is randomly generated in -dim. hypercube (uniformly),
CMA-ES, CMA-ES-SB, CMA-ES-INV-SB and CMA-ES-SB-BF settings: we used suggested settings for enhanced global search abilities, mentioned in the C-code reference implementation.
in all experiments, the optimization is finished when a maximum number of ANN-function-evaluations is reached.
Given a parameter and a data set , we define the Mean Squared Error (MSE) according to Eqn. (8):
In order to limit the -dimensional parameter space to a feasible region, we apply a penalty approach. Due to the length-invariance by the symmetry operators as shown in Eqn. (15), the feasible region is defined by a hypersphere. In case of , the error function (47) is evaluated at a rescaled parameter vector and a penalty term is added to the error .
6.1 Experimental setup
In all experiments, data is normalized such that mean is zero and variance is one. The population size
used in DE and CMA-ES depends on the problem and the choice of the optimization method. Therefore, it is manually adapted accordingly. For each problem and each optimization method, we conduct 50 independent repetitions of the optimization process and record the error over the number of ANN-evaluations. To test for statistical significance of the obtained results, first the Kruskal-Wallis test for the hypothesis that all performance means are equal is applied. In case this hypothesis is rejected, the Wilcoxon rank sum test  is applied to all pairs of means to identify significantly different results. All tests are based on a significance level of . In Table 1
, normalized training set errors for the regression and the autoencoding problems, and normalized test set errors for the classification problems are shown.
|syn5||0.958 0.079||1.000 0.186||0.949 0.039||1.000 4.446||0.386 0.469||0.093 0.007|
|sinc||1.000 0.859||0.412 0.166||0.114 0.008||0.459 0.271||1.000 0.784||0.139 0.051|
|inc-sinc||1.000 0.963||0.337 0.155||0.089 0.016||0.287 0.336||1.000 0.707||0.082 0.035|
|sinc2d||1.000 0.387||0.995 0.094||0.875 0.029||0.975 0.139||1.000 0.241||0.089 0.253|
|sinc3d||0.622 0.029||1.000 0.572||0.603 0.033||1.000 1.401||0.090 0.013||0.043 0.021|
|autoenc-circle||0.057 0.082||1.000 1.850||0.020 0.030||1.000 0.295||0.626 0.548||0.077 0.164|
|autoenc-spiral||0.341 0.545||1.000 0.932||0.116 0.308||0.248 0.232||1.000 0.882||0.030 0.024|
|autoenc-sphere||0.554 0.321||1.000 0.064||0.022 0.012||0.050 0.012||1.000 0.416||0.032 0.008|
|two-circles||0.450 0.225||1.000 0.182||0.269 0.074||0.635 0.368||1.000 0.284||0.326 0.169|
|two-spirals||0.918 0.260||1.000 0.213||0.426 0.197||1.000 0.228||0.930 0.201||0.683 0.293|
|digits||0.325 0.087||1.000 0.111||0.272 0.062||1.000 0.352||0.805 0.113||0.668 0.099|
6.2 Regression problems
As in , we apply learning only on a training set to compare the performance of the introduced methods. In the following, the regression problems are introduced and corresponding results are shown.
6.2.1 Dataset syn5
The syn5 dataset is generated by the fourth-degree polynome
with uniformly distributed random input values. We use a 1-3-1 net and 200 data samples. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 9 shows the resulting convergence curves and box plots for the learning process.
For the DE-family, the Kruskal-Wallis test showed no significant difference in means. In contrast, according to the Wilcoxon tests, the inequality of means of CMA-ES and CMA-ES-SB is rejected by a narrow margin, with a corresponding p-value of . The other means are significantly different. All DE variants reach the same low-error, where DE-SB shows the fastest decrease in error. As for the CMA-ES variants, CMA-ES fails to reach a low error in a few runs, which leads to a larger mean error in average. In contrast, CMA-ES-SB proves to be more robust and reaches a relatively low error in all runs.
6.2.2 Dataset sinc
The sinc dataset is generated by the function with uniformly distributed random input values . We use a 1-5-1 net and 200 data samples. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 10 shows the resulting convergence curves and box plots for the learning process.
According to the Wilcoxon tests, all pairwise differences are significant. DE-SB clearly outperforms DE and DE-INV-SB. Similarly, CMA-ES-SB is the fastest among the CMA-ES-based methods.
6.2.3 Dataset inc-sinc
The inc-sinc dataset is generated by the function with uniformly distributed random input values . We use a 1-5-1 net and 200 data samples. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 11 shows the resulting convergence curves and box plots for the learning process.
According to the Wilcoxon tests, all pairwise differences are significant. Interestingly, the global optimum invariant symmetry breaking approach leads to in improvement for DE (DE-INV-SB), but shows inferior performance on CMA-ES (CMA-ES-INV-SB). This proves that symmetry breaking approaches should be specific to the selected global optimization method. Again, DE-SB and CMA-ES-SB are the fastest methods.
6.2.4 Dataset sinc2d
The sinc2d dataset is generated by the function with uniformly distributed random input values . We use a 2-3-1-3-1 net and 1000 data samples. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 12 shows the resulting convergence curves and box plots for the learning process.
According to the Wilcoxon tests, all pairwise differences are significant, except the difference between CMA-ES and CMA-ES-INV-SB. The proposed symmetry breaking approach shows a very clear impact on the CMA-ES-variants. While CMA-ES and CMA-ES-INV-SB fail to solve this problem completely, CMA-ES-SB successfully trains the ANN in the majority of the 50 runs.
6.2.5 Dataset sinc3d
The sinc3d dataset is generated by the function with uniformly distributed random input values . We use a 3-4-1-4-1 net and 1000 data samples. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 13 shows the resulting convergence curves and box plots for the learning process.
According to the Wilcoxon tests, all pairwise differences are significant. Again, DE-SB and CMA-ES-SB are the fastest methods. This time, in contrast to previous experiments, CMA-ES-INV-SB clearly outperforms CMA-ES.
6.3 Autoencoding problems
In this section, all -dimensional data samples lie on a -dimensional set, where . As a result, the data can be described, or ’encoded’ by an -dimensional subset. On the other hand, there is also a -D to -D mapping to ’decode’ the data. The task is to approximate both the encoding and decoding mapping by an ANN. As in the case of the regression problems, the performance is compared only on the training using a training set.
6.3.1 Dataset autoenc-circle
In this problem, the data samples lie on a 2-D circle centered at the origin with radius one. We use a 2-5-3-2-1-2-3-5-2 net and 200 data samples to encode from 2-D to 1-D and decode back to 2-D. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 14 shows the resulting convergence curves and box plots for the learning process.
All pairwise differences prove to be statistically significant. The proposed symmetry beraking approach improves the training in both methods. On CMA-ES-SB, the difference turns out to be quite significant.
6.3.2 Dataset autoenc-spiral
In this problem, the data samples lie on a 3-D spiral with radius one, defined by
We use a 3-1-3-4-7-3 net and 1000 data samples to encode from 3-D to 1-D and decode back to 3-D. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 15 shows the resulting convergence curves and box plots for the learning process.
All pairwise differences prove to be statistically significant.
6.3.3 Dataset autoenc-sphere
In this problem, the data samples lie on a 3-D sphere centered at the origin with radius one. We use a 3-8-5-2-5-8-3 net and 1000 data samples to encode from 3-D to 2-D and decode back to 3-D. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 16 shows the resulting convergence curves and box plots for the learning process.
All pairwise differences prove to be statistically significant. Clearly, DE-SB and CMA-ES-SB are significantly faster then the other methods.
6.4 Classification problems
In classification problems, data samples are divided into a training set, a validation set and a test set. All three sets are generated by random selection of samples. A winner-takes-all scheme is applied to distinguish different classes, i.e., given an input, the ANN-output component with the greatest value determines the class. In order to improve generalization, classification performance measures on the training and test set are updated only on each improvement of the validation set classification performance.
6.4.1 Dataset: Two-Circles
In this problem, the 2-D data domain is divided into two parts, where one part is given by the union area of two circles and the remaining part is the disjunct space. Hence, there are two classes: samples which lie inside any circle and samples which lie outside of both circles. One circle is specified by center and radius , and the other circle by center and same radius . We use a 2-4-2-4-2 net with 400 samples for each training, validation and test set, having a total of 1200 samples. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 15 shows the resulting convergence curves and box plots for the learning process.
All pairwise differences prove to be statistically significant. It can be seen that again DE-SB and CMA-ES-SB dominate the performances.
6.4.2 Dataset: Two-Spirals
This problem 
contains 2-D data-samples from two spirals on the plane, both starting at the origin and going around each other. The task is to classify each data sample by deciding to which spiral it belongs to. We use a 2-8-3-1-3-8-2 net, 114 samples for the training set, 40 samples for the validation set and 40 samples for the test set. The population size for all DE-based methods is, and for all CMA-ES-based methods. Fig. 19 and 20 show the resulting convergence curves and box plots for the learning process.
On the training and test set, DE and DE-INV-SB mean results are not statistical significantly different. Furthermore, on the test set, CMA-ES and CMA-ES-INV-SB mean results are not significantly different. Otherwise, all other pairwise differences prove to be statistically significant. DE-SB and CMA-ES-SB continue to show the best results.
6.4.3 Dataset: Digits
This problem 
deals with the recognition of handwritten digits, which results in a classification problem with 10 classes. The data is generated by asking several writers to write 250 digits in random order inside boxes of 500 by 500 tablet pixel resolution. There are 16 features extracted from the digitized data. We use a 16-8-3-10-10 net and 1000 data samples each for the training set, the validation set as well as the test set.
All pairwise differences prove to be statistically significant. Again, DE-SB and CMA-ES-SB are the fastest methods.
6.5 Ideal separation
In this Section, we compare the ideal separation to the proposed approximations. Since the complexity of the brute force method for the ideal separation is exponential, we restrict the experiments to small networks as used in the problems syn5, sinc and inc-sinc. It can be seen that the results are almost identical.
The problem of symmetries in ANN-parameter space is a well known problem resulting in important complication in the training of ANN’s. However, a detailed investigation of this problem for Evolutionary Algorithms other than Genetic Algorithms is missing in the literature. Furthermore, there are contradictionary results about the efficacy of symmetry breaking methods in the performance of the global search. We show that a possible explanation for this situation is the use of symmetry breaking methods which are invariant to the global optimum and therefore can only be effective on a limited number of problems. Furthermore, we show theoretically and illustrate experimentally, that the application of global optimum invariant symmetry breaking may even lead to inferior performance. To circumvent these problems, we propose methods for global optimum variant symmetry breaking approaches for Differential Evolution (DE) and Covariance Matrix Adaptation Evolution Strategies (CMA-ES), which are two popular, robust and state-of-the-art global optimization methods.
Experimental studies conducted on fixed topology feedforward neural networks indicate a significant improvement over standard DE and CMA-ES techniques in terms of global convergence speed. Further comparisons of the proposed approach with a common global optimum invariant symmetry breaking approach support our hypotheses.
Based on the obtained results, we conclude that other global optimization based methods may also benefit from the use of the proposed global optimum variant symmetry breaking. Further research is required to adapt the proposed approach to other techniques to improve their performance.
The proposed method can be tested and verified using the open source C++ Monte Carlo Machine Learning Library (MCMLL), which is available under the GNU GPLv2 license. The website of the library can be found on:mcmll.sourceforge.net. The project website is available at sourceforge.net/projects/mcmll.
-  E. Alpaydin and Fevzi Alimoglu. UCI machine learning repository, 1996.
-  Michael A. Arbib, editor. The Handbook of Brain Theory and Neural Networks. MIT Press, Cambridge, MA, USA, 2002.
Hugues Bersini, Marco Dorigo, Stefan Langerman, Gregory Seront, and Luca Maria
Results of the first international contest on evolutionary
optimisation (1st iceo).
International Conference on Evolutionary Computation, pages 611–615, 1996.
-  Enrique Castillo, Bertha Guijarro-Berdiñas, Oscar Fontenla-Romero, and Amparo Alonso-Betanzos. A very fast learning method for neural networks based on sensitivity analysis. Journal of Machine Learning Research, 7:1159–1182, July 2006.
-  Uday K. Chakraborty. Advances in Differential Evolution. Springer Publishing Company, Incorporated, 1 edition, 2008.
Nicolás García-Pedrajas, Domingo Ortiz-Boyer, and César
An alternative approach for neural network evolution with a genetic algorithm: Crossover by combinatorial optimization.Neural Netw., 19(4):514–528, 2006.
-  David E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley Professional, 1 edition, January 1989.
Faustino Gomez, Jürgen Schmidhuber, and Risto Miikkulainen.
Accelerated neural evolution through cooperatively coevolved synapses.J. Mach. Learn. Res., 9:937–965, 2008.
-  Stefan Haflidason and Richard Neville. On the significance of the permutation problem in neuroevolution. In GECCO ’09: Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 787–794, New York, NY, USA, 2009. ACM.
-  P. Hancock. Genetic algorithms and permutation problems: a comparison of recombination operators for neural net structure specification. In Proceedings of the International Workshop on Combinations of Genetic Algorithms and Neural Networks, 1992.
-  Nikolaus Hansen, Sibylle D. Müller, and Petros Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es). Evol. Comput., 11(1):1–18, 2003.
-  Nikolaus Hansen and Andreas Ostermeier. Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation. In Proc. of the 1996 IEEE Int. Conf. on Evolutionary Computation, pages 312–317, Piscataway, NJ, 1996. IEEE Service Center.
-  S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2nd edition, July 1998.
-  Myles Hollander and Douglas A. Wolfe. Nonparametric Statistical Methods, 2nd Edition. Wiley-Interscience, 2 edition, January 1999.
Jarmo Ilonen, Joni-Kristian Kamarainen, and Jouni Lampinen.
Differential evolution training algorithm for feed-forward neural networks.Neural Process. Lett., 17(1):93–105, 2003.
-  Fei Jiang, Hugues Berry, and Marc Schoenauer. Unsupervised learning of echo state networks: balancing the double pole. In GECCO ’08: Proceedings of the 10th annual conference on Genetic and evolutionary computation, pages 869–870, New York, NY, USA, 2008. ACM.
-  K. J. Lang and M. J. Witbrock. Learning to tell two spirals apart. In Proceedings 1988 Connectionist Models Summer School, pages 52–59, Los Altos, CA, 1988. Morgan Kaufmann.
Junhong Liu, Jorma Mattila, and Jouni Lampinen.
Training rbf networks using a de algorithm with adaptive control.
Tools with Artificial Intelligence, IEEE International Conference on, 0:673–676, 2005.
-  T. Masters. Practical Neural Networks Recipes in C++. Academic Press, 1993.
-  Silvio Priem Mendes, Juan A. Gomez Pulido, Miguel A. Vega Rodriguez, Maria D. Jaraiz Simon, and Juan M. Sanchez Perez. A differential evolution based algorithm to optimize the radio network design problem. In E-SCIENCE ’06: Proceedings of the Second IEEE International Conference on e-Science and Grid Computing, page 119, Washington, DC, USA, 2006. IEEE Computer Society.
-  Zbigniew Michalewicz. Genetic algorithms + data structures = evolution programs (2nd, extended ed.). Springer-Verlag New York, Inc., New York, NY, USA, 1994.
-  K. V. Price. Differential evolution: a fast and simple numerical optimizer. In Biennial Conference of the North American Fuzzy Information Processing Society, NAFIPS, pages 524–527. IEEE Press, New York. ISBN: 0-7803-3225-3, June 1996.
-  Allan Rae and Sri Parameswaran. Application-specific heterogeneous multiprocessor synthesis using differential-evolution. In ISSS ’98: Proceedings of the 11th international symposium on System synthesis, pages 83–88, Washington, DC, USA, 1998. IEEE Computer Society.
-  O.M. Shir, C. Siedschlag, T. Back, and M.J.J. Vrakking. Evolutionary algorithms in the optimization of dynamic molecular alignment. In Evolutionary Computation, 2006. CEC 2006. IEEE Congress on, pages 2912 –2919, 0-0 2006.
-  Nils Siebel, Gerald Sommer, and Yohannes Kassahun. Evolutionary learning of neural structures for visuo-motor control. In Arpad Kelemen, Ajith Abraham, and Yulan Liang, editors, Computational Intelligence in Medical Informatics, volume 85 of Studies in Computational Intelligence, pages 93–115. Springer Berlin / Heidelberg, 2008.
-  Nils T Siebel, Jonas Boetel, and Gerald Sommer. Efficient neural network pruning during neuro-evolution. In Proceedings of 2009 International Joint Conference on Neural Networks (IJCNN 2009), Atlanta, USA, pages 2920–2927, June 2009.
Nils T. Siebel, Jochen Krause, and Gerald Sommer.
Efficient learning of neural networks with evolutionary algorithms.
Proceedings of the 29th DAGM conference on Pattern recognition, pages 466–475, Berlin, Heidelberg, 2007. Springer-Verlag.
-  Jirí Síma. Minimizing the quadratic training error of a sigmoid neuron is hard. In ALT ’01: Proceedings of the 12th International Conference on Algorithmic Learning Theory, pages 92–105, London, UK, 2001. Springer-Verlag.
-  R. Storn and K. Price. Differential evolution - a simple and efficient adaptive scheme for global optimization over continuous spaces. Technical Report TR-95-012, ICSI, March 1995.
-  Héctor J. Sussmann. Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Netw., 5(4):589–593, 1992.
-  Dirk Thierens. Non-redundant genetic coding of neural networks. In In Proceedings of the 1996 IEEE International Conference on Evolutionary Computation, pages 571–575. IEEE Press, 1996.
-  Dirk Thierens, J.A.K. Suykens, J. Vandewalle, and B. De Moor. Genetic weight optimization of a feedforward neural network controller. Innsbruck, Austria, Apr. 1993.
-  Tea Tušar and Bogdan Filipič. Differential evolution versus genetic algorithms in multiobjective optimization. In EMO’07: Proceedings of the 4th international conference on Evolutionary multi-criterion optimization, pages 257–271, Berlin, Heidelberg, 2007. Springer-Verlag.
J. Vesterstrom and R. Thomsen.
A comparative study of differential evolution, particle swarm optimization, and evolutionary algorithms on numerical benchmark problems.In Evolutionary Computation, 2004. CEC2004. Congress on, volume 2, pages 1980–1987 Vol.2, June 2004.
-  Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83, 1945.
-  Xing Xu and Yuanxiang Li. Comparison between particle swarm optimization, differential evolution and multi-parents crossover. Computational Intelligence and Security, International Conference on, 0:124–127, 2007.
-  Xin Yao and Yong Liu. Towards designing artificial neural networks by evolution. Applied Mathematics and Computation, 91(1):83 – 90, 1998.