Symmetry Breaking in Neuroevolution: A Technical Report

07/22/2011 ∙ by Onay Urfalioglu, et al. ∙ 0

Artificial Neural Networks (ANN) comprise important symmetry properties, which can influence the performance of Monte Carlo methods in Neuroevolution. The problem of the symmetries is also known as the competing conventions problem or simply as the permutation problem. In the literature, symmetries are mainly addressed in Genetic Algoritm based approaches. However, investigations in this direction based on other Evolutionary Algorithms (EA) are rare or missing. Furthermore, there are different and contradictionary reports on the efficacy of symmetry breaking. By using a novel viewpoint, we offer a possible explanation for this issue. As a result, we show that a strategy which is invariant to the global optimum can only be successfull on certain problems, whereas it must fail to improve the global convergence on others. We introduce the Minimum Global Optimum Proximity principle as a generalized and adaptive strategy to symmetry breaking, which depends on the location of the global optimum. We apply the proposed principle to Differential Evolution (DE) and Covariance Matrix Adaptation Evolution Strategies (CMA-ES), which are two popular and conceptually different global optimization methods. Using a wide range of feedforward ANN problems, we experimentally illustrate significant improvements in the global search efficiency by the proposed symmetry breaking technique.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Artificial Neural Networks (ANN) are general function approximators [13] and can be used to find a functional representation of a data set. Another point of view is that ANN’s represent a way of data compression [2]

. The compression ratio depends on the number of neurons used in the ANN which encodes the data: the less neurons at the same representation quality, the better the compression.

Given a problem, there are generally two kinds of optimization tasks for the learning process of ANN’s. The first one is to find a network topology, i.e., the optimal number of layers and the optimal number of neurons per layer. The second task is to find the parameters of the network, given a topology. In this paper, we focus on the second task and assume a predefined topology.

The estimation of the ANN-parameters is generally a computationally demanding task 

[28]. The corresponding Maximum-Likelihood derived error function comprises many local optima. Therefore, local search techniques to find an optimal solution generally fail and typically converge to a suboptimal solution [13]

. In addition, local search techniques are mainly sequential methods and parallel implementations are limited. On the other hand, global optimization techniques based on Monte Carlo methods such as the Genetic Algorithm (GA) 

[7, 21], Covariance Matrix Adaptation Evolution Strategies (CMA-ES) [12, 11] or Differential Evolution (DE) [29, 22, 34] are generally very well parallelizable. Differential Evolution is one of the most popular and robust Monte Carlo global search methods, which outperforms many other evolutionary algorithms on a wide range of problems [3, 33, 36]. DE is successfully used in various engineering problems such as multiprocessor synthesis [23], optimization of radio network designs [20]

, training Radial Basis Function networks 

[18], training multi layer neural networks [15] and many others [5]. On the other hand, CMA-ES is a state-of-the-art evolutionary algorithm, which is also used for ANN-learning [27, 26, 8] and other engineering tasks [24, 16, 25].

Due to inherent symmetries in the parametric representation of ANN’s, there are multiple global optima in the parameter space. The multiple global optima result from point symmetries and permutation symmetries [30, 31]. In the literature, this problem is also known as the competing conventions problem, or simply the permutation problem. In [32, 31], significant improvements are reported by different approaches to symmetry breaking for GA’s. However, in both publications, the improvement is shown using only one single test-case, respectively. On the other hand, in [10, 9] contradictionary results are presented, where the effect of removing these symmetries on GA’s is reported to be minimal and negligable, and even leading to reduced performance.

Furthermore, crossover operators used in GA’s are reported to be a source of the problems caused by symmetries [6]. Therefore, some researchers disable crossover or apply EA’s which do not have crossover at all [37].

To our best knowledge, there are no reports on the impact of the ANN-symmetries regarding the performance of the DE and CMA-ES methods. In this paper, we show that the performance of DE and CMA-ES are highly sensitive to the presence of multiple global optima, and that symmetries are also an issue on the performance of EA’s without crossover operators. We show that there are infinitely many ways of symmetry breaking, which differ in the way they partititon the parameter space. Furthermore, we argue that an effective way of partitioning should depend on the location of the global optimum and its symmetric replicas. Therefore, we derive a symmetry breaking operator based on considerations about the partitioning of the ANN-parameter space, which is optimal according to a Minimum Global Optimum Proximity

condition. By theoretical considerations and numerous experimental studies on offline supervised learning problems, we show that typical approaches to symmetry breaking, which are invariant to the global optimum, may lead to superior or inferior results, depending on the ANN-problem.


On the other hand, we show that the proposed global optimum variant approach for symmetry breaking leads to consistent and significant improvements in the estimation of ANN-parameters.

The paper is organized as follows. In the following Section, we briefly review Artificial Feedforward Neural Networks (ANN). Section 3 defines the term ’symmetry’ and introduces the types of symmetries found in the optimization of ANN-parameters. In Section 4, we discuss existing approaches to symmetry breaking. In this Section, we also reformulate the rules applied by existing approaches to prepare a more general view to the topic. In Section 5, we introduce the ’Minimum global optimum proximity’ principle and propose symmetry breaking methods based on this principle. In Section 6, we present the conducted experiments and obtained results, followed by the Section of Conclusions, where the main contributions are emphasized.

2 Brief review of Artificial Feedforward Neural Networks

Artificial (Feedforward) Neural Networks (ANN) are used for approximation of functions . ANN’s typically have multiple layers of artificial neurons. Assuming that an ANN has layers, the first and the last layer are called as the input and the output layer, respectively. Remaining layers are called as hidden layers.

For the -th neuron in layer

, we denote a parameter vector by

(1)

where is the weight vector of dimension equal to the number of inputs available to the neuron and is the shift scalar. The output of a tanh-type sigmoid neuron is given by

(2)

where is the output vector of layer . After all hidden layers are evaluated, the output layer component of the output vector is typically obtained by the following two alternative ways:

(3)
(4)

We denote the parameter vector of all neurons in a layer by , where

(5)

The vector of all the parameters in the network is given by

(6)

where , is the vector of the output layer weights for output . The function defined by the network is denoted by

(7)

where is the input vector, which is notationwise equal to the output of the input layer, so that .
Assuming additive normal i.i.d. noise on the available data , the ML-estimate of the parameters can be obtained by the minimizer to the following least squares optimization problem:

(8)

For regression problems, the output layer is linear as shown in Eqn. (3). Thus, the corresponding weights can be determined by a least squares method, as described in [19], which we adopt in this paper. This has the advantage that global search is applied only to the non-linear part of the parameter space, which generally speeds up convergence. For classification problems, we assume that an output vector of a data-sample designating class has the following format

(9)

Although the output layer is non-linear as shown in Eqn. (4), corresponding weights can still be determined linearly in the training phase. For this, the output vectors of the training data are rescaled by factor 20, such that and . The weights of the output layer are determined by a least squares method using the rescaled data. Given the remaining parameters, Eqn. (8) is applied by using the non-rescaled data.

Consequently, the parameter vector for the global optimization can be reduced to

(10)

The important problem of how to choose the net topology is not considered in this paper. For a given net-topology, we focus on the effect of symmetry breaking on the efficiency of the optimization of the parameters in (10). In the following Section, we investigate the symmetries in the ANN-parameter space.

3 Symmetries in ANN’s

A symmetry is an operator which does not change the output of an ANN when applied to the parameter vector :

(11)

Non reducable ANN’s comprise two types of symmetries [30]. The first type is a point symmetry on the neuron parameter level, since

(12)

The following definition of a point symmetry operator

(13)

changes the sign of the parameters of neuron and the -th weight component of all neurons in the following layer . It satisfies the symmetry condition because of Eqn. (12). In Fig. 1, an example for the application of is shown. For each layer , the point symmetry yields symmetric replicas of the parameter vector .

point-symm-example

Figure 1: Application of the point symmetry operator , which changes the signs of -parameters in layer two and -parameters in layer three, respectively.

The second type of symmetry is a permutation symmetry by the neuron parameters and the corresponding weight parameters in the next layer. A permutation operator defined by

(14)

leaves the output invariant. Note that . In Fig. 2, the application of is illustrated. In each layer , there are symmetric replicas of the parameter vector due to permutation symmetries. Combining both symmetries, the total count of symmetric replicas per layer is . Another important property is that the length of the vector is invariant under such symmetry operators,

(15)

since the point symmetry operator only changes the sign of some components of the parameter vector, whereas the permutation symmetry operator only swaps some components.

perm-symm-example

Figure 2: Application of the permutation symmetry operator , which exchanges the parameters in layer two and the parameters , in layer three.
Lemma 3.1.

Symmetry operators are linear and orthogonal operators.

Proof.

The proof for the linearity of these operators is trivial and therefore omitted in this paper. The orthogonality follows from Eqn. (15):

(16)
(17)
(18)

Furthermore, applying the same point symmetry operator two times subsequently does not change the parameter vector, since switching the signs of selected components a second time reverts the first sign-change. The same holds also for the permutation symmetry operator: swapping the selected components a second time reverts the first swapping. Therefore, we can write

(19)

where is the identity operator. As a result, point symmetry, permutation symmetry as well as joint symmetry operators correspond to rotations and all symmetric replicas of a global optimum lie on a hypersphere. Since such symmetries multiply the local and global optima count in the parameter space, the ultimate goal of symmetry breaking is to reduce the total number of local optima in the parameter space by avoiding all but one symmetrically equivalent space partitions.

There are infinitely many ways for symmetry breaking by using the operators and , which depend on the condition upon which these operators are applied. As an example, consider a 2-D point symmetry as illustrated in Fig. 3. Limiting the search space to the upper half plane () is one possibility to break the symmetry, where only one global optimum remains and the space is separated into two partitions. In this case, the point symmetry operator is to be applied only for . Another possibility is to reduce the space to the right half plane (). This is realized by applying the point symmetry operator only on the condition

. By rotating the coordinate system, we obtain infinitely many other ways to separate and reduce the space. As a result, there is a degree of freedom on the choice of a specific condition or separation. We derive similar results also for the permutation symmetry. In Section 

5, we argue that there is an optimal choice for a specific symmetry breaking condition (separation) based on considerations about the location of the global optimum. We exploit the degree of freedom on the choice of a specific condition by choosing a condition such that the distance of the global optimum to the separating region is maximal. In other words, we demand that the proximity of the global optimum to the separating region is minimal. This way, the influence of neighboring global optima is minimized and the symmetry breaking can be realized most effectively.

A detailed discussion about an optimal separation follows in Section 5.

4 Existing approaches to deal with symmetries

A commonly used method is to reduce the parameter space to one single symmetrically equivalent region, also called partition. To achieve this, the following rules can be applied [31]:

rule-1

The shift parameter of all neurons is ensured to be positive by flipping the signs of the parameters when required, for each neuron.

rule-2

In each hidden layer, neurons are sorted according to the shift parameter.

This method and all other similar methods can be realized by applying a chain of the operators and . In the following, we show that these rules are suboptimal, and in some cases may even cause inferior performance. We show that rules for symmetry breaking should take the position of the global optimum into account in order to be effective. Therefore, we denote rule-1 and rule-2 as global optimum invariant, and rules which depend on the global optimum as global optimum variant.

4.1 Global optimum invariant point symmetry breaking

Assuming a point symmetric function , Fig. 3 shows two cases where rule-1 is applied such that all -coordinates are forced to be positive. As a consequence, all solution candidates are located in the upper half plane and the parameter space is effecively reduced. There is only one remaining global optimum . In the left plot, the global optima and are relatively far away from the -axis, whereas in the right plot, the global optima are close to the -axis, although they have the same distance to the origin in both plots. In case of the right plot, there exists an ’artificial’ local optimum due to the proximity of the hidden global optimum , where some solution candidates may be attracted to. The main problem is that after applying symmetry breaking, some solution candidates may still be closer to the hidden global optimum than to . As a result, the goal of reducing the influence of other global optima is not fully achieved. Furthermore, the introduced artificial local optimum may trap some solution candidates without having a chance to ever reach the corresponding ’hidden’ global optimum . We believe that this is the main reason why an inferior performance is reported by some symmetry breaking approaches. Note that this situation depends on the location of the global optimum, which in turn depends on the problem at hand. Therefore, this issue arises on some problems, whereas on others, a symmetry breaking with increased performance can be achieved by these rules.

point-symm-good-case

point-symm-local-opt

Figure 3: Example for a point symmetry in 2-D, where . It is assumed that rule-1 is applied to force all solution candidates to be in the upper half plane (). As a result, the parameter space is effecively reduced and there is only one remaining global optimum . In the left plot, the global optima and are relatively far away from the -axis, whereas in the right plot, the global optima are close to the -axis, although they have the same distance to the origin in both cases. In case of the right plot, there exists an ’artificial’ local optimum due to the proximity of the hidden global optimum , where some solution candidates may be attracted to. The main problem is that after applying symmetry breaking, some solution candidates may still be closer to the hidden global optimum than to .

In Fig. 3, the -axis is the region of separation

(20)

The separating region depends on the rule and divides the parameter space into partitions. As an example, an alternative rule, which would force all coordintates to be positive, would have the -axis as the separating region. We repeat that the distance of the global optimum to the separating region is crucial for effective symmetry breaking, and that it should be arranged to have this distance as large as possible. Another equivalent goal is to apply symmetry breaking such that no solution candidate is closer to the hidden global optimum than to the global optimum of the selected partition.

4.2 Global optimum invariant permutation symmetry breaking

Similar problems caused by rule-1 also arise by the application of rule-2. This is shown in the following example. We use a 2x2 parameter structure, i.e., two neurons with two parameters per neuron : . From the permutation symmetry follows that

(21)

where shall be the error function. Let the global optimum be at . There are two possibilities to apply rule-2: sorting by parameter or sorting by parameter , respectively. The separating region varies for each choice. Choosing to sort by parameter yields , whereas sorting by parameter yields :

(22)

We show that each separation region has a different distance to the global optimum . The closest point on to is at , which yields the distance . On the other hand, the closest point on to is at , which yields the distance . In this example, applying rule-2 by ordering the -coordinates results in a better sparation of the partitions. Would the global optimum be at , the opposite case would apply. Consequently, similar to rule-1 in the previous Section 4.1, rule-2 can only be effective on some problems.

5 Minimum global optimum proximity principle

In this Section we propose new methods for symmetry breaking to avoid the problems described in Section 4. Here, we assume that the basin, or the region of influence of the global optimum is isotropic. Although this assumption does not apply in general, it is introduced to simplify the discussion. Also, this simplification enables us to easily derive theoretically motivated methods, which prove to be very effective in a wide range of problems. In the presentation, we first consider the point symmetry, then the permutation symmetry and finally the general joint symmetry as a combination of both point and permutation symmetries.

5.1 Minimum global optimum proximity principle for point symmetry

The differences between possible rules to apply the point symmetry operator arise from the condition on which the operator is to be applied. Fig. 4 shows different rules with corresponding separation regions for breaking a point symmetry in relation to the global optimum.

worst-case-point-symm-breaking

suboptimal-point-symm-breaking

optimal-point-symm-breaking

Figure 4: Example for a point symmetry in 2-D, where . The plots show worst case (left), suboptimal (middle) and optimal separation lines (right) for point symmetry breaking. The separating line divides the parameter space in two parts, where each partition contains a global optimum ( and ).

It can be seen that the separating region which has maximum distances to the global optima, which means that the according proximity is minimal, enables the optimal separation or partitioning. This way, an optimal isolation between all symmetric replicas of the global optimum is achieved. As a result, the disturbing influence of other neighboring global optima is decreased to a minimum, which in turn effectively maximizes the attraction of the global optimum of the selected partition.

The following Lemma provides a more general perspective for rule-1 presented in Section 4. Note that the shift parameter is the last entry in the parameter vector.

Lemma 5.1.

Rule-1 from Section 4 modifies a parameter vector as:

(23)
Proof.

From the first line in Eqn. (23) follows with and a reference vector

(24)
(25)
(26)

Further simplifying both sides of the equation yields

(27)

This means that the conditional Equation (23) is equivalent to rule-1 which demands that the shift parameters shall be positive. ∎

The rule-structure introduced by Lemma 5.1 can be used to formulate the following strategy to maximize the distance of the global optimum to the separating region.

(28)
Theorem 5.2.

The solution candidate determined by rule (28) is always closer to than to .

We will prove Theorem 5.2 in a more general setting in Section 5.3.

5.2 Minimum global optimum proximity principle for permutation symmetry

In this Section we introduce an optimal rule for breaking a permutation symmetry for parameter spaces with two blocks of permutation-invariant parameters. We define a parameter vector as

(29)

where the notation is used to emphasize the block structure. The permutation symmetry is given by

(30)

where is the error function and is a permutation operator defined by

(31)

The following Lemma restates rule-2 as a distance dependent rule.

Lemma 5.3.

Assuming the shift parameter is the last parameter in the parameter block , rule-2, presented in Section 4, can alternatively be described in a more general form by the following rule:

(32)
Proof.

From Eqn. (32) follows with

(33)
(34)
(35)

We state the following proposal in order to maximize the distance of the global optimum to the separating region, according to the rule-structure introduced by Lemma 5.3

(36)
Theorem 5.4.

The solution candidate determined by rule (36) is always closer to than to .

Theorem 5.4 will be proved in a more general setting in Section 5.3.

5.3 Ideal symmetry breaking

For a given ANN-optimization problem, let be the set of all possible symmetry operators. Note that a symmetry operator may be a point symmetry, a permutation symmetry or a joint symmetry operator. A joint symmetry operator is generally composed of a chain of point symmetry and permutation symmetry operators. As an example, applies a permutation symmetry followed by a point symmetry operator. The following properties of symmetry operators are relevant in the following discussion. According to Eqn. (11), a symmetry operator does not change the output of the ANN when applied to the parameter vector . According to Eqn. (15) a symmetry operator does not change the length of a parameter vector. Furthermore, according to Eqn. (18), symmetry operators are orthogonal.

Given a parameter vector , the set of all symmetric replicas of is defined by

(37)

Recall that the ultimate goal of symmetry breaking is to minimize the influence of all symmetric replicas of the selected global optimum and to concentrate the global search to the partition where the selected global optimum is located. To achieve this, we propose the following joint separation condition:

(38)

In other words, this optimization selects the closest symmetric replica of to the selected global optimum . Finding the closest symmetric replica of means finding the corresponding symmetry operator , where

(39)

In case the parameter vector is already close to , i.e., it is in the corresponding partition, the solution for is the identity operator . Note that, according to Eqn. (19), the identity operator is in . In Fig. 5, ideal symmetry breaking according to Eqn. (38) is illustrated on a hypothetical 2-D space.

Symmetric replica of global optimum
Figure 5: Ideal symmetry breaking according to Eqn. (38) shown on a hypothetical 2-D space. In this example, applying a point symmetry operator followed by a permutation symmetry operator maps to , which is located at the partition of the selected global optimum , marked by a star. Note that a symmetry operator corresponds to a rotation, which preserves lengths as well as angles.
Theorem 5.5.

The solution determined by Equation (38) ensures that no other symmetric replica of the selected global optimum is closer to than . In other words, it minimizes the influence of the symmetric replicas of the selected global optimum.

Proof.

We prove this by contradiction. According to Eqn. (38), is minimal. Assume that there exists a global optimum replica with

(40)

Due to the underlying symmetry properties, each global optimum replica can be mapped to another replica by a symmetry operator, i.e., there exists a symmetry operator which satisfies

(41)

Due to length-preserving property of symmetry operators, using Eqn. (39), the left-hand side of the Relation (40) can be written as

(42)

Since and therefore , it follows that . But this means that does not minimize the distance to , which contradicts Eqn. (38). ∎

5.4 Approximations of the ideal separation

In order to take advantage of these results, we have to address two issues. First, the global optimum is not known a priori. Second, the brute force method for finding an optimal solution to (38) has exponential complexity, but a low-complexity algorithm is desired. In order to circumvent the first problem, we propose to use an estimate for the global optimum, which can be determined by the population of solution candidates at each iteration of the applied Monte Carlo method. Naturally, this estimate improves with increasing iteration number. The second problem can be addressed by using an approximation for the ideal separation achieved by (38).

To describe the proposed method, for each neuron , we define a symmetry relevant parameter block as

(43)
(44)

which includes also some corresponding parameters from the next layer . Given a parameter vector and an estimate of the global optimum with corresponding parameter blocks and , the pseudocode 1 describes the proposed approximation for ideal symmetry breaking.

  [breaking point symmetry]
  for all hidden layers  do
     for all neurons per layer  do
         // would the point symmetry operator decrease the distance? ()
         calculate distance-square for NOT applying :
         calculate distance-square for applying :
         if  then
            apply point symmetry operator : set
         end if
     end for
  end for
  [breaking permutation symmetry]
  for all hidden layers  do
     randomly choose two neurons in hidden layer with
     // would the permutation operator decrease the distance? ()
     calculate distance-square for NOT applying :
     calculate distance-square for applying :
     if  then
         apply permutation symmetry operator : swap
     end if
  end for
Algorithm 1 Proposed symmetry breaking method. A symmetry operator is only applied to the parameter vector when it decreases the distance to the global optimum , i.e., . Algorithm input: and . Effect: modification of the parameter vector when appropriate.

In Fig. 6, the effect of the several symmetry breaking approaches is demonstrated on a hypothetical 2-D parameter space.

Figure 6: Examples of symmetry breaking methods. Given a distribution of solution candidates as shown in the upper circle, typical outcomes of three different symmetry breaking methods are shown. In the left-bottom case, all solution candidates are mapped into the selected partition, but the global optimum is not necessarily centered within the partition. As a downside, there is a relatively strong influence of the global optimum from the neighbor partition. In the center-bottom case, the selected partition is chosen such that the distance to other symmetric replica of the global optimum are maximized, and all solution candidates are mapped into the selected partition. The right-bottom case shows the proposed approximate global optimum variant symmetry breaking. It equals the center-bottom case, except that the solution candidates are not necessarily mapped into the selected partition, but also to other partitions close to the selected one.

5.4.1 DE with symmetry breaking

The DE method [29, 22] comprises a population of solution candidates , which are iteratively updated and moved towards an optimal solution. We propose to choose the centroid of the population at each iteration as an estimate for the global optimum .

The DE method extended by the global optimum invariant symmetry breaking [31] is denoted by DE-INV-SB, DE extended by the proposed global optimum variant symmetry breaking, described by Algorithm 1, is denoted by DE-SB and DE with global optimum variant ideal symmetry breaking using brute force search is denoted by DE-SB-BF. As shown in Fig. 7, in DE-based symmetry breaking approaches, symmetry breaking is always applied on each solution candidate right after it has been updated for the next iteration. Only in DE-SB, we apply an additional step by increasing the error yield of some solution candidates which are not in the same partition as the selected partition holding

. This increases the probability that these solution candidates are updated and moved closer to the selected partition. This is not required for symmetry breaking approaches which map each solution cadidate exactly to the selected partition, such as DE-INV-SB or DE-SB-BF. The DE-SB method is described in Algorithm 

2.

  for all candidate vectors  do
     apply symmetry breaking on , see Algorithm 1
     if  modified (a symmetry operator was applied) and  then
         multiply the stored error yield of by factor
     end if
  end for
Algorithm 2 DE-SB. Algorithm input: population of candidate vectors and the centroid of the population as the estimate for the global optimum . Effect: modify candidate vectors when appropriate.
Figure 7: Flowgraph for DE with symmetry breaking.

5.4.2 CMA-ES with symmetry breaking

The CMA-ES method [12, 11] adapts a global step size , the mean and a covariance matrix

at each iteration. According to the Gaussian distribution

with mean and covariance matrix , solution candidate vectors are drawn. After sorting the population by the error each candidate vector yields, the best samples are used to update the mean, covariance matrix and the step size for the next iteration.

In the following discussion, the CMA-ES method extended by the global optimum invariant symmetry breaking [31] is denoted by CM-ES-INV-SB, CMA-ES extended by the proposed global optimum variant symmetry breaking, described by Algorithm 1, is denoted by CMA-ES-SB and CMA-ES with global optimum variant ideal symmetry breaking using brute force search is denoted by CMA-ES-SB-BF.

In CMA-ES-INV-SB, CMA-ES-SB and CMA-ES-SB-SF, symmetry breaking is applied right after the evaluation of all candidate vectors and prior to updating the parameters of the Gaussian distribution. In CMA-ES-SB, we propose to use the best candidate vector (yielding the smallest error) so far as the estimate for the global optimum, denoted by . In Fig. 8, the flowgraph for CMA-ES-based symmetry breaking approaches is shown. For CMA-ES-SB, the update of the mean is described in Algorithm 3. In all other CMA-ES-based methods, the original update formula for the mean is applied.

In CMA-ES, applying symmetry breaking introduces a bias in the mean, which can lead to an excessive increase of the global step size and negatively affect the performance. This bias results from the rotations caused by the symmetry operators. These rotations move solution candidates to the vicinity of one partiton, which typically increases the radius of the population mean, as shown in Fig. 6. In order to prevent such an increase, in all CMA-ES-based symmetry breaking methods, we modify the damping term for the update of the global step size . Let be the shift vector of the centroid of the best solution candidates induced by applying symmetry breaking. The regular update formula for

(45)

is changed to

(46)

where is the iteration number and is a term depending on the difference of the previous mean and the current mean, and several other parameters.

  for all candidate vectors  do
     set mean vector
     apply symmetry breaking on , see Algorithm 1
     if  modified (a symmetry operator was applied) then
         add weighted global optimum estimate to mean vector:
     else
         add weighted candidate vector to mean vector:
     end if
  end for
Algorithm 3 CMA-ES-SB. Algorithm input: population of candidate vectors , the estimate for the global optimum and weights . Effect: modify candidate vectors when appropriate.
Figure 8: Flowgraph for CMA-ES with symmetry breaking.

6 Experiments

In this section, we introduce results of experiments to demonstrate the performance improvements by symmetry breaking. The following methods are compared using regression and classification tests. From the DE-family: Differential Evolution (DE), DE with global optimum invariant symmetry breaking (DE-INV-SB), DE with global optimum variant symmetry breaking (DE-SB) and DE with global optimum variant ideal symmetry breaking using brute force search (DE-SB-BF). From the CMA-ES-family: Covariance Matrix Adaptation Evolution Strategies (CMA-ES), CMA-ES with global optimum invariant symmetry breaking (CMA-ES-INV-SB), CMA-ES with global optimum variant symmetry breaking (CMA-ES-SB) and CMA-ES with global optimum variant ideal symmetry breaking using brute force search (CMA-ES-SB-BF). It should be noted that the purpose of this investigation is not to present the best global optimization method for ANN-learning, but to demonstrate the benefits of symmetry breaking.
With a -dimensional parameter space, all tests are performed with following settings:

  • DE, DE-SB, DE-INV-SB and DE-SB-BF settings: , , initial population is randomly generated in -dim. hypercube (uniformly),

  • CMA-ES, CMA-ES-SB, CMA-ES-INV-SB and CMA-ES-SB-BF settings: we used suggested settings for enhanced global search abilities, mentioned in the C-code reference implementation.

  • in all experiments, the optimization is finished when a maximum number of ANN-function-evaluations is reached.

Given a parameter and a data set , we define the Mean Squared Error (MSE) according to Eqn. (8):

(47)

In order to limit the -dimensional parameter space to a feasible region, we apply a penalty approach. Due to the length-invariance by the symmetry operators as shown in Eqn. (15), the feasible region is defined by a hypersphere. In case of , the error function (47) is evaluated at a rescaled parameter vector and a penalty term is added to the error .

In self-generated data sets, we add normal distributed noise with zero mean and variance

to the function values

(48)

6.1 Experimental setup

In all experiments, data is normalized such that mean is zero and variance is one. The population size

used in DE and CMA-ES depends on the problem and the choice of the optimization method. Therefore, it is manually adapted accordingly. For each problem and each optimization method, we conduct 50 independent repetitions of the optimization process and record the error over the number of ANN-evaluations. To test for statistical significance of the obtained results, first the Kruskal-Wallis test 

[14] for the hypothesis that all performance means are equal is applied. In case this hypothesis is rejected, the Wilcoxon rank sum test [35] is applied to all pairs of means to identify significantly different results. All tests are based on a significance level of . In Table 1

, normalized training set errors for the regression and the autoencoding problems, and normalized test set errors for the classification problems are shown.

DE DE-INV-SB DE-SB CMA-ES CMA-ES-INV-SB CMA-ES-SB

 

syn5 0.958 0.079 1.000 0.186 0.949 0.039 1.000 4.446 0.386 0.469 0.093 0.007
sinc 1.000 0.859 0.412 0.166 0.114 0.008 0.459 0.271 1.000 0.784 0.139 0.051
inc-sinc 1.000 0.963 0.337 0.155 0.089 0.016 0.287 0.336 1.000 0.707 0.082 0.035
sinc2d 1.000 0.387 0.995 0.094 0.875 0.029 0.975 0.139 1.000 0.241 0.089 0.253
sinc3d 0.622 0.029 1.000 0.572 0.603 0.033 1.000 1.401 0.090 0.013 0.043 0.021

 

autoenc-circle 0.057 0.082 1.000 1.850 0.020 0.030 1.000 0.295 0.626 0.548 0.077 0.164
autoenc-spiral 0.341 0.545 1.000 0.932 0.116 0.308 0.248 0.232 1.000 0.882 0.030 0.024
autoenc-sphere 0.554 0.321 1.000 0.064 0.022 0.012 0.050 0.012 1.000 0.416 0.032 0.008

 

two-circles 0.450 0.225 1.000 0.182 0.269 0.074 0.635 0.368 1.000 0.284 0.326 0.169
two-spirals 0.918 0.260 1.000 0.213 0.426 0.197 1.000 0.228 0.930 0.201 0.683 0.293
digits 0.325 0.087 1.000 0.111 0.272 0.062 1.000 0.352 0.805 0.113 0.668 0.099
Table 1: Normalized training set errors for the regression and the autoencoding problems, and normalized test set errors for the classification problems. The best results are printed in boldface. For each problem and method, errors are normalized by the maximum error from within the corresponding regular method, its extension by global optimization invariant symmetry breaking and its extension by global optimization variant symmetry breaking.

6.2 Regression problems

As in [4], we apply learning only on a training set to compare the performance of the introduced methods. In the following, the regression problems are introduced and corresponding results are shown.

6.2.1 Dataset syn5

The syn5 dataset is generated by the fourth-degree polynome

with uniformly distributed random input values

. We use a 1-3-1 net and 200 data samples. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 9 shows the resulting convergence curves and box plots for the learning process.

Figure 9: Convergence curves for regression by DE (left) and CMA-ES (right) using the syn5 dataset.

For the DE-family, the Kruskal-Wallis test showed no significant difference in means. In contrast, according to the Wilcoxon tests, the inequality of means of CMA-ES and CMA-ES-SB is rejected by a narrow margin, with a corresponding p-value of . The other means are significantly different. All DE variants reach the same low-error, where DE-SB shows the fastest decrease in error. As for the CMA-ES variants, CMA-ES fails to reach a low error in a few runs, which leads to a larger mean error in average. In contrast, CMA-ES-SB proves to be more robust and reaches a relatively low error in all runs.

6.2.2 Dataset sinc

The sinc dataset is generated by the function with uniformly distributed random input values . We use a 1-5-1 net and 200 data samples. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 10 shows the resulting convergence curves and box plots for the learning process.

Figure 10: Convergence curves for regression by DE (left) and CMA-ES (right) using the sinc dataset.

According to the Wilcoxon tests, all pairwise differences are significant. DE-SB clearly outperforms DE and DE-INV-SB. Similarly, CMA-ES-SB is the fastest among the CMA-ES-based methods.

6.2.3 Dataset inc-sinc

The inc-sinc dataset is generated by the function with uniformly distributed random input values . We use a 1-5-1 net and 200 data samples. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 11 shows the resulting convergence curves and box plots for the learning process.

Figure 11: Convergence curves for regression by DE (left) and CMA-ES (right) using the inc-sinc dataset.

According to the Wilcoxon tests, all pairwise differences are significant. Interestingly, the global optimum invariant symmetry breaking approach leads to in improvement for DE (DE-INV-SB), but shows inferior performance on CMA-ES (CMA-ES-INV-SB). This proves that symmetry breaking approaches should be specific to the selected global optimization method. Again, DE-SB and CMA-ES-SB are the fastest methods.

6.2.4 Dataset sinc2d

The sinc2d dataset is generated by the function with uniformly distributed random input values . We use a 2-3-1-3-1 net and 1000 data samples. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 12 shows the resulting convergence curves and box plots for the learning process.

Figure 12: Convergence curves for regression by DE (left) and CMA-ES (right) using the sinc2d dataset.

According to the Wilcoxon tests, all pairwise differences are significant, except the difference between CMA-ES and CMA-ES-INV-SB. The proposed symmetry breaking approach shows a very clear impact on the CMA-ES-variants. While CMA-ES and CMA-ES-INV-SB fail to solve this problem completely, CMA-ES-SB successfully trains the ANN in the majority of the 50 runs.

6.2.5 Dataset sinc3d

The sinc3d dataset is generated by the function with uniformly distributed random input values . We use a 3-4-1-4-1 net and 1000 data samples. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 13 shows the resulting convergence curves and box plots for the learning process.

Figure 13: Convergence curves for regression by DE (left) and CMA-ES (right) using the sinc3d dataset.

According to the Wilcoxon tests, all pairwise differences are significant. Again, DE-SB and CMA-ES-SB are the fastest methods. This time, in contrast to previous experiments, CMA-ES-INV-SB clearly outperforms CMA-ES.

6.3 Autoencoding problems

In this section, all -dimensional data samples lie on a -dimensional set, where . As a result, the data can be described, or ’encoded’ by an -dimensional subset. On the other hand, there is also a -D to -D mapping to ’decode’ the data. The task is to approximate both the encoding and decoding mapping by an ANN. As in the case of the regression problems, the performance is compared only on the training using a training set.

6.3.1 Dataset autoenc-circle

In this problem, the data samples lie on a 2-D circle centered at the origin with radius one. We use a 2-5-3-2-1-2-3-5-2 net and 200 data samples to encode from 2-D to 1-D and decode back to 2-D. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 14 shows the resulting convergence curves and box plots for the learning process.

Figure 14: Convergence curves for regression by DE (left) and CMA-ES (right) using the sinc3d dataset.

All pairwise differences prove to be statistically significant. The proposed symmetry beraking approach improves the training in both methods. On CMA-ES-SB, the difference turns out to be quite significant.

6.3.2 Dataset autoenc-spiral

In this problem, the data samples lie on a 3-D spiral with radius one, defined by

We use a 3-1-3-4-7-3 net and 1000 data samples to encode from 3-D to 1-D and decode back to 3-D. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 15 shows the resulting convergence curves and box plots for the learning process.

Figure 15: Convergence curves for regression by DE (left) and CMA-ES (right) using the sinc3d dataset.

All pairwise differences prove to be statistically significant.

6.3.3 Dataset autoenc-sphere

In this problem, the data samples lie on a 3-D sphere centered at the origin with radius one. We use a 3-8-5-2-5-8-3 net and 1000 data samples to encode from 3-D to 2-D and decode back to 3-D. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 16 shows the resulting convergence curves and box plots for the learning process.

Figure 16: Convergence curves for regression by DE (left) and CMA-ES (right) using the autoenc-sphere dataset.

All pairwise differences prove to be statistically significant. Clearly, DE-SB and CMA-ES-SB are significantly faster then the other methods.

6.4 Classification problems

In classification problems, data samples are divided into a training set, a validation set and a test set. All three sets are generated by random selection of samples. A winner-takes-all scheme is applied to distinguish different classes, i.e., given an input, the ANN-output component with the greatest value determines the class. In order to improve generalization, classification performance measures on the training and test set are updated only on each improvement of the validation set classification performance.

6.4.1 Dataset: Two-Circles

In this problem, the 2-D data domain is divided into two parts, where one part is given by the union area of two circles and the remaining part is the disjunct space. Hence, there are two classes: samples which lie inside any circle and samples which lie outside of both circles. One circle is specified by center and radius , and the other circle by center and same radius . We use a 2-4-2-4-2 net with 400 samples for each training, validation and test set, having a total of 1200 samples. The population size for all DE-based methods is , and for all CMA-ES-based methods. Fig. 15 shows the resulting convergence curves and box plots for the learning process.

Figure 17: Classification error rates over ANN-evaluations on the Two-Circles dataset using the DE-variants.
Figure 18: Classification error rates over ANN-evaluations on the Two-Circles dataset using the CMA-ES-variants.

All pairwise differences prove to be statistically significant. It can be seen that again DE-SB and CMA-ES-SB dominate the performances.

6.4.2 Dataset: Two-Spirals

This problem [17]

contains 2-D data-samples from two spirals on the plane, both starting at the origin and going around each other. The task is to classify each data sample by deciding to which spiral it belongs to. We use a 2-8-3-1-3-8-2 net, 114 samples for the training set, 40 samples for the validation set and 40 samples for the test set. The population size for all DE-based methods is

, and for all CMA-ES-based methods. Fig. 19 and 20 show the resulting convergence curves and box plots for the learning process.

Figure 19: Classification error rates over ANN-evaluations on the Two-Spirals dataset using the DE-variants.
Figure 20: Classification error rates over ANN-evaluations on the Two-Spirals dataset using the CMA-ES-variants.

On the training and test set, DE and DE-INV-SB mean results are not statistical significantly different. Furthermore, on the test set, CMA-ES and CMA-ES-INV-SB mean results are not significantly different. Otherwise, all other pairwise differences prove to be statistically significant. DE-SB and CMA-ES-SB continue to show the best results.

6.4.3 Dataset: Digits

This problem [1]

deals with the recognition of handwritten digits, which results in a classification problem with 10 classes. The data is generated by asking several writers to write 250 digits in random order inside boxes of 500 by 500 tablet pixel resolution. There are 16 features extracted from the digitized data. We use a 16-8-3-10-10 net and 1000 data samples each for the training set, the validation set as well as the test set.

Figure 21: Classification error rates over ANN-evaluations on the Digits dataset using the DE-variants.
Figure 22: Classification error rates over ANN-evaluations on the Digits dataset using the CMA-ES-variants.

All pairwise differences prove to be statistically significant. Again, DE-SB and CMA-ES-SB are the fastest methods.

6.5 Ideal separation

In this Section, we compare the ideal separation to the proposed approximations. Since the complexity of the brute force method for the ideal separation is exponential, we restrict the experiments to small networks as used in the problems syn5, sinc and inc-sinc. It can be seen that the results are almost identical.

Figure 23: Comparing DE-SB with DE using ideal separation by brute force symmetry breaking (DE-SB-BF) on the syn5, sinc and inc-sinc datasets.
Figure 24: Comparing CMA-ES-SB with CMA-ES using ideal separation by brute force symmetry breaking (CMA-ES-SB-BF) on the syn5, sinc and inc-sinc datasets.

7 Conclusions

The problem of symmetries in ANN-parameter space is a well known problem resulting in important complication in the training of ANN’s. However, a detailed investigation of this problem for Evolutionary Algorithms other than Genetic Algorithms is missing in the literature. Furthermore, there are contradictionary results about the efficacy of symmetry breaking methods in the performance of the global search. We show that a possible explanation for this situation is the use of symmetry breaking methods which are invariant to the global optimum and therefore can only be effective on a limited number of problems. Furthermore, we show theoretically and illustrate experimentally, that the application of global optimum invariant symmetry breaking may even lead to inferior performance. To circumvent these problems, we propose methods for global optimum variant symmetry breaking approaches for Differential Evolution (DE) and Covariance Matrix Adaptation Evolution Strategies (CMA-ES), which are two popular, robust and state-of-the-art global optimization methods.

Experimental studies conducted on fixed topology feedforward neural networks indicate a significant improvement over standard DE and CMA-ES techniques in terms of global convergence speed. Further comparisons of the proposed approach with a common global optimum invariant symmetry breaking approach support our hypotheses.

Based on the obtained results, we conclude that other global optimization based methods may also benefit from the use of the proposed global optimum variant symmetry breaking. Further research is required to adapt the proposed approach to other techniques to improve their performance.

The proposed method can be tested and verified using the open source C++ Monte Carlo Machine Learning Library (MCMLL), which is available under the GNU GPLv2 license. The website of the library can be found on:

mcmll.sourceforge.net. The project website is available at sourceforge.net/projects/mcmll.

References

  • [1] E. Alpaydin and Fevzi Alimoglu. UCI machine learning repository, 1996.
  • [2] Michael A. Arbib, editor. The Handbook of Brain Theory and Neural Networks. MIT Press, Cambridge, MA, USA, 2002.
  • [3] Hugues Bersini, Marco Dorigo, Stefan Langerman, Gregory Seront, and Luca Maria Gambardella. Results of the first international contest on evolutionary optimisation (1st iceo). In

    International Conference on Evolutionary Computation

    , pages 611–615, 1996.
  • [4] Enrique Castillo, Bertha Guijarro-Berdiñas, Oscar Fontenla-Romero, and Amparo Alonso-Betanzos. A very fast learning method for neural networks based on sensitivity analysis. Journal of Machine Learning Research, 7:1159–1182, July 2006.
  • [5] Uday K. Chakraborty. Advances in Differential Evolution. Springer Publishing Company, Incorporated, 1 edition, 2008.
  • [6] Nicolás García-Pedrajas, Domingo Ortiz-Boyer, and César Hervás-Martínez.

    An alternative approach for neural network evolution with a genetic algorithm: Crossover by combinatorial optimization.

    Neural Netw., 19(4):514–528, 2006.
  • [7] David E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley Professional, 1 edition, January 1989.
  • [8] Faustino Gomez, Jürgen Schmidhuber, and Risto Miikkulainen.

    Accelerated neural evolution through cooperatively coevolved synapses.

    J. Mach. Learn. Res., 9:937–965, 2008.
  • [9] Stefan Haflidason and Richard Neville. On the significance of the permutation problem in neuroevolution. In GECCO ’09: Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 787–794, New York, NY, USA, 2009. ACM.
  • [10] P. Hancock. Genetic algorithms and permutation problems: a comparison of recombination operators for neural net structure specification. In Proceedings of the International Workshop on Combinations of Genetic Algorithms and Neural Networks, 1992.
  • [11] Nikolaus Hansen, Sibylle D. Müller, and Petros Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es). Evol. Comput., 11(1):1–18, 2003.
  • [12] Nikolaus Hansen and Andreas Ostermeier. Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation. In Proc. of the 1996 IEEE Int. Conf. on Evolutionary Computation, pages 312–317, Piscataway, NJ, 1996. IEEE Service Center.
  • [13] S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2nd edition, July 1998.
  • [14] Myles Hollander and Douglas A. Wolfe. Nonparametric Statistical Methods, 2nd Edition. Wiley-Interscience, 2 edition, January 1999.
  • [15] Jarmo Ilonen, Joni-Kristian Kamarainen, and Jouni Lampinen.

    Differential evolution training algorithm for feed-forward neural networks.

    Neural Process. Lett., 17(1):93–105, 2003.
  • [16] Fei Jiang, Hugues Berry, and Marc Schoenauer. Unsupervised learning of echo state networks: balancing the double pole. In GECCO ’08: Proceedings of the 10th annual conference on Genetic and evolutionary computation, pages 869–870, New York, NY, USA, 2008. ACM.
  • [17] K. J. Lang and M. J. Witbrock. Learning to tell two spirals apart. In Proceedings 1988 Connectionist Models Summer School, pages 52–59, Los Altos, CA, 1988. Morgan Kaufmann.
  • [18] Junhong Liu, Jorma Mattila, and Jouni Lampinen. Training rbf networks using a de algorithm with adaptive control.

    Tools with Artificial Intelligence, IEEE International Conference on

    , 0:673–676, 2005.
  • [19] T. Masters. Practical Neural Networks Recipes in C++. Academic Press, 1993.
  • [20] Silvio Priem Mendes, Juan A. Gomez Pulido, Miguel A. Vega Rodriguez, Maria D. Jaraiz Simon, and Juan M. Sanchez Perez. A differential evolution based algorithm to optimize the radio network design problem. In E-SCIENCE ’06: Proceedings of the Second IEEE International Conference on e-Science and Grid Computing, page 119, Washington, DC, USA, 2006. IEEE Computer Society.
  • [21] Zbigniew Michalewicz. Genetic algorithms + data structures = evolution programs (2nd, extended ed.). Springer-Verlag New York, Inc., New York, NY, USA, 1994.
  • [22] K. V. Price. Differential evolution: a fast and simple numerical optimizer. In Biennial Conference of the North American Fuzzy Information Processing Society, NAFIPS, pages 524–527. IEEE Press, New York. ISBN: 0-7803-3225-3, June 1996.
  • [23] Allan Rae and Sri Parameswaran. Application-specific heterogeneous multiprocessor synthesis using differential-evolution. In ISSS ’98: Proceedings of the 11th international symposium on System synthesis, pages 83–88, Washington, DC, USA, 1998. IEEE Computer Society.
  • [24] O.M. Shir, C. Siedschlag, T. Back, and M.J.J. Vrakking. Evolutionary algorithms in the optimization of dynamic molecular alignment. In Evolutionary Computation, 2006. CEC 2006. IEEE Congress on, pages 2912 –2919, 0-0 2006.
  • [25] Nils Siebel, Gerald Sommer, and Yohannes Kassahun. Evolutionary learning of neural structures for visuo-motor control. In Arpad Kelemen, Ajith Abraham, and Yulan Liang, editors, Computational Intelligence in Medical Informatics, volume 85 of Studies in Computational Intelligence, pages 93–115. Springer Berlin / Heidelberg, 2008.
  • [26] Nils T Siebel, Jonas Boetel, and Gerald Sommer. Efficient neural network pruning during neuro-evolution. In Proceedings of 2009 International Joint Conference on Neural Networks (IJCNN 2009), Atlanta, USA, pages 2920–2927, June 2009.
  • [27] Nils T. Siebel, Jochen Krause, and Gerald Sommer. Efficient learning of neural networks with evolutionary algorithms. In

    Proceedings of the 29th DAGM conference on Pattern recognition

    , pages 466–475, Berlin, Heidelberg, 2007. Springer-Verlag.
  • [28] Jirí Síma. Minimizing the quadratic training error of a sigmoid neuron is hard. In ALT ’01: Proceedings of the 12th International Conference on Algorithmic Learning Theory, pages 92–105, London, UK, 2001. Springer-Verlag.
  • [29] R. Storn and K. Price. Differential evolution - a simple and efficient adaptive scheme for global optimization over continuous spaces. Technical Report TR-95-012, ICSI, March 1995.
  • [30] Héctor J. Sussmann. Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Netw., 5(4):589–593, 1992.
  • [31] Dirk Thierens. Non-redundant genetic coding of neural networks. In In Proceedings of the 1996 IEEE International Conference on Evolutionary Computation, pages 571–575. IEEE Press, 1996.
  • [32] Dirk Thierens, J.A.K. Suykens, J. Vandewalle, and B. De Moor. Genetic weight optimization of a feedforward neural network controller. Innsbruck, Austria, Apr. 1993.
  • [33] Tea Tušar and Bogdan Filipič. Differential evolution versus genetic algorithms in multiobjective optimization. In EMO’07: Proceedings of the 4th international conference on Evolutionary multi-criterion optimization, pages 257–271, Berlin, Heidelberg, 2007. Springer-Verlag.
  • [34] J. Vesterstrom and R. Thomsen.

    A comparative study of differential evolution, particle swarm optimization, and evolutionary algorithms on numerical benchmark problems.

    In Evolutionary Computation, 2004. CEC2004. Congress on, volume 2, pages 1980–1987 Vol.2, June 2004.
  • [35] Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83, 1945.
  • [36] Xing Xu and Yuanxiang Li. Comparison between particle swarm optimization, differential evolution and multi-parents crossover. Computational Intelligence and Security, International Conference on, 0:124–127, 2007.
  • [37] Xin Yao and Yong Liu. Towards designing artificial neural networks by evolution. Applied Mathematics and Computation, 91(1):83 – 90, 1998.