A Frequency-Domain Encoding for Neuroevolution

12/28/2012 ∙ by Jan Koutnik, et al. ∙ 0

Neuroevolution has yet to scale up to complex reinforcement learning tasks that require large networks. Networks with many inputs (e.g. raw video) imply a very high dimensional search space if encoded directly. Indirect methods use a more compact genotype representation that is transformed into networks of potentially arbitrary size. In this paper, we present an indirect method where networks are encoded by a set of Fourier coefficients which are transformed into network weight matrices via an inverse Fourier-type transform. Because there often exist network solutions whose weight matrices contain regularity (i.e. adjacent weights are correlated), the number of coefficients required to represent these networks in the frequency domain is much smaller than the number of weights (in the same way that natural images can be compressed by ignore high-frequency components). This "compressed" encoding is compared to the direct approach where search is conducted in the weight space on the high-dimensional octopus arm task. The results show that representing networks in the frequency domain can reduce the search-space dimensionality by as much as two orders of magnitude, both accelerating convergence and yielding more general solutions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training neural networks for reinforcement learning tasks (i.e. as value-function approximators) is problematic because the non-stationarity of the error gradient can lead to poor convergence, especially if the networks are recurrent. The data which the agent learns from is dependent on the agent’s own policy which changes over time.

An alternative to training by gradient-descent is to search the space of neural networks policy directly via evolutionary computation. In this

neuroevolutionary framework, networks are encoded either directly or indirectly as strings of values or genes, called chromosomes

, and then evolved in the standard way (genetic algorithm, evolution strategies, etc.)

Direct encoding schemes employ a one-to-one mapping from genes to network parameters (e.g. connectivity pattern, synaptic weights), so that the size of the evolved networks is proportional to the length of the chromosomes.

In indirect schemes, the mapping from chromosome to network can in principle be any computable function, allowing chromosomes of fixed size to represent networks of arbitrary complexity. The underlying motivation for this approach is to scale neuroevolution to problems requiring large networks such as vision (Gauci and Stanley, 2007), since search can be conducted in relatively low-dimensional gene space. Theoretically, the optimal or most compressed encoding is the one in which each possible network is represented by the shortest program that generates it, i.e. the one with the lowest Kolmogorov complexity (Li and Vitányi, 1997). While the lowest Kolmogorov complexity encoding is generally not computable, but it can be approximated from above through a search in the space of network-computing programs (Schmidhuber, 1995, 1997) written in a universal programming language.

Less general but more practical encodings (Gauci and Stanley, 2007; Gruau, 1994; Buk et al., 2009; Buk, 2009) often lack continuity in the genotype-phenotype mapping, such that small changes to a genotype can cause large changes in its phenotype. For example, using cellular automata (Buk, 2009) or graph-based encodings (Kitano, 1990; Gruau, 1994) to generate connection patterns can produce large networks but violates this continuity condition. HyperNEAT (Gauci and Stanley, 2007), which evolves weight-generating networks using Neuro-Evolution of Augmenting Topologies (NEAT; Stanley and Miikkulainen 2002

) provides continuity while changing weights, but adding a node or a connection to the weight-generating network causes a discontinuity in the phenotype space. These discontinuities occur frequently when e.g. replacing NEAT in HyperNEAT with genetic programming-constructed expressions

(Buk et al., 2009). Furthermore, these representations do not provide an importance ordering on the constituent genes. For example, in the case of graph encodings, one cannot gradually cut of less important parts of the graph (GP expression, NEAT network) that constructs the phenotype.

Here we present an indirect encoding scheme in which genes represent Fourier series coefficients, and genomes are decoded into weight matrices via an inverse Fourier-type transform. This means that the search is conducted in the frequency domain rather than the weight space (i.e. the spatio-temporal domain). Due to the equivalence between the two, this encoding is both complete and closed: all valid networks can be represented and all representations are valid networks (Kassahun et al., 2007). The encoding also provides continuity (small changes to a frequency coefficient cause small changes to the weight matrix), allows the complexity of the weight matrix to be controlled by the number of coefficients (importance ordering), and makes the size of the genome independent of the size of the network it generates.

The intuition behind this approach is that because real world tasks tend to exhibit strong regularity, the weights near each other in the weight matrix of a successful network will be correlated, and therefore can be represented in the frequency domain by relatively few, low-frequency coefficients. For example, if the input to a network is raw video, it is very likely the input weights corresponding to adjacent pixels will have a similar value. This is the same concept used in lossy image coding where high-frequency coefficients containing very little information are discarded to achieve compression.

This “compressed” encoding was first introduced by Koutník et al. (2010) where a version of practical universal search (Schaul and Schmidhuber, 2010) was used to discover minimal solutions to well-known RL benchmarks. Subsequently (Koutník et al., 2010) it was used with the CoSyNE (Gomez et al., 2008) neuroevolution algorithm where the correlation between weights was restricted to a 2D topology. In this paper, the encoding is generalized to higher dimensional correlations that can potentially better capture the inherent regularities in a given environment, so that fewer coefficients are needed to represent successful networks (i.e. higher compression). The encoding is applied to the scalable octopus arm using a variant of Natural Evolution Strategies (NES; Wierstra et al. 2008), called Separable NES (SNES; Schaul et al. 2011) which is efficient for optimizing high-dimensional problems. Our experiments show that while the task requires networks with thousands of weights, it contains a high degree of redundancy that the frequency domain encoding can exploit to reduce the dimensionality of the search dramatically.

The next section provides a short tutorial on the Fourier transform. Section 

3 describes the DCT network encoding and the procedure for decoding the networks. The experimental results appear in section 4, where we show how the compressed network representation both can accelerate learning and provide more robust solutions. Section 5 discusses the main contributions of the paper, and provide some ideas for future research.

2 The Fourier Transform

Any periodic function can be uniquely represented by an infinite sum of cosine and sine functions, i.e. its Fourier series:

(1)

where is time and is the frequency, and . The coefficients and specify how much of the corresponding function is in , and can be obtained by multiplying both sides of eq. (1) by the band frequency, integrating, and dividing by . So for the coefficient, , of the cosine with frequency :

(2)
(3)
(4)
(5)

(2) simplifies to (3) because all sinusoidal functions with different frequencies are orthogonal and therefore cancel out, , leaving only the frequency of interest, and (4) simplifies to (5) because .

The Fourier series can be extended to complex coefficients:

(6)

For a function periodic in the equations become:

(7)

The Fourier transform is then a generalization of complex Fourier series as . The discrete is replaced with a continuous , while and the sum is replaced with an integral:

(8)

In the case where there are uniformly-spaced samples of , the discrete Fourier transform (DFT)

(9)

and the inverse discrete Fourier transform

(10)

are defined.

                                                  

Figure 1: Decoding the compressed networks

. The figure shows the three step process involved in transforming a genome of frequency-domain coefficients into a recurrent neural network. First, the genome (left) is divided into

chromosomes, one for each of the weight matrices specified by the network architecture, . Each chromosome is mapped, by Algorithm 1, into a coefficient array of a dimensionality specified by

. In this example, an RNN with two inputs and four neurons is encoded as 8 coefficients. There are

, chromosomes and . The second step is to apply the inverse DCT to each array to generate the weight values, which are mapped into the weight matrices in the last step.

The most widely used transform in image compression, is the discrete cosine transform (DCT) which considers only the real part of the DFT. The DCT is an invertible function that computes a sequence of coefficients () from a sequence of real numbers (. There are four types of DCT transforms based on how the boundary conditions are handled. In this paper, the Type III DCT, DCT(III), is used to transform coefficients into weight matrices. DCT(III) is the inverse of the standard, forward DCT(II) used in e.g. JPEG, and is defined as:

(11)

where is the -th weight and is the -th frequency coefficient.

The DCT can be performed on signals of arbitrary dimension by applying a one-dimensional transform along each dimension of the signal. For example, in a 2D image a 1D transform is first applied to the columns and then, a second 1D transform is applied to the rows of the coefficient matrix resulting from the first transform.

When a signal, such as a natural image, is transformed into the frequency domain, the power in the upper frequencies tends be low (i.e. the corresponding coefficients have small values) since pixel values tend change gradually across most of the image. Compression can be achieved by discarding these coefficients, meaning fewer bits need to be stored, and replacing them with zeros during decompression. This is the idea behind the network encoding described in the next section: if a problem can be solved by a neural network with smooth weight matrices, then, in the frequency domain, the matrices can be represented using only some of the frequencies, and therefore fewer parameters compared to the number of weights in the network.

3 DCT Network Representation

[]Coefficient importance. The coefficients are ordered along the second diagonals in the two-dimensional case depicted here (left). Each diagonal is filled from the edges to the center starting on the side that corresponds to the longer dimension. The complexity of the weight matrix (right) is controlled by the number of coefficients. The gray-scale levels denote the weight values (black = low, white = high). The more coefficients that are used the more potentially complex the weight matrix. [r][b]

Networks are encoded as a string or genome, , consisting of substrings or chromosomes of real numbers representing DCT coefficients. The number of chromosomes is determined by the choice of network architecture, , and data structures used to decode the genome, specified by = , where , , is the dimensionality of the coefficient array for chromosome . The total number of coefficients, , is user-specified (for a compression ratio of ), and the coefficients are distributed evenly over the chromosomes. Which frequencies should be included in the encoding is unknown. The approach taken here restricts the search space to band-limited neural networks where the power spectrum of the weight matrices goes to zero above a specified limit frequency, , and chromosomes contain all frequencies up to , .

Figure 1 illustrates the procedure used to decode the genomes. In this example, a fully-recurrent neural network (on the right) is represented by weight matrices, one for the input layer weights, one for the recurrent weights, and one for the bias weights. The weights in each matrix are generated from a different chromosome which is mapped into its own -dimensional array with the same number of elements as its corresponding weight matrix; in the case shown, : 3D arrays for both the input and recurrent matrices, and a 2D array for the bias weights.

Figure 2: Mapping the coefficients: The cuboidal array (left) is filled with the coefficients from chromosome one simplex at a time, according to Algorithm 1, starting at the origin and moving to the opposite corner one simplex at a time.

In previous work (Koutník et al., 2010), the coefficient matrices were 2D, where the simplexes are just the secondary diagonals; starting in the top-left corner, each diagonal is filled alternately starting from its corners (see figure 3). However, if the task exhibits inherent structure that cannot be captured by low frequencies in a 2D layout, more compression can potentially be gained by organizing the coefficients in higher-dimensional arrays.

Each chromosome is mapped to its coefficient array according to Algorithm 1 (figure 2) which takes a list of array dimension sizes, and the chromosome, , to create a total ordering on the array elements, . In the first loop, the array is partitioned into -simplexes, where each simplex, , contains only those elements whose Cartesian coordinates, , sum to integer . The elements of simplex are ordered in the while loop according to their distance to the corner points, (i.e. those points having exactly one non-zero coordinate; see example points for a 3D-array in figure 2), which form the rows of matrix , sorted in descending order by their sole, non-zero dimension size. In each loop iteration, the coordinates of the element with the smallest Euclidean distance to the selected corner is appended to the list , and removed from . The loop terminates when is empty.

After all of the simplexes have been traversed, the vector

holds the ordered element coordinates. In the final loop, the array is filled with the coefficients from low to high frequency to the positions indicated by ; the remaining positions are filled with zeroes. Finally, a dimensional inverse DCT transform is applied to the array to generate the weight values, which are mapped to their position in the corresponding 2D weight matrix. Once the chromosomes have been transformed, the network is complete.

Figure 3: Fully-connected recurrent neural network representation. A single-chromosome genome, , is shown decoded into three different networks. The genome is first mapped into an matrix which is transformed into a weight matrix via the 2D inverse DCT. The right column shows the resulting networks corresponding to each matrix. Note that the size of the network is independent of the genome length. The squares denote input units; the circles are neurons, arrow thickness denotes the magnitude of a connection weight and its color the polarity (black=positive, red=negative).

The DCT network representation is not restricted to a specific class of networks but most of the conventional perceptron-type neural networks can be represented as a special case of a fully-connected recurrent neural networks (FRNN). This architecture is general enough to represent e.g. feed-forward and Jordan/Elman networks since they are just sub-graphs of the FRNN.

4 Experiments

The compressed weight space encoding was tested on evolving neural network controllers for the octopus arm problem, introduced by Yekutieli et al. (2005)111This task has been used in past reinforcement learning competitions, http://rl-competition.org. The octopus arm was chosen because its complexity can scaled by increasing the arm length.

4.1 Octopus-Arm Task

The octopus arm (see figure 4.2) consists of compartments floating in a 2D water environment. Each compartment has a constant volume and contains three controllable muscles (dorsal, transverse and ventral). The state of a compartment is described by the -coordinates of two of its corners plus their corresponding and velocities. Together with the arm base rotation, the arm has state variables and control variables. The goal of the task to reach a goal position with the tip of the arm, starting from three different initial positions, by contracting the appropriate muscles at each s step of simulated time. While initial positions 2 and 3 look symmetrical, they are actually quite different due to gravity.

The number of control variables is typically reduced by aggregating them into “meta”–actions: contraction of all dorsal, all transverse, and all ventral muscles in first (actions 1, 2, 3) or second half of the arm (actions 4, 5, 6) plus rotation of the base in either direction (actions 7, 8). In the experiments, both meta-actions and raw actions are used.


                                                  

Figure 4: Neural Network Architectures. Architecture consists of fully-connected neurons that control the arm through the meta-actions. The network is connected to inputs ( stands for the number of compartments, e.g. ). The network for raw actions has neurons (in the case of compartments) organized in grid.

4.2 Neural Network Architectures

Networks were evolved to control a compartment arm using two different fully-connected recurrent neural network architectures: , with neurons controlling the meta-actions, and , having neurons, one for each primitive, non-aggregated (raw) action (see figure 4). Architecture has input weight matrix,

recurrent weight matrix and bias vector of length

, for a total of weights. Architecture has input weight matrix, recurrent weight matrix and bias vector of length , for a total of weights.

The following three schemes were used to map the genomes in the coefficient arrays, see figure 5.

  1. : the genome is mapped into a single matrix (i.e. ), the inverse DCT is performed, and the matrix is split into a (8+2) matrix of input weights, a weight matrix of recurrent connections and a bias vector of length , where is the number of neurons in the network, and is the number of arm compartments.

  2. : the genome is partitioned into chromosomes, mapped into three arrays: (1) a 3D, () array, where 8 refers to the number of state variables per compartment, (2) an array for the recurrent weights of the neurons controlling the meta-actions, and (3) a bias vector of length .

  3. : the genome is partitioned into chromosomes, mapped into three arrays: (1) a 4D ()() array that contains input weights for a () grid of neurons, one for each raw action, (2) a ()() recurrent weight array, and (3) and a () bias array. The dimension size of 3 in these arrays refers to the number of muscles per compartment.

Schemes and were used to generate networks; and were used to generate networks. Coefficient arrays are filled using Algorithm 1, and weights for each compartment are placed next to weights for the adjacent compartments in the physical arm.

Scheme was used by Koutník et al. (2010), and is included here for the purpose of comparison. This is the simplest mapping that forces a single set of coefficients (chromosome) to represent all of the network weight matrices. Scheme tries to capture 3D correlations between input weights, so that fewer coefficients may be required to represent the similarity between not only weights with similar function (i.e. affecting state variables near each other on the arm) within a given arm compartment (as in ), but also across compartments. The input, recurrent and bias weights are compressed separately. arranges the weights such that correlations between the all four dimensions that uniquely specify a weight a can be exploited. For example, this data structure places next to each other input weights affecting: muscles with the same function in adjacent compartments, muscles in the same compartment with different functions, the same muscle from adjacent compartments, etc.

[] Octopus arm: a flexible arm consisting of compartments, each with 3 muscles, must be controlled to touch a goal location with the arm tip from different initial positions. Initial positions , and are used for training, and were used for generalization tests in section 4.5.1. [r][b]

[width=tics=20] gComposed5

Figure 5: Coefficient mappings. The coefficients are mapped into two network architectures, , and (with 8 and 32 neurons, respectively) using three mappings: maps coefficients into a single 2D matrix, which splits into input matrix, matrix of recurrent connections and a bias vector. Alternatively, using , the input array can be three-dimensional ( compartments neurons state variables) to respect the geometrical constrains of the input space. The network that controls raw actions is decoded after the coefficients are mapped (with ) into two four-dimensional arrays, from which input and recurrent weights are decoded ( input connected to a layer of neurons). In the case of and , the coefficient arrays are larger than number of weights ( compartments plus state variables for the arm base) and some of the coefficients are unused as denoted in the figures.

4.3 Setup

Indirect encoded networks were evolved with a fixed number of coefficients , and using an incremental procedure describe below, for the four configurations , , , and , where for example denotes the architecture that uses raw actions and is decoded using scheme . Each of the (compression ratios) ( configurations) = setups consisted of 20 runs. For comparison direct encoded networks were also evolved where the genomes explicitly encode the weights, for a total of and genes (weights), for and , respectively.

Networks were evolved using Separable Natural Evolution Strategies (SNES; (Sun et al., 2011)), an efficient variant in the NES (Wierstra et al., 2008) family of black-box optimization algorithms. In each generation the algorithm samples a population of

individuals, computes a Monte Carlo estimate of the fitness gradient, transforms it to the natural gradient and updates the search distribution parameterized by a mean vector,

, and covariance matrix, . Adaptation of the full covariance matrix is costly because it requires computing the matrix exponential, which becomes intractable for large problems (e.g. more than 1000 parameters – network weights or DCT coefficients). SNES combats this by restricting the class of search distributions to be Gaussian with a diagonal covariance matrix, so that the search is performed in predefined coordinate system. This restriction makes SNES scale linearly with the problem dimension (see (Wierstra et al., 2008) for a full description of NES).

The population size is calculated based on the number of coefficients, , being evolved, , the learning rates are . Each SNES run is limited to fitness evaluations.

The fitness was computed as the average of the following score over three trials:

(12)

where is the number of time steps before the arm touches the goal, is a number of time steps in a trial, is the final distance of the arm tip to the goal and is the initial distance of the arm tip to the goal. Each of the three trials starts with the arm in a different configuration (see figure 4.2). This fitness measure is different to the one used in (Woolley and Stanley, 2010), because minimizing the integrated distance of the arm tip to the goal causes greedy behaviors. In the viscous fluid environment of the octopus arm, a greedy strategy using the shortest length trajectory does not lead to the fastest movement: the arm has to be compressed first, and then stretched in the appropriate direction. Our fitness function favors behaviors that reach the goal within a small number of time steps.

In all of the experiments described so far, the encoding stays fixed throughout the evolutionary run. Therefore it depends on correctly guessing the best number of coefficients. In an attempt to automatically determine the best number of coefficients, a set of 20 simulations were run, using configuration , where the networks are initially encoded by 10 coefficients and then the number of coefficients incremented by 10 every 6000 evaluations. If the performance does not improve after successive coefficient additions, the algorithm ends and the best number of coefficients is reported. Adding a coefficient to the network encoding means incrementing the number of dimensions in the mean, , and covariance, , vectors of the SNES search distribution.

When coefficients are added the complexity of all weight matrices increases. For example, a genome consisting of coefficients is distributed into chromosomes: , and . Additional coefficients would then be appended one at a time cycling through the chromosomes, adding the first to (the first shortest chromosome), the second to , the next to , and so on, until all 10 new coefficients are added, resulting in chromosomes of length . If a chromosome reaches a length equal to the number of weights in its corresponding weight matrix, then it cannot take on any more coefficients, and any additional coefficients are distributed the same way over the other chromosomes.

In most tasks, not all input or control variables can be organized in such way (such as the base rotation in the octopus arm task). In such case, one can either use a separate weight array, or place the weights together in a large array and decode them separately. In such a case, some values that result from the inverse DCT are not used.

4.4 Results

[width=tics=20] convergenceGrid3b

Figure 6: Performance results. The three log-log plots show the best fitness at each generation (averaged over 20 runs), for each encoding for a given configuration.

[width=tics=20] barChartPerformance3b

Figure 7: Performance results. The bar graph shows the number of evaluations required on average to reach a fitness of 0.75 for each set of experiments. converges faster than the direct encoding especially for the more compressed nets (), while provides no advantage. The advantage of is clear in the case of raw action control (), where did not reach the average fitness of 0.75 when up to 80 coefficients were used. For , networks represented by just 20 coefficient (compression ratio: ) outperform the direct encoding both in terms of learning speed and final fitness.

[width=tics=20] weightMatrices2                                                                                                

Figure 8: Weight matrix visualization. Each group of images shows typical evolved weight matrices for each configuration. Each row consists of an input matrix (left), recurrent matrix (center), and bias vector (right). Colors indicate weight value: blue = large, positive; orange = large negative. For , high fitness can be achieved with very simple matrices in which the weight values change smoothly (are highly correlated), compared to the direct approach (bottom). The 4D arrays used by allow regularities inherent in the raw-action control to be captured by as few as 20 coefficients.

Figure 6 summarizes the experimental results. Each of the three log-log plot shows performance of each encoding for one of the three configurations; each curve denotes the best fitness in each generation (averaged over 20 runs). The bar-graph shows the number of evaluations required on average for each set to reach a fitness of 0.75.

For the , controllers encoded indirectly by 40 coefficients or less () reach high fitness more quickly than the direct encoded controllers. However, the final fitness after 6000 evaluations is similar across all encodings. Because the networks are relatively small (728 weights) when meta-actions are used, direct search in weight space is still efficient. When architecture is decoded using , surprisingly the advantage of the indirect encoding is lost. While the 3D coefficient input array would seem to offer higher compression, it turns out that the number of coefficients required properly set the weights in this structure is so close to the number of weight in the network that nothing is gained.

For raw action control, , where the networks now have 3680 weights, the simple scheme again works well, converging 60% faster while using less than % as many parameters () as the direct encoding. However, much higher compression comes from where correlations in all four dimensions of the arm can be captured. The direct encoding only outperforms , which does not offer enough complexity to represent successful weight matrices. But, with just DCT coefficients, the compression ratio goes to :; reaching a fitness of in only evaluations, more than times faster than the direct encoding.

Figure 8 shows examples of weight matrices evolved for the two most successful configurations. Notice how regular the weight values are compared to the direct encoded networks. The evolved controllers exhibit quite natural looking behavior222go to http://www.idsia.ch/~koutnik/images/octopus.mp4 for a video demonstration. For example, when starting with the arm hanging down (initial state 3), the controller employs a whip-like motion to stretch the arm tip toward the goal, and overcome gravity and the resistance from the fluid environment (figure 12).

Figure 9: Incremental Coefficient Search.

The box-plot shows the median, max, min, and 25% -75% quantile fitness (

runs) achieved for a given number of coefficients in incremental evolution of networks. The median number of coefficients for which adding more coefficients does not improve the solution is .

Figure 9 contains box-plots showing the median, maximum and minimum (out of independent runs) fitness found during the progress of the incremental coefficient evolution. With the initial 10 coefficients the runs reach a median fitness of

, but with very high variance. As coefficients are added the median improves peaking at

, and the variance narrows to a minimum at .

4.5 Generalization

Figure 10: Generalization: different starting positions. Controllers encoded indirectly using from to

coefficients (box-plots) are compared to directly encoded controllers (horizontal lines). Data points are the median of 20 runs, the boxes indicate the lower and upper quartiles, and the bars the minimum and maximum values.

In this section the best controllers from the two most successful indirect encodings, and , are tested in two ways to measure both the generality of the evolved behavior, and that of the underlying frequency-based representation.

4.5.1 Different Starting Positions

Controllers were re-evaluated on the task using two new starting positions, with the arm oriented in the and directions instead of the three positions (, , ) used during evolution (see figure 4.2). Figure 10 shows the results of this test comparing direct and indirect encoded controllers. Each data point is the median fitness of the best controller from each of the 20 runs for a given number of coefficients; the boxes indicate the upper/lower quantiles and the bars the min/max values. The solid straight line is the median fitness for the direct encoded controllers, the dashed lines correspond to the upper/lower quantiles. For the generalization is comparable to that of direct encoding, but with significantly lower variance, and networks encoded with generalize better that the direct nets, again with lower variance. The performance of yields the best generalization, very consistently performing nearly as well as on the original three starting positions. The networks with lower compression () better capture the general behavior required to reach the goal from new starting positions.

Figure 11: Generalization: changing arm length. The best network in the final population of each evolutionary run is tested on an arm having from to compartments. The surface plots show the difference between the indirect and direct encoding for each compression level and number of compartments for (a) (meta-actions), and (b) (raw actions). The surface elevation above the “water” indicates the degree to which the indirect encoding generalized better than the direct encoding.

4.5.2 Different Arm Lengths

In this test, the arm length is changed from 10 compartments to between 3 and 20. Different arm lengths mean different numbers of inputs, and consequently require different size weight matrices. For the DCT encoded nets, the size of the network is independent of the number of coefficients, so that different arm lengths can be accommodated by modifying the size of the coefficient matrix appropriately (see figure 3). However, for the direct encoded nets, there is no straightforward way to add or remove structure to the network meaningfully.

In order to be able to compare direct and indirect nets, the direct nets were transformed into the frequency domain by reversing the procedure depicted in figure 1. First, the network weights are mapped to the appropriate positions in the correct number of multi-dimensional arrays. The forward

DCT is applied to each array, and the network is then “re-generated” to the appropriate size for the specified arm by adjusting the size of the coefficient matrix (padding with zeros if the matrix is enlarged), and applying the inverse DCT.

The best network from each run was re-evaluated on each of the arm lengths (3-20). The number of time steps allowed to control arms longer than 10 compartments was increased linearly up to time steps for compartment arm. The closest position of the arm tip and the time step when the goal was reached were used to compute the fitness. Arms that moved the arm tips further away from the goal were assigned zero fitness because the closest position (which is in fact the initial arm position) was reached in zero time.

The results of this test are summarized in figure 11. The surface plots show the difference between the indirect and direct encoding for each compression level and number of compartments for (a) (meta-actions), and (b) (raw actions). The elevation of the surface above the plane indicates how much better or worse the indirect encoding is in generalizing to different arm lengths than the direct encoding (with networks resized as described above). While the convergence speed of the indirect and direct encoding was very similar for (figure 6), the indirect encodings are less sensitive to changes in the network size. The deep trough at 10 compartments in graph (a) is due to the fact that, for this length arm (the same as used to evolve the nets), the direct encoding is slightly better on average than the indirect encoding (see final fitness in plot in figure 6), but cannot generalize well to even small changes to the arm length—the direct encoded solutions are overspecialized.

As with the test in section 4.5.1, the best generalization performance is obtained with and coefficients for both and . For larger numbers of coefficients, the generalization declines gradually for arm length of around 10, and more rapidly for shorter arms.

Figure 12: Octopus arm visualization. Visualization of the behavior of one of the successful controllers compressed to 40 genomes. The motion starts from one of the three initial states (a,b, and c). The arm base (depicted with a cross) is fixed. In the last phase, the goal is plotted with the disc. The controller uses a whip-like motion to overcome the environment friction. This sequence of snapshots was captured from the video available at http://www.idsia.ch/~koutnik/images/octopus.mp4.
Figure 13: Generalization visualization: changing arm length (raw actions). Behavior of the arm controlled by one of the networks trained to control the arm with compartments. We can see that the network scales well (except an arm with lengths of and ) and produces a smooth behavior transition for arms having from to compartments. The movement starts to be different in case of the long arms and the generalization performance degrades.

5 Discussion and Future Work

The experimental results revealed that searching in the “compressed” space of Fourier coefficients can improve search efficiency over the standard, direct search in weight space. The frequency domain representation exploits the correlation between weight values that are spatially proximal in the weight matrix, thereby reducing the number of parameters required to encode successful networks. Both fixed and incremental search in coefficient space discovered solutions that required an order of magnitude fewer parameters than the direct encoding for the octopus arm task, and a similar improvement in learning speed. Perhaps more importantly, it also produced controllers that were more general with respect to initial states, and more robust to changes in the environment (the arm length). This supports the idea that band-limited networks are in some sense simpler, and therefore less prone to overfitting.

The choice of encoding scheme, , proved decisive in determining the amount of compression attainable for the two network architectures. There are many possible ways to organize the coefficients as input to the decompressor (iDCT), but the fact that even the most naive approach, (where one set of coefficients is used to represent all of the weight matrices) worked well, is encouraging. The slightly more complex illustrates how adding higher dimensional correlations does not necessarily lead to better compression.

So, how to choose a good ? A useful default strategy may be to first identify the high-level dimensions of the environment that partition the weights qualitatively (e.g. for input weights: the compartment from which its connection originates, the compartment where it terminates, the muscle it affects, and which of the eight state variables it is associated with), and assume that these dimensions are all correlated by arranging the coefficients in data structures with the same number of dimensions, as was done in . This strategy, though the most complex, yielded by far the most compression, with solutions having thousands of weights being discovered by searching a space of only 20 coefficients.

It might be possible to achieve even higher compression by switching to a different basis altogether, such Gaussian kernels (Glasmachers et al., 2011) or wavelets. One potential limitation of a Fourier-type basis is that if the frequency content needs to vary across the matrix, then many coefficients will be required to represent it. This is the reason for using multiple chromosomes per genome in our experiments. In contrast, wavelets are designed to deal with this spatial locality, and could therefore provide higher compression by allowing all network matrices to by represented compactly by a single set of coefficients; for example, a simple scheme like could possibly compress as well as while requiring less domain knowledge.

In the current implementation, the network topology (number of neurons) is simply specified by the user. However, given the fact that the size of the weight matrices is independent of number of coefficients, it may be possible to optimize the topology by decoding genomes into networks whose size is drawn from probability mass function that is updated each generation according to relative performance of each topology. Future work will begin in this direction to not only search for parsimonious representation of large network, but also to determine their complexity.

Acknowledgment

This work was supported by the SNF grants 200020-125038/1 and 200020-140399/1.

References

  • Buk (2009) Buk, Z. (2009). High-dimensional cellular automata for neural network representation. In International Mathematica User Conference 2009, Champaign, Illinois, USA.
  • Buk et al. (2009) Buk, Z., Koutník, J., and Šnorek, M. (2009). NEAT in HyperNEAT substituted with genetic programming. In International Conference on Adaptive and Natural Computing Algorithms (ICANNGA 2009).
  • Gauci and Stanley (2007) Gauci, J. and Stanley, K. (2007). Generating large-scale neural networks through discovering geometric regularities. In Proceedings of the Conference on Genetic and Evolutionary Computation, pages 997–1004, New York, NY, USA. ACM.
  • Glasmachers et al. (2011) Glasmachers, T., Koutník, J., and Schmidhuber, J. (2011). Kernel Representations for Evolving Continuous Functions. Evolutionary Intelligence. To appear.
  • Gomez et al. (2008) Gomez, F., Schmidhuber, J., and Miikkulainen, R. (2008).

    Accelerated neural evolution through cooperatively coevolved synapses.

    Journal of Machine Learning Research

    , 9(May):937–965.
  • Gruau (1994) Gruau, F. (1994). Neural Network Synthesis using Cellular Encoding and the Genetic Algorithm. PhD thesis, l’Universite Claude Bernard-Lyon 1, France.
  • Kassahun et al. (2007) Kassahun, Y., Edgington, M., Metzen, J. H., Sommer, G., and Kirchner, F. (2007). A common genetic encoding for both direct and indirect encodings of networks. In Proceedings of the Conference on Genetic and Evolutionary Computation (GECCO-07), pages 1029–1036, New York, NY, USA. ACM.
  • Kitano (1990) Kitano, H. (1990). Designing neural networks using genetic algorithms with graph generation system. Complex Systems, 4:461–476.
  • Koutník et al. (2010) Koutník, J., Gomez, F., and Schmidhuber, J. (2010). Evolving neural networks in compressed weight space. In Proceedings of the Conference on Genetic and Evolutionary Computation (GECCO-10).
  • Koutník et al. (2010) Koutník, J., Gomez, F., and Schmidhuber, J. (2010). Searching for minimal neural networks in fourier space. In Proceedings of the 4th Annual Conference on Artificial General Intelligence.
  • Li and Vitányi (1997) Li, M. and Vitányi, P. M. B. (1997). An Introduction to Kolmogorov Complexity and its Applications (2nd edition). Springer.
  • Schaul et al. (2011) Schaul, T., Glasmachers, T., and Schmidhuber, J. (2011). High dimensions and heavy tails for natural evolution strategies. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2011, Dublin).
  • Schaul and Schmidhuber (2010) Schaul, T. and Schmidhuber, J. (2010). Towards practical universal search. In Proceedings of the Third Conference on Artificial General Intelligence, Lugano.
  • Schmidhuber (1995) Schmidhuber, J. (1995). Discovering solutions with low Kolmogorov complexity and high generalization capability. In Prieditis, A. and Russell, S., editors, Proceedings of the Twelfth International Conference on Machine Learning (ICML), pages 488–496. Morgan Kaufmann Publishers, San Francisco, CA.
  • Schmidhuber (1997) Schmidhuber, J. (1997). Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857–873.
  • Stanley and Miikkulainen (2002) Stanley, K. O. and Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10:99–127.
  • Sun et al. (2011) Sun, Y., Gomez, F., Schaul, T., and Schmidhuber, J. (2011). A linear time natural evolution strategy for non-separable functions. Technical report, arXiv:1106.1998v2.
  • Wierstra et al. (2008) Wierstra, D., Schaul, T., Peters, J., and Schmidhuber, J. (2008). Natural Evolution Strategies. In Proceedings of the Congress on Evolutionary Computation (CEC08), Hongkong. IEEE Press.
  • Woolley and Stanley (2010) Woolley, B. G. and Stanley, K. O. (2010). Evolving a single scalable controller for an octopus arm with a variable number of segments. In Schaefer, R., Cotta, C., Kolodziej, J., and Rudolph, G., editors, PPSN (2), volume 6239 of Lecture Notes in Computer Science, pages 270–279. Springer.
  • Yekutieli et al. (2005) Yekutieli, Y., Sagiv-Zohar, R., Aharonov, R., Engel, Y., Hochner, B., and Flash, T. (2005). A dynamic model of the octopus arm. I. Biomechanics of the octopus reaching movement. Journal of Neurophysiology, 94(2):1443–1458.