1 Introduction
In nature, both morphology and behaviour of a species crucially shape its physical interactions with the environment [3]. For example, the diversity in animal locomotion styles is an immediate result of the interplay between different body structures, e.g., different numbers, compositions and shapes of limbs, as well as as different neuromuscular controls, e.g., different sensorymotor loops and neural periodic patterns. Adaptation of a species to new ecological opportunities often comes with changes to both body shape and control signals – morphology and behaviour are coadapted. Building upon this insight, we investigate in this paper a methodology for coadaptation of the morphology and behaviour for computational agents using deep reinforcement learning. Without loss of generality, we focus in particular on legged locomotion. The goal of legged robots in such locomotion tasks is to transform as much electric energy as possible into directional movement [19, 22, 1, 13]. To this end, two approaches exist: 1) optimization of the behavioural policy, and 2) optimization of the robot design, which affects the achievable locomotion efficiency [23, 21, 19, 18]. Policy optimization is, especially in novel or changing environments, often performed using reinforcement learning [7, 18]
. Design optimization is frequently based on evolutionary algorithms or evolutioninspired and use a population of design prototypes for this process (Fig.
0(a)) [23, 19, 5]. However, manufacturing and evaluating a large quantity of design candidates is often infeasible in the real world due to cost and time constraints, especially for larger robots. Therefore, the evaluation of designs is often restricted to simulation, which is feasible but suffers from the simulationtorealitygap [25, 14]. Designs and control policies optimized in simulation are often not the best possible choice for the real world, especially if the robotics system is complex and the environmental parameters hard to model. For example, in the work of Lipson and Pollack [17] designs were first optimized in simulation in an evolutionary manner and then manufactured in the real world. However, the performances of the manufactured designs in the real world were significant lower than in simulation in all but one case (see Table 1 in [17]), even though efforts were undertaken to close the simulationtoreality gap for the described robot.The method proposed in this work caters towards the need of roboticists for dataefficiency in respect to the number of prototypes required to achieve an optimal design. We are combining design optimization and reinforcement learning in such a way that the reinforcement learning process provides us with an objective function for the design optimization process (Fig. 0(b)). Thus, eliminating the need for a population of prototypes and requiring only one functioning prototype at a time.
2 Related Work
The work of Schaff et al. [21] is a relatively recent approach to combine reinforcement learning and design optimization into one framework. The common idea is to consider the design parameter as an additional input to the policy and to optimize the expected reward given the policy and design. The policy is trained such that it is able to generalize over many designs and is iteratively updated with experience collected from a population of prototypes. The algorithm maintains a distribution over designs, whose parameters are optimized to maximize the expected reward. However, this approach [21] requires the maintenance of a population of designs, which is updated every timesteps and relies on the simulator to compute the fitness of designs. Similarly, the work of David Ha [9] uses the design parameters as input to the policy but uses REINFORCE [24] to update the design parameters. Again, this approach requires a population of design prototypes to compute the introduced populationbased policy gradient for the design as well as rewards collected from the simulator. The recent method introduced by Liao et al. [16] employs Batch Bayesian Optimization to improve morphology and policies. The expected performance of designs is here learned and inferred by Gaussian Processes (GP), a second GP is also used to optimize the parameters of central pattern generators representing movement policies. The paper demonstrates the design optimization of a simulated microrobot with three parameters defining the morphology. While the presented results are using a prototype population of 5 designs, the authors mention that the proposed method can handle a single prototype as well. One drawback of [16] is, however, that the GP predicting the fitness of designs is trained only with a single value per design: the single highest reward achieved for a design. Since the maximum reward is potentially affected by the initial state a robot is in, this approach has a reduced applicability to tasks with noisy or random start states. In [19]
, the leg lengths and controller of a quadruped robot were optimized in the real world. The controller was here based on the inverse kinematics of the robot and defined by tuning eight parameters. All leg segment lengths were described by a twodimensional design vector. Two different evolutionary algorithms were used to optimize these parameters over eight generations with a population size of eight and based on the reward received. While this experiment is an impressive demonstration of the potential of adapting behaviour and morphology in the real world, the task was simplified through the use of a reconfigurable robot which is able to adapt its leglengths automatically. This decreases the setuptime required between experiments because manufacturing of legsegments or other body parts are not necessary. All four of these approaches rely on a population of design prototypes whose performance must be evaluated in simulation or the real world, or rely on a single reward.
3 Problem Statement
We formalize the problem of coadapting morphology and behavior as the optimization
(1) 
of the reward w.r.t. the variables were are the morphological properties of the agent, and the behavior. There are multiple ways to tackle this problem. One commonly used way is to decompose it as bilevel optimization, where we iteratively optimize the morphology first , and after fixing it, we optimize the behavior
. One advantage of this formulation is that by decoupling the two optimization, we can take into consideration the fact that evaluating different morphologies has an associated cost (e.g., manufacturing a physical robot) which can be substantially higher than evaluating different behaviors (e.g., running multiple controllers). In this paper, we frame the learning of the behaviors as an extension of the standard Markov decision process (MDP)
[2] given the additional design variable(i.e., the context). In this model, the transition probability to reach a state
after performing action is given by and depends on design properties of the agent. The reward function can be dependent on the design as well. For notational clarity, we will generally use in the remainder of the paper. The actions are generated from the policy and the goal is to maximize the expected future reward given by(2) 
with being a discount factor and future states produced by the transition function. Our goal is hence to maximize this objective function for both the policy and the design using deep reinforcement learning.
4 Optimization of Morphology and Behaviour
We now introduce our proposed framework for sampleefficient optimization of behaviour and design for robotic prototypes. We first describe our novel objective function based on an actor and critic to remove the dependency on prototypes and simulations during design optimization. Thereafter, a method is described for fast behaviour adaptation by training a copy of actor and critic primarily on experience collected with the current design prototype. We continue with an explanation of two different design exploration mechanisms, random selection and novelty search. The chapter closes with a description of the reinforcement learning algorithms and optimization routines used.
4.1 Using the QFunction for Design Optimization
Optimizing the behaviour of an agent usually requires learning a value or Qvalue function and a policy by the means of reinforcement learning. The rationale of our approach is to extend this methodology to the evaluation of the space of designs, thereby reducing the need for large numbers of simulations or manufactured robot prototypes.
The goal of design optimization is to increase the efficiency of the agent given an optimal policy for each design. The objective function for this case can be the sum of rewards collected by evaluating the behaviour of the agent with this design, given by
(3) 
where the rewards are collected through the execution of a policy on the agent with design in the real world or in simulation.
To alleviate the aforementioned problems with the evaluation through executions in simulation or real world, we instead propose to reuse the Qfunction learned by a deep reinforcement learning algorithm and reformulate our objective as
(4) 
where the action a is given by the policy
. This creates a strong coupling between the design optimization and reinforcement learning loop: We effectively reduce the problem of finding optimal designs to the problem of training a critic which is able to generate an estimated performance of a design given state and action. This means, while optimizing a policy for a design, we also train the objective function given above at the same time. We hypothesize that, during the training process, the critic learns to distinguish and interpolate between designs due to the influence of the design on the reward of transitions. We further reformulate Eq.
4 to optimize over the distribution of start states encountered in trajectories . The objective function becomes then the expected future reward given a design choice . This could be, for example, the case if the leg lengths of a robot are optimized and the initial position is a standing one. Here, the initial height of the robot would vary with the design choice. Thus, we reformulate the objective function in Eq. 4 such that we optimize over the distribution of start states with(5) 
The motivation to optimize this function over the distribution of start states is to take potential randomness in the initial positions, or even inaccuracies when resetting the initial position of a robot, into account. Since the distribution of start states might be unknown or even depend on the design, we approximate the expectation by drawing a random batch of start states
from a replay buffer, which contains exclusively all start states seen so far. If we use a deterministic deep neural network for policy
, Eq. 5 reduces to(6) 
with containing
randomly chosen start states. This objective function can be optimized with classical global optimization methods such as Particle Swarm Optimization (PSO)
[4, 8] or Covariance Matrix Adaptation  Evolution Strategy (CMAES) [11].4.2 Design Generalization and Specialization of Actor and Critic
A naive solution to input the design variable into the actor and critic network would be to append the design vector to the state and train a single set of networks using the experience of all designs. A more promising approach is to have two sets of networks: One population (pop.) actor and critic network which is trained on the training experience from all designs, and individual (ind.) networks which are initialized with the population network but use primarily training experience from the current design (the individual). In practice, we found it helpful to allocate 10% of the training batch for samples from the population replay buffer when training the individual networks. Essentially, this approach allows the individual networks and to specialize in a fast manner to the current design and its nuances to quickly achieve maximum performance. In parallel, we are training the population networks and with experience from all designs seen so far by selecting samples from the population replay buffer . These population networks are then able to better generalize across different designs and provide initial weights for the individual networks. Hence, policies do not have to be learned from scratch for each new prototype. Instead, previously collected training data is used so that different designs can inform each other and make efficient use of all the experiences collected thus far.
4.3 Exploration and Exploitation of Designs
We alternate between design exploration and exploitation to increase the diversity of explored designs, improve generalization capabilities of the critic and avoid an early convergence to regions of the design space. Therefore, every time we find an optimal design during the design optimization process with the objective function (Eq. 6) and conclude the subsequent reinforcement learning process, we next choose one design using the exploration strategy. To this end, we implemented two different approaches: sampling new designs 1) randomly, and 2) using Novelty search [15]. We found that using random sampling as exploration strategy outperformed novelty search (see appendix).
4.4 Fast Evolution through ActorCritic Reinforcement Learning
The proposed algorithm, Fast Evolution through ActorCritic Reinforcement Learning, is presented in Algorithm 1. We will now discuss the specifics of the used reinforcement learning algorithm and global optimization method. However, it is worth noting that our methodology is agnostic to the specific algorithms used for design and behaviour optimization.
Reinforcement Learning Algorithm
While in principal every reinforcement learning method can be employed to train the Q and policy functions necessary to optimize the designs, we use a deep reinforcement learning method due to the continuous state and action domains of our tasks. Specifically, we employed the SoftActorCritic (SAC) algorithm [10], a stateoftheart deep reinforcement learning method based on the actorcritic architecture. All neural networks had three hidden layers with a layer size of 200. Per episode we train the individual networks and 1000 times while the population networks and are trained 250 times. The motivation was to assign more processing power to the individual networks to adapt quickly to a design and specialize. A batch size of 256 was used for each training updated.
Optimization Algorithm
To optimize the objective function given in Eq. (6), we used the global optimization method Particle Swarm Optimization (PSO) [4, 8]. We chose PSO primarily because of its ability to search the design space exhaustively using a large number of particles. The objective function (Eq. (6)) was optimized using about 700 particles, each representing a candidate design, and updated over 250 iterations. Accordingly, PSO used a total contingent of 175,000 objective function evaluations to find an optimal design. To optimize the design using rollouts in simulation, we had to reduce this number to about 1,050 design candidates, i.e. 35 particles updated over 30 iterations. Although this contingent is only about of the size of the Qfunction contingent, it takes about two times longer to evaluate this number of designs in simulation. For example, on a system with an Intel Xeon CPU E52630 v4 CPU equipped with an NVIDIA Quadro P6000, the design optimization via simulation takes approximately 30 minutes while the optimization routine using the critic requires only 15 minutes. To put this into perspective, the reinforcement learning process on a single design requires approximately 60 minutes for 100 episodes.
5 Experimental Evaluation
We now experimentally evaluate our proposed approach, with the aim of answering the following questions: 1) Can we obtain with our algorithm comparable task performance as optimizing the design by performing extensive trials, by instead relying on the learned model? 2) If so, how much can our approach reduce the number of trials? 3) Can our approach help us to get insight into the design space that we are trying to optimize for a specific task?
Code for reproducing the experiments, videos, and additional material is available online at
https://sites.google.com/view/drlcoadaptation.
5.1 Experimental Setting
To evaluate our algorithm, we considered the four control tasks simulated using PyBullet [6] shown in Fig. 2. The design of agents for each task is described as a continuous design vector . The initial five designs for each task were preselected with the original design and four randomly chosen designs which were consistent over all experiments. All experiments were repeated five times. For the standard PyBullet tasks (Figures 1(c), 1(b) and 1(a)) we executed 300 episodes for the initial five designs and 100 episodes thereafter. The latter was increased to 200 episodes for the more complex Daisy Hexapod task (Fig. 1(d)) [12]. We will give a short description of the simulated locomotion tasks and state for each task the number of states, actions and design parameters as a vector . A detailed descriptions of the tasks can be found in the appendix. HalfCheetah (17, 6, 6) and Walker (17, 6, 6) are agents with two legs tasked to learn to run forward. Each agent has six leg segments to be optimized independently for their length. The Hopper (13, 4, 5) agent has a single leg with four leg segments as well as a noselike feature and has to learn to move forward as well. All three agents are restricted to movements in a 2D plane. The Daisy Hexapod (43, 18, 9) simulates an hexapod and is able to move in all three dimensions. Its goal is to learn to move forward without changing its orientation. The lengths of the legsegments are mirrored between the left and right side of the robot, with three legsegments per leg.
5.2 Coadaptation Performance
We compared the proposed framework, using actorcritic networks for design evaluation, and the classical approach, optimizing the design through candidate evaluations in simulation, on all four locomotion tasks (Fig. 3). We can see that, especially in the HalfCheetah task, using actorcritic networks might perform worse over the first few designs but quickly reaches a comparable performance and even surpasses the baseline. It is hypothesized that the better performance in later episodes is due to the ability of the critic to interpolate between designs while the evaluations of designs in simulation suffers from noise during execution. Interestingly, using simulations to optimize the design does not seem to lead to much improvement in the case of the Walker task. This could be due to the randomized start state, which often leads to the agent being in an initial state of falling backwards or forwards, which would have an immediate effect on the episodic reward. Additionally, we compared the proposed method using the introduced objective function for evaluating design candidates against the method used for design optimization in [9]. Fig. 5 shows that the evolution strategy OpenAIES [20]
, using the simulator to evaluate design candidates with a population size of 256, is outperformed by our proposed method. Moreover, we verified that for all experiments, designs selected randomly, with a uniform distribution, performed worse than designs selected through optimization (see Fig.
5).Simulation Efficiency
To evaluate the suitability of the proposed method for deployment in the real world, we compared the methods based on the number of simulations required. As we can see in Fig. 4, the actorcritic approach quickly reaches a high performance quickly with a low number of simulations. As explained above, this is due to the design optimization via simulation requiring 1,050 simulations to find an optimal design while the proposed method requires none.
Visualization of Reward Landscapes for Designs
A major advantage of the proposed method is the possibility to visualize the expected reward for designs. Instead of selecting a number of designs to evaluate, which would take a significant effort in the real world as well as computationally, we are able to query the introduced objective function (Eq. (6)) in a fast manner. This allows us to visually inspect the reward landscape of designs and to gai ninsight at what makes designs perform better or worse. In Fig. 6, the first two principal components were computed based on the designs selected for learning in the HalfCheetah task. We can see, for example, that a shorter second segment of the back leg and as well as a shorter first segment of the front leg seems to be desirable.
6 Conclusion
In this paper, we study the problem of dataefficiently coadapting morphologies and behaviors of robots. Our contribution is a novel algorithm, based on recent advances in deep reinforcement learning, which can better exploit previous trials to estimate the performance of morphologies and behaviors before testing them. As a result, our approach can drastically reduce the number of morphology designs tested (and their eventual manufacturing time/cost). Experimental results on 4 simulated robots show strong performance and a drastically reduced number of design prototypes, with one robot requiring merely 50 designs compared to the 24,177 of the baseline – that is about 3 orders of magnitudes less data. The unparalleled dataefficiency of our approach opens exciting venues towards the use in the real world of robots that can coadapt both their morphologies and their behaviors to more efficiently learning to perform the desired tasks with minimal expert knowledge. In future work, we aim to demonstrate the capabilities of this algorithm on a robot in the real world.
We thank Akshara Rai for the valuable discussions during the early stages of this research, as well as for testing the early implementations of the Daisy hexapod simulation thoroughly. Furthermore, we thank Ge Yang for his support to run additional simulations when they were needed. Finally, we thank the anonymous reviewers for their helpful comments.
References
 [1] (1984) Walking and running: legs and leg movements are subtly adapted to minimize the energy costs of locomotion. American Scientist 72 (4), pp. 348–354. Cited by: §1.
 [2] (1957) A Markovian decision process. Journal of Mathematics and Mechanics, pp. 679–684. Cited by: §3.
 [3] (201107) Morphology and behaviour: functional links in development and evolution introduction. Philosophical transactions of the Royal Society of London. Series B, Biological sciences 366, pp. 2056–68. External Links: Document Cited by: §1.
 [4] (2017) Particle swarm optimization for single objective continuous space problems: a review. MIT Press. Cited by: §4.1, §4.4.

[5]
(2015)
Noveltybased evolutionary design of morphing underwater robots.
In
Proceedings of the 2015 annual conference on Genetic and Evolutionary Computation
, pp. 145–152. Cited by: §1. 
[6]
(2016–2019)
PyBullet, a python module for physics simulation for games, robotics and machine learning
. Note: http://pybullet.org Cited by: §5.1, §7.1.  [7] (2013) A survey on policy search for robotics. Foundations and Trends® in Robotics 2 (1–2), pp. 1–142. Cited by: §1.
 [8] (199510) A new optimizer using particle swarm theory. In MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science, Vol. , pp. 39–43. External Links: Document, ISSN Cited by: §4.1, §4.4.
 [9] (2018) Reinforcement learning for improving agent design. arXiv preprint arXiv:1810.03779. Cited by: §2, Figure 5, §5.2, Figure 15, §7.5.
 [10] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1856–1865. Cited by: §4.4.
 [11] (1996) Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation. In Proceedings of IEEE international conference on evolutionary computation, pp. 312–317. Cited by: §4.1.
 [12] (2019) HEBI robotics xseries hexapod data sheet. Note: http://docs.hebi.us/resources/kits/assyInstructions/A204901_Data_Sheet.pdfAccessed: 06.07.2019 Cited by: §5.1, §7.1.
 [13] (2017) Bioinspired robot design considering loadbearing and kinematic ontogeny of chelonioidea sea turtles. In Conference on Biomimetic and Biohybrid Systems, pp. 216–229. Cited by: §1.
 [14] (2012) The transferability approach: crossing the reality gap in evolutionary robotics. IEEE Transactions on Evolutionary Computation 17 (1), pp. 122–145. Cited by: §1.
 [15] (2008) Exploiting openendedness to solve problems through the search for novelty. Artificial Life 11, pp. 329. Cited by: §4.3, §7.6.
 [16] (2019) Dataefficient learning of morphology and controller for a microrobot. In IEEE International Conference on Robotics and Automation (ICRA), pp. 2488–2494. External Links: Document Cited by: §2.
 [17] (2000) Automatic design and manufacture of robotic lifeforms. Nature 406 (6799), pp. 974. Cited by: §1.
 [18] (2017) From the lab to the desert: fast prototyping and learning of robot locomotion. In 2017 Robotics: Science and Systems, RSS 2017, Cited by: §1.
 [19] (2018) Realworld evolution adapts robot morphology and control to hardware limitations. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 125–132. Cited by: §1, §2.
 [20] (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: Figure 5, §5.2.
 [21] (2019) Jointly learning to construct and control agents using deep reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 9798–9805. Cited by: §1, §2.
 [22] (2014) Design principles for energyefficient legged locomotion and implementation on the mit cheetah robot. Ieee/asme transactions on mechatronics 20 (3), pp. 1117–1129. Cited by: §1.
 [23] (1994) Evolving virtual creatures. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pp. 15–22. Cited by: §1.
 [24] (1992) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8 (34), pp. 229–256. Cited by: §2.
 [25] (2004) Back to reality: crossing the reality gap in evolutionary robotics. In IAV 2004 the 5th IFAC Symposium on Intelligent Autonomous Vehicles, Lisbon, Portugal, Cited by: §1.
7 Appendix
7.1 Simulation Environments
This section states a short description of each task simulated in PyBullet [6]:
HalfCheetah (17, 6, 6)
The half cheetah task has an 17 dimensional state space consisting of joint positions, joint velocities, horizontal speed, angular velocity, vertical speed and relative height. Actions have six dimensions and are accelerations of joints. The original reward function used in PyBullet was adapted to be design independent and is given by where is the horizontal speed to encourage forward motion. The continuous design vector is a scaling factor of the original leg lengths of HalfCheetah: . The dimensions of the design vector are in the interval .
Walker (17, 6, 6)
Similar to the HalfCheetah task, the state space of the Walker task is given by joint positions, joint velocities, horizontal speed, angular velocity, vertical speed and relative height and has 16 dimensions. The two legs of Walker are controlled through acceleration with a six dimensional action. Again, the original reward was adapted to be design agnostic. The term encouraging maximum height of the torso of walker was replaced by two terms favouring vertical orientation of the torso and reaching a minimal height of . The full reward function is given by . The design vector is a scaling factor of the leg and foot lengths of the Walker agent: . Each design dimension lies in the interval .
Hopper (13, 4, 5)
In the planar Hopper task a onelegged agent has to learn jumping motions in order to move forward. The state space of this task has thirteen dimensions and four dimensions in the action space. We use the same reward function as for the Walker task with . In addition to the length of the four movable leg segments, the length of the noselike feature of walker is an additional design parameter, here . The full design vector is given by with being the length of each movable segment from pelvis to foot. The design parameters were bounded with for the length of the nose and for all leg lengths.
Daisy Hexapod (43, 18, 9)
For a preliminary study and to evaluate whether the proposed method is suitable for real world applications, a simulation of the sixlegged Daisy robot by HEBI Robotics [12] was created in PyBullet. Each leg of the robot has three motors and hence the action space has 18 dimensions. The state space has 43 dimensions and consists of joint positions, joint velocities, joint accelerations, the velocity of the robot in x/y/z directions and the orientation of the robot in Euler angles. The task of the robot is to learn to walk forward while keeping its orientation and thus the reward function is given by , with being the dislocation along the yaxis, the direction the robot faces at initialization, and representing the angle between the original and current orientation in quaternions. The design vector consists of two parts: leg lengths, and movement range of the motors at the base of the legs. All parameters are symmetric between the left and right side of the robot. The leg lengths are in for the two leg segments of each leg. Additionally, we allowed the algorithm to optimize the movement range of the first out of three motors on each leg. The base motors are restricted in movement between radians with the design parameters .
7.2 Visualization of Design Space
Because we can query the proposed objective function from eq. 6, we are able to visualize the cost landscape of each task. Figure 7 shows the design spaces of the three standard PyBullet tasks HalfCheetah, Walker and Hopper after 50 designs evaluated in simulation. Each single plot shows the design landscape of two dimensions while the other dimensions were held fix with stated design vectors as well as the location of design chosen by the proposed method (yellow) and designs chosen randomly for exploration (black). The cost landscape of the more complex Daisy Hexapod task is shown in figure 8.
7.3 Visualization of the Latent Design Space
For a better understanding of the cost landscape a low dimensional design space was computed with principal component analysis. Figure
9 shows the lowdimensional projection of the design space as well as the designs chosen by the proposed method (ywllow) and randomly selected designs for exploration (black) In white designs chosen by the optimization via simulation method are shown. We can see that the convergence rate of optimization via simulation appears to be slower than our method. To see what properties of the design lead to a better performance we visualized the design along the two principal components (Fig. 10). We can see that just longer leg do not appear to lead automatically to better performance but shorter front legs and slightly longer back legs do.7.4 Evolution of Walker
Figure 11 shows the evolution of designs with the proposed objective function. We can see that the start states are random and lead to different poses of Walker, sometimes falling for or backwards. It can be seen that while shorter legs seem desirable, the larger the foot length the better the performance.
7.5 Using CMAES for Evolutionary Design Optimization
As proposed in the work of David Ha [9] we evaluated our approach against two approaches using CMAES (Fig. 15) and OpenAIES (see main text) in an evolutionary manner for the optimization of the designs. For this experiment, we let CMAES create a population of design candidates and evaluated them in the simulator. We then executed exactly one update iteration of CMAES and used the best design found in the reinforcement learning loop. Figure 15 shows that this method is outperformed by the approach proposed in this paper. The proposed method uses the Qfunction for design evaluations during the design optimization phase and executes a number of update iterations before selecting the best design for the reinforcement learning loop.
7.6 Design Exploration Strategies
We alternate between design exploration and exploitation to increase the diversity of explored designs, improve generalization capabilities of the critic and avoid an early convergence to regions of the design space. Therefore, every time we find an optimal design during the design optimization process with the objective function (Eq. 6) and conclude the subsequent reinforcement learning process, we next choose one design using the exploration strategy. To this end, we implemented two different approaches: sampling new designs 1) randomly, and 2) using Novelty search [15]. Novelty search is an exploration strategy in which the objective maximizes distance to the closest neighbours. The objective function is given by
(7) 
where the function returns the nearest neighbors of a design from the set of chosen designs so far. This set includes only designs which were selected for evaluation in the real world or simulation, i.e., were handed over to the reinforcement learning algorithm as (Fig. 0(b)). Experiments showed that using novelty search for exploration did not yield an advantage over random selection of designs (Fig. 13).
7.7 Performance of Optimization Algorithms for Design Optimization
Since we had to reduce the number of simulations considerably during the design optimization stage, we also evaluated the performance between Particle Swarm Optimization (PSO) and Covariance Matrix AdaptationEvolution Strategy (CMAES). However, we could not find a significant difference in performance (Fig. 14).
7.8 About the Use of Batches of Start States for the Evaluation of Design Candidates
We evaluated the importance of evaluating the objective function (Eq. 6) over a batch of start states. Figure 12 shows the use of a single start state , using a batch of 16 and 32 start states in the objective function presented in Eq. 6. The evaluation shows that averaging the objective function over a number of randomly drawn start states increases the performance of the proposed approach considerably.
7.9 Evaluating the use of Population and Individual Networks
In a preliminary evaluation we were able to confirm that the use of a single set of population networks, instead of using a combination of population and individual networks, shows a decreased performance (Fig. 16). This shows that the ability of the individual networks, to adapt quickly to the current design, is important for the overall performance of the proposed approach.