Reinforcement Learning is concerned with maximizing rewards from an environment through repeated interactions and trial and error. Such methods often rely on various approximations of the Bellman equation and include value function approximation, policy gradient methods, and more 2]
. Such methods have been used to optimize the structure of neural networks for vision tasks, for instance.
Recently, Salimans et al. have shown that a particular variant of evolutionary computation methods, termed Evolution Strategies (ES) are a fast and scalable alternative to other reinforcement learning approaches, solving the difficult humanoid MuJoCo task in 10 minutes 
. The authors argue that ES has several benefits over other reinforcement learning methods: 1) The need to backpropagate gradients through a policy is avoided, which opens up a wider class of policy parameterizations; 2) ES methods are massively parallelizable, which allows for scaling up learning to larger, more complex problems; 3) ES often finds policies which are more robust than other reinforcement learning methods; and 4) ES are better at assigning credit to changes in the policy over longer timescales, which enables solving tasks with longer time horizons and sparse rewards. In this work we leverage all four of these advantages by using ES to solve a problem with: 1) a more complex and decipherable policy architecture which allows for safety considerations; 2) a large-scale simulated environment with many interacting elements; 3) multiple sources of stochasticity including variations in intial conditions, disturbances, etc.; and 4) sparse rewards which only occur at the very end of a long episode.
A common critique of evolutionary computation algorithms is a lack of convergence analysis or guarantees. Of course, for problems with non-differentiable and non-convex objective functions, analysis will always be difficult. Nevertheless, we show that the Evolution Strategies algorithm proposed by  is a special case of a class of model-based stochastic search methods known as Gradient-Based Adaptive Stochastic Search (GASS) . This class of methods generalizes many stochastic search methods such as the well-known Cross Entropy Method (CEM) , CMA-ES , etc. By casting a non-differentiable, non-convex optimization problem as a gradient descent problem, one can arrive at nice asymptotic convergence properties and known convergence rates .
With more confidence in the convergence of Evolution Strategies, we demonstrate how ES can be used to efficiently solve both cooperative and competitive large-scale multi-agent problems. Many approaches to solving multi-agent problems rely on hand-designed and hand-tuned algorithms (see  for a review). One such example, distributed Model Predictive Control, relies on independent MPC controllers on each agent with some level of coordination between them [10, 11]
. These controllers require hand-designing dynamics models, cost functions, feedback gains, etc. and require expert domain knowledge. Additionally, scaling these methods up to more complex problems continues to be an issue. Evolutionary algorithms have also been tried as a solution to multi-agent problems; usually with smaller, simpler environments, and policies with low complexity[12, 13]
. Recently, a hybrid approach combining MPC and the use of genetic algorithms to evolve the cost function for a hand-tuned MPC controller has been demonstrated for a UAV swarm combat scenario.
In this work we demonstrate the effectiveness of our approach on two complex multi-agent UAV swarm combat scenarios: where a team of fixed wing aircraft must attack a well-defended base, and where two teams of agents go head to head to defeat each other. Such scenarios have been previously considered in simulated environments with less fidelity and complexity [15, 14]. We leverage the computational efficiency and flexibility of the recently developed SCRIMMAGE multi-agent simulator for our experiments (Figure 1) . We compare the performance of ES against the Cross Entropy Method. We also show for the competitive scenario how the policy learns over time to coordinate a strategy in response to an enemy learning to do the same. We make our code freely available for use (https://github.com/ddfan/swarm_evolve).
Ii Problem Formulation
We can pose our problem as the non-differentiable, non-convex optimization problem
where , a nonempty compact set, is the space of solutions, and is a non-differentiable, non-convex real-valued objective function . could be any combination of decision variables of our problem, including neural network weights, PID gains, hardware design parameters, etc. which affect the outcome of the returns . For reinforcement learning problems usually represents the parameters of the policy and is an implicit function of the sequential application of the policy to the environment. We first review how this problem can be solved using Gradient-Based Adaptive Stochastic Search methods and then show how the ES algorithm is a special case of these methods.
Ii-a Gradient-Based Adaptive Stochastic Search
The goal of model-based stochastic search methods is to cast the non-differentiable optimization problem (1) as a differentiable one by specifying a probabilistic model (hence ”model-based”) from which to sample . Let this model be , where). Then the expectation of over the distribution will always be less than the optimal value of , i.e.
The idea of Gradient-based Adaptive Stochastic Search (GASS) is that one can perform a search in the space of parameters of the distribution rather than , for a distribution which maximizes the expectation in (2):
Maximizing this expectation corresponds to finding a distribution which is maximally distributed around the optimal . However, unlike maximizing (1), this objective function can now be made continuous and differentiable with respect to . With some assumptions on the form of the distribution, the gradient with respect to can be pushed inside the expectation.
The GASS algorithm presented by  is applicable to the exponential family of probability densities:
where , and
is the vector of sufficient statistics. Since we are concerned with showing the connection with ES which uses parameter perturbations sampled with Gaussian noise, we assume thatis Gaussian. Furthermore, since we are concerned with learning a large number of parameters (i.e. weights in a neural network), we assume an independent Gaussian distribution over each parameter. Then, and , where and
are vectors of the mean and standard deviation corresponding to the distribution of each parameter, respectively.
We present the GASS algorithm for this specific set of probability models (Algorithm 1), although the analysis for convergence holds for the more general exponential family of distributions. For each iteration , The GASS algorithm involves drawing samples of parameters . These parameters are then used to sample the return function . The returns are fed through a shaping function and then used to calculate an update on the model parameters .
The shaping function is required to be nondecreasing and bounded from above and below for bounded inputs, with the lower bound away from 0. Additionally, the set must be a nonempty subset of the set of solutions of the original problem
. The shaping function can be used to adjust the exploration/exploitation trade-off or help deal with outliers when sampling. The original analysis of GASS assumes a more general form ofwhere can change at each iteration. For simplicity we assume here it is deterministic and unchanging per iteration.
GASS can be considered a second-order gradient method and requires estimating the variance of the sampled parameters:
In practice if the size of the parameter space is large, as is the case in neural networks, this variance matrix will be of size and will be costly to compute. In our work we approximate with independent calculations of the variance on the parameters of each independent Gaussian. With a slight abuse of notation, consider as a scalar element of . We then have, for each scalar element a variance matrix:
Theorem 1 shows that GASS produces a sequence of that converges to a limit set which specifies a set of distributions that maximize (3). Distributions in this set will specify how to choose to ultimately maximize (1). As with most non-convex optimization algorithms, we are not guaranteed to arrive at the global maximum, but using probabilistic models and careful choice of the shaping function should help avoid early convergence into suboptimal local maximum. The proof relies on casting the update rule in the form of a generalized Robbins-Monro algorithm (see , Thms 1 and 2). Theorem 1 also specifies convergence rates in terms of the number of iterations , the number of samples per iteration , and the learning rate . In practice Theorem 1 implies the need to carefully balance the increase in the number of samples per iteration and the decrease in learning rate as iterations progress.
The learning rate , as , and .
The sample size , where ; also and jointly satisfy .
is bounded on .
If is a local maximum of (3), the Hessian of is continuous and symmetric negative definite in a neighborhood of .
Ii-B Evolutionary Strategies
We now review the ES algorithm proposed by  and show how it is a first-order approximation of the GASS algorithm. The ES algorithm consists of the same two phases as GASS: 1) Randomly perturb parameters with noise sampled from a Gaussian distribution. 2) Calculate returns and calculate an update to the parameters. The algorithm is outlined in Algorithm 2. Once returns are calculated, they are sent through a function which performs fitness shaping . Salimans et al. used a rank transformation function for which they argue reduced the influence of outliers at each iteration and helped to avoid local optima.
It is clear that the ES algorithm is a sub-case of the GASS algorithm when the sampling distribution is a point distribution. We can also recover the ES algorithm by ignoring the variance terms on line in Algorithm 1. Instead of the normalizing term , ES uses the number of samples . The small constant in GASS becomes the variance term in the ES algorithm. The update rule in Algorithm 2 involves multiplying the scaled returns by the noise, which is exactly in Algorithm 1.
We see that ES enjoys the same asymptotic convergence rates offered by the analysis of GASS. While GASS is a second-order method and ES is only a first-order method, in practice ES uses approximate second-order gradient descent methods which adapt the learning rate in order to speed up and stabilize learning. Examples of these methods include ADAM, RMSProp, SGD with momentum, etc., which have been shown to perform very well for neural networks. Therefore we can treat ES a first-order approximation of the full second-order variance updates which GASS uses. In our experiments we use ADAM to adapt the learning rate for each parameter. As similarly reported in , when using adaptive learning rates we found little improvement over adapting the variance of the sampling distribution. We hypothesize that a first order method with adaptive learning rates is sufficient for achieving good performance when optimizing neural networks. For other types of policy parameterizations however, the full second-order treatment of GASS may be more useful. It is also possible to mix and match which parameters require a full variance update and which can be updated with a first-order approximate method. We use the rank transformation function for and keep constant.
Ii-C Learning Structured Policies for Multi-Agent Problems
Now that we are more confident about the convergence of the ES/GASS method, we show how ES can be used to optimize a complex policy in a large-scale multi-agent environment. We use the SCRIMMAGE multi-agent simulation environment  as it allows us to quickly and in parallel simulate complex multi-agent scenarios. We populate our simulation with 6DoF fixed-wing aircraft and quadcopters with dynamics models having 10 and 12 states, respectively. These dynamcis models allow for full ranges of motion within realistic operating regimes. Stochastic disturbances in the form of wind and control noise are modeled as additive Gaussian noise. Ground and mid-air collisions can occur which result in the aircraft being destroyed. We also incorporate a weapons module which allows for targeting and firing at an enemy within a fixed cone projecting from the aircraft’s nose. The probability of a hit depends on the distance to the target and the total area presented by the target to the attacker. This area is based on the wireframe model of the aircraft and its relative pose. For more details, see our code and the SCRIMMAGE simulator documentation.
We consider the case where each agent uses its own policy to compute its own controls, but where the parameters of the policies are the same for all agents. This allows each agent to control itself in a decentralized manner, while allowing for beneficial group behaviors to emerge. Furthermore, we assume that friendly agents can communicate to share states with each other (see Figure 2). Because we have a large number of agents (up to 50 per team), to keep communication costs lower we only allow agents to share information locally, i.e. agents close to each other have access to each other’s states. In our experiments we allow each agent to sense the states of the closest 5 friendly agents for a total of incoming state messages.
Additionally, each agent is equipped with sensors to detect enemy agents. Full state observability is not available here, instead we assume that sensors are capable of sensing an enemy’s relative position and velocity. In our experiments we assumed that each agent is able to sense the nearest 5 enemies for a total of dimensions of enemy data ( states = [relative xyz position, distance, and relative xyz velocities]). The sensors also provide information about home and enemy base relative headings and distances (an additional states). With the addition of the agent’s own state ( states), the policy’s observation input has a dimension of . These input states are fed into the agent’s policy: a neural network with 3 fully connected layers with sizes 200, 200, and 50, which outputs 3 numbers representing a desired relative heading . Each agent’s neural network has more than 70,000 parameters. Each agent uses the same neural network parameters as its teammates, but since each agent encounters a different observation at each timestep, the output of each agent’s neural network policy will be unique. It may also be possible to learn unique policies for each agent; we leave this for future work.
With safety being a large concern in UAV flight, we design the policy to take into account safety and control considerations. The relative heading output from the neural network policy is intended to be used by a PID controller to track the heading. The PID controller provides low-level control commands to the aircraft (thrust, aileron, elevator, rudder). However, to prevent cases where the neural network policy guides the aircraft into crashing into the ground or allies, etc., we override the neural network heading with an avoidance heading if the aircraft is about to collide with something. This helps to focus the learning process on how to intelligently interact with the environment and allies rather than learning how to avoid obvious mistakes. Furthermore, by designing the policy in a structured and interpretable way, it will be easier to take the learned policy directly from simulation into the real world. Since the neural network component of the policy does not produce low-level commands, it is invariant to different low-level controllers, dynamics, PID gains, etc. This aids in learning more transferrable policies for real-world applications.
We consider two scenarios: a base attack scenario where a team of 50 fixed wing aircraft must attack an enemy base defended by 20 quadcopters, and a team competitive task where two teams concurrently learn to defeat each other. In both tasks we use the following reward function:
The reward function encourages air-to-air combat, as well as suicide attacks against the enemy base (e.g. a swarm of cheap, disposable drones carrying payloads). The last term encourages the aircraft to move towards the enemy during the initial phases of learning.
Iii-a Base Attack Task
Scores per iteration for the base attack task. Top: Scores earned by perturbed policies during training. Scores are on average lower because they result from policies which are parameterized by randomly peturbed values. Bottom: Scores during the course of training earned by the updated policy parameters. Red curve is Evolution Strategies algorithm, blue is Cross Entropy Method. Bold line is the median, shaded areas are 25/75 quartile bounds.
In this scenario a team of 50 fixed-wing aircraft must attack an enemy base defended by 20 quadcopters (Figure 3). The quadcopters use a hand-crafted policy where in the absence of an enemy, they spread themselves out evenly to cover the base. In the presence of an enemy they target the closest enemy, match that enemy’s altitude, and fire repeatedly. We used , a time step of seconds, and total episode length of seconds. Initial positions of both teams were randomized in a fixed area at opposide ends of the arena. Training took two days with full parallelization on a machine equipped with a Xeon Phi CPU (244 threads).
We found that over the course of training the fixed-wing team learned a policy where they quickly form a V-formation and approach the base. Some aircraft suicide-attack the enemy base while others begin dog-fighting (see supplementary video111http://https://goo.gl/dWvQi7). We also compared our implementation of the ES method against the well-known cross-entropy method (CEM). CEM performs significantly worse than ES (Figure 4). We hypothesize this is because CEM throws out a significant fraction of sampled parameters and therefore obtains a worse estimate of the gradient of (3). Comparison against other full second-order methods such as CMA-ES or the full second-order GASS algorithm is unrealistic due to the large number of parameters in the neural network and the prohibitive computational difficulties with computing the covariances of those parameters.
Iii-B Two Team Competitive Match
The second scenario we consider is where two teams each equipped with their own unique policies for their agents learn concurrently to defeat their opponent (Figure 5). At each iteration, simulations are spawned, each with a different random perturbation, and with each team having a different perturbation. The updates for each policy are calculated based on the scores received from playing the opponent’s perturbed policies. The result is that each team learns to defeat a wide range of opponent behaviors at each iteration. We observed that the behavior of the two teams quickly approached a Nash equilibrium where both sides try to defeat the maximum number of opponent aircraft in order to prevent higher-scoring suicide attacks (see supplementary video). The end result is a stalemate with both teams annihilating each other, ending with tied scores (Figure 6). We hypothesize that more varied behavior could be learned by having each team compete against some past enemy team behaviors or by building a library of policies from which to select from, as frequently discussed by the evolutionary computation community .
We have shown that Evolution Strategies are applicable for learning policies with many thousands of parameters for a wide range of complex tasks in both the competitive and cooperative multi-agent setting. By showing the connection between ES and more well-understood model-based stochastic search methods, we are able to gain insight into future algorithm design. Future work will include experiments with optimizing mixed parameterizations, e.g. optimizing both neural network weights and PID gains. In this case, the second-order treatment on non-neural network parameters may be more beneficial, since the behavior of the system may be more sensitive to perturbations of non-neural network parameters. Another direction of investigation could be optimizing unique policies for each agent in the team. Yet another direction would be comparing other evolutionary computation strategies for training neural networks, including methods which use a more diverse population , or more genetic algorithm-type heuristics .
-  Y. Li, “Deep Reinforcement Learning: An Overview,” ArXiv e-prints, Jan. 2017.
-  K. Stanley and B. Bryant, “Real-time neuroevolution in the NERO video game,” IEEE transactions on, 2005. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs–_˝all.jsp?arnumber=1545941
-  O. J. Coleman, “Evolving Neural Networks for Visual Processing,” Thesis, 2010.
-  T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolution Strategies as a Scalable Alternative to Reinforcement Learning,” ArXiv e-prints, Mar. 2017.
-  J. Hu, “Model-based stochastic search methods,” in Handbook of Simulation Optimization. Springer, 2015, pp. 319–340.
-  S. Mannor, R. Rubinstein, and Y. Gat, “The cross entropy method for fast policy search,” in Machine Learning-International Workshop Then Conference-, vol. 20, no. 2, 2003, Conference Proceedings, p. 512.
-  N. Hansen, “The CMA evolution strategy: A tutorial,” CoRR, vol. abs/1604.00772, 2016. [Online]. Available: http://arxiv.org/abs/1604.00772
-  E. Zhou and J. Hu, “Gradient-based adaptive stochastic search for non-differentiable optimization,” IEEE Transactions on Automatic Control, vol. 59, no. 7, pp. 1818–1832, 2014.
-  L. Panait and S. Luke, “Cooperative multi-agent learning: The state of the art,” Autonomous Agents and Multi-Agent Systems, vol. 11, no. 3, pp. 387–434, 2005. [Online]. Available: http://link.springer.com/10.1007/s10458-005-2631-2
-  J. B. Rawlings and B. T. Stewart, “Coordinating multiple optimization-based controllers: New opportunities and challenges,” Journal of Process Control, vol. 18, no. 9, pp. 839–845, 2008.
-  W. Al-Gherwi, H. Budman, and A. Elkamel, “Robust distributed model predictive control: A review and recent developments,” The Canadian Journal of Chemical Engineering, vol. 89, no. 5, pp. 1176–1190, 2011. [Online]. Available: http://doi.wiley.com/10.1002/cjce.20555
-  G. B. Lamont, J. N. Slear, and K. Melendez, “UAV swarm mission planning and routing using multi-objective evolutionary algorithms,” in IEEE Symposium Computational Intelligence in Multicriteria Decision Making, no. Mcdm, 2007, Conference Proceedings, pp. 10–20.
-  A. R. Yu, B. B. Thompson, and R. J. Marks, “Competitive evolution of tactical multiswarm dynamics,” IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans, vol. 43, no. 3, pp. 563–569, 2013.
-  D. D. Fan, E. Theodorou, and J. Reeder, “Evolving cost functions for model predictive control of multi-agent uav combat swarms,” in Proceedings of the Genetic and Evolutionary Computation Conference Companion, ser. GECCO ’17. New York, NY, USA: ACM, 2017, pp. 55–56. [Online]. Available: http://doi.acm.org/10.1145/3067695.3076019
-  U. Gaerther, “UAV swarm tactics: an agent-based simulation and Markov process analysis,” 2015. [Online]. Available: https://calhoun.nps.edu/handle/10945/34665
-  K. J. DeMarco. (2018) SCRIMMAGE multi-agent robotics simulator. [Online]. Available: http://www.scrimmagesim.org/
-  D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and J. Schmidhuber, “Natural evolution strategies.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 949–980, 2014.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
K. O. Stanley and R. Miikkulainen, “Competitive coevolution through
Journal of Artificial Intelligence Research, vol. 21, pp. 63–100, 2004.
-  E. Conti, V. Madhavan, F. Petroski Such, J. Lehman, K. O. Stanley, and J. Clune, “Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents,” ArXiv e-prints, Dec. 2017.
-  F. Petroski Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and J. Clune, “Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning,” ArXiv e-prints, Dec. 2017.