1 Introduction
Gliding is an intrinsically efficient motion that relies on the body shape to extract momentum from the air flow, while performing minimal mechanical work to control attitude. The sheer diversity of animal and plant species that have independently evolved the ability to glide is a testament to the efficiency and usefulness of this mode of transport. Well known examples include birds that soar with thermal winds, fish that employ burst and coast swimming mechanisms and plant seeds, such as the samara, that spread by gliding. Furthermore, arboreal animals that live in forest canopies often employ gliding to avoid earthbound predators, forage across long distances, chase prey, and safely recover from falls. Characteristic of gliding mammals is the membrane (patagium) that develops between legs and arms. When extended, the patagium transforms the entire body into a wing, allowing the mammal to stay airborne for extended periods of time Jackson (2000). Analogous body adaptations have developed in species of lizards Mori & Hikida (1994) and frogs McCay (2001).
Most surprisingly, gliding has developed in animal species characterized by blunt bodies lacking specialized liftgenerating appendages. The Chrysopelea genus of snakes have learned to launch themselves from trees, flatten and camber their bodies to form a concave crosssection, and perform sustained aerial undulations to generate enough lift to match the gliding performance of mammalian gliders (Socha, 2002). Wingless insects such as tropical arboreal ants (Yanoviak et al., 2005) and bristletails (Yanoviak et al., 2009) are able to glide when falling from the canopy in order to avoid the possibly flooded or otherwise hazardous forest understory. During descent these canopydwelling insects identify the target tree trunk using visual cues (Yanoviak & Dudley, 2006) and orient their horizontal trajectory appropriately.
Most bird species alternate active flapping with gliding, in order to reduce physical effort during longrange flight (Rayner, 1985). Similarly, gliding is an attractive solution to extend the range of micro air vehicles (MAVs). MAV designs often rely on arrays of rotors (ie. quadcoptors) due to their simple structure and due to the existence of simplified models that capture the main aspects of the underlying fluid dynamics. The combination of these two features allows finding precise control techniques (Gurdan et al., 2007; Lupashin et al., 2010) to perform complex flight maneuvres (Müller et al., 2011; Mellinger et al., 2013). However, the main drawback of rotorpropelled MAVs is their limited flighttimes, which restricts realworld applications. Several solutions for extending the range of MAVs have been proposed, including techniques involving precise perching manouvres (Thomas et al., 2016) and mimicking flying animals by designing a flier (Abas et al., 2016) capable of gliding.
Here we study the ability of falling bluntshaped bodies, lacking any specialized feature for generating lift, to learn gliding strategies through Reinforcement Learning Bertsekas et al. (1995); Kaelbling et al. (1996); Sutton & Barto (1998). The goal of the RL agent is to control its descent towards a set target landing position and perching angle. The agent is modeled by a simple dynamical system describing the passive planar gravitydriven descent of a cylindrical object in a quiescent fluid. The simplicity of the model is due to a parameterized model for the fluid forces which has been developed through simulations and experimental studies (Wang et al., 2004; Andersen et al., 2005a, b). Following the work of Paoletti & Mahadevan (2011)
, we augment the original, passive dynamical system with active control. We identify optimal control policies through Reinforcement Learning, a semisupervised learning framework, that has been employed successfully in a number of flow control problems
(Gazzola et al., 2014, 2016; Reddy et al., 2016; Novati et al., 2017; Colabrese et al., 2017). We employ recent advances in coupling RL with deep neural networks
Mnih et al. (2015); Wang et al. (2016); Novati & Koumoutsakos (2018). These, so called Deep Reinforcement Learning algorithms have been shown in several problems to match and even surpass the performance of control policies obtained by classical approaches.The paper is organised as follows: we describe the model of an active, falling body in section 2 and frame the problems in terms of Reinforcement Learning in section 3. In section 3.1 we present a highlevel description of the RL algorithm and describe the reward shaping combining the time/energy cost with kinematic constraints as described in section 3.2. We explore the effects of the weight and shape of the agent’s body on the optimal gliding strategies in section 4. In sections 5 and 6 we compare the optimal RL policies by comparing them to other RL algorithms and the optimal control (OC) trajectories, e.g. Paoletti & Mahadevan (2011).
2 Model
We model the glider as an ellipse (see figure 1) with semiaxes and density in a quiescent incompressible fluid of density
. Under the assumption of planar motion, we can model the system with a set of ordinary differential equations (ODEs)
(Lamb, 1932). The dimensionless form of the ODEs for the ellipse’s translational and rotational degrees of freedom can written as
(Andersen et al., 2005a; Paoletti & Mahadevan, 2011):(1)  
(2)  
(3)  
(4)  
(5)  
(6) 
Here and denote the projections of the velocity along the ellipse’s semiaxes, is the angle between the major semiaxis and the horizontal direction, is the angular velocity, and  is the position of the center of mass. Closure of the above system requires expressions for the forces and torques , , , and the circulation
. Here, we approximate them in terms of a parametric model that has been developed through numerical and experimental studies by
Wang et al. (2004); Andersen et al. (2005a, b):(7)  
(8)  
(9) 
Furthermore, the numerical constants are selected to be valid at intermediate Reynolds numbers based on the semimajor axes consistent with that of gliding ants Yanoviak et al. (2005)), with , , , , and . Given the closure for the fluid forces and torques, the dynamics of the system are characterized by the nondimensional parameters , and , where is the density ratio .
In active gliding, we assume that the gravitydriven descent can be modified by the agent by modulating the torque in equation 3. This torque could be achieved by deforming the body in order to move its center of mass, by deflecting the incoming flow, or by extending and rotating its limbs, as introduced by Paoletti & Mahadevan (2011) as a minimal representation of the ability of gliding ants to guide their fall by rotating their hind legs Yanoviak et al. (2010). This leads to a natural question: how should the active torque be varied in time for the ant to achieve a particular task such as landing and perching at a particular location with a particular orientation, subject to some constraints, e.g. optimizing time, minimizing power consumption, maximizing accuracy etc., a problem considered by Paoletti & Mahadevan (2011) in an optimal control framework. Here we use an alternative approach inspired by how organisms might learn, that of reinforcement learning Sutton & Barto (1998).
3 Reinforcement Learning for landing and perching
The tasks of landing and perching are achieved by the falling body by employing a Reinforcement learning (RL) framework Bertsekas et al. (1995); Kaelbling et al. (1996); Sutton & Barto (1998) to identify their control actions.
In the following we provide a brief overview of the RL framework in the context of flow control and outline the algorithms used in the present study. Reinforcement Learning (RL) is a semisupervised learning framework with a broad range of applications ranging from robotics Levine et al. (2016), games Mnih et al. (2015); Silver et al. (2016) and flow control Gazzola et al. (2014). In RL, the control actor (termed ”agent”) interacts with its environment by sampling its states (), performing actions () and receiving rewards (). At each time step () the agent performs the action and the system is advanced in time for , before the agent can observe its new state , receive a scalar reward , and choose a new action . The agent infers a policy through its repeated interactions with the environment so as to maximize its long term rewards. The optimal policy is found by maximizing the expected utility:
(10) 
Once the optimal policy has been inferred the agent can interact autonomously with the environment.
When the tasks of the agent satisfy the Markov property within finite state and action spaces they are called finite Markov decision processes (MDP). Following
Sutton & Barto (1998) the finite MDP is defined by the current state and action pairby the one step dynamics of the environment that are in turn described by the probability
of any next possible state and the expected value of the next rewards defined as :(11) 
The value function for each state
provides an estimate of future rewards given a certain policy
. For an MDP we define the state value function as :(12) 
where denotes expected value for the agent when it performs the policy and is a discount factor () for future rewards. The actionvalue function satisfying the celebrated Bellman equation Bellman (1952) can be written as:
(13) 
The statevalue function is the expected actionvalue under the policy : . We note that the may be described by tables that reflect discrete sets of states and actions. Such approaches have been used with some success in fluid mechanics applications Gazzola et al. (2014); Colabrese et al. (2017) but they often lead to poor training and are susceptible to noisy flow environments. In turn continuous approximations of such functions, such as those employed here in, have been shown to lead to robust and efficient learning policies Verma et al. (2018).
We remark that the value functions and the Bellman equation are also inherent to Dynamic Programming (DP) Bertsekas et al. (1995)
. RL is inherently linked to DP and it is often referred to as approximate DP. The key difference between RL and dynamic programming is that RL does not require an accurate model of the environment and infers the optimal policy by a sampling process. Moreover RL can be applied to non MDPs. As such RL is in general more computationally intensive than DP and optimal control but at the same time can handle blackbox problems and is robust to noisy and stochastic environments. With the advancement of computational capabilities we believe that RL is becoming a valid complement to Optimal Control and other Machine Learning strategies
Duriez et al. (2017) for Fluid mechanics problems.In the perching and landing context, we consider an agent, initially located at , that has the objective of landing at a target location , with perching angle . By describing the trajectory with the model outlined in section 2
, the state of the agent is completely defined at every time step by the state vector
. With a finite time interval , the agent is able to observe its state and, based on the state, samples a stochastic control policy to select an action . In the case of perching and landing an episode is terminated when at some terminal time the agent touches the ground . Because the gravitational force acting on the glider ensures that each trajectory will last a finite number of steps we can avoid the discount factor and set . We consider continuousvalued controls defined by Gaussian policies which allows for finegrained corrections(in contrast to usually employed discretized controls Novati et al. (2017)). The action determines the constant control torque exerted by the agent between time and .3.1 Offpolicy actorcritic
We solve the RL problem with a novel offpolicy actorcritic algorithm named Racer (Novati & Koumoutsakos, 2018). The algorithm relies on training a neural network (NN), defined by weights , to obtain a continuous approximation of the policy , the state value and the action value . The network receives as input and produces as output the set of parameters that are further explained below. The policy for each state is approximated with a Gaussian having a mean
and a standard deviation
:(14) 
The standard deviation is initially wide enough to adequately explore the dynamics of the system. In turn, we also suggest a continuous estimate of the stateaction value function (). Here, rather than having a specialized network, which includes in its input the action , we propose computing the estimate by combining the network’s state value estimate with a quadratic term with vertex at the mean of the policy:
(15)  
(16) 
This definition ensures that . Here is an output of the network describing the rate at which the action value decreases for actions farther away from . This parameterization relies on the assumption that for any given state is maximal at the mean of the policy. Since the dynamics of the system are described by a small number of ordinary differential equations that can be solved at each instant to determine the state of the system, in contrast with the need to solve the full NavierStokes equations Novati et al. (2017)
, we can use a continuous action space, and further use a multilayer perceptron
Sutton et al. (2000)rather than recurrent neural networks as policy approximators.
The learning process advances by iteratively sampling the dynamical system in order to assemble a set of training trajectories . A trajectory is sequence of observations . An observation is defined as the collection of all information available to the agent at time : the state , the, reward , the current policy and the sampled action . Here we made a distinction between the policy executed at time and because, when the data is used for training, the weights of the NN might change, causing the current policy for state to change. For each new observation from the environment, a number of observations are sampled from the dataset . Finally, the network weights are updated through backpropagation of the policy () and value function gradients ().
The policy parameters and are improved through the policy gradient estimator (Degris et al., 2012):
(17) 
where is an estimator of the action value. A key insight from policygradient based algorithms is that the parameterized cannot safely be used to approximate onpolicy returns, due to its inaccuracy during training Sutton et al. (2000). On the other hand, obtaining through Monte Carlo sampling is often computationally prohibitive. Hence, we approximate with the Retrace algorithm Munos et al. (2016), which can we written recursively as:
(18) 
The importance weight , is the ratio of probabilities of sampling the action from state with the current policy and with the old policy .
The state value and action value coefficient are trained with the importancesampled gradient of the distance from :
(19) 
Further implementation details of the algorithm can be found in Novati & Koumoutsakos (2018).
3.2 Reward formulation
We wish to identify energyoptimal and timeoptimal control policies by varying the aspect and density ratios that define the system of ODEs. In the optimal control setting, boundary conditions, such as the initial and terminal positions of the ellipse, and constraints, such as bounds on the power or torque can be included directly in the problem formulation as employed in Paoletti & Mahadevan (2011). In RL, boundary conditions can only be included in the reward formulation. The agent is discouraged from violating optimization constraints by introducing a condition for termination of a simulation, accompanied by negative terminal rewards. For example, here we inform the agent about the landing target by composing the reward as:
(20) 
where is the optimal control cost function which can either be for learning timeoptimal policies, or for energyoptimal policies. The control cost is used as a proxy for the energy cost as in Paoletti & Mahadevan (2011). Note that for a trajectory monotonically approaching the difference between the RL and optimal control cost functions . If the exact target location is reached at the terminal state, the discrepancy between the two formulations would be a constant baseline , which can be proved to not affect the policy Ng et al. (1999). Therefore, a RL agent that maximizes cumulative rewards also minimizes either the time or the energy cost.
The episodes are terminated if the ellipse touches the ground at . In order to allow the agent to explore diverse perching maneuvers, such as phugoid motions, the ground is recessed between and and is located at . For both time optimal and energy optimal optimizations, the terminal reward is given by:
(21) 
here or when training for energy or timeoptimal policies respectively. The second exponential term of Eq. 21 is added only if , in order to prevent the policy from landing away from by relying on the perching angle bonus while minimizing time/energy costs.
4 Results
We explore the gliding strategies of the RL agents that aim to minimize either timetotarget or energy expenditure, by varying the aspect ratio and density ratio of the falling ellipse. These two optimization objectives may be seen as corresponding to the biologic scenarios of foraging and escaping from predators. Figure 2 shows the two prevailing flight patterns learned by the RL agent,which we refer to as ‘bounding’ and ‘tumbling’ flight Paoletti & Mahadevan (2011). The name ‘bounding’ flight is due to an energysaving flight strategy first analyzed by Rayner (1977) and Lighthill (1977) with simplified models of intermittently flapping fliers.
In the present model, bounding flight is characterized by succeeding phases of gliding and tumbling. During gliding, the agent exerts negative torque to maintain a small angle of attack (represented by the blue snapshots of the glider in Fig. 2), deflecting momentum from the air flow which slows down the descent. During the tumbling phase, the agent applies a rapid burst of positive torque (red snapshots of the glider in Fig. 2) to generate lift and, after a rotation of , recover into a gliding attitude.
The trajectory on the  plane (Fig. 2) highlights that the sign of the control torque is correlated with whether and have the same sign. This behavior is consistent with the goal of maintaining upward lift. In fact, the vertical component of lift applied onto the falling ellipse is , with because the target position is to the right of the starting position. From Eq.9 of our ODEbased model, the lift is positive if and have opposite signs or if is positive. Therefore, in order to create upward lift, the agent can either exert a positive to generate positive angular velocity, or, if and have opposite signs, exert a negative to reduce its angular velocity (Eq.3) and maintain the current orientation. The grayedout trajectory shows what would happen during the gliding phase without active negative torque: the ellipse would increase its angle of attack, lose momentum and, eventually, fall vertically.
Tumbling flight, visualized in figures 2 and 2, is a much simpler pattern obtained by applying an almost constant torque that causes the ellipse to steadily rotate along its trajectory, thereby generating lift. The constant rotation is generally slowed down for the landing phase in order to descent and accurately perch at .
In figure 3 we report the effect of the ellipse’s shape and weight on the optimal strategies. The system of ODEs described in section 2 is characterized by nondimensional parameters and . Here we independently vary the density ratio and the aspect ratio in the range . For each set of dimensionless parameters we train a RL agent to find both the energyoptimal and timeoptimal policies. The flight strategies employed by the RL agents can be clearly defined as either bounding flight or tumbling only in the timeoptimal setting, while energyoptimal strategies tend to employ elements of both flight patterns. In figure 3, timeoptimal policies that employ bounding flight are marked by a triangle, while those that use tumbling flight are marked by a square. We find that lighter and elongated bodies employ bounding flight while heavy and thick bodies employ tumbling flight. Only one policy, obtained for , alternated between the two patterns and is marked by a star. These results indicate that a simple linear relation (outlined by a black dashed line in figure 3) approximately describes the boundary between regions of the phasespace where one flight pattern is preferred over the other. In figures 3 and 3 we report the optimal time costs and optimal energy costs for all the combinations of ondimensional parameters.
Once the RL training terminates, the agent obtains a set of opaque rules, parameterized by a neural network, to select actions. These rules are approximatelyoptimal only for the states encountered during training, but can also be applied to new conditions. In fact, we find that the policies obtained through RL are remarkably robust. In figure 4 we apply the timeoptimal policy for and to a new set of initial conditions along the coordinate. Despite the agent never having encountered these position during training, it can always manage to reach the perching target. Similarly, in figure 4 we test the robustness with respect to changes to the parameters of the ODE model . At the beginning of a trajectory, we vary each parameter according to where
is sampled from a lognormal distribution with mean 1 and standard deviation
. The color contour of figure 4 represent the envelopes of trajectories for (blue), 0.2 (green), and 0.4 (orange). Surprisingly, even when the parameters are substantially different from those of the original model, the RL agent always finds its bearing and manages to land in the neighborhood of the target position.5 Comparison with Optimal Control
Having obtained approximatelyoptimal policies with RL, we now compare them with the trajectories derived from optimal control (OC) by (Paoletti & Mahadevan, 2011) for and . In figure 5, we show the energy optimal trajectories, and in figure 6 we show the time optimal trajectories. In both cases, we find that the RL agent surpasses the performance of the OC solution: the final energycost is approximately 2% lower for RL and the timecost is 4% than that of OC. While in principle OC should find locally optimal trajectories, OC solvers (in this case GPOPS, see Paoletti & Mahadevan (2011)) convert the problem into a set of finitedimensional subproblems by discretizing the time. Therefore the (locally) optimal trajectory is found only up to a finite precision, in some cases allowing RL, which employs a different timediscretization, to achieve better performance.
The RL and OC solutions qualitatively find the same control strategy. The energyoptimal trajectories consist in finding a constant minimal torque that generates enough lift to reach by steady tumbling flight. The timeoptimal controller follows a ”bangbang” pattern that alternately reaches the two bounds of the action space as the glider switches between gliding and tumbling flight. However, the main drawback of RL is having only the reward signal to nudge the system towards satisfying the constraints. We can impose arbitrary initial conditions and bounds to the action space (Sec. 3), but we cannot directly control the terminal state of the glider. Only through expert shaping of the reward function, as outlined in section 3.2, we can train policies that reliably land at (within tolerance ) with perching angle ().
One of the advantages of RL relative to optimal control, beside not requiring a precise model of the environment, is that RL learns closedloop control strategies. While OC has to compute denovo an openloop policy after any perturbation that drives the system away from the planned path, the RL agent selects action contextually and robustly based on the current state. This suggests that RL policies from simplified, inexpensive models can be transferred to related more accurate simulations Verma et al. (2018) or robotic experiments (for example, see Geng et al. (2016)).
6 Comparison of learning algorithms
The RL agent starts the training by performing actions haphazardly, due to the control policy which is initialized with random small weights being weakly affected by the state in which the agent finds itself. Since the desired landing location is encoded in the reward, the agent’s trajectories gradually shift towards landing closer to .
In order to have a fair comparison with the trajectories obtained through optimal control, the RL agents should be able to precisely and reliably land at the prescribed target position. In general, the behaviors learned through RL are appealing, however, depending on the problem, it can be hard to obtain quantitatively precise control policies. This issue may be observed in figure 7 where we show the time evolution of the distribution of terminal coordinates during training of three stateoftheart RL algorithms. The Racer manages to reliably land in the proximity of , after the first 1000 observed trajectories, with a mean squared error on the order of one. The precision of the distribution of landing locations, obtained here by sampling the stochastic policy during training, can be increased when evaluating a trained policy by choosing deterministically at every turn the action corresponding to its mean .
Normalized Advantage Function algorithm (NAF Gu et al. (2016)) is an offpolicy valueiteration algorithm which learns a quadratic parameterization of the action value , similar to the one defined in equation 16. One of the main differences with respect to Racer is that the mean of the policy is not trained with the policy gradient (Eq. 17) but with the critic gradient (Eq. 19). While the accuracy of the parameterized might increase during training, does not necessarily correspond to better action, leading to the erratic distribution of landing positions in figure 7.
Proximal Policy Optimization (PPO Schulman et al. (2017)) is an onpolicy actorcritic algorithm. This algorithm’s main difference with respect to Racer is that only the most recent (onpolicy) trajectories are used to update the policy. This allows estimating directly from onpolicy rewards Schulman et al. (2015) rather than with an offpolicy estimator (here we used Retrace 18), and it bypasses the need for learning a parametric . While PPO has led to many stateoftheart results in benchmark test cases, here it does not succeed to center the distribution of landing positions around
. This could be attributed to an unfavorable formulation of the reward, or to the high variance of the onpolicy estimator for
.7 Conclusion
We have demonstrated that Reinforcement Learning can be used to develop gliding agents that execute complex and precise control patterns using a simple model of the controlled gravitydriven descent of an elliptical object. We show that RL agents learn a variety of optimal flight patterns and perching maneuvers that minimize either timetotarget or energy cost. The RL agents were able to match and even surpass the performance of trajectories found through Optimal Control. We also show that the the RL agents can generalize their behavior, allowing them to select adequate actions even after perturbing the system. Finally, we examined the effects of the ellipse’s density and aspect ratio to find that the optimal policies lead to either bounding flight or tumbling flight. Bounding flight is characterized as alternating phases of gliding with a small angle of attack and rapid rotation to generate lift. Tumbling flight is characterized by continual rotation, propelled by a minimal almost constant torque. Ongoing work aims to extend the present algorithms to three dimensional Direct Numerical Simulations of gliders.
Acknowledgments
We thank Siddhartha Verma for helpful discussions and feedback on this manuscript. This work was supported by European Research Council Advanced Investigator Award 341117. Computational resources were provided by Swiss National Supercomputing Centre (CSCS) Project s658.
References
 Abas et al. (2016) Abas, MF Bin, Rafie, ASBM, Yusoff, HB & Ahmad, KAB 2016 Flapping wing microaerialvehicle: Kinematics, membranes, and flapping mechanisms of ornithopter and insect flight. Chinese Journal of Aeronautics 29 (5), 1159–1177.
 Andersen et al. (2005a) Andersen, A, Pesavento, U & Wang, ZJ 2005a Analysis of transitions between fluttering, tumbling and steady descent of falling cards. Journal of Fluid Mechanics 541, 91–104.
 Andersen et al. (2005b) Andersen, A, Pesavento, U & Wang, ZJ 2005b Unsteady aerodynamics of fluttering and tumbling plates. Journal of Fluid Mechanics 541, 65–90.
 Bellman (1952) Bellman, Richard 1952 On the theory of dynamic programming. Proceedings of the National Academy of Sciences 38 (8), 716–719.
 Bertsekas et al. (1995) Bertsekas, Dimitri P, Bertsekas, Dimitri P, Bertsekas, Dimitri P & Bertsekas, Dimitri P 1995 Dynamic programming and optimal control, , vol. 1. Athena scientific Belmont, MA.
 Colabrese et al. (2017) Colabrese, S, Gustavsson, K, Celani, A & Biferale, L 2017 Flow navigation by smart microswimmers via reinforcement learning. Physical Review Letters 118 (15), 158004.
 Degris et al. (2012) Degris, T, White, M & Sutton, R S 2012 Offpolicy actorcritic. arXiv preprint arXiv:1205.4839 .
 Duriez et al. (2017) Duriez, Thomas, Brunton, Steven L & Noack, Bernd R 2017 Machine Learning ControlTaming Nonlinear Dynamics and Turbulence. Springer.
 Gazzola et al. (2014) Gazzola, M, Hejazialhosseini, B & Koumoutsakos, P 2014 Reinforcement learning and wavelet adapted vortex methods for simulations of selfpropelled swimmers. SIAM Journal on Scientific Computing 36 (3), B622–B639.
 Gazzola et al. (2016) Gazzola, M, Tchieu, AA, Alexeev, D, de Brauer, A & Koumoutsakos, P 2016 Learning to school in the presence of hydrodynamic interactions. Journal of Fluid Mechanics 789, 726–749.
 Geng et al. (2016) Geng, X, Zhang, M, Bruce, J, Caluwaerts, K, Vespignani, M, Sun, SV, Abbeel, P & Levine, S 2016 Deep reinforcement learning for tensegrity robot locomotion. arXiv preprint arXiv:1609.09049 .
 Gu et al. (2016) Gu, S, Lillicrap, T, Sutskever, I & Levine, S 2016 Continuous deep qlearning with modelbased acceleration. In International Conference on Machine Learning, pp. 2829–2838.
 Gurdan et al. (2007) Gurdan, D, Stumpf, J, Achtelik, M, Doth, KM, Hirzinger, G & Rus, D 2007 Energyefficient autonomous fourrotor flying robot controlled at 1 khz. In Robotics and Automation, 2007 IEEE International Conference on, pp. 361–366. IEEE.
 Jackson (2000) Jackson, SM 2000 Glide angle in the genus petaurus and a review of gliding in mammals. Mammal Review 30 (1), 9–30.

Kaelbling et al. (1996)
Kaelbling, Leslie Pack, Littman, Michael L & Moore,
Andrew W 1996 Reinforcement learning: A survey.
Journal of artificial intelligence research
4, 237–285.  Lamb (1932) Lamb, H 1932 Hydrodynamics. Cambridge university press.
 Levine et al. (2016) Levine, Sergey, Finn, Chelsea, Darrell, Trevor & Abbeel, Pieter 2016 Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), 1334–1373.
 Lighthill (1977) Lighthill, MJ 1977 Introduction to the scaling of aerial locomotion. Scale effects in animal locomotion pp. 365–404.
 Lupashin et al. (2010) Lupashin, S, Schöllig, A, Sherback, M & D’Andrea, R 2010 A simple learning strategy for highspeed quadrocopter multiflips. In Robotics and Automation (ICRA), 2010 IEEE International Conference on, pp. 1642–1648. IEEE.
 McCay (2001) McCay, MG 2001 Aerodynamic stability and maneuverability of the gliding frog polypedates dennysi. Journal of Experimental Biology 204 (16), 2817–2826.
 Mellinger et al. (2013) Mellinger, D, Shomin, M, Michael, N & Kumar, V 2013 Cooperative grasping and transport using multiple quadrotors. In Distributed autonomous robotic systems, pp. 545–558. Springer.
 Mnih et al. (2015) Mnih, V, Kavukcuoglu, K, Silver, D, Rusu, A, Veness, J, Bellemare, MG, Graves, A, Riedmiller, M, Fidjeland, AK, Ostrovski, G & others 2015 Humanlevel control through deep reinforcement learning. Nature 518 (7540), 529–533.
 Mori & Hikida (1994) Mori, A & Hikida, T 1994 Field observations on the social behavior of the flying lizard, draco volans sumatranus, in borneo. Copeia pp. 124–130.
 Müller et al. (2011) Müller, M, Lupashin, S & D’Andrea, R 2011 Quadrocopter ball juggling. In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, pp. 5113–5120. IEEE.
 Munos et al. (2016) Munos, R, Stepleton, T, Harutyunyan, A & Bellemare, M 2016 Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062.
 Ng et al. (1999) Ng, AY, Harada, D & Russell, S 1999 Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, , vol. 99, pp. 278–287.
 Novati & Koumoutsakos (2018) Novati, G & Koumoutsakos, P 2018 Remember and forget for experience replay. In Advances in Neural Information Processing Systems (submitted).
 Novati et al. (2017) Novati, G, Verma, S, Alexeev, D, Rossinelli, D, van Rees, W M & Koumoutsakos, P 2017 Synchronisation through learning for two selfpropelled swimmers. Bioinspiration & Biomimetics 12 (3), 036001.
 Paoletti & Mahadevan (2011) Paoletti, P & Mahadevan, L 2011 Planar controlled gliding, tumbling and descent. Journal of Fluid Mechanics 689, 489–516.
 Rayner (1977) Rayner, JMV 1977 The intermittent flight of birds. Scale effects in animal locomotion pp. 437–443.
 Rayner (1985) Rayner, JMV 1985 Bounding and undulating flight in birds. Journal of Theoretical Biology 117 (1), 47–77.
 Reddy et al. (2016) Reddy, G, Celani, A, Sejnowski, TJ & Vergassola, M 2016 Learning to soar in turbulent environments. Proceedings of the National Academy of Sciences p. 201606075.
 Schulman et al. (2015) Schulman, J, Moritz, P, Levine, S, Jordan, M & Abbeel, P 2015 Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 .
 Schulman et al. (2017) Schulman, J, Wolski, F, Dhariwal, P, Radford, A & Klimov, O 2017 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .
 Silver et al. (2016) Silver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, Van Den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc & others 2016 Mastering the game of go with deep neural networks and tree search. nature 529 (7587), 484–489.
 Socha (2002) Socha, JJ 2002 Kinematics: Gliding flight in the paradise tree snake. Nature 418 (6898), 603–604.
 Sutton et al. (2000) Sutton, RS, McAllester, DA, Singh, SP & Mansour, Y 2000 Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063.
 Sutton & Barto (1998) Sutton, Richard S & Barto, Andrew G 1998 Reinforcement learning: An introduction, , vol. 1. MIT press Cambridge.
 Thomas et al. (2016) Thomas, J, Pope, M, Loianno, G, Hawkes, E W, Estrada, M A, Jiang, H, Cutkosky, MR & Kumar, V 2016 Aggressive flight with quadrotors for perching on inclined surfaces. Journal of Mechanisms and Robotics 8 (5), 051007.
 Verma et al. (2018) Verma, Siddhartha, Novati, Guido & Koumoutsakos, Petros 2018 Efficient collective swimming by harnessing vortices through deep reinforcement learning. Proceedings of the National Academy of Sciences p. 201800923.
 Wang et al. (2016) Wang, Z, Bapst, V, Heess, N, Mnih, V, Munos, R, Kavukcuoglu, K & de Freitas, N 2016 Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224 .
 Wang et al. (2004) Wang, ZJ, Birch, JM & Dickinson, MH 2004 Unsteady forces and flows in low reynolds number hovering flight: twodimensional computations vs robotic wing experiments. Journal of Experimental Biology 207 (3), 449–460.
 Yanoviak & Dudley (2006) Yanoviak, SP & Dudley, R 2006 The role of visual cues in directed aerial descent of cephalotes atratus workers (hymenoptera: Formicidae). Journal of Experimental Biology 209 (9), 1777–1783.
 Yanoviak et al. (2005) Yanoviak, SP, Dudley, R & Kaspari, M 2005 Directed aerial descent in canopy ants. Nature 433 (7026), 624–626.
 Yanoviak et al. (2009) Yanoviak, SP, Kaspari, M & Dudley, R 2009 Gliding hexapods and the origins of insect aerial behaviour. Biology letters 5 (4), 510–512.
 Yanoviak et al. (2010) Yanoviak, SP, Munk, Y, Kaspari, M & Dudley, R 2010 Aerial manoeuvrability in wingless gliding ants (cephalotes atratus). Proceedings of the Royal Society of London B: Biological Sciences p. rspb20100170.
Comments
There are no comments yet.