Reinforcement learning for autonomous swimmers
Reinforcement learning [29] has been introduced to identify navigation policies in several model systems of vortex dipoles, soaring birds and microswimmers [30, 31, 32]. Here, we expand on our earlier work [22, 33] combining Reinforcement Learning with Direct Numerical Simulations of the Navies stokes equations for two selfpropelled and autonomous swimmers. We first investigate twodimensional swimmers in a tandem configuration and analyse their kinematics for the cases of and (Fig. 2). In both cases, the swimmer trails a leader representing an adult zebrafish of length , swimming steadily at a velocity (Reynolds number ). We employ deep Reinforcement Learning (see Methods section for details), and after training we observe that is able to maintain its position behind the leader quite effectively (, Fig. 2), in accordance to its reward (). Surprisingly, with a reward function proportional to swimmingefficiency (), also settles close to the center of the leader’s wake (Fig. 2 and Supplementary Movie S2), although it receives no reward associated with its relative position. Both and maintain a distance of from their respective leaders (Figure 2). shows a greater proclivity to maintain this separation and intercepts the periodically shed wakevortices just after they have been fully formed and detach from the leader’s tail. In addition to , there is an additional point of stability at (Fig. 2). The difference matches the distance between vortices in the wake of the leader. In both positions the lateral motion of the follower’s head is synchronized with the flowvelocity in the leader’s wake, thus inducing minimal disturbance on the oncoming flowfield. We note that a similar synchronization has been observed when trout minimize muscle usage by interacting with vortexcolumns in a cylinder’s wake [14]. undergoes relatively minor bodydeformation while manoeuvring (Figure 2), whereas executes aggressive turns involving large bodycurvature. Trout interacting with cylinderwakes exhibit increased bodycurvature [28], which is contrary to the behaviour displayed by . The difference may be ascribed to the widelyspaced vortex columns generated by largediameter cylinders used in the experimental study. Weaving in and out of comparatively smaller vortices generated by likesized fish encountered in a school (Fig. 1) would entail excessive energy consumption. We note that maintaining requires significant effort by (Supplementary Fig. S2) since its reward () is insensitive to energy expenditure. A previous study [33] suggested that minimizing lateral displacement led to enhanced swimmingefficiency (compared to the leader), albeit with noticeable deviation from
. In the current study, recurrent neural networks
augmented with ‘Long ShortTerm Memory’ cells
(Supplementary Fig. S3) help to encode timedependencies in the value function, and enable far more robust smartswimmers. Thus, stringent attempts by to correct for oscillations about (Fig. 2) give rise to increased costs (Supplementary Fig. S2).) (fig:dY) Lateral displacement of the smart followers. (fig:histogram) Histogram showing the probability density function (pdf  left vertical axis) of swimmer
’s preferred centerofmass location during training. In the early stages of training (first 10000 transitions  green bars), the swimmer does not show a strong preference for maintaining any particular separation distance. Towards the end of training (last 10000 transitions  lilac bars), the swimmer displays a strong preference for maintaining a separationdistance of either or . The solid black line in the figure depicts correlationcoefficient, with peaks in the black curve signifying locations where the smartfollower’s headmovement would be synchronized with the flowvelocity in an undisturbed wake (please see Supplementary Information for relevant details). (fig:contortion) Comparison of bodydeformation for swimmers (top) and (bottom), from to . Their respective trajectories are shown with the dashdot lines, whereas the dashed gray line represents the trajectory of the leader (not shown). A quantitative comparison of bodycurvature for the two swimmers may be found in Supplementary Fig. S1.Intercepting vortices for efficient swimming
To determine the impact of wakeinduced interactions on swimmingperformance, we compare energetics data for and (Fig. 3). The swimmingefficiency of is significantly higher than that of (Fig. 3), whereas the Cost of Transport (CoT), which represents energy spent for traversing a unit distance, is lower (Fig. 3). Over a duration of 10 tailbeat periods (from to , Supplementary Fig. S2) experiences a increase in average speed compared to , a increase in average swimmingefficiency, and a decrease in CoT. The benefit for results from both a reduction in effort required for deforming its body against flowinduced forces (), and a increase in average thrustpower (). Performancedifferences between and exist solely due to the presence/absence of a preceding wake, since both swimmers undergo identical bodyundulations throughout the simulations. Comparing the swimmingefficiency and power values of four distinct swimmers (Supplementary Fig. S2 and Supplementary Table 1), we confirm that and are considerably more energetically efficient than either or , thus verifying the hydrodynamic benefits of coordinated swimming.
The efficient swimming of (e.g., point in Fig. 3) is attributed to the synchronized motion of its head with the lateral flowvelocity generated by the wakevortices of the leader (see panel ‘v’ in Supplementary Movie S2). This mechanism is evidenced by the correlationcurve shown in Fig. 2
, and by the coalignment of velocity vectors close to the head in Figs.
4 and 4. As shown in Supplementary Movie S4,intercepts the oncoming vortices in a slightly skewed manner, splitting each vortex into a stronger (
, Fig. 4) and a weaker fragment (). The vortices interact with the swimmer’s own boundary layer to generate ‘liftedvortices’ (), which in turn generate secondaryvorticity () close to the body. Meanwhile, the wake and liftedvortices created during the previous halfperiod, , , and , have travelled downstream along the body. This sequence of events alternates periodically between the upper (rightlateral) and lower (leftlateral) surfaces, as seen in Supplementary Movie S4. Interactions of with the flowfield at points and in Fig. 3 are analyzed separately in Supplementary Figs. S4 and S5.


. The envelope signifies the standard deviation among the 10 snapshots. (fig:pDefLEtaMaxA) Deformationpower and (fig:pThrustLEtaMaxA) thrustpower on the lower (leftlateral) surface of the swimmer.
We observe that the swimmer’s upper surface is covered in a layer of negative vorticity (and vice versa for the lower surface) (Fig. 4, top panel) owing to the noslip boundary condition. The wake or the liftedvortices weaken this distribution by generating vorticity of opposite sign (e.g., secondaryvorticity visible in narrow regions between the fishsurface and vortices , , , and ), and create highspeed areas visible as bright spots in Fig. 4 (lower panel). The resulting lowpressure region exerts a suctionforce on the surface of the swimmer (Fig. 4, upper panel), which assists bodyundulations when the forcevectors coincide with the deformationvelocity (Fig. 4 lower panel), or increases the effort required when they are counteraligned. The detailed impact of these interactions is demonstrated in Figs. 4 to 4. On the lower surface, generates a suctionforce oriented in the same direction as the deformationvelocity ( in Fig. 4), resulting in negative (Fig. 4) and favourable (Fig. 4). On the upper surface, the liftedvortex increases the effort required for deforming the body (positive peak in Fig. 4 at ), but is beneficial in terms of producing large positive thrustpower (Fig. 4). Moreover, as progresses along the body, it results in a prominent reduction in over the next halfperiod, similar to the negative peak produced by the liftedvortex ( in Fig. 4). The average on both the upper and lower surfaces is predominantly negative (i.e., beneficial), in contrast to the minimum swimmingefficiency instance , where a mostly positive distribution signifies substantial effort required for deforming the body (Supplementary Fig. S4). We observe noticeable drag on the upper surface close to (Fig. 4 top panel and Fig. 4), attributed to highpressure region forming in front of the swimmer’s head. Forces induced by are both beneficial and detrimental in terms of generating thrustpower ( in Fig. 4), whereas forces induced by primarily increase drag but assist in bodydeformation (Fig. 4). The tailsection ( to ) does not contribute noticeably to either thrust or deformationpower at the instant of maximum swimmingefficiency.
Energysaving mechanisms in coordinated swimming
The most discernible behaviour of is the synchronization of its headmovement with the wakeflow. However, the most prominent reduction in deformationpower occurs near the midsection of the body ( in Figs. 4 and 4). This indicates that the technique devised by is markedly different from energyconserving mechanisms implied in previous theoretical [6, 34] and computational [21] work, namely, dragreduction attributed to reduced relativevelocity in the flow, and thrustincrease owing to the ‘chanelling effect’. In fact, the predominant energeticsgain (i.e., negative ) occurs in areas of high relativevelocity, for instance near the highvelocity spot generated by vortex (Fig. 4). This dependence of swimmingefficiency on a complex interplay between wakevortices and bodydeformation aligns closely with experimental findings [14, 28].
We remark that the majority of the results presented here were obtained with a steadilyswimming leader. However, with no additional training, is able to extract an energeticbenefit even when exposed to an erratic leader (as seen in Supplementary Movie S3), where it deliberately chooses to interact with the unsteady wake. Moreover, given the headsynchronization tendency of the 2D smartswimmer, we identify suitable locations behind a 3D leader where the flow velocity would match a follower’s head motion (Supplementary Fig. S6). A feedback controller is used to regulate the undulations of two followers to maintain these target coordinates on either branch of the diverging wake, as shown in Fig. 1 and Supplementary Movie S1. The controlled motion yields an increase in average swimmingefficiency for each of the followers (Fig. 5), and a reduction in each of their Cost of Transport. Overall, the group experiences a increase in efficiency when compared to three isolated noninteracting swimers. The mechanism of energysavings closely resembles that observed for the 2D swimmer; an oncoming wakevortex ring (WR  Fig. 5) interacts with the deforming body to generate a ‘liftedvortex’ ring (LR  Fig. 5). As this new ring proceeds along the length of the body, it modulates the follower’s swimmingefficiency as observed in Fig. 5. Remarkably, the positioning of the liftedring at the instants of minimum and maximum swimmingefficiency resembles the corresponding positioning of liftedvortices in the 2D case; a slight dip in efficiency corresponds to liftedvortices interacting with the anterior section of the body (Fig. 5 and Supplementary Fig. S4), whereas an increase occurs upon their interaction with the midsection (Fig. 5 and Fig. 4).
These results showcase the remarkable capability of machine learning, and deep RL in particular, for discovering effective solutions that may not have been envisaged by humans, either owing to preexisting biases, or due to the difficulty of anticipating the effects of delayed reactions by swimmers in complex flows. Finally, this study demonstrates that deep reinforcement learning can produce navigation algorithms for complex flowfields, with promising implications for energy savings in autonomous robotic swarms.
Methods We perform twoand three dimensional simulations of multiple selfpropelled swimmers using wavelet adapted vortex methods [36] to discretise the velocityvorticity form of the NavierStokes (NS) equations (in 2D), and their velocitypressure form along with the pressureprojection [37] method (in 3D) using finite differences on a uniform computational grid. The bodygeometry of the selfpropelled swimmers is based on simplified models of a zebrafish. The swimmers adapt their motion using deep reinforcement learning. The learning process was greatly accelerated by employing recurrent neural networks with longshort term memory (RLLSTM) [38] as a surrogate of the value function for the smartswimmer. Additional details regarding the simulation methods and the reinforcement learning algorithm are provided in the Supporting Information.
Acknowledgements This work was supported by the European Research Council Advanced Investigator Award (Fluid Mechanics of Collective Behavior, Grant: 341117), and the Swiss National Science Foundation Sinergia Award (CRSII3 147675). The authors are grateful to the Swiss National Supercomputing Center (CSCS) for providing access to computational resources (project ‘s658’).
References
 [1] Schmidt J (1923) Breeding places and migrations of the eel. Nature 111:51–54.
 [2] Lang TG, Pryor K (1966) Hydrodynamic performance of porpoises (stenella attenuata). Science 152:531–533.
 [3] Aleyev YG (1977) Nekton. (Springer Netherlands).
 [4] Triantafyllou MS, Weymouth GD, Miao J (2016) Biomimetic Survival Hydrodynamics and Flow Sensing. Annu. Rev. Fluid Mech. 48:1–24.
 [5] Breder CM (1965) Vortices and fish schools. ZoologicaN.Y. 50:97–114.
 [6] Weihs D (1973) Hydromechanics of fish schooling. Nature 241:290–291.
 [7] Shaw E (1978) Schooling Fishes: The school, a truly egalitarian form of organization in which all members of the group are alike in influence, offers substantial benefits to its participants. Am. Sci. 66:166–175.
 [8] Pavlov DS, Kasumyan AO (2000) Patterns and mechanisms of schooling behavior in fish: A review. J. Ichthyol. 40:163–231.
 [9] Burgerhout E, et al. (2013) Schooling reduces energy consumption in swimming male European eels, Anguilla anguilla L. J. Exp. Mar. Biol. Ecol. 448:66 – 71.
 [10] Whittlesey RW, Liska S, Dabiri JO (2010) Fish schooling as a basis for vertical axis wind turbine farm design. Bioinspir. Biomim. 5(3):035005.
 [11] Chapman JW, et al. (2011) Animal orientation strategies for movement in flows. Curr. Biol. 21:R861 – R870.
 [12] Montgomery JC, Baker CF, Carton AG (1997) The lateral line can mediate rheotaxis in fish. Nature 389:960–963.
 [13] Lyon EP (1904) On rheotropism. I. — Rheotropism in fishes. Am. J. Physiol. 12:149–161.
 [14] Liao JC, Beal DN, Lauder GV, Triantafyllou MS (2003) Fish exploiting vortices decrease muscle activity. Science 302:1566–1569.
 [15] Oteiza P, Odstrcil I, Lauder G, Portugues R, Engert F (2017) A novel mechanism for mechanosensorybased rheotaxis in larval zebrafish. Nature 547:445–448.
 [16] Herskin J, Steffensen JF (1998) Energy savings in sea bass swimming in a school: measurements of tail beat frequency and oxygen consumption at different swimming speeds. J. Fish Biol. 53:366–376.
 [17] Killen SS, Marras S, Steffensen JF, McKenzie DJ (2012) Aerobic capacity influences the spatial position of individuals within fish schools. Proc. Biol. Sci. 279:357–364.
 [18] Ashraf I, et al. (2017) Simple phalanx pattern leads to energy saving in cohesive fish schooling. Proc. Natl. Acad. Sci. U.S.A.
 [19] Pitcher TJ (1986) Functions of shoaling behaviour in teleosts in The Behaviour of Teleost Fishes, ed. Pitcher TJ. (Springer US, Boston, MA), pp. 294–337.
 [20] Lopez U, Gautrais J, Couzin ID, Theraulaz G (2012) From behavioural analyses to models of collective motion in fish schools. Interface Focus 2:693–707.
 [21] Daghooghi M, Borazjani I (2015) The hydrodynamic advantages of synchronized swimming in a rectangular pattern. Bioinspir. Biomim. 10:056018.
 [22] Gazzola M, Hejazialhosseini B, Koumoutsakos P (2014) Reinforcement learning and wavelet adapted vortex methods for simulations of selfpropelled swimmers. SIAM J. Sci. Comput. 36:B622–B639.
 [23] Maertens AP, Gao A, Triantafyllou MS (2017) Optimal undulatory swimming for a single fishlike body and for a pair of interacting swimmers. J. Fluid Mech 813:301–345.
 [24] Mnih V, , et al. (2015) Humanlevel control through deep reinforcement learning. Nature 518:529–533.
 [25] Müller UK, Smit J, Stamhuis EJ, Videler JJ (2001) How the body contributes to the wake in undulatory fish swimming. J. Exp. Biol. 204:2751–2762.
 [26] Kern S, Koumoutsakos P (2006) Simulations of optimized anguilliform swimming. J. Exp. Biol. 209:4841–4857.
 [27] Borazjani I, Sotiropoulos F (2008) Numerical investigation of the hydrodynamics of carangiform swimming in the transitional and inertial flow regimes. J. Exp. Biol. 211:1541–1558.
 [28] Liao JC, Beal DN, Lauder GV, Triantafyllou MS (2003) The Kármán gait: novel body kinematics of rainbow trout swimming in a vortex street. J. Exp. Biol. 206:1059–1073.
 [29] Sutton RS, Barto AG (1998) Reinforcement learning: An introduction. (MIT press, Cambridge, MA, USA).
 [30] Gazzola M, Tchieu AA, Alexeev D, de Brauer A, Koumoutsakos P (2016) Learning to school in the presence of hydrodynamic interactions. J. Fluid Mech. 789:726–749.
 [31] Reddy G, Celani A, Sejnowski TJ, Vergassola M (2016) Learning to soar in turbulent environments. Proceedings of the National Academy of Sciences 113(33):E4877–E4884.
 [32] Colabrese S, Gustavsson K, Celani A, Biferale L (2017) Flow navigation by smart microswimmers via reinforcement learning. Physical Review Letters 118(15):158004–.
 [33] Novati G, et al. (2017) Synchronisation through learning for two selfpropelled swimmers. Bioinspir. Biomim. 12:036001.
 [34] Weihs D (1975) Swimming and Flying in Nature: Volume 2, eds. Wu TYT, Brokaw CJ, Brennen C. (Springer US, Boston, MA), pp. 703–718.
 [35] Bertsekas DP, Bertsekas DP, Bertsekas DP, Bertsekas DP (1995) Dynamic programming and optimal control. (Athena scientific Belmont, MA) Vol. 1.
 [36] Rossinelli D, et al. (2015) MRAGI2D: Multiresolution adapted grids for remeshed vortex methods on multicore architectures. J. Comput. Phys. 288:1–18.
 [37] Chorin AJ (1968) Numerical solution of the NavierStokes equations. Math. Comp. 22:745–762.
 [38] Hochreiter S, Schmidhuber J (1997) Long shortterm memory. Neural Comput. 9:1735–1780.
 [39] Coquerelle M, Cottet GH (2008) A vortex level set method for the twoway coupling of an incompressible fluid with colliding rigid bodies. J. Comput. Phys. 227:9121–9137.
 [40] Verma S, Abbati G, Novati G, Koumoutsakos P (2017) Computing the force distribution on the surface of complex, deforming geometries using vortex methods and brinkman penalization. Int. J. Numer. Meth. Fluids 85(8):484–501.
 [41] Greengard L, Rokhlin V (1987) A fast algorithm for particle simulations. J. Comput. Phys. 73:325–348.
 [42] Gholami A, Hill J, Malhotra D, Biros G (2015) AccFFT: A library for distributedmemory FFT on CPU and GPU architectures. arXiv preprint arXiv:1506.07933.
 [43] Tytell ED, Lauder GV (2004) The hydrodynamics of eel swimming. J. Exp. Biol. 207:1825–1841.
 [44] van Rees WM, Gazzola M, Koumoutsakos P (2013) Optimal shapes for anguilliform swimmers at intermediate reynolds numbers. J. Fluid Mech. 722:R3 1–12.
 [45] Bellman RE (2010) Dynamic Programming. (Princeton University Press, Princeton, NJ, USA).
 [46] van Hasselt H, Guez A, Silver D (2015) Deep reinforcement learning with double Qlearning. CoRR, abs/1509.06461.
 [47] Mnih V, , et al. (2015) Humanlevel control through deep reinforcement learning. Nature 518:529–533.
 [48] Riedmiller M (2005) Neural fitted Q iteration – First experiences with a data efficient neural reinforcement learning method in Machine Learning: ECML 2005: Lecture Notes in Computer Science, vol 3720, eds. Gama J, Camacho R, Brazdil PB, Jorge AM, Torgo L. (Springer Berlin Heidelberg, Berlin, Heidelberg), pp. 317–328.
 [49] Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: Continual prediction with LSTM. Neural Comput. 12(10):2451–2471.
 [50] Lin LJ (1992) Ph.D. thesis (Carnegie Mellon University, Pittsburgh, PA, USA).
 [51] Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 [cs.LG].
 [52] Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18:602–610.
 [53] Hunt JCR, Wray AA, Moin P (1988) Eddies, streams, and convergence zones in turbulent flows in Studying Turbulence Using Numerical Simulation Databases, 2. Report CTRS88. pp. 193–208.
Supporting Information  Methods
Simulation details.
The simulations presented here are based on the incompressible NavierStokes (NS) equations:
(1)  
(2) 
Each swimmer is represented on the computational grid via the characteristic function
, and interacts with the fluid by means of the penalty [39] term , with . denotes the swimmer’s combined translational, rotational, and deformation velocity, whereas and correspond to the fluid velocity and viscosity, respectively. represents the pressure, and the fluid density is denoted by .The vorticity form of the NS equations was used for the twodimensional simulations. A wavelet adaptive grid [36] with an effective resolution of points was used to discretize a unit square domain. A lower effective resolution of points was used for the trainingsimulations to minimize computational cost. The pressurePoisson equation (
), necessary for estimating the distribution of flowinduced forces on the swimmers’ bodies, was solved using the Fast Multipole Method
[41, 40].The threedimensional simulations employed the pressureprojection method for solving the NS equations [37]. The simulations were parallelized via the CUBISM framework [36], and used a uniform grid consisting of points in a domain of size . The nondivergencefree deformation of the selfpropelled swimmers was incorporated into the pressurePoisson equation as follows:
(3) 
where represents the intermediate velocity from the convectiondiffusionpenalization fractional steps. Equation 3
was solved using a distributed Fast Fourier Transform library (AccFFT
[42]).Flowinduced forces, and energetics variables.
The pressureinduced and viscous forces acting on the swimmers are computed as follows [40]:
(4)  
(5) 
Here, represents the pressure acting on the swimmer’s surface,
is the strainrate tensor on the surface, and
denotes the infinitesimal surface area. Since selfpropelled swimmers generate zero net average thrust (and drag) during steady swimming, we determine the instantaneous thrust as follows:(6) 
where . Similarly, the instantaneous drag may be determined as:
(7) 
Using these quantities, the thrust, drag, and deformationpower are computed as:
(8)  
(9)  
(10) 
where represents the deformationvelocity of the swimmer’s body. The doubleintegrals in these equations represent surfaceintegration over the swimmer’s body, and yield measurements for timeseries analysis. On the other hand, only the integrand is evaluated when surfacedistributions of thrust, drag, or deformationpower are required (as in Figs. 4 to 4).
The instantaneous swimmingefficiency is based on a modified form of the Froude efficiency proposed in ref. [43]:
(11) 
To compute both and the Cost of Transport (CoT), we neglect negative values of , which can result from beneficial interactions of the smartswimmer with the leader’s wake:
(12) 
This restriction accounts for the fact that the elastically rigid swimmer may not store energy furnished by the flow, and yields a conservative estimate of potential savings in the CoT. We note that percentagechanges in , reported in the main text and the supplementary section, have been computed using this bounded value to avoid overstating any potential benefits.
Swimmer shape and kinematics.
The Reynolds number of the selfpropelled swimmers is computed as . The bodygeometry is based on a simplified model of a zebrafish [44]. The halfwidth of the 2D profile is described as follows:
(13) 
where is the arclength along the midline of the geometry, is the body length, , , and . For 3D simulations, the geometry is comprised of elliptical cross sections, with the halfwidth and halfheight described via cubic Bsplines [44]. Six controlpoints define the halfwidth: ; whereas eight controlpoints define the halfheight: . The length was set to , which keeps the gridresolution, i.e., the number of points along the fish midline, comparable to the 2D simulations. Bodyundulations for both 2D and 3D simulations were generated as a travellingwave defining the curvature along the midline:
(14) 
Here is the curvature amplitude and varies linearly from to .
Reinforcement Learning.
Reinforcement learning (RL) [29] is a process by which an agent (in this case, the smartswimmer) learns to earn rewards through trialanderror interaction with its environment. At each turn, the agent observes the state of the environment and performs an action , which influences both the transition to the next state and the reward received . The agent’s goal is to learn the optimal control policy which maximises the action value , defined as the sum of discounted future rewards:
(15) 
Here, denotes the terminal state of a trainingsimulation, and the discount factor is set to 0.9. The optimal actionvalue function is a fixed point of the Bellman equation: [45]. We approximate using a neural network [46, 47, 48] with weights , which are updated iteratively to minimize the temporal difference error:
(16) 
Here, is a set of target weights, and is the best action in state computed with the current weights (). The target weights are updated towards the current weights as , where is an underrelaxation factor used to stabilize the algorithm [47].
States and actions.
The six observedstate variables perceived by the learning agent include , , , the two most recent actions taken by the agent, and the current tailbeat ‘stage’ . The permissible range of the observedstate variables is limited to: ; (boundary depicted by in Supplementary Fig. S7); and . If the agent exceeds any of these thresholds, the trainingsimulation terminates and the agent receives a terminal reward .
The smartswimmer (or agent) is capable of manoeuvering by actively manipulating the curvaturewave travelling down the body. This is accomplished by linearly superimposing a piecewise function on the baseline curvature (equation 14):
(17) 
The curve is composed of 3 distinct segments:
(18) 
The curve is a clamped cubic spline with , , and , . represents the timeinstance when action is taken, whereas represents the corresponding controlamplitude, which may take five discrete values: , , and .
Neural network architecture.
One of the assumptions in RL is that the transition probability to a new state is independent of the previous transitions, given and , i.e.,:
(19) 
This assumption is invalidated whenever the agent has a limited perception of the environment. In most realistic cases the agent receives an observation rather than the complete state of the environment . Therefore, past observations carry information relevant for future transitions (i.e., ), and should be taken into account in order to make optimal decisions. This operation can be approximated by a Recurrent Neural Network (RNN), which can learn to compute and remember important features in past observations. In this work we approximate the actionvalue function with a LSTMRNN [49] composed of three layers of 24 fully connected LSTM cells each, and terminating in a linear layer (Supplementary Fig. S3). The last layer computes a vector of actionvalues with one component for each possible action available to the agent ( represents the activation of the network at the previous turn).
Training procedure.
During training, both the leader and the follower (learning agent) start from rest. The leader swims steadily along a straight line, whereas the follower manoeuvers according to the actions supplied to it. Multiple independent simulations run simultaneously, with each of these sending the current observedstate of the agent to a central processor, and in turn receiving the next action to be performed. The central processor computes using an greedy policy (with gradually annealed from to ) from the most recently updated function. Once a trainingsimulation reaches a terminal state (e.g., the follower hits the boundary labelled in Supplementary Fig. S7), all the messages exchanged between the simulation and the central processor are appended to a training set of sequences [50]. In the meantime, the network is continually updated by sampling sequences from the set , according to algorithm 1.
ProportionalIntegral feedback controller.
The PI controller modulates the 3D follower’s bodykinematics, which allows it to maintain a specific position (, , ) relative to the leader:
(20) 
The factor modifies the undulation envelope, and controls the acceleration or deceleration of the follower based on its streamwise distance from the target position:
(21) 
The term adds a baseline curvature to the follower’s midline to correct for lateral deviations:
(22) 
Here, represents the follower’s yaw angle about the axis, and is its exponential moving average: . The swimmers’ positions remain fixed at , as outofplane motion is not permitted. The controllercoefficients were selected to have a minimal impact on regular swimming kinematics, which allows for a direct comparison of the follower’s efficiency to that of the leader:
(23)  
(24)  
(25) 
Supporting Information  Supplementary Text, Figures, and Movies
Bodydeformation during autonomous manoeuvres.
The extent of bodybending that swimmers and undergo when manoeuvring is compared quantitatively in Supplementary Fig. S1. A qualitative comparison was presented in Fig. 2.
We observe that the bodydeformation of is noticeably higher than that of a steady swimmer (with relative curvature ), which implies a tendency to take aggressive turns. The deformation for swimmer is markedly lower, which plays an instrumental role in reducing the power required for undulating the body against flowinduced forces.
Comparison of four different swimmers.
The performance metrics for four different swimmers are compared in Supplementary Fig. S2.
Interacting swimmer occasionally attains higher speed than (Supplementary Fig. S2), but at the cost of much higher energy expenditure (Supplementary Fig. S2 and Table 1).
1.0  0.76  0.77  0.66  
CoT  1.0  1.56  3.96  3.86 
1.0  1.41  3.90  3.28  
1.0  0.66  2.33  1.48 
Moreover, the speeds of solitary swimmers and are lower than those of either interacting swimmer ( and ), which suggests that wakeinteractions may benefit a follower regardless of the goal being pursued. In Supplementary Fig. S2 attains negative values only for , which is indicative of maximum benefit extracted from flowinduced forces. Both and are capable of generating significantly higher thrustpower than , but suffer from larger deformationpower, and consequently, lower swimmingefficiency. Comparing the columns for and in Table S1, we note that interacting with a preceding wake has a measurable impact on swimmingperformance; is approximately more efficient than , spends less energy per unit distance travelled, requires less power for bodyundulations, and generates higher thrustpower. Wakeinteractions yield energetics benefits even for the swimmer actively minimizing lateral displacement from the leader, primarily by increasing thrustpower, as can be surmised by comparing the data for and in Supplementary Table 1.
Uncovering underlying timedependencies.
While it is relatively straightforward to maintain a particular tandem formation via feedback control (when the follower strays too far to one side, a feedback controller can relay instructions to veer in the opposite direction), the same is not true for maximizing swimmingefficiency. It is difficult to formulate a simple set of apriori rules for maximizing efficiency, especially in dynamically evolving conditions. This happens because: 1) the swimmer perceives only a limited representation of its environment (Fig. 1); and 2) there may be measurable delay between an action and its impact on the reward received over the long term. These traits make deep RL ideal for determining the optimal policy when maximizing swimmingefficiency, especially when augmented with recurrent neural networks (Supplementary Fig. S3). These network architectures are adept at discovering and exploiting longterm timedependencies.
Flowinteractions at the instant of minimum swimmingefficiency.
The instant when swimmer attains the lowest efficiency during each halfperiod ( in Fig. 3) is examined in Supplementary Fig. S4.


The mean curve is mostly positive on both the lower and upper surfaces, with large positive peaks generated by interaction with the wake and liftedvortices. This increase in effort is not offset sufficiently by an increase in , resulting in low swimmingefficiency. Compared to the instance of maximum efficiency (Fig. 4), increased effort is required in the head region, along with an increase in thrustproduction by the tail section .
Slight deviations impact performance.
To examine the impact of small deviations in ’s trajectory on its performance, we compare two different timeinstances (at the same tailbeat stage) in Supplementary Fig. S5.




At , deviates slightly to the left of its steady trajectory (Supplementary Movie S4), which throws it out of synchronization with the oncoming wakevortices. The resulting reduction in efficiency at indicates that even slight deviations are capable of impacting performance, and that there may be a measurable delay between actions and consequences. However, the smartswimmer autonomously corrects for such deviations, and is able to quickly recover its optimal behaviour.
Correlation with the flowfield
The correlationcoefficient curve shown in Fig. 2, and the correlation map shown in Supplementary Fig. S6, were computed as follows:
(26) 
Here, was recorded in the wake of a solitary swimmer, whereas was recorded at the swimmer’s head. Maxima in provide an estimate for the coordinates where a follower’s headmovements would exhibit longterm synchronization with an undisturbed wake.
Limiting the exploration space.
During training, the range of values that a smartfollower’s states can take are constrained, as mentioned previously. This prevents excessive exploration of regions that involve no wakeinteractions, and helps to minimize the computational cost of trainingsimulations. The limits of the bounding box (shown in Supplementary Fig. S7) are kept sufficiently large to provide the follower ample room to swim clear of the unsteady wake, if it determines that interacting with the wake is unfavourable.
Power distribution in the presence/absence of a preceding wake.
To determine the extent to which wakeinduced interactions alter the distribution of and , both of which influence overall swimmingefficiency, we compare these quantities for and in Supplementary Fig. S8.
A similar comparison for and is shown in Supplementary Fig. S9.
For , a greater variation in and is observed (broad envelopes in Supplementary Figs. S8 and S8), compared to the solitary swimmer (Supplementary Figs. S8 and S8). This is caused by ’s interactions with the unsteady wake, which is absent for . The average for shows distinct negative troughs near the head (, Supplementary Fig. S8) and at . A lack of similar troughs for (Supplementary Fig. S8) implies that these benefits originate exclusively from wakeinduced interactions. There is no apparent difference in drag for both and in the pressuredominated region close to the head (). However, wakeinduced interactions provide a pronounced increase in thrustpower generated by the midsection for (compare Supplementary Figs. S8 and S8, ). Among all of the four swimmers compared, only shows a distinct negative region close to the head (), which further supports the occurrence of headmotion synchronization with flowinduced forces, when efficiency is maximized. Comparing the deformation and thrustpower distribution for and in Supplementary Fig. S9 provides additional evidence that wakeinteractions have a marked impact on swimmingenergetics.
Supplementary Movie S1.
3D simulation of three nonautonomous swimmers, in which the leader swims steadily, and the two followers maintain specified relative positions such that they interact favourably with the leader’s wake. The flowstructures have been visualized using isosurfaces of the Qcriterion [53].
Supplementary Movie S2.
2D simulation of a pair of swimmers, in which the leader swims steadily, and the follower () takes autonomous decisions to interact favourably with the wake. The upper panel (labelled ‘’) shows the vorticity field generated by the swimmers, whereas the second panel (labelled ‘v’) shows the lateral flowvelocity. The smartswimmer appears to synchronize the motion of its head with the lateral flowvelocity, which allows it to increase its swimmingefficiency. The lower panels show the energetics metrics, namely, the swimming efficiency , the thrustpower , the deformationpower , and the Cost of Transport (CoT).
Supplementary Movie S3.
2D simulation of a pair of swimmers, where the leader performs random actions, and the follower takes autonomous decisions to benefit from the flowfield. The smartfollower, which was trained with a steadilyswimming leader, is able to adapt to the erratic leader’s behaviour without any further training. Remarkably, the follower chooses to interact deliberately with the wake in order to maximize its longterm swimmingefficiency, even though it has the option to swim clear of the unsteady flowfield.
Supplementary Movie S4.
Detailed view of the flowfield around smartswimmer . The top panel shows the vorticity field in colour and velocity vectors as black arrows. The middle panels show the swimmingefficiency and the deformationpower. The distribution of thrustpower and deformationpower along the swimmer’s left (‘lower’) and rightlateral (‘upper’) surfaces are shown in the lower panels, and depict how these quantities depend on wakeinteractions.
Supplementary Movie S5.
3D simulation of two nonautonomous swimmers, in which the leader swims steadily, and the follower maintains a specified relative position to interact favourably with the wake. The energeticbenefit for the follower is similar to that of each of the followers in Supplementary Movie S1.
Supplementary Movie S6.
3D simulation of three nonautonomous swimmers, in which the leaders use a feedback controller to maintain formation abreast of each other, and the follower holds a specified position relative to the leaders. The energeticbenefit for the follower is double that of the followers in Supplementary Movies 1 and 2, as it now interacts profitably with wakerings generated by both the leaders.
Comments
There are no comments yet.