Reinforcement learning for autonomous swimmers
Reinforcement learning  has been introduced to identify navigation policies in several model systems of vortex dipoles, soaring birds and micro-swimmers [30, 31, 32]. Here, we expand on our earlier work [22, 33] combining Reinforcement Learning with Direct Numerical Simulations of the Navies stokes equations for two self-propelled and autonomous swimmers. We first investigate two-dimensional swimmers in a tandem configuration and analyse their kinematics for the cases of and (Fig. 2). In both cases, the swimmer trails a leader representing an adult zebrafish of length , swimming steadily at a velocity (Reynolds number ). We employ deep Reinforcement Learning (see Methods section for details), and after training we observe that is able to maintain its position behind the leader quite effectively (, Fig. 2), in accordance to its reward (). Surprisingly, with a reward function proportional to swimming-efficiency (), also settles close to the center of the leader’s wake (Fig. 2 and Supplementary Movie S2), although it receives no reward associated with its relative position. Both and maintain a distance of from their respective leaders (Figure 2). shows a greater proclivity to maintain this separation and intercepts the periodically shed wake-vortices just after they have been fully formed and detach from the leader’s tail. In addition to , there is an additional point of stability at (Fig. 2). The difference matches the distance between vortices in the wake of the leader. In both positions the lateral motion of the follower’s head is synchronized with the flow-velocity in the leader’s wake, thus inducing minimal disturbance on the oncoming flow-field. We note that a similar synchronization has been observed when trout minimize muscle usage by interacting with vortex-columns in a cylinder’s wake . undergoes relatively minor body-deformation while manoeuvring (Figure 2), whereas executes aggressive turns involving large body-curvature. Trout interacting with cylinder-wakes exhibit increased body-curvature , which is contrary to the behaviour displayed by . The difference may be ascribed to the widely-spaced vortex columns generated by large-diameter cylinders used in the experimental study. Weaving in and out of comparatively smaller vortices generated by like-sized fish encountered in a school (Fig. 1) would entail excessive energy consumption. We note that maintaining requires significant effort by (Supplementary Fig. S2) since its reward () is insensitive to energy expenditure. A previous study  suggested that minimizing lateral displacement led to enhanced swimming-efficiency (compared to the leader), albeit with noticeable deviation from
. In the current study, recurrent neural networks
augmented with ‘Long Short-Term Memory’ cells(Supplementary Fig. S3) help to encode time-dependencies in the value function, and enable far more robust smart-swimmers. Thus, stringent attempts by to correct for oscillations about (Fig. 2) give rise to increased costs (Supplementary Fig. S2).
) (fig:dY) Lateral displacement of the smart followers. (fig:histogram) Histogram showing the probability density function (pdf - left vertical axis) of swimmer’s preferred center-of-mass location during training. In the early stages of training (first 10000 transitions - green bars), the swimmer does not show a strong preference for maintaining any particular separation distance. Towards the end of training (last 10000 transitions - lilac bars), the swimmer displays a strong preference for maintaining a separation-distance of either or . The solid black line in the figure depicts correlation-coefficient, with peaks in the black curve signifying locations where the smart-follower’s head-movement would be synchronized with the flow-velocity in an undisturbed wake (please see Supplementary Information for relevant details). (fig:contortion) Comparison of body-deformation for swimmers (top) and (bottom), from to . Their respective trajectories are shown with the dash-dot lines, whereas the dashed gray line represents the trajectory of the leader (not shown). A quantitative comparison of body-curvature for the two swimmers may be found in Supplementary Fig. S1.
Intercepting vortices for efficient swimming
To determine the impact of wake-induced interactions on swimming-performance, we compare energetics data for and (Fig. 3). The swimming-efficiency of is significantly higher than that of (Fig. 3), whereas the Cost of Transport (CoT), which represents energy spent for traversing a unit distance, is lower (Fig. 3). Over a duration of 10 tail-beat periods (from to , Supplementary Fig. S2) experiences a increase in average speed compared to , a increase in average swimming-efficiency, and a decrease in CoT. The benefit for results from both a reduction in effort required for deforming its body against flow-induced forces (), and a increase in average thrust-power (). Performance-differences between and exist solely due to the presence/absence of a preceding wake, since both swimmers undergo identical body-undulations throughout the simulations. Comparing the swimming-efficiency and power values of four distinct swimmers (Supplementary Fig. S2 and Supplementary Table 1), we confirm that and are considerably more energetically efficient than either or , thus verifying the hydrodynamic benefits of coordinated swimming.
The efficient swimming of (e.g., point in Fig. 3) is attributed to the synchronized motion of its head with the lateral flow-velocity generated by the wake-vortices of the leader (see panel ‘v’ in Supplementary Movie S2). This mechanism is evidenced by the correlation-curve shown in Fig. 2
, and by the co-alignment of velocity vectors close to the head in Figs.4 and 4. As shown in Supplementary Movie S4,
intercepts the oncoming vortices in a slightly skewed manner, splitting each vortex into a stronger (, Fig. 4) and a weaker fragment (). The vortices interact with the swimmer’s own boundary layer to generate ‘lifted-vortices’ (), which in turn generate secondary-vorticity () close to the body. Meanwhile, the wake- and lifted-vortices created during the previous half-period, , , and , have travelled downstream along the body. This sequence of events alternates periodically between the upper (right-lateral) and lower (left-lateral) surfaces, as seen in Supplementary Movie S4. Interactions of with the flow-field at points and in Fig. 3 are analyzed separately in Supplementary Figs. S4 and S5.
. The envelope signifies the standard deviation among the 10 snapshots. (fig:pDefLEtaMaxA) Deformation-power and (fig:pThrustLEtaMaxA) thrust-power on the lower (left-lateral) surface of the swimmer.
We observe that the swimmer’s upper surface is covered in a layer of negative vorticity (and vice versa for the lower surface) (Fig. 4, top panel) owing to the no-slip boundary condition. The wake- or the lifted-vortices weaken this distribution by generating vorticity of opposite sign (e.g., secondary-vorticity visible in narrow regions between the fish-surface and vortices , , , and ), and create high-speed areas visible as bright spots in Fig. 4 (lower panel). The resulting low-pressure region exerts a suction-force on the surface of the swimmer (Fig. 4, upper panel), which assists body-undulations when the force-vectors coincide with the deformation-velocity (Fig. 4 lower panel), or increases the effort required when they are counter-aligned. The detailed impact of these interactions is demonstrated in Figs. 4 to 4. On the lower surface, generates a suction-force oriented in the same direction as the deformation-velocity ( in Fig. 4), resulting in negative (Fig. 4) and favourable (Fig. 4). On the upper surface, the lifted-vortex increases the effort required for deforming the body (positive peak in Fig. 4 at ), but is beneficial in terms of producing large positive thrust-power (Fig. 4). Moreover, as progresses along the body, it results in a prominent reduction in over the next half-period, similar to the negative peak produced by the lifted-vortex ( in Fig. 4). The average on both the upper and lower surfaces is predominantly negative (i.e., beneficial), in contrast to the minimum swimming-efficiency instance , where a mostly positive distribution signifies substantial effort required for deforming the body (Supplementary Fig. S4). We observe noticeable drag on the upper surface close to (Fig. 4 top panel and Fig. 4), attributed to high-pressure region forming in front of the swimmer’s head. Forces induced by are both beneficial and detrimental in terms of generating thrust-power ( in Fig. 4), whereas forces induced by primarily increase drag but assist in body-deformation (Fig. 4). The tail-section ( to ) does not contribute noticeably to either thrust- or deformation-power at the instant of maximum swimming-efficiency.
Energy-saving mechanisms in coordinated swimming
The most discernible behaviour of is the synchronization of its head-movement with the wake-flow. However, the most prominent reduction in deformation-power occurs near the midsection of the body ( in Figs. 4 and 4). This indicates that the technique devised by is markedly different from energy-conserving mechanisms implied in previous theoretical [6, 34] and computational  work, namely, drag-reduction attributed to reduced relative-velocity in the flow, and thrust-increase owing to the ‘chanelling effect’. In fact, the predominant energetics-gain (i.e., negative ) occurs in areas of high relative-velocity, for instance near the high-velocity spot generated by vortex (Fig. 4). This dependence of swimming-efficiency on a complex interplay between wake-vortices and body-deformation aligns closely with experimental findings [14, 28].
We remark that the majority of the results presented here were obtained with a steadily-swimming leader. However, with no additional training, is able to extract an energetic-benefit even when exposed to an erratic leader (as seen in Supplementary Movie S3), where it deliberately chooses to interact with the unsteady wake. Moreover, given the head-synchronization tendency of the 2D smart-swimmer, we identify suitable locations behind a 3D leader where the flow velocity would match a follower’s head motion (Supplementary Fig. S6). A feedback controller is used to regulate the undulations of two followers to maintain these target coordinates on either branch of the diverging wake, as shown in Fig. 1 and Supplementary Movie S1. The controlled motion yields an increase in average swimming-efficiency for each of the followers (Fig. 5), and a reduction in each of their Cost of Transport. Overall, the group experiences a increase in efficiency when compared to three isolated non-interacting swimers. The mechanism of energy-savings closely resembles that observed for the 2D swimmer; an oncoming wake-vortex ring (WR - Fig. 5) interacts with the deforming body to generate a ‘lifted-vortex’ ring (LR - Fig. 5). As this new ring proceeds along the length of the body, it modulates the follower’s swimming-efficiency as observed in Fig. 5. Remarkably, the positioning of the lifted-ring at the instants of minimum and maximum swimming-efficiency resembles the corresponding positioning of lifted-vortices in the 2D case; a slight dip in efficiency corresponds to lifted-vortices interacting with the anterior section of the body (Fig. 5 and Supplementary Fig. S4), whereas an increase occurs upon their interaction with the midsection (Fig. 5 and Fig. 4).
These results showcase the remarkable capability of machine learning, and deep RL in particular, for discovering effective solutions that may not have been envisaged by humans, either owing to pre-existing biases, or due to the difficulty of anticipating the effects of delayed reactions by swimmers in complex flows. Finally, this study demonstrates that deep reinforcement learning can produce navigation algorithms for complex flow-fields, with promising implications for energy savings in autonomous robotic swarms.
Methods We perform two-and three dimensional simulations of multiple self-propelled swimmers using wavelet adapted vortex methods  to discretise the velocity-vorticity form of the Navier-Stokes (NS) equations (in 2D), and their velocity-pressure form along with the pressure-projection  method (in 3D) using finite differences on a uniform computational grid. The body-geometry of the self-propelled swimmers is based on simplified models of a zebrafish. The swimmers adapt their motion using deep reinforcement learning. The learning process was greatly accelerated by employing recurrent neural networks with long-short term memory (RL-LSTM)  as a surrogate of the value function for the smart-swimmer. Additional details regarding the simulation methods and the reinforcement learning algorithm are provided in the Supporting Information.
Acknowledgements This work was supported by the European Research Council Advanced Investigator Award (Fluid Mechanics of Collective Behavior, Grant: 341117), and the Swiss National Science Foundation Sinergia Award (CRSII3 147675). The authors are grateful to the Swiss National Supercomputing Center (CSCS) for providing access to computational resources (project ‘s658’).
-  Schmidt J (1923) Breeding places and migrations of the eel. Nature 111:51–54.
-  Lang TG, Pryor K (1966) Hydrodynamic performance of porpoises (stenella attenuata). Science 152:531–533.
-  Aleyev YG (1977) Nekton. (Springer Netherlands).
-  Triantafyllou MS, Weymouth GD, Miao J (2016) Biomimetic Survival Hydrodynamics and Flow Sensing. Annu. Rev. Fluid Mech. 48:1–24.
-  Breder CM (1965) Vortices and fish schools. Zoologica-N.Y. 50:97–114.
-  Weihs D (1973) Hydromechanics of fish schooling. Nature 241:290–291.
-  Shaw E (1978) Schooling Fishes: The school, a truly egalitarian form of organization in which all members of the group are alike in influence, offers substantial benefits to its participants. Am. Sci. 66:166–175.
-  Pavlov DS, Kasumyan AO (2000) Patterns and mechanisms of schooling behavior in fish: A review. J. Ichthyol. 40:163–231.
-  Burgerhout E, et al. (2013) Schooling reduces energy consumption in swimming male European eels, Anguilla anguilla L. J. Exp. Mar. Biol. Ecol. 448:66 – 71.
-  Whittlesey RW, Liska S, Dabiri JO (2010) Fish schooling as a basis for vertical axis wind turbine farm design. Bioinspir. Biomim. 5(3):035005.
-  Chapman JW, et al. (2011) Animal orientation strategies for movement in flows. Curr. Biol. 21:R861 – R870.
-  Montgomery JC, Baker CF, Carton AG (1997) The lateral line can mediate rheotaxis in fish. Nature 389:960–963.
-  Lyon EP (1904) On rheotropism. I. — Rheotropism in fishes. Am. J. Physiol. 12:149–161.
-  Liao JC, Beal DN, Lauder GV, Triantafyllou MS (2003) Fish exploiting vortices decrease muscle activity. Science 302:1566–1569.
-  Oteiza P, Odstrcil I, Lauder G, Portugues R, Engert F (2017) A novel mechanism for mechanosensory-based rheotaxis in larval zebrafish. Nature 547:445–448.
-  Herskin J, Steffensen JF (1998) Energy savings in sea bass swimming in a school: measurements of tail beat frequency and oxygen consumption at different swimming speeds. J. Fish Biol. 53:366–376.
-  Killen SS, Marras S, Steffensen JF, McKenzie DJ (2012) Aerobic capacity influences the spatial position of individuals within fish schools. Proc. Biol. Sci. 279:357–364.
-  Ashraf I, et al. (2017) Simple phalanx pattern leads to energy saving in cohesive fish schooling. Proc. Natl. Acad. Sci. U.S.A.
-  Pitcher TJ (1986) Functions of shoaling behaviour in teleosts in The Behaviour of Teleost Fishes, ed. Pitcher TJ. (Springer US, Boston, MA), pp. 294–337.
-  Lopez U, Gautrais J, Couzin ID, Theraulaz G (2012) From behavioural analyses to models of collective motion in fish schools. Interface Focus 2:693–707.
-  Daghooghi M, Borazjani I (2015) The hydrodynamic advantages of synchronized swimming in a rectangular pattern. Bioinspir. Biomim. 10:056018.
-  Gazzola M, Hejazialhosseini B, Koumoutsakos P (2014) Reinforcement learning and wavelet adapted vortex methods for simulations of self-propelled swimmers. SIAM J. Sci. Comput. 36:B622–B639.
-  Maertens AP, Gao A, Triantafyllou MS (2017) Optimal undulatory swimming for a single fish-like body and for a pair of interacting swimmers. J. Fluid Mech 813:301–345.
-  Mnih V, , et al. (2015) Human-level control through deep reinforcement learning. Nature 518:529–533.
-  Müller UK, Smit J, Stamhuis EJ, Videler JJ (2001) How the body contributes to the wake in undulatory fish swimming. J. Exp. Biol. 204:2751–2762.
-  Kern S, Koumoutsakos P (2006) Simulations of optimized anguilliform swimming. J. Exp. Biol. 209:4841–4857.
-  Borazjani I, Sotiropoulos F (2008) Numerical investigation of the hydrodynamics of carangiform swimming in the transitional and inertial flow regimes. J. Exp. Biol. 211:1541–1558.
-  Liao JC, Beal DN, Lauder GV, Triantafyllou MS (2003) The Kármán gait: novel body kinematics of rainbow trout swimming in a vortex street. J. Exp. Biol. 206:1059–1073.
-  Sutton RS, Barto AG (1998) Reinforcement learning: An introduction. (MIT press, Cambridge, MA, USA).
-  Gazzola M, Tchieu AA, Alexeev D, de Brauer A, Koumoutsakos P (2016) Learning to school in the presence of hydrodynamic interactions. J. Fluid Mech. 789:726–749.
-  Reddy G, Celani A, Sejnowski TJ, Vergassola M (2016) Learning to soar in turbulent environments. Proceedings of the National Academy of Sciences 113(33):E4877–E4884.
-  Colabrese S, Gustavsson K, Celani A, Biferale L (2017) Flow navigation by smart microswimmers via reinforcement learning. Physical Review Letters 118(15):158004–.
-  Novati G, et al. (2017) Synchronisation through learning for two self-propelled swimmers. Bioinspir. Biomim. 12:036001.
-  Weihs D (1975) Swimming and Flying in Nature: Volume 2, eds. Wu TYT, Brokaw CJ, Brennen C. (Springer US, Boston, MA), pp. 703–718.
-  Bertsekas DP, Bertsekas DP, Bertsekas DP, Bertsekas DP (1995) Dynamic programming and optimal control. (Athena scientific Belmont, MA) Vol. 1.
-  Rossinelli D, et al. (2015) MRAG-I2D: Multi-resolution adapted grids for remeshed vortex methods on multicore architectures. J. Comput. Phys. 288:1–18.
-  Chorin AJ (1968) Numerical solution of the Navier-Stokes equations. Math. Comp. 22:745–762.
-  Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput. 9:1735–1780.
-  Coquerelle M, Cottet GH (2008) A vortex level set method for the two-way coupling of an incompressible fluid with colliding rigid bodies. J. Comput. Phys. 227:9121–9137.
-  Verma S, Abbati G, Novati G, Koumoutsakos P (2017) Computing the force distribution on the surface of complex, deforming geometries using vortex methods and brinkman penalization. Int. J. Numer. Meth. Fluids 85(8):484–501.
-  Greengard L, Rokhlin V (1987) A fast algorithm for particle simulations. J. Comput. Phys. 73:325–348.
-  Gholami A, Hill J, Malhotra D, Biros G (2015) AccFFT: A library for distributed-memory FFT on CPU and GPU architectures. arXiv preprint arXiv:1506.07933.
-  Tytell ED, Lauder GV (2004) The hydrodynamics of eel swimming. J. Exp. Biol. 207:1825–1841.
-  van Rees WM, Gazzola M, Koumoutsakos P (2013) Optimal shapes for anguilliform swimmers at intermediate reynolds numbers. J. Fluid Mech. 722:R3 1–12.
-  Bellman RE (2010) Dynamic Programming. (Princeton University Press, Princeton, NJ, USA).
-  van Hasselt H, Guez A, Silver D (2015) Deep reinforcement learning with double Q-learning. CoRR, abs/1509.06461.
-  Mnih V, , et al. (2015) Human-level control through deep reinforcement learning. Nature 518:529–533.
-  Riedmiller M (2005) Neural fitted Q iteration – First experiences with a data efficient neural reinforcement learning method in Machine Learning: ECML 2005: Lecture Notes in Computer Science, vol 3720, eds. Gama J, Camacho R, Brazdil PB, Jorge AM, Torgo L. (Springer Berlin Heidelberg, Berlin, Heidelberg), pp. 317–328.
-  Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: Continual prediction with LSTM. Neural Comput. 12(10):2451–2471.
-  Lin LJ (1992) Ph.D. thesis (Carnegie Mellon University, Pittsburgh, PA, USA).
-  Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 [cs.LG].
-  Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18:602–610.
-  Hunt JCR, Wray AA, Moin P (1988) Eddies, streams, and convergence zones in turbulent flows in Studying Turbulence Using Numerical Simulation Databases, 2. Report CTR-S88. pp. 193–208.
Supporting Information - Methods
The simulations presented here are based on the incompressible Navier-Stokes (NS) equations:
Each swimmer is represented on the computational grid via the characteristic function, and interacts with the fluid by means of the penalty  term , with . denotes the swimmer’s combined translational, rotational, and deformation velocity, whereas and correspond to the fluid velocity and viscosity, respectively. represents the pressure, and the fluid density is denoted by .
The vorticity form of the NS equations was used for the two-dimensional simulations. A wavelet adaptive grid  with an effective resolution of points was used to discretize a unit square domain. A lower effective resolution of points was used for the training-simulations to minimize computational cost. The pressure-Poisson equation (
), necessary for estimating the distribution of flow-induced forces on the swimmers’ bodies, was solved using the Fast Multipole Method[41, 40].
The three-dimensional simulations employed the pressure-projection method for solving the NS equations . The simulations were parallelized via the CUBISM framework , and used a uniform grid consisting of points in a domain of size . The non-divergence-free deformation of the self-propelled swimmers was incorporated into the pressure-Poisson equation as follows:
where represents the intermediate velocity from the convection-diffusion-penalization fractional steps. Equation 3
was solved using a distributed Fast Fourier Transform library (AccFFT).
Flow-induced forces, and energetics variables.
The pressure-induced and viscous forces acting on the swimmers are computed as follows :
Here, represents the pressure acting on the swimmer’s surface,
is the strain-rate tensor on the surface, anddenotes the infinitesimal surface area. Since self-propelled swimmers generate zero net average thrust (and drag) during steady swimming, we determine the instantaneous thrust as follows:
where . Similarly, the instantaneous drag may be determined as:
Using these quantities, the thrust-, drag-, and deformation-power are computed as:
where represents the deformation-velocity of the swimmer’s body. The double-integrals in these equations represent surface-integration over the swimmer’s body, and yield measurements for time-series analysis. On the other hand, only the integrand is evaluated when surface-distributions of thrust-, drag-, or deformation-power are required (as in Figs. 4 to 4).
The instantaneous swimming-efficiency is based on a modified form of the Froude efficiency proposed in ref. :
To compute both and the Cost of Transport (CoT), we neglect negative values of , which can result from beneficial interactions of the smart-swimmer with the leader’s wake:
This restriction accounts for the fact that the elastically rigid swimmer may not store energy furnished by the flow, and yields a conservative estimate of potential savings in the CoT. We note that percentage-changes in , reported in the main text and the supplementary section, have been computed using this bounded value to avoid overstating any potential benefits.
Swimmer shape and kinematics.
The Reynolds number of the self-propelled swimmers is computed as . The body-geometry is based on a simplified model of a zebrafish . The half-width of the 2D profile is described as follows:
where is the arc-length along the midline of the geometry, is the body length, , , and . For 3D simulations, the geometry is comprised of elliptical cross sections, with the half-width and half-height described via cubic B-splines . Six control-points define the half-width: ; whereas eight control-points define the half-height: . The length was set to , which keeps the grid-resolution, i.e., the number of points along the fish midline, comparable to the 2D simulations. Body-undulations for both 2D and 3D simulations were generated as a travelling-wave defining the curvature along the midline:
Here is the curvature amplitude and varies linearly from to .
Reinforcement learning (RL)  is a process by which an agent (in this case, the smart-swimmer) learns to earn rewards through trial-and-error interaction with its environment. At each turn, the agent observes the state of the environment and performs an action , which influences both the transition to the next state and the reward received . The agent’s goal is to learn the optimal control policy which maximises the action value , defined as the sum of discounted future rewards:
Here, denotes the terminal state of a training-simulation, and the discount factor is set to 0.9. The optimal action-value function is a fixed point of the Bellman equation: . We approximate using a neural network [46, 47, 48] with weights , which are updated iteratively to minimize the temporal difference error:
Here, is a set of target weights, and is the best action in state computed with the current weights (). The target weights are updated towards the current weights as , where is an under-relaxation factor used to stabilize the algorithm .
States and actions.
The six observed-state variables perceived by the learning agent include , , , the two most recent actions taken by the agent, and the current tail-beat ‘stage’ . The permissible range of the observed-state variables is limited to: ; (boundary depicted by in Supplementary Fig. S7); and . If the agent exceeds any of these thresholds, the training-simulation terminates and the agent receives a terminal reward .
The smart-swimmer (or agent) is capable of manoeuvering by actively manipulating the curvature-wave travelling down the body. This is accomplished by linearly superimposing a piecewise function on the baseline curvature (equation 14):
The curve is composed of 3 distinct segments:
The curve is a clamped cubic spline with , , and , . represents the time-instance when action is taken, whereas represents the corresponding control-amplitude, which may take five discrete values: , , and .
Neural network architecture.
One of the assumptions in RL is that the transition probability to a new state is independent of the previous transitions, given and , i.e.,:
This assumption is invalidated whenever the agent has a limited perception of the environment. In most realistic cases the agent receives an observation rather than the complete state of the environment . Therefore, past observations carry information relevant for future transitions (i.e., ), and should be taken into account in order to make optimal decisions. This operation can be approximated by a Recurrent Neural Network (RNN), which can learn to compute and remember important features in past observations. In this work we approximate the action-value function with a LSTM-RNN  composed of three layers of 24 fully connected LSTM cells each, and terminating in a linear layer (Supplementary Fig. S3). The last layer computes a vector of action-values with one component for each possible action available to the agent ( represents the activation of the network at the previous turn).
During training, both the leader and the follower (learning agent) start from rest. The leader swims steadily along a straight line, whereas the follower manoeuvers according to the actions supplied to it. Multiple independent simulations run simultaneously, with each of these sending the current observed-state of the agent to a central processor, and in turn receiving the next action to be performed. The central processor computes using an -greedy policy (with gradually annealed from to ) from the most recently updated function. Once a training-simulation reaches a terminal state (e.g., the follower hits the boundary labelled in Supplementary Fig. S7), all the messages exchanged between the simulation and the central processor are appended to a training set of sequences . In the meantime, the network is continually updated by sampling sequences from the set , according to algorithm 1.
Proportional-Integral feedback controller.
The PI controller modulates the 3D follower’s body-kinematics, which allows it to maintain a specific position (, , ) relative to the leader:
The factor modifies the undulation envelope, and controls the acceleration or deceleration of the follower based on its streamwise distance from the target position:
The term adds a baseline curvature to the follower’s midline to correct for lateral deviations:
Here, represents the follower’s yaw angle about the -axis, and is its exponential moving average: . The swimmers’ -positions remain fixed at , as out-of-plane motion is not permitted. The controller-coefficients were selected to have a minimal impact on regular swimming kinematics, which allows for a direct comparison of the follower’s efficiency to that of the leader:
Supporting Information - Supplementary Text, Figures, and Movies
Body-deformation during autonomous manoeuvres.
We observe that the body-deformation of is noticeably higher than that of a steady swimmer (with relative curvature ), which implies a tendency to take aggressive turns. The deformation for swimmer is markedly lower, which plays an instrumental role in reducing the power required for undulating the body against flow-induced forces.
Comparison of four different swimmers.
The performance metrics for four different swimmers are compared in Supplementary Fig. S2.
Moreover, the speeds of solitary swimmers and are lower than those of either interacting swimmer ( and ), which suggests that wake-interactions may benefit a follower regardless of the goal being pursued. In Supplementary Fig. S2 attains negative values only for , which is indicative of maximum benefit extracted from flow-induced forces. Both and are capable of generating significantly higher thrust-power than , but suffer from larger deformation-power, and consequently, lower swimming-efficiency. Comparing the columns for and in Table S1, we note that interacting with a preceding wake has a measurable impact on swimming-performance; is approximately more efficient than , spends less energy per unit distance travelled, requires less power for body-undulations, and generates higher thrust-power. Wake-interactions yield energetics benefits even for the swimmer actively minimizing lateral displacement from the leader, primarily by increasing thrust-power, as can be surmised by comparing the data for and in Supplementary Table 1.
Uncovering underlying time-dependencies.
While it is relatively straightforward to maintain a particular tandem formation via feedback control (when the follower strays too far to one side, a feedback controller can relay instructions to veer in the opposite direction), the same is not true for maximizing swimming-efficiency. It is difficult to formulate a simple set of a-priori rules for maximizing efficiency, especially in dynamically evolving conditions. This happens because: 1) the swimmer perceives only a limited representation of its environment (Fig. 1); and 2) there may be measurable delay between an action and its impact on the reward received over the long term. These traits make deep RL ideal for determining the optimal policy when maximizing swimming-efficiency, especially when augmented with recurrent neural networks (Supplementary Fig. S3). These network architectures are adept at discovering and exploiting long-term time-dependencies.
Flow-interactions at the instant of minimum swimming-efficiency.
The mean curve is mostly positive on both the lower and upper surfaces, with large positive peaks generated by interaction with the wake- and lifted-vortices. This increase in effort is not offset sufficiently by an increase in , resulting in low swimming-efficiency. Compared to the instance of maximum efficiency (Fig. 4), increased effort is required in the head region, along with an increase in thrust-production by the tail section .
Slight deviations impact performance.
To examine the impact of small deviations in ’s trajectory on its performance, we compare two different time-instances (at the same tail-beat stage) in Supplementary Fig. S5.
At , deviates slightly to the left of its steady trajectory (Supplementary Movie S4), which throws it out of synchronization with the oncoming wake-vortices. The resulting reduction in efficiency at indicates that even slight deviations are capable of impacting performance, and that there may be a measurable delay between actions and consequences. However, the smart-swimmer autonomously corrects for such deviations, and is able to quickly recover its optimal behaviour.
Correlation with the flow-field
Here, was recorded in the wake of a solitary swimmer, whereas was recorded at the swimmer’s head. Maxima in provide an estimate for the coordinates where a follower’s head-movements would exhibit long-term synchronization with an undisturbed wake.
Limiting the exploration space.
During training, the range of values that a smart-follower’s states can take are constrained, as mentioned previously. This prevents excessive exploration of regions that involve no wake-interactions, and helps to minimize the computational cost of training-simulations. The limits of the bounding box (shown in Supplementary Fig. S7) are kept sufficiently large to provide the follower ample room to swim clear of the unsteady wake, if it determines that interacting with the wake is unfavourable.
Power distribution in the presence/absence of a preceding wake.
To determine the extent to which wake-induced interactions alter the distribution of and , both of which influence overall swimming-efficiency, we compare these quantities for and in Supplementary Fig. S8.
A similar comparison for and is shown in Supplementary Fig. S9.
For , a greater variation in and is observed (broad envelopes in Supplementary Figs. S8 and S8), compared to the solitary swimmer (Supplementary Figs. S8 and S8). This is caused by ’s interactions with the unsteady wake, which is absent for . The average for shows distinct negative troughs near the head (, Supplementary Fig. S8) and at . A lack of similar troughs for (Supplementary Fig. S8) implies that these benefits originate exclusively from wake-induced interactions. There is no apparent difference in drag for both and in the pressure-dominated region close to the head (). However, wake-induced interactions provide a pronounced increase in thrust-power generated by the midsection for (compare Supplementary Figs. S8 and S8, ). Among all of the four swimmers compared, only shows a distinct negative region close to the head (), which further supports the occurrence of head-motion synchronization with flow-induced forces, when efficiency is maximized. Comparing the deformation- and thrust-power distribution for and in Supplementary Fig. S9 provides additional evidence that wake-interactions have a marked impact on swimming-energetics.
Supplementary Movie S1.
3D simulation of three nonautonomous swimmers, in which the leader swims steadily, and the two followers maintain specified relative positions such that they interact favourably with the leader’s wake. The flow-structures have been visualized using isosurfaces of the Q-criterion .
Supplementary Movie S2.
2D simulation of a pair of swimmers, in which the leader swims steadily, and the follower () takes autonomous decisions to interact favourably with the wake. The upper panel (labelled ‘’) shows the vorticity field generated by the swimmers, whereas the second panel (labelled ‘v’) shows the lateral flow-velocity. The smart-swimmer appears to synchronize the motion of its head with the lateral flow-velocity, which allows it to increase its swimming-efficiency. The lower panels show the energetics metrics, namely, the swimming efficiency , the thrust-power , the deformation-power , and the Cost of Transport (CoT).
Supplementary Movie S3.
2D simulation of a pair of swimmers, where the leader performs random actions, and the follower takes autonomous decisions to benefit from the flow-field. The smart-follower, which was trained with a steadily-swimming leader, is able to adapt to the erratic leader’s behaviour without any further training. Remarkably, the follower chooses to interact deliberately with the wake in order to maximize its long-term swimming-efficiency, even though it has the option to swim clear of the unsteady flow-field.
Supplementary Movie S4.
Detailed view of the flow-field around smart-swimmer . The top panel shows the vorticity field in colour and velocity vectors as black arrows. The middle panels show the swimming-efficiency and the deformation-power. The distribution of thrust-power and deformation-power along the swimmer’s left- (‘lower’) and right-lateral (‘upper’) surfaces are shown in the lower panels, and depict how these quantities depend on wake-interactions.
Supplementary Movie S5.
3D simulation of two nonautonomous swimmers, in which the leader swims steadily, and the follower maintains a specified relative position to interact favourably with the wake. The energetic-benefit for the follower is similar to that of each of the followers in Supplementary Movie S1.
Supplementary Movie S6.
3D simulation of three nonautonomous swimmers, in which the leaders use a feedback controller to maintain formation abreast of each other, and the follower holds a specified position relative to the leaders. The energetic-benefit for the follower is double that of the followers in Supplementary Movies 1 and 2, as it now interacts profitably with wake-rings generated by both the leaders.