Zermelo's problem: Optimal point-to-point navigation in 2D turbulent flows using Reinforcement Learning

07/17/2019 ∙ by Luca Biferale, et al. ∙ 4

To find the path that minimizes the time to navigate between two given points in a fluid flow is known as the Zermelo's problem. Here, we investigate it by using a Reinforcement Learning (RL) approach for the case of a vessel which has a slip velocity with fixed intensity, V_s, but variable direction and navigating in a 2D turbulent sea. We use an Actor-Critic RL algorithm, and compare the results with strategies obtained analytically from continuous Optimal Navigation (ON) protocols. We show that for our application, ON solutions are unstable for the typical duration of the navigation process, and are therefore not useful in practice. On the other hand, RL solutions are much more robust with respect to small changes in the initial conditions and to external noise, and are able to find optimal trajectories even when V_s is much smaller than the maximum flow velocity. Furthermore, we show how the RL approach is able to take advantage of the flow properties in order to reach the target, especially when the steering speed is small.



There are no comments yet.


page 2

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Figure 1: Panel a: Image of the turbulent snapshot used as the advecting flow, with the starting and ending points of the two considered problems. Problem P1 goes from (red) and problem P2 goes from (green). For each case, we also show an illustrative trajectory . The flow is obtained from a fully periodic snapshot of a 2D turbulent configuration in the inverse energy cascade regime with a multi-scale power-law Fourier spectrum, , shown in panel (b). For RL optimization, the initial conditions are taken randomly inside a circle of radius centered around . Similarly, the final targets are the circles of radius centered around for each problem. The flow area is covered by tiles with and of size which identify the state-space for the RL protocol. Every time interval , the agent selects one of the 8 possible actions with (the steering directions depicted in panel (c)) according to the policy (see Sec. III.1), where

is the probability distribution of the action

given the current state of the agent at that time. The large-scale periodicity of the underlying flow is , and we fixed .

Path planning for small autonomous marine vehicles Petres et al. (2007); Witt and Dunbabin (2008) such as wave and current gliders Kraus (2012); Smith et al. (2011), active drifters Lumpkin and Pazos (2007); Niiler (2001), buoyant underwater explorers, and small swimming drones is key for many geo-physical Lermusiaux et al. (2017) and engineering Bechinger et al. (2016); Kurzthaler et al. (2018); Popescu, Tasinkevych, and Dietrich (2011); Baraban et al. (2012) applications. In nature, these vessels are affected by environmental disturbances like wind, waves and ocean currents, often in competition, and often characterized by unpredictable (chaotic) evolutions. This is problematic when one wants to send probes to specific locations, for example when trying to optimize data-assimilation for environmental applications Lermusiaux et al. (2017); Carrassi et al. (2018); Lakshmivarahan and Lewis (2013); Clark Di Leoni, Mazzino, and Biferale (2018, 2019). Most of the times, a dense set of fixed platforms or manned vessels are not economically viable solutions. As a result, scientists rely on networks of moving sensors, e.g. near-surface currents drifters Centurioni (2018) or buoyant explorers Roemmich et al. (2009). In both cases, the platforms move with the surface flow (or with a depth current) and are either fully passive Centurioni (2018) or inflatable/deflatable with some predetermined scheduled protocol Roemmich et al. (2009). The main drawback is that they might be distributed in a non-optimal way, as they might accumulate in uninteresting regions, or disperse away from key points. Beside this applied motivation, the problem of (time) optimal point-to-point navigation in a flow, known as the Zermelo’s problem Zermelo (1931), is interesting per se in the framework of Optimal Control Theory Bryson and Ho (1975); Ben-Asher (2010); Liebchen and Löwen (2019); Hays et al. (2014).

In this paper, we tackle the Zermelo’s problem for a specific but important application, the case of a two-dimensional fully turbulent flowXia et al. (2011); Boffetta and Ecke (2012); Alexakis and Biferale (2018) with an entangled distribution of complex spatial features, such as recirculating eddies or shear-regions, and with multi-scale spectral properties (see Fig. 1 for a graphical summary of the problem). In such conditions, trivial or naive navigation policies can be extremely inefficient and even ineffective. To overcome this, we have implemented one approach based on Reinforcement Learning (RL) Sutton and Barto (2018) in order to find a robust quasi-optimal policy that accomplish the task. Furthermore, we compare RL with an approach based on Optimal Navigation (ON) theory Rugh and Rugh (1996); Pontryagin (2018). To the best of our knowledge, only simple advecting flows have been studied so far for both ON Techy (2011); Yoo and Kim (2016); Liebchen and Löwen (2019) and RL Yoo and Kim (2016).

Promising results have been obtained when applying RL algorithms to similar problems, such as the training of smart inertial particles or swimming particles navigating intense vortex regions Colabrese et al. (2018), Taylor Green flows Colabrese et al. (2017) and ABC flows Gustavsson et al. (2017). RL has also been successfully implemented to reproduce schooling of fishes Gazzola et al. (2016); Verma, Novati, and Koumoutsakos (2018), soaring of birds in a turbulent environments Reddy et al. (2016, 2018) and in many other applications Muinos-Landin, Ghazi-Zahedi, and Cichos (2018); Novati, Mahadevan, and Koumoutsakos (2018); Tsang et al. (2018).

In this paper, we show that for the case of vessels that have a slip velocity with fixed intensity but variable direction, RL can find a set of quasi-optimal paths to efficiently navigate the flow. Moreover, RL, unlike ON, can provide a set of highly stable solutions, which are insensitive to small disturbances in the initial condition and successful even when the slip velocity is much smaller than the guiding flow. We also show how the RL protocol is able to take advantage of different features of the underlying flow in order to achieve its task, indicating that the information it learns is non-trivial.

The paper is organized as follows. In Sec. II we present the general set-up of the problem, write the equations of motion of the vessels used, we give details on the underlying flow and tasks. In Sec. III we first present an overview of the RL algorithm used in this paper and then show the results obtained using it, while in Sec. IV we do the same for the ON case. In Sec. V we compare the results obtained from the two approaches. Finally, we give our conclusions in Sec. VI.

Ii Problem set-up

For our analysis we use one static snapshot from a numerical realization of 2D turbulence, and try to learn the optimal path connecting two different sets of starting and ending points; we call these problems P1 and P2, respectively. In Fig. 1 we show a sketch of the set-up (see the caption of the figure for further details on the turbulent realization used and how it was generated). Our goal is to find (if they exist) trajectories that join the region close to with a target close to (problem P1) and with (problem P2) in the shortest possible time assuming that the vessels obey the following equations of motion:


where is the stationary velocity of the underlying 2D advecting flow, and

is the control slip velocity of the vessel with fixed intensity and varying steering direction :

where the angle is evaluated along the trajectory, . We introduce a dimensionless slip velocity by dividing with the maximum velocity of the underlying flow:


In this framework, the Zermelo’s problems reduces to optimize the time-space dependency of in order to reach the target Zermelo (1931). A general solution, given by optimal control theory, can be found in Refs. Techy (2011); Mannarini et al. (2016). In particular, assuming that the angle is controlled continuously in time, one can prove that, if there exists an optimal trajectory that joins with with a given initial angle , the optimal steering angle must satisfy the following time-evolution:


where is evaluated along the agent trajectory obtained from Eq. (1). The set of equations (1) together with (3) form a three-dimensional dynamical system, which may result in chaotic dynamics even though the fluid velocity is 2D and time-independent (tracer particles cannot exhibit chaotic dynamics in such flow). Due to the sensitivity to small errors in chaotic systems the ON approach might become useless for many practical applications. Moreover, even in the presence of a global non-positive maximal Lyapunov exponent, where the long time evolution of a generic trajectory is attracted toward fixed points or periodic orbits for almost all initial conditions, the finite time Lyapunov exponents (FTLE) Ott (2002); Vulpiani, Cecconi, and Cencini (2009) can be positive for particular initial conditions and for a time longer than the typical navigation time. In this case, a navigation protocol based on (3) would be unstable for all practical purposes. This is most likely the reason why previous works on ON have dealt mainly with simple advecting flow configurations Techy (2011); Yoo and Kim (2016); Liebchen and Löwen (2019).

Iii The Reinforcement Learning approach

iii.1 Methods

RL applications Sutton and Barto (2018) are based on the idea that an optimal solution for certain complex problems can be obtained by learning from continuous interactions of an agent with its environment. The agent interacts with the environment by sampling its states , performing actions and collecting rewards . In the approach used here, actions are chosen randomly with a probability that is given by the policy function, , given the current state of the surrounding environment. The goal is to find the optimal policy that maximizes the total reward,


accumulated along one episode, i.e. one trial. To accomplish this goal, RL works in an iterative fashion. Different attempts, or episodes, are performed and the policy is updated to improve the total reward. The initial policy of each episode coincides with the final policy of the previous episode. During the training phase optimality is approached as the total reward for each episode converges (as a function of the number of episodes) to a fixed value (up to stochastic fluctuations).

In our case the vessel acts as the agent and the two-dimensional flow as the environment. As shown in Fig. 1, we define the states by covering the flow domain with square tiles with of size . Here and , where is the large-scale periodicity of the flow. In other words, we suppose that the agent is able to identify its absolute position in the flow within a given approximation determined by the tile size . Furthermore, to be realistic, we allow the agent to sample states and change action only at given time intervals , where is the characteristic flow time, . The possible actions (steering directions) correspond to the eight angles shown in Fig. 1, namely with . Each episode is defined as one attempt to reach the target, where we make sure that the sum in Eq. (4) is always finite by imposing a maximum time (chosen to be of the order of times the typical navigation time) after which we terminate the episode. To identify a time-optimal trajectory we use a potential based reward shaping Andrew, Harada, and Russelt (1999) at each time during the learning process:


where is the center of the final target region. The first term in the RHS of Eq. (5) is a contribution that accumulate to a large penalty if it takes long for the agent to reach the end point. The second and third terms give the relative improvement in the distance-from-target potential during the training episode, which is known to preserve the optimal policy and help the algorithm to converge faster Andrew, Harada, and Russelt (1999). An episode is finalized when the trajectory reaches the circle of radius around the target, is roughly times smaller than the distance between the target and the starting position, see Fig. 1. If an agent does not reach the target within time , or gets as far as from the target the episode is ended and a new episode begins. In the latter case the agent receives an extra negative reward equal to , in order to strongly penalize these failures. By summation of Eq. (5) over the entire duration of the episode, the total reward (4) becomes


Eq. (6) is approximately equal to the difference between the time to reach the target without a flow and the actual time taken by the trajectory: , where the free-flight time is defined as


In order to converge to policies that are robust against small perturbations of the initial condition, which is an important property in the presence of chaos, each episode is started with a uniformly random position within a given radius from the starting point . Following from our action-state space discretization, with states and actions, a natural choice for the policy parametrization is the softmax distribution defined as:


where is a linear combination of a matrix, , of free parameters. Here we adopt the simplest choice of the feature matrix : a perfect non overlapping tiling of the action-state space, . Unless the matrix of coefficients converges to a singular distribution for each state , the softmax expression (8) leads to a stochastic dynamics even for the optimal policy.

During the training phase of the RL protocol, one needs to estimate the expected total future reward (

4). In this paper, we follow the one-step actor-critic method Sutton and Barto (2018) based on a gradient ascent in the policy parametrization. The critic approach circumvents the need to generate a big number of trial episodes by introducing the estimation of the the state-value function, ;


where and are a set of free parameters (similar to ). The expression is used to estimate the future expected reward, , in the gradient ascent algorithm:


Finally, the parameterizations of the policy and the state-value functions are updated every time the state-space is sampled as:


where are the state-action pairs that are explored at time during the episode, while is the future expected reward minus the state-value function, now used as baseline

The main appeal of the one-step actor-critic algorithm is that replacing the total reward with the one-step return plus the learned state-value function, leads to a fully local -in time- evolution of the gradient ascent. The learning rates , , follow the Adam algorithm Kingma and Ba (2014)

to improve the convergence performance over standard stochastic gradient descent. Both gradients in Eqs. (

11) can be computed explicitly and the one-step actor-critic algorithm becomes

Figure 2: Evolution of the total reward averaged over a window of consecutive episodes versus the number of training episodes for

. The shaded area around the main curve indicates the standard deviation inside each averaged window. Inset: evaluation of the policies obtained during the training at three different stages, after 200 episodes obtaining a total reward

(green), after 500 episode (yellow) and for a policy already converged to the maximum reward after 2500 episodes with the total reward increased up to (red).
Figure 3: Examples of different trajectories generated using the final policy resulting from the RL protocol, for different and for problems P1 (left) and P2 (right). For each case we plot ten trajectories starting randomly in the circle around and .
Figure 4: Superposition of the trajectories of Fig. 3 with the Okubo-Weiss parameter (color coded in simulation units) as defined in Eq. (12) for P1 (left) and P2 (right).
Figure 5: (Left column) Color map of the density of visited states along the training phase for problem P1 and three different . (Right column) Degree of greediness of the best action for the softmax final optimal policy obtained by the RL for three different and for the P1 point-to-point case. Similar results are obtained for P2 (not shown).
Figure 6: Comparison between the arrival time histograms, Pr(), using the Reinforcement Learning (RL) (blue bars) and the Trivial Policy (TP) (green bars) with different values of and for P1 (top row) and P2 (bottom row). The bar on the right end of each panel denotes the probability of failure, i.e. that the target is not reached.
Figure 7: 2D map of the best actions selected by the RL for (from left to right) and for P1 (top row) and P2 (bottom row). Next to each panel we show the histograms of the angle mismatch, , between the greedy action selected by the RL and the action selected following the TP.

iii.2 Results

In Fig. 2 we show the evolution of the total reward as a function of the episode number for problem P1 using . As one can see, the system reaches a stationary state and stable maximum reward, indicating that the RL protocol has converged to a certain policy. In the inset of Fig. 2, we show the trajectories of the vessel following the policies extracted from three different stages during the learning process. This illustrates that the policy evolves slowly toward one that generates stable and short paths joining and . This is the first result supporting the RL approach.

In Fig. 3 we show examples of trajectories generated with the final policies for both problems P1 and P2 and for different values of . For large slip velocities, or , the optimal paths are very close (but not identical) to a straight line connecting the start and end points. For the case with small slip velocity, , the vessel must make use of the underlying flow to reach the target. This is particularly clear for problem P2 where it navigates on very intense flow regions, as can be seen by looking at the correlations between the trajectories in red and the underlying flow intensity in the right panel of Fig. 3. To further illustrate this, we superpose in Fig. 4 the example trajectories of Fig. 3 with the Okubo-Weiss Okubo (1970); Weiss (1991)

parameter of the flow (defined as the discriminant of the eigenvalues of the fluid-gradient matrix



The sign of this parameter determines if the flow is straining (positive) or rotating (negative) and the magnitude determines the degree of strain or rotation. When the slip velocity is small, (red curves), the vessel tends to get attracted to the vicinity of the rotating regions where it exploit the coherent head wind to reach the target quickly.

One of the main results of this paper is connected to the high robustness of the RL solution, especially if compared with the ON (see later in Sect. V). Here we want to show that this property is connected to the fact that the RL optimal policy is the result of a systematic sampling of all regions inside the flow, with information not restricted to the few states that are visited by the shortest trajectories. In the left column of Fig. 5, we show a color coded map for the density of visited states for target P1 (similar results are obtained for P2). As one can see, while there is obviously a high density close to the optimal trajectory, the system has also explored large regions around it, allowing it to also store non-trivial information about neighbouring regions of the optimal trajectory. Similarly, in the right column of Fig. 5 we plot the degree of greediness, defined as the probability of the optimal action for each state:


in order to have a direct assessment of the randomness in the policy. Close to the optimal trajectory the policy is almost fully deterministic, becoming more and more random as the distance to the optimal trajectory increases.

Finally, we compared the optimal policies found with RL to a trivial policy (TP), where the angle selected at each is given by the action that points most directly towards the target among the different possibilities. In Fig. (6) we show the trajectories optimized with RL and the ones following the TP at different for problems P1 and P2, together with the probabilities of arrival times, . The TP is able to perform well only when the navigating slip velocity is large, see for P1. Conversely, for the more interesting case when is small, the TP produces many failed attempts (as illustrated by the bars to the right end of the histograms), the arrival times tend to be much longer or infinite because the agents get trapped in recirculating regions from where it is difficult or impossible to exit. The results of the TP are even more bleak in P2, where TP is only successful when and for the other cases the vessels always get trapped in the flow. In order to quantify the local differences between RL and TP along the optimal trajectories, we show in Fig. 7 the greedy solutions (solutions using the action with the highest probability) selected by the RL in the whole domain for all studied cases together with the probability of the angle mismatch between the greedy RL and TP, , (shown in the small boxes). As one can see, there is always a certain mismatch between the RL and the TP policies, confirming the difficulty to guess apriori the quasi-optimal solutions discovered by the RL.

Iv The Optimal Navigation approach

iv.1 Methods

Eq. (3) gives the evolution of the steering angle that minimizes the time it takes to navigate from to provided that the system starts at the optimal initial angle, as was first derived by Zermelo Zermelo (1931). Following Bryson and Ho (1975), this control strategy can be obtained by mapping the problem onto a classical mechanics problem with a Hamiltonian

where are the generalized momenta of . The corresponding Hamiltonian dynamics become


where is the fluid gradient matrix with components evaluated along . By construction, using the principle of least action, solutions from at to at of this dynamics are extreme points of the action evaluated along a trajectory (1):

Thus, the trajectory with the optimal time to navigate to the target satisfies the dynamics (1416). Eq. (16) gives and it follows that the time-optimal control is obtained by solving the joint equations (14) and  (15) using . The corresponding angular dynamics, , is identical to Eq. (3). We remark that the dynamics (15) tend to orient the vessels transverse to the maximal stretching directions of the underlying flow. As a consequence, vessels avoid high strain regions and accumulate in vortex regions.

Solutions of Zermelo’s problem deliver the quickest trajectory joining the starting point to the target point for each initial condition. The challenge lies in finding the initial direction that hits the target point in the shortest time. In the set-up described in Section II, the target is not a single point but instead an area. We view this as a continuous family of Zermelo’s problems, where each target point corresponds to a point in the target area with an optimal initial angle and corresponding optimal time. Thus, the optimal path corresponds to the solution of the Zermelo’s problem that has the quickest trajectory.

As we shall see, it is not straightforward to find the optimal trajectory by refining the angle for the initial condition. The complication is that Zermelo’s dynamics are often unstable in a non-linear flow. This implies that even initial conditions very close to each other may end up at different locations in the flow. As a consequence it is hard to find the optimal strategy: local refinement of tend to result in local minima rather than the global one.

iv.2 Results

Figure 8: (a): visualization of trajectories with starting angle uniformly selected on a grid between and and constant velocity . Trajectories are distinguished using different colors. (b): The subsample of trajectories that reaches the target (out of ) are shown. The color map indicates the time when the target is reached. The quickest trajectory, , out of 10000 is colored red and defines the optimal initial angle, for and P2. (c): Evolution of for target-reaching trajectories starting close to . Trajectories are color coded according to their initial angle and terminated when the target is reached. Out of 10000 trajectories with uniformly equispaced initial angles in with , reach the target and are displayed. (d): Time to reach the target as a function of for the trajectories in panel (c). Zero time corresponds to trajectories not reaching the target. The time is plotted discretely (blue points) and continuously (red line) to indicate transitions between successes and failures.
Figure 9: FTLE for the quickest ON trajectories for target P2 (starting at with initial angle ) for three different navigation speeds, . The vertical dashed line denotes the time the target is reached.

To study this problem, we integrate Eqs. (3) and (1) numerically using a fourth-order Runge-Kutta scheme with a small time step, for 100 time units. We have explicitly checked that reducing the time step to does not change vessel trajectories significantly on the time scales considered in this work. We consider the same targets and values of as for the RL. In Fig. 8a we show the results for ON running 100 trajectories with uniformly gridded initial values of between and and (for target P2). As one can see, the trajectories wander around the flow, all missing the target. Fig. 8b shows a repetition of the experiment for a larger number of trajectories (10000) and only the few trajectories that reach the target are shown (and terminated at the target). In order to empirically understand the stability of the protocol, we identified the optimal initial angle as the trajectory in Fig. 8b that reaches the target in the shortest time (red) and run another 10000 trajectories with initial angles in an interval of length around (this is the interval of so far unexplored initial conditions around ). Fig. 8c shows the evolution of for those trajectories that reach the target. We observe that there is a wide variability in the time to reach the target and that the trajectories are well mixed in the long run: the order of does not reflect the initial ordering at the time scale when the target is reached. Fig. 8d summarizes which initial angles reach the target and which fail. We observe that successes depend intermittently on the initial angle: continuous bands of successful initial conditions are interdispersed with regions of failures. We also observe that within this sample there exist trajectories that reach the target quicker than the trajectory starting at . However, due to the intermittent nature of successes, it is not possible to continuously change , starting at , to find the best sampled trajectory without passing regions of failure. This highlights the problem in refining the initial condition to find the global optimal trajectory in the ON protocol to the non-linear system considered here.

As shown in Fig. 8, ON is able to produce trajectories that join the starting and ending points. We will compare their times with the ones coming from the RL methods in the next section. Here we instead make a more quantitative analysis of the stability of the ON solutions. In order to do this, we evaluate the FTLE along a phase-space trajectory governed by the system of Eqs. (1) and (3). The FTLE are the local stretching rates, for , of a small phase-space separation . By polar decomposition, we can write , where and are rotation and positive definite stretching matrices. The FTLE are defined from the eigenvalues , and of . If the maximal FTLE is positive, the ON trajectory is unstable and small deviations from the trajectory are exponentially amplified with time. In Fig. 9, we show results of the FTLE for the quickest trajectories reaching the target using ON, i.e. trajectories starting at as defined in Fig. 8b. We find that the maximal FTLE is positive for the time it takes to reach the target for all velocities considered. On the other hand, for times large enough, many trajectories approach fixed-point attractors, leading to a smooth decay towards negative FTLE. This effect becomes more evident the smaller the navigating velocity is. For the system of equations (1) decouples from Eq. (3) and the dynamics cease to be sensitive to small perturbations.

In conclusion, Fig. 9 shows that the system is unstable to small perturbations on the time scales needed to reach the target. This explains the observed behaviour of the trajectories in Fig. 8 and why the ON approach is untractable for practical applications.

V Comparison between RL and ON

In this section we make a side-by-side comparison of the RL and ON approaches. To better highlight the RL stability compared to the ON solution we need to specify how the two sets of simulations are initialized. While the RL initial conditions are chosen in a circle of radius centered at with a small spread in the initial angles (typically the probability of the non-greedy actions which is in the initial state of the order of ), we initialize the ON simulations at the exact spatial starting point,

, and with uniformly distributed initial angles in an interval of length

around as defined in Fig. 8b. The ON approach could not be initialized starting in a box of side length because its unstable dynamics prevents it to work if the range of initial conditions is too wide.

We find that the minimum time taken by the best trajectory to reach the target is of the same order for the two methods. The main difference between RL and ON lies instead in their robustness. We illustrate this by plotting the spatial density of trajectories in the left column of Fig. 10 for the optimal policies of ON and RL with three values of and initialized as described above. We observe that the RL trajectories (blue colour area) follow a coherent region in space, while the ON trajectories (red colour area) fill space almost uniformly, especially for large values of . Moreover, for small navigation velocities, many trajectories in the ON system approach regular attractors, as visible by the high-concentration regions. Similar results are found for the trajectories following problem P1 (not shown).

Figure 10: Left column: spatial concentrations of trajectories for three values of . The flow region is color coded proportionally to the time the trajectories spend in each pixel area for both ON (red) and RL (blue). Light colors refer to low occupation and bright to high occupation. The green-dashed line shows the best ON out the trajectories. Right column: arrival time histograms for ON (red) and RL (blue). Probability of not reaching the target within the upper time limit is plotted in the Fail bar.

The right column of Fig. 10 shows a comparison between the probability of arrival times for the trajectories illustrated in the left column. This provides a quantitative estimation of the better robustness of RL compared to ON: even though the ON best time is comparable or, sometimes, even slightly smaller than the RL minimum time, the ON probability has a much wider tail towards large arrival times and it is always characterized by a much larger number of failures. All these results highlights a strong instability of the ON approach.

Vi Conclusions

We have presented a first systematic investigation of Zermelo’s time-optimal navigation problem in a realistic 2D turbulent flow. We have developed a RL approach based on an actor-critic algorithm, which is able to find quasi-optimal discretized policies with strong robustness against changing of the initial condition. In particular, we have considered constant navigation speeds, , not exceeding the maximum flow velocity , down to values of and for all cases we successfully identified quick navigation paths and close to optimal policies that are strongly different from the trivial choice to navigate towards the target at all times. We have also implemented a few attempts with an additive noise in the equations for the vessel evolution and found that RL is able to reach a solution even for this case (results will be reported elsewhere). Furthermore, we investigated the relation between the optimal paths and the underlying topological properties of the flow, identifying the role played by coherent structures to guide the vessel towards the target. Finally, we have compared RL with the Optimal navigation approach showing that the latter exhibits a strong sensitivity on the initial conditions and is thus inadequate for real-world applications. Many potential applications can be envisaged. In particular, it is key to probe the efficiency of the different approaches considered in this work for time-dependent flows and 3D geometries. In both cases, already the simple uncontrolled tracer dynamics, , can be chaotic Dombre et al. (1986); Bohr et al. (2005), opening new challenges for the optimization problem. Moreover, similar optimal navigation problems can be reformulated for inertial particles Toschi and Bodenschatz (2009); Gibert, Xu, and Bodenschatz (2012); Bec et al. (2007); Mordant et al. (2001), where the control is moved to the acceleration with important potential applications to buoyant geophysical probes Roemmich et al. (2009). Work in these directions is in progress and will be reported elsewhere.
We acknowledge A. Celani for useful comments. L.B., M.B. and P.C.d.L. acknowledge funding from the European Union Programme (FP7/2007-2013) grant No.339032. K.G. acknowledges funding from the Knut and Alice Wallenberg Foundation, Dnar. KAW 2014.0048.


  • Petres et al. (2007) C. Petres, Y. Pailhas, P. Patron, Y. Petillot, J. Evans,  and D. Lane, “Path planning for autonomous underwater vehicles,” IEEE Transactions on Robotics 23, 331–341 (2007).
  • Witt and Dunbabin (2008) J. Witt and M. Dunbabin, “Go with the flow: Optimal auv path planning in coastal environments,” in Australian Conference on Robotics and Automation, Vol. 2008 (2008).
  • Kraus (2012) N. D. Kraus, Wave glider dynamic modeling, parameter identification and simulation, Ph.D. thesis, [Honolulu]:[University of Hawaii at Manoa],[May 2012] (2012).
  • Smith et al. (2011) R. N. Smith, J. Das, G. Hine, W. Anderson,  and G. S. Sukhatme, “Predicting wave glider speed from environmental measurements,” in OCEANS’11 MTS/IEEE KONA (IEEE, 2011) pp. 1–8.
  • Lumpkin and Pazos (2007) R. Lumpkin and M. Pazos, “Measuring surface currents with surface velocity program drifters: the instrument, its data, and some recent results,” Lagrangian analysis and prediction of coastal and ocean dynamics , 39–67 (2007).
  • Niiler (2001) P. Niiler, “.1 the world ocean surface circulation,” in International Geophysics, Vol. 77 (Elsevier, 2001) pp. 193–204.
  • Lermusiaux et al. (2017) P. F. Lermusiaux, D. Subramani, J. Lin, C. Kulkarni, A. Gupta, A. Dutt, T. Lolla, P. Haley, W. Ali, C. Mirabito, et al., “A future for intelligent autonomous ocean observing systems,” Journal of Marine Research 75, 765–813 (2017).
  • Bechinger et al. (2016) C. Bechinger, R. Di Leonardo, H. Löwen, C. Reichhardt, G. Volpe,  and G. Volpe, “Active particles in complex and crowded environments,” Reviews of Modern Physics 88, 045006 (2016).
  • Kurzthaler et al. (2018) C. Kurzthaler, C. Devailly, J. Arlt, T. Franosch, W. C. Poon, V. A. Martinez,  and A. T. Brown, “Probing the spatiotemporal dynamics of catalytic janus particles with single-particle tracking and differential dynamic microscopy,” Physical review letters 121, 078001 (2018).
  • Popescu, Tasinkevych, and Dietrich (2011) M. Popescu, M. Tasinkevych,  and S. Dietrich, “Pulling and pushing a cargo with a catalytically active carrier,” EPL (Europhysics Letters) 95, 28004 (2011).
  • Baraban et al. (2012) L. Baraban, M. Tasinkevych, M. Popescu, S. Sanchez, S. Dietrich,  and O. Schmidt, “Transport of cargo by catalytic janus micro-motors,” Soft Matter 8, 48–52 (2012).
  • Carrassi et al. (2018) A. Carrassi, M. Bocquet, L. Bertino,  and G. Evensen, “Data assimilation in the geosciences: An overview of methods, issues, and perspectives,” Wiley Interdisciplinary Reviews: Climate Change 9, e535 (2018).
  • Lakshmivarahan and Lewis (2013) S. Lakshmivarahan and J. M. Lewis, “Nudging methods: a critical overview,” in Data Assimilation for Atmospheric, Oceanic and Hydrologic Applications (Vol. II) (Springer, 2013) pp. 27–57.
  • Clark Di Leoni, Mazzino, and Biferale (2018) P. Clark Di Leoni, A. Mazzino,  and L. Biferale, “Unraveling turbulence via physics-informed data-assimilation and spectral nudging,” arXiv preprint arXiv:1804.07680  (2018).
  • Clark Di Leoni, Mazzino, and Biferale (2019) P. Clark Di Leoni, A. Mazzino,  and L. Biferale, “Synchronization to big-data: nudging the navier-stokes equations for data assimilation of turbulent flows,” arXiv preprint arXiv:1905.05860  (2019).
  • Centurioni (2018) L. R. Centurioni, “Drifter technology and impacts for sea surface temperature, sea-level pressure, and ocean circulation studies,” in Observing the Oceans in Real Time (Springer, 2018) pp. 37–57.
  • Roemmich et al. (2009) D. Roemmich, G. C. Johnson, S. Riser, R. Davis, J. Gilson, W. B. Owens, S. L. Garzoli, C. Schmid,  and M. Ignaszewski, “The argo program: Observing the global ocean with profiling floats,” Oceanography 22, 34–43 (2009).
  • Zermelo (1931) E. Zermelo, “Über das navigationsproblem bei ruhender oder veränderlicher windverteilung,” ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik 11, 114–124 (1931).
  • Bryson and Ho (1975) A. E. Bryson and Y. Ho, Applied optimal control: optimization, estimation and control (New York: Routledge, 1975).
  • Ben-Asher (2010) J. Z. Ben-Asher, Optimal control theory with aerospace applications (American institute of aeronautics and astronautics, 2010).
  • Liebchen and Löwen (2019) B. Liebchen and H. Löwen, “Optimal control strategies for active particle navigation,” arXiv preprint arXiv:1901.08382  (2019).
  • Hays et al. (2014) G. C. Hays, A. Christensen, S. Fossette, G. Schofield, J. Talbot,  and P. Mariani, “Route optimisation and solving z ermelo’s navigation problem during long distance migration in cross flows,” Ecology letters 17, 137–143 (2014).
  • Xia et al. (2011) H. Xia, D. Byrne, G. Falkovich,  and M. Shats, “Upscale energy transfer in thick turbulent fluid layers,” Nature Physics 7, 321 (2011).
  • Boffetta and Ecke (2012) G. Boffetta and R. E. Ecke, “Two-dimensional turbulence,” Annual Review of Fluid Mechanics 44, 427–451 (2012).
  • Alexakis and Biferale (2018) A. Alexakis and L. Biferale, “Cascades and transitions in turbulent flows,” Physics Reports 767-769, 1 – 101 (2018).
  • Sutton and Barto (2018) R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction (MIT press, 2018).
  • Rugh and Rugh (1996) W. J. Rugh and W. J. Rugh, Linear system theory, Vol. 2 (prentice hall Upper Saddle River, NJ, 1996).
  • Pontryagin (2018) L. S. Pontryagin, Mathematical theory of optimal processes (Routledge, 2018).
  • Techy (2011) L. Techy, “Optimal navigation in planar time-varying flow: Zermelo’s problem revisited,” Intelligent Service Robotics 4, 271–283 (2011).
  • Yoo and Kim (2016) B. Yoo and J. Kim, “Path optimization for marine vehicles in ocean currents using reinforcement learning,” Journal of Marine Science and Technology 21, 334–343 (2016).
  • Colabrese et al. (2018) S. Colabrese, K. Gustavsson, A. Celani,  and L. Biferale, “Smart inertial particles,” Physical Review Fluids 3, 084301 (2018).
  • Colabrese et al. (2017) S. Colabrese, K. Gustavsson, A. Celani,  and L. Biferale, “Flow navigation by smart microswimmers via reinforcement learning,” Physical review letters 118, 158004 (2017).
  • Gustavsson et al. (2017) K. Gustavsson, L. Biferale, A. Celani,  and S. Colabrese, “Finding efficient swimming strategies in a three-dimensional chaotic flow by reinforcement learning,” The European Physical Journal E 40, 110 (2017).
  • Gazzola et al. (2016) M. Gazzola, A. A. Tchieu, D. Alexeev, A. de Brauer,  and P. Koumoutsakos, “Learning to school in the presence of hydrodynamic interactions,” Journal of Fluid Mechanics 789, 726–749 (2016).
  • Verma, Novati, and Koumoutsakos (2018) S. Verma, G. Novati,  and P. Koumoutsakos, “Efficient collective swimming by harnessing vortices through deep reinforcement learning,” Proceedings of the National Academy of Sciences 115, 5849–5854 (2018).
  • Reddy et al. (2016) G. Reddy, A. Celani, T. J. Sejnowski,  and M. Vergassola, “Learning to soar in turbulent environments,” Proceedings of the National Academy of Sciences 113, E4877–E4884 (2016).
  • Reddy et al. (2018) G. Reddy, J. Wong-Ng, A. Celani, T. J. Sejnowski,  and M. Vergassola, “Glider soaring via reinforcement learning in the field,” Nature 562, 236 (2018).
  • Muinos-Landin, Ghazi-Zahedi, and Cichos (2018) S. Muinos-Landin, K. Ghazi-Zahedi,  and F. Cichos, “Reinforcement learning of artificial microswimmers,” arXiv preprint arXiv:1803.06425  (2018).
  • Novati, Mahadevan, and Koumoutsakos (2018) G. Novati, L. Mahadevan,  and P. Koumoutsakos, “Deep-reinforcement-learning for gliding and perching bodies,” arXiv preprint arXiv:1807.03671  (2018).
  • Tsang et al. (2018) A. C. H. Tsang, P. W. Tong, S. Nallan,  and O. S. Pak, “Self-learning how to swim at low reynolds number,” arXiv preprint arXiv:1808.07639  (2018).
  • Mannarini et al. (2016) G. Mannarini, N. Pinardi, G. Coppini, P. Oddo,  and A. Iafrati, “Visir-i: small vessels–least-time nautical routes using wave forecasts,” Geoscientific Model Development 9, 1597–1625 (2016).
  • Ott (2002) E. Ott, Chaos in dynamical systems (Cambridge university press, 2002).
  • Vulpiani, Cecconi, and Cencini (2009) A. Vulpiani, F. Cecconi,  and M. Cencini, Chaos: From Simple Models To Complex Systems, Vol. 17 (World Scientific, 2009).
  • Andrew, Harada, and Russelt (1999) Y. N. Andrew, D. Harada,  and S. Russelt, “Policy invariance under reward transformations: Theory and application to reward shaping,” ICML 99, 278 (1999).
  • Kingma and Ba (2014) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980  (2014).
  • Okubo (1970) A. Okubo, “Horizontal dispersion of floatable particles in the vicinity of velocity singularities such as convergences,” in Deep sea research and oceanographic abstracts, Vol. 17 (Elsevier, 1970) pp. 445–454.
  • Weiss (1991) J. Weiss, “The dynamics of enstrophy transfer in two-dimensional hydrodynamics,” Physica D: Nonlinear Phenomena 48, 273–294 (1991).
  • Dombre et al. (1986) T. Dombre, U. Frisch, J. M. Greene, M. Hénon, A. Mehr,  and A. M. Soward, “Chaotic streamlines in the abc flows,” Journal of Fluid Mechanics 167, 353–391 (1986).
  • Bohr et al. (2005) T. Bohr, M. H. Jensen, G. Paladin,  and A. Vulpiani, Dynamical systems approach to turbulence (Cambridge University Press, 2005).
  • Toschi and Bodenschatz (2009) F. Toschi and E. Bodenschatz, “Lagrangian properties of particles in turbulence,” Annual review of fluid mechanics 41, 375–404 (2009).
  • Gibert, Xu, and Bodenschatz (2012) M. Gibert, H. Xu,  and E. Bodenschatz, “Where do small, weakly inertial particles go in a turbulent flow?” Journal of Fluid Mechanics 698, 160–167 (2012).
  • Bec et al. (2007) J. Bec, L. Biferale, M. Cencini, A. Lanotte, S. Musacchio,  and F. Toschi, “Heavy particle concentration in turbulence at dissipative and inertial scales,” Physical review letters 98, 084502 (2007).
  • Mordant et al. (2001) N. Mordant, P. Metz, O. Michel,  and J.-F. Pinton, “Measurement of lagrangian velocity in fully developed turbulence,” Physical Review Letters 87, 214501 (2001).