I Introduction
In his landmark 1934 paper, and one of his most cited works, GI Taylor proposed a device to study drop deformation and breakup in a twodimensional flow taylor1934formation . Now called the fourroll mill higdon1993kinematics , the apparatus used electrical motors to rotate four identical cylinders immersed in a viscous fluid. Spinning all cylinders at the same speed (in magnitude), with adjacent cylinders rotating in opposite directions, led to an approximately extensional flow with a stagnation point at the centre. Taylor aimed to study how the extension rate of the flow deformed the drop, when this was placed and kept stable at the stagnation point.
The stagnation point at the centre of an extensional flow is a saddle. The unstable nature of the stagnation point made the drop in Taylor’s experiment difficult to control, and the speed had to be varied in real time to compensate for the drop moving off in the wrong direction taylor1934formation . Over fifty years later, a more systematic method to stabilize the motion of the drop was proposed by Bentley and Leal bentley1986experimental . Using a camera to measure the drift of the drop and computer activated stepping motors to adjust the revolution rates, they modulated the rotation speeds of the cylinders using a simple feedback model where the position of the stagnation point was the control variable.
This newly discovered control scheme allowed further advances in our understanding of capillary flows eggers1997nonlinear , in particular the deformation of drops in shear flows rallison1984deformation . In more recent work, microfluidic devices have been used to control the deformation of small drops stone2004engineering ; SquireQuake_MicrofluidReview2005 . In particular, microfluidic implementations of the fourroll mill hudson2004microfluidic ; lee2007microfluidic and of related Stokes traps tanyeri2010hydrodynamic ; tanyeri2011microfluidic ; shenoy2016stokes have led to pioneering techniques for trapping and manipulating on small scales.
It is a simple mathematical exercise to show that an extension flow is unstable. Approximating the small drop by a point particle, its trajectory in the steady flow field is solution to . If the centre of the apparatus is used as the origin of the coordinate system , we have by symmetry and can therefore approximate
near the origin. The extensional flow is irrotational and, since the flow is incompressible, the tensor
is symmetric and traceless. Therefore, it has real eigenvalues of identical magnitude but opposite sign,
, and . The centre of the extension flow is thus a saddle point, with basin of attraction parallel to the unidirectional compression and instability in all other directions (see streamlines in Fig. 1).In this paper, we aim to design a different type of algorithm to drive the drop back when it drifts in the unstable direction. The natural way to correct the trajectory is to adjust the angular rotation rates of the cylinders, but to do so deterministically requires a control model describing the response of the drop to changes in the flow field and for that control scheme to be implemented in real time. This was the rationale for the algorithm proposed in Ref. bentley1986computer using the position of the stagnation point as the control variable. In this paper, instead of a physicsbased control scheme we devise a stabilization algorithm using the framework of Reinforcement Learning sutton2018reinforcement .
Reinforcement Learning is a branch of Machine Learning that allows a software agent to behave optimally in a given environment (state space) via observation of environmental feedback. In essence, the agent explores the environment by taking actions (which can be anything from moves in chess to steering in a selfdriving car) and receiving positive or negative feedback accordingly. Feedback comes in the form of rewards, which, when suitably added together, make up the return associated with the overall performance. The goal of Reinforcement Learning is, in general, to learn how to maximize this return by improving the agent’s behaviour sutton2018reinforcement . The learning algorithms designed to achieve this vary significantly depending on the nature of the state space (e.g. continuous or discrete, finite or infinite) and on the agent’s knowledge of the effect of actions. When only finitely many actions are available, finding the best behaviour is often entirely algorithmic. If however there is a continuum of states and actions, exploration is typically harder and local improvements to the behaviour have to be found via gradient methods.
Reinforcement Learning has found countless applications in recent years, with significant impact already in fluid dynamics brenner2019perspective ; brunton2020machine . For applications in flow physics at high Reynolds number, Reinforcement Learning has been used for bulk flow control gueniat2016statistical ; rabault2019artificial , the control of free surfaces xie2021sloshing and liquid films belus2019exploiting , shape optimization viquerat2021direct , turbulence modelling novati2021automating and sensor placement paris2021robust . Biological and bioinspired applications at high high Reynolds numbers include control and energy optimization in fish swimming gazzola2016learning ; novati2017synchronisation ; verma2018efficient , gliding and perching novati2019controlled and locomotion in potential flows jiao2021learning . A landmark study even demonstrated how to exploit Reinforcement Learning in experimental conditions for turbulent drag reduction in flow past bluff bodies fan2020reinforcement . Applications in the absence of inertia have been motivated by biological problems in navigation and locomotion, and include optimal navigation and escape of selfpropelled swimmers colabrese2017flow ; gustavsson2017finding ; colabrese2018smart , learning to swim tsang2020self ; liu2021mechanical and to perform chemotaxis hartl2021microswimmers or even active cloaking mirzakhanloo_esmaeilzadeh_alam_2020 . Reinforcement Learning was also incorporated in experiments using artificial microswimmers navigating in noisy environments muinos2021reinforcement .
In our study, we show how to use the framework of Reinforcement Learning to successfully control the position of a drop in a model of the fourroll mill setup. The flow is modelled as the linear superposition of four twodimensional rotlets and the drop treated as a rigid spherical particle smaller than all other length scales in the problem. Our state space is a small neighbourhood of the unstable equilibrium in the resulting twodimensional extension flow, and our actions consist of varying the speed of the cylinders at each time step. We reward actions depending on whether the speed adjustment moves us towards the origin during the time step. Since this is a lowReynoldsnumber setup, we can assume that the flow and the drop both respond instantaneously to speed modulation, so that the outcome of an action depends only on the drop’s current position, and not on its current speed or acceleration. The chosen learning algorithm is a classic ActorCritic method based on gradient ascent. Actions are determined by a set of parameters that are varied, at every time step, in the direction of an estimate of the gradient of performance with respect to these parameters
After introducing the flow model in §II, we give a quick overview of Reinforcement Learning in §III along with a description of our algorithm. The various physical and learning parameters are summarised in §IV. We then demonstrate in §V that, with the right choice of parameters, our algorithm is effective at stabilising the drop from any initial drift. Next, in §VI, we explore the impact of the various physical and learning parameters on the effectiveness of the algorithm. Finally, in §VII we address the robustness of the algorithm against thermal noise, its ability to provide a global policy for all initial positions, and how to modify the algorithm to enable control of the magnitude of the flow extension near the position of the drop.
Ii Flow model and trajectories
ii.1 Flow
As a prototypical device generating an extension flow, we consider a simple model for a twodimensional fourroll mill. The flow is generated by four identical cylinders centred at the corners of a square of side length . All lengths are nondimensionalised by so that the centres of the cylinders are located at in a Cartesian coordinate system (see Fig. 1). Motivated by application in microfluidics, we assume that the rotation rates of the cylinders are small enough that all inertial effects in the fluid can be neglected. We further assume that the cylinders are sufficiently long and far away from each other that we can approximate the flow created by each cylinder as a twodimensional rotlet Batchelor1970 ; kimbook , i.e. by the exact solution for the twodimensional Stokes flow outside an isolated cylinder in an infinite fluid. The flow induced by each cylinder at position is hence given by
(1) 
where is the dimensionless radius of the cylinder and where and are, respectively, the angular velocity and the location of the centre of the cylinder; note that indicates anticlockwise rotation (see Fig. 1). In the limit where , we may approximate the flow near the centre of the device as a linear superposition of the four flows from each cylinder, so that for small,
(2) 
Note that this twodimensional flow is irrotational.
As in Taylor’s original, the case where
(3) 
leads to a purely extensional flow near the origin, since the offdiagonal entries of the velocity gradient are by symmetry. Our Reinforcement Learning algorithm will then modify the individual angular velocities independently in order to correct trajectories (see §VI), so Eq. (3) holds only before speed control is applied.
ii.2 Drop motion
We model the viscous drop, transported by the flow and for which we want to achieve stable motion, as a rigid spherical particle of radius (we thus assume that the drop is very rigid and the Capillary number small enough to not deform it significantly). Its centre, located at , evolves in time according to Faxén’s law kimbook
(4) 
where the flow is given by Eq. (1). Note that for this choice of flow the Faxén term is identically zero because the flow is both incompressible and irrotational and thus . In the absence of noise, we integrate Eq. (4) numerically with the RungeKutta RK4 method. In §VII.1 we also incorporate thermal noise (i.e. Brownian motion) as relevant to the dynamics of small drops.
Iii Reinforcement Learning algorithm
iii.1 Fundamentals of Reinforcement Learning
We begin by introducing some terminology that underpins the rest of the work; the reader is referred to the classical book by Sutton and Barto for a detailed treatment sutton2018reinforcement
. In Reinforcement Learning, agents take actions that depend on their current state, and get rewarded accordingly. The mathematical basis is that of Markov Decision Processes
bellman1957markovian , which consist of the following:(1) A state space to be explored, with realization .
(2) An action space (or , since it may vary between states), with realization , which comprises the moves available at each state.
(3) A probability density function (or mass function, if is countable) , which determines the probability of transitioning from state to state after taking action . This probability never changes during the process.
(4) A reward function , which gives the reward earned after transitioning from to through action .
The actions are drawn from a p.d.f. (or mass function, if the action sets are countable) know as the policy. This is the function that determines behaviour. Exploration takes place in discrete time steps. At time step , the agent lands in state and takes action , which takes it to state according to the distribution
. The probability of landing in a given state is a function of the current state and the choice of action, so transitions have the Markov Property. If the probability distribution
or the reward function are not known to the agent, this is referred to as modelfree Reinforcement Learning.Since we want the agent to behave in a specific way, we introduce a notion of return from time step onwards, given that we are starting from state at time . We define , where is the reward earned at time and is known as discount factor. Multiplication by ensures convergence if rewards are well behaved and captures the uncertainty associated with longterm rewards.
From we can define the state value function , which is the expected return starting from state and following (we thus use to denote expected values in what follows). Our goal is to find (or at least to approximate) the policy which maximizes , i.e. the choice of actions leading to maximum return.
iii.2 Choice of Markov Decision Process
We now describe the simplest version of the algorithm used in this study, with some improvements summarised in §III.9. In the specific viscous flow problem considered here, we wish to learn how to modify the motion of the cylinders in order to manoeuvre the drop towards the origin from a fixed starting point . In other words, given the default angular velocities in Eq. (3), an initial position for the drop and a sequence of time steps , ,
,…, we want to learn how to change the angular velocity vector
at each step so as to bring the drop as close to the origin as possible. We will discuss how to extend this strategy to all initial positions later in the paper.We start by assuming that, at each time step, the angular velocity vector changes instantaneously and that the drop’s position can be computed exactly and with no delay. To make speed adjustments without the use of Reinforcement Learning, we would need to know how a given change in angular velocity affects the trajectory before changing the speed, which is computationally unfeasible. Using Reinforcement Learning, in contrast, we can limit ourselves to observing how the drop reacts to a speed change in a given position and learn through trial and error.
We can now formulate the problem in terms of a Markov Decision Process, following points in §III.1:
(1) We choose the state space to be a square of dimensionless side length centred at the origin (shown schematically in Fig. 1). The drop starts somewhere inside this square and needs to reach the origin while moving inside this square. If the drop ever leaves this region during a run, we terminate execution because the drop has wandered too far. The exact size of the region can be changed depending on the accuracy needed, and it will come into play when we try to find a general strategy that does not work just for .
(2) The action space associated with state consists of all allowed changes to in that particular state. Since we are going to use a gradient ascent method to determine the optimal changes, it is important to keep the action space as small as possible. If, for example, we allowed ourselves to act on all four cylinders at every time step our policy would become a function of position and range over all speed adjustments. Such complexity would be hard to approximate, especially with a probabilistic gradient method. Instead, with our algorithm, we only act on one cylinder at a time, thereby reducing the dimension of to . We split the plane into four quadrants (one per cylinder) and whenever the drop is located in a specific quadrant we only allow the cylinder ahead of it in the clockwise direction to modify its rotation speed (this is illustrated graphically in the insets of Fig. 2B, C). An action consists of changing the angular velocity for that specific cylinder, , to some other value in a prescribed interval , where gives the size of the “wiggle” room and is chosen in advance for all cylinders (see below for more details). At the end of the time step, we instantaneously reset the velocity of this cylinder, so that the effect of the subsequent action only depends on the final position of the drop. Since we have no inertia, transitions obey the Markov Property.
(3) With regard to the probability density function , in the absence of thermal noise each position is a deterministic function of and of the action and is independent of and for all . So we can write for some , and hence the probability density function is a delta function, i.e. . Note that, if we knew exactly, we would also know how actions affect the trajectory, so the problem would be trivially solved. The reason why some sort of control algorithm is needed is precisely that cannot be easily determined. It is worth mentioning that, had we included inertia in the problem, we would have needed to add the drop’s velocity and acceleration to the state space in order for to be welldefined; this increase in dimensionality would have made the problem harder.
(4) For the rewards, we need to favour actions that move the particle closer to the origin, and punish ones that bring it further away from it. We thus choose to reward each speed adjustments in relation to the the drop’s subsequent displacement vector. Our choice of reward function is given by
(5) 
where is a dimensionless parameter designed to tune the peakedness of the function inside the exponential; we explore below how the performance depends on the value of (the value will be chosen for most results). To aid intuition, note that the reward function can also be written as , where is the angle that the displacement vector makes with the inward radial vector . The reward is thus maximal when (inward radial motion) and minimal when (outward radial motion). We found it important that our reward function evaluate actions on a continuous scale. If, for example, we were to assign a value of to moves that point us within some angle of the right direction and to everything else, the algorithm would regard all bad moves as equally undesirable and have difficulty learning. An exponential dependence was chosen over other options, such as a piecewise linear function, in order to reduce the number of free parameters.
iii.3 Choice of algorithm
For our Reinforcement Learning algorithm, we choose a classic ActorCritic method based on gradient ascent sutton2018reinforcement . The “Actor” refers to the policy, which encodes behaviour, while the “Critic” refers to the value function, which measures expected returns. We introduce parametric approximations of both the policy and the state value function, and then, at each time step, vary the parameters in the direction of an estimate of the gradient of performance with respect to the parameters.
iii.4 Actor part of algorithm
We wish to determine the optimal policy for this problem, i.e. the p.d.f. that maximizes for some fixed . We introduce a parametric policy of the form , where is some array that characterizes the policy, and we then use gradient ascent on to find a local optimum for (in all that follows, when we use a subscript in the value functions, it will always indicate implicitly a dependence on ). In other words, if we define , we will seek to optimize for by iterating . This will allow us to improve the policy at every time step (socalled online learning). This is referred to as the “Actor” part of the algorithm, because the policy generates behaviour.
Computing the gradient may appear difficult a priori, but can be achieved using a powerful result known as the policy gradient theorem, proven in Ref. sutton2018reinforcement for countable action spaces. This theorem states that at time the gradient is equal to
(6) 
where
(7) 
is known as the advantage function and we have introduced
(8) 
which is the actionstate value function.
The result in Eq. (6) suggests a practical way to implement an algorithm to determine the parameters of the optimal policy. Specifically, we drop the expectation and, after drawing from the current policy (more on this below), iterate on the parameters of the policy as
(9) 
at each time step. Note that this leads to an unbiased estimate of the policy gradient because the expected value of the update is the true value of the gradient.
iii.5 Parametric policy
We now need to write down an expression for , i.e. our guess for the true optimal policy. Since we have a logarithm in Eq. (6), it is convenient to write in the form
(10) 
where and where
(11) 
For fixed (,), this ensures that is a p.d.f. for . Here can be any convenient function, and in what follows we take it to be a polynomial in the parameters. Specifically, we take to be an array and set
(12) 
As noted before, this can only work if the action space is not too large. If, for example, we could act on multiple cylinders simultaneously, we would need a more complex Ansatz for as well as a higherdimensional array , which would make gradient ascent harder. Then, at time step , the score function becomes a array such that
(13) 
In practice, we generate a second action at time and take
(14) 
where we use the subscript to indicate that this is its value at time step . Then our algorithmic update for in Eq. (9) becomes
(15) 
Note that other choices for are of course possible, a truncated Fourier series being the obvious one, but we found that Eq. (12) was computationally faster. Note also that
is an unbiased estimate of the advantage function
, since(16) 
Therefore we can replace in Eq. (15) and iterate
(17) 
iii.6 Critic Part of Algorithm
The second, or “Critic”, part of the algorithm deals with the approximation of the value function. To make use of Eq. (17), we replace the state value function for our policy with another parametric approximation , where is once again an array. The goal is then to determine which minimizes the distance . This can be done numerically by using gradient descent with the update rule
(18) 
where . Assuming that we can take the gradient inside the expectation, we have
(19) 
After replacing the expectation with the corresponding unbiased estimate, we then obtain the the gradient algorithmic update rule
(20) 
Finally, since , we can use the approximation to get the final form of the update rule as
(21) 
where
(22) 
Similarly to Eq. (12), we take to be an array and
(23) 
Then with , and the update becomes
(24) 
To ensure convergence, it is customary to make and decay geometrically, which we do here by setting and where and are constants and where is the discount factor sutton2018reinforcement .
iii.7 Summary of algorithm
To summarise, the algorithm we implement works as follows:

Choose the step size constants and ;

Initialise the arrays and to and choose the drop position at .

At time step , draw a random action from and record the corresponding reward and next state .

Update the two parameters as C_t+1=C_t+αγ^tδ_tT_t
D_t+1=D_t+βγ^tδ_tQ_t where . 
Repeat until convergence.
iii.8 Sampling from
The final problem we need to address is how to sample from Eq. (10), i.e. randomly chose an action from the approximate policy, without knowing the normalising function . For this we can use a technique know as rejection sampling. Say we want to sample from a p.d.f. on but we only know for an unknown . We then consider a second (known) p.d.f. on and we take so that for all in . We generate and compute
(25) 
Then we generate and if we accept , otherwise we reject it. Then, conditional on being accepted, we have . A proof of this algorithm is given in Appendix A.
In our case we can take
to be the uniform distribution; to generate
it then suffices to find an upper bound for , which is straightforward if we have a bound on each of , , . The only drawback of this method is that if is very small it may take a long time to find an acceptable . To get around this, we generate and then take the speed adjustment to be , where is the wiggle room size and is the halfwidth of the state space. By taking , we can easily find a reasonably small upper bound for , which helps to keep relatively large.iii.9 Time delays and noise
Two aspects were finally added to the algorithm in order to make it physically realistic. First, we dropped the mathematical assumption that the cylidnders can change speed instantaneously. Instead, we assumed that the computer takes a time to determine the position of the drop, accelerates the cylinder over a time and resets the velocity to its initial value over a time . All these delays are included in the same time step. In all cases, we assume that cylinders speed up or slow down with constant angular acceleration. Secondly, we allowed the drop to no longer follow a deterministic trajectory by adding thermal noise, as explained in §II.2. This allows us to test the application of the algorithm to setups on small length scales.
Iv Physical and Reinforcement Learning parameters
iv.1 Episodes and batches
In order to estimate the best policy for a given starting point , we would need to apply the ActorCritic method outlined above to an infinitely long sequence of states starting from . In practice, however, this sequence has to terminate, so it is customary to instead run a number of episodes (i.e. sequences of states) of fixed length from and to apply the ActorCritic algorithm to every state transition as usual. In each new episode, we then use the latest estimates of and as parameters.
A straightforward way to assess the speed at which learning is done is to group episodes into batches (typically of ) and examine the effectiveness of the values of and given by the learning process at the end of each batch. To do this, we use the values of and so obtained to run a separate series of episodes starting from during which no learning occurs. We then compute the average final distance to origin, , as a proxy for the effectiveness of the control algorithm in bringing the drop back to the origin. Note that if we were estimating the policy for a realworld experiment, we would stop running batches as soon as becomes suitably small, and then use the resulting values of the parameters as our practical control algorithm.
iv.2 Physical parameters
In order to run the algorithm we need to fix values for the parameters describing the flow and the physical setup in which the drops moves. In order to use a classical study as benchmark, we take these physical parameters from Bentley and Leal’s fourroll mill control study bentley1986experimental . This leads to the choices of parameters as follows (all dimensions below will thus refer to their paper):

Length scales are nondimensionalised by half the distance between the centre of the cylinders () and time scales by the angular velocities (which we take to be ) from Ref. bentley1986experimental .

The radius of the cylinders was in Ref. bentley1986experimental so the nondimensionalised cylinder radius is . Note that in our theoretical approach for the flow in Eq. (2), we assumed . The experiments in Ref. bentley1986experimental are therefore at the limit of what can be captured by the simple hydrodynamic theory, which is however useful in what follows to demonstrate a proof of concept of our control approach.

For the initial position of the drop, we start by taking to test the functionality of our algorithm, and we then extend to different starting points in §V.

Four time parameters have to be set: the times step (), the lag and the delays ( and ). In Ref. bentley1986experimental the lag was about , and the motors took roughly to modulate speed. We will thus take nondimensionalised values , , . Note that because the algorithm is only rewarded for moving the drop in the right direction, the exact values of the time parameters actually do not matter, as long as they all get scaled by the same factor. In other words, we still expect the algorithm to learn if we replace the four time scales , , and with , , and for some . Reducing the time parameters gives of course more control over the drop, since we adjust its trajectory more frequently.

We need to set the dimensionless rotation wiggle rooms, i.e. the range of angular velocities we allow for the cylinders. For simplicity these are same for all cylinders, set to be unless specified otherwise; we will study how varying this parameter affects performance in §VI.6.
The flow velocity is given by Eq. (2) and the trajectory of the drop is obtained by integrating Eq. (4) with the speed parameters during each step. For numerical integration, we employed the fourth order RungeKutta scheme RK4 with step size .
iv.3 ActorCritic parameters
The Reinforcement Learning algorithm outlined above contains also a number of parameters:
In our exploration, we start by choosing the values of the algorithm parameters randomly and then we vary them one at a time to see how they affect accuracy and learning speed. We use the mean final distance to the origin, , to monitor the algorithm’s success in bringing back the drop to the centre of the flow. When we compute this quantity, all parameters remain the same as in the training episodes, and and are held fixed. When find a local optimum for one parameter, we keep it fixed at that value in subsequent simulations, thereby leading to a set of parameters which should optimize performance, at least locally. This exploration of parameters will be further discussed in §VI.
V Illustration of learned policy
In this first section of results, we demonstrate the effectiveness of Reinforcement Learning in stabilising the motion of the drop when lags and delays are included (but not thermal noise). We first let the algorithm practice with a given starting point and then simulate a trajectory to assess performance (i.e. the practical control of the drop’s motion). We will illustrate the details of the learning process in §VI and the robustness of the algorithm in §VII. The codes used as part of this study have also been posted on GitHub GH where they are freely available.
We assume here that all physical parameters are as in §IV.2 and take . The parameters of the Reinforcement Learning algorithm, which will be examined in detail in §VI, are taken to be , , , . We also set the rotation wiggle room to be of the default angular velocity.
We start the drop at the dimensionless location , estimate , over episodes and then use the learned policy to plot the trajectory of the controlled drop motion. In an experiment, one would use the algorithm to estimate the policy and then apply the control policy until the drop is sufficiently close to the origin, after which the cylinder could resume spinning at their default velocities. Results are shown in Fig. 2A, with a movie of the motion available in Supplementary Material SM . Since the drop starts in the quadrant (see Fig. 1), the motion is initially only affected by , i.e. the rotation rate of the cylinder ahead of it in the clockwise direction. We show in Fig. 2B and Fig. 2C the timeevolution of and in blue and green, respectively. The use of the green and blue in the trajectory from Fig. 2A highlights the parts of the trajectory where each cylinder undergoes a change in its rotation speed (the corresponding cylinder is indicated in the insets of Fig. 2B, C). The final distance from the origin was about , which is smaller that the nondimensionalised value of required in the experiments of Ref. bentley1986computer .
The results in Fig. 2 suggest a simple physical interpretation of the policy. In the absence of control, the drop would be advected towards , from its initial position (see streamlines shown in Fig. 1). The policy obtained via the Reinforcement Learning algorithm causes the angular speed to undergo bursts of small increases above its steady value (typically 50% in magnitude); when is increased, the drop is seen to undergo a small diagonal displacement towards the axis, while when the drop experiences a small amount of free motion. By alternating between the two, the drop is eventually able to reach the axis. Note that the sharp corners in some of the pathlines are a consequence of the absence of inertia. After reaching the axis, the drop crosses into the quadrant, where (the only cylinder we can now act on) undergoes similar small bursts in order to bring the drop back to the axis. The net result of the alternating actions of and is a zigzag motion on both sides of the axis, which eventually brings the drop acceptably close to the origin. Note that when taking the nondimensionalization into account, the motion displayed in Fig. 2 would take about s in the original experimental setup of Ref. bentley1986computer .
We next illustrate how the algorithm performs from different starting points, as well as how trajectories change depending on the initial position. We again take , , , and choose the same time parameters as before. We consider six different starting points located in the four quadrants, specifically , , , , , . The algorithm trains for each point separately, i.e. it computes a different policy for each value of . For each starting point, we allow the algorithm to practice on as many batches as needed until drops below . This never took more than batches, i.e. episodes.
We show in Fig. 3 the trajectories resulting from the learned policies (to reduce crowding, each trajectory terminates as soon as the distance from the origin at the end of a time step becomes smaller than ). In each case, the controlled motion of the drop is shown in thick solid, while the thins lines correspond to the paths that the drop would follow if it were not for speed control (these paths coincide with the streamlines in Fig. 1). In all cases, we see that the algorithm succeeds in bringing the drop back to the origin. All trajectories present small diagonal drifts caused by bursts of increased rotation, separated by free motion along the streamlines. Computationally, points further away from the origin required more training; for example, finding the trajectory starting from in Fig. 3 only required training runs, while the one starting from took .
Vi Learning process and parameters
The previous section demonstrated the effectiveness of Reinforcement Learning in controlling the motion of the drop. We now investigate how accuracy and learning speed depend on the various parameters used by the algorithm. Then we examine in §VII how the algorithm deals with noise and with finding a global policy. As explained in §IV, learning is assessed by running a fixed number of batches () of episodes and plotting the values of the average final distance in each batch. Since results are random, we quantify the uncertainty in each learning curve () by generating it twice with the same parameters and returning the average relative error
(26) 
Since the setup is fourfold symmetric, we restrict our attention to the case where the drop starts out in the second quadrant (denoted by , see Fig. 2).
vi.1 Varying the discount Factor
We start by setting , , , and aim to find the value of the discount factor which causes to decrease the fastest. We ignore the values and , since would result in a very shallow onestep lookahead, and would not ensure , while decay is required in the updates of the ActorCritic method. In Fig. 4 we plot the average final distance as a function of the batch number for different values of in the range . Clearly performance improves steadily with , showing that we can base our choice of actions on longterm predictions; the larger the value of the more we penalize bad actions far ahead in the future, since the th reward gets discounted by
. The average relative errors are small, indicating that the variance within each learning curve is likely to be small. Since it gave the best performance, we take
in what follows.vi.2 Varying the peakedness of the reward function
To address the impact of the peakedness of the reward function, in Fig. 5 we plot the learning curves obtained by running the algorithm with the values , , , and different values of . Small values of , such as , do not adequately discriminate between actions, while very large values (e.g. ) hinder exploration by treating all bad actions as equally undesirable, and also take longer to run. The average relative errors are again very small, which makes us confident that the displayed curves are representative samples. Within a small window from to , the learning speed increases slightly, but since results are all very similar we keep in what follows.
vi.3 Varying the gradient ascent parameters ,
Even with the previous choices of parameters, it still takes approximately episodes to reach a final accuracy of ( batches of episodes, or more). Out of all parameters, we found that the gradient ascent parameters and have the biggest impact on learning speeds. When chosen correctly, they can reduce the number of training episodes to just a few hundred. To demonstrate this, we take , , , and monitor the final average distance for various values of ; the resulting learning curves are shown in Fig. 6. Performance increases steadily with . The values lead to a steep learning curve, dropping below after only iterations. The only real constraint on these parameters is that they cannot be arbitrarily large, because for the ActorCritic algorithm may give very large entries for and , making it hard to find a suitable for the rejection sampling part (§III.8). Furthermore, the final gradient ascent update in each episode has size . If we want this to be reasonably small for and , e.g. less than , we should take . As in the previous simulations, the relative errors are seen to be very small. We therefore settle on the values in what follows.
vi.4 Varying the size of policy and value function arrays
We next examine the impact of the size of the policy and value function arrays and . To see how this parameter affects the final accuracy and the learning speed, we choose different values of and run batches for each value (the other parameters are kept at , and , ). The learning curves are displayed in Fig. 7. We see that the choice performs poorly since becomes a uniform distribution; the remaining values give very similar results, with small relative errors , so we keep in what follows.
vi.5 Varying the step size and the length of episodes
The accuracy of the algorithm depends strongly on the step size , with larger values leading to a poorer accuracy. Furthermore, for large values of , learning may still occur but the learning curve is no longer steadily decreasing with batch number because bad actions can take the drop further away from the target. Through extensive simulations, we found empirically that should be chosen so that the particle can never move by a distance larger than the desired final accuracy during a time step. In their original paper, Bentley and Leal state that a dimensionless final distance of is enough for their experiments (bentley1986computer, ), and since was consistently below this threshold in the previous sections, our chosen is sufficiently small.
The length of the episodes is also an important parameter. To investigate how it affects performance, we fix the values , , , and monitor how the learning speed depends on when it is equal to , , , . For each value, we run batches of episodes of steps each, until we reach a total of steps. This way, all batches consist of time steps and we can compare learning speed batch by batch. The sizes of our batches are thus, respectively, , , and . The resulting learning curves are shown in Fig. 8. The learning curve seems to get steeper as increases, signifying that the algorithm takes longer to identify the optimal strategy. A possible explanation for this result is as follows. Since is very close to , there is very little discounting in the first few time steps. Therefore, if is large, the algorithm can afford to pick suboptimal actions in the beginning because it has time to recover. Conversely, if is small the algorithm cannot waste time on bad actions and needs to aim for the target from the start. After the initial phase, the algorithm proved more accurate for larger values of , likely because the drop is allowed to explore the environment for longer. All learning curves consistently plateau around the mark and relative errors are small. For the purpose of the experiments in Ref. (bentley1986computer, ), performance is essentially the same in all four case, so we keep , which had the smallest .
vi.6 Varying the rotation wiggle rooms
As a reminder, the wiggle room is the halfwidth of the window of (dimensionless) angular velocity within which the cylinders are allowed to change their speeds. This is another important parameter that affects learning speed. If the initial position is far from the origin, large changes in the fluid velocity, and therefore in the torques, may be needed to prevent the drop from wandering out of the state space.
To illustrate this, Fig. 9 shows the learning curves obtained from (with , , , , ) when the wiggle rooms are , , and . We see that errors are almost always negligible and that even a small difference in the the allowed rotations significantly affects the learning speed; small wiggle rooms mean that we need to be more precise with our choice of actions, because we may not be able to recover from a bad one.
From a practical standpoint, a small wiggle room might be preferable to prevent high torques and accelerations of the cylinders. However, a value that is too small prevents the algorithm from stabilising the drop. In general, points further away from the origin will require bigger leeways, and reducing the wiggle rooms decreases learning performance. We keep our wiggle rooms at , which gave good performance while in general requiring less torque than .
vi.7 Variance
After running a batch, we used the resulting policy to simulate episodes in order to estimate the average final distance . Let be the final distance from the origin in the th episode. In order for the algorithm to be useful in practice we need consistency, i.e. for to be as small as possible. To test this, we run batches with the parameters , , , and . For each batch, in Fig. 10 we plot along with the range of the corresponding . We can see that the algorithm is initially rather inaccurate, but then slowly improves and becomes more consistent. In batch , all episodes land within of the origin, which is smaller than the (dimensionless) distance required for the experiments of Ref. (bentley1986computer, ).
Vii Robustness of the learned policy and further control
So far we have established the effectiveness of our Reinforcement Learning algorithm in stabilising the drop trajectory when the algorithm is trained against deterministic motion and when the drop always starts at a fixed location in space. In this section we relax these two assumptions. First we establish that the policy learned in the absence of noise continues to work even in the presence of thermal noise (§VII.1). We next study the extent to which the policy learned from a given starting point is effective when the drop starts from another location (§VII.2). Finally, motivated by experiments where the drop is stretched by the flow in a controlled way, we propose a variant of the algorithm designed to control the extension rate of the flow at the location of the flow (§VII.3).
vii.1 Noise
The dynamics of the drop so far followed Eq. (4) with the model flow from Eq. (1) and it was therefore fully deterministic. Motivated by experimental situations where the drop is small enough to be impacted by thermal noise, we now examine the performance of the deterministic Reinforcement Learning algorithm in a noisy situation.
We incorporate thermal noise using a Langevin approach batchelor1976 . In a dimensional setting, this is classically done by adding a random term to Eq. (4) where is the mobility of the spherical drop in a fluid of viscosity and is a random force. We assume that has zero mean value (i.e. , where we use to denote ensemble averaging) and that it satisfies the fluctuationdissipation theorem
(27) 
where is Boltzmann’s constant and is the absolute temperature.
Moving to dimensionless variables, we used the half distance between the cylinders, , as the characteristic length scale and the inverse cylinder rotation speed, , as the characteristic time scale (see §IV.2), so Eq. (27) allows to define a typical magnitude for the random force, given by . Nondimensionalising by , the Langevin approach consists then in adding a random term of the form to the dimensionless version of Eq. (4), where is a dimensionless random force with and . Here is the dimensionless Péclet number, which compares the relative magnitude of advection by the flow and Brownian diffusion
(28) 
We implement the Langevin approach numerically by adding a random term at the end of each numerical step, where is the step size used in the RK4 scheme and (
) is drawn from a standard normal distribution.
Physically, the Péclet number in Eq. (28) can be recast as a ratio between the radius of the drop and a thermal length scale ,
(29) 
With the dimensions from §IV.2, we have m, s, and assuming the fluid to be water at room temperature ( K, viscosity Pas), we obtain m. In the original work from Ref. bentley1986experimental , the typical drop has radius mm, which leads to in these experiments. This very large number clearly indicates that thermal noise was not important in this original work.
To test robustness, we applied the policy obtained via the Reinforcement Learning algorithm from the previous sections (i.e. under deterministic drop dynamics) to environments with progressively smaller values of the Péclet number, which corresponds physically to shrinking the scale of the drop so that thermal noise becomes progressively more important. The parameters of the algorithm are once again , , , , , , and ; the wiggle rooms are set to . After training runs, we let Pe take values , , and simulated separate episodes in each case. Fig. 12 shows the corresponding proportions of runs landing within a distance of the origin. As expected, the accuracy decreases when the Pe number becomes smaller, dropping from a largest value of when Pe to when . The algorithm was more than accurate for Pe , which is six orders of magnitude smaller than in Bentley and Leal’s experiment (and thus would correspond to nanometersized drops). It is worth mentioning that, even though we did not do it in this work, noise could be included in the training phase rather than added once the policy has been found.
vii.2 Global Policy
So far the learned policy was always obtained for the same fixed starting point . Can we, on the other hand, obtain a policy that is optimal (or sufficiently close to optimal) for all starting point? Intuitively, points in the state space that are close together should have similar optimal policies, so if the state space itself is sufficiently small such a global policy should exist. Should this not be the case, we would have to split the state space in smaller regions and to determine a globally good policy in each region separately.
We investigate the existence of a global policy by estimating the optimal policy from the dimensionless starting position and then running trajectories from a number of other points in the state space. We use the time and learning parameters , , , , , , , . After training runs in a noiseless environment, we construct a rectangular lattice of evenly spaced points in the state space and run a trajectory from each one of them using the policy obtained for (i.e. no further learning occurs during that process). We then use the results to build a colour map of the final distances, i.e. a matrix where is coloured according to the final distance from the origin of a trajectory starting from the corresponding location on the grid. We show the results for all trajectories in the absence of thermal noise in Fig. 13. We can see that all unsuccessful starting points are clustered around the edge of the state space on the sides where the flow points away from the origin, suggesting that we can indeed find a global policy by making the state space a bit smaller. The algorithm was successful in of cases, with an average final distance of
and a standard deviation of
. The average final distance is heavily skewed by the edge cases. In the region , corresponding the lighter strip in the middle, the average final distance was with a standard deviation of and a success rate of . The largest final distance in this region was and the smallest was . To see how thermal noise affects this result, we carry out the same simulations by incorporating noise as in §VII.1 in the case where , with results shown in Fig. 14. The algorithm was now successful in of cases, with an average final distance of ad a standard deviation of . Again, if we restrict the set of initial states to those in these figures improve significantly. Success rate jumps to and the average final distance becomes with a standard deviation of . The largest final distance in this region was and the smallest was . In summary, the overall performance was quite similar to the noiseless case, except for a small decrease in accuracy and consistency in the central region. This shows that the algorithm is robust to noise even in the case of nanometersized drops.vii.3 Extension Rate Control
Returning to the physical aspects of the experiment, Taylor’s original study addressed how the properties of the flow affected the shape of the drop taylor1934formation . When the drop is fixed at the origin, its rate of deformation is dictated by the eigenvalues of the velocity gradient tensor, . The flows considered in this study are twodimensional and irrotational so that remains symmetric and traceless throughout. The velocity gradient is thus characterised by a pair of eigenvalues (extension rate) and (compression). When we alter the speeds of the cylinder with the control algorithm, we inevitably change the eigenvalues of , where is the position of the drop, leading to a timedependent eigenvalue . Since this eigenvalue controls the deformation of the drop, we wish to keep its magnitude as close as possible to the extension rate which we aim to study while we control the drop position.
Here we examine the case where the starting position is , with the same parameters as above (i.e. , , , , , , , ). We assume the drop is subject to thermal noise with . After training runs, we simulate a step trajectory (in which no learning occurs) during which we sample the extension rate times per time step. In Fig. 15A we plot the variation of the scaled extension rate, , with time, where is the value at the centre of the uncontrolled apparatus (using the dimensionless parameters in the problem, we have ). The extension rate is seen to undergo significant variations during the controlled motion of the drop, with jumps that are routinely about the desired value . The norm of the final state was .
To lower the variations on and keep it closer to its target value, we modified the algorithm as follows. The idea is to note that if an angular velocity vector (i.e. the vector of all four cylinder rotations) induces an extension rate at , then by linearity the angular velocity vector induces an extension rate at the same point (). We may then scale, at each time step, with a suitable scalar function so that the angular velocity vector changes as , where corresponds to standard speed modulation. Since the jumps in are due to the rapid changes in angular velocities, we choose to minimize the impact of speed modulation. Specifically, at time step and state we denote , where is the extension rate at resulting from unscaled speed modulation. Then we take a piecewise linear scaling
(30) 
and choose . To compensate for this scaling, we also make the change . In Fig. 15B we show the evolution of the extension rate (scaled by ) in a trajectory with the same parameters as before but with our scaling implemented. A couple of large excursions remain, but performance has noticeably improved relative to the original control algorithm (left). The norm of the final state was , indicating that scaling does not affect accuracy. This proofofprinciple result shows therefore that a scaling in the optimal policy can be used to limit the extension rate in the flow.
Viii Discussion
In this paper we saw how Reinforcement Learning can be applied to solve a classical control problem for fluid dynamics at low Reynolds numbers. Our goal here was to modulate the rotation speeds of a model of Taylor’s fourroll mill in order to stabilize a drop positioned near the stagnation point, which is known to be unstable. We implemented an ActorCritic method and found a probabilistic policy that worked well for all initial positions.
In our approach, we proceeded by steps. We first derived a basic version of the algorithm, and then added measurement delays, thermal noise and extension rate control. The algorithm was able to manoeuvre the drop effectively in all cases and the accuracy achieved was below that required in the experiments of Bentley and Leal bentley1986experimental , and therefore satisfactory for most experimental implementations. Numerical results shown in §VI also demonstrated that learning is remarkably consistent, with minimal variance within the learning curves in the majority of cases.
The good performance observed was, to a large extent, due to our choice of actions rather than to the quality of the approximation for the policy (). Indeed, numerical results in Fig. 7 show that a first order approximation of the form is sufficient to get accurate results. In practice, the learning process was often slower at the beginning, when the algorithm had not yet gathered enough information to take good actions. Then, once the general shape of the policy had been identified, learning sped up significantly, until it slowly tapered off as we approached the theoretical accuracy.
It is worth mentioning here that we attempted other implementations too, which were not successful . We initially tried to discretize the state space, so that the drop would move in a finite grid as opposed to a continuous environment, but it was difficult to combine this with the Markov property and harder to factor in thermal noise. We also used a truncated Fourier series for the form of the function , but this was computationally expensive and it artificially introduced discontinuities as well as Gibbs’ phenomena.
Finally, we also experimented with the shape of the reward function, seeking to penalise actions requiring very large torques. Unfortunately, our attempted modifications in that regard (such as subtracting some simple increasing function of the torque from ) did not succeed. After extensive simulations, we concluded that torque reduction can be achieved by either shrinking the sate space (so that smaller corrections are needed) or reducing the default angular velocities.
There are many possible extensions to our work. One could try an algorithm with higher sample efficiency (i.e. one which makes better use of past experience), one that better balances exploration and exploitation, or a different learning paradigm altogether (i.e. a neural network). We may also implement different approximations for the value function as well as alternative rewards and sampling methods. It would also be interesting to devise a model where time is continuous.
From a physical standpoint, it might be desirable to include inertia (b
Comments
There are no comments yet.