1 Introduction
The classical paradigm for designing fluid control strategies, for example in the case of reducing the drag force on a bluff body through active control, consists of a first stage when we explore and understand the physics of the problem over a wide parametric range, followed by careful modeling and developing specially designed control strategies to exploit the understanding, and culminating in a optimal tuning of the control parameters [brunton2015closed]
. The process involves careful computational or experimental investigation, and intuition obtained through the investigation, resulting in heuristically derived schemes. Hence, this procedure is extremely slow to yield effective results. In recent years, the application of machine learning in fluid control problems has received increasing attention
[brunton2019machine], because it offers a totally different paradigm to arrive quicker at results. The combination of proper machine learning tools with domains of expertise in fluid mechanics could shift the classical paradigm by directly optimizing the control strategy, reducing or even eliminating human involvement in modeling and design of the control strategies.Among the machine learning tools, reinforcement learning (RL) offers especially intriguing opportunities for a quick progress, as it has demonstrated the capability of achieving “superhuman” performances in board games [silver2017mastering], and a capability for tackling complex, highdimensional continuous control tasks [haarnoja2018soft]. Recent explorations of RL for fluid mechanics problems include fish biolocomotion [gazzola2014reinforcement, verma2018efficient], motion and path planning for aerial/aquatic vehicles [colabrese2017flow, novati2019controlled], active flow control for bluff bodies [ma2018fluid, rabault2019artificial], and foil shape optimization [viquerat2019direct].
To authors’ best knowledge, RL applications in fluid problems are so far limited to only computer simulations. In order to demonstrate the feasibility of applying RL to experimental fluid mechanics, we address experimentally the problem of drag reduction and power gain efficiency maximization in a bluff cylindrical body equipped with two small rotating control cylinders. A sketch of the model used is shown in Fig. 1. The problem has been studied both experimentally [schulmeister2017flow] and numerically [zhu2015simultaneous], investigating the effects of (a) the small to large cylinder diameter ratio , (b) gap ratio , (c) the smaller control cylinder configuration, and (d) the rotation rate on the fluid forces and patterns. Here, is the diameter of the main cylinder, is the diameter of the smaller control cylinder, is the gap between the main and each of the smaller cylinders, is the rotation speed of the smaller cylinders and is the oncoming velocity. Past results showed that the counter rotating cylinder pair could effectively reduce the main cylinder drag force as well as diminish the oscillatory lift force by suppressing the vortices shed in the wake [beaudoin2006drag]. The physics behind this phenomenon is that when the control cylinders are placed at appropriate locations and rotate at a sufficiently fast speed, they are able to interact with the main cylinders separating boundary layer and cause it to reattach, resulting in a narrower wake behind the cylinder, hence significantly reducing the pressure drag.
Parameter  Experiment  Simulation 

5.08cm  1  
0.125  0.125  
0.025  0.025  
9    
10,160  500  
0.1s  0.12  
3.66  5 
In this paper, we first describe our active flow control procedure and then present results of applying RL in both an experimental as well as a simulation environment. We demonstrate that with a properly designed reward function and noise reduction, the agent can learn the control strategy that is close to the optimal static control to reduce the system drag or maximize the power gain efficiency.
2 Methodology Description
2.1 Experimental model and procedure
The sketch in Fig. 1 outlines the experimental procedure and highlights one episode of the learning process, corresponding to one towing experiment lasting 40 seconds. At the beginning of each experiment, the control cylinders are held still for four seconds to ensure a fully developed wake. Then, the RL agent starts to interact with the environment via a state inquiry and an action decision at 10Hz (). The states in the current experiment are the drag and lift coefficients and on the three cylinders altogether, and can be calculated as follows,
(1) 
where is the fluid density, and and are the average drag and lift forces over . After completing one experiment, the carriage is brought back to the starting point. Then, the policy of the RL agent is updated based on the experience learnt from all the previous experiments up to that time, while the environment is reset and prepared for the next experiment. A twominute pause is imposed between towing experiments to avoid crosscontamination of the results between successive experiments.
The policy of the RL agent in the current work is only updated between experiments, instead of at every agentenvironment interaction, to reduce delay of action due to the limitation of our hardware. The control and data collection interface are developed in
language, and the RL agent is implemented in Python language based on the deep learning package TensorFlow
[abadi2016tensorflow]. The XMLRPC protocol is then applied for data communication between the crosslanguage platforms, which allows us to take advantage of the machine learning tools developed in Python, as well as the established experimental and computational platforms in other languages with minimum effort.2.2 Simulation model
In addition to the experiment, we also employ a simulation model implemented in the LilyPad solver
[weymouth2015lily]. We conduct twodimensional numerical simulations and visualize the flow around the main and control cylinders via the same procedures described in the last subsection.The simulation resolution is selected to be 24 grids per main cylinder diameter and a domain size of . The Reynolds number based on the main cylinder is , the same as in the simulation work by [schulmeister2017flow]. Based on , the fixed nondimensional time step is selected equal to , and the state inquiry and action decision are made every 16 time steps. In each episode, the RL agent starts actions at nondimensional time when the wake behind the cylinders have fully developed, and terminates at . Each simulation takes 10 minutes on a single core of a Dell workstation precision tower 5810.
The configuration parameters for the simulation are listed in Table 1. In the simulation, the states for the RL agent are different from those in the experiment, selected to be the drag and lift coefficient ( and ) of the main cylinder alone, calculated as follows,
(2) 
where and are the average drag and lift forces on the main cylinder alone over the 16 simulation time steps.
2.3 Reinforcement learning
Reinforcement learning involves an agent interacting with the environment, aiming to learn the policy that maximizes the expected cumulative reward. At each discrete time step , the agent makes an observation of the state , and selects corresponding actions with respect to the policy to interact with the environment, then receives a reward . The objective is to find the optimal policy parameterized by which maximizes the expected cumulative reward,
(3) 
where is a discount factor and denotes the stateaction marginals of the trajectory distribution induced by the policy .
As mentioned in the previous subsections, in the current work the state is the concatenation of and in experimental environments, or and in simulation environments. The action is the concatenation of and . The reward received in each time step is induced from the state and action in the subsequent interaction. The detailed formulations of the reward and the comparisons will be presented in the next section.
The update of the agent follows one of the stateoftheart deep RL algorithms, viz. the Twin Delayed Deep Deterministic policy gradient algorithm (TD3) [fujimoto2018addressing]
. In this paper, all the neural networks are feedforward neural networks with 2 hidden layers, each of width 256. The discount factor
is set as 0.99. The policy exploration noise is set as in the task of drag reduction, and in the task of system power gain efficiency maximization. We use the Adam optimizer with learning rate with batch size , and update the critic networks iterations in each episode, while updating the actor and target networks everyiterations. Other hyperparameters are inherited from
[fujimoto2018addressing].3 Results and Discussion
3.1 Experimental validation for constant rotating control cylinders
We first conducted 169 experiments with the control cylinders rotating at a constant speed. The result of the average drag and lift coefficients and the is plotted in Figs. 2 (a) and (b). The result shows that the decreases from 1.7 to 0.72 as decreases and increases. This decrease was the effect of the fast rotating control cylinders that help reattach the previously separating boundary layer in the main cylinder, reducing the wake width and hence the pressure drag [schulmeister2017flow].
A comparison is made between the in the current experiment at and the in the experimental work by [schulmeister2017flow] at . Fig. 2 (c) demonstrates the same trend of the mean drag coefficient against for both sets of results. Note that Fig. 2 (b) shows that when , is close to zero.
The experiment comparison when the control cylinders rotate at a constant speed confirms the validity of the current experimental setup, and we find that the minimum of the happens with at full speed in the clockwise (CW) direction, while at full speed in the counter clockwise (CCW) direction; is found to be close to zero.
3.2 Reinforcement learning in experimental environments
3.2.1 Task one: drag reduction
Three cases have been tested to demonstrate the importance of an appropriately designed RL reward and the application of a Kalman filter (KF)
[zarchan2013fundamentals] for noise reduction when the agent inquires the states. The results of the , as well as for the first 200 episodes are plotted in Fig. 3, while the setup of the reward and the filter in the three cases are listed as follows:
Case 1: with KF;

Case 2: without KF;

Case 3: with KF.

Training process in experimental task one. (a) case 1; (b) case 2; (c) case 3. The first row shows the hydrodynamic coefficients over 200 episodes, and the second row shows actions over 200 episodes. The solid lines and the shaded areas represent the mean value and one standard deviation over each episode, respectively. The black dashed lines represent
. (d) Time trace of drag coefficient and actions (inset) for episode 1, 5, 50 and 150 in case 1. The carriage moves at the second and stops at the second. The active control is switched on at the second. (e) Visualization of policy evolution from episode 151 to 160 in case 3, corresponding to the region between the two green dashed lines in (c). In (e), the first row shows and the second row shows , in terms of and .The result of case 1 is shown in Fig. 3 (a), and the is found to drop quickly and converges to approximately in about 10 episodes, i.e., about half an hour in wall clock time, where is the minimum value found in the reference experiment of the control cylinders rotating at . The learning curve of actions shows that the agent learns to rotate the two cylinders in the opposite directions with nearmaximum speeds. We observe that the increases in the first few episodes before decreasing and converging, which is a result of agent’s random exploration in the early stage of learning.
The time traces of and actions of the four different episodes in case 1 (highlighted with cross markers in Fig. 3 (a)) are displayed in Fig. 3. A comparison between the raw data (blue) and the filtered data (red) reveals that the KF manages to remove the high frequency oscillations in . Fig. 3 (a) of the first episode shows that when the learning process just begins, the RL agent fails to make any informed decision. In the fifth episode shown in Fig. 3 (b), the RL agent explores the rotation of the first control cylinder at its maximum speed in the CCW direction, which results in an increase of . After tens of policy updates, at the episode shown in Fig. 3 (c), when the active control is turned on, the RL agent manages to make the correct decision to rotate the control cylinders in the right direction and at the right speed, and therefore, reduce the . Comparing the actions of the episode in Fig. 3 (d) to those of the episode, we see that the actions are more stable with less variation.
In case 2, we use the same reward as in the case 1, but do not employ a KF. The result in Fig. 3 (b) shows that after 200 episodes, the RL agent fails to reduce the as effectively as in case 1 where KF is employed. The learning curve of actions indicates that the RL agent is not able to learn an appropriate policy for the second rotational cylinder, resulting in large and . The comparison between case 1 and 2 clearly shows the necessity of noise reduction when applying RL techniques in experimental environments and realworld applications.
To demonstrate the importance of properly designing the reward, in case 3, we keep the KF implementation but change the reward to . The augmentation of the weighted squared lift coefficient in case 1 is motivated by the need to reduce the oscillating lift force, through preventing the alternate shedding of a vortex street. The result shows that the is reduced slowly over the number of episodes. Between the and the episodes, the reaches a relatively constant value of about 0.83, higher than the , and the is as large as 0.77. With the increase of episodes, we observe tha at around the episode, the drops suddenly and converges to the , while the magnitude of decreases to a value close to zero.
In order to explain such a drastic change of the hydrodynamic coefficients at around the episode, in Fig. 3 we visualize the policy evolution between the and the episodes: the RL agent policy is initially stuck at a local minimum but then manages to escape due to exploration. The and in the whole learning process are mostly concentrated in the region highlighted by the black square. Note that the policy for
gradually approaches to the strategy of rotating with maximum speed, showing the process of learning in the time interval. In addition, the policy could be far from optimal outside the highlighted region, as the agent learns from the experience collected, and can hardly generalize the policy for outlier states.
3.2.2 Task two: maximization of the system power gain efficiency
We define the system power gain efficiency as , which is increased by the drag reduction, , and decreased due to the power loss from the friction of the control cylinder rotation, , where we restrict in this task, is the average drag coefficient when , and the friction coefficient is calculated as [prandtl1949report]. Our goal is to maximize the average system power gain efficiency over one episode. Therefore, we constructed the reward function as follows,
(4) 
Due to the tradeoff between the drag reduction and the power loss of the cylinder rotation, in this task, for the static control, the maximum of is achieved at , shown in Fig. 4(a) by the black solid line as the reference. The dots in Fig. 4(a) represent estimated in each episode, and are shown to be concentrated near the peak of the reference line for the episodes with welltrained RL agent. In fact, the optimal from the RL experiment is found to be higher than the maximum from the static control, which could be explained by the control strategy designed by the agent that is dynamic instead of static. We also plot the and the for each episode in Fig 4(b).
3.3 Reinforcement learning in simulation environments
We conducted the drag reduction task in the simulation environment, where the parameters used are shown in Table 1, and the results are displayed in Fig. 5 for a total of 50 episodes. The results demonstrate that only after four episodes the RL agent has already achieved a stable hydrodynamic performance, with the mean drag coefficient of the main cylinder negative, and the mean lift coefficient close to zero when the control cylinders rotate close to their maximum speeds and in opposite directions.
We select the first, the third, the tenth and the fortieth episode to visualize the flow behind the main and control cylinders. From Fig. 5, we can see that at the first episode, the vortex shedding behind the main cylinder is strong and the width of the wake is wide, so the pressure drag is large. At the third episode, the rotation of the control cylinders results in a narrow width of the wake and a regular shedding vortex street, which leads to a smaller drag coefficient but enhanced oscillation of the unsteady lift coefficient term. At the tenth episode, the RL agent has learnt an appropriate control strategy: the maximum rotation speed of the two control cylinders results in a thrust force on the main cylinder as well as zero mean lift coefficient as they are able to stabilize the flow and eliminate the shed vorticies in the wake [zhu2015simultaneous]. The similarity of the wake pattern and the hydrodynamic coefficients value between episode fortieth and tenth demonstrates the convergence and stability of the RL agent’s active control strategy for the current system.
4 Summary
We demonstrated the feasibility of applying reinforcement learning with proper designed reward function to discover effective active control strategies in experimental environments, by studying the bluff body flow control problem of actively reducing the drag force and maximizing the system power gain efficiency through the use of two rotating smaller control cylinders attached to the main cylinder. With a properly designed reward function, the agent was able to learn a control strategy that is comparable to the optimal one found in static control experiments, within tens of experiments, requiring only several hours of wallclock time.
The vortex shedding behind the cylinder was effectively suppressed by the control strategy learned by the agent, as was illustrated in the companion simulation studies. We also demonstrated the necessity of noise reduction techniques using a Kalman filter, a method which is especially suitable for experimental setups.
We believe that the flow control problem studied in the current work is only the beginning, and reinforcement learning can find wide applicability in a variety of experimental fluid mechanics problems, especially highdimensional dynamic fluid control problems that are too difficult to tackle with classical methods. The learning algorithm we employ is totally modelfree, but it will be worth exploring the possibility of incorporating some domain knowledge and designing physicsinformed reinforcement learning algorithms, in order to further accelerate the scientific discovery.
The code for the reinforcement learning agent and simulation environment is shared and can be downloaded via the link of https://github.com/LiuYangMage/RLFluidControl.
Acknowledgement
DF and MST would like to acknowledge support from the MIT Riser Digital Twin Consortium, and a fellowship provided by Shell International Exploration and Production, inc; YL and GEK would like to acknowledge support by DOE PhILMs project (No. DESC0019453).
Appendix A Experimental model
In Fig. 6 we show the control panels and model used in the experiment. The control panels in Fig. 6 (a) have two decks and consist of six major components: two DYN2series motor controllers, one NI USB6218 Data Acquisition (DAQ) board, one ATI sensor amplifier, and two power sources. The DAQ board in the upper deck is controlled through the USB communication, and it is in charge of the analog data collection from the sensor amplifier at a sampling rate of 1000 Hz, and sending the signal to the two motor controllers at a feedback rate of 10 Hz. In the lower deck, two independent DYN2series servo motor controllers for two DST410 servo motors are powered by 60V DC power source.
Images of the experimental model are shown in Fig. 6 (b) and (c) for two views. The main cylinder is made from hardanodized aluminium tube to prevent corrosion in the water. The two smaller stainless steel cylinders are connected to the two DST410 motors via couplings and are supported by the bearings on both ends. Shown in Fig. 6 (b), the ATIGamma sensor is installed on top of the model and is used to measure the total lift and drag force on the main and two smaller cylinders altogether.