# Controlling Rayleigh-Bénard convection via Reinforcement Learning

Thermal convection is ubiquitous in nature as well as in many industrial applications. The identification of effective control strategies to, e.g., suppress or enhance the convective heat exchange under fixed external thermal gradients is an outstanding fundamental and technological issue. In this work, we explore a novel approach, based on a state-of-the-art Reinforcement Learning (RL) algorithm, which is capable of significantly reducing the heat transport in a two-dimensional Rayleigh-Bénard system by applying small temperature fluctuations to the lower boundary of the system. By using numerical simulations, we show that our RL-based control is able to stabilize the conductive regime and bring the onset of convection up to a Rayleigh number Ra_c ≈ 3 · 10^4, whereas in the uncontrolled case it holds Ra_c=1708. Additionally, for Ra > 3 · 10^4, our approach outperforms other state-of-the-art control algorithms reducing the heat flux by a factor of about 2.5. In the last part of the manuscript, we address theoretical limits connected to controlling an unstable and chaotic dynamics as the one considered here. We show that controllability is hindered by observability and/or capabilities of actuating actions, which can be quantified in terms of characteristic time delays. When these delays become comparable with the Lyapunov time of the system, control becomes impossible.

## Authors

• 2 publications
• 7 publications
• 2 publications
• 6 publications
• ### DATE: Defense Against TEmperature Side-Channel Attacks in DVFS Enabled MPSoCs

Given the constant rise in utilizing embedded devices in daily life, sid...
07/02/2020 ∙ by Somdip Dey, et al. ∙ 3

• ### A Scalable Thermal Reservoir Simulator for Giant Models on Parallel Computers

This paper introduces the model, numerical methods, algorithms and paral...
12/10/2018 ∙ by Hui Liu, et al. ∙ 0

• ### Development of A Scalable Thermal Reservoir Simulator for Distributed-memory Parallel Computers

This paper introduces the model, numerical methods, algorithms and paral...
12/10/2018 ∙ by Hui Liu, et al. ∙ 0

• ### Modeling of a multiple source heating plate

Heating plates describe the transfer of heat from actuators to a target ...
11/30/2020 ∙ by Stephan Scholz, et al. ∙ 0

• ### Control-Tutored Reinforcement Learning

We introduce a control-tutored reinforcement learning (CTRL) algorithm. ...
12/12/2019 ∙ by Francesco De Lellis, et al. ∙ 0

• ### The entropy of a thermodynamic graph

We introduce an algorithmic model of heat conduction, the thermodynamic ...
01/31/2018 ∙ by O. Kurganskyy, et al. ∙ 0

• ### An adjoint optimization approach for the topological design of large-scale district heating networks based on nonlinear models

This article deals with the problem of finding the best topology, pipe d...
08/19/2020 ∙ by Maarten Blommaert, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Rayleigh–Bénard Convection (RBC) provides a widely studied paradigm for thermally-driven flows, which are ubiquitous in nature and in industrial applications [intro:thermobook]. Buoyancy effects, ultimately yielding to fluid dynamics instability, are determined by temperature gradients [intro:chandrasekhar-fluid-instalbility] and impact on the heat transport. The control of RBC is an outstanding research topic with fundamental scientific implications [intro:RB-control-accel-exp]. Additionally, preventing, mitigating or enhancing such instabilities and/or regulating the heat transport is crucial in numerous applications. Examples include crystal growth processes, e.g. to produce silicon wafers [intro:crystal-convection-inhomogeneities-book]. Indeed, while the speed of these processes benefits from increased temperature gradients, the quality of the outcome is endangered by fluid motion (i.e. flow instability) that grows as the thermal gradients increase. Thus the key problem addressed here: can we control and stabilize fluid flows that, due to temperature gradients, would otherwise be unstable?

In the Boussinesq approximation, the fluid motion in RBC can be described via the following equations [intro:lbm-book]:

 ∂u∂t+u⋅∇u=−∇p+ν∇2u+^yαg(T−T0), (1) ∂T∂t+u⋅∇T=κ∇2T. (2)

where denotes the time, the incompressible velocity field (), the pressure, the kinematic viscosity, the thermal expansion coefficient, the magnitude of the acceleration of gravity (with direction ), and the thermal diffusivity. For a fluid system confined between two parallel horizontal planes at distance and with temperatures and , respectively for the top and the bottom element (), it is well known that the dynamics is regulated by three non-dimensional parameters: the Rayleigh and Prandtl numbers and the aspect ratio of the cell (i.e. the ratio between the cell height and width ), i.e.

 Ra=gα(TH−TC)H3κν,Pr=ν/κ,Γ=H/Lx. (3)

Considering adiabatic side walls, a mean heat flux, , independent on the height establishes on the cell:

 q=¯¯¯¯¯¯¯¯¯¯¯¯¯⟨uyT⟩x−κ∂y¯¯¯¯¯¯¯¯⟨T⟩x=const, (4)

where indicates an average with respect to the -axis, parallel to the plates, and the time averaging. The time-averaged heat flux is customarily reported in a non-dimensional form, scaling it by the conductive heat flux, , which defines the Nusselt number

 Nu=qκΔT/H. (5)

As the Rayleigh number overcomes a critical threshold, , fluid motion is triggered enhancing the heat exchange ().

Dubbed in terms of Rayleigh and Nusselt numbers, our control question becomes: can we diminish, or minimize, Nusselt for a fixed Rayleigh number?

In recent years, diverse approaches have been proposed to tackle this issue. These can be divided into passive and active control methods. Passive control methods include: acceleration modulation [intro:RB-control-accel-exp, intro:RB-control-accel-compute], oscillating shear flows [intro:RB-control-shear-flows], and oscillating boundary temperatures [intro:RB-control-temperature-oscillations]. Active control methods include: velocity actuators [intro:RB-control-velocity], and perturbations of the thermal boundary layer [intro:RB-control-temp-first, intro:RB-control-temp-experimental, intro:RB-control-temp-robust]. Many of these methods, although ingenious, are not practical due, e.g., to the requirement of a perfect knowledge of the state of the system, or a starting condition close to the conductive state - something difficult to establish [intro:RB-control-temp-experimental]. Another state-of-the-art active method, that is used in this paper as comparison, is based on a linear control acting at the bottom thermal boundary layer [intro:PDRBcontrol]. The main difficulty in controlling RBC resides in its chaotic behavior and chaotic response to controlling actions. In recent years, Reinforcement Learning (RL) algorithms [intro:RL-book] have been proven capable of solving complex control problems, dominating extremely hard challenges in high-dimensional spaces (e.g. board games [intro:RL-example-GO, intro:RL-example-boardgames] and robotics [intro:RL-example-irl-cube]

). Reinforcement Learning is a supervised Machine Learning (ML)

[RLback:ML-book] approach aimed at finding optimal control strategies. This is achieved by successive trial-and-error interactions with a (simulated or real) environment which iteratively improve an initial random control policy. Indeed, this is usually a rather slow process which may take millions of trial-and-error episodes to converge [intro:RL-example-rainbow-atari]. Reinforcement learning has also been used in connection with fluid flows, such as for training smart inertial or self-propelling particles [colabrese2018smart, colabrese2017flow, gustavsson2017finding], for schooling of fishes [gazzola2016learning, verma2018efficient], soaring of birds and gliders in turbulent environments [reddy2016learning, reddy2018glider], optimal navigation in turbulent flows [biferale2019, alageshan2019path], drag reduction by active control in the turbulent regime [intro:drag], and more [muinos2018reinforcement, novati2018deep, tsang2018self, intro:drag-accel, cichos2020machine, rabault2020deep].

In this work, we show that RL methods can be successfully applied for controlling a Rayleigh–Bénard system at fixed Rayleigh number reducing (or suppressing) convective effects. Considering a 2D proof-of-concept setup, we demonstrate that RL can significantly outperform state-of-the-art linear methods [intro:PDRBcontrol] when allowed to apply (small) temperature fluctuations at the bottom plate (see setup in Figure 1). In particular, we target a minimization of the time-averaged Nusselt number (Eq. (5)), aiming at reducing its instantaneous counterpart:

 Nuinstant(t)=⟨uyT⟩x,y−κ∂y⟨T⟩x,yκΔT/H, (6)

where the additional average along the vertical direction, , amends instantaneous fluctuations.

Finding controls fully stabilizing RBC might be, however, at all impossible. For a chaotic system as RBC, this may happen, among others, when delays in controls or observation become comparable with the Lyapunov time. We discuss this topic in the last part of the paper employing Reinforcement Learning to control the Lorenz attractor, a well-known, reduced version of RBC [back:lorenz1963].

The rest of this manuscript is organized as follows. In Section 2, we formalize the Rayleigh–Bénard control problem and the implementation of both linear and RL controls. In Section 3, we present the control results and comment on the induced flow structures. In Section 4 we analyze physical factors that limit the RL control performance. The discussion in Section 5 closes the paper.

## 2 Control-based convection reduction

In this section we provide details of the Rayleigh–Bénard system considered, formalize the control problem and introduce the control methods.

We consider an ideal gas () in a two-dimensional Rayleigh–Bénard system with at an assigned Rayleigh number (cf. sketch and coordinate system in Figure 1). We assume the four cell sides to satisfy a non-slip boundary condition, the lateral sides to be adiabatic, and a uniform temperature, , imposed at the top boundary. We enforce the Rayleigh number by specifying the average temperature,

 TH=⟨TB(x,t)⟩x, (7)

at the bottom boundary (where is the instantaneous temperature at location of the bottom boundary). Temperature fluctuations with respect to such average,

 ^TB(x,t)=TB(x,t)−TH, (8)

are left for control.

We aim at controlling to minimize convective effects, which we quantify via the time-averaged Nusselt number (cf. Eq.(5)). We further constrain the allowed temperature fluctuations to

 |^TB(x,t)|≤C∀x,t, (9)

to prevent extreme and nonphysical temperature gradients (in similar spirit to the experiments in [intro:RB-control-temp-experimental]).

We simulate the flow dynamics through the Lattice-Boltzmann method (LBM) [intro:lbm-book] employing a double lattice, respectively for the velocity and for the temperature populations (with D2Q9 and D2Q4 schemes on a square lattice with sizes ; collisions are resolved via the standard BGK relaxation). We opt for the LBM since it allows for fast, extremely vectorizable, implementations which enables us to perform multiple (up to hundreds) simulations concurrently on a GPU architecture. See Table 1 for relevant simulation parameters; further implementation details are reported in Appendix A.

Starting from a system in a (random) natural convective state (cf. experiments [intro:RB-control-temp-experimental]), controlling actions act continuously. Yet, to limit learning computational costs, is updated with a period, (i.e. control loop), longer than the LBM simulation step and scaling with the convection time, . We report in Table 3 the loop length, which satisfies, approximately, , and the system size, all of which are Ra-dependent. Once more, for computational efficiency reasons, we retain the minimal system size that enables us to quantify the (uncontrolled) Nusselt number within error (Appendix A).

In the next subsections we provide details on the linear and reinforcement-learning based controls.

### 2.1 Linear control

We build our reference control approach via a linear proportional-derivative (PD) controller [intro:PDRBcontrol]. Our control loop prescribes instantaneously the temperature fluctuations at the bottom boundary as

 ^TB(x,t)=R(~TB(x,t))=−R((kP−kD∂t)E(x,t)) (10)

with , being constant parameters and the (signed) distance from the optimal conductive state (), , satisfying

 E(x,t)=⟨uy(x,y,t)⟩y/V0, (11)

where is a reference vertical velocity. To ensure the constraints given by Eqs. (7),(9) we operate a clipping and renormalization operation, , as described in Appendix B.

Various other metrics, , have been proposed leveraging, for instance, on the shadow graph method ( [intro:RB-control-temp-experimental]), and on the mid-line temperature (, with , [intro:RB-control-temp-first]). These metrics provide similar results and, in particular, an average Nusselt number for the controlled systems within . Hence, we opted for Eq. (11) as it proved to be more stable. Note that, by restricting to , one obtains a linear proportional (P) controller. While supposedly less powerful than a PD controller, in our case the two show similar performance. The controller operates with the same space and time discretization of the LBM simulations, with the time derivative of calculated with a first order backwards scheme. We identify the Rayleigh-dependent parameters and via a grid-search algorithm [RLback:optimization-book] for (cf. values in Table 2). In case of higher Ra, due to the chaoticity of the system, we were unable to find parameters consistently reducing the heat flux with respect to the uncontrolled case.

### 2.2 Reinforcement Learning-based control

In a Reinforcement Learning context we have a policy, , that selects a temperature fluctuation, , on the basis of the observed system state. is identified automatically through an optimization process, which aims at maximizing a reward signal. In our case we define the system state, the allowed controlling actions and the reward are as follows:

• The state space, , includes observations of the temperature and velocity fields (i.e. of scalar fields) probed on a regular grid in nodes for the last time steps (i.e. , where is the current time). Note that the probe grid has a coarser resolution than the lattice, i.e. , , which allows to reduce the complexity of the control problem. It holds, therefore, .

• The action space, , includes the temperature profiles for the lower boundary that the controller can instantaneously enforce. To limit the computational complexity, we allow profiles which are piece-wise constant functions on segments (cf. Figure 1). Moreover, each of the function elements, (), can attain only two temperature levels, i.e.

 ~Tk∈{C,−C}. (12)

To enforce the constraint in Eq. (7),(9) we normalize the profile according to Appendix B, generating the final profile . After normalization, the action space includes distinct actions.

• The reward function defines the goal of the control problem. We define the reward, , as the negative instantaneous Nusselt number (cf. Eq. 6) which results from applying a temperature profile at time . In formulas, it holds

 rl+1=−Nuinstant(tl+1). (13)

Note that the RL controller aims at maximizing the reward accumulated over time (rather than the instantaneous reward), which minimizes the average Nusselt number, Eq. (5), as desired.

We employ the Proximal Policy Optimization (PPO) RL algorithm [intro:PPO-algorithm]

, which belongs to the family of Policy Gradient Methods. Starting from a random initial condition, Policy Gradient Methods search iteratively (and probabilistically) for optimal (or sufficient) policies by gradient ascent (based on local estimates of the performance). Specifically, this optimization employs a probability distribution,

, on the action space conditioned to the instantaneous system state. At each step of the control loop, we sample and apply an action according to the distribution . Notably, the sampling operation is essential at training time, to ensure an adequate balance between exploration and exploitation, while at test time, this stochastic approach can be turned deterministic by restricting to the action with highest associated probability. In our case, at test time we used the deterministic approach for and the stochastic approach for , as this allowed for higher performance.

The PPO algorithm is model-free, i.e. it does not need assumptions on the nature of the control problem. Besides, it does not generally require significant hyperparameter tuning, as often happens for RL algorithms (e.g. value based method

[intro:PPO-algorithm]).

When the state vector

is high-dimensional (or even continuous), it is common to parameterize the policy function in probabilistic terms as , for a parameter vector  [intro:RL-example-GO]. This parameterization can be done via different kinds of functions and, currently, neural networks are a popular choice [intro:universal-NN]

. In the simplest case, as used here, the neural network is a multilayer perceptron

[intro:RL-book]

(MLP). An MPL is a fully connected network in which neurons are stacked in layers and information flows in a pre-defined direction from the input to the output neurons via “hidden” neuron layers. The

-th neuron in the -th layer operates returning the value , which satisfies

 h(n+1)i=σ(b(n)i+∑jA(n)ijh(n)j), (14)

where the ’s are the outputs of the neurons of the previous layer (the -th one), which thus undergo an affine transformation via the matrix and the biases

. The non-linear activation function

provides the network with the capability of approximating non-linear functions [intro:universal-NN]. During training, the parameters get updated through back propagation [intro:back-prop] (according to the loss defined by the PPO algorithm) which results in progressive improvements of the policy function.

To increase the computational efficiency, we opt for a policy function factorized as follows

 π(ai|si)=π(~T1,~T2,…,~Tns∣∣si)=ns∏k=1πk(~Tk∣∣si). (15)

In other words, we address the marginal distributions of the local temperature values . We generate the marginals by an MLP (with two hidden layers each with activation) that has final outputs, , returned by sigmoid activation functions, in formulas:

 yk=ϕ(zk)=11+exp(−zk),k=1,2,…,ns (16)

with (see Fig. 2, note that ). The values provide the parameters for Bernoulli distributions that determine, at random, the binary selection between the temperature levels . In formulas, it holds

 πk(~Tk=+C∣∣si)=Bernoulli(p=yk). (17)

The final temperature profile is then determined via Eq. (12) and the normalization in Eq. (23). We refer to Fig. 2 for further details on the network.

## 3 Results

We compare the linear and RB-based control methods on different scenarios with Rayleigh number ranging between (just before the onset of convection) and (mild turbulence, see Table 3). Figure 3 provides a summary of the performance of the methods in terms of the (ensemble) averaged Nusselt number. Until , the RL control and the linear control deliver similar performance. At higher Ra numbers, in which the Rayleigh–Bénard system is more complex and chaotic, RL controls significantly outperform linear methods. This entails an increment of the critical Rayleigh number, which increases from , in the uncontrolled case, to , in case of linear control, and to in case of RL-based controls. Above the critical Rayleigh, RL controls manage a reduction of the Nusselt number which remains approximately constant, for our specific experiment, it holds

 Nuuncontrolled−NuRL≈2.5% for Ra<3⋅106.

In contrast, the linear method scores only

 Nuuncontrolled−Nulinear≈1.5for Ra<106,

while at higher Rayleigh it results completely ineffective.

The reduction of the average Nusselt number is an indicator of the overall suppression -or the reduction- of convective effects, yet it does not provide insightful and quantitative information on the characteristics of the control and of the flow. In Figure 4, we compare the controls in terms of the time histories of the (ensemble-averaged) Nusselt number. We include four different scenarios. For the RL control is able to stabilize the system () while both linear control methods result in periodic orbits [intro:PDRBcontrol]. At , RL controls are also unable to stabilize the system; yet, this does not result in any periodic flows as in the case of linear control. Finally, at we observe a time-varying Nusselt number even using RL-based controls. To better understand the RL control strategy, in Figures 5 and 6, we plot the instantaneous velocity streamlines for the last two scenarios.

For the case (see Figure 5(c)), the RL control steers and stabilizes the system towards a configuration that resembles a double cell. This reduces convective effect by effectively halving the effective Rayleigh number. In particular, we can compute a effective Rayleigh number, , by observing that the double cell structure can be constructed as two vertically stacked Rayleigh–Bénard systems with halved temperature difference and height (i.e., in formulas, and ). It thus results in an effective Rayleigh number satisfying

 Raeff=gαΔT′H′3νκ=116gαΔTH3νκ=116Ra, (18)

which is in line with the observed reduction in Nusselt.

At it appears that the RL control attempts, yet fails, to establish the “double cell” configuration observed at lower Ra (cf. Figure 6(c)). Likely, this is connected to the increased instability of the double cell configuration as Rayleigh increases.

These results were achieved with less than an hour of training time on an Nvidia V100 for the cases with low Rayleigh number (). However, at the optimization took up to 150 hours of computing time (the majority of which is spent in simulations and the minority of which is spent in updating the policy). For further details on the training process of the RL controller see Appendix D.

## 4 Limits to learning and control

In this section we discuss possible limits to the capability of learning automatic controls when aiming at suppressing convection. Our arguments are grounded on general physics concepts and thus apply to other control problems for non-linear/chaotic dynamics.

In Sect. 2, we showed that RL controls are capable of substantially outperforming linear methods in presence of sufficient control complexity (). It remains however unclear how far these controls are from optimality, especially at high Ra. Here we address the physics factors certainly hindering learning and controlling capabilities.

Having sufficient time and spatial resolution on the relevant state of the system is an obvious requirement to allow a successful control. Such resolution however is not defined in absolute terms, rather it connects to the typical scales of the system and, in case of a chaotic behavior, with the Lyapunov time and associated length scale. In our case, at fixed Ra, learnability and controllability connect with the number and density of probes, with the time and space resolution of the control, but also with its “distance” with respect to the bulk of the flow.

The system state is observed via a number of probes in fixed position (see Figure 0(b)). For a typical (buoyant) velocity in the cell, , there is a timescale associated with the delay with which a probe records a sudden change (e.g. creation/change of a plume) in the system. When this timescale becomes comparable or larger than the Lyapunov time, it appears hopeless for any control to learn and disentangle features from the probe readings. In other words, as Ra increases, we expect that a higher and higher number of probes becomes necessary (but not sufficient) in order to learn control policies.

Besides, our choice to control the system via the bottom plate temperature fluctuations entails another typical delay: the time taken to thermal adjustments to propagate inside the cell. In particular, in a cell at rest, this is the diffusion time, . If the delay gets larger or comparable to the Lyapunov time, controlling the system becomes, again, likely impossible.

To illustrate this concept, we rely on a well-known low-dimensional model inspired by RBC: the Lorenz attractor. The attractor state is three-dimensional and its variables (usually denoted by ; see Appendix E) represent the flow angular velocity, the horizontal temperature fluctuations and the vertical temperature fluctuations. We consider a RL control acting on the horizontal temperature fluctuations ( variable) that aims at either minimizing or maximizing the sign changes of the angular velocity (i.e. maximizing or minimizing the frequency of sign changes of the variable). In other words, the control aims at keeping the flow rotation direction maximally consistent or, conversely, at magnifying the rate of velocity inversions. In this simplified context, we can easily quantify the effects of an artificial delay in the control on the overall control performance (Figure 7). Consistently with our previous observations, when the artificial delay approaches the Lyapunov time the control performance significantly degrades.

Notably, in our original 2D RBC control problem (Sect. 2), control limitations are not connected to delays in observability. In fact, as we consider coarser and coarser grids of probes, the control performance does not diminish significantly (cf. Fig. 8; surprisingly, observations via only allow similar control performance to what achieved employing probes). This suggests that other mechanisms than insufficient probing determine the performance, most likely, the excessively long propagation time (in relation to the Lyapunov time) needed by the controlling actions to traverse the cell from the boundary to the bulk. This could be possibly lowered by considering different control and actuation strategies.

## 5 Discussion

In this paper we considered the issue of suppressing convective effects in a 2D Rayleigh–Bénard  cell, by applying small temperature fluctuations at the bottom plate. In our proof of concept comparison, limited to a square cell and fixed Pr, we showed that controls based on Reinforcement Learning (RL) are able to significantly outperform state-of-the-art linear approaches. Specifically, RL is capable of discovering controls stabilizing the Rayleigh–Bénard system up to a critical Rayleigh number that is approximately times larger than achievable by linear controls and times larger than in the uncontrolled case. Secondly, when the purely conductive state could not be achieved, the RL still produces control strategies capable of reducing convection, which are significantly better than linear algorithms. The RL control achieves this by inducing an unstable flow mode, similar to a stacked double-cell, yet not utilized in the context of RBC control.

Actually no guarantee exists on the optimality of the controls found by RL. Similarly it holds for the linear controls, which additionally need vast manual intervention for the identification of the parameters. However, as we showed numerically, theoretical bounds to controllability hold which are regulated by the chaotic nature of the system, i.e. by its Lyapunov exponents, in connection with the (space and time) resolution of the system observations as well as with the actuation capabilities. We quantified such theoretical bounds in terms of delays, in observability and/or in actuation: whenever these become comparable to the Lyapunov time, the control becomes impossible.

There is potential for replication of this work in an actual experimental setting. However, training a controller only via experiments might take an excessively long time to converge. Recent developments in reinforcement learning showed already the possibility of employing controllers partially trained on simulations (transfer learning

[intro:RL-example-irl-cube]). This would not only be a large step for the control of flows, but also for RL where practical/industrial uses are still mostly lacking [intro:RL-book].

## Appendix A Rayleigh–Bénard simulation details

We simulate the Rayleigh–Bénard system via the lattice Boltzmann method (LBM) that we implement in a vectorized way on a GPU. We employ the following methods and schemes

• Velocity population: D2Q9 scheme, afterwards indicated by ;

• Temperature population: D2Q4 scheme;

• Collision model: BGK collision operator [Appendix:BGK, intro:lbm-book];

• Forcing scheme: As seen in [intro:lbm-book] most forcing schemes can be formulated as follows

 fi(x+eiΔt,t+Δt)−fi(x,t)=[Ωi(x,t)+Si(x,t)]Δt (19) ueq=1ρ∑ifiei+AFΔtρ (20)

with and defined by scheme. We choose the scheme by He et al. [Appendix:He-force-model] for its improved accuracy which reads

 A =12 (21) Si =(1−Δt2τ)feqiρei−uc2sF (22)
• Boundary model: bounce-back rule enforcing no-slip boundary conditions [intro:lbm-book].

To limit the training time, we implemented the LBM vectorizing on the simulations. This enabled us to simulate multiple concurrent, fully independent, Rayleigh–Bénard systems within a single process. This eliminates the overhead of having numerous individual processes running on a single GPU which would increase the CPU communication overhead.

When a RL controller selects a temperature profile for the bottom boundary this is endured for number of LBM steps (this defines one environment step, or env step, with length ). The reason for these, so-called, sticky actions is that within one env step the system does not change significantly. Allowing quicker actions would not only be physically meaningless but also possibly detrimental to the performance (this is a known issue when training RL agents for games where actions need to be sustained to make an impact [intro:RL-example-Curiosity-large]). Furthermore, due to our need to investigate the transient behavior, we set the episode length to env steps. In this way, the transient is extinguished within the first env steps. After each episode the system is reset to a random, fully developed, convective RB state.

In dependence on the Rayleigh number (i.e. system size), it takes between millions and billions env steps to converge to a control strategy. To limit the computing time, we consider the smallest possible system that gives a good estimate for the uncontrolled Nusselt number (error within few percent).

In Table 3 we report the considered Rayleigh numbers and related system sizes.

## Appendix B Control amplitude normalization

To limit the amplitude of the temperature fluctuations and ensure their zero-average (see Eq. (7), (9)) we employ the following three normalization steps, indicated by in the manuscript. Let be the temperature fluctuation proposed by either the linear control or the RL-based control, we obtain as

 ~T′B(x,t)=Clip(~TB(x,t),−C,C) (23) ~T′′B(x,t)=~T′B(x,t)−⟨~T′B(x,t)⟩x (24) ^TB(x,t)=~T′′B(x,t)maxx′(1,|~T′′B(x,t)|/C). (25)

Note that the first operation is a clipping of the local temperature fluctuation between , which is necessary only for the linear control case.

## Appendix C RL algorithm implementation and hyperparameters

In this appendix we elaborate on our design choices about the implementation of RL for Rayleigh–Bénard control.

• Discretization of the bottom boundary in 10 sections. A literature study [LAres:RL-continues-action-2, LAres:RL-continues-action] and preliminary experiments have shown that large/continuous action spaces are currently rather challenging for the convergence of RL methods. In our preliminary experiments we observed that discretizing in sections was even less effective that in sections, and that sections were instead insufficient to get the desired performance.

• 3 layer multilayer perceptron (MLP) for state encoding.

We considered this option over a convolutional neural network (CNN) applied on the entire lattice. The latter had significantly longer training times and lower final performance. Besides, we included in the state observed by the MLP the system readings in the

previous env steps, which is known to be beneficial for performance [intro:RL-example-Curiosity-large].

• PPO algorithm.

We considered this option over value-based methods which were however more difficult to operate with due to the need of extensive hyperparameter tuning. Furthermore, we used the open source implementation of PPO included in the stable-baselines python library

[intro:RL-platform-stable-baselines] (note: training PPO demands for a so-called auxiliary value function [intro:PPO-algorithm]. For that we employed a separate neural network having the same structure as the policy function).

### c.1 Hyperparameters

We used the work by Burda et al. [intro:RL-example-Curiosity-large] as a starting point for our choice of the hyperparameters. We specifically considered two separate hyperparameter sets. i targeting final performance over training speed, used for . ii targeting speed over final performance, used only on the highest Rayleigh number case () and for the research on the probe density. Below one can see the PPO hyperparameters used (see [intro:RL-book] and [intro:PPO-algorithm] for further explanations).

• Number of concurrent environments: 512 (set 2: 128)

• Number of roll-out steps: 128

• Number of samples training samples: (set 2: )

• Entropy coefficient : 0.01

• Learning rate :

• Discount factor : 0.99

• Number of mini-batches: 8 (set 2: 16)

• Number of epoch when optimizing the surrogate: 4 (set 2: 32)

• Value function coefficient for the loss calculation: 0.5

• Factor for trade-off of bias vs. variance for Generalized Advantage Estimator

: 0.95

• PPO Clip-range: 0.2

## Appendix D Training curves

We report in Figure 9 the learning curves for our RL control (performance vs. length of the training session). These curves provide information on the time necessary to converge to a strategy and thus are an indication of the difficulty and stability of the process.

## Appendix E Implementation Lorentz Attractor Control

To illustrate our argument that a delay comparable to the Lyapunov time is detrimental to the control performance, we introduce two control problems defined on the Lorentz Attractor (LA). These LA control problems are defined considering the following equations

 ˙x=σ(y−x), (26) ˙y=x(ρ−z)−y+a, (27) ˙z=xy−βz, (28) subject to  |a|≤1 (29)

with being a relatively small controllable parameter, and , and . The control loop and integration loop (via RK4) have the same time stepping . The two control problems are as follows

1. “LA stabilizer”. We aim at minimizing the frequency with which the flow direction changes (i.e. the frequency of sign changes). Reward: if and zero otherwise;

2. “LA oscillator”. Similar to LA stabilizer but with inverse goal. Reward: if and zero otherwise.

We start the system in a random state around the attractor, the controller is an MLP network, and we use the PPO RL algorithm (similarly to our approach for the complete Rayleigh–Bénard case). We furthermore limit the control to three states, , for training speed purposes.

Applying RL on these control problems with no delay results in the behaviors shown in Figure 10. The control develops complex strategies to maximize/minimize the frequency of sign changes of .