1 Introduction
In recent years, concerns about climate change combined with political and social pressures have pushed the world to increase the installed capacity of renewable energy sources (wind, solar, bioenergy and hydro), allowing renewable energy to account for 28% of global energy generation in 2020 IEA (April 2020, accessed January 10, 2021). While significant progress has been made in expanding solar and wind resources, tidal energy remains practically untapped. As of today, only two successful large Tidal Range Structure (TRS) projects have been built, namely, La Rance (France) and Lake Sihwa (South Korea), with 240 MW and 254 MW of installed capacity, respectively Neill et al. (2018).
A review by UK’s exminister of energy Hendry (2016) has helped in drawing attention to tidal lagoons as a competitive choice among renewables. In his report, the construction of “smallscale” tidal lagoons, such as the Swansea Lagoon (our case study), is suggested as a pathfinder project before moving to largerscale lagoons. The report also emphasises that tidal lagoons have proposed lifetimes of operation of 120 years – far surpassing any other renewable energy type, allowing for very low electricity cost for years. As an example, La Rance, which is in operation for 55 years, took 20 years to amortise the initial investment, generating energy at competitive cost of nuclear or offshore wind sources Evans (October 4th, 2019, accessed January 10, 2021); Hendry (2016).
Among the challenges faced in TRS deployment is the optimisation of energy generation through the operation of hydraulic structures (turbines and sluices), which increases the utilisation factor (ratio of actual energy generated to installed capacity). Although the current literature has advanced in increasing theoretical power generation capabilities of TRS Neill et al. (2018), there is room for improvement, considering that stateofart (flexible operation) optimisation methods can be (i) computationally time expensive and (ii) do not perform realtime control, relying on accurate tidal prediction techniques. In view of this, we propose the usage of Deep Reinforcement Learning (DRL) methods, more specifically Proximal Policy optimisation (PPO) through the Unity MLAgents package Juliani et al. (2018)
, which enables stateofart energy generation of TRS (on par with best optimisation routines) through the realtime control of turbines and sluices. DRL was chosen among machine learning techniques due to the nature of our problem, which involves sequential decisionmaking of a reactive environment (lagoon water levels vary depending on the operation of hydraulic structures) with the goal of maximising expected return (energy), and also because a target optimal operation of the tidal lagoon is not known “a priori” – a requirement for supervised learning techniques. After training, our method shows consistent performance, regardless of test data used, not requiring future tidal predictions or retraining the DRL agent. To date, this is the first flexible operation optimisation approach in the literature that can maximise TRS energy generation without such constraints. A 0D model of the Swansea Bay Tidal Lagoon is utilised to compare our DRL method with six optimisation baselines devised from the literature
Angeloudis et al. (2018); Xue et al. (2019).After related work (Section 2), this paper is divided in four parts. In the first part, we explain how tidal barrages extract energy from the tides through the lenses of classical and variant operation approaches. In the second part, we cover the theory behind DRL, more specifically on the PPO algorithm that is used in this study. In the third part, we cover our agentenvironment setup, modelled with the Unity MLAgents package (Table S1, Supplementary Material). In the final part, an experimental study contrasting our results with six baselines is presented and discussed.
2 Related Work
Stateofart optimisation methods for TRS estimate the available energy in such systems by operating the tidal lagoon hydraulic structures through flexible operational strategies (adaptive, according to tidal amplitudes and lagoon water levels for sequential tidal cycles Xue et al. (2019); Angeloudis et al. (2018)). With the assumption of well predictable tides, flexible operation of turbines and sluices can be inferred by “looking–ahead” through harmonic or numerical tidal prediction methods Egbert and Ray (2017) and applying the acquired operation to the real, measured ocean – a procedure that needs to be repeated for every new tide. In fact, and to the best of our knowledge, the requirement of accurate future tidal predictions has been the basis for all optimisation routines developed for enabling flexible operation Ahmadian et al. (2017); Xue et al. (2019); Angeloudis et al. (2018); Xue et al. (2020); Harcourt et al. (2019); Neill et al. (2018); Xue et al. (2021). This constraint can be a problem when future tidal predictions are unavailable, unreliable, have some associated validation cost MedinaLopez et al. (2021) or are regulated by private companies or government agencies.
Stateofart optimisation routines in the literature utilise either grid search (a bruteforce approach), gradientbased, global optimisation Xue et al. (2019); Angeloudis et al. (2018); Harcourt et al. (2019)
and, more recently, genetic algorithm methods
Xue et al. (2021) to optimise the operation of TRS. As a basis of comparison with our DRL agent, two nonflexible and four flexible stateofart baselines devised from Angeloudis et al. (2018); Xue et al. (2019) are modelled utilising grid search and global optimisation methods.3 Tidal Power Overview
TRS extract power by artificially inducing a water head difference between the ocean and an impounded area. By allowing water to flow through the hydraulic structures into an artificial impoundment, the incoming tide (flood tide) is confined within the lagoon at high level (holding stage). Then, during the receding tide (ebb tide), power generation begins when a high operational head () is established between the basin and ocean Prandle. (1984). Power generation then stops when a minimum operational head () is achieved. A sluicing sequence immediately follows, where idling turbines and sluices allow water to flow in order to increase lagoon tidal range for the next operation. Following the same procedure, generating energy is also possible during the flood tide, although with reduced efficiency due to turbines usually being ebboriented Angeloudis et al. (2018). From the literature, operational strategies that allow for power generation to occur during flood and ebb tides are called “twoway scheme” operation Prandle. (1984). Operational modes for controlling TRS in a “twoway scheme” are shown in Fig. 1 and detailed in Table 1.
Operational Mode  Description 

Ebb Gen:  Power generation during receding tide 
Flood Gen:  Power generation during incoming tide 
Sluicing:  Operate sluice gates and/or idle turbines 
Holding:  Stop operation of all hydraulic structures 
Operational Mode  Turbines  Sluices  Power Gen. 

Ebb Gen:  On  Off  Yes (if ) 
Flood Gen:  On  Off  Yes (if ) 
Sluicing:  On  On  No 
Holding:  Off  Off  No 
TRS turbines can be operated to either generate energy or to increase flow rates through the barrage during sluicing stage (idle operation of turbines). Also, a minimum head , usually in the range Aggidis and Benzon (2013), is required for the turbine to generate energy ( in this study). Considering that the holding stage begins automatically when the difference between ocean and lagoon is negligible, and that power generation is not possible with head differences below , the classic operation of tidal lagoons Prandle. (1984) is reduced to two variables: and . As seen in Fig. 1, pairs and occur every halftide period, when ocean oscillates between its valleys and peaks.
A slight modification of the discussed classical operation allows for opening the sluice gates at the end of “flood” and “ebb” generation stages Baker ((1991)); Angeloudis et al. (2018) independently of , with the possibility of increasing power generation (increased lagoon tidal range when starting the next “ebb” or “flood” stages). This variant operation requires 3 control variables every halftide: , and (sluice gate starting head). The water level variations within the lagoon, following classic and variant operations of hydraulic structures, can be seen in Fig. 1. Tables 2 and 3 show all possible combined operations of turbines and sluices, with resulting power generation, for each control stage in classic and variant operations, respectively. Classic and variant approaches to operate TRS are used in the optimisation routines of our baselines in Section 6.1.
Operational Mode  Turbines  Sluices  Power Gen. 

Ebb Gen:  On  Off/On  Yes (if ) 
Flood Gen:  On  Off/On  Yes (if ) 
Sluicing:  On  Off/On  No 
Holding:  Off  Off  No 
3.1 Tidal Lagoon Simulation  0D Model
In order to estimate the available energy of twoway operational strategies, analytical or numerical models (0D to 3D) can be considered. When the goal is the optimisation of TRS operation for maximising energy generation, 0D models are usually chosen, given their computational efficiency, and the fact that for “smallscale” projects, such as the Swansea Bay Tidal Lagoon, 0D models present good agreement with more complex finiteelement 2D models Angeloudis et al. (2017, 2018); Xue et al. (2019); Neill et al. (2018); Schnabl et al. (2019). 0D models are derived from conservation of mass:
(1) 
where is the water level (in meters) inside the lagoon, is the total directional water flow rate () from both sluices and turbines and is the variable lagoon area (). From Eq. (1), the lagoon water level at the following timestep () can be calculated by a backward finite difference method:
(2) 
where is the water level at timestep and the discretized time ( in this work).
3.2 Turbine and Sluice Parametrization
From 0D to 2D models, studies show that flow rate and power from turbines can be approximated with the parametrization of experimental results Falconer et al. (2009); Aggidis and Benzon (2013); Aggidis and Feather (2012). For this work, the equations describing flow and power for low head bulb turbines were based on experimental results from Andritz Hydro Aggidis and Feather (2012). The edited Andritz chart shown in Fig. 2 demonstrates how turbine unit speed and specific unit discharge (obtained experimentally) are related. The graph also shows wicket gate and running blade openings ( and , in degrees), and isoefficiency curves .
By specifying the parameters of the turbine: diameter , number of generating poles and grid frequency , the turbine rotation (rpm) is obtained from . Furthermore, unit speed , turbine flow rate and power output are calculated as:
(3) 
(4) 
(5) 
is the head difference between ocean and lagoon, the seawater density (), the gravity acceleration () and is the product of other efficiencies shown in Table 4.
TRS Efficiencies  (%) 

Generator  97 
Transformer  99.5 
Water friction  95 
Gear box/drive train  97.2 
Turbine availability  95 
Turbine orientation (Flood Gen. only) Angeloudis et al. (2018)  90 
When is available, is estimated directly from Eq. (3). For calculating and , and are obtained experimentally by adjusting the opening of the wicket gates (), the pitch angle of the runner blades () and crossing the values with the obtained (see Fig. 2). In order to choose appropriate values for and , a parameterized curve of maximum power output was drawn over Fig. 2 (blue line) by following the path where the product between , and is maximised. If we assume and are automatically adjusted to always be in the maximum power output curve, then and become functions of , as shown in Eq. (6) and (7).
(6)  
and
(7) 
For simulating sluice gates, the barrage model utilises the orifice equation, so that the flow rate is a function of Prandle. (1984); Baker ((1991)):
(8) 
where is the discharge coefficient for sluices (equal to one in this study, following Angeloudis et al. (2018)), and the sluice area.
When generating energy, turbines use Eq. (4) for estimating flow rate through the barrage. On the other hand, when operating in “idling” mode, turbines use equation Eq. (8), with Angeloudis et al. (2018).
When starting or stopping either turbines or sluices, the literature has used sinusoidal ramp functions for simulating the smooth transition of flow output as Angeloudis et al. (2018, 2017), where , (around to Angeloudis et al. (2018, 2016)), and
is the time when the current operation was triggered. Since this is a heuristic method, a simpler transition function (named “momentum ramp”) is proposed in this work:
(9) 
is the estimated flow rate at the next timestep, is the total flow rate calculated from turbines and sluice equations (Eq. (4, 8)),
a dimensionless hyperparameter that controls the intensity of flow rate update per timestep,
is the flow rate at timestep “” and .The “momentum ramp” is applied every timestep during simulation. This not only simplifies the code, but facilitates training, since sluice opening is treated as a continuous control problem (Section 5.2). In this work we set , which guarantees a precision of for a time interval with .
4 Reinforcement Learning Overview
As shown in the work of Sutton and Barto Sutton and Barto (2018)
, a reinforcement learning (RL) problem can be mathematically formalised as a Markov Decision Process (MDP). In a MDP, an agent interacts with an environment through actions (
), and these actions lead to new environmental states () and possible rewards () for the agent. The quantities (, and) are random variables, with well defined probability distributions.
By sampling multiple timesteps , observations of the agentenvironment interaction are organised as a sequence of stateaction, next reward triples:
(10) 
where, are instances of the random variables (, and ). The sequence of stateaction pairs defines a trajectory :
(11) 
Also, in a MDP, we can say that the probabilities of
and are completely conditioned on the preceding state and action ( and ), that is:(12) 
The probability distribution of Eq. (12) defines the dynamics of the MDP. It can also be manipulated to yield the stateactiontransition probability distribution (which is just the sum of probabilities over all possible future rewards):
(13) 
For estimating Eq. (13
) for a given state, we also need to condition an action. In nondeterministic scenarios, the selection of possible actions by the agent is a stochastic process, defined by a conditional probability distribution (known as policy) of the form:
(14) 
Using Eq. (13,14), the probability distribution of starting in a state and ending in , given a policy, can be estimated as:
(15) 
With some defined policy, we can sum the observed rewards for each stateaction pair (as shown in Eq. (10)) and calculate a total return at timestep :
(16) 
where is a discount factor between 0 and 1.
The objective of reinforcement learning problems is to find an optimal policy , that maximises the expected return of rewards conditioned on any initial state, i.e.
(17) 
4.1 Proximal Policy optimisation (PPO)
In this work, the process of finding an optimal policy has been achieved through Proximal Policy optimisation (PPO) Schulman et al. (2017). PPO was shown to outperform several other “onpolicy” gradient methods Schulman et al. (2017) and is one of the preferred methods for control optimisation when the cost of acquiring new data is low Abbeel (2019b).
Differently from approaches that try to infer the policy through statevalue or actionvalue functions (e.g. Deep QNetwork) Mnih et al. (2013)
, PPO uses an “onpolicy” approach that maximises the expected sum of rewards by improving its current policy – smoothly shifting the probability density function estimate of the policy towards
.The PPO algorithm is an updated form of Policy Gradients. As TRPO (Trust Region Policy optimisation) Schulman et al. (2015a), it tries to increase sample efficiency (reusing data from previous policies), while constraining gradient steps to a trust region. It is also actorcritic, since it utilises an estimate of the statevalue function for its baseline Abbeel (2017). An overview of Policy Gradients and PPO is presented below.
4.1.1 Policy Gradients
Policy gradient methods rely on the fact that a stochastic policy can be parameterized by an “actor” neural network with weights
(simplified as going forward). As represented in Fig. 3, this neural network receives a vector state representation of
. For the case of discrete actions, the neural network outputs the probabilities of each possible action in that state using a softmax layer. For continuous actions, each node in the last layer outputs the moments of a multivariate Gaussian distribution of the form
Poupart (2018); Bøhn et al. :(18) 
where and
are the parameterized mean and covariance matrix, respectively. While training, actions are randomly sampled from the distribution to favour exploration. During testing,
is taken as the optimum action for each input state .Considering a trajectory , the expected return of following a parameterized policy is , where
(19) 
and is the reward from taking action from state . We also note that represents the undiscounted return following a sampled trajectory for a time horizon . With these considerations, finding an optimal policy can be viewed as tuning to maximise , i.e. to perform gradient ascent of :
(20) 
A sample based estimate for assumes the form:
(21) 
where is the number of sampled trajectories from the “actor” neural network. For vanilla policy gradient methods, following a trajectory , we get
(22) 
(23) 
, parameterized by a “critic” neural network, is the estimate for the value function of being in state and following policy thereafter; is the discounted future return of following the chosen trajectory, from time ; and is the advantage estimate of taking this trajectory in respect to the current estimate of . A complete derivation of can be seen in Abbeel (2019a).
4.1.2 Clipped Surrogate Loss derivation for PPO
In order to increase sampling efficiency Schulman (August, 2017), importance sampling can be used to rewrite the gradient term in Eq.( 21) as:
(24) 
Eq.( 24) importance sampling form allows for reutilising samples from an older policy to perform gradient ascent steps, when refining a new policy. It is obtained when differentiating the Surrogate Loss:
(25) 
where is a probability ratio.
While in TRPO Schulman et al. (2015a) the maximization of the surrogate loss from Eq.( 25
) is subjected to a Kullback–Leibler divergence constraint, in PPO
Schulman et al. (2017) the surrogate loss is constrained through a clipping procedure, yielding the clipped surrogate loss objective:(26) 
where is a hyperparameter that limits large policy updates.
To further reduce variance when estimating the advantage, Schulman et al. (2017); Abbeel (2017) utilise a truncated version of generalized advantage estimation Schulman et al. (2015b), where is estimated as
(27) 
where is a time index within the sampled trajectory time horizon , is a hyperparameter that performs the exponential weighted average of kstep estimators of the returns Schulman et al. (2015b), and .
When utilising shared parameters for the “actor” and “critic” neural networks (as is the case with this work), the loss function needs to be augmented with a value function error term
Schulman et al. (2017). To ensure exploration, an entropy term “” is also added. Finally, the loss function to be maximised at each iteration becomes:(28) 
For this study, Unity MLAgents package fixes Pierre (2020), is a hyperparameter controlling the entropy bonus magnitude , and is a clipped, squarederror loss between the estimate of the statevalue function and the actual return value obtained when following a trajectory Schulman et al. (2017). The implementation of in Unity MLAgents is seen in Pierre (2020).
Additionally, parallel training can also be implemented as a way of substituting experience replay by running the policy on multiple instances of the environment. By guaranteeing that each environment starts in a random initial state during training, this parallelism helps decorrelate the sampled data, stabilising learning Mnih et al. (2016).
5 AgentEnvironment Setup
5.1 Unity MLAgents
The Unity3D graphics engine is a popular game developing environment that has been used to create games and simulations in 2D and 3D since its debut in 2005. It has received widespread adoption in other areas as well, such as architecture, engineering and construction Juliani et al. (2018).
Unity MLAgents is an opensource project that allows for designing environments where a smart agent can learn through interactions
Juliani et al. (2018); Pierre (2020). It has been chosen in this project due to ease of implementation, builtin PPO algorithm and visual framework for visualising realtime control of TRS.5.2 AgentEnvironment Modelling and Training
By creating simple representative 3D models for turbines, sluices, ocean and lagoon, a training environment for TRS is created in Unity 3D. In this environment, the equations for simulating flow operation through the lagoon, when operating sluices and turbines, are extracted from the 0D model representation, detailed in Section 3.2. In order to choose appropriate parameters for operating our environment, we follow literature representations suggested by Angeloudis et al. (2018, 2017); Xue et al. (2019), for the Swansea Bay Tidal Lagoon project. The chosen parameters are shown in Table 5. A variable lagoon surface area, digitized from Xue et al. (2019), is also utilised.
of Turbines  
of  
Grid frequency  
Turbine Diameter ()  
Sluice Area () 
For ease of visualisation, the 3D representations of sluice and turbine change colours depending on the operational mode chosen by the agent. For the turbine, green represents power generation mode, orange – idling mode and black – offline mode (zero flow rate). Similarly, sluices change colour between orange and black for sluicing and offline modes, respectively. Fig. 4 shows a capture of the Unity 3D environment representation for the Swansea Bay tidal Lagoon during ebb generation, with the representative models for sluice and turbines in offline and power generation modes, respectively. Ocean and Lagoon surface level motion are also represented.
For stabilising and speeding up training, parallel training is performed with 64 copies of the environment (Fig. 5), while episodes are set to month of duration. During training, each environment instance requires a representative ocean input at the location where the Swansea Lagoon is planned to be constructed ( ). Ideally, ocean measurements could be used as training data. However, due to the lack of sufficient measured data (Section 6.1), it is not possible to train the agent until reasonable performance is reached. Instead, an artificial tide signal to simulate the ocean is created by summing the major sinusoidal tide constituents (due to gravitational pull of the Moon and Sun). Although we are not accounting for other less predictable local wave motions (e.g. wind waves), the artificial ocean input representation is sufficient for enabling the agent to converge to an optimal policy. A major advantage of this approach is the fact that we can generate any amount of input data required for training the agent.
The tide constituent’s amplitudes of the simulated ocean utilised in this work (Table 6), were obtained from a numerical simulation, at the location of Swansea Bay Tidal Lagoon, by Angeloudis et al. (2018). The periods for each constituent were obtained by Wolanski and Elliott (2015). The final equation for simulating the ocean can be seen in Eq. (29).
Ocean tide constituent  Amplitude (m)  Period (hr) 

(29)  
States (at times and )  Units 

Ocean water level  meters (float) 
Lagoon water level  meters (float) 
Number of online turbines  0, or 16 (integer) 
Number of idling turbines  0, or 16 (integer) 
Sluice gate opening area  0 to 1 (float) 
where , , and are angular frequencies (rad/s) of each tidal component, and , , and are random phase lags in the range , generated for each environment instance during parallel training when starting an episode, which allow for learning more generalized scenarios.
In each environment, the agent act as an operator responsible for controlling turbine and sluice operational modes through policy network node outputs , according to a vector of input states (water levels of ocean and lagoon, plus current operational mode of turbines and sluices, for current and previous timesteps). The reward received by the agent is collected per timestep (1 minute) and equals the generated energy by the turbines. Input states are shown in Table 7.
Policy network node outputs can be discrete or continuous. In this work, continuous outputs are chosen, reducing the number of nodes in the last layer, and consequently, the complexity of the neural network. There are node outputs that determine turbine and sluice operation every of simulation. The window was selected for this work since the time usually associated with the opening/closing of hydraulic structures lies in the range Angeloudis et al. (2018, 2016).
Each node in the last layer outputs a value between and , and the resulting actions are computed in a hierarchical fashion. The first node determines the number of turbines set to power generation mode ( or ), depending if the node output is below or above a threshold (), i.e. if the node outputs a value below the threshold, no turbines will be generating energy, otherwise, all turbines are set to power generating mode. Therefore, if no turbines are set to power generation, are available for other operational modes (idling or offline).
The second node selects the number of idling turbines just as the first node, if the number of turbines available is . Otherwise the number of idling turbines is , independent of this node output.
If no turbine is selected for power generation or idling modes, all turbines are set offline. Therefore, the first two nodes control turbines through discrete actions.
The third and final node outputs the opening area of the sluice gates. Since any value between can be chosen by the neural network, the momentum ramp function (Section 3.2) is applied to the outputted flow rate every timestep, ensuring smooth flow rate transitions, independently of the opening sluice area set by the agent.
Beyond reducing the number of node outputs, this configuration also allows for having the sluice operation independent of turbine operation. All possible operational modes for turbines and sluices as a function of node output () are shown in Tables 8 and 9, respectively. For reproducibility, Table S2 in the Supplementary Material showcases the hyperparameters utilised during training.
The agentenvironment optimisation problem is solved through the PPO algorithm (Section 4.1). After training, the policy neural network receives input states and outputs optimum values, following a policy that maximises energy generation. During testing, this means that the agent receives real ocean measurements as input, performing realtime flexible control of turbines and sluices.
Node 1  Node 2  Discrete Turbine Control 

Offline Mode  
Idling Mode  
Power Generation Mode  
Power Generation Mode 
Node 3  Continuous Sluice Control 

Sluicing Mode (Available sluice area = )  
Offline Mode (Available sluice area = 0) 
6 Experiments
In this section we compare our DRL trained agent performance against stateoftheart optimisation routines. Codes for reproducibility can be made available under request to the corresponding author.
6.1 Test Data and Baselines optimisation
For comparing our DRL agent performance against conventional optimisation routines, we model six baselines devised from the recent literature Angeloudis et al. (2018); Xue et al. (2019) and compare the energy generated in a month for each method. All baselines in this work consider the operation of the Swansea Bay tidal lagoon either through classic or variant “twoway scheme” methods, as detailed in Section 3.
Regarding test data for baselines and trained agent, we utilise all tide gauge ocean measurements available from the British Oceanographic Data Centre (BODC) at Mumbles Station BODC (accessed January 10, 2021), located at the edge of Swansea Bay. The obtained measurements of ocean elevation are recorded every min in a table, for the years of and the range
. Before utilising the data, a preprocessing step is performed so that data flagged as “improbable”, “null value” and “interpolated” by BODC are not considered. After this step we retain
months of usable, nonoverlapping, test data. The preprocessing step ensures a conservative comparison between baselines and our trained agent, since it considers scenarios where tidal predictions had a good match with measured data.Tidal predictions for the same 26 months are also provided by BODC in the same dataset. For each month, baseline optimisation routines utilise tidal predictions for capturing operational head values , and (when considered) that optimise power generation. These operational head values are then applied to the measured ocean test data, so that comparisons between baselines and trained agent can be made. Baselines, in increasing order of optimisation complexity, are described next:

CH (Constant Heads): Best, constant and are picked for extracting energy during a whole month Ahmadian et al. (2017).

CHV (Constant Heads, with variant operation): Best, constant heads , and are picked for extracting energy during a whole month.

EHT (Every HalfTide): optimised pairs of and are picked for every consecutive halftide. Proposed by Xue et al. (2019).

EHTV (Every HalfTide, with variant operation): optimised , and are picked for every consecutive halftide.

EHN (Every HalfTide and Next): optimised and are picked for every halftide, considering the best and for the next halftide as well. Proposed by Xue et al. (2019).

EHNV (Every HalfTide and Next, with variant operation): optimised , and are picked for every halftide, considering the best , and for the next halftide as well.
All variant optimisation methods are augmented through the addition of independent sluice head operation . This modification we are introducing is inspired by the work of Baker ((1991)); Angeloudis et al. (2018). CH and CHV perform nonflexible operation, while EHT, EHTV, EHN and EHNV perform stateofart flexible operation. A summary detailing each baseline operational heads and method is shown in Table 10.
Constant Head  Every HalfTide  Every HalfTide and Next  

CH  CHV  EHT  EHTV  EHN  EHNV  
✓  ✓  ✓  ✓  ✓  ✓  
✓  ✓  ✓  ✓  ✓  ✓  
✓  ✓  ✓  
nonflexible operation  ✓  ✓  
flexible operation  ✓  ✓  ✓  ✓ 
All baselines, except EHNV, are optimised with a grid search optimisation algorithm, which iteratively increases its search resolution until convergence. Initial search resolution starts with meter, with optimisation heads , and (when considered) within ranges , and , respectively. After the first run, search resolution is halved and the algorithm performs a bruteforce search around the best previous configuration attained. The latter procedure is repeated until final search resolution is lower than .
EHNV requires a different optimisation approach due to its high computational time when utilising the previous grid search method. For this case we utilise the stochastic global optimisation algorithm basinhopping Wales and Doye (1997) from Scipy package Virtanen et al. (2020), with COBYLA as a local minimizer Powell (1994). Basinhopping was chosen for its efficiency when solving smooth function problems with several local minima separated by large barriers v0.14.0 Reference Guide (2015). The local minimizer COBYLA is a nonlinear derivative–free constrained optimisation that uses a linear approximation approach. Even though basinhopping is not guaranteed to converge to a global optimum, EHNV is shown to be, on average, the best baseline method for energy generation.
6.2 Agent Performance Evaluation
Following hyperparameter tuning, we trained the agent for steps, until convergence. The cumulative reward (energy) per month (episode) during parallel training, averaged for the 64 instances of the lagoon environment, is shown in Fig. 6. The logrepresentation insert highlights the twostep plateau that is observed when converging to an optimal strategy. After starting in a total random strategy, the cumulative reward received by the agent increases until reaching an intermediate plateau at around steps, where the agent learns the strategy of operating mostly the turbines, while keeping sluices practically offline during ebb generation. Then, after about steps, the cumulative reward starts increasing again. The second plateau stabilises around steps, with a cumulative reward approximately 25% higher than the first plateau – a gain allowed by (i) a flexible operational strategy learnt by the agent, that adjust TRS operation according to tidal range (ii) the smart usage of the sluicing mode, as discussed below in test results. Videos showcasing the strategic operation developed for both plateaus are available in the Supplementary Material.
For test data, we utilise months of real ocean measurements from BODC. These months are presented and numbered in Table S3, while Table S4 (Supplementary Material) compares the amount of energy obtained in the numbered months between our trained agent (performing realtime flexible control) and the baselines. The averaged monthly energy attained for all methods is shown in Fig. 10.
For the baselines, CH and CHV present the worst performance, since constant operational heads cannot account for the varying ocean amplitudes in a month (about to in our test set). Furthermore, baselines with variant operation outputted more energy in average than their classical counterparts. Finally, “halftide and next” approaches showed very small improvements () when compared to “halftide” methods, while requiring much greater () computational time (Tables S5 and S6, Supplementary Material).
For the trained agent, Fig 7 show operational test results of power generation and lagoon water levels for one month of measured ocean data (starting with initial lagoon water level at mean sea level). We note that the agent quickly converges to an optimal energy generation strategy for sequential tidal cycles, independent of tidal range input – a characteristic of stateofart flexible operation Xue et al. (2019). Furthermore, Figs. 8 and 9 showcase detailed results of realtime control on test data. Apart from ocean water levels, results are coloured according to actions taken by the agent for turbines and sluices, respectively, as defined in Section 5.2. More specifically, Figs. 7(a) and 8(a) show lagoon water level variations, while Fig. 7(c), 7(b) and 8(b) show power generation, turbine and sluice flow rates. From the sequence of actions taken, we see that the agent arrives at a policy with independent operation of sluices, i.e. the variant operation of TRS, which was shown to be a better strategy than the classical operation in our baseline comparison. A summary of our method accomplishments in comparison with stateofart baselines is shown in Table 11.
CH  CHV  EHT  EHTV  EHN  EHNV  DRL Agent  
Ahmadian et al. (2017)  Xue et al. (2019)  Xue et al. (2019)  (our work)  
realtime flexible control  ✓  
predictionfree approach  ✓  
variant lagoon operation  ✓  ✓  ✓  ✓  
stateofart performance  ✓  ✓  ✓  ✓  ✓  
optimisation routines with variant operation of tidal lagoons. Augmentation inspired by Angeloudis et al. (2018, 2016)  
Equivalent outputs, in average, within the error bars (Fig. 10). 
Our agent managed very competitive energy outputs, staying on average within of the best baseline (EHNV). Indeed, for all months tested, our agent performed optimally, outputting better results than the stateofart EHN method for 22 out of 26 months, within a margin in worst scenarios. We note that this novel result was obtained by training the agent once with a simple artificial ocean input, in contrast with the baselines that require future tidal predictions and being rerun for every new tide.
7 Conclusions
In this work, we have shown that proximal policy optimisation (a DRL method) can be used for realtime flexible control of tidal range structures after training with artificially generated tide signals, yielding competitive results with stateofart optimisation methods devised from the literature.
We have chosen the Swansea Bay Tidal Lagoon for our analysis, given its status as a pathfinder project for larger tidal lagoon projects. We show that our novel approach obtains optimal energy generation from measured tidal data through an optimised control policy of turbines and sluices. Our method shows promising advancements over stateofart optimisation approaches since it (i) performs realtime flexible control with equivalent energy generation, (ii) does not require future tidal predictions, and (iii) needs to be trained a single time only.
Owing to its characteristic features, the method introduced here would be of broad applicability for performing optimal energy generation of TRS in cases where future tidal predictions are unreliable or not available.
Acknowledgments
We would like to thank the Brazilian agencies Coordenação de Aperfeiçoamento de Pessoal de Ensino Superior (CAPES) and Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) for providing funding for this research.
References
 Lecture 4a: policy gradients, deep rl bootcamp berkeley. External Links: Link Cited by: §4.1.2, §4.1.
 CS 287 lecture 18  rl i: policy gradients, uc berkeley eecs. External Links: Link Cited by: §4.1.1.
 Lecture 19 offpolicy, modelfree rl, uc berkeley eecs. External Links: Link Cited by: §4.1.
 Operational optimisation of a tidal barrage across the mersey estuary using 0d modelling. Ocean Engineering 66, pp. 69–81. Cited by: §3.2, Table 4, §3.
 Tidal range turbines and generation on the Solway Firth. Renewable Energy 43, pp. 9–17. Cited by: Figure 2, §3.2.
 Optimisation of tidal range schemes. In Proceedings of the 12th European Wave and Tidal Energy Conference, pp. 1059. Cited by: §2, item 1, Table 11.
 Numerical model simulations for optimisation of tidal lagoon schemes. Applied Energy 165, pp. 522–536. Cited by: §3.2, §5.2, Table 11.
 Optimising tidal range power plant operation. Applied energy 212, pp. 680–690. Cited by: §1, §2, §2, §3.1, §3.2, §3.2, §3.2, Table 4, §3, §3, §5.2, §5.2, §5.2, §6.1, §6.1, Table 11.
 Comparison of 0d, 1d and 2d model capabilities for tidal range energy resource assessments. EarthArXiv (2017). Cited by: §3.1, §3.2, §5.2.
 Tidal power. Institution of Engineering and Technology (1711). Cited by: §3.2, §3, §6.1.
 Download uk tide gauge network data from bodc. External Links: Link Cited by: §6.1.
 [12] Deep reinforcement learning attitude control of fixedwing uavs using proximal policy optimization. In 2019 International Conference on Unmanned Aircraft Systems (ICUAS), pp. 523–533. Cited by: §4.1.1.
 Tidal prediction. Journal of Marine Research 75 (3), pp. 189–237. Cited by: §2.
 La rance: learning from the world’s oldest tidal project. External Links: Link Cited by: §1.
 The severn barrage and other tidal energy options: hydrodynamic and power output modeling. Science in China Series E: Technological Sciences 52 (11), pp. 3413–3424. Cited by: §3.2.
 Utilising the flexible generation potential of tidal range power plants to optimise economic value. Applied Energy 237, pp. 873–884. Cited by: §2, §2.
 The role of tidal lagoons. Final Report, pp. 67–75. Cited by: §1.
 Global energy review 2020. External Links: Link Cited by: §1.
 Unity: a general platform for intelligent agents. arXiv preprint arXiv:1809.02627 (2018). Cited by: §1, §5.1, §5.1.
 Satellite data for the offshore renewable energy sector: synergies and innovation opportunities. arXiv preprint arXiv:2103.00872 (2021). Cited by: §2.
 Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §4.1.2.
 Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013). Cited by: §4.1.
 Tidal range energy resource and optimization–past perspectives and future challenges. Renewable Energy 127, pp. 763–778. Cited by: §1, §1, §2, §3.1.
 Unitytechnologies/mlagents. GitHub. Note: https://github.com/UnityTechnologies/mlagents/blob/master/mlagents/mlagents/trainers/ppo/optimizer_torch.py Cited by: §4.1.2, §5.1.
 CS885 lecture 7a: policy gradient, university of waterloo. External Links: Link Cited by: §4.1.1.
 Advances in optimization and numerical analysis. In Proceeding of the 6th Workshop on Optimization and Numerical Analysis, pp. 5–67. Cited by: §6.1.
 Simple theory for designing tidal power schemes. Advances in Water Resources 7 (1), pp. 21–27. Cited by: §3.2, §3, §3.
 Implementation of tidal stream turbines and tidal barrage structures in dgswem. In International Conference on Offshore Mechanics and Arctic Engineering, Vol. 58899, pp. V010T09A005. Cited by: §3.1.
 Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §4.1.2, §4.1.
 Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015). Cited by: §4.1.2, §4.1.2.
 Proximal policy optimization algorithms. arXiv preprint:1707.06347 (2017). Cited by: §4.1.2, §4.1.2, §4.1.2, §4.1.2, §4.1.
 Deep rl bootcamp lecture 5: natural policy gradients, trpo, ppo, bootcamp berkeley. External Links: Link Cited by: §4.1.2.
 Reinforcement learning: an introduction. MIT press. Cited by: §4.
 Scipy.optimize.basinhopping. External Links: Link Cited by: §6.1.
 SciPy 1.0: fundamental algorithms for scientific computing in python. Nature Methods 17 (3), pp. 261–272. Cited by: §6.1.
 Global optimization by basinhopping and the lowest energy structures of lennardjones clusters containing up to 110 atoms. The Journal of Physical Chemistry A 101 (28), pp. 5111–5116. Cited by: §6.1.
 Estuarine ecohydrology: an introduction. section 2.1 the tides at sea. Elsevier. Cited by: §5.2.
 Optimising the operation of tidal range schemes. Energies 12 (15), pp. 2870. Cited by: §1, §2, §2, §3.1, §5.2, item 3, item 5, §6.1, §6.2, Table 11.
 Design of tidal range energy generation schemes using a genetic algorithm model. Applied Energy 286, pp. 116506. Cited by: §2, §2.
 Genetic algorithm in tidal range schemes’ optimisation. Energy, pp. 117496. Cited by: §2.
Supplementary Material
To be made available after paper acceptance.