Prediction-Free, Real-Time Flexible Control of Tidal Lagoons through Proximal Policy Optimisation: A Case Study for the Swansea Lagoon

by   Túlio Marcondes Moreira, et al.

Tidal range structures have been considered for large scale electricity generation for their potential ability to produce reasonable predictable energy without the emission of greenhouse gases. Once the main forcing components for driving the tides have deterministic dynamics, the available energy in a given tidal power plant has been estimated, through analytical and numerical optimisation routines, as a mostly predictable event. This constraint imposes state-of-art flexible operation methods to rely on tidal predictions (concurrent with measured data and up to a multiple of half-tidal cycles into the future) to infer best operational strategies for tidal lagoons, with the additional cost of requiring to run optimisation routines for every new tide. In this paper, we propose a novel optimised operation of tidal lagoons with proximal policy optimisation through Unity ML-Agents. We compare this technique with 6 different operation optimisation approaches (baselines) devised from the literature, utilising the Swansea Bay Tidal Lagoon as a case study. We show that our approach is successful in maximising energy generation through an optimised operational policy of turbines and sluices, yielding competitive results with state-of-the-art methods of optimisation, regardless of test data used, requiring training once and performing real-time flexible control with measured ocean data only.


page 1

page 2

page 3

page 4


Efficient optimisation of structures using tabu search

This paper presents a novel approach to the optimisation of structures u...

An end-to-end data-driven optimisation framework for constrained trajectories

Many real-world problems require to optimise trajectories under constrai...

Design optimisation of a multi-mode wave energy converter

A wave energy converter (WEC) similar to the CETO system developed by Ca...

Design optimisation of piezoelectric energy harvesters for bridge infrastructure

The use of Piezoelectric Energy Harvesters (PEHs) on bridges is explored...

Evolutionary Optimisation of Real-Time Systems and Networks

The design space of networked embedded systems is very large, posing cha...

Bayesian Optimisation for Premise Selection in Automated Theorem Proving (Student Abstract)

Modern theorem provers utilise a wide array of heuristics to control the...

1 Introduction

In recent years, concerns about climate change combined with political and social pressures have pushed the world to increase the installed capacity of renewable energy sources (wind, solar, bioenergy and hydro), allowing renewable energy to account for 28% of global energy generation in 2020 IEA (April 2020, accessed January 10, 2021). While significant progress has been made in expanding solar and wind resources, tidal energy remains practically untapped. As of today, only two successful large Tidal Range Structure (TRS) projects have been built, namely, La Rance (France) and Lake Sihwa (South Korea), with 240 MW and 254 MW of installed capacity, respectively Neill et al. (2018).

A review by UK’s ex-minister of energy Hendry (2016) has helped in drawing attention to tidal lagoons as a competitive choice among renewables. In his report, the construction of “small-scale” tidal lagoons, such as the Swansea Lagoon (our case study), is suggested as a pathfinder project before moving to larger-scale lagoons. The report also emphasises that tidal lagoons have proposed lifetimes of operation of 120 years – far surpassing any other renewable energy type, allowing for very low electricity cost for years. As an example, La Rance, which is in operation for 55 years, took 20 years to amortise the initial investment, generating energy at competitive cost of nuclear or offshore wind sources Evans (October 4th, 2019, accessed January 10, 2021); Hendry (2016).

Among the challenges faced in TRS deployment is the optimisation of energy generation through the operation of hydraulic structures (turbines and sluices), which increases the utilisation factor (ratio of actual energy generated to installed capacity). Although the current literature has advanced in increasing theoretical power generation capabilities of TRS Neill et al. (2018), there is room for improvement, considering that state-of-art (flexible operation) optimisation methods can be (i) computationally time expensive and (ii) do not perform real-time control, relying on accurate tidal prediction techniques. In view of this, we propose the usage of Deep Reinforcement Learning (DRL) methods, more specifically Proximal Policy optimisation (PPO) through the Unity ML-Agents package Juliani et al. (2018)

, which enables state-of-art energy generation of TRS (on par with best optimisation routines) through the real-time control of turbines and sluices. DRL was chosen among machine learning techniques due to the nature of our problem, which involves sequential decision-making of a reactive environment (lagoon water levels vary depending on the operation of hydraulic structures) with the goal of maximising expected return (energy), and also because a target optimal operation of the tidal lagoon is not known “a priori” – a requirement for supervised learning techniques. After training, our method shows consistent performance, regardless of test data used, not requiring future tidal predictions or re-training the DRL agent. To date, this is the first flexible operation optimisation approach in the literature that can maximise TRS energy generation without such constraints. A 0D model of the Swansea Bay Tidal Lagoon is utilised to compare our DRL method with six optimisation baselines devised from the literature

Angeloudis et al. (2018); Xue et al. (2019).

After related work (Section 2), this paper is divided in four parts. In the first part, we explain how tidal barrages extract energy from the tides through the lenses of classical and variant operation approaches. In the second part, we cover the theory behind DRL, more specifically on the PPO algorithm that is used in this study. In the third part, we cover our agent-environment setup, modelled with the Unity ML-Agents package (Table S1, Supplementary Material). In the final part, an experimental study contrasting our results with six baselines is presented and discussed.

2 Related Work

State-of-art optimisation methods for TRS estimate the available energy in such systems by operating the tidal lagoon hydraulic structures through flexible operational strategies (adaptive, according to tidal amplitudes and lagoon water levels for sequential tidal cycles Xue et al. (2019); Angeloudis et al. (2018)). With the assumption of well predictable tides, flexible operation of turbines and sluices can be inferred by “looking–ahead” through harmonic or numerical tidal prediction methods Egbert and Ray (2017) and applying the acquired operation to the real, measured ocean – a procedure that needs to be repeated for every new tide. In fact, and to the best of our knowledge, the requirement of accurate future tidal predictions has been the basis for all optimisation routines developed for enabling flexible operation Ahmadian et al. (2017); Xue et al. (2019); Angeloudis et al. (2018); Xue et al. (2020); Harcourt et al. (2019); Neill et al. (2018); Xue et al. (2021). This constraint can be a problem when future tidal predictions are unavailable, unreliable, have some associated validation cost Medina-Lopez et al. (2021) or are regulated by private companies or government agencies.

State-of-art optimisation routines in the literature utilise either grid search (a brute-force approach), gradient-based, global optimisation Xue et al. (2019); Angeloudis et al. (2018); Harcourt et al. (2019)

and, more recently, genetic algorithm methods

Xue et al. (2021) to optimise the operation of TRS. As a basis of comparison with our DRL agent, two non-flexible and four flexible state-of-art baselines devised from Angeloudis et al. (2018); Xue et al. (2019) are modelled utilising grid search and global optimisation methods.

3 Tidal Power Overview

TRS extract power by artificially inducing a water head difference between the ocean and an impounded area. By allowing water to flow through the hydraulic structures into an artificial impoundment, the incoming tide (flood tide) is confined within the lagoon at high level (holding stage). Then, during the receding tide (ebb tide), power generation begins when a high operational head () is established between the basin and ocean Prandle. (1984). Power generation then stops when a minimum operational head () is achieved. A sluicing sequence immediately follows, where idling turbines and sluices allow water to flow in order to increase lagoon tidal range for the next operation. Following the same procedure, generating energy is also possible during the flood tide, although with reduced efficiency due to turbines usually being ebb-oriented Angeloudis et al. (2018). From the literature, operational strategies that allow for power generation to occur during flood and ebb tides are called “two-way scheme” operation Prandle. (1984). Operational modes for controlling TRS in a “two-way scheme” are shown in Fig. 1 and detailed in Table 1.

Figure 1: Classic and variant “two-way scheme” operation. Ocean level is represented by the blue line, while the lagoon level is shown in either green dashed lines or red, for classic or variant lagoon operations, respectively.
Operational Mode Description
Ebb Gen: Power generation during receding tide
Flood Gen: Power generation during incoming tide
Sluicing: Operate sluice gates and/or idle turbines
Holding: Stop operation of all hydraulic structures
Table 1: TRS control stages.
Operational Mode Turbines Sluices Power Gen.
Ebb Gen: On Off Yes (if )
Flood Gen: On Off Yes (if )
Sluicing: On On No
Holding: Off Off No
Table 2: Classic hydraulic structures operation.

TRS turbines can be operated to either generate energy or to increase flow rates through the barrage during sluicing stage (idle operation of turbines). Also, a minimum head , usually in the range Aggidis and Benzon (2013), is required for the turbine to generate energy ( in this study). Considering that the holding stage begins automatically when the difference between ocean and lagoon is negligible, and that power generation is not possible with head differences below , the classic operation of tidal lagoons Prandle. (1984) is reduced to two variables: and . As seen in Fig. 1, pairs and occur every half-tide period, when ocean oscillates between its valleys and peaks.

A slight modification of the discussed classical operation allows for opening the sluice gates at the end of “flood” and “ebb” generation stages Baker ((1991)); Angeloudis et al. (2018) independently of , with the possibility of increasing power generation (increased lagoon tidal range when starting the next “ebb” or “flood” stages). This variant operation requires 3 control variables every half-tide: , and (sluice gate starting head). The water level variations within the lagoon, following classic and variant operations of hydraulic structures, can be seen in Fig. 1. Tables 2 and 3 show all possible combined operations of turbines and sluices, with resulting power generation, for each control stage in classic and variant operations, respectively. Classic and variant approaches to operate TRS are used in the optimisation routines of our baselines in Section 6.1.

Operational Mode Turbines Sluices Power Gen.
Ebb Gen: On Off/On Yes (if )
Flood Gen: On Off/On Yes (if )
Sluicing: On Off/On No
Holding: Off Off No
Table 3: Variant hydraulic structures operation.

3.1 Tidal Lagoon Simulation - 0D Model

In order to estimate the available energy of two-way operational strategies, analytical or numerical models (0D to 3D) can be considered. When the goal is the optimisation of TRS operation for maximising energy generation, 0D models are usually chosen, given their computational efficiency, and the fact that for “small-scale” projects, such as the Swansea Bay Tidal Lagoon, 0D models present good agreement with more complex finite-element 2D models Angeloudis et al. (2017, 2018); Xue et al. (2019); Neill et al. (2018); Schnabl et al. (2019). 0D models are derived from conservation of mass:


where is the water level (in meters) inside the lagoon, is the total directional water flow rate () from both sluices and turbines and is the variable lagoon area (). From Eq. (1), the lagoon water level at the following time-step () can be calculated by a backward finite difference method:


where is the water level at time-step and the discretized time ( in this work).

3.2 Turbine and Sluice Parametrization

From 0D to 2D models, studies show that flow rate and power from turbines can be approximated with the parametrization of experimental results Falconer et al. (2009); Aggidis and Benzon (2013); Aggidis and Feather (2012). For this work, the equations describing flow and power for low head bulb turbines were based on experimental results from Andritz Hydro Aggidis and Feather (2012). The edited Andritz chart shown in Fig. 2 demonstrates how turbine unit speed and specific unit discharge (obtained experimentally) are related. The graph also shows wicket gate and running blade openings ( and , in degrees), and iso-efficiency curves .

Figure 2: Edited Andritz Chart for a double regulated turbine (varying and angles), adapted from Aggidis and Feather (2012).

By specifying the parameters of the turbine: diameter , number of generating poles and grid frequency , the turbine rotation (rpm) is obtained from . Furthermore, unit speed , turbine flow rate and power output are calculated as:


is the head difference between ocean and lagoon, the seawater density (), the gravity acceleration () and is the product of other efficiencies shown in Table 4.

TRS Efficiencies (%)
Generator 97
Transformer 99.5
Water friction 95
Gear box/drive train 97.2
Turbine availability 95
Turbine orientation (Flood Gen. only) Angeloudis et al. (2018) 90
Table 4: Other efficiency considerations for TRS Aggidis and Benzon (2013).

When is available, is estimated directly from Eq. (3). For calculating and , and are obtained experimentally by adjusting the opening of the wicket gates (), the pitch angle of the runner blades () and crossing the values with the obtained (see Fig. 2). In order to choose appropriate values for and , a parameterized curve of maximum power output was drawn over Fig. 2 (blue line) by following the path where the product between , and is maximised. If we assume and are automatically adjusted to always be in the maximum power output curve, then and become functions of , as shown in Eq. (6) and (7).




For simulating sluice gates, the barrage model utilises the orifice equation, so that the flow rate is a function of Prandle. (1984); Baker ((1991)):


where is the discharge coefficient for sluices (equal to one in this study, following Angeloudis et al. (2018)), and the sluice area.

When generating energy, turbines use Eq. (4) for estimating flow rate through the barrage. On the other hand, when operating in “idling” mode, turbines use equation Eq. (8), with Angeloudis et al. (2018).

When starting or stopping either turbines or sluices, the literature has used sinusoidal ramp functions for simulating the smooth transition of flow output as Angeloudis et al. (2018, 2017), where , (around to Angeloudis et al. (2018, 2016)), and

is the time when the current operation was triggered. Since this is a heuristic method, a simpler transition function (named “momentum ramp”) is proposed in this work:


is the estimated flow rate at the next time-step, is the total flow rate calculated from turbines and sluice equations (Eq. (4, 8)),

a dimensionless hyperparameter that controls the intensity of flow rate update per time-step,

is the flow rate at time-step “” and .

The “momentum ramp” is applied every time-step during simulation. This not only simplifies the code, but facilitates training, since sluice opening is treated as a continuous control problem (Section 5.2). In this work we set , which guarantees a precision of for a time interval with .

4 Reinforcement Learning Overview

As shown in the work of Sutton and Barto Sutton and Barto (2018)

, a reinforcement learning (RL) problem can be mathematically formalised as a Markov Decision Process (MDP). In a MDP, an agent interacts with an environment through actions (

), and these actions lead to new environmental states () and possible rewards () for the agent. The quantities (, and

) are random variables, with well defined probability distributions.

By sampling multiple time-steps , observations of the agent-environment interaction are organised as a sequence of state-action, next reward triples:


where, are instances of the random variables (, and ). The sequence of state-action pairs defines a trajectory :


Also, in a MDP, we can say that the probabilities of

and are completely conditioned on the preceding state and action ( and ), that is:


The probability distribution of Eq. (12) defines the dynamics of the MDP. It can also be manipulated to yield the state-action-transition probability distribution (which is just the sum of probabilities over all possible future rewards):


For estimating Eq. (13

) for a given state, we also need to condition an action. In non-deterministic scenarios, the selection of possible actions by the agent is a stochastic process, defined by a conditional probability distribution (known as policy) of the form:


Using Eq. (13,14), the probability distribution of starting in a state and ending in , given a policy, can be estimated as:


With some defined policy, we can sum the observed rewards for each state-action pair (as shown in Eq. (10)) and calculate a total return at time-step :


where is a discount factor between 0 and 1.

The objective of reinforcement learning problems is to find an optimal policy , that maximises the expected return of rewards conditioned on any initial state, i.e.


4.1 Proximal Policy optimisation (PPO)

In this work, the process of finding an optimal policy has been achieved through Proximal Policy optimisation (PPO) Schulman et al. (2017). PPO was shown to outperform several other “on-policy” gradient methods Schulman et al. (2017) and is one of the preferred methods for control optimisation when the cost of acquiring new data is low Abbeel (2019b).

Differently from approaches that try to infer the policy through state-value or action-value functions (e.g. Deep Q-Network) Mnih et al. (2013)

, PPO uses an “on-policy” approach that maximises the expected sum of rewards by improving its current policy – smoothly shifting the probability density function estimate of the policy towards


The PPO algorithm is an updated form of Policy Gradients. As TRPO (Trust Region Policy optimisation) Schulman et al. (2015a), it tries to increase sample efficiency (re-using data from previous policies), while constraining gradient steps to a trust region. It is also actor-critic, since it utilises an estimate of the state-value function for its baseline Abbeel (2017). An overview of Policy Gradients and PPO is presented below.

4.1.1 Policy Gradients

Policy gradient methods rely on the fact that a stochastic policy can be parameterized by an “actor” neural network with weights

(simplified as going forward). As represented in Fig. 3

, this neural network receives a vector state representation of

. For the case of discrete actions, the neural network outputs the probabilities of each possible action in that state using a softmax layer. For continuous actions, each node in the last layer outputs the moments of a multivariate Gaussian distribution of the form

Poupart (2018); Bøhn et al. :


where and

are the parameterized mean and co-variance matrix, respectively. While training, actions are randomly sampled from the distribution to favour exploration. During testing,

is taken as the optimum action for each input state .

Figure 3: Input-output representation of policy network.

Considering a trajectory , the expected return of following a parameterized policy is , where


and is the reward from taking action from state . We also note that represents the undiscounted return following a sampled trajectory for a time horizon . With these considerations, finding an optimal policy can be viewed as tuning to maximise , i.e. to perform gradient ascent of :


A sample based estimate for assumes the form:


where is the number of sampled trajectories from the “actor” neural network. For vanilla policy gradient methods, following a trajectory , we get


, parameterized by a “critic” neural network, is the estimate for the value function of being in state and following policy thereafter; is the discounted future return of following the chosen trajectory, from time ; and is the advantage estimate of taking this trajectory in respect to the current estimate of . A complete derivation of can be seen in Abbeel (2019a).

4.1.2 Clipped Surrogate Loss derivation for PPO

In order to increase sampling efficiency Schulman (August, 2017), importance sampling can be used to rewrite the gradient term in Eq.( 21) as:


Eq.( 24) importance sampling form allows for re-utilising samples from an older policy to perform gradient ascent steps, when refining a new policy. It is obtained when differentiating the Surrogate Loss:


where is a probability ratio.

While in TRPO Schulman et al. (2015a) the maximization of the surrogate loss from Eq.( 25

) is subjected to a Kullback–Leibler divergence constraint, in PPO

Schulman et al. (2017) the surrogate loss is constrained through a clipping procedure, yielding the clipped surrogate loss objective:


where is a hyperparameter that limits large policy updates.

To further reduce variance when estimating the advantage, Schulman et al. (2017); Abbeel (2017) utilise a truncated version of generalized advantage estimation Schulman et al. (2015b), where is estimated as


where is a time index within the sampled trajectory time horizon , is a hyperparameter that performs the exponential weighted average of k-step estimators of the returns Schulman et al. (2015b), and .

When utilising shared parameters for the “actor” and “critic” neural networks (as is the case with this work), the loss function needs to be augmented with a value function error term

Schulman et al. (2017). To ensure exploration, an entropy term “” is also added. Finally, the loss function to be maximised at each iteration becomes:


For this study, Unity ML-Agents package fixes Pierre (2020), is a hyperparameter controlling the entropy bonus magnitude , and is a clipped, squared-error loss between the estimate of the state-value function and the actual return value obtained when following a trajectory Schulman et al. (2017). The implementation of in Unity ML-Agents is seen in Pierre (2020).

Additionally, parallel training can also be implemented as a way of substituting experience replay by running the policy on multiple instances of the environment. By guaranteeing that each environment starts in a random initial state during training, this parallelism helps decorrelate the sampled data, stabilising learning Mnih et al. (2016).

5 Agent-Environment Setup

5.1 Unity ML-Agents

The Unity3D graphics engine is a popular game developing environment that has been used to create games and simulations in 2D and 3D since its debut in 2005. It has received widespread adoption in other areas as well, such as architecture, engineering and construction Juliani et al. (2018).

Unity ML-Agents is an open-source project that allows for designing environments where a smart agent can learn through interactions

Juliani et al. (2018); Pierre (2020). It has been chosen in this project due to ease of implementation, built-in PPO algorithm and visual framework for visualising real-time control of TRS.

5.2 Agent-Environment Modelling and Training

By creating simple representative 3D models for turbines, sluices, ocean and lagoon, a training environment for TRS is created in Unity 3D. In this environment, the equations for simulating flow operation through the lagoon, when operating sluices and turbines, are extracted from the 0D model representation, detailed in Section 3.2. In order to choose appropriate parameters for operating our environment, we follow literature representations suggested by Angeloudis et al. (2018, 2017); Xue et al. (2019), for the Swansea Bay Tidal Lagoon project. The chosen parameters are shown in Table 5. A variable lagoon surface area, digitized from Xue et al. (2019), is also utilised.

of Turbines
Grid frequency
Turbine Diameter ()
Sluice Area ()
Table 5: Swansea Lagoon design.

For ease of visualisation, the 3D representations of sluice and turbine change colours depending on the operational mode chosen by the agent. For the turbine, green represents power generation mode, orange – idling mode and black – offline mode (zero flow rate). Similarly, sluices change colour between orange and black for sluicing and offline modes, respectively. Fig. 4 shows a capture of the Unity 3D environment representation for the Swansea Bay tidal Lagoon during ebb generation, with the representative models for sluice and turbines in offline and power generation modes, respectively. Ocean and Lagoon surface level motion are also represented.

Figure 4: Unity 3D environment for a 0D model of the Swansea Bay Tidal Lagoon during ebb generation.

For stabilising and speeding up training, parallel training is performed with 64 copies of the environment (Fig. 5), while episodes are set to month of duration. During training, each environment instance requires a representative ocean input at the location where the Swansea Lagoon is planned to be constructed ( ). Ideally, ocean measurements could be used as training data. However, due to the lack of sufficient measured data (Section 6.1), it is not possible to train the agent until reasonable performance is reached. Instead, an artificial tide signal to simulate the ocean is created by summing the major sinusoidal tide constituents (due to gravitational pull of the Moon and Sun). Although we are not accounting for other less predictable local wave motions (e.g. wind waves), the artificial ocean input representation is sufficient for enabling the agent to converge to an optimal policy. A major advantage of this approach is the fact that we can generate any amount of input data required for training the agent.

Figure 5: 64 instances of the environment during parallel training. For the turbines, green represents power generation mode. For turbines and sluices orange represents idling/sluicing mode and black – offline mode.

The tide constituent’s amplitudes of the simulated ocean utilised in this work (Table 6), were obtained from a numerical simulation, at the location of Swansea Bay Tidal Lagoon, by Angeloudis et al. (2018). The periods for each constituent were obtained by Wolanski and Elliott (2015). The final equation for simulating the ocean can be seen in Eq. (29).

Ocean tide constituent Amplitude (m) Period (hr)
Table 6: Simulated tide constituents at Swansea Bay.
States (at times and ) Units
Ocean water level meters (float)
Lagoon water level meters (float)
Number of online turbines 0, or 16 (integer)
Number of idling turbines 0, or 16 (integer)
Sluice gate opening area 0 to 1 (float)
Table 7: Input states for PPO neural network.

where , , and are angular frequencies (rad/s) of each tidal component, and , , and are random phase lags in the range , generated for each environment instance during parallel training when starting an episode, which allow for learning more generalized scenarios.

In each environment, the agent act as an operator responsible for controlling turbine and sluice operational modes through policy network node outputs , according to a vector of input states (water levels of ocean and lagoon, plus current operational mode of turbines and sluices, for current and previous time-steps). The reward received by the agent is collected per time-step (1 minute) and equals the generated energy by the turbines. Input states are shown in Table 7.

Policy network node outputs can be discrete or continuous. In this work, continuous outputs are chosen, reducing the number of nodes in the last layer, and consequently, the complexity of the neural network. There are node outputs that determine turbine and sluice operation every of simulation. The window was selected for this work since the time usually associated with the opening/closing of hydraulic structures lies in the range Angeloudis et al. (2018, 2016).

Each node in the last layer outputs a value between and , and the resulting actions are computed in a hierarchical fashion. The first node determines the number of turbines set to power generation mode ( or ), depending if the node output is below or above a threshold (), i.e. if the node outputs a value below the threshold, no turbines will be generating energy, otherwise, all turbines are set to power generating mode. Therefore, if no turbines are set to power generation, are available for other operational modes (idling or offline).

The second node selects the number of idling turbines just as the first node, if the number of turbines available is . Otherwise the number of idling turbines is , independent of this node output.

If no turbine is selected for power generation or idling modes, all turbines are set offline. Therefore, the first two nodes control turbines through discrete actions.

The third and final node outputs the opening area of the sluice gates. Since any value between can be chosen by the neural network, the momentum ramp function (Section 3.2) is applied to the outputted flow rate every time-step, ensuring smooth flow rate transitions, independently of the opening sluice area set by the agent.

Beyond reducing the number of node outputs, this configuration also allows for having the sluice operation independent of turbine operation. All possible operational modes for turbines and sluices as a function of node output () are shown in Tables 8 and 9, respectively. For reproducibility, Table S2 in the Supplementary Material showcases the hyperparameters utilised during training.

The agent-environment optimisation problem is solved through the PPO algorithm (Section 4.1). After training, the policy neural network receives input states and outputs optimum values, following a policy that maximises energy generation. During testing, this means that the agent receives real ocean measurements as input, performing real-time flexible control of turbines and sluices.

Node 1 Node 2 Discrete Turbine Control
Offline Mode
Idling Mode
Power Generation Mode
Power Generation Mode
Table 8: Possible turbine operational modes.
Node 3 Continuous Sluice Control
Sluicing Mode (Available sluice area = )
Offline Mode (Available sluice area = 0)
Table 9: Possible sluice operational modes.

6 Experiments

In this section we compare our DRL trained agent performance against state-of-the-art optimisation routines. Codes for reproducibility can be made available under request to the corresponding author.

6.1 Test Data and Baselines optimisation

For comparing our DRL agent performance against conventional optimisation routines, we model six baselines devised from the recent literature Angeloudis et al. (2018); Xue et al. (2019) and compare the energy generated in a month for each method. All baselines in this work consider the operation of the Swansea Bay tidal lagoon either through classic or variant “two-way scheme” methods, as detailed in Section 3.

Regarding test data for baselines and trained agent, we utilise all tide gauge ocean measurements available from the British Oceanographic Data Centre (BODC) at Mumbles Station BODC (accessed January 10, 2021), located at the edge of Swansea Bay. The obtained measurements of ocean elevation are recorded every min in a table, for the years of and the range

. Before utilising the data, a preprocessing step is performed so that data flagged as “improbable”, “null value” and “interpolated” by BODC are not considered. After this step we retain

months of usable, non-overlapping, test data. The preprocessing step ensures a conservative comparison between baselines and our trained agent, since it considers scenarios where tidal predictions had a good match with measured data.

Tidal predictions for the same 26 months are also provided by BODC in the same data-set. For each month, baseline optimisation routines utilise tidal predictions for capturing operational head values , and (when considered) that optimise power generation. These operational head values are then applied to the measured ocean test data, so that comparisons between baselines and trained agent can be made. Baselines, in increasing order of optimisation complexity, are described next:

  • CH (Constant Heads): Best, constant and are picked for extracting energy during a whole month Ahmadian et al. (2017).

  • CHV (Constant Heads, with variant operation): Best, constant heads , and are picked for extracting energy during a whole month.

  • EHT (Every Half-Tide): optimised pairs of and are picked for every consecutive half-tide. Proposed by Xue et al. (2019).

  • EHTV (Every Half-Tide, with variant operation): optimised , and are picked for every consecutive half-tide.

  • EHN (Every Half-Tide and Next): optimised and are picked for every half-tide, considering the best and for the next half-tide as well. Proposed by Xue et al. (2019).

  • EHNV (Every Half-Tide and Next, with variant operation): optimised , and are picked for every half-tide, considering the best , and for the next half-tide as well.

All variant optimisation methods are augmented through the addition of independent sluice head operation . This modification we are introducing is inspired by the work of Baker ((1991)); Angeloudis et al. (2018). CH and CHV perform non-flexible operation, while EHT, EHTV, EHN and EHNV perform state-of-art flexible operation. A summary detailing each baseline operational heads and method is shown in Table 10.

Constant Head Every Half-Tide Every Half-Tide and Next
non-flexible operation     
flexible operation           
Table 10: Simplified reference table for baselines.

All baselines, except EHNV, are optimised with a grid search optimisation algorithm, which iteratively increases its search resolution until convergence. Initial search resolution starts with meter, with optimisation heads , and (when considered) within ranges , and , respectively. After the first run, search resolution is halved and the algorithm performs a brute-force search around the best previous configuration attained. The latter procedure is repeated until final search resolution is lower than .

EHNV requires a different optimisation approach due to its high computational time when utilising the previous grid search method. For this case we utilise the stochastic global optimisation algorithm basin-hopping Wales and Doye (1997) from Scipy package Virtanen et al. (2020), with COBYLA as a local minimizer Powell (1994). Basin-hopping was chosen for its efficiency when solving smooth function problems with several local minima separated by large barriers v0.14.0 Reference Guide (2015). The local minimizer COBYLA is a nonlinear derivative–free constrained optimisation that uses a linear approximation approach. Even though basin-hopping is not guaranteed to converge to a global optimum, EHNV is shown to be, on average, the best baseline method for energy generation.

6.2 Agent Performance Evaluation

Following hyperparameter tuning, we trained the agent for steps, until convergence. The cumulative reward (energy) per month (episode) during parallel training, averaged for the 64 instances of the lagoon environment, is shown in Fig. 6. The log-representation insert highlights the two-step plateau that is observed when converging to an optimal strategy. After starting in a total random strategy, the cumulative reward received by the agent increases until reaching an intermediate plateau at around steps, where the agent learns the strategy of operating mostly the turbines, while keeping sluices practically offline during ebb generation. Then, after about steps, the cumulative reward starts increasing again. The second plateau stabilises around steps, with a cumulative reward approximately 25% higher than the first plateau – a gain allowed by (i) a flexible operational strategy learnt by the agent, that adjust TRS operation according to tidal range (ii) the smart usage of the sluicing mode, as discussed below in test results. Videos showcasing the strategic operation developed for both plateaus are available in the Supplementary Material.

Figure 6: Monthly cumulative energy (reward) in GWh, averaged for all 64 environments during parallel training. The log-representation insert highlights the two-step plateau.

For test data, we utilise months of real ocean measurements from BODC. These months are presented and numbered in Table S3, while Table S4 (Supplementary Material) compares the amount of energy obtained in the numbered months between our trained agent (performing real-time flexible control) and the baselines. The averaged monthly energy attained for all methods is shown in Fig. 10.

For the baselines, CH and CHV present the worst performance, since constant operational heads cannot account for the varying ocean amplitudes in a month (about to in our test set). Furthermore, baselines with variant operation outputted more energy in average than their classical counterparts. Finally, “half-tide and next” approaches showed very small improvements () when compared to “half-tide” methods, while requiring much greater () computational time (Tables S5 and S6, Supplementary Material).

For the trained agent, Fig 7 show operational test results of power generation and lagoon water levels for one month of measured ocean data (starting with initial lagoon water level at mean sea level). We note that the agent quickly converges to an optimal energy generation strategy for sequential tidal cycles, independent of tidal range input – a characteristic of state-of-art flexible operation Xue et al. (2019). Furthermore, Figs. 8 and 9 showcase detailed results of real-time control on test data. Apart from ocean water levels, results are coloured according to actions taken by the agent for turbines and sluices, respectively, as defined in Section 5.2. More specifically, Figs. 7(a) and 8(a) show lagoon water level variations, while Fig. 7(c), 7(b) and 8(b) show power generation, turbine and sluice flow rates. From the sequence of actions taken, we see that the agent arrives at a policy with independent operation of sluices, i.e. the variant operation of TRS, which was shown to be a better strategy than the classical operation in our baseline comparison. A summary of our method accomplishments in comparison with state-of-art baselines is shown in Table 11.

(a) Ocean and lagoon water levels.
(b) Power generation.
Figure 7: Lagoon water levels and power generation results for the trained agent performing flexible control in a month, with measured ocean data only.
(a) Ocean (in blue) and lagoon water levels.
(b) Flow rate from the 16 turbine units.
(c) Combined power output from turbines.
Figure 8: Lagoon water levels, turbine flow rates and power output are shown and coloured following turbine operational mode chosen by the trained agent. Green represents power generation mode, orange – idling mode and black – offline mode.
(a) Ocean (in blue) and lagoon water levels.
(b) Flow rate from sluice gates.
Figure 9: Lagoon water levels and sluice flow rates are shown and coloured following sluice operational mode chosen by the trained agent. Orange represents idling (i.e. sluicing) mode and black – offline mode.
Ahmadian et al. (2017) Xue et al. (2019) Xue et al. (2019) (our work)
real-time flexible control   
prediction-free approach   
variant lagoon operation      
state-of-art performance         
optimisation routines with variant operation of tidal lagoons. Augmentation inspired by Angeloudis et al. (2018, 2016)
Equivalent outputs, in average, within the error bars (Fig. 10).
Table 11: Comparison of state-of-the-art baselines with our proposed DRL Agent.
Figure 10:

Averaged monthly energy comparison between baselines and trained agent utilising test data. Sample standard deviations for the various months are also shown as error bars.

Our agent managed very competitive energy outputs, staying on average within of the best baseline (EHNV). Indeed, for all months tested, our agent performed optimally, outputting better results than the state-of-art EHN method for 22 out of 26 months, within a margin in worst scenarios. We note that this novel result was obtained by training the agent once with a simple artificial ocean input, in contrast with the baselines that require future tidal predictions and being re-run for every new tide.

7 Conclusions

In this work, we have shown that proximal policy optimisation (a DRL method) can be used for real-time flexible control of tidal range structures after training with artificially generated tide signals, yielding competitive results with state-of-art optimisation methods devised from the literature.

We have chosen the Swansea Bay Tidal Lagoon for our analysis, given its status as a pathfinder project for larger tidal lagoon projects. We show that our novel approach obtains optimal energy generation from measured tidal data through an optimised control policy of turbines and sluices. Our method shows promising advancements over state-of-art optimisation approaches since it (i) performs real-time flexible control with equivalent energy generation, (ii) does not require future tidal predictions, and (iii) needs to be trained a single time only.

Owing to its characteristic features, the method introduced here would be of broad applicability for performing optimal energy generation of TRS in cases where future tidal predictions are unreliable or not available.


We would like to thank the Brazilian agencies Coordenação de Aperfeiçoamento de Pessoal de Ensino Superior (CAPES) and Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) for providing funding for this research.


  • P. Abbeel (2017) Lecture 4a: policy gradients, deep rl bootcamp berkeley. External Links: Link Cited by: §4.1.2, §4.1.
  • P. Abbeel (2019a) CS 287 lecture 18 - rl i: policy gradients, uc berkeley eecs. External Links: Link Cited by: §4.1.1.
  • P. Abbeel (2019b) Lecture 19 off-policy, model-free rl, uc berkeley eecs. External Links: Link Cited by: §4.1.
  • G. A. Aggidis and D. S. Benzon (2013) Operational optimisation of a tidal barrage across the mersey estuary using 0-d modelling. Ocean Engineering 66, pp. 69–81. Cited by: §3.2, Table 4, §3.
  • G. A. Aggidis and O. Feather (2012) Tidal range turbines and generation on the Solway Firth. Renewable Energy 43, pp. 9–17. Cited by: Figure 2, §3.2.
  • R. Ahmadian, J. Xue, R. A. Falconer, and N. Hanousek (2017) Optimisation of tidal range schemes. In Proceedings of the 12th European Wave and Tidal Energy Conference, pp. 1059. Cited by: §2, item 1, Table 11.
  • A. Angeloudis, R. Ahmadian, R. A. Falconer, and B. Bockelmann-Evans (2016) Numerical model simulations for optimisation of tidal lagoon schemes. Applied Energy 165, pp. 522–536. Cited by: §3.2, §5.2, Table 11.
  • A. Angeloudis, S. C. Kramer, A. Avdis, and M. D. Piggott (2018) Optimising tidal range power plant operation. Applied energy 212, pp. 680–690. Cited by: §1, §2, §2, §3.1, §3.2, §3.2, §3.2, Table 4, §3, §3, §5.2, §5.2, §5.2, §6.1, §6.1, Table 11.
  • A. Angeloudis, M. Piggott, S. C. Kramer, A. Avdis, D. Coles, and M. Christou (2017) Comparison of 0-d, 1-d and 2-d model capabilities for tidal range energy resource assessments. EarthArXiv (2017). Cited by: §3.1, §3.2, §5.2.
  • C. Baker ((1991)) Tidal power. Institution of Engineering and Technology (1711). Cited by: §3.2, §3, §6.1.
  • BODC (accessed January 10, 2021) Download uk tide gauge network data from bodc. External Links: Link Cited by: §6.1.
  • [12] E. Bøhn, E. M. Coates, S. Moe, and T. A. Johansen Deep reinforcement learning attitude control of fixed-wing uavs using proximal policy optimization. In 2019 International Conference on Unmanned Aircraft Systems (ICUAS), pp. 523–533. Cited by: §4.1.1.
  • G. D. Egbert and R. D. Ray (2017) Tidal prediction. Journal of Marine Research 75 (3), pp. 189–237. Cited by: §2.
  • S. Evans (October 4th, 2019, accessed January 10, 2021) La rance: learning from the world’s oldest tidal project. External Links: Link Cited by: §1.
  • R. A. Falconer, J. Xia, B. Lin, and R. Ahmadian (2009) The severn barrage and other tidal energy options: hydrodynamic and power output modeling. Science in China Series E: Technological Sciences 52 (11), pp. 3413–3424. Cited by: §3.2.
  • F. Harcourt, A. Angeloudis, and M. D. Piggott (2019) Utilising the flexible generation potential of tidal range power plants to optimise economic value. Applied Energy 237, pp. 873–884. Cited by: §2, §2.
  • C. Hendry (2016) The role of tidal lagoons. Final Report, pp. 67–75. Cited by: §1.
  • P. IEA (April 2020, accessed January 10, 2021) Global energy review 2020. External Links: Link Cited by: §1.
  • A. Juliani, V. Berges, E. Vckay, Y. Gao, H. Henry, M. Mattar, and D. Lange (2018) Unity: a general platform for intelligent agents. arXiv preprint arXiv:1809.02627 (2018). Cited by: §1, §5.1, §5.1.
  • E. Medina-Lopez, D. McMillan, J. Lazic, E. Hart, S. Zen, A. Angeloudis, E. Bannon, J. Browell, S. Dorling, R. Dorrell, et al. (2021) Satellite data for the offshore renewable energy sector: synergies and innovation opportunities. arXiv preprint arXiv:2103.00872 (2021). Cited by: §2.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §4.1.2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013). Cited by: §4.1.
  • S. P. Neill, A. Angeloudis, P. E. Robins, I. Walkington, S. L. Ward, I. Masters, M. J. Lewis, M. Piano, A. Avdis, M. D. Piggott, et al. (2018) Tidal range energy resource and optimization–past perspectives and future challenges. Renewable Energy 127, pp. 763–778. Cited by: §1, §1, §2, §3.1.
  • V. Pierre (2020) Unity-technologies/ml-agents. GitHub. Note: Cited by: §4.1.2, §5.1.
  • P. Poupart (2018) CS885 lecture 7a: policy gradient, university of waterloo. External Links: Link Cited by: §4.1.1.
  • M. J. Powell (1994) Advances in optimization and numerical analysis. In Proceeding of the 6th Workshop on Optimization and Numerical Analysis, pp. 5–67. Cited by: §6.1.
  • D. Prandle. (1984) Simple theory for designing tidal power schemes. Advances in Water Resources 7 (1), pp. 21–27. Cited by: §3.2, §3, §3.
  • A. M. Schnabl, T. M. Moreira, D. Wood, E. J. Kubatko, G. T. Houlsby, R. A. McAdam, and T. A. Adcock (2019) Implementation of tidal stream turbines and tidal barrage structures in dg-swem. In International Conference on Offshore Mechanics and Arctic Engineering, Vol. 58899, pp. V010T09A005. Cited by: §3.1.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015a) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §4.1.2, §4.1.
  • J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2015b) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015). Cited by: §4.1.2, §4.1.2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint:1707.06347 (2017). Cited by: §4.1.2, §4.1.2, §4.1.2, §4.1.2, §4.1.
  • J. Schulman (August, 2017) Deep rl bootcamp lecture 5: natural policy gradients, trpo, ppo, bootcamp berkeley. External Links: Link Cited by: §4.1.2.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §4.
  • S. v0.14.0 Reference Guide (2015) Scipy.optimize.basinhopping. External Links: Link Cited by: §6.1.
  • P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, et al. (2020) SciPy 1.0: fundamental algorithms for scientific computing in python. Nature Methods 17 (3), pp. 261–272. Cited by: §6.1.
  • D. J. Wales and J. P. Doye (1997) Global optimization by basin-hopping and the lowest energy structures of lennard-jones clusters containing up to 110 atoms. The Journal of Physical Chemistry A 101 (28), pp. 5111–5116. Cited by: §6.1.
  • E. Wolanski and M. Elliott (2015) Estuarine ecohydrology: an introduction. section 2.1 the tides at sea. Elsevier. Cited by: §5.2.
  • J. Xue, R. Ahmadian, and R. A. Falconer (2019) Optimising the operation of tidal range schemes. Energies 12 (15), pp. 2870. Cited by: §1, §2, §2, §3.1, §5.2, item 3, item 5, §6.1, §6.2, Table 11.
  • J. Xue, R. Ahmadian, O. Jones, and R. A. Falconer (2021) Design of tidal range energy generation schemes using a genetic algorithm model. Applied Energy 286, pp. 116506. Cited by: §2, §2.
  • J. Xue, R. Ahmadian, and O. Jones (2020) Genetic algorithm in tidal range schemes’ optimisation. Energy, pp. 117496. Cited by: §2.

Supplementary Material

To be made available after paper acceptance.