Reinforcement Learning-Based Automatic Berthing System

12/03/2021
by   Daesoo Lee, et al.
0

Previous studies on automatic berthing systems based on artificial neural network (ANN) showed great berthing performance by training the ANN with ship berthing data as training data. However, because the ANN requires a large amount of training data to yield robust performance, the ANN-based automatic berthing system is somewhat limited due to the difficulty in obtaining the berthing data. In this study, to overcome this difficulty, the automatic berthing system based on one of the reinforcement learning (RL) algorithms, proximal policy optimization (PPO), is proposed because the RL algorithms can learn an optimal control policy through trial-and-error by interacting with a given environment and does not require any pre-obtained training data, where the control policy in the proposed PPO-based automatic berthing system controls revolutions per second (RPS) and rudder angle of a ship. Finally, it is shown that the proposed PPO-based automatic berthing system eliminates the need for obtaining the training dataset and shows great potential for the actual berthing application.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/01/2022

Hysteresis-Based RL: Robustifying Reinforcement Learning-based Control Policies via Hybrid Control

Reinforcement learning (RL) is a promising approach for deriving control...
11/14/2020

RL-QN: A Reinforcement Learning Framework for Optimal Control of Queueing Systems

With the rapid advance of information technology, network systems have b...
03/16/2016

Applying Artifical Neural Networks To Predict Nominal Vehicle Performance

This paper investigates the use of artificial neural networks (ANNs) to ...
07/06/2020

NAPPO: Modular and scalable reinforcement learning in pytorch

Reinforcement learning (RL) has been very successful in recent years but...
08/28/2019

Reinforcement Learning: Prediction, Control and Value Function Approximation

With the increasing power of computers and the rapid development of self...
03/12/2021

Discovering Diverse Solutions in Deep Reinforcement Learning

Reinforcement learning (RL) algorithms are typically limited to learning...
08/05/2021

Active Reinforcement Learning over MDPs

The past decade has seen the rapid development of Reinforcement Learning...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ship berthing was traditionally conducted by an experienced captain because when a ship approaches a port or harbor, the ship experiences nonlinear ship motions due to its slow speed and sudden changes in a rudder angle and engine revolutions per second (RPS) and these factors made automation of the berthing process difficult. To overcome this limitation, many studies on automatic berthing systems have been conducted and an ANN-based automatic berthing system has shown the most robust performance [bae2008study, ahmed2013automatic, im2018artificial, lee2020application]. Although the previous studies on the ANN-based automatic berthing systems showed robust berthing performance, the berthing performance was still limited by a number of training data. Typically, for a well-trained ANN system, a large number of training data is required. However, in case of the ANN-based berthing system, it is hard to obtain a large number of berthing data as training data, therefore, the performance of the trained ANN in the ANN-based automatic berthing system is limited. In this study, to overcome this limitation, a proximal policy optimization (PPO)-based automatic berthing system is proposed in which the PPO is one of the most popular reinforcement learning (RL) algorithms [schulman2017proximal]. The RL algorithms can train a neural network to take appropriate actions to achieve a given goal by finding an optimal control policy through interaction with a given environment. In the proposed PPO-based automatic berthing system, types of the actions are controls over the rudder angle and RPS, and the goal is to arrive at a berthing goal point. Through the interaction with a given environment, the proposed PPO-based automatic berthing system may learn to provide optimal controls over the rudder angle and RPS, which results in robust berthing from various initial positions. Finally, it is shown that the proposed PPO-based automatic berthing system can train a neural network to arrive at the berthing goal point without any pre-obtained berthing data as training data. In the following sections, a reinforcement learning theory, details of the proposed PPO-based automatic berthing systems are presented. Then, the simulation conditions and simulation and result discussion are followed.

2 Reinforcement Learning Theory

The RL is applied to problems where decisions for actions are made in a sequential order such as . A unit that learns through interaction with an environment

is called an agent. To solve the sequential-action-decision problems, the problem should be defined mathematically, which can be achieved by a Markov Decision Process (MDP). The MDP consists of the state

, action , reward , action (control) policy in which the policy is a policy for taking the action given the state . The procedure of the RL training according to the MDP is shown in Fig. 1. As the repeating loop in Fig. 1 continues, the policy continues being updated to maximize the reward and eventually the optimal policy is achieved that can maximize the reward. Therefore, solving the MDP is what the RL algorithm does to find the optimal policy.

Figure 1: Procedure of RL training according to MDP.

In the early days of the RL, the MDP was solved by a dynamic programming where the states, rewards, policy are calculated using table forms and a perfectly defined environment is required. However, in the real world, environments are very complex and involve non-linearity. Thus, it was impossible to learn a mapping function for the optimal policy using the dynamic programming for real-world applications. But, in recent years, the RL algorithms became capable of learning the optimal policy for real-world applications with the help of the ANN or other types of neural networks in which such neural networks act as an approximate mapping function for the policy . The PPO is one of the RL algorithms to train such neural networks to map the policy efficiently and it has shown robust performance in various applications. The overall procedure of RL training with the PPO is shown in Algorithm 1 where the agent is basically composed of the neural networks.

1:  Set an initial state .
2:  for  do
3:     Agent takes action given the current state .
4:     Interact with an environment.
5:     Obtain reward and next state .
6:     
7:     Every steps, train the agent using PPO.
8:  end for
Algorithm 1 Overall procedure of RL training with PPO.

3 Proposed PPO-Based Automatic Berthing System

In Algorithm 2, the overall procedure of the proposed PPO-based automatic berthing system is presented. Its corresponding illustration is shown in Fig. 2. When setting the initial ship position randomly, it should be made sure that minimum and maximum values for the random-initial ship position are defined in advance so that the random-initial ship positions can always be set within the defined range. The ship position consists of ship positions in and axes in a global coordinate system and ship’s heading angle. The configuration of the state is presented in Eqs. (1)-(4) where and are normalized ship’s , positions by length of a ship . denotes the heading angle and its direction is presented in Fig. 3. is distance between a berthing goal point and the normalized ship position. , , and denote velocities in the surge, sway, yaw directions, respectively. The subscript denotes timestep. The action the agent takes consists of the rudder angle and RPS as shown in Eq. (5) in which and denote the rudder angle and RPS, respectively. The reward is calculated by a reward function shown in Algorithm 3 where a unit of the rudder angle is degree and the tolerance can be set considering how closely the ship should berth at the berthing goal point. is a local heading angle that is zero when a ship’s front is directly headed towards . Finally, the training of the agent with the PPO is conducted every steps and a number of steps is determined by how much previous data to feed to train the agent at every training process.

1:  Initialize actor and critic.
2:  for  do
3:     Randomly select an initial ship position
4:     Set an initial state .
5:     for  do
6:        Take action by the actor.
7:        Interact with the environment
8:        Train the actor and critic at every step.
9:     end for
10:  end for
Algorithm 2 Overall procedure of the proposed PPO-based automatic berthing system.
Figure 2: Illustration of the overall procedure of the proposed PPO-based automatic berthing system.
(1)
(2)
(3)
(4)
(5)
Figure 3: Direction of the heading angle where the positive direction is indicated by the arrow.

Input:
Parameter:
Output:

1:  
2:  if  then
3:     
4:     if  then
5:        // the above condition encourages the ship to be headed towards .
6:        
7:     end if
8:  end if
9:  
10:    // to prevent an excessive rudder control.
11:  
12:  if  then
13:       // to encourage going forward.
14:  end if
15:  
16:    // scaling
Algorithm 3 Reward function.

4 Architecture of the Agent

The architecture of the agent is shown in Fig. 4. The length of the previous state history is determined by which is shown in Algorithm 2

. The boxes of the Flatten, HL, Reshape, LSTM denote a flatten layer, hidden layer, reshape layer, and an LSTM layer, respectively. The feature extraction layers extract embedded features from the previous state history and the extracted features are fed into the LSTM layer. The state value

is one of the agent’s output and it measures how good the current state is considering future rewards the agent may receive. This state value is used to train the agent with the PPO to learn the optimal policy that can take appropriate actions to maximize the reward received throughout an episode. In this study, the size of the HL and LSTM is set to 64 and 256, respectively.

Figure 4: Architecture of the agent.

5 Simulation Conditions

Principal particulars of a target ship are shown in Table 1. Control restrictions on the propeller and the rudder are shown in Eqs. (6)-(8) where units of , , and are RPS (rotation per second), deg, and deg/s, respectively. There is no restriction on a change rate of in this paper in order to first check feasibility of application of PPO on the automatic berthing system. No environmental load is considered such as wind force to put a focus on learning capability and berthing performance of the proposed PPO-based automatic berthing system when there is no environmental disturbance. The ranges of the random-initial ship positions , , and in Algorithm 2 are presented in Eqs. (9)-(11) where the subscript 0 denotes a timestep of zero. The initial heading angle

is basically set towards the berthing goal point from the initial ship position with some value added which is sampled from a uniform distribution

. The berthing goal point is defined as in Eq. (12). The number of steps in Algorithm 2

is set to 128. The hyperparameter settings from the original PPO paper are used with a bit of change.

HULL [.5]
Length overall 188 m
Length between perpendiculars 175 m
Breath 25.4 m
Draft 8.5
Block coefficient 0.559
RUDDER [.5]
Height 7.7 m
Area ratio 1/45.8
Aspect ratio 1.827
PROPELLER [.5]
Diameter 6.5 m
Pitch ratio 1.055
Expanded area ratio 0.73
Table 1: Principal particulars of a target ship.
(6)
(7)
(8)
(9)
(10)
(11)
(12)

6 Simulation and Result Discussion

Training progress of the RL algorithm can be observed through a reward time history and its convergence. Since the goal of the RL algorithm is to find the optimal policy which can maximize rewards received throughout an episode, the training of the RL algorithm is finished when the reward time history increased and finally converges. The proposed PPO-based automatic berthing system was trained for about 20 hours and its reward time history is shown in Fig. 5 where the reward time history is filtered by a moving average filter with a window size of 0.99. In Fig. 5, it can be observed that the reward time history increases in an early stage and converges later, meaning that the PPO reached the optimal policy when converged. Next, berthing trajectories using the trained PPO-based automatic berthing system are presented in Fig. 6 and Fig. 7 where the red circle is the berthing goal point given the tolerance, the blue half-square denotes a harbor, and the ship in the trajectory is drawn every 50s. The control time histories of the propeller and the rudder angle for Fig. 7 can be found in Figs. 8-16. Fig. 6

shows that the berthing control is well managed for the initial ship positions within the random initial ship position range during training, which can be viewed as control within the interpolated range. More importantly, Fig.

7 shows that the berthing control is successful and robust even for the initial ship positions that are outside of the training initial ship position range, which can be viewed as control starting from the extrapolated range. Therefore, it can be argued that PPO can learn the optimal control policy that is generic enough to perform well even in extrapolated situations, and the further study is encouraged to be conducted with the additional restriction on and environmental load.

Code for the simulation is available here111https://github.com/danelee2601/RL-based-automatic-berthing. In the GitHub repository, a link to the Google Colab is available where you can run the berthing simulation with the pre-trained model by the PPO. There, you can try various initial ship positions and would be able to see the robustness of the PPO-based automatic berthing system.

Figure 5: Reward time history of training the PPO-based automatic berthing system.
Figure 6: Berthing trajectories by the proposed PPO-based automatic berthing system (figure sub-index) denotes (, , ). The initial positions are within the random-initial ship position range during training (interpolated).
Figure 7: Berthing trajectories by the proposed PPO-based automatic berthing system (figure sub-index) denotes (, , ). The initial positions are outside of the random-initial ship position range during training (extrapolated).

7 Conclusion

The PPO-based automatic berthing system is proposed in this study. The main limitation of the previous ANN-based automatic berthing system comes from having to have a large number of berthing data as training data required to train the ANN. However, the berthing data is difficult to obtain. There are three main advantages of the proposed PPO-based automatic berthing system in comparison to the previous ANN-based automatic berthing system. First, the proposed system does not require any pre-obtained training data because it obtains data through interaction with a given environment. Second, the proposed system is not limited by having to have a large number of training data because it can obtain the training data indefinitely as long as simulation continues. Third, with a careful definition of the reward function, a desired maneuvering behavior can be achieved without any pre-knowledge in the ship maneuvering. In the simulation results, it was shown that the proposed system could learn how to robustly maneuver the ship to the berthing goal point from a variety of the initial positions.

References