## 1 Introduction

Ship berthing was traditionally conducted by an experienced captain because when a ship approaches a port or harbor, the ship experiences nonlinear ship motions due to its slow speed and sudden changes in a rudder angle and engine revolutions per second (RPS) and these factors made automation of the berthing process difficult. To overcome this limitation, many studies on automatic berthing systems have been conducted and an ANN-based automatic berthing system has shown the most robust performance [bae2008study, ahmed2013automatic, im2018artificial, lee2020application]. Although the previous studies on the ANN-based automatic berthing systems showed robust berthing performance, the berthing performance was still limited by a number of training data. Typically, for a well-trained ANN system, a large number of training data is required. However, in case of the ANN-based berthing system, it is hard to obtain a large number of berthing data as training data, therefore, the performance of the trained ANN in the ANN-based automatic berthing system is limited. In this study, to overcome this limitation, a proximal policy optimization (PPO)-based automatic berthing system is proposed in which the PPO is one of the most popular reinforcement learning (RL) algorithms [schulman2017proximal]. The RL algorithms can train a neural network to take appropriate actions to achieve a given goal by finding an optimal control policy through interaction with a given environment. In the proposed PPO-based automatic berthing system, types of the actions are controls over the rudder angle and RPS, and the goal is to arrive at a berthing goal point. Through the interaction with a given environment, the proposed PPO-based automatic berthing system may learn to provide optimal controls over the rudder angle and RPS, which results in robust berthing from various initial positions. Finally, it is shown that the proposed PPO-based automatic berthing system can train a neural network to arrive at the berthing goal point without any pre-obtained berthing data as training data. In the following sections, a reinforcement learning theory, details of the proposed PPO-based automatic berthing systems are presented. Then, the simulation conditions and simulation and result discussion are followed.

## 2 Reinforcement Learning Theory

The RL is applied to problems where decisions for actions are made in a sequential order such as . A unit that learns through interaction with an environment

is called an agent. To solve the sequential-action-decision problems, the problem should be defined mathematically, which can be achieved by a Markov Decision Process (MDP). The MDP consists of the state

, action , reward , action (control) policy in which the policy is a policy for taking the action given the state . The procedure of the RL training according to the MDP is shown in Fig. 1. As the repeating loop in Fig. 1 continues, the policy continues being updated to maximize the reward and eventually the optimal policy is achieved that can maximize the reward. Therefore, solving the MDP is what the RL algorithm does to find the optimal policy.In the early days of the RL, the MDP was solved by a dynamic programming where the states, rewards, policy are calculated using table forms and a perfectly defined environment is required. However, in the real world, environments are very complex and involve non-linearity. Thus, it was impossible to learn a mapping function for the optimal policy using the dynamic programming for real-world applications. But, in recent years, the RL algorithms became capable of learning the optimal policy for real-world applications with the help of the ANN or other types of neural networks in which such neural networks act as an approximate mapping function for the policy . The PPO is one of the RL algorithms to train such neural networks to map the policy efficiently and it has shown robust performance in various applications. The overall procedure of RL training with the PPO is shown in Algorithm 1 where the agent is basically composed of the neural networks.

## 3 Proposed PPO-Based Automatic Berthing System

In Algorithm 2, the overall procedure of the proposed PPO-based automatic berthing system is presented. Its corresponding illustration is shown in Fig. 2. When setting the initial ship position randomly, it should be made sure that minimum and maximum values for the random-initial ship position are defined in advance so that the random-initial ship positions can always be set within the defined range. The ship position consists of ship positions in and axes in a global coordinate system and ship’s heading angle. The configuration of the state is presented in Eqs. (1)-(4) where and are normalized ship’s , positions by length of a ship . denotes the heading angle and its direction is presented in Fig. 3. is distance between a berthing goal point and the normalized ship position. , , and denote velocities in the surge, sway, yaw directions, respectively. The subscript denotes timestep. The action the agent takes consists of the rudder angle and RPS as shown in Eq. (5) in which and denote the rudder angle and RPS, respectively. The reward is calculated by a reward function shown in Algorithm 3 where a unit of the rudder angle is degree and the tolerance can be set considering how closely the ship should berth at the berthing goal point. is a local heading angle that is zero when a ship’s front is directly headed towards . Finally, the training of the agent with the PPO is conducted every steps and a number of steps is determined by how much previous data to feed to train the agent at every training process.

(1) |

(2) |

(3) |

(4) |

(5) |

## 4 Architecture of the Agent

The architecture of the agent is shown in Fig. 4. The length of the previous state history is determined by which is shown in Algorithm 2

. The boxes of the Flatten, HL, Reshape, LSTM denote a flatten layer, hidden layer, reshape layer, and an LSTM layer, respectively. The feature extraction layers extract embedded features from the previous state history and the extracted features are fed into the LSTM layer. The state value

is one of the agent’s output and it measures how good the current state is considering future rewards the agent may receive. This state value is used to train the agent with the PPO to learn the optimal policy that can take appropriate actions to maximize the reward received throughout an episode. In this study, the size of the HL and LSTM is set to 64 and 256, respectively.## 5 Simulation Conditions

Principal particulars of a target ship are shown in Table 1. Control restrictions on the propeller and the rudder are shown in Eqs. (6)-(8) where units of , , and are RPS (rotation per second), deg, and deg/s, respectively. There is no restriction on a change rate of in this paper in order to first check feasibility of application of PPO on the automatic berthing system. No environmental load is considered such as wind force to put a focus on learning capability and berthing performance of the proposed PPO-based automatic berthing system when there is no environmental disturbance. The ranges of the random-initial ship positions , , and in Algorithm 2 are presented in Eqs. (9)-(11) where the subscript 0 denotes a timestep of zero. The initial heading angle

is basically set towards the berthing goal point from the initial ship position with some value added which is sampled from a uniform distribution

. The berthing goal point is defined as in Eq. (12). The number of steps in Algorithm 2is set to 128. The hyperparameter settings from the original PPO paper are used with a bit of change.

HULL [.5] | |
---|---|

Length overall | 188 m |

Length between perpendiculars | 175 m |

Breath | 25.4 m |

Draft | 8.5 |

Block coefficient | 0.559 |

RUDDER [.5] | |

Height | 7.7 m |

Area ratio | 1/45.8 |

Aspect ratio | 1.827 |

PROPELLER [.5] | |

Diameter | 6.5 m |

Pitch ratio | 1.055 |

Expanded area ratio | 0.73 |

(6) |

(7) |

(8) |

(9) |

(10) |

(11) |

(12) |

## 6 Simulation and Result Discussion

Training progress of the RL algorithm can be observed through a reward time history and its convergence. Since the goal of the RL algorithm is to find the optimal policy which can maximize rewards received throughout an episode, the training of the RL algorithm is finished when the reward time history increased and finally converges. The proposed PPO-based automatic berthing system was trained for about 20 hours and its reward time history is shown in Fig. 5 where the reward time history is filtered by a moving average filter with a window size of 0.99. In Fig. 5, it can be observed that the reward time history increases in an early stage and converges later, meaning that the PPO reached the optimal policy when converged. Next, berthing trajectories using the trained PPO-based automatic berthing system are presented in Fig. 6 and Fig. 7 where the red circle is the berthing goal point given the tolerance, the blue half-square denotes a harbor, and the ship in the trajectory is drawn every 50s. The control time histories of the propeller and the rudder angle for Fig. 7 can be found in Figs. 8-16. Fig. 6

shows that the berthing control is well managed for the initial ship positions within the random initial ship position range during training, which can be viewed as control within the interpolated range. More importantly, Fig.

7 shows that the berthing control is successful and robust even for the initial ship positions that are outside of the training initial ship position range, which can be viewed as control starting from the extrapolated range. Therefore, it can be argued that PPO can learn the optimal control policy that is generic enough to perform well even in extrapolated situations, and the further study is encouraged to be conducted with the additional restriction on and environmental load.Code for the simulation is available here^{1}^{1}1https://github.com/danelee2601/RL-based-automatic-berthing. In the GitHub repository, a link to the Google Colab is available where you can run the berthing simulation with the pre-trained model by the PPO. There, you can try various initial ship positions and would be able to see the robustness of the PPO-based automatic berthing system.

## 7 Conclusion

The PPO-based automatic berthing system is proposed in this study. The main limitation of the previous ANN-based automatic berthing system comes from having to have a large number of berthing data as training data required to train the ANN. However, the berthing data is difficult to obtain. There are three main advantages of the proposed PPO-based automatic berthing system in comparison to the previous ANN-based automatic berthing system. First, the proposed system does not require any pre-obtained training data because it obtains data through interaction with a given environment. Second, the proposed system is not limited by having to have a large number of training data because it can obtain the training data indefinitely as long as simulation continues. Third, with a careful definition of the reward function, a desired maneuvering behavior can be achieved without any pre-knowledge in the ship maneuvering. In the simulation results, it was shown that the proposed system could learn how to robustly maneuver the ship to the berthing goal point from a variety of the initial positions.

Comments

There are no comments yet.