Reinforcement Learning Approach to Clear Paths of Robots in Elevator Environment

by   Wanli Ma, et al.

Efficiently using the space of an elevator for a service robot is very necessary, due to the need for reducing the amount of time caused by waiting for the next elevator. To solve this, we propose a hybrid approach that combines reinforcement learning (RL) with voice interaction for robot navigation in the scene of entering the elevator. RL provides robots with a high exploration ability to find a new clear path to enter the elevator compared to the traditional navigation methods such as Optimal Reciprocal Collision Avoidance (ORCA). The proposed method allows the robot to take an active clear path action towards the elevator whilst a crowd of people stands at the entrance of the elevator wherein there are still lots of space. This is done by embedding a clear path action (beep) into the RL framework, and the proposed navigation policy leads the robot to finish tasks efficiently and safely. Our model approach provides a great improvement in the success rate and reward of entering the elevator compared to state-of-the-art ORCA and RL navigation policy without beep.


page 1

page 2

page 3

page 4


Human-Aware Robot Navigation via Reinforcement Learning with Hindsight Experience Replay and Curriculum Learning

In recent years, the growing demand for more intelligent service robots ...

Navigating Conceptual Space; A new take on Artificial General Intelligence

Edward C. Tolman found reinforcement learning unsatisfactory for explain...

NavTuner: Learning a Scene-Sensitive Family of Navigation Policies

The advent of deep learning has inspired research into end-to-end learni...

L2B: Learning to Balance the Safety-Efficiency Trade-off in Interactive Crowd-aware Robot Navigation

This work presents a deep reinforcement learning framework for interacti...

Rule-Based Reinforcement Learning for Efficient Robot Navigation with Space Reduction

For real-world deployments, it is critical to allow robots to navigate i...

Mapless Navigation among Dynamics with Social-safety-awareness: a reinforcement learning approach from 2D laser scans

We propose a method to tackle the problem of mapless collision-avoidance...

Reinforcement Learning for Navigation of Mobile Robot with LiDAR

This paper presents a technique for navigation of mobile robot with Deep...

1 Introduction

Mobile robots have become a research hotspot for various applications due to the critical impact of the Novel Coronavirus Pneumonia. Particularly, cleaning and disinfection, security, logistics, and catering delivery are just some of the uses for the mobile robot, which has also constructed a line of defense for humans caught up in the epidemic. Although mobile robot navigation capabilities have vastly increased, there are still numerous issues in real-world applications, such as poor route flexibility, time consumption for delivery robots. Delivery robots generally work in a multi-layer elevator environment, and they must use the elevator to complete the duty of delivering goods or merchandise from the beginning point to the target location. When it comes to safety navigation, the robots are controlled to avoid collection with people as much as possible. In this case, they frequently spend a lot of time waiting for the elevator, which decreases delivery efficiency significantly. In terms of application, the most vital factor in achieving the goal of autonomously executing distribution with mobile robots is to access the elevator more efficiently. Therefore, our focus is on balancing safety-efficiency navigation methods for mobile robots in elevator scenarios, to ensure that mobile robots have the benefits (safety and efficiency) mentioned above when completing delivery tasks.

When confronted with a complicated and changing environment, people can usually adopt the best navigation technique and get through safely and quickly. In particular, in cases when we are in a hurry to get things done, for example, when efficiency is our priority, we would remind the others in the area to make room for us so that we can get to our destination on time. Although a few mobile robot navigation systems focus on efficient navigation, they use human senses to remind pedestrians to give way and improve the robot’s navigation efficiency [3, 4, 5, 6, 7]. However, this also comes with several drawbacks, for example, a high number of warnings may surely increase the likelihood of collisions in a congested setting [Nishimura2020L2BLT].

Figure 1: (a) ORCA policy without active clear path action; (b) RL policy without activate clear path action; (c) policy without activate clear path action; (d) RL policy with activate clear path action.

To address this problem, we proposed a reinforcement learning strategy for mobile robots to balance navigation safety and efficiency in different elevator scenarios. The mobile robot chooses between passively avoiding obstacles by finding a new path and actively clearing a planning path, or interacting with humans by voicing a beeping sound and actively clearing a planning path. Particularly, when there is enough room in the elevator to safely fit in, the robot tries to enter the elevator, chooses to aggressively clear the forward path, and reminds pedestrians who are blocking their way to move over (Please see Fig. 1).

In summary, the following are the vital contributions of this work:

  1. To the best of our knowledge, we are the first to provide a method for safely and efficiently navigating enter of elevators using a deep reinforcement learning framework.

  2. The effective benefits of using reinforcement learning are good adaptability to unknown cases and helping improve the robot’s working efficiency.

  3. Using RL-based policy to balance actively clearing a path to meet the goal of pedestrians giving way through beeps or passively avoiding obstacles to gain a new passable route.

The rest of the paper is organized as follows: Section 2 examines related work on mobile robot navigation strategies whilst in Section 3, the solution for entering and exiting the elevator securely and efficiently is presented. The experimental surroundings and parameter settings are described in Section 4, and the qualitative and quantitative analyses of the experimental findings are performed in Section 5. Section 6 concludes the paper with a summary and future works.

2 Related Work

2.1 Robot navigation in elevator environment

In an elevator environment, existing navigation approaches in the literature primarily focus on the safety of mobile robots, that is, their capability to avoid obstacles. Hada et al. [34] proposed a method which detects obstacles in the planned path of the mobile robot by the PSD sensor. When an obstacle is detected, the robot will stay for 5 seconds before re-planning the path. Amano et al. [38] proposed an approach using edge detection and laser detectors to avoid collisions with pedestrians. However, this approach is only suitable for specific elevators with stop lines. Madarasz et al. [11] implemented a safety navigation approach that uses a classic path planning algorithm and ultrasonic rangefinders.Tschichold et al. [6] present a navigation approach that determines the robot’s current behavior based on the state of the robot. This approach used multiple control algorithms to execute different robot behaviors to achieve safe navigation of mobile robots. Zhao et al. [35] proposed an obstacle avoiding approach that detects obstacles using the YOLO-v3 tiny model and obtains the state of the elevator with a LiDAR based approach, which achieves safety navigation in the elevator. Correia et al. [37] implemented an obstacle avoidance method that utilizes Gmapping to map the entrance and interior of the elevator, yet this approach is not suitable for environments with crowded spaces or obstacles. Meanwhile, Su et al. [36] and Matsumoto et al. [17] proposed navigation methods that use visual information to perform robot obstacle avoidance. Troniak et al. [troniak2013charlie] put forward an approach for ensuring robot safety when entering and exiting a narrow elevator by using machine vision and algorithmic decision-making.

Most of the aforementioned approaches are not suitable for real-world elevator environments. In addition, on the one hand, some navigation approaches utilize the impact of future trajectories of pedestrians on the safe navigation of mobile robots, which allows robots to avoid pedestrian trajectories during trajectory planning and reduce the likelihood of collisions. [12, 13, 29]. On the other hand, using these approaches for navigation in a crowded environment is highly likely to cause the robot to freeze [31]. Considering the outstanding performance of deep reinforcement learning in numerous fields [18], some navigation algorithms exploited this theory to boost the learning ability of mobile agents [19, 20, 32].

2.2 Safe and efficient robot navigation

For the purpose of making navigation of the robot safe and efficient, various navigation methods utilised visual or audible messages for pedestrians to keep the mobile robot on a predetermined course to increase efficiency from the perspective of human-computer interactions [3, 4, 5, 6, 7]. Trautman et al. [2]

used a combination of interactive Gaussian process and relevance sampling to implement the navigation strategy of agent and pedestrian collaboration. However, this method faced several problems, especially in overcrowded situations. The navigation algorithm that uses behavior heuristics to predict pedestrian intentions was proposed by Guzziet al. 

[22]. Chen et al. [Chen19] implemented a navigation approach that uses deep reinforcement learning and an attention mechanism to provide capability with perceived crowd dynamics, meanwhile, could safely navigate in crowded environments.

Understanding and cooperating with human behavior while adhering to social norms have made it easier for robots to navigate safely and effectively [24, 25]. Tai et al. [26]

brought a navigation strategy in which an imitation learning algorithm embedded with a generative adversarial network. Nishimura et al. 

[Nishimura2020L2BLT] proposed a navigation approach based on a reinforcement learning algorithm that is compatible with social rules and reasonable selection of robot behavior, whilst Chen et al. [30] also proposed a similar method to allow robots to avoid collisions. All these approaches exploited different navigation learning strategies and introduced the idea of social norms to increase the safety and efficiency of mobile robot navigation [27]. These navigation strategies enable robots to navigate in crowded environments, such as elevators, airports, train stations, and subways, et al.

3 The Proposed Method

3.1 Reinforcement Learning for Robot Navigation

A robot navigation task of entering the elevator can be formulated as a sequential decision making problem. A reinforcement learning framework can solve this [Nishimura2020L2BLT]

as a Markov decision process (MDP), which can be defined by a tuple

[everett2021collision]. In particular, and are the state and action spaces, respectively, whilst and refer to the state-transition and reward functions, respectively. The scalar is a discount factor, which …. In the sequel, we introduce the model parameters mentioned above.

The state of the robot in each step can be defined by



is the vertical distance from the current position of the robot to the interior of the elevator. The reason behind considering only the vertical distance is that the aim of the robot is to enter the elevator rather than reach a particular location point. The vector

is velocity whilst is the current position. Both preferred speed and robot’s radius are hyper-parameters.

Similarly, the state of the th human is given by


where presents the distance between the robot and the th human; is radius of the th human.

Thus, the joint state of the RL framework becomes


Regarding the robot movements, we set 17 actions in the action space of the robot (independent from beeping), which consist of 16 orientations evenly spaced between [0,2) at the velocity and a stay action.  On the other hand, when considering clear path action, the number of actions in action space is doubled to 34 since each moving action is added with a sound warning (beep).

Assuming the robot is unaware of the navigation policy of humans, the humans in the elevator can thus be regarded as a part of the environment. The system state transition function in this paper is based on Optimal Reciprocal Collision Avoidance (ORCA) policy which controls human behaviors and some hidden parameters (velocities, goals, etc.).

The aim of using reinforcement learning method is to train an optimal policy for a robot to choose the most suitable action in every time step so as to maximize the expectation of return of an episode, which corresponds to navigating the robot entering the elevator safety and efficiently. The optimal policy and state value function in time can be defined as:

π^*(s_t^j n)= a_targmax[ R(s_t^j n, a_t) + γ^Δt ⋅v_pref∫_s_t+Δt^j n P( s_t+Δt^j n| a_t, s_t^j n) V^*(s_t+Δt^j n) d s_t+Δt^j n] where the discount factor is

, and the probability of transition is

which presents the probability of transforming to the state from the state by taking action , and


3.2 Learning to Balance Between Safety And Efficiency

There is a highly common human behaviour phenomenon that people stand in front of the elevator door and block the entrance even though there is still a lot of room inside the elevator. In cases like this, most of the existing robot navigation policies suggests a time consuming action to wait for the next elevator. An active clear path action can be an efficient method to overcome this problem and improve the efficiency of robots to complete tasks in various elevator scenarios.

On the one hand, in the case of active clear path action is not on, a reward function is designed to award task completion while penalizing collisions and uncomfortable distance between the robot and humans. Meanwhile, in order for the robot to complete the task as quickly as possible, the reward is also designed to be related to the task completion time. Every moving step of the robot needs to pay a small price for the purpose of finding a shortest path. If there is no accessible path for the robot, stop is the optimal action. Thus, the reward function becomes


where refers to the closest distance between the robot and humans.  represents the time limit to complete the task. At time , and are the current velocity and the position of the robot, respectively.

On the other hand, assume the robot has the ability to take active clear path action. When people stand in front of the elevator door even when there is still usable room in the elevator, robot could take a beep action to make humans make room for it. However, beeping too frequently may cause too much noise and not smart enough. In order to make the robot behavior more natural and avoid the abuse of beep, every beep action needs to pay a price. That means if a beep action happens.

3.3 State Value Estimation

In order to obtain the optimal value function in (4

), we use a deep neural network based on a self-attention mechanism

[Chen19], which combines human-robot and human-human interactions into one model. The framework of the attention based network is shown in Fig. 2. The input data is the states of all agents in the environment which include and . Then, the output of the model is obtained as the optimal state value .

Figure 2: The network structure for the self-attention network

4 Experiments

4.1 Environmental Setup

The crowd in elevator was simulated by circles with radius 1 in a 812 rectangle. Similarly, the robot was also regarded as a circle with radius 1 but filled a particular color. The locations of the crowd were randomly distributed inside the rectangle. In the experiment, when the robot made a clear path action (Beep), the people in the elevator were controlled by ORCA method [van, Xu21], which navigate them to make room for the robot without collision. Here, we set the safety space as 0.2, max velocity as 1 and the destination of humans was the same as their starting points. For other cases, the locations of humans in the elevator are fixed.

For robot navigation policies, we set the starting point of the robot was on the outside of the elevator facing the middle of elevator door and the distance between them was 3. The reinforcement learning based method do not need to set a destination but for ORCA policy we set the middle of the elevator as its destination. The task will end only when the robot

  1. enters the elevator,

  2. collides with humans,

  3. timeout (9s)

We also set the preferred speed of the robot to 1.

4.2 Training Setup

A multilayer perceptron based network was implemented in this paper to evaluate the state value. Before reinforcement learning, an imitation learning according to 3,000 demonstrations from ORCA


was used to create a pre-training model with learning rate of 0.01 for 50 epochs. We trained the value network by using stochastic gradient descent (SGD) optimizer both for imitation and reinforcement learning. For reinforcement learning, we set learning rate and discount factor to be 0.9 and 0.001, respectively. Also, exploration rate was set as 0.5 for first 5,000 episodes and then keep 0.1 for remaining 5,000 episodes.

5 Evaluation

5.1 Qualitative Evaluation

Figure 3: Performance evaluation for no-beep scenario. (a) (c) Cases for ORCA policy; (b) (d) The same cases in (a) and (c) but for RL policy.

A reinforcement learning (RL) based method shows more flexible and intelligent performance in terms of entering the elevator than the traditional ORCA navigation policy. As shown in Fig. 3, there are two crowd cases for ORCA policy (a,c) and RL policy (b,d) respectively.

ORCA policy in both cases in Fig. 3 fails to enter the elevator with a timeout (9s) although there are still lots of space in the elevator. This is because, the objective of ORCA based method is only to reach a goal which is set in the initial part. Also, the goal cannot be changed during the task of entering the elevator. Namely, the exploration range of ORCA based method is very limited. In this case, it is difficult for the robot to run a new path which requires a wide detour when it encounters obstacles. Thus, the ORCA based method is more likely to lead to a situation of timeout when encountering obstacles.

On the other hand, RL based method shows an advantage of path replanning. As shown in Fig. 3, both (b) and (d) experience success in the task of entering the elevator which cost 4.8 and 5 seconds respectively. RL policy allows the robot to choose the optimal action in each time step (0.25s) according to the current state values. Furthermore, there is no preset route or expected target point set before starting the task. According to RL policy, the parking position in the elevator and the path to entering the elevator could change at any time based on current states. Thus, RL based navigation method shows more flexible and effective performance in the task of entering the elevator than ORCA for the no-beep scenario.

Voice interaction (beep) between robot and humans makes a huge difference to the efficiency of entering the elevator. As shown in the Fig. 4, there are two cases for each navigation policies (e.g. RL without beep and RL with beep). Obviously, for both of the cases, the crowd block the entrance to the elevator and there is no available way for the robot to enter the elevator even though enough space is available within the elevator. In this case, the policy of RL without beep fails to enter the elevator and leads to timeout. However, when considering beep action (clear path) in RL policy, as expected, the robot take an active clear path action (beep) to generate a valid route through the elevator at a suitable time and place. After the clear path action, the people who block the entrance move to make room for the robot. Finally, the robot enters the elevator successfully without collision. It is worth noting that since the robot needs to pay a price for each clear path action (beep), the frequency of beep action is limited by the policy.

Figure 4: Voice interaction (beep) performance analysis. (a) (c) Cases for RL policy without beep; (b) (d) RL policy with beep. Red cycles in the way of the robot show positions where the robot beeps.

5.2 Quantitative Evaluation

We conducted two groups of control experiments to test the success rates and average reward values of each methods. The reward value is a kind of evaluation index to present the performance of entering the elevator according to the selected reward function. In each group of control experiments, the number of people in the elevator varies between 4 and 8. All models were evaluated 100 times in which people were randomly distributed in the elevator.

Examining the performance comparison analysis for the no-beep scenario shown in Table 1, both success rate and average reward value decrease when the number of people increases in the elevator. This is because, the locations of humans are randomly distributed in the beginning. Furthermore, when the number of humans grow the probability of people blocking the door of the elevator will rise. While both ORCA and RL methods cannot address this kind of situation. Thus, the success rate and reward values of both methods show a downward trend as the increase in the number of humans from 4 to 8.

When comparing the two navigation methods, the performance of ORCA is worse than that is based on RL in terms of success rate and average reward value. The reason behind the dominant performance of the RL based method is the ability of RL to explore new routes to the elevator, and the range of obstacle avoidance of ORCA policy is worse than RL. Considering the reasons above, typically, when 8 humans are in the elevator, the success rate of entering the elevator for ORCA policy is only 26% even though there is still enough space in the elevator. On the other hand, RL based policy has a stronger ability to explore new paths to the elevator, and the success rate is much higher than that of ORCA especially when the number of people is large such as 8 in the Table 1.

In aspect of efficiency, we regard the average reward value as a measurement of efficiency as shown in 5. As the data shown in the Table 1, when the number of people is 4, the difference between the efficiencies of ORCA and RL is around 1 (5.0707 and 6.1449, respectively). However, when the number of humans in the elevator increases to 8, the reward value of RL is almost six times higher than ORCA, which thus is a clear demonstration of the the efficiency of RL policy.

Number of ORCA RL Without Beep
People Success (%) Reward Success (%) Reward
4 84 5.0707 92 6.1449
5 72 3.9901 88 5.8773
6 58 2.9195 84 5.5359
7 37 1.4607 73 4.6234
8 26 0.7264 69 4.3443
Table 1: Quantitative evaluation for RL and ORCA based navigation algorithms without beep

The proposed method (RL with voice interaction (Beep)) to further improve the navigation performance whilst entering the elevator. As shown in the Table 2, the success rate of RL without beep policy decrease from 92% to 69% as the number of people increases from 4 to 8. This huge success degradation is not surprising since the RL without beep policy cannot address the situation that some humans are able to stand in front of the elevator door and block the way into the elevator. However, the proposed method of RL with beep shows a great improvement in success rate with 100% up to 6 people in the elevator. When the number of people is 7 or 8, the success rate slightly decreases from 100% to 98%. To this end, we note that voice interaction between the robots and people work efficiently since the people standing in front of the elevator door directly creates room for the robot after the beep warning. For RL with beep policy, the average reward values small decrease from 6.8409 to 6.5766 as the increase in the number of the persons in the elevator. The voice interaction addition to RL proposed in this paper clearly shows an efficient solution for crowded and challenging environments by keeping the reward value nearly the same whilst RL without beep demonstrates a performance degradation compared to the proposed method even though it outperforms the ORCA policy.

Number of RL Without Beep RL With Beep
People Success (%) Reward Success (%) Reward
4 92 6.1449 100 6.8409
5 88 5.8773 100 6.8307
6 84 5.5359 100 6.7782
7 73 4.6234 99 6.7122
8 69 4.3443 98 6.5766
Table 2: Performance Comparison of Reinforcement Learning Based Navigation Algorithms Between with And without Beeps

The trend chart in Fig. 5 depicts the variation of the success rate values for each three methods. The method of RL with beep shows the best result, which remains nearly stable for increasing the number of people and outperforms the other two methods significantly. Without active clear path action, the success rates decreases significantly as the increase in the number of people in the elevator for both of the methods whilst ORCA based method shows a more abrupt decrease compared to the RL based policy.

Figure 5: The trend of success rate of enter the elevator for three navigation methods as the increasing number of humans in the elevator

6 Conclusion

In this paper, we proposed a voice interacted reinforcement learning approach to demonstrate successful navigation with improved efficiency of a robot to complete tasks in the scene of entering the elevator. The proposed approach led to a huge improvement in the success rate of entering the elevator and saved the time of finishing tasks for the robot. Besides, a reward function in the reinforcement learning based method was designed to guarantee the safety of the robot navigation.

The limitation of our study is that the humans were passively controlled by a fixed policy, which may be slightly different from real human behavior. A possible research direction of our future work is to formulate the behavior of humans via using real measurements and enable people in the experimental environment to also have the ability to clear paths.