Learning Collision-free and Torque-limited Robot Trajectories based on Alternative Safe Behaviors

03/05/2021 ∙ by Jonas C. Kiemel, et al. ∙ KIT 7

This paper presents an approach to learn online generation of collision-free and torque-limited trajectories for industrial robots. A neural network, which is trained via reinforcement learning, is periodically invoked to predict future motions. For each robot joint, the network outputs the kinematic state that is desired at the end of the current time interval. Compliance with kinematic joint limits is ensured by the design of the action space. Given the current kinematic state and the network prediction, a trajectory for the current time interval can be computed. The main idea of our paper is to execute the predicted motion only if a collision-free and torque-limited way to continue the trajectory is known. In practice, the predicted motion is expanded by a braking trajectory and simulated using a physics engine. If the simulated trajectory complies with all safety constraints, the predicted motion is carried out. Otherwise, the braking trajectory calculated in the previous decision step serves as an alternative safe behavior. For evaluation, up to three simulated robots are trained to reach as many randomly placed target points as possible. We show that our method reliably prevents collisions with static obstacles and collisions between the robots, while generating motions that respect both torque limits and kinematic joint limits. Experiments with a real robot demonstrate that safe trajectories can be generated in real-time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the manufacturing industry, industrial robots are frequently used for tasks like welding and spray painting. Typically, a repetitive motion sequence is hard-coded for each robot prior to putting a new production line into operation. This process is time-consuming and inflexible as the robots cannot react to unforeseen events or changes in their environment. Generating adaptive motions through reinforcement learning is a promising approach and an active field of research [1, 2, 3, 4]. However, robots have to be able to explore their environment in order to learn well-performing motions. When starting learning from scratch, completely random actions are predicted at the beginning of the training process. Robots in production lines are usually placed close together and likely to collide if random motions are performed. In order to prevent damage to the robots and their environment, care must be taken to avoid collisions during the training process. Aside from collisions, a robot can also be damaged if one of its joints is overloaded. The kinematic capabilities of an industrial robot are limited by the maximum position , velocity , acceleration and jerk of each joint, whereas the dynamic potential is restricted by the maximum torque of each actuator. In this work, we consider a motion as safe, if no collision occurs and if no kinematic joint limit or torque limit is violated.

Fig. 1: Our method enables learning of collision-free movements with multiple robots. Torque limits and kinematic joint limits are also respected.

For online trajectory generation, it is not sufficient to check whether all safety constraints are met within the current decision step as inevitable safety violations might occur at a later point in time. Consequently, a robotic system must check compliance with safety constraints over an infinite time-horizon or until a safe goal state is found [5]. A fundamental problem with online trajectory generation is that the safety of all subsequent movements must be guaranteed, although future movements cannot be predicted in advance. To overcome this problem, our work combines two ideas:

  • Compliance with kinematic joint limits is ensured by the design of the action space used for reinforcement learning [6].

  • Collisions and torque limit violations are prevented by ensuring the existence of an alternative safe behavior at each decision step [7].

With our approach, arbitrary movements can be predicted by a neural network without having to worry about potential safety implications. In the sections that follow, we clarify how alternative safe behaviors are generated and demonstrate the effectiveness of our method by learning various reaching tasks with up to three simulated robots. Experiments with a real robot are performed to show that safe motions can be computed in real-time. With the aim of enabling further research, our source code will be made publicly available.

9cm

(a) Observed robot scene

10cm

(b) Successful collision avoidance

12cm

(c) Collision due to conflicting constraints
Fig. 2: Successful and failing collision avoidance using Faverjon and Tournassoud’s method.

Ii Related work

Ii-a Safe reinforcement learning

The term safe reinforcement learning refers to methods that aim at respecting safety constraints during the learning phase and the deployment process [8]. Although reinforcement learning (RL) is widely used in robotics, safe learning is still considered as an outstanding challenge for practical applications [9]. One important aspect of safe reinforcement learning is the problem of safe exploration. In order to find well-performing policies, an RL agent has to explore its environment. During this process, dangerous situations have to be avoided. An overview of different techniques for safe exploration is given in [8] and [10]

. To consider safety constraints during learning, the concept of Constrained Markov decision processes (CMDP)

[11] was introduced and suitable RL algorithms were developed [12]. In [13], different constrained and unconstrained RL algorithms are benchmarked in various environments. It is shown that the use of constrained RL algorithms does not guarantee constraint satisfaction, especially at the beginning of the training process. In contrast, our method ensures safety right from the beginning of the training process and even if the environment is modified afterwards.

Ii-B Learning safe motions in robotics

Applying the concept of safe RL to robotics usually leads to the problem of safe online trajectory generation. When working with industrial robots, relevant safety constraints include prevention of collisions and adherence to kinematic and dynamic joint limits. We note that jerk limits are often ignored by academic prototypes, although jerky movements may damage the gear boxes and actuators of a robot [9]. While most neural networks are trained in simulation environments [13], safety constraints become relevant as soon as motions are to be carried out by real robots. Practitioners have developed various techniques to avoid unsafe behaviors when performing real-world experiments. For instance, penalties for undesired behaviors can be added to the reward function of an unconstrained Markov decision process (MDP) [14, 15]. Although penalties reduce the likelihood of undesirable behaviors, safety violations are not entirely prevented, even if the training process is carried out until convergence [6]

. In some cases, task-specific heuristics can be used to avoid unsafe behaviors

[1]. Designing the action space in such a way that all actions can be safely executed is an elegant solution to deal with safety constraints. However, this approach is often very restrictive and not suitable for all types of constraints [9]. Recently, an action space representation to ensure compliance with kinematic joint constraints was proposed [6]. Conflicting constraints are avoided over an infinite time-horizon and the work space of the robot is not restricted. We utilize the code provided by [6] for our action space representation and to generate braking trajectories. When considering kinematic joint constraints only, all joints can be treated as decoupled. However, preventing collisions and torque limit violations is more challenging as the coupling between the joints has to be taken into account. While most online collision avoidance methods are based on potential fields [16, 17, 18], Faverjon and Tournassoud’s method [19] provides a way to avoid collisions by specifying explicit inequality constraints. In [20], Faverjon and Tournassoud’s method was applied to a reinforcement learning problem. The basic idea is to define and solve a quadratic program (QP) at each decision step. The QP outputs the closest action to the network prediction that satisfies the specified inequality constraints. This approach is computationally efficient as constraint satisfaction is checked for the current decision step only. However, in order to avoid conflicting constraints, a robotic system has to reason over an infinite time-horizon or until a safe goal state is found [5]. The problem of conflicting constraints is demonstrated in Fig. 2. While extensions for non strictly convex objects exist [21], the basic version of Faverjon and Tournassoud’s method supports strictly convex objects only. As shown in Fig. 1(a), we define an orange bounding sphere around the third robot link and try to avoid collisions with the table surface and a stationary blue sphere. Fig. 1(b) and Fig. 1(c) show the distance and the relative velocity between the two spheres for a successful and an unsuccessful attempt to avoid a collision. The red dots visualize the minimum relative velocity for collision avoidance as specified by Faverjon and Tournassoud. The green lines indicate the range of possible relative velocities resulting from kinematic joint constraints. In Fig. 1(c), the desired minimum relative velocity is in conflict with the kinematic constraints. In the time between and , the corresponding quadratic programs have no solution and a collision cannot be prevented. Facing the problem of conflicting constraints, the concept of alternative safe behaviors was proposed in [7]

. We adapt the concept of alternative safe behaviors to RL problems and demonstrate the applicability to a multi-robot system with up to 21 degrees of freedom (DOFs). Compared to

[7], we additionally consider torque constraints, jerk limits and collisions between multiple robots.

Iii Approach

Iii-a Problem statement

Our method addresses the problem of learning collision-free robot motions complying with kinematic and dynamic joint limits. Like in [6], the following kinematic constraints are defined for each revolute robot joint:

(1)
(2)
(3)
(4)

with being the joint angle and , , and standing for position, velocity, acceleration and jerk, respectively. In addition, we specify torque limits for each actuator:

(5)

While the kinematic joint limits apply to the setpoints given to a joint trajectory controller, the torque limits are defined with respect to actual values. In order to avoid collisions, an arbitrary number of obstacle-link pairs and link-link pairs can be observed. Link-link pairs can be used to prevent either self-collisions or collisions between different robots. The safety distance S specifies the minimum distance tolerated between observed obstacle-link pairs and between observed link-link pairs. In the presence of arbitrarily moving obstacles, collision avoidance cannot be guaranteed [5]. Therefore, we assume that our environment is composed of controllable robots and static obstacles only. In addition, the shape and position of each object are known.

To learn motions with model-free RL, we define an unconstrained Markov decision process , where is the state space, is the action space,

are unknown transition probabilities and

is the immediate reward resulting from action . A neural network is trained to predict actions that maximize the sum of immediate rewards. The state contains the current kinematic state (position , velocity , acceleration ) of each robot joint as well as additional task-specific information. The neural network predicts a single scalar per joint as action . The immediate reward results from a task-specific reward function, which specifies the optimization goal of the learning problem. For our experiments, the prediction frequency of the neural network is set to .

Iii-B Adherence to kinematic constraints

In order to comply with kinematic joint constraints, we use the code provided by [6] to compute for each joint the minimum and maximum acceleration and allowed at the beginning of the following time interval . Since all intermediate acceleration values lead to feasible trajectories, the acceleration setpoint at the beginning of the next time interval can be computed by linearly mapping the network prediction onto the range specified by and :

(6)

and depend on the current kinematic state (, , ) and the specified kinematic limits (1) - (4

). To generate a continuous trajectory between the discrete decision steps, acceleration setpoints are linearly interpolated between

and . The corresponding velocity and position setpoints can then be calculated by integration.

Iii-C Prevention of collisions and torque limit violations

Collisions and torque limit violations are prevented by ensuring the existence of an alternative safe behavior prior to executing the motion resulting from the network prediction. The network prediction specifies the desired motion from the beginning of the current time interval to the beginning of the next time interval . To ensure the existence of a safely executable trajectory at , the predicted motion is continued by a braking trajectory that stops all robot joints. Starting from the current kinematic state at , the resulting movement is simulated and checked for collisions and torque limit violations. If no violations occur, an alternative safe behavior is found and the predicted motion is carried out until . Otherwise, the alternative safe behavior computed in the previous decision step is executed.

In the following, the concept of alternative safe behaviors is further elaborated using a real-world example.

The target point is reached.

The robot stops.
Fig. 3: Collisions are prevented even if the shape of an obstacle is changed from (a) to (b) after training.

For the purpose of illustration, a robot is trained to reach randomly placed target points in an environment with a monitor as an obstacle. As shown in Fig. 3, the orientation of the monitor is changed after training so that the desired target point can no longer be reached. The resulting trajectories for both monitor orientations are visualized in Fig. 4 and in the accompanying video. In case of the rotated monitor shown in Fig. 3, a braking process is initiated at as no alternative safe behavior can be found. Note that the finally executed braking trajectory was already checked for collisions in the previous decision step at . Without the braking process, a collision with the monitor would have occurred at .

First target point reached

Braking process initiated

Collision (no braking)
Fig. 4: Exemplary trajectory of a trained neural network for two joints. Solid lines: For the adapted environment shown in Fig. 3, a braking process is initiated at to prevent a collision. Dashed lines: In case of the initial environment shown in Fig. 3, the motion is performed as predicted by the neural network.

It is important to note that the robot does not necessarily come to a standstill when a braking process is initiated. As soon as an alternative safe behavior is found in a subsequent decision step, the braking process is stopped and the motion is continued as predicted by the neural network. Fig. 5 shows an exemplary trajectory generated by selecting random actions. Although a torque limit violation is prevented at , the robot joint does not come to a full stop.

Fig. 5: Exemplary trajectory for a single joint generated by a random agent. Note that neither kinematic joint limits nor torque limits are violated.

Iii-D Computation of alternative behaviors

Conflicting constraints are avoided by ensuring the existence of an alternative safe trajectory at each decision step. The alternative trajectory has to lead to a safe goal state without causing safety violations. In the absence of moving obstacles, a safe goal state is reached when all robot joints have been brought to a stop. The shorter the duration of the alternative trajectory, the less computational effort is required to verify its compliance with the safety constraints. Thus, we compute braking trajectories that are time-optimal with respect to the following predefined kinematic limits:

(7)
(8)

The kinematic limits for braking trajectories have to be either equal to or smaller than the limits specified in (3) and (4). As can be seen in Fig. 4 at , the maximum brake acceleration in our experiments is set to of the maximum joint acceleration.

Starting from the current decision step , the neural network predicts the desired motion until . Assuming a positive joint velocity at , the brake acceleration selected at is the minimum acceleration that satisfies the constraints (1), (7) and (8), the upper velocity limit specified by (2) and a lower velocity limit of zero. Provided that the braking limits (7) and (8) are more restrictive than the initial limits (3) and (4), the constraints for can be incompatible. If so, the conflicting constraint (7) or (8) is replaced with (3) or (4) so that the compliance with the initial kinematic constraints is guaranteed in any case. can be computed with the code provided by [6]. The procedure explained above is repeated for the subsequent time steps , , … until all robot joints are brought to a standstill. Potential benefits of more sophisticated alternative behaviors are discussed in section III-G.

9cm

(a) One robot

9cm

(b) Two robots

9cm

(c) Three robots
Fig. 6: Visualization of the environments used for evaluation.

Iii-E Checking alternative behaviors for collisions

Beginning with the current kinematic state at , the alternative trajectory has to be checked for violations of the specified safety distance S. To do so, position setpoints are generated for each robot joint by sampling the alternative trajectory at a frequency of . We assume that a kinematic model of the robot is given and that the shape and position of each obstacle are known. For every position setpoint, the distance between each observed obstacle-link pair and between each observed link-link pair is computed. Although no forces have to be computed, we use the getClosestPoints() function provided by the physics engine PyBullet [22] to calculate the corresponding distances. If a single distance is smaller than the specified safety distance S, the whole trajectory is considered as unsafe and a braking process is initiated. Note that the computational effort depends on the duration of the alternative trajectory and the selected frequency of distance calculations . For computational efficiency it is preferable to decrease the frequency and to compensate the resulting inaccuracy by increasing the safety distance S. In our experiments, is set to .

We note that our collision avoidance method checks the setpoints of the alternative trajectory rather than the resulting actual values. Typically, trajectory tracking controller of industrial robots are able to closely follow the desired reference trajectory, so that the tracking error can be compensated by defining a relatively small safety distance. If higher accuracy is required, the behavior of the trajectory controller can be approximated by a controller model. The resulting actual position values can then be pre-estimated and checked for violations of the safety distance. In case of our robot, the trajectory controller can be approximated by a first-order low-pass filter with a time constant of

.

Iii-F Checking alternative behaviors for torque limit violations

The torque required to follow a trajectory does not only depend on the desired joint acceleration but also on the current position of the robot and forces like gravity and friction. In order to detect potential torque limit violations, the execution of the alternative trajectory is simulated using the physics engine PyBullet. For our experiments, the simulation frequency is set to . After every simulation step, the torque applied to each joint is read out and checked for compliance with the specified limits. If a violation is detected, the alternative safe trajectory computed in the previous decision step is executed. While running a complete physics simulation is more time-consuming than the method described in III-E, the approach is very flexible with regard to the observed constraint. For instance, the same principle could also be used to limit contact forces between the robot and elastic obstacles. We note that collisions with rigid obstacles typically lead to torque limit violations. Thus, collisions can also be prevented by checking the torque limits only. However, in contrast to the approach described in III-E, no safety distance can be specified. When applying our approach to real robots, torque limit violations can only be prevented if the physics engine is able to accurately model the forces resulting from the execution of the alternative behavior. While the simulation accuracy can be improved based on system identification [2], we assume that a perfect simulation model is given. In practice, model and parameter errors can be compensated by subtracting a safety margin from the actual torque limits.

Iii-G Limitations

In this section, we discuss potential limitations of our approach: Firstly, the use of alternative safe behaviours ensures that the specified safety constraints are not violated. However, the motion predicted by the neural network might get adapted although a feasible way to meet the constraints exists. For example, the braking trajectory itself might cause torque violations, if the acceleration limits specified in (7) are set to high values. In this case, the false positive rate can be reduced by selecting smaller brake accelerations. However, the computational effort increases when the robot is slowed down slowly as the duration of the alternative trajectory increases. Secondly, a moving obstacle can cause a collision even if there is a way to avoid the obstacle. For instance, no evasive movement is performed if the robot has already been brought to a standstill. Thirdly, our implementation assumes that a robot can be safely stopped at any joint configuration. While this is true for industrial robots, legged robots might lose their balance if stopped in unstable positions.

All three limitations can be mitigated by searching for an optimized alternative safe behavior rather than computing a single, deterministic braking trajectory. We intend to investigate optimized alternative trajectories in our future work.

Iv Evaluation

Iv-a Description of evaluation scenarios

We evaluate our approach by learning a reaching task with up to three simulated KUKA iiwa robots. The environments used for our evaluation are shown in Fig. 6. Renderings of the resulting robot motions can be found in the accompanying video. TABLE I summarizes the most important properties of each environment. The environment with a single robot corresponds to the setup in Fig. 3, which is used for our real-world experiments. A total of six obstacles are observed, namely a monitor in front of the robot, the table and four virtual walls around the table. In case of two and three robots, the table is the only obstacle in the environment. However, collisions between the robots are prevented by observing a large number of link-link pairs. Each of the robots is trained to reach randomly placed target points with its last link, which is colored green, purple and cyan, respectively. In Fig. 6, the target points are visualized by green spheres. The color indicates to which robot the target point is assigned. As soon as a target point is reached, a new one is placed. The duration of each episode is set to eight seconds. We note that all robots are controlled by a single, centralized agent. The average number of target points reached per episode indicates the performance of the agent.

Environment DOFs Obstacles Obstacle-link pairs Link-link pairs
One robot 7 6 36 0
Two robot 14 1 12 72
Three robots 21 1 18 216
TABLE I: Properties of the environments used for evaluation.

The maximum values for the position, velocity and torque of each joint are selected according to the official specification of the KUKA iiwa 7 R800 [23]. Since no acceleration limits are provided by KUKA, the maximum accelerations are taken from a comparable Franka Emika robot [24]. The maximum jerk for each joint is limited as the acceleration is linearly interpolated between the decision steps:

(9)

where is the prediction frequency of the neural network. To evaluate our approach with different configurations, we define the following naming convention for our experiments:

  • Protection mode P

    Neither collision nor torque limit prevention

    C

    Collision prevention

    CT

    Collision and torque limit prevention

  • Safety distance S for collision prevention (in centimeter)

  • Specified joint limits J

    A

    Pos: ,

    Vel: , Acc: , Jerk: , Tor:

    B

    Pos: ,

    Vel: , Acc: , Jerk: , Tor:

The factors defined by J A and J B are multiplied with the the corresponding maximum values. For trained agents, the following additional arguments are specified:

  • Distance threshold D of a penalty for being close to the safety distance, given in centimeter, if applied.

  • Target point generation for multiple robots T

    A

    A single target point assigned to the robots in alternating order.

    S

    One target point per robot at the same time.

Configuration Closest
distance
Safety zone Adaptation rate Max. deceleration time
One robot
    

P –

/

S 5

/ J A
    

P CT

/

S 5

/ J A
    

P –

/

S 5

/ J B
    

P C

/

S 5

/ J B
    

P CT

/

S 5

/ J B
Two robots
    

P –

/

S 1

/ J A
    

P CT

/

S 1

/ J A
Three robots
    

P –

/

S 1

/ J A
    

P CT

/

S 1

/ J A
TABLE II: Evaluation of collision prevention based on 900 episodes generated by a random agent.
Configuration Max. normalized joint values Episodes with
Pos. Vel. Acc. Jerk Tor. Tor. violations
One robot
    

P –

/

S 5

/ J B
    

P C

/

S 5

/ J B
    

P CT

/

S 5

/ J B
Two robots
    

P –

/

S 1

/ J B
    

P C

/

S 1

/ J B
    

P CT

/

S 1

/ J B
Three robots
    

P –

/

S 1

/ J B
    

P C

/

S 1

/ J B
    

P CT

/

S 1

/ J B
TABLE III: Evaluation of the compliance with joint constraints based on 900 episodes generated by a random agent.
  Configuration Target points reached Adaptation Influence time Deceleration time Closest Episodes with
Random Trained R1 R2 R3 rate per intervention prior to intervention distance Tor. violations
  One robot
    

P –

/

S 5

/

J A

/ D –
    

P CT

/

S 5

/

J A

/ D –
    

P –

/

S 5

/

J B

/ D –
    

P CT

/

S 5

/

J B

/ D –
  Two robots
    

P C

/

S 1

/

J A

/

D –

/

T S

    

P C

/

S 1

/

J A

/

D 5

/

T S

    

P C

/

S 1

/

J A

/

D 5

/

T A

  Three robots
    

P C

/

S 1

/

J A

/

D 5

/

T A

TABLE IV: Evaluation of performance metrics for trained agents based on 900 episodes.

Iv-B Adherence to safety constraints

As a first step, we evaluate the compliance with the specified safety constraints. For each experiment, 900 episodes are generated by a random agent, which corresponds to a total execution time of two hours. The results are shown in TABLE II and TABLE III. If no alternative safe behaviors are computed ( P – ), the specified safety constraints are violated in every experiment. In contrast, our method ( P CT ) reliably prevents collisions and torque limit violations. However, it can be seen that the closest distance between observed obstacle-link pairs or observed link-link pairs is slightly smaller than the specified safety distance S, which means that the time fraction in the so-called safety zone is not zero. For our evaluation, the closest distance is measured at a frequency of , whereas the alternative behavior is checked for collisions at a rate of . The trade-off between accuracy and computational efficiency was already discussed in III-E. When repeating the experiments with , the closest distance is not smaller than the specified safety distance. The adaptation rate shown in TABLE II indicates the proportion of actions that are adjusted by our method. The additionally listed maximum deceleration time is an important metric for real-time execution as the computational effort depends on the duration of the braking trajectories used as alternative behaviors. As can be seen in TABLE III, kinematic limits are never violated due to the action space used. The very high torque values that occur without calculating alternative behaviors ( P – ) result from collisions. We note that the stated kinematic joint values are not affected by collisions since they refer to the setpoints given to a trajectory controller. As shown in TABLE IV, the specified safety constraints are also respected by trained agents.

Iv-C Evaluation of the learning performance

To evaluate the learning performance for each experiment, we use Proximal Policy Optimization (PPO) [25]

to train neural networks with 256 neurons in the first hidden layer and 128 neurons in the second hidden layer. The task-specific part of the state

contains the Cartesian position of each target point and the relative position between a target point and the correspondingly colored robot link. In the case of alternating target points ( T A ), an additional input signal indicates whether the target point is active or not. The reward assigned to an action results from the reduction of the distances between the target points and the correspondingly coloured robot links. As can be seen in TABLE IV, the average number of target points reached within one episode increases significantly during training. For purposes of comparison, we evaluate the learning performance of agents trained without computing alternative safe behaviors ( P – ). While the number of target points is lower when collisions and torque limit violations are prevented ( P CT ), it is important to note that without collision prevention, the agent has a time advantage as the robot can move through the monitor. If lower joint limits are selected ( J B ), the performance decreases since the maximum velocity of the robot is reduced. For two and three robots we do not observe torque limits ( P C ) to speed up the time required for training. Nevertheless, torque limit violations do not occur since higher torque limits are selected ( J A ). In our first experiment with simultaneous target points ( T S ), the adaptation rate exceeds . Since the predicted actions are overwritten, the adaptation rate should not be too high to avoid negative effects on the performance. When penalizing alternative trajectories that are closer than to any observed obstacle-link pair or link-link pair ( D 5 ), the adaptation rate decreases and the learning performance improves. However, a large performance difference is observed between robot 1 (R1) and robot 2 (R2) as the agent prefers to reach target points with a single robot while keeping the second robot away from potential collisions. In order to balance the performance between the robots, alternating target points (T A) are introduced. Our experiments with two and three robots show that the robots contribute almost equally to the overall performance when alternating target points are used. We additionally evaluate the average influence time of each braking process and the deceleration time required to bring the robots to a full stop. It can be seen that the average influence time is significantly smaller than the deceleration time, which means that in most cases a feasible action is predicted before the robots are fully stopped. Exemplary movements of trained agents are shown in the accompanying video.

Iv-D Real-time capability and sim-2-real transfer

In order to apply our approach to a real robot, alternative safe behaviors must be computed in real time. As a first step, we analyze the average and the maximum computation time of simulated episodes with a duration of using an Intel i9-9900K CPU. The results are shown in TABLE V.

Configuration One robot Two robots Three robots
Mean Max. Mean Max. Mean Max.

P –

/

S 0.05

/ J B

P C

/

S 0.05

/ J B

P CT

/

S 0.05

/ J B
TABLE V: Computation times based on 900 episodes.

The maximum time required to compute an episode for a single robot while preventing collisions and torque limit violations ( P CT ) was . For real-time execution, two threads are started by our program: The first one computes the desired kinematic state at the end of the subsequent decision step considering potential adaptations due to safety violations. As shown in Fig. 7, these computations are performed in advance to the actual decision step. In order to check the alternative safe behavior for torque violations, the actual position at decision step has to be known. To enable the desired precalculation, we model the behavior of the trajectory controller as first order low-pass with a time constant of .

Actionprediction

Safety checks fordecision step

Precalculation
Fig. 7: Safety checks are performed in advance.

The second thread keeps synchronized with the trajectory controller of the real robot while interpolating position setpoints at a frequency of . As can be seen in the accompanying video, we successfully transferred a trained policy including collision and torque limit prevention (P CT) to a real robot.

V Conclusion and future work

This paper presented a method to learn collision-free and torque-limited robot trajectories. Conflicting constraints are avoided by ensuring the existence of an alternative safe behavior at each decision step. Violations of kinematic joint limits are prevented by the design of the action space used for reinforcement learning. Our evaluation with up to three simulated robots showed that neither random agents nor trained agents violated the specified constraints. Safe motions for a single robot could be computed in real-time.

In future work, we would like to apply our approach to other constraints like Cartesian velocity limits and contact forces. Moreover, we are interested in investigating the use of more sophisticated alternative safe behaviors.

Acknowledgment

This research was supported by the German Federal Ministry of Education and Research (BMBF) and the Indo-German Science & Technology Centre (IGSTC) as part of the project TransLearn (01DQ19007A). We appreciate the access to the KUKA Robot Learning Lab at KIT and thank Tamim Asfour for his valuable feedback and advice.

References

  • [1] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” in 2017 IEEE international conference on robotics and automation (ICRA).   IEEE, 2017, pp. 3389–3396.
  • [2] M. Kaspar, J. D. Muñoz Osorio, and J. Bock, “Sim2real transfer for reinforcement learning without dynamics randomization,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 4383–4388.
  • [3] J. C. Kiemel, R. Weitemeyer, P. Meißner, and T. Kröger, “TrueÆdapt: Learning smooth online trajectory adaptation with bounded jerk, acceleration and velocity in joint space,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 5387–5394.
  • [4] G. Thomas, M. Chien, A. Tamar, J. A. Ojea, and P. Abbeel, “Learning robotic assembly from cad,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 1–9.
  • [5] T. Fraichard, “A short paper about motion safety,” in Proceedings 2007 IEEE International Conference on Robotics and Automation, 2007, pp. 1140–1145.
  • [6] J. C. Kiemel and T. Kröger, “Learning robot trajectories subject to kinematic joint constraints,” in 2021 International Conference on Robotics and Automation (ICRA) (accepted), 2021.
  • [7] S. Rubrecht, V. Padois, P. Bidaud, M. De Broissia, and M. D. S. Simoes, “Motion safety and constraints compatibility for multibody robots,” Autonomous Robots, vol. 32, no. 3, pp. 333–349, 2012.
  • [8] J. Garcıa and F. Fernández, “A comprehensive survey on safe reinforcement learning,”

    Journal of Machine Learning Research

    , vol. 16, no. 1, pp. 1437–1480, 2015.
  • [9] J. Ibarz, J. Tan, C. Finn, M. Kalakrishnan, P. Pastor, and S. Levine, “How to train your robot with deep reinforcement learning: lessons we have learned,” The International Journal of Robotics Research, p. 027836492098785, Jan 2021. [Online]. Available: http://dx.doi.org/10.1177/0278364920987859
  • [10] M. Pecka and T. Svoboda, “Safe exploration techniques for reinforcement learning–an overview,” in International Workshop on Modelling and Simulation for Autonomous Systems.   Springer, 2014, pp. 357–375.
  • [11] E. Altman, Constrained Markov decision processes.   CRC Press, 1999, vol. 7.
  • [12] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in International Conference on Machine Learning, 2017, pp. 22–31.
  • [13] A. Ray, J. Achiam, and D. Amodei, “Benchmarking safe exploration in deep reinforcement learning,” arXiv preprint arXiv:1910.01708, 2019.
  • [14] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,” arXiv preprint arXiv:1804.10332, 2018.
  • [15] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al., “Learning dexterous in-hand manipulation,” The International Journal of Robotics Research, vol. 39, no. 1, pp. 3–20, 2020.
  • [16] O. Khatib, “Real-time obstacle avoidance for manipulators and mobile robots,” in Autonomous robot vehicles.   Springer, 1986, pp. 396–404.
  • [17] C. W. Warren, “Global path planning using artificial potential fields,” in 1989 IEEE International Conference on Robotics and Automation.   IEEE Computer Society, 1989, pp. 316–317.
  • [18] F. Flacco, T. Kröger, A. De Luca, and O. Khatib, “A depth space approach to human-robot collision avoidance,” in 2012 IEEE International Conference on Robotics and Automation.   IEEE, 2012, pp. 338–345.
  • [19] B. Faverjon and P. Tournassoud, “A local based approach for path planning of manipulators with a high number of degrees of freedom,” in Proceedings. 1987 IEEE international conference on robotics and automation, vol. 4.   IEEE, 1987, pp. 1152–1159.
  • [20] T.-H. Pham, G. De Magistris, and R. Tachibana, “Optlayer-practical constrained optimization for deep reinforcement learning in the real world,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 6236–6243.
  • [21] F. Kanehiro, F. Lamiraux, O. Kanoun, E. Yoshida, and J.-P. Laumond, “A local collision avoidance method for non-strictly convex polyhedra,” Proceedings of robotics: science and systems IV, p. 33, 2008.
  • [22] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” 2016.
  • [23] “Kuka sensitive robotics lbr iiwa,” p. 30, 2017. [Online]. Available: https://www.kuka.com/-/media/kuka-downloads/imported/9cb8e311bfd744b4b0eab25ca883f6d3/kuka˙lbr˙iiwa˙brochure˙en.pdf
  • [24] “Franka emika robot and interface specifications,” 2017. [Online]. Available: https://frankaemika.github.io/docs/control˙parameters.html
  • [25] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.