PPMC Training Algorithm: A Robot Independent Rough Terrain Deep Learning Based Path Planner and Motion Controller

03/02/2020 ∙ by Tamir Blum, et al. ∙ Tohoku University 0

Robots can now learn how to make decisions and control themselves, generalizing learned behaviors to unseen scenarios. In particular, AI powered robots show promise in rough environments like the lunar surface, due to the environmental uncertainties. We address this critical generalization aspect for robot locomotion in rough terrain through a training algorithm we have created called the Path Planning and Motion Control (PPMC) Training Algorithm. This algorithm is coupled with any generic reinforcement learning algorithm to teach robots how to respond to user commands and to travel to designated locations on a single neural network. In this paper, we show that the algorithm works independent of the robot structure, demonstrating that it works on a wheeled rover in addition the past results on a quadruped walking robot. Further, we take several big steps towards real world practicality by introducing a rough highly uneven terrain. Critically, we show through experiments that the robot learns to generalize to new rough terrain maps, retaining a 100 To the best of our knowledge, this is the first paper to introduce a generic training algorithm teaching generalized PPMC in rough environments to any robot, with just the use of reinforcement learning.



There are no comments yet.


page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As artificial intelligence (AI) progresses, we are finally gaining the ability to test past predictions of its capabilities and find real applications in a wide array of fields. One promising area for AI is in robotics, particularly for decision making and controls in rough environments, such as outer space celestial body exploration or disaster scenarios, environments that are highly unstructured and require real time information processing. In such environments, we often don’t have sufficient knowledge of the environment a priori, coupled with hazardous scenarios for humans that would better be avoided if possible.

This makes artificial intelligence a promising tool due to its ability to process large amounts of data in real time, promising to increase automation and environmental awareness. Machine learning can be broken into three sub categories: supervised learning, unsupervised learning and reinforcement learning (RL). Of these, this paper focuses on reinforcement learning, which we believe holds special promise for the field of robotics. Reinforcement learning continuously learns by interacting with the environment and tries to explore and maximize a reward function during the training period. In this case, it saves the decision making process, called a policy, on a neural network, which takes in state information about the robot, the environment and goals, and translates it into actions such as wheel speeds.

Fig. 1: CLOVER Rover robot in the simulated rough terrain emulating a lunar environment. Photo credit: NASA (lunar surface)

The 3 main pillars of the PPMC algorithm are:

  1. Region Enabled Travel

  2. Multipoint Travel

  3. Respond to User Commands

Region enabled travel is a generalized ability to travel anywhere in a map, rather than just memorizing how to traverse to the same location each time. Multipoint travel is the ability to travel to points in succession, giving us the ability to design complicated paths. Lastly, we want to give the robot the ability to be controllable and to respond to user commands, in order to increase its usefulness and giving us more control over its actions and movements, allowing us to tell the robot to avoid certain areas or to purposely go through others.

In this paper, we will first touch upon related work detailing past inroads reinforcement learning researchers have made in the field of robotics and then will address a novel area, of applying our PPMC training algorithm to a rough terrain environment. In a past paper, Tohoku University researchers introduced an early version of the PPMC algorithm with limited capabilities for a quadruped robot. This paper looks to extend the algorithm to work independent of robot structure, while also making improvements to the algorithm to gain generalization capabilities and increasing the terrain complexity, such that it is closer to many real environments. In this paper, we show that path planning and motion control are possible in a robust and reliable manner, even in complicated environments. Everything is learned purely through the power of model-free reinforcement learning through the use of a customized reward function.

Fig. 2: Ternary RL architecture: Agent – User – Environment

Fig. 3: The robot path is shown for 30 trials for 4 test case paths in all 3 maps, showing generalization capabilities both beyond the training region (the black inner rectangle) and beyond the training map. The grey circle in the center is the origin, the green square is the first goal and the red square is the final goal. Although the map is bumpy and 3D in nature, we show a 2D projection for simplicity. As can be seen, the robot is able to get to both goals with a 100% success rate even on the maps it was not trained on. Slight path deviations can be seen based on trial and based on map.

Ii Related Work in Reinforcement Learning and Robotics

Our work builds upon and references related work in a number of sub fields of robotics and reinforcement learning. Reinforcement learning has been applied to robotics for some time now in areas such as motion control, mapless navigation, mapped navigation and simulation to real world ”transfer learning”. In many cases, it has been applied in simulation, while in a limited number of other cases it has been applied to real robots through the transfer learning techniques or through training directly in the real world.

There are several noteworthy works dealing with locomotion and path planning through reinforcement learning. Oxford researchers were able to combine reinforcement learning guided policy search with supervised learning in order to control a tensegrity rover’s locomotion in bumpy terrain[16]. Similar to this work, several other works have combined reinforcement learning with other techniques, such as learning from expert behavior and combining a mixture of local and global path planners[21]. HKUST researchers were able to teach a wheeled robot to path plan in a cluttered flat environment[17].

There has also been some research done focused on learning a specializing for one particular complex environments, however without the need to generalize [8][11]. Several works have focused on applying simulated results to real world robots, which will be useful in future work[17][18].

There has also been research conducted by JPL researchers about current limitations and potential AI applications for mars rovers or other space exploration robots. One piece focuses on a lack of environmental awareness currently, and another about the potential energy savings that could come from the use of computer vision systems


Some researchers have also taken a look at the human brain and the decision making process and making an analogy to complex AI systems[12]. While this work is on a macro level of many agents, we are seeking to the same thing on a micro scale, having a single neural network conduct multiple tasks on a single agent. This will become more important as we add additional systems such as vision and other sensors.

There have been several works also combining both traditional controls with reinforcement learning based controls, finding it useful for hard to model tasks with friction or for walking[10][20]. There have also been works exploring adding humans into the RL loop, such as by having them judge performance[4].

In past work, researchers at Tohoku University’s Space Robotics Lab introduced a prototype of the PPMC algorithm that taught limited capabilities to a quadruped walking robot [1]. The work focused on teaching path planning and motion control in a single quadrant of the training map, without generalization to the rest of the map or to other maps not trained on. It showed that the training algorithm was independent of any specific reinforcement learning algorithm.

Although research in the realm of robotics and reinforcement learning/AI has been started and shows great potential for the application of AI in the real world and in rough environments like mars, there is still much work to be done, particularly for generalization capabilities. This work seeks to introduce a training algorithm that trains the robot on rough terrain, independent of its system architecture or a specific reinforcement learning algorithm. It seeks to show generalized capabilities in our experiments outside the training region and in newly created maps of varying roughness.

Iii PPMC Algorithm

Iii-a Overview

The PPMC algorithm uses reinforcement learning to train a robot-agent system to learn to conduct path planning and motion control, and to respond to user given goals. It works by using observable goals and randomized waypoints during the episodic training. A training perimeter is specified by the user at the start of training. At the start of each episode 2 goals, a final goal and a waypoint are given to the robot. The robot must traverse to the given locations before time is up and the episode resets. Each episode lasts a given duration, and the time limit is increased if the robot is able to get to a waypoint, to give it time to get to the final goal. The reinforcement learning algorithm then uses the data generated from the episode to improve the decision making process of the robot, called a policy, which is stored on a neural network. Learning is done on a model-free basis, with the initial policy just containing random noise and exploring until it learns properly according to the customized reward function explained below. The algorithm is described below in psuedocode.

Iii-B Algorithm

  for  do
     Generate M randomized waypoints within TP
     Set G
     Set R(G)
     Run simulation with Alg, env
     if  and  then
        Update R(G)
        Update G
        tep += tinc
     end if
     if   then
        End episode, e += 1
     end if
  end for
Fig. 4: PPMC Algorithm

Algorithm Terminology: TP = training perimeter; M = number of waypoints; Alg = learning algorithm; BX, BY = boundary threshold for X and Y; tinc = episode length increase; tep = initial episode length; t = current time; e = episode counter; GX, GY = goal X and Y coordinates; PX, PY = current X and Y coordinates of robot body; R(G) = reward function w.r.t G; G = goal array; env = environment

Iii-C Improved Results

In the previous work, researchers at Tohoku University first introduced the PPMC algorithm[1] that works independent of a reinforcement learning algorithm on a quadruped walking robot. In this work, we show that the algorithm works independent of robot system architecture by showcasing its ability to control a wheeled rover robot environment. We decided to use the Actor Critic using Kronecker-Factored Trust Region (ACKTR) algorithm as it was the best of the tested algorithms in the previous work[19][5].

We also were able to remove some of the limitations and simplifications seen in the previous work. Namely, the previous work included training on only one quadrant of the training map, the (+X, +Y) quadrant, dubbed the ”training perimeter”. For this work, we were able to expand the training perimeter to a square centered at the robot origin. Even though the map was bigger than the training perimeter, the robot was able to learn to generalize and to travel anywhere within the map, even outside of the training perimeter, as seen in Case 1 of Fig. 6. This is important as it can reduce training time to decrease the size of the training perimeter as it allows us to have shorter episode lengths, which is particularly useful early in the training process.

One of the most critical results of this paper is the resulting ability to generalize, not just outside the training perimeter but also to other maps with different slope gradients, and bump amplitudes and frequencies, as can be seen in Fig. 6 by comparing the 3 maps as well as in TABLE I. Not only this, but it was able to traverse to the given goals with a 100% accuracy, regardless of which map it was tested on, thus overcoming one of the biggest shortfalls of the previous research.

Several changes were made from the original implementation of PPMC in order to improve performance. One of the critical changes included an increased state array with additional information not included previously. This information helped the robot learn locomotion better. More information will be given below.

Iv Robot-Specific Setup

Pre-processing of data: The data should be pre-processed such that all observations are made on a scale of -1 to 1. This also applies for angular data. We discovered that by using linearized angular data, such as the yaw angle of the robot, by taking the sin and cos for example, allows for faster training. However, it does not improve the final performance. This preprocessing must also be done for goal info.

Post-processing of actions: The action values should be post-processed such that the range of -1 to 1 must be scaled to the range of useful actions for the robot. In the case of the car robot, the actions were scaled to the max motor speed going forward and in reverse. In the case of the walking robot, the actions were scaled to the range of motion for each joint.

Episode duration: The episode duration must be chosen appropriately to give the robot enough time to get to the waypoint for both near and far cases. This will heavily depend on the possible speed of the robot and the size of the training perimeter.

Fail criteria: Fail criteria should be picked such as to eliminate undesirable behaviors that the agent might take. These essentially are a harder alternative to ”penalties” but could in most cases also be achieved through reward function tuning.

State data: The state data that is passed to the agent must be chosen carefully. Within reason, the agent does not suffer by being given data it does not need, however, suffers immensely by not being given data it does need. An example is the current position of the robot, and the position of the goal, which we showed were critical to learn path planning in previous research.

V Experiments

We created several experiments to test the generalization capability of the algorithm, as well as its general ability to conduct path planning and motion control within the rough environment.

We creates 3 different maps. One can see the elevation changes as the robot traversed the different maps for the ”Test Case 1” in Fig. 5. By having the different maps, we can show that the robot did not simply memorize the rough terrain of the training map, but in fact learned to generalize to variable roughness terrain, both rougher and flatter than the original. This is particularly useful if we do not know the exact roughness of the terrain that the robot will need to traverse through and it readies it for any scenario. The maps are described below:

  1. The training map (medium level amplitude and frequency for bumps)

  2. Flatter variant map (decreased amplitude and frequency of bumps)

  3. Bumpier variant map (increased amplitude and frequency of bumps)

Fig. 5: 3D Path visualization for Case 1, in which you can notice that map 2 is smoother while map 3 is rougher than the training map, map 1.

We tested the final robot-agent system on four test cases and on three maps. For each test case and map combination, 30 trials were conducted. The test cases were picked to test a wide array of possibilities, including 1 test case outside the training perimeter, 1 edge to edge test case and two other cases testing different angles of travel for both close and far goals. Through these tests, we can verify that the system learned to generalize properly and understands how to conduct proper decision making and controls to do path planning and motion control in rough environments. A 100% success rate was achieved for every test case and every map tested.

Test Case Goal Coordinates Map 1 Map 2 Map 3
1 (12,7) (-12,7) 100% 100% 100%
2 (-10,-10) (10,10) 100% 100% 100%
3 (7,-1) (-7,9) 100% 100% 100%
4 (0,8) (3,10) 100% 100% 100%
TABLE I: Success ratio for two-point path test cases

Fig. 6: Four test cases were chosen to show CLOVER’s path planning and motion control generalization ability. For all four test cases, CLOVER could successfully get to each goal with 100% success rate, even on maps it was not trained on. The black inner box signifies the training perimeter, the green colored box is the first waypoint and the red colored box is the final goal.

Vi Setup

Vi-a Reward Function

For simplicity, we break up the reward function into 3 components, primary goal rewards, , beneficial behavior rewards, , and detrimental behavior penalties, . The primary goal is to get to the goals and so we specify this component as velocity with respect to the current goal, specified as . For beneficial behaviors, we specify an alive reward, , which encourages the robot to avoid early episode termination, triggered by certain bad behaviors. Fail conditions include excessive turning or rolling over and trigger early episode termination. Detrimental behavior is broken down into several components: a penalty for torque usage, , a penalty for reversing or driving backwards, , and a penalty for turning, . Each term needs to be multiplied by a constant to get the right ratio of reward to penalty for each term[2]

. These constants need to be tuned to get the proper behavior and these constants along with the terms form the customized reward function. Finding a good combination is probably the most time consuming part of setting up training.


Vi-B Simulation and Neural Network Setup

We chose to use CoppeliaSim, a robotics simulator, due to its flexibility in designing complicated environments both for this work and for future works[7]. We used a simulation time step of , with bullet 2.78 selected as the physics simulator with ”default” accuracy, including a time step, and constraint solving iterations[3].

We utilized a standard feed forward neural network. The training algorithm was made to be independent of the learning algorithm and thus should be able to be implemented with almost any of the algorithms.

Fig. 7: Simple feed forward neural network architecture used with 5 hidden layers, 80 hidden units per layer.

Although the state array is the same size as the initial publication, it actually contains more information respectively, due to the reduced number of actuators in the rover as compared to the quadruped robot. This consists of (4) motor speeds and (4) motor torques; (3) the x, y and z position of the body; (6) vectorized velocity of the body in both absolute and relative frames; (3) the angular velocity of the robot; (3) the roll, pitch and yaw orientation of the body; (1) elapsed episode time; and (1) the angle between the robot and the target. The goal array which is concatenated onto the state array remains the same size. This consists of (2) the x and y coordinate for the current goal; and (2) the x and y coordinate for the next goal.

Parameter Value
Batch Size 40
# Procs 32
Learning Rate 0.25 0
Entropy Coeff 0.01
Clipping 0.001
Discounting () 0.99
Value Function Coeff 0.5
TABLE II: ACKTR Learning Algorithm Parameters

Vi-C Simulated Environment

In this paper, we use an uneven bumpy surface for training along with two additional maps during evaluation. These additional maps include a flatter and a bumpier variant of the training map and are meant to show that the robot agent system can generalize across different surfaces.

Vi-D Robot

We trained on a four wheeled rover agent in CoppeliaSim. This agent has only two motors, as its two wheel drive, yielding a 2-element action array as the output of the neural network policy, with the left side wheels and the right side wheels set to go the same speed, respectively. Each of the motors uses velocity control with a max torque of and a max wheel angular velocity of . The chassis is long and wide, and the wheels have a radius of . The overall robotic system weights .

Fig. 8: Two Space Robotics Lab Rovers, both candidates to try simulation to real world (sim2real) transfer learning on for future work. Left, low speed swarm CLOVER prototype and right, realistic model of HERO (High-speed Exploration ROver).

Vii Discussion and Conclusion

To our knowledge, we are the first to showcase a training algorithm that gives robots the capability to generalize in rough uneven terrain for path planning and motion control through solely reinforcement learning. It is also the first robot architecture independent training algorithm to the best of our knowledge. Such complex terrains and ability to generalize will be needed if these robots are to be useful in real world applications, such as disaster aid and space exploration. High reliability is a must and we were able to reach a 100% success rate for all trials conducted, even on new environments, both more and less rough.

The resulting training algorithm is useful in achieving the 3 main goals we stated above: region enabled travel, multipoint travel and ability to listen to user commands. These goals were generated with the purpose of making robots that are useful in more real world applications.

Future work will build on this foundation by adding in sensor information such as a camera/LIDAR in order to enable more intelligent path selection. We will also conduct some sim2real transfer experiments[6]. This should be considerably easier on our rover robot platforms of HERO and CLOVER, as compared to a walking robot, due to more accurate simulation dynamics[15][13]. As we add more AI powered subsystems, an important question will be how to best integrate these different components (such as vision and motor control) all into one system, and perhaps comparisons to human/animal brains could help us find a good solution. Another important question will be how the robot will fare with uneven friction levels, such as that presented by loose soil/sand and with noisy sensor information.


  • [1] T. Blum, W. Jones, and K. Yoshida (2020-02) PPMC training algorithm: a deep learning based path planner and motion controller. In 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC) (ICAIIC 2020), Fukuoka, Japan. Cited by: §II, §III-C.
  • [2] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016-06) OpenAI Gym. arXiv e-prints, pp. arXiv:1606.01540. External Links: 1606.01540 Cited by: §VI-A.
  • [3] E. Coumans (2015-07) Bullet physics simulation. pp. 1. External Links: Document Cited by: §VI-B.
  • [4] C. Daniel, M. Viering, J. Metz, O. Kroemer, and J. Peters (2014) Active reward learning. In Robotics: Science and Systems, Cited by: §II.
  • [5] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov (2017) OpenAI baselines. GitHub. Note: https://github.com/openai/baselines Cited by: §III-C.
  • [6] G. Dulac-Arnold, D. Mankowitz, and T. Hester (2019-04) Challenges of Real-World Reinforcement Learning. arXiv e-prints, pp. arXiv:1904.12901. External Links: 1904.12901 Cited by: §VII.
  • [7] M. F. E. Rohmer (2013) CoppeliaSim (formerly v-rep): a versatile and scalable robot simulation framework. In Proc. of The International Conference on Intelligent Robots and Systems (IROS), Note: www.coppeliarobotics.com Cited by: §VI-B.
  • [8] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. Riedmiller, and D. Silver (2017-07) Emergence of Locomotion Behaviours in Rich Environments. arXiv e-prints, pp. arXiv:1707.02286. External Links: 1707.02286 Cited by: §II.
  • [9] S. Higa, Y. Iwashita, K. Otsu, M. Ono, O. Lamarre, A. Didier, and M. Hoffmann (2019-10)

    Vision-based estimation of driving energy for planetary rovers using deep learning and terramechanics

    IEEE Robotics and Automation Letters 4 (4), pp. 3876–3883. External Links: Document, ISSN 2377-3774 Cited by: §II.
  • [10] T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. Aparicio Ojea, E. Solowjow, and S. Levine (2018-12) Residual Reinforcement Learning for Robot Control. arXiv e-prints, pp. arXiv:1812.03201. External Links: 1812.03201 Cited by: §II.
  • [11] W. Jones, T. Blum, and K. Yoshida (2020-01) Adaptive slope locomotion with deep reinforcement learning. In 2020 IEEE/SICE International Symposium on System Integration (SII), Vol. , pp. . External Links: Document, ISSN Cited by: §II.
  • [12] S. K. Kim and S. Kim (2020-02) Brain-inspired method for hyper-connected and distributed intelligence. In 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC) (ICAIIC 2020), Fukuoka, Japan. Cited by: §II.
  • [13] M. Laîné, C. Tamakoshi, M. Touboulic, J. Walker, and K. Yoshida (2018-08) Initial design characteristics, testing and performance optimisation for a lunar exploration micro-rover prototype. Advances in Astronautics Science and Technology 1, pp. 1–7. External Links: Document Cited by: §VII.
  • [14] K. Otsu, G. Matheron, S. Ghosh, O. Toupet, and M. Ono (2018-07) Fast Approximate Clearance Evaluation for Rovers with Articulated Suspension Systems. arXiv e-prints, pp. arXiv:1808.00031. External Links: 1808.00031 Cited by: §II.
  • [15] D. Rodríguez-Martínez, M. Winnendael, and K. Yoshida (2019-10) High‐speed mobility on planetary surfaces: a technical review. Journal of Field Robotics 36, pp. 1436–1455. External Links: Document Cited by: §VII.
  • [16] D. Surovik, K. Wang, and K. E. Bekris (2018-09) Adaptive Tensegrity Locomotion on Rough Terrain via Reinforcement Learning. arXiv e-prints, pp. arXiv:1809.10710. External Links: 1809.10710 Cited by: §II.
  • [17] L. Tai, G. Paolo, and M. Liu (2017-03) Virtual-to-real Deep Reinforcement Learning: Continuous Control of Mobile Robots for Mapless Navigation. arXiv e-prints, pp. arXiv:1703.00420. External Links: 1703.00420 Cited by: §II, §II.
  • [18] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke (2018-04) Sim-to-Real: Learning Agile Locomotion For Quadruped Robots. arXiv e-prints, pp. arXiv:1804.10332. External Links: 1804.10332 Cited by: §II.
  • [19] Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba (2017-08) Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. arXiv e-prints, pp. arXiv:1708.05144. External Links: 1708.05144 Cited by: §III-C.
  • [20] Y. Yang, K. Caluwaerts, A. Iscen, T. Zhang, J. Tan, and V. Sindhwani (2019-07) Data Efficient Reinforcement Learning for Legged Robots. arXiv e-prints, pp. arXiv:1907.03613. External Links: 1907.03613 Cited by: §II.
  • [21] X. Zhou, Y. Gao, and L. Guan (2019) Towards goal-directed navigation through combining learning based global and local planners. In Sensors, Cited by: §II.