Abstract
In this paper we focus on developing a control algorithm for multiterrain tracked robots with flippers using a rl approach. The work is based on the ddpg algorithm, proven to be very successful in simple simulation environments. The algorithm works in an endtoend fashion in order to control the continuous position of the flippers. This endtoend approach makes it easy to apply the controller to a wide array of circumstances, but the huge flexibility comes to the cost of an increased difficulty of solution. The complexity of the task is enlarged even more by the fact that real multiterrain robots move in partially observable environments. Notwithstanding these complications, being able to smoothly control a multiterrain robot can produce huge benefits in impaired people daily lives or in search and rescue situations.
Key Words
Deep reinforcement learning, multiterrain robot, deep deterministic policy gradient, neural networks, endtoend learning
1 Introduction
Tracked multiterrain robots are a typology of semiautonomous robots that are getting more and more important thanks to their flexibility. They can be valuable in a wide range of situations, from wheelchairs^{1}^{1}1http://bfree.hk/2017/ to search and rescue applications. This is due to the fact that they can overcome obstacles of any kind quite easily, but at the same time be more stable, and thus safer, than their legged counterparts.
In order to give greater capabilities to these vehicles additional components, like flippers, need to be used, thus making the locomotion system more complex and the control of the robot harder. This increased complexity needs to be taken care of in order to improve the user’s experience. From here the need for more powerful and flexible control software. Its development is the focus of this work.
Using state of the art deeprl algorithms, we developed an endtoend controller, capable of reading raw data from the camera and calculate the best positions for the flippers, in order to tackle the obstacles found along the robot’s path. This allows fast responses while at the same time providing greater stability to the platform. The use of deeprl is promising for this task because it stimulates the robot, through the optimization of a reward function, to find the best way of action in order to achieve its goal. If this reward function is well designed, it can lead to better performances with respect to humans. Additionally, thanks to the algorithm’s ability at generalization, it can learn to solve many different problems, as shown in [1]. This means that it can be applied, without excessive modifications, to different settings and environments.
To train the algorithm, we simulated the robot in a virtual environment, letting it learn how to use the flippers to climb a flight of stairs. It is not easy to find solutions in situations like these, where the environment is intrinsically partially observable, through deeprl. At the same time, though, these settings are very similar to the real world scenarios where the vehicle will be operated in. In fact, an important limiting factor for the application of deeprl algorithms to the real world is given by the fact that fully observable environments, in which deeprl performs very well, are seldom found in everyday life. Notwithstanding these issues, their application can open new and exciting opportunities for robotics, while making many people’s lives easier.
The work is structured so that section 2 describes some related works; then section 3.1 shows the robot structure and the sensors it is equipped with; 3.2 presents the algorithm; section 4 describes the implementation details; section 5 presents the results; finally, section 6 discusses the implications of our approach while conclusions and future improvements are discussed in 7. Finally we provide additional reference to related material.
2 Related Work
The state of the art deeprl algorithms manage to solve many different tasks. Among them, the two most widely used are dqn[1] and ddpg[2]. The former managed to master, often better than human players, many Atari games just using raw pixel images as input and outputting discrete actions. The latter instead achieved something more complex: the use of continuous actions. This could be the key to extend deeprl to real environments: in these cases in fact, just using discrete actions might not be enough to guarantee good performances. Another important result was obtained in [3], where the authors applied deeprl to a real and complex robot to make it learn how to move in the most efficient way. It needs to be noticed though among these works, only the first two used deeprl in an endtoend fashion only in gamestyle simulators with mostly fully observable environments.
A problem of rl is the time required for the training and for the collection of a sufficient number of samples. This issue has been addressed in [4], where agents working in parallel have been used for sample collection, thus significantly speeding up the network’s training. In [5] the authors applied an asynchronous deeprl algorithm to a mobile robot to navigate it through unseen environments, obtaining better results than previously available algorithms.
When solving complex problems with rl, having the algorithm converge can be tricky. To solve this issue many tricks have been developed, like the use of Replay Memory[6] and auxiliary target networks[1]. Another trick used in [7] is to give the agent other auxiliary tasks, which differ from the main goal, to solve; this pushes the agent to explore more, thus leading to better solutions.
If rl algorithms are to be applied to realworld robots, the risk that they can cause damage to the environment and the robot itself while exploring different policies, needs to be considered. To overcome this problem Pecka et al.[8] and Zimmermann et al.[9] added some constraints to the exploration policies. Although they also used a tracked robot with flippers, this solution is not feasible when using nn because, as of now, it is not possible to control their learning process through constraints. Thus, in our work we relied on training the agent in a simulation environment. This allowed the algorithm to test all the policies needed with no real harm.
3 Preliminaries
In this section, we present the robot we used, the simulation environment and the algorithm itself.
3.1 Robot
The robot is equipped with tracks and flippers, two in the front and two in the rear. During our work, we created a simulated version of it in order to be able to safely train the algorithm. The robot has been simulated in VRep^{2}^{2}2http://www.coppeliarobotics.com/ and is based on the tracked wheelchair by BFree^{3}^{3}3http://bfree.hk/2017/. In Fig. 0(a) it is possible to see the robot model, with the four flippers and the cameras, in the simulation environment while it is trying to climb a flight of stairs. As shown in the figure, the robot is equipped with two depth cameras, one in the front and one in the back, whose fov volumes are delimited by the blue lines and are of the same dimensions of a Kinect^{4}^{4}4https://en.wikipedia.org/wiki/Kinect fov. Cameras’ resolution is pixels.
In addition to cameras, the sensing equipment is completed by a 6 degree IMU. This sensor is mounted on the central vertical axis of the platform and is composed of a 3D gyroscope and a 3D accelerometer: this allows the robot to know its own inclination and acceleration at any given moment during the operations.
Fig. 0(b) shows the inputs coming from the two vision sensors while the robot is in the configuration shown in Fig. 0(a). On the left, there are the data coming from the front camera. It is possible to see the differences in depths given by the stairs. On the right, are shown the back camera’s readings, focusing on the ground, thus showing just a smooth changing in depth.
The four flippers can change their orientation between and from the horizontal position, considered as . It is possible to control each flipper separately, but in order to reduce the action space, from four dimensions to two, and also to make the simulation more similar to the real platform, right and left flippers were paired.
3.2 Deep Learning
In order to address problems with rl, we need to look for policies and states values. These values need to be stored somewhere. There are two ways to do it: in a hash table or in a function approximator. The former works adequately for trivial problems, being easier to implement and allowing to know at every moment the values associated with each state. However, in more complex situations, with highdimensional observations and actions, the table can become quite resource intensive[10][11]
. From here the need to use more elaborate but also more powerful function approximators to store and retrieve these values. In deep learning, nn are used as function approximators, allowing the handling of continuous actions, as done in ddpg. The aforementioned reason is why our project is based on this algorithm in order to develop a good controller for our robot.
ddpg uses two networks: the Critic net, , for the Q value, and the Actor net, , for the actions. To help the nns’ convergence other two networks are used, called target networks, whose outputs are used as targets for training the Critic and Actor ones. The weights of these two target networks are updated with soft updates according to:
(1) 
where are the parameters of the target nets and
the ones of the two main nn. The weights of the main nn are updated through stochastic gradient descent.
This algorithm, in a fully observable environment, is capable of obtaining good policies in an endtoend fashion. That means it can just be fed with raw data, without preelaboration and, thanks to the capabilities of nns, obtain as outputs appropriate actions for each situation. What we want to achieve is to find a way to apply this power to more complex, realworldlike situations. The flow of ddpg is shown in Algorithm 1.
4 Implementation
In this section, we formalize, from a mathematical point of view, the problem we want to solve and describe in detail how we approach its solution.
4.1 Problem Definition
In this paper we aimed at implementing an endtoend algorithm capable of reading raw sensor data and, after some elaboration, help to control a tracked robot. So what we are looking for is a function in the form:
(2) 
where
is the vector of actions, with
being the action taken at time and the state the robot is in at time .The state is defined as:
(3) 
with being the reading at time of the front and back depth cameras respectively, and being the IMU data at time .
The data coming from the cameras are shaped as a 3dimensional matrix containing four frames captured at time intervals. This way the robot can infer its velocity and the changes in its orientation and inclination due to the actions it is performing. A similar strategy is also used in [1] and [2]. At the same time, the data coming from the IMU is a 6dimensional vector: each element of this vector is calculated as the average of four values captured at the same time instant as the cameras’ frames. Using this running average over the IMU data we manage to smooth out the intrinsic noise of the sensor while still retaining the important informations. This is the only preelaboration done on the data.
4.2 Network Structure
The networks used are MultiInput SingleOutput nets: the Critic is composed of two convolutional branches, one for each camera, and two fully connected branches for the IMU data and the previous actions respectively. All these branches are then merged together and passed to three fully connected layers that will output the Q value for the stateaction input pair. The Actor net structure is the same as the Critic with the exception that there is no previous actions input. The output of this network is the 2dimensional action vector where the first element controls the position of the front flippers and the second one drives the rear flippers.
The activations of the convolutional branches are linear. The fully connected layers are instead equipped with LeakyReLU activation functions, exception made for the last layer. This one, in fact, is different among the two networks: the Critic has a linear output activation, while the Actor output goes through a
function. This difference is due to the fact that we want to limit the actions in the range in order to be able to control the flippers, while the Q value can be any value in .4.3 Reward Function
The reward function is the most important part of a rl algorithm, defining the goal of the training process. A badly designed one can hinder learning and result in totally wrong policies. This means that conceiving a proper reward function is not an easy task, but requires a lot of effort. Nevertheless, it remains easier to create compared to an efficient controller and is more prone to generalization. This is the main reason why reinforcement learning is becoming increasingly more popular. In this paper, we base our reward on the data from the IMU and the distance traversed by the platform.
First, we take the square of the gyroscope data, that gives the velocity of rotation around each axis, and penalize it, in order to prevent excessive swinging that could cause harm of any sort to the platform during operation. Then, in the same way, we use the data coming from the accelerometer, that is the acceleration along each axis, to penalize the inclination with respect to the horizontal plane. This can be expressed as:
(4) 
where represent the values of the gyroscope along the 3 axes , the values of the accelerometer, and the vector contains the parameters used to scale the data according to their importance to us.
Another situation we penalize heavily, for obvious reasons, is the case in which the robot flips upsidedown. In this circumstance, in fact, we give it a penalty and put an end to the current episode. At the same time, we give a reward proportional to how much the platform advanced when moving forward and slightly punish it, with , for every time step it gets stuck. Finally, when it reaches the goal point it gets a reward of .
The resulting function can be expressed in the following way:
(5) 
where , with being the travelled distance until time . and are boolean flags: the former is set to if the robot flips over while the latter is set to if the robot manages to climb the stairs and reach the goal point. is the reward scaling factor for the forward movement distance.
4.4 Learning Setup
As all machine learning algorithms, also our implementation has many hyperparameters to set besides the networks’ structure. In our experiments, we used the Adam
[12] optimizer with a learning rate of for both networks. The training was done in a synchronous online fashion with the data collected during the training process itself and stored into the replay memory. This memory had a capacity of 45000 elements composed of the observation from 4 timesteps.In order to push the algorithm to explore the action space, some noise sampled from an OrnsteinUhlenbek process was added to the actions. The target networks update rate was set to . Moreover, the maximum length of each episode was constrained to 150 steps.
5 Results
In order to understand which topology of neural networks was the most appropriate, different configurations were tested and compared: 3D convolutional nets, that let the time dimension intact till the final dense layers, and standard 2D convolutional ones. The training procedure was done in the VRep simulation environment. The results are shown in Fig. 5 and 4.
As is possible to see, the 2D convolutional networks perform much better than their 3D counterpart, in fact, the loss for the first topology decreases faster and in a smoother fashion. At the same time, the second network suffers from a huge decrease in the Q value, indicating its inability to optimize the policy.
6 Discussion
In this work, we started developing a framework for applying deep reinforcement learning to real robots. As already mentioned, this is a hard problem to solve due to the partial observability of the real world. In fact, previous attempts to use deepRL focused on fully observable environments, where it is easier for the algorithm to obtain good results, but realworld environments are intrinsically partially observable.
During our experiments, we found out that the robot tended to get stuck in a local minimum with the front flippers at and the back one at . This situation can be seen in Fig. 6.
This configuration is quite effective to climb the first few steps of a flight of stairs but results highly inefficient in order to finish the task. From here it can be seen that DDPG did not prove to be efficient for the task at hand. This can be given by many factors that will be analyzed in the following parts, while suggesting strategies to improve the results.
6.1 Data Representation and Networks Structure
One of the causes at the root of the bad policy case could be the data representation. In fact, the algorithm has to learn the actions to take from a sequence of previous steps. This task requires the networks to have some kind of memory structure that could work with the time dimension of the sequence. In our case, this memory is faked through the use of 4 following frames for each state, from which the nets should infer velocity and movements. This means that the time frames are elaborated by feedforward networks, that are not perfectly fit for the task. In fact, as can be seen from Fig. 4 and 5, the loss decreases, implying the algorithm is learning something but it is not optimal, as can be noted from the decreasing Q value and from visual inspection of the simulation.
Moreover, given the higher performance of the 2D network, compared to the 3D one, we can infer that a proper time dimension manipulation is a fundamental requirement of the process. In fact, in the former topology, this dimension is elaborated by the convolution branches, while in the latter this elaboration is delayed till the fully connected layers.
Thus the application of nn specifically designed to handle time series might significantly improve the results of the policy search. This kind of networks are called recurrent nn and the most promising of this kind are LSTM[13]. Using recurrent nets would eliminate the need for the 4 frames while at the same time allowing to work with whole training episodes, thus removing the need for the memory replay.
6.2 Reward Function
As stated earlier, the most important factor in rl is the reward function. This function is what tells the algorithm what it has to learn and how, how to interpret the data and if it is acting properly or not. This is why it requires a lot of study and experimentation to find the right one and have a properly converging algorithm.
Our reward function describes what we thought to be the most important aspects to take in consideration in order to solve the problem. to guarantee the safety and wellbeing of the user and of the platform itself we penalized excessive swinging during operation through the definition of (4). In this equation, we take into account the data coming from the IMU mounted at the centre of the platform.
Other than that, situations in which the flippers are in a position that does not allow the platform to proceed are also penalized. This penalty is very important toward the accomplishment of the task, given that the platform is equipped with flippers in order to overcome this kind of situations.
Changing the reward function or improving it, taking into considerations also other factors like the distance from the stairs or the flippers’ inclination, could lead to performance improvements and help in solving the task.
7 Conclusions
In this work, we have set the basis for applying deeprl algorithms to interesting realworld problems. This task is very complex given the partial observability of the environment and also given the high processing power such kind of algorithms require in order to learn good policies. Nonetheless, a proper implementation of these techniques could improve the lives of many people around the world.
Future work includes testing new cost functions and switch from feedforward neural networks to recurrent ones, namely LSTM, thus getting rid of the multiframes inputs. Other interesting research directions could be to discard DDPG in favor of continuous DQN, a different algorithm, or to totally switch from continuous actions to discrete ones, where the action network output is a flag that controls if we have to increment or decrement the flippers’ angles of a fixed quantity. If this quantity is small enough the movements will be so smooth they will seem like a continuous movement. Moreover, this approach would simplify a lot the action space, making the task easier to solve.
Another line of work could be to use imitation rl, where the algorithm is trained by a human teacher showing it how to perform the preferred actions. Once the algorithm has learned them, then it could keep optimizing from that starting point. This should lead to improved and superhuman performances.
References
 [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, Feb 2015. Letter.
 [2] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, et al., “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.
 [3] X. Geng, M. Zhang, J. Bruce, K. Caluwaerts, et al., “Deep reinforcement learning for tensegrity robot locomotion,” CoRR, vol. abs/1609.09049, 2016.
 [4] V. Mnih, A. P. Badia, M. Mirza, A. Graves, et al., “Asynchronous methods for deep reinforcement learning,” CoRR, vol. abs/1602.01783, 2016.
 [5] L. Tai, G. Paolo, and M. Liu, “Virtualtoreal deep reinforcement learning: Continuous control of mobile robots for mapless navigation,” arXiv preprint arXiv:1703.00420, 2017.
 [6] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, et al., “Playing atari with deep reinforcement learning,” CoRR, vol. abs/1312.5602, 2013.
 [7] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, et al., “Learning to navigate in complex environments,” CoRR, vol. abs/1611.03673, 2016.
 [8] M. Pecka, V. Salansky, K. Zimmermann, and T. Svoboda, “Autonomous flipper control with safety constraints,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2889–2894, Oct 2016.

[9]
M. Pecka, K. Zimmermann, and T. Svoboda, “Safe exploration for reinforcement
learning in real unstructured environments,” in
Proc. of the Computer Vision Winter Workshop
, 2015.  [10] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” CoRR, vol. cs.AI/9605103, 1996.
 [11] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, vol. 1. MIT press Cambridge, 1998.
 [12] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.

[13]
S. Hochreiter and J. Schmidhuber, “Long shortterm memory,”
Journal of Neural Computation, vol. 9, 1997.  [14] L. Tai, S. Li, and M. Liu, “Autonomous exploration of mobile robots through deep neural networks,” International Journal of Advanced Robotic Systems, vol. 14, no. 4, p. 1729881417703571, 2017.
 [15] M. P. Deisenroth and C. E. Rasmussen, “Efficient reinforcement learning for motor control,” in In 10th International PhD Workshop on Systems and Control, Hluboka nad Vltavou, Czech Republic, 2009.
 [16] S. Dini and M. Serrano, “Combining qlearning with artificial neural networks in an adaptive light seeking robot,” 2012.
 [17] J. Yanga, Y. Zhuang, and C. Li, “Towards behavior switch control for an evolutionary robot based on rl with enn,” International Journal of Robotics and Automation (IJRA), 2012.
 [18] F. S. Melo, “Convergence of qlearning: A simple proof,” Institute Of Systems and Robotics, Tech. Rep, pp. 1–4, 2001.
 [19] L. Tai and M. Liu, “Towards cognitive exploration through deep reinforcement learning for mobile robots,” arXiv preprint arXiv:1610.01733, 2016.
 [20] A. G. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann, “Dataefficient generalization of robot skills with contextual policy search.,” in AAAI, 2013.
 [21] K. Zimmermann, P. Zuzanek, M. Reinstein, and V. Hlavac, “Adaptive traversability of unknown complex terrain with obstacles for mobile robots,” in Robotics and Automation (ICRA), 2014 IEEE International Conference on, pp. 5177–5182, IEEE, 2014.
 [22] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
 [23] H. P. van Hasselt, A. Guez, M. Hessel, V. Mnih, et al., “Learning values across many orders of magnitude,” in Advances in Neural Information Processing Systems, pp. 4287–4295, 2016.
Comments
There are no comments yet.