I Introduction and Related Works
Terrain traversability analysis is a fundamental issue to achieve the autonomy of a robot at offroad environments. At such scenes, many methods for urban streets are not adaptive as there is no pavement or lane marking, no curb or other artificial objects to delimit road and noroad region, terrain surface is formed by natural objects that may have complex visual and geometric properties, etc. An extensive review of the challenges and literature works is given in [1]. LiDARs and cameras have been used as the major sensors of online traversability analysis, where the mainstream methods are divided by [1] into geometrybased and appearancebased ones.
Geometrybased methods generate a geometric representation of the world first using LiDAR or depth data, then assess traversability by comparing the geometric features such as height, roughness, slope, curvature and width with the vehicle’s mechanical properties [2][3][4].
Appearancebased methods assume that traversability is correlated with terrain appearance, and many learningbased approaches have been developed [5][6]. In order to improve farfield capability, methods are developed using underfoot or near field data to selfsupervise the learning [7][8]. Recently, deep neural networks are also employed to model the procedure [9][10][11]
, where in order to solve the problem of data annotation, semisupervised learning method are developed by incorporating the weakly supervised labels such as the vehicles’ driving path.
Behaviorbased method is a new trend in this field, which is inspired by the development of learning from demonstration(LfD) and promising results in recent years [12]. Mainstream algorithms in LfD area can be approximately divided into two classes, Behavior Cloning (BC) [13][14][15] and Inverse Reinforcement Learning (IRL) [16][17]. Behavior Cloning directly learns a mapping from observation to action while IRL recovers the essential reward function behind expert demonstrations. Although earlier IRL algorithms use simple linear reward functions [16][17][18], deep neural networks reward structures [19][20] are proposed later to model highdimensional and nonlinear process. Compared with handcrafted cost and supervisedlearning methods, IRL has better robustness and scalability [21]. Recently, deep maximum entropy IRL has been used to learn a traversable cost map for urban autonomous driving [22][23], and vehicle kinematics has also been considered in [24]
by converting history trajectory into new data channels, which are integrated with scene features to compose the input of a CNN based cost function. However in these works, vehicle kinematics is not incorporated in the forward reinforcement learning procedure, and the methods of value iteration and state visitation frequency estimation have poor efficiency.
This research proposes a method of offroad traversability analysis and trajectory planning using Deep Maximum Entropy Inverse Reinforcement Learning. Novel contributions are that we encode vehicle kinematics into convolution kernels and propose two novel convolutional neural networks (RL ConvNet and Svf ConvNet) to achieve efficient forward reinforcement learning process, which solves the problem of exponential increase of statespace complexity. Experiments are conducted at offroad environments using real driving trajectories and synthesized ones that represent specific behaviors as demonstration. Results validate the performance and efficiency of our method.
Ii Methodology
As illustrated in Fig. 1, this research proposes a deep inverse reinforcement learning framework for analyzing offroad autonomous vehicle traversability and planning trajectories, which incorporates kinematics and employs RL ConvNet and Svf ConvNet for efficient computation.
Iia Problem Formulation
We formulate the process of autonomous vehicles navigating through offroad environment as Markov Decision Process(MDP), which can be defined as a tuple
, where denotes state space of the scene, denotes action set of the autonomous vehicle,denotes state transition probabilities,
denotes discount factor and finally denotes the traverse reward. Let be traversability costs, , where the lower the costs, the higher the rewards.Given demonstration samples set , where at scene , the vehicle is driven through trajectory by a human expert. A trajectory is a sequence of stateaction pairs , where actions are taken sequentially at states . The reward value of a trajectory is simply the accumulative rewards (or negative costs) over all states that the trajectory traversed.
(1) 
Let be a function to evaluate traversability cost of a certain scene with features , . Following Wulfmeier et al. [22], we use grid maps to represent , and , and a fully convolutional neural network(FCN) for with a parameter set . It is assumed that human expert trajectories are intending to maximize rewards gain or minimizing traversability costs. Our goal is to learn a parameter set for from expert’s demonstrations, so as to guide an autonomous agent to plan trajectories in similar ways as human drivers.
IiB Maximum Entropy Deep IRL
Under the maximum entropy assumption, probability of a trajectory is estimated below, where trajectories with higher reward values are exponentially more preferrable [17].
(2)  
Given demonstration samples set , learning can be formulated as maximizing the following loglikelihood problem.
(3)  
(4)  
Let and denote the state visiting frequencies of human expert drivers’ policy and optimal policy recovered from reward function respectively, where is approximated from human demonstration samples, while is estimated by solving the MDP process. According to Ziebart et al. [17] and Wulfmeier et al. [22], optimizing is conducted by backpropagating the following loss gradient.
(5) 
Hence, given the current parameter set , the following steps are taken at each iteration for optimization. The processing flow is as follows.

Estimating traversability cost , and let ;

Reinforcement learning to find an optimal policy

Computing expected state visitation frequency

Computing expert’s state visitation frequency from demonstrated trajectories

Optimizing by Eqn. 5, where is a learning rate.
IiC Incorporating Vehicle Kinematics
However, the following problems remain. First, vehicle kinematics is nonholonomic. Incorporating nonholonomic constraints is vital to plan trajectories which are physically operational by vehicles. Second, traditional value iteration and state visiting frequency estimation are timeconsuming. Especially, incorporating kinematics comes with more state dimensions, resulting in an exponential increase in computation complexity.
Tamar et al. [25] proposed a convolutional network structure for value iteration process, where previous value and reward
are passed through a convolution layer and maxpooling layer, each channel in the convolution output represents the
function of a specific action, and convolution kernel weights corresponds to the discounted transition probabilities. Thus by recurrently applying a convolution layer times, value iteration is efficiently performed with significant reduction of computation costs.Inspired by the idea, we propose RL ConvNet and Svf ConvNet for both incorporating vehicle kinematics and achieving efficient computation at the same time.
IiC1 RL ConvNet
We consider modeling kinematic constraints of vehicles’ orientation. Let and be set of vehicles’ discrete actions and orientations respectively. We assume that the vehicles’ orientation constrains vehicles’ state transition probability under a certain action, hence a set of convolutional kernels is defined, where is a kernel with the weights corresponding to the discounted transition probabilities after taking action under orientation . To handle the exponential increase of computation complexity coming with incorporating kinematics in states without explicit performance degradation, we constrain the transition probability of vehicle orientations to be deterministic. More specific, we define the function , where represents the next step vehicle orientation index after taking action under current orientation . For convolution purpose, grid maps are employed to represent for value, and policy functions, and a grid pixel corresponds to a 2D location at the scene. Let be a value map for vehicle orientation , where for any pixel ,
is the expected maximal value starting from the location
with orientation . and are defined in similar ways, which are related to both vehicle orientation and action. Let and be a and a policy map of vehicle orientation and action , where for any pixel , is the expected longterm reward and is the stochastic policy probability if action is taken when the vehicle is at the location and orientation .Hence the original value iteration algorithm can be converted to Algorithm 1, which incorporates vehicle kinematics and accelerates computation through convolution.
As is illustrated in Fig. 2(a), for a particular orientation , each is passed first through a convolution layer with kernel corresponding to orientation and action , then through a pixelwise addition layer with to estimate . All s of orientation
are then passed through a Softmax layer to obtain updated
. These operations iterate times until converges. Since then, a sequence of policy maps are estimated corresponding to the particular orientation and each of the actions , which are matrix operations. These operations are conducted for the of each a particular orientation , hence a set of policy maps are obtained corresponding to each pair of and .IiC2 Svf ConvNet
State visiting frequency is also represented as a grid map. Let be orientationaction state visiting frequency, where each pixel value is the expected frequency that action is taken under current orientation at the corresponding 2D location. Similarly, denotes the expected state visitation frequency for vehicle orientation , which is simply sum of orientationaction state visiting frequency. Denote as pixelwise multiplication of two grid maps, is calculated as follows:
(6) 
Hence the original algorithm of computing expected state visitation frequency can be converted to Algorithm 2, which incorporates vehicle kinematics and computes efficiently through convolution.
Fig. 2(b) illustrats the workflow. For any particular orientation , policy of action and orientation is passed through a multiplication layer with . The resultant orientationaction state visiting frequency grid map is then passed through a convolution layer with kernel to obtain . All s, and are passed through a weighted multiplication layer to find updated orientation state visiting frequency s, where the weights are .
Iii Experiment
Iiia Experimental Design and Dataset
Scene  Expert’s Behavior  Learnt Cost Function  

Exp.1 (E1)  straight and flat road scenes  normal driving behavior  
Exp.2 (E2)  negative obstacles scenes  avoid the negative obstacles  
Exp.3 (E3)  negative obstacles scenes  cross negative obstacles if they block the way  
Exp.4 (E4)  negative obstacles scenes  cross all negative obstacles on the road 
As shown in Tab. I, we design four experiments to examine the proposed method’s learning capability of different experts’ driving behaviors. In Exp.1, an expert demonstrates normal driving behaviors on straight and flat roads and a cost function is obtained as the result of learning, which is used as a baseline of other experiments. Exp.24 are conducted at negative obstacle scenes, where the expert’s behavior is to avoid the negative obstacles in Exp.2, cross negative obstacles only if they block the way in Exp.3, while cross all negative obstacles on the road in Exp.4.
Data collection is conducted at offroad environments using an instrumented vehicle shown in Fig.3(a). The vehicle has a Velodyne HDL64 LiDAR to map scene features, a GPS/IMU system to capture expert’s driving trajectories, and a frontview monocular camera for visualization only.
We generate a dataset where each frame has a scene map, a demonstration trajectory crossing the scene, a start and a goal point which are defined by the trajectory points that enter and leave the scene. The scene map is a 100 100 2D grid one with a resolution of 0.25 meters. It is generated using all LiDAR data crossing the scene, and each grid is assigned the height value of the highest LiDAR points projected to the cell. The whole dataset has a total of 2388 scene maps, and the expert’s real driving trajectories are used as the demonstrations for Exp.1 and Exp.2. Assume that we have a vehicle of stronger mobility and the expert chooses to cross all or some of the negative obstacles for efficiency. We use the real scene maps of negative obstacles, but synthesize demonstration trajectories in compliance with the defined behaviors of Exp.3 and Exp.4.
We choose 320 frames of straight and flat road scenes in Exp.1 and trained a cost function . We choose 320 frames of negative obstacle scenes but trajectories demonstrating different behaviors in Exp.2, 3 and 4, and trained cost functions , , respectively. The rest scenes are used in testing, where the focus is the comparison of different cost functions and the planned trajectories at the same scene.
IiiB Implementation Details
IiiB1 Transition kernels
In our implementation, we use discrete actions and orientations shown in Fig. 4. The orientation is discretized into eight directions with 45 degrees as interval to cover the full range. The actions are simplified to be combinations of steering angle(45, 0, 45 degrees relative to current orientation) and driving forward or backward. The transition kernels corresponding to orientation are illustrated in Fig. 4. Although we only consider onestep transition between adjacent grids, large and complex transition kernels can be designed in the future.
IiiB2 Network and Training Configuration
In the experiment, we adopt a simple fivelayer fully convolutional network (FCN) structure which takes processed lidar feature map as input. We use size and for convolution kernels in FCN. As for RL ConvNet and Svf ConvNet, we set the number of value iterations to 150 and number of svf iterations to 120, which are experimentally observed to execute effective reinforcement learning process and svf computation. The network is trained with the Adam optimizer with initial learning rate 1e4 and learning rate decay 0.99. The batch training is also employed with batch size 5, which proved in practice more stable than updating weights based on a single demonstration.
IiiC Training Results
The training process visualization is shown in Fig. 5. During the training process, we periodically evaluate the expert’s trajectory reward and policy’s trajectory reward using Eqn. 1. Average reward of 30 trajectories randomly sampled from learned policy is used. As is shown in Fig. 5(a), the loss value decreases and tend to converge after 200 iterations. Expert’s and policy’s rewards are getting higher while policy reward is a little lower than expert’s. Samples of expert’s demonstration trajectories are shown in Fig. 5(b) while policy generated trajectories at different training stages are shown in Fig. 5(c)(e). One can explicitly find that as iterations increase, similarity between expert’s and policy’s trajectories is becoming higher, which validates that learned reward guides trajectory planning successfully in compliance with human’s behavior.
More visual evaluation results of our four experiments can be found in Fig. 6. Four learned cost maps all successfully capture the high traverse cost feature of positive obstacles(i.e., trees or bushes) but differ in the assessment of negative obstacles(pits on the road). Demonstrated by human drivers’ avoidnegativeobstacle trajectories, Exp.2 learns that negative obstacles have higher cost compared with flat roads. Exp.3 and Exp.4 have opposite results given crossnegativeobstacle trajectories as demonstration. Negative obstacles in cost maps of Exp.3 and Exp.4 have relatively lower costs compared to flat roads, resulting in cross holes behavior of planned trajectories. One interesting fact is that the result of Exp.4 assigns lower costs to holes than Exp.3, showing more preference for negative obstacles.
IiiD Testing Results
Visualization of testing results on different scenes are shown in Fig. 7. We demonstrate the learning capability and scalability by comparing the behavior of trajectories generated by different learned cost functions. As shown in Fig. 7, fail to handle scenes where negative obstacles exist and planned trajectories cross them. On opposite, by demonstrating avoiding obstacles behavior in Exp.2, shows good performance under these scenes and successfully avoids negative obstacles. By comparison, and show strong learning capability as they produce cost maps that assign low traverse cost to negative obstacles. Compared to , assigns much lower cost to negative obstacles since it is demonstrated by trajectories more preferable for them. The testing results show that our method has a strong learning capability of different behaviors and is scalable to different scenes. When transferring to new scenes or new driving behavior, our method can refine the current model by demonstrating new trajectories.
IiiE Computation Efficiency Analysis
We reimplement method of Wulfmeier et al. [22] and make quantitive comparison of computation time with ours. Time spent on the process of reinforcement learning and svf computation of each sample is measured for analysis. The experiment is conducted on an Intel Xeon E5 CPU and an NVIDIA TiTanX GPU. As shown in Tab. II and Tab. III, even without GPU accelerating, our method achieves much less computation time than Wulfmeier et al. [22] in either training or testing process.
Train  

Wulfmeier et al. [22]  Ours(CPU)  Ours(GPU)  
RL  80.5920s  0.7501s  0.3958s 
Svf  19.3368s  1.0671s  0.5161s 
Test  

Wulfmeier et al. [22]  Ours(CPU)  Ours(GPU)  
RL  77.5618s  0.6027s  0.2541s 
Svf  18.7573s  0.8479s  0.3860s 
Iv Conclusion and Future works
A method of offroad traversability analysis and trajectory planning using Deep Maximum Entropy Inverse Reinforcement Learning is proposed. A major novelty is the incorporating of vehicle kinematics, where two convolutional neural networks, i.e., RL ConvNet and Svf ConvNet, are developed that encode vehicle kinematics into convolution kernels, so as to solve the exponential increase of statespace complexity problem and achieve efficient computation in forward reinforcement learning. Experiments are conducted, where four traverse cost functions are learned from demonstration trajectories that represent the behaviors of normal driving, avoid negative obstacles, cross all negative obstacles and cross the negative obstacles if they block the way. Results demonstrate that the learned cost functions are able to guide trajectory planning in compliance with the expert’s behaviors, the method has scalability at various scenes and significantly improves the computation efficiency of both learning and testing. In future work, more extensive experimental studies will be conducted, and improvement on the accuracy of kinematic kernel will be addressed.
References

[1]
P. Papadakis, “Terrain traversability analysis methods for unmanned ground
vehicles: A survey,”
Engineering Applications of Artificial Intelligence
, vol. 26, no. 4, pp. 1373–1385, 2013.  [2] J. F. Lalonde, N. Vandapel, D. F. Huber, and M. Hebert, “Natural terrain classification using threedimensional ladar data for ground robot mobility,” Journal of Field Robotics, vol. 23, no. 10, pp. 839–861, 2010.
 [3] J. Larson and M. Trivedi, “Lidar based offroad negative obstacle detection and analysis,” in International IEEE Conference on Intelligent Transportation Systems, 2011.
 [4] S. Kuthirummal, A. Das, and S. Samarasekera, “A graph traversal based algorithm for obstacle detection using lidar or stereo,” in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, Sep. 2011, pp. 3874–3880.
 [5] F. Labrosse and M. Ososinski, “Automatic driving on illdefined roads: An adaptive, shapeconstrained, colorbased method,” Journal of Field Robotics, vol. 32, no. 4, pp. 504–533, 2015.
 [6] J. Mei, Y. Yu, H. Zhao, and H. Zha, “Sceneadaptive offroad detection using a monocular camera,” IEEE Transactions on Intelligent Transportation Systems, vol. PP, no. 99, pp. 1–12, 2017.
 [7] A. Howard, M. Turmon, L. Matthies, B. Tang, A. Angelova, and E. Mjolsness, “Towards learned traversability for robot navigation: From underfoot to the far field,” Journal of Field Robotics, vol. 23, no. 1112, pp. 1005–1017, 2010.
 [8] S. Zhou, J. Xi, M. W. Mcdaniel, T. Nishihata, P. Salesses, and K. Iagnemma, “Selfsupervised learning to visually detect terrain surfaces for autonomous robots operating in forested terrain,” Journal of Field Robotics, vol. 29, no. 2, pp. 277–297, 2012.
 [9] B. Suger, B. Steder, and W. Burgard, “Traversability analysis for mobile robots in outdoor environments: A semisupervised learning approach based on 3dlidar data,” in 2015 IEEE International Conference on Robotics and Automation (ICRA), May 2015, pp. 3941–3946.
 [10] B. Dan, W. Maddern, and I. Posner, “Find your own way: Weaklysupervised segmentation of path proposals for urban autonomy,” in IEEE International Conference on Robotics and Automation, 2017.
 [11] B. Gao, A. Xu, Y. Pan, X. Zhao, W. Yao, and H. Zhao, “OffRoad Drivable Area Extraction Using 3D LiDAR Data,” Intelligent Vehicles Symposium, no. Iv, pp. 1323–1329, 2019.
 [12] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and Autonomous Systems, vol. 57, no. 5, pp. 469–483, 2009.
 [13] G. Hayes and J. Demiris, “A robot controller using learning by imitation,” Proc.of the Intl.symp.on Intelligent Robotic Systems, vol. 676, no. 5, pp. 1257–1274, 1994.
 [14] D. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” vol. 1, 01 1988, pp. 305–313.
 [15] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to end learning for selfdriving cars,” CoRR, vol. abs/1604.07316, 2016.

[16]
P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement
learning,” in
International Conference on Machine Learning
, 2004.  [17] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum Entropy Inverse Reinforcement Learning,” Proceeding AAAI’08 Proceedings of the 23rd national conference on Artificial intelligence, pp. 1433–1438, 2008.
 [18] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, “Maximum margin planning,” in International Conference on Machine Learning, 2006.
 [19] M. Wulfmeier, P. Ondruska, and I. Posner, “Maximum Entropy Deep Inverse Reinforcement Learning,” arXiv eprints, July 2015.
 [20] C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverse optimal control via policy optimization,” in International Conference on International Conference on Machine Learning, 2016.

[21]
T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters, “An algorithmic perspective on imitation learning,”
Foundations and Trends in Robotics, vol. 7, no. 12, pp. 1–179, 2018.  [22] M. Wulfmeier, D. Z. Wang, and I. Posner, “Watch this: Scalable costfunction learning for path planning in urban environments,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2016, pp. 2089–2095.
 [23] M. Wulfmeier, D. Rao, D. Z. Wang, P. Ondruska, and I. Posner, “Largescale cost function learning for path planning using deep inverse reinforcement learning,” International Journal of Robotics Research, vol. 36, no. 10, p. 027836491772239, 2017.
 [24] Y. Zhang, W. Wang, R. Bonatti, D. Maturana, and S. Scherer, “Integrating kinematics and environment context into deep inverse reinforcement learning for predicting offroad vehicle trajectories,” CoRR, vol. abs/1810.07225, 2018. [Online]. Available: http://arxiv.org/abs/1810.07225
 [25] A. Tamar, Y. WU, G. Thomas, S. Levine, and P. Abbeel, “Value iteration networks,” in Advances in Neural Information Processing Systems 29, 2016, pp. 2154–2162.
Comments
There are no comments yet.