I Introduction and Related Works
Terrain traversability analysis is a fundamental issue to achieve the autonomy of a robot at off-road environments. At such scenes, many methods for urban streets are not adaptive as there is no pavement or lane marking, no curb or other artificial objects to delimit road and no-road region, terrain surface is formed by natural objects that may have complex visual and geometric properties, etc. An extensive review of the challenges and literature works is given in . LiDARs and cameras have been used as the major sensors of online traversability analysis, where the mainstream methods are divided by  into geometry-based and appearance-based ones.
Geometry-based methods generate a geometric representation of the world first using LiDAR or depth data, then assess traversability by comparing the geometric features such as height, roughness, slope, curvature and width with the vehicle’s mechanical properties .
Appearance-based methods assume that traversability is correlated with terrain appearance, and many learning-based approaches have been developed . In order to improve far-field capability, methods are developed using underfoot or near field data to self-supervise the learning . Recently, deep neural networks are also employed to model the procedure 
, where in order to solve the problem of data annotation, semi-supervised learning method are developed by incorporating the weakly supervised labels such as the vehicles’ driving path.
Behavior-based method is a new trend in this field, which is inspired by the development of learning from demonstration(LfD) and promising results in recent years . Mainstream algorithms in LfD area can be approximately divided into two classes, Behavior Cloning (BC)  and Inverse Reinforcement Learning (IRL) . Behavior Cloning directly learns a mapping from observation to action while IRL recovers the essential reward function behind expert demonstrations. Although earlier IRL algorithms use simple linear reward functions , deep neural networks reward structures  are proposed later to model high-dimensional and non-linear process. Compared with handcrafted cost and supervised-learning methods, IRL has better robustness and scalability . Recently, deep maximum entropy IRL has been used to learn a traversable cost map for urban autonomous driving , and vehicle kinematics has also been considered in 
by converting history trajectory into new data channels, which are integrated with scene features to compose the input of a CNN based cost function. However in these works, vehicle kinematics is not incorporated in the forward reinforcement learning procedure, and the methods of value iteration and state visitation frequency estimation have poor efficiency.
This research proposes a method of off-road traversability analysis and trajectory planning using Deep Maximum Entropy Inverse Reinforcement Learning. Novel contributions are that we encode vehicle kinematics into convolution kernels and propose two novel convolutional neural networks (RL ConvNet and Svf ConvNet) to achieve efficient forward reinforcement learning process, which solves the problem of exponential increase of state-space complexity. Experiments are conducted at off-road environments using real driving trajectories and synthesized ones that represent specific behaviors as demonstration. Results validate the performance and efficiency of our method.
As illustrated in Fig. 1, this research proposes a deep inverse reinforcement learning framework for analyzing off-road autonomous vehicle traversability and planning trajectories, which incorporates kinematics and employs RL ConvNet and Svf ConvNet for efficient computation.
Ii-a Problem Formulation
We formulate the process of autonomous vehicles navigating through off-road environment as Markov Decision Process(MDP), which can be defined as a tuple, where denotes state space of the scene, denotes action set of the autonomous vehicle,
denotes state transition probabilities,denotes discount factor and finally denotes the traverse reward. Let be traversability costs, , where the lower the costs, the higher the rewards.
Given demonstration samples set , where at scene , the vehicle is driven through trajectory by a human expert. A trajectory is a sequence of state-action pairs , where actions are taken sequentially at states . The reward value of a trajectory is simply the accumulative rewards (or negative costs) over all states that the trajectory traversed.
Let be a function to evaluate traversability cost of a certain scene with features , . Following Wulfmeier et al. , we use grid maps to represent , and , and a fully convolutional neural network(FCN) for with a parameter set . It is assumed that human expert trajectories are intending to maximize rewards gain or minimizing traversability costs. Our goal is to learn a parameter set for from expert’s demonstrations, so as to guide an autonomous agent to plan trajectories in similar ways as human drivers.
Ii-B Maximum Entropy Deep IRL
Under the maximum entropy assumption, probability of a trajectory is estimated below, where trajectories with higher reward values are exponentially more preferrable .
Given demonstration samples set , learning can be formulated as maximizing the following log-likelihood problem.
Let and denote the state visiting frequencies of human expert drivers’ policy and optimal policy recovered from reward function respectively, where is approximated from human demonstration samples, while is estimated by solving the MDP process. According to Ziebart et al.  and Wulfmeier et al. , optimizing is conducted by back-propagating the following loss gradient.
Hence, given the current parameter set , the following steps are taken at each iteration for optimization. The processing flow is as follows.
Estimating traversability cost , and let ;
Reinforcement learning to find an optimal policy
Computing expected state visitation frequency
Computing expert’s state visitation frequency from demonstrated trajectories
Optimizing by Eqn. 5, where is a learning rate.
Ii-C Incorporating Vehicle Kinematics
However, the following problems remain. First, vehicle kinematics is non-holonomic. Incorporating non-holonomic constraints is vital to plan trajectories which are physically operational by vehicles. Second, traditional value iteration and state visiting frequency estimation are time-consuming. Especially, incorporating kinematics comes with more state dimensions, resulting in an exponential increase in computation complexity.
Tamar et al.  proposed a convolutional network structure for value iteration process, where previous value and reward
are passed through a convolution layer and max-pooling layer, each channel in the convolution output represents thefunction of a specific action, and convolution kernel weights corresponds to the discounted transition probabilities. Thus by recurrently applying a convolution layer times, value iteration is efficiently performed with significant reduction of computation costs.
Inspired by the idea, we propose RL ConvNet and Svf ConvNet for both incorporating vehicle kinematics and achieving efficient computation at the same time.
Ii-C1 RL ConvNet
We consider modeling kinematic constraints of vehicles’ orientation. Let and be set of vehicles’ discrete actions and orientations respectively. We assume that the vehicles’ orientation constrains vehicles’ state transition probability under a certain action, hence a set of convolutional kernels is defined, where is a kernel with the weights corresponding to the discounted transition probabilities after taking action under orientation . To handle the exponential increase of computation complexity coming with incorporating kinematics in states without explicit performance degradation, we constrain the transition probability of vehicle orientations to be deterministic. More specific, we define the function , where represents the next step vehicle orientation index after taking action under current orientation . For convolution purpose, grid maps are employed to represent for value, and policy functions, and a grid pixel corresponds to a 2D location at the scene. Let be a value map for vehicle orientation , where for any pixel ,
is the expected maximal value starting from the locationwith orientation . and are defined in similar ways, which are related to both vehicle orientation and action. Let and be a and a policy map of vehicle orientation and action , where for any pixel , is the expected long-term reward and is the stochastic policy probability if action is taken when the vehicle is at the location and orientation .
Hence the original value iteration algorithm can be converted to Algorithm 1, which incorporates vehicle kinematics and accelerates computation through convolution.
As is illustrated in Fig. 2(a), for a particular orientation , each is passed first through a convolution layer with kernel corresponding to orientation and action , then through a pixel-wise addition layer with to estimate . All s of orientation
are then passed through a Softmax layer to obtain updated. These operations iterate times until converges. Since then, a sequence of policy maps are estimated corresponding to the particular orientation and each of the actions , which are matrix operations. These operations are conducted for the of each a particular orientation , hence a set of policy maps are obtained corresponding to each pair of and .
Ii-C2 Svf ConvNet
State visiting frequency is also represented as a grid map. Let be orientation-action state visiting frequency, where each pixel value is the expected frequency that action is taken under current orientation at the corresponding 2D location. Similarly, denotes the expected state visitation frequency for vehicle orientation , which is simply sum of orientation-action state visiting frequency. Denote as pixel-wise multiplication of two grid maps, is calculated as follows:
Hence the original algorithm of computing expected state visitation frequency can be converted to Algorithm 2, which incorporates vehicle kinematics and computes efficiently through convolution.
Fig. 2(b) illustrats the workflow. For any particular orientation , policy of action and orientation is passed through a multiplication layer with . The resultant orientation-action state visiting frequency grid map is then passed through a convolution layer with kernel to obtain . All s, and are passed through a weighted multiplication layer to find updated orientation state visiting frequency s, where the weights are .
Iii-a Experimental Design and Dataset
|Scene||Expert’s Behavior||Learnt Cost Function|
|Exp.1 (E1)||straight and flat road scenes||normal driving behavior|
|Exp.2 (E2)||negative obstacles scenes||avoid the negative obstacles|
|Exp.3 (E3)||negative obstacles scenes||cross negative obstacles if they block the way|
|Exp.4 (E4)||negative obstacles scenes||cross all negative obstacles on the road|
As shown in Tab. I, we design four experiments to examine the proposed method’s learning capability of different experts’ driving behaviors. In Exp.1, an expert demonstrates normal driving behaviors on straight and flat roads and a cost function is obtained as the result of learning, which is used as a baseline of other experiments. Exp.2-4 are conducted at negative obstacle scenes, where the expert’s behavior is to avoid the negative obstacles in Exp.2, cross negative obstacles only if they block the way in Exp.3, while cross all negative obstacles on the road in Exp.4.
Data collection is conducted at off-road environments using an instrumented vehicle shown in Fig.3(a). The vehicle has a Velodyne HDL-64 LiDAR to map scene features, a GPS/IMU system to capture expert’s driving trajectories, and a front-view monocular camera for visualization only.
We generate a dataset where each frame has a scene map, a demonstration trajectory crossing the scene, a start and a goal point which are defined by the trajectory points that enter and leave the scene. The scene map is a 100 100 2D grid one with a resolution of 0.25 meters. It is generated using all LiDAR data crossing the scene, and each grid is assigned the height value of the highest LiDAR points projected to the cell. The whole dataset has a total of 2388 scene maps, and the expert’s real driving trajectories are used as the demonstrations for Exp.1 and Exp.2. Assume that we have a vehicle of stronger mobility and the expert chooses to cross all or some of the negative obstacles for efficiency. We use the real scene maps of negative obstacles, but synthesize demonstration trajectories in compliance with the defined behaviors of Exp.3 and Exp.4.
We choose 320 frames of straight and flat road scenes in Exp.1 and trained a cost function . We choose 320 frames of negative obstacle scenes but trajectories demonstrating different behaviors in Exp.2, 3 and 4, and trained cost functions , , respectively. The rest scenes are used in testing, where the focus is the comparison of different cost functions and the planned trajectories at the same scene.
Iii-B Implementation Details
Iii-B1 Transition kernels
In our implementation, we use discrete actions and orientations shown in Fig. 4. The orientation is discretized into eight directions with 45 degrees as interval to cover the full range. The actions are simplified to be combinations of steering angle(-45, 0, 45 degrees relative to current orientation) and driving forward or backward. The transition kernels corresponding to orientation are illustrated in Fig. 4. Although we only consider one-step transition between adjacent grids, large and complex transition kernels can be designed in the future.
Iii-B2 Network and Training Configuration
In the experiment, we adopt a simple five-layer fully convolutional network (FCN) structure which takes processed lidar feature map as input. We use size and for convolution kernels in FCN. As for RL ConvNet and Svf ConvNet, we set the number of value iterations to 150 and number of svf iterations to 120, which are experimentally observed to execute effective reinforcement learning process and svf computation. The network is trained with the Adam optimizer with initial learning rate 1e-4 and learning rate decay 0.99. The batch training is also employed with batch size 5, which proved in practice more stable than updating weights based on a single demonstration.
Iii-C Training Results
The training process visualization is shown in Fig. 5. During the training process, we periodically evaluate the expert’s trajectory reward and policy’s trajectory reward using Eqn. 1. Average reward of 30 trajectories randomly sampled from learned policy is used. As is shown in Fig. 5(a), the loss value decreases and tend to converge after 200 iterations. Expert’s and policy’s rewards are getting higher while policy reward is a little lower than expert’s. Samples of expert’s demonstration trajectories are shown in Fig. 5(b) while policy generated trajectories at different training stages are shown in Fig. 5(c)-(e). One can explicitly find that as iterations increase, similarity between expert’s and policy’s trajectories is becoming higher, which validates that learned reward guides trajectory planning successfully in compliance with human’s behavior.
More visual evaluation results of our four experiments can be found in Fig. 6. Four learned cost maps all successfully capture the high traverse cost feature of positive obstacles(i.e., trees or bushes) but differ in the assessment of negative obstacles(pits on the road). Demonstrated by human drivers’ avoid-negative-obstacle trajectories, Exp.2 learns that negative obstacles have higher cost compared with flat roads. Exp.3 and Exp.4 have opposite results given cross-negative-obstacle trajectories as demonstration. Negative obstacles in cost maps of Exp.3 and Exp.4 have relatively lower costs compared to flat roads, resulting in cross holes behavior of planned trajectories. One interesting fact is that the result of Exp.4 assigns lower costs to holes than Exp.3, showing more preference for negative obstacles.
Iii-D Testing Results
Visualization of testing results on different scenes are shown in Fig. 7. We demonstrate the learning capability and scalability by comparing the behavior of trajectories generated by different learned cost functions. As shown in Fig. 7, fail to handle scenes where negative obstacles exist and planned trajectories cross them. On opposite, by demonstrating avoiding obstacles behavior in Exp.2, shows good performance under these scenes and successfully avoids negative obstacles. By comparison, and show strong learning capability as they produce cost maps that assign low traverse cost to negative obstacles. Compared to , assigns much lower cost to negative obstacles since it is demonstrated by trajectories more preferable for them. The testing results show that our method has a strong learning capability of different behaviors and is scalable to different scenes. When transferring to new scenes or new driving behavior, our method can refine the current model by demonstrating new trajectories.
Iii-E Computation Efficiency Analysis
We re-implement method of Wulfmeier et al.  and make quantitive comparison of computation time with ours. Time spent on the process of reinforcement learning and svf computation of each sample is measured for analysis. The experiment is conducted on an Intel Xeon E5 CPU and an NVIDIA TiTanX GPU. As shown in Tab. II and Tab. III, even without GPU accelerating, our method achieves much less computation time than Wulfmeier et al.  in either training or testing process.
|Wulfmeier et al. ||Ours(CPU)||Ours(GPU)|
|Wulfmeier et al. ||Ours(CPU)||Ours(GPU)|
Iv Conclusion and Future works
A method of off-road traversability analysis and trajectory planning using Deep Maximum Entropy Inverse Reinforcement Learning is proposed. A major novelty is the incorporating of vehicle kinematics, where two convolutional neural networks, i.e., RL ConvNet and Svf ConvNet, are developed that encode vehicle kinematics into convolution kernels, so as to solve the exponential increase of state-space complexity problem and achieve efficient computation in forward reinforcement learning. Experiments are conducted, where four traverse cost functions are learned from demonstration trajectories that represent the behaviors of normal driving, avoid negative obstacles, cross all negative obstacles and cross the negative obstacles if they block the way. Results demonstrate that the learned cost functions are able to guide trajectory planning in compliance with the expert’s behaviors, the method has scalability at various scenes and significantly improves the computation efficiency of both learning and testing. In future work, more extensive experimental studies will be conducted, and improvement on the accuracy of kinematic kernel will be addressed.
P. Papadakis, “Terrain traversability analysis methods for unmanned ground
vehicles: A survey,”
Engineering Applications of Artificial Intelligence, vol. 26, no. 4, pp. 1373–1385, 2013.
-  J. F. Lalonde, N. Vandapel, D. F. Huber, and M. Hebert, “Natural terrain classification using three-dimensional ladar data for ground robot mobility,” Journal of Field Robotics, vol. 23, no. 10, pp. 839–861, 2010.
-  J. Larson and M. Trivedi, “Lidar based off-road negative obstacle detection and analysis,” in International IEEE Conference on Intelligent Transportation Systems, 2011.
-  S. Kuthirummal, A. Das, and S. Samarasekera, “A graph traversal based algorithm for obstacle detection using lidar or stereo,” in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, Sep. 2011, pp. 3874–3880.
-  F. Labrosse and M. Ososinski, “Automatic driving on ill-defined roads: An adaptive, shape-constrained, color-based method,” Journal of Field Robotics, vol. 32, no. 4, pp. 504–533, 2015.
-  J. Mei, Y. Yu, H. Zhao, and H. Zha, “Scene-adaptive off-road detection using a monocular camera,” IEEE Transactions on Intelligent Transportation Systems, vol. PP, no. 99, pp. 1–12, 2017.
-  A. Howard, M. Turmon, L. Matthies, B. Tang, A. Angelova, and E. Mjolsness, “Towards learned traversability for robot navigation: From underfoot to the far field,” Journal of Field Robotics, vol. 23, no. 11-12, pp. 1005–1017, 2010.
-  S. Zhou, J. Xi, M. W. Mcdaniel, T. Nishihata, P. Salesses, and K. Iagnemma, “Self-supervised learning to visually detect terrain surfaces for autonomous robots operating in forested terrain,” Journal of Field Robotics, vol. 29, no. 2, pp. 277–297, 2012.
-  B. Suger, B. Steder, and W. Burgard, “Traversability analysis for mobile robots in outdoor environments: A semi-supervised learning approach based on 3d-lidar data,” in 2015 IEEE International Conference on Robotics and Automation (ICRA), May 2015, pp. 3941–3946.
-  B. Dan, W. Maddern, and I. Posner, “Find your own way: Weakly-supervised segmentation of path proposals for urban autonomy,” in IEEE International Conference on Robotics and Automation, 2017.
-  B. Gao, A. Xu, Y. Pan, X. Zhao, W. Yao, and H. Zhao, “Off-Road Drivable Area Extraction Using 3D LiDAR Data,” Intelligent Vehicles Symposium, no. Iv, pp. 1323–1329, 2019.
-  B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and Autonomous Systems, vol. 57, no. 5, pp. 469–483, 2009.
-  G. Hayes and J. Demiris, “A robot controller using learning by imitation,” Proc.of the Intl.symp.on Intelligent Robotic Systems, vol. 676, no. 5, pp. 1257–1274, 1994.
-  D. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” vol. 1, 01 1988, pp. 305–313.
-  M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self-driving cars,” CoRR, vol. abs/1604.07316, 2016.
P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement
International Conference on Machine Learning, 2004.
-  B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum Entropy Inverse Reinforcement Learning,” Proceeding AAAI’08 Proceedings of the 23rd national conference on Artificial intelligence, pp. 1433–1438, 2008.
-  N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, “Maximum margin planning,” in International Conference on Machine Learning, 2006.
-  M. Wulfmeier, P. Ondruska, and I. Posner, “Maximum Entropy Deep Inverse Reinforcement Learning,” arXiv e-prints, July 2015.
-  C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverse optimal control via policy optimization,” in International Conference on International Conference on Machine Learning, 2016.
T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters, “An algorithmic perspective on imitation learning,”Foundations and Trends in Robotics, vol. 7, no. 1-2, pp. 1–179, 2018.
-  M. Wulfmeier, D. Z. Wang, and I. Posner, “Watch this: Scalable cost-function learning for path planning in urban environments,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2016, pp. 2089–2095.
-  M. Wulfmeier, D. Rao, D. Z. Wang, P. Ondruska, and I. Posner, “Large-scale cost function learning for path planning using deep inverse reinforcement learning,” International Journal of Robotics Research, vol. 36, no. 10, p. 027836491772239, 2017.
-  Y. Zhang, W. Wang, R. Bonatti, D. Maturana, and S. Scherer, “Integrating kinematics and environment context into deep inverse reinforcement learning for predicting off-road vehicle trajectories,” CoRR, vol. abs/1810.07225, 2018. [Online]. Available: http://arxiv.org/abs/1810.07225
-  A. Tamar, Y. WU, G. Thomas, S. Levine, and P. Abbeel, “Value iteration networks,” in Advances in Neural Information Processing Systems 29, 2016, pp. 2154–2162.