Off-road Autonomous Vehicles Traversability Analysis and Trajectory Planning Based on Deep Inverse Reinforcement Learning

09/16/2019 ∙ by Zeyu Zhu, et al. ∙ Peking University 0

Terrain traversability analysis is a fundamental issue to achieve the autonomy of a robot at off-road environments. Geometry-based and appearance-based methods have been studied in decades, while behavior-based methods exploiting learning from demonstration (LfD) are new trends. Behavior-based methods learn cost functions that guide trajectory planning in compliance with experts' demonstrations, which can be more scalable to various scenes and driving behaviors. This research proposes a method of off-road traversability analysis and trajectory planning using Deep Maximum Entropy Inverse Reinforcement Learning. To incorporate vehicle's kinematics while solving the problem of exponential increase of state-space complexity, two convolutional neural networks, i.e., RL ConvNet and Svf ConvNet, are developed to encode kinematics into convolution kernels and achieve efficient forward reinforcement learning. We conduct experiments in off-road environments. Scene maps are generated using 3D LiDAR data, and expert demonstrations are either the vehicle's real driving trajectories at the scene or synthesized ones to represent specific behaviors such as crossing negative obstacles. Four cost functions of traversability analysis are learned and tested at various scenes of capability in guiding the trajectory planning of different behaviors. We also demonstrate the computation efficiency of the proposed method.



There are no comments yet.


page 1

page 3

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction and Related Works

Terrain traversability analysis is a fundamental issue to achieve the autonomy of a robot at off-road environments. At such scenes, many methods for urban streets are not adaptive as there is no pavement or lane marking, no curb or other artificial objects to delimit road and no-road region, terrain surface is formed by natural objects that may have complex visual and geometric properties, etc. An extensive review of the challenges and literature works is given in [1]. LiDARs and cameras have been used as the major sensors of online traversability analysis, where the mainstream methods are divided by [1] into geometry-based and appearance-based ones.

Fig. 1: Framework of our proposed method. The RL ConvNet and Svf ConvNet are our major contribution that incorporates vehicle kinematics and solves computation efficiency problem.

Geometry-based methods generate a geometric representation of the world first using LiDAR or depth data, then assess traversability by comparing the geometric features such as height, roughness, slope, curvature and width with the vehicle’s mechanical properties [2][3][4].

Appearance-based methods assume that traversability is correlated with terrain appearance, and many learning-based approaches have been developed [5][6]. In order to improve far-field capability, methods are developed using underfoot or near field data to self-supervise the learning [7][8]. Recently, deep neural networks are also employed to model the procedure [9][10][11]

, where in order to solve the problem of data annotation, semi-supervised learning method are developed by incorporating the weakly supervised labels such as the vehicles’ driving path.

Behavior-based method is a new trend in this field, which is inspired by the development of learning from demonstration(LfD) and promising results in recent years [12]. Mainstream algorithms in LfD area can be approximately divided into two classes, Behavior Cloning (BC) [13][14][15] and Inverse Reinforcement Learning (IRL) [16][17]. Behavior Cloning directly learns a mapping from observation to action while IRL recovers the essential reward function behind expert demonstrations. Although earlier IRL algorithms use simple linear reward functions [16][17][18], deep neural networks reward structures [19][20] are proposed later to model high-dimensional and non-linear process. Compared with handcrafted cost and supervised-learning methods, IRL has better robustness and scalability [21]. Recently, deep maximum entropy IRL has been used to learn a traversable cost map for urban autonomous driving [22][23], and vehicle kinematics has also been considered in [24]

by converting history trajectory into new data channels, which are integrated with scene features to compose the input of a CNN based cost function. However in these works, vehicle kinematics is not incorporated in the forward reinforcement learning procedure, and the methods of value iteration and state visitation frequency estimation have poor efficiency.

This research proposes a method of off-road traversability analysis and trajectory planning using Deep Maximum Entropy Inverse Reinforcement Learning. Novel contributions are that we encode vehicle kinematics into convolution kernels and propose two novel convolutional neural networks (RL ConvNet and Svf ConvNet) to achieve efficient forward reinforcement learning process, which solves the problem of exponential increase of state-space complexity. Experiments are conducted at off-road environments using real driving trajectories and synthesized ones that represent specific behaviors as demonstration. Results validate the performance and efficiency of our method.

This paper is organized as follows. The proposed methodology is described in Section.II. Experimental results are shown in Section.III. Finally, conclusion and future works are given in Section. IV.

Ii Methodology

As illustrated in Fig. 1, this research proposes a deep inverse reinforcement learning framework for analyzing off-road autonomous vehicle traversability and planning trajectories, which incorporates kinematics and employs RL ConvNet and Svf ConvNet for efficient computation.

Ii-a Problem Formulation

We formulate the process of autonomous vehicles navigating through off-road environment as Markov Decision Process(MDP), which can be defined as a tuple

, where denotes state space of the scene, denotes action set of the autonomous vehicle,

denotes state transition probabilities,

denotes discount factor and finally denotes the traverse reward. Let be traversability costs, , where the lower the costs, the higher the rewards.

Given demonstration samples set , where at scene , the vehicle is driven through trajectory by a human expert. A trajectory is a sequence of state-action pairs , where actions are taken sequentially at states . The reward value of a trajectory is simply the accumulative rewards (or negative costs) over all states that the trajectory traversed.


Let be a function to evaluate traversability cost of a certain scene with features , . Following Wulfmeier et al. [22], we use grid maps to represent , and , and a fully convolutional neural network(FCN) for with a parameter set . It is assumed that human expert trajectories are intending to maximize rewards gain or minimizing traversability costs. Our goal is to learn a parameter set for from expert’s demonstrations, so as to guide an autonomous agent to plan trajectories in similar ways as human drivers.

Ii-B Maximum Entropy Deep IRL

Under the maximum entropy assumption, probability of a trajectory is estimated below, where trajectories with higher reward values are exponentially more preferrable [17].


Given demonstration samples set , learning can be formulated as maximizing the following log-likelihood problem.


Let and denote the state visiting frequencies of human expert drivers’ policy and optimal policy recovered from reward function respectively, where is approximated from human demonstration samples, while is estimated by solving the MDP process. According to Ziebart et al. [17] and Wulfmeier et al. [22], optimizing is conducted by back-propagating the following loss gradient.


Hence, given the current parameter set , the following steps are taken at each iteration for optimization. The processing flow is as follows.

  1. Estimating traversability cost , and let ;

  2. Reinforcement learning to find an optimal policy

  3. Computing expected state visitation frequency

  4. Computing expert’s state visitation frequency from demonstrated trajectories

  5. Optimizing by Eqn. 5, where is a learning rate.

(a) RL ConvNet
(b) Svf ConvNet
Fig. 2: (a):Structure of our proposed RL ConvNet. (b):Structure of our proposed Svf ConvNet. denotes reward map(negative to cost). denotes value map with orientation . denotes transition kernels corresponding to actions and denotes Q maps and stochastic policy maps corresponding to orientation and action respectively. and denotes state visiting frequency for orientation and orientation-action pair respectively.

Ii-C Incorporating Vehicle Kinematics

However, the following problems remain. First, vehicle kinematics is non-holonomic. Incorporating non-holonomic constraints is vital to plan trajectories which are physically operational by vehicles. Second, traditional value iteration and state visiting frequency estimation are time-consuming. Especially, incorporating kinematics comes with more state dimensions, resulting in an exponential increase in computation complexity.

Tamar et al. [25] proposed a convolutional network structure for value iteration process, where previous value and reward

are passed through a convolution layer and max-pooling layer, each channel in the convolution output represents the

function of a specific action, and convolution kernel weights corresponds to the discounted transition probabilities. Thus by recurrently applying a convolution layer times, value iteration is efficiently performed with significant reduction of computation costs.

Inspired by the idea, we propose RL ConvNet and Svf ConvNet for both incorporating vehicle kinematics and achieving efficient computation at the same time.

Ii-C1 RL ConvNet

We consider modeling kinematic constraints of vehicles’ orientation. Let and be set of vehicles’ discrete actions and orientations respectively. We assume that the vehicles’ orientation constrains vehicles’ state transition probability under a certain action, hence a set of convolutional kernels is defined, where is a kernel with the weights corresponding to the discounted transition probabilities after taking action under orientation . To handle the exponential increase of computation complexity coming with incorporating kinematics in states without explicit performance degradation, we constrain the transition probability of vehicle orientations to be deterministic. More specific, we define the function , where represents the next step vehicle orientation index after taking action under current orientation . For convolution purpose, grid maps are employed to represent for value, and policy functions, and a grid pixel corresponds to a 2D location at the scene. Let be a value map for vehicle orientation , where for any pixel ,

is the expected maximal value starting from the location

with orientation . and are defined in similar ways, which are related to both vehicle orientation and action. Let and be a and a policy map of vehicle orientation and action , where for any pixel , is the expected long-term reward and is the stochastic policy probability if action is taken when the vehicle is at the location and orientation .

0:  reward , transition probability , function
0:  optimal stochastic policy
1:  Initialize:
2:  for  do
3:     for  do
5:     end for
6:     for  do
7:        for  do
9:        end for
11:     end for
12:  end for
Algorithm 1 RL ConvNet

Hence the original value iteration algorithm can be converted to Algorithm 1, which incorporates vehicle kinematics and accelerates computation through convolution.

As is illustrated in Fig. 2(a), for a particular orientation , each is passed first through a convolution layer with kernel corresponding to orientation and action , then through a pixel-wise addition layer with to estimate . All s of orientation

are then passed through a Softmax layer to obtain updated

. These operations iterate times until converges. Since then, a sequence of policy maps are estimated corresponding to the particular orientation and each of the actions , which are matrix operations. These operations are conducted for the of each a particular orientation , hence a set of policy maps are obtained corresponding to each pair of and .

Ii-C2 Svf ConvNet

State visiting frequency is also represented as a grid map. Let be orientation-action state visiting frequency, where each pixel value is the expected frequency that action is taken under current orientation at the corresponding 2D location. Similarly, denotes the expected state visitation frequency for vehicle orientation , which is simply sum of orientation-action state visiting frequency. Denote as pixel-wise multiplication of two grid maps, is calculated as follows:


Hence the original algorithm of computing expected state visitation frequency can be converted to Algorithm 2, which incorporates vehicle kinematics and computes efficiently through convolution.

0:  stochastic policy , transition probability , initial state distribution , function
0:  expected state visiting frequency
1:  Initialize:
2:  for   do
3:     for  do
4:        for  do
7:        end for
8:     end for
9:     for  do
11:     end for
12:  end for
Algorithm 2 Svf ConvNet

Fig. 2(b) illustrats the workflow. For any particular orientation , policy of action and orientation is passed through a multiplication layer with . The resultant orientation-action state visiting frequency grid map is then passed through a convolution layer with kernel to obtain . All s, and are passed through a weighted multiplication layer to find updated orientation state visiting frequency s, where the weights are .

Iii Experiment

Iii-a Experimental Design and Dataset

Scene Expert’s Behavior Learnt Cost Function
Exp.1 (E1) straight and flat road scenes normal driving behavior
Exp.2 (E2) negative obstacles scenes avoid the negative obstacles
Exp.3 (E3) negative obstacles scenes cross negative obstacles if they block the way
Exp.4 (E4) negative obstacles scenes cross all negative obstacles on the road
TABLE I: Experimental Design

As shown in Tab. I, we design four experiments to examine the proposed method’s learning capability of different experts’ driving behaviors. In Exp.1, an expert demonstrates normal driving behaviors on straight and flat roads and a cost function is obtained as the result of learning, which is used as a baseline of other experiments. Exp.2-4 are conducted at negative obstacle scenes, where the expert’s behavior is to avoid the negative obstacles in Exp.2, cross negative obstacles only if they block the way in Exp.3, while cross all negative obstacles on the road in Exp.4.

Fig. 3: Our experimental vehicle platform and scene map example (a)Data collection platform(POSS-V) (b) Example RGB view of off-road environment with expert trajectory projected onto it (c) 2D scene grid map

Data collection is conducted at off-road environments using an instrumented vehicle shown in Fig.3(a). The vehicle has a Velodyne HDL-64 LiDAR to map scene features, a GPS/IMU system to capture expert’s driving trajectories, and a front-view monocular camera for visualization only.

We generate a dataset where each frame has a scene map, a demonstration trajectory crossing the scene, a start and a goal point which are defined by the trajectory points that enter and leave the scene. The scene map is a 100 100 2D grid one with a resolution of 0.25 meters. It is generated using all LiDAR data crossing the scene, and each grid is assigned the height value of the highest LiDAR points projected to the cell. The whole dataset has a total of 2388 scene maps, and the expert’s real driving trajectories are used as the demonstrations for Exp.1 and Exp.2. Assume that we have a vehicle of stronger mobility and the expert chooses to cross all or some of the negative obstacles for efficiency. We use the real scene maps of negative obstacles, but synthesize demonstration trajectories in compliance with the defined behaviors of Exp.3 and Exp.4.

We choose 320 frames of straight and flat road scenes in Exp.1 and trained a cost function . We choose 320 frames of negative obstacle scenes but trajectories demonstrating different behaviors in Exp.2, 3 and 4, and trained cost functions , , respectively. The rest scenes are used in testing, where the focus is the comparison of different cost functions and the planned trajectories at the same scene.

Iii-B Implementation Details

Fig. 4: Illustration of part of transition convolutional kernels used in experiment. ”+/-” represents forward/backward.

Iii-B1 Transition kernels

In our implementation, we use discrete actions and orientations shown in Fig. 4. The orientation is discretized into eight directions with 45 degrees as interval to cover the full range. The actions are simplified to be combinations of steering angle(-45, 0, 45 degrees relative to current orientation) and driving forward or backward. The transition kernels corresponding to orientation are illustrated in Fig. 4. Although we only consider one-step transition between adjacent grids, large and complex transition kernels can be designed in the future.

Iii-B2 Network and Training Configuration

In the experiment, we adopt a simple five-layer fully convolutional network (FCN) structure which takes processed lidar feature map as input. We use size and for convolution kernels in FCN. As for RL ConvNet and Svf ConvNet, we set the number of value iterations to 150 and number of svf iterations to 120, which are experimentally observed to execute effective reinforcement learning process and svf computation. The network is trained with the Adam optimizer with initial learning rate 1e-4 and learning rate decay 0.99. The batch training is also employed with batch size 5, which proved in practice more stable than updating weights based on a single demonstration.

Fig. 5: Training process visualization. The loss curve, reward of expert’s and policy’s trajectories are shown in (a). Samples of expert’s demonstration trajectories are shown in (b) while policy generated trajectories at different training stages are shown in (c)-(e).

Iii-C Training Results

The training process visualization is shown in Fig. 5. During the training process, we periodically evaluate the expert’s trajectory reward and policy’s trajectory reward using Eqn. 1. Average reward of 30 trajectories randomly sampled from learned policy is used. As is shown in Fig. 5(a), the loss value decreases and tend to converge after 200 iterations. Expert’s and policy’s rewards are getting higher while policy reward is a little lower than expert’s. Samples of expert’s demonstration trajectories are shown in Fig. 5(b) while policy generated trajectories at different training stages are shown in Fig. 5(c)-(e). One can explicitly find that as iterations increase, similarity between expert’s and policy’s trajectories is becoming higher, which validates that learned reward guides trajectory planning successfully in compliance with human’s behavior.

Fig. 6: Training result visualization of four experiments(most importantly, experiments here are not to find the best cost map, but to demonstrate the learning capability of different behavior). Columns from left to right: RGB image for reference. LiDAR image with demonstration trajectory projected onto it. Learned cost map after training. State visiting frequency (Svf) map. Planned trajectory(in blue). As for Exp.2-4, negative obstacles can be found on the LiDAR map (see the red arrows). The learned cost maps have different traverse cost evaluation of these holes.
Fig. 7: Testing result visualization of different scenes. Due to limited page space, annotations can be found in Fig. 6, and are omitted here.

More visual evaluation results of our four experiments can be found in Fig. 6. Four learned cost maps all successfully capture the high traverse cost feature of positive obstacles(i.e., trees or bushes) but differ in the assessment of negative obstacles(pits on the road). Demonstrated by human drivers’ avoid-negative-obstacle trajectories, Exp.2 learns that negative obstacles have higher cost compared with flat roads. Exp.3 and Exp.4 have opposite results given cross-negative-obstacle trajectories as demonstration. Negative obstacles in cost maps of Exp.3 and Exp.4 have relatively lower costs compared to flat roads, resulting in cross holes behavior of planned trajectories. One interesting fact is that the result of Exp.4 assigns lower costs to holes than Exp.3, showing more preference for negative obstacles.

Iii-D Testing Results

Visualization of testing results on different scenes are shown in Fig. 7. We demonstrate the learning capability and scalability by comparing the behavior of trajectories generated by different learned cost functions. As shown in Fig. 7, fail to handle scenes where negative obstacles exist and planned trajectories cross them. On opposite, by demonstrating avoiding obstacles behavior in Exp.2, shows good performance under these scenes and successfully avoids negative obstacles. By comparison, and show strong learning capability as they produce cost maps that assign low traverse cost to negative obstacles. Compared to , assigns much lower cost to negative obstacles since it is demonstrated by trajectories more preferable for them. The testing results show that our method has a strong learning capability of different behaviors and is scalable to different scenes. When transferring to new scenes or new driving behavior, our method can refine the current model by demonstrating new trajectories.

Iii-E Computation Efficiency Analysis

We re-implement method of Wulfmeier et al. [22] and make quantitive comparison of computation time with ours. Time spent on the process of reinforcement learning and svf computation of each sample is measured for analysis. The experiment is conducted on an Intel Xeon E5 CPU and an NVIDIA TiTanX GPU. As shown in Tab. II and Tab. III, even without GPU accelerating, our method achieves much less computation time than Wulfmeier et al. [22] in either training or testing process.

Wulfmeier et al. [22] Ours(CPU) Ours(GPU)
RL 80.5920s 0.7501s 0.3958s
Svf 19.3368s 1.0671s 0.5161s
TABLE II: Train time spent on two computation stages of each sample
Wulfmeier et al. [22] Ours(CPU) Ours(GPU)
RL 77.5618s 0.6027s 0.2541s
Svf 18.7573s 0.8479s 0.3860s
TABLE III: Test time spent on two computation stages of each sample

Iv Conclusion and Future works

A method of off-road traversability analysis and trajectory planning using Deep Maximum Entropy Inverse Reinforcement Learning is proposed. A major novelty is the incorporating of vehicle kinematics, where two convolutional neural networks, i.e., RL ConvNet and Svf ConvNet, are developed that encode vehicle kinematics into convolution kernels, so as to solve the exponential increase of state-space complexity problem and achieve efficient computation in forward reinforcement learning. Experiments are conducted, where four traverse cost functions are learned from demonstration trajectories that represent the behaviors of normal driving, avoid negative obstacles, cross all negative obstacles and cross the negative obstacles if they block the way. Results demonstrate that the learned cost functions are able to guide trajectory planning in compliance with the expert’s behaviors, the method has scalability at various scenes and significantly improves the computation efficiency of both learning and testing. In future work, more extensive experimental studies will be conducted, and improvement on the accuracy of kinematic kernel will be addressed.


  • [1] P. Papadakis, “Terrain traversability analysis methods for unmanned ground vehicles: A survey,”

    Engineering Applications of Artificial Intelligence

    , vol. 26, no. 4, pp. 1373–1385, 2013.
  • [2] J. F. Lalonde, N. Vandapel, D. F. Huber, and M. Hebert, “Natural terrain classification using three-dimensional ladar data for ground robot mobility,” Journal of Field Robotics, vol. 23, no. 10, pp. 839–861, 2010.
  • [3] J. Larson and M. Trivedi, “Lidar based off-road negative obstacle detection and analysis,” in International IEEE Conference on Intelligent Transportation Systems, 2011.
  • [4] S. Kuthirummal, A. Das, and S. Samarasekera, “A graph traversal based algorithm for obstacle detection using lidar or stereo,” in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, Sep. 2011, pp. 3874–3880.
  • [5] F. Labrosse and M. Ososinski, “Automatic driving on ill-defined roads: An adaptive, shape-constrained, color-based method,” Journal of Field Robotics, vol. 32, no. 4, pp. 504–533, 2015.
  • [6] J. Mei, Y. Yu, H. Zhao, and H. Zha, “Scene-adaptive off-road detection using a monocular camera,” IEEE Transactions on Intelligent Transportation Systems, vol. PP, no. 99, pp. 1–12, 2017.
  • [7] A. Howard, M. Turmon, L. Matthies, B. Tang, A. Angelova, and E. Mjolsness, “Towards learned traversability for robot navigation: From underfoot to the far field,” Journal of Field Robotics, vol. 23, no. 11-12, pp. 1005–1017, 2010.
  • [8] S. Zhou, J. Xi, M. W. Mcdaniel, T. Nishihata, P. Salesses, and K. Iagnemma, “Self-supervised learning to visually detect terrain surfaces for autonomous robots operating in forested terrain,” Journal of Field Robotics, vol. 29, no. 2, pp. 277–297, 2012.
  • [9] B. Suger, B. Steder, and W. Burgard, “Traversability analysis for mobile robots in outdoor environments: A semi-supervised learning approach based on 3d-lidar data,” in 2015 IEEE International Conference on Robotics and Automation (ICRA), May 2015, pp. 3941–3946.
  • [10] B. Dan, W. Maddern, and I. Posner, “Find your own way: Weakly-supervised segmentation of path proposals for urban autonomy,” in IEEE International Conference on Robotics and Automation, 2017.
  • [11] B. Gao, A. Xu, Y. Pan, X. Zhao, W. Yao, and H. Zhao, “Off-Road Drivable Area Extraction Using 3D LiDAR Data,” Intelligent Vehicles Symposium, no. Iv, pp. 1323–1329, 2019.
  • [12] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and Autonomous Systems, vol. 57, no. 5, pp. 469–483, 2009.
  • [13] G. Hayes and J. Demiris, “A robot controller using learning by imitation,” Proc.of the Intl.symp.on Intelligent Robotic Systems, vol. 676, no. 5, pp. 1257–1274, 1994.
  • [14] D. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” vol. 1, 01 1988, pp. 305–313.
  • [15] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self-driving cars,” CoRR, vol. abs/1604.07316, 2016.
  • [16] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in

    International Conference on Machine Learning

    , 2004.
  • [17] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum Entropy Inverse Reinforcement Learning,” Proceeding AAAI’08 Proceedings of the 23rd national conference on Artificial intelligence, pp. 1433–1438, 2008.
  • [18] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, “Maximum margin planning,” in International Conference on Machine Learning, 2006.
  • [19] M. Wulfmeier, P. Ondruska, and I. Posner, “Maximum Entropy Deep Inverse Reinforcement Learning,” arXiv e-prints, July 2015.
  • [20] C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverse optimal control via policy optimization,” in International Conference on International Conference on Machine Learning, 2016.
  • [21]

    T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters, “An algorithmic perspective on imitation learning,”

    Foundations and Trends in Robotics, vol. 7, no. 1-2, pp. 1–179, 2018.
  • [22] M. Wulfmeier, D. Z. Wang, and I. Posner, “Watch this: Scalable cost-function learning for path planning in urban environments,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2016, pp. 2089–2095.
  • [23] M. Wulfmeier, D. Rao, D. Z. Wang, P. Ondruska, and I. Posner, “Large-scale cost function learning for path planning using deep inverse reinforcement learning,” International Journal of Robotics Research, vol. 36, no. 10, p. 027836491772239, 2017.
  • [24] Y. Zhang, W. Wang, R. Bonatti, D. Maturana, and S. Scherer, “Integrating kinematics and environment context into deep inverse reinforcement learning for predicting off-road vehicle trajectories,” CoRR, vol. abs/1810.07225, 2018. [Online]. Available:
  • [25] A. Tamar, Y. WU, G. Thomas, S. Levine, and P. Abbeel, “Value iteration networks,” in Advances in Neural Information Processing Systems 29, 2016, pp. 2154–2162.