1 Introduction
Autonomous systems operating in unstructured, partially observed, and changing realworld environments need an understanding of context to evaluate the safety, utility, and efficiency of their performance. For example, while a bipedal robot may navigate along sidewalks, an autonomous car needs to follow the road lane structure and the traffic signs. Designing a cost function that encodes such rules by hand is cumbersome, if not infeasible, especially for complex tasks. However, it is often possible to obtain demonstrations of desirable behavior that indirectly capture the role of semantic context in the task execution. Semantic labels provide rich information about the relationship between object entities and their surroundings. In this work, we consider an inverse reinforcement learning (IRL) problem in which observations containing semantic information about the environment are available.
There has been significant progress in semantic segmentation techniques, including deep neural networks for RGB image segmentation (Papandreou2015WeakSemSeg; Badrinarayanan2017Segnet; Chen2018EncoderSemSeg) or point cloud labeling via a 2D spherical depth projection (Wu2018SqueezeSeg; Dohan2015LidarSemSeg). Maps that store semantic information can be generated from segmented images (Sengupta2012SemMap; Lu2019MonoSemMap). Gan2019Bayesian; Sun2018ReccOctoMap generalize binary occupancy grid mapping (Hornung2013Octomap) to multiclass semantic mapping in 3D. In this work, we parameterize the navigation cost of an autonomous vehicle as a nonlinear function of such semantic features to explain the demonstrations of an expert.
Learning a cost function from demonstration requires a control policy that is differentiable with respect to the cost parameters. Computing policy derivatives has been addressed by several sucessful IRL approaches (Neu2012Apprenticeship; Ratliff2006MMP; Ziebart2008MaxEnt)
. Early works assume that the cost is linear in the feature vector and aim at matching the feature expectations of the learned and expert policies.
Ratliff2006MMP computes subgradients of planning algorithms so that expected reward of an expert policy is better than any other policy by a margin. Value iteration networks (VIN) (Tamar2016VIN) show that the value iteration algorithm can be approximated by a series of convolution and maxpooling layers, allowing automatic differentiation to learn the cost function endtoend. Ziebart2008MaxEnt develops a dynamic programming algorithm to maximize the likelihood of observed expert data and learns a policy of maximum entropy (MaxEnt) distribution. Many works (Levine2011Nonlinear; Wulfmeier2016DeepMaxEnt; Song2019IRL) extend MaxEnt to learn a nonlinear cost using Gaussian Processes or deep neural networks. Finn2016GCL uses samplebased approximation of the MaxEnt objective on highdimensional continuous systems. However, the cost in most existing work is learned offline using full observation sequences from the expert demonstrations. A major contribution of our work is to develop cost representations and planning algorithms that rely only on causal partial observations.Achieving safe and robust navigation is directly coupled with the quality of the environment representation and the cost function specifying desirable behaviors. The traditional approach combines geometric mapping of occupancy probability or distance to the nearest obstacle (Hornung2013Octomap; Oleynikova2017Voxblox) with handspecified planning cost functions. Recent advances in deep reinforcement learning demonstrated that control inputs may be predicted directly from sensory observations (visuomotor). However, special model designs (Khan2018MACN) that serve as a latent map are needed in navigation tasks where simple reactive policies are not feasible. Gupta2017CMP decompose visual navigation into two separate stages explicitly: mapping the environment from firstperson RGB images and planning through the constructed map with VIN. Our model also seperates the two stages but integrates semantic information to obtain a richer map representation. In addition, Wang2020ICRA propose a differentiable mapping and planning framework to learn the expert cost function. They parameterize cost function as a neural network over the binary occupancy map probability, which is integrated from previous partial observations. They further propose an efficient A* (Hart1968Astar)
planning algorithm that computes the policy at the current state and backpropagates gradients in closedform to optimize the cost parameters. We extend their work by incorporating semantic observation to the map representation and evaluating the model in the CARLA autonomous driving simulator
(Dosovitskiy2017CARLA).We propose a model that learns to navigate from firstperson semantic observations and make the following contributions. First, we propose a cost function representation composed of a map encoder, capturing semantic class probabilities from the streaming observations, and a cost encoder, defined as a deep neural network over the semantic features. Second, we optimize the cost parameters using a closedform subgradient of the costtogo only over a subset of promising states, obtained by an efficient planning algorithm. Finally, we verify our model in autonomous navigation experiments in urban environments provided by the CARLA simulator (Dosovitskiy2017CARLA).
2 Problem Formulation
Consider a robot navigating in an unknown environment with the task of reaching a goal state . Let be the robot state, capturing its pose, twist, etc., at discrete time . For a given control input where is assumed finite, the robot state evolves according to known deterministic dynamics: . Let be a set of class labels, where denotes “free” space and denotes a particular semantic class such as car or tree. Let be a function specifying the true semantic occupancy of the environment by labeling states with semantic classes and be the space of possible environment realizations . Let be a cost function specifying desirable robot behavior in a given environment, e.g., according to an expert user or an optimal design. We assume that the robot does not have access to either the true semantic map or the true cost function . However, the robot is able to obtain point cloud observations at each step , where is the measurement location and is an observed semantic likelihood such that , whose support is . In practice, can be obtained from a semantic segmentation algorithm (Papandreou2015WeakSemSeg; Badrinarayanan2017Segnet; Chen2018EncoderSemSeg) that predicts the semantic class of the corresponding measurement location . The observed point cloud depends on the robot state and the environment realization . Given a training set of expert trajectories with length to demonstrate desirable behavior, our goal is to

[nosep]

learn a cost function estimate that depends on an observation sequence from the true latent environment and is parameterized by ,

design a stochastic policy from such that the robot behavior under matches the prior experience .
To balance exploration in partially observable environments with exploitation of promising controls, we specify as a Boltzmann policy (Ramachandran2007BayesianIRL; Neu2012Apprenticeship) associated with the cost , , where the optimal costtogo function is:
(1) 
Problem
Given demonstrations , optimize the cost function parameters so that loglikelihood of the demonstrated controls is maximized under the robot policies :
(2) 
The problem setup is illustrated in Fig. 1. While Eqn. (1) is a standard deterministic shortest path (DSP) problem, the challenge is to make it differentiable with respect to , which is necessary for the loss in (2) to propagate back through the DSP problem to update the cost parameters . Once the parameters are optimized, the robot can generalize to navigation tasks in new partially observable environments by evaluating the cost based on the observations iteratively and (re)computing the associated policy .
3 Cost Function Representation
We propose a cost function representation with two components: a semantic map encoder with parameters and a cost encoder with parameters . The model is differentiable by design, allowing its parameters to be optimized by the subsequent planning algorithm described in Sec. 4.
3.1 Semantic Map Encoder
The semantic probability of different environment areas is encoded in a hidden state given the trajectory and observations . Specifically, we discretize the state space into cells and let be the random vector of true semantic labels over the cells. Since is unknown to the robot, we maintain the semantic occupancy posterior where , given the history of states and observations . The representation complexity may be simplified significantly if one assumes independence among the map cells : .
Inspired by the binary occupancy grid mapping (Thrun2005PR; Hornung2013Octomap), we extend the recurrent updates for the multiclass semantic probability of each cell
. The logodds ratio of semantic classes associated with cell
at time is(3) 
Its recurrent Bayesian update is . Note that by definition . The update increment is a logodds observation model and we assume the observation given the cell is independent of the previous observations . The semantic class posterior can be recovered from the semantic logodds ratio via a softmax function , where satisfies the following properties
(4) 
We provide a simple observation model to instantiate Eq. (3). Consider all cells that lie on the ray between robot state and a labeled point in the point cloud . Let be the distance between the robot position and the center of mass of the cell . The inverse observation model relating the label of cell to the ray between robot state and labeled point is defined as a softmax function with parameters , scaled by the distance, , which is truncated at a threshold :
(5) 
The function returns a diagonal matrix from a vector and the prior occupancy logodds ratio depends on the environment (e.g., specifies a uniform prior over the semantic classes).
Given the definitions of the logodds ratio in Eq. (3) and the inverse observation model in Eq. (5), the logodds update rule for the semantic probability at cell is where the logodds inverse observation model for cells along the ray from to can be simplied using (4) as:
(6) 
A more expressive multilayer neural network may be used to parameterize the inverse observation model instead of the linear transformation
of the semantic probability and distance differential along the th ray in Eq (5):(7) 
3.2 Cost Encoder
The cost encoder uses the semantic occupancy grid prosterior to define the cost function estimate at a given statecontrol pair
. A convolutional neural network (CNN)
(Goodfellowetal2016) with parameters can extract cost features from the environment map: We implement an encoderdecoder neural network architecture (Badrinarayanan2017Segnet) to parameterize the cost function from semantic class probabilities. The idea is to perform downsamples and upsamples at multiple scales to provide both local and global context between semantic probability and cost.4 Cost Learning via Differentiable Planning
We follow the planning algorithm in Wang2020ICRA that enables efficient cost optimization and briefly review the steps below. The parameters of the cost representation developed in Sec. 3 are optimized by differentiating in (2) through the DSP problem in (1). Motion planning algorithms, such as A* (ARA), solve problem (1) efficiently and determine the optimal costtogo only over a subset of promising states. This is sufficient to obtain the subgradient of with respect to along the optimal path by applying the subgradient method (Shor2012Subgradient; Ratliff2006MMP).
A backwards A* search applied to problem (1) with start state , goal state , and predecessors expansions according to transition provides an upper bound to the optimal costtogo: , where are the values computed by A* for expanded nodes in the CLOSED list and visited nodes in the OPEN list. Strict equality is obtained only if belongs to the CLOSED list. A Boltzmann policy may be defined using the values for all
and a uniform distribution over
for all other states.We rewrite in a form that makes its subgradient with respect to obvious. Let be the set of feasible trajectories of horizon that start at , , satisfy transition and terminate at . Let be an optimal trajectory corresponding to the optimal costtogo function . Define as a statecontrol visitation function indicating if is visited by . The optimal costtogo function can be viewed as a minimum over of the inner product of the cost function and the visitation function :
(8) 
where can be assumed finite because both and are finite. Applying the subgradient method (Shor2012Subgradient; Ratliff2006MMP) to (8) shows that
is a subgradient of the optimal costtogo. This result and the chain rule allow us to obtain a subgradient of
.5 Experiments
5.1 Experiment Setup
We evaluate our approach using the CARLA simulator (0.9.6) (Dosovitskiy2017CARLA), which provides highfidelity autonomous vehicle simulation in urban environments. Demonstration data for training the cost function representation is collected from maps , while map is used for testing. is the largest map and includes different street layouts, junctions, and a freeway. In each map, we collect expert trajectories by running the autonomous navigation agent provided by the CARLA Python API. The expert finds the shortest path between two query points, while respecting traffic rules, such as staying on the road, and keeping in the current lane. Features not related to the experiment are disabled, including spawning other vehicles and pedestrians, and changing traffic signal. Each vehicle trajectory is discretized into a grid of meter resolution. The robot state is the grid cell location while the control takes the robot to one of its neighbor grid cells. Trajectories that do not fit in the grid are discarded.
The ego vehicle is equipped with a lidar sensor that has meters maximum range and horizontal field of view. The vertical field of view ranges from (facing forward) to (facing down) with resolution. A total of lidar rays is generated per scan while each point measurement is returned only if it hits an obstacle. The ego vehicle is also equipped with semantic segmentation cameras that display objects of 13 different classes in RGB images, including road, road line, sidewalk, vegetation, car, building, etc. The cameras face front, left, right and rear, each capturing a horizontal field of view. The semantic label of each lidar point is retrieved from the semantic segmentation image by projecting the lidar point in the camera’s frame.
5.2 Models and Metrics
We compare our model with two baseline algorithms: Wulfmeier2016DeepMaxEnt and Wulfmeier2016DeepMaxEnt + semantics. Wulfmeier2016DeepMaxEnt
use a neural network to learn a cost from lidar point clouds without semantics. The input to the neural network is a grid that stores the mean and variance of points in each cell, as well as a binary indicator of cell visibility. We augment the grid features with the mode of semantic labels in each cell to get the model
Wulfmeier2016DeepMaxEnt+ semantics as a fair comparison with ours. Neural networks are implemented in the PyTorch library
(Paszke2019Pytorch) and trained with the Adam optimizer (Kingma2014ADAM) until convergence.The evaluation metrics include: negative loglikelihood (
NLL), control accuracy (Acc), trajectory success rate (Traj. Succ. Rate) and Modified Hausdorff Distance (MHD). More precisely, given a test set and a learned policy with paramters , we define and . Traj. Succ. Rate records the success rate of the learned policy by iteratively rolling out its predicted controls. A trajectory is regarded as successful if it reaches the goal within twice the number of steps of the expert trajectory without hitting an obstacle. MHD compares the rolled out trajectory by the learned policy and the expert trajectory and is defined as: where measures the minimum Euclidean distance from state to any state in .5.3 Results and Discussion
Model 





Wulfmeier2016DeepMaxEnt  0.595  86.1  92  4.521  

0.613  82.7  88  4.479  
Ours  0.446  90.5  91  3.036 
Fig. 4 shows the performance of our model versus Wulfmeier2016DeepMaxEnt and Wulfmeier2016DeepMaxEnt + semantics using the metrics described above. Ours learns to generate policies closest to the expert in new environments by scoring best in NLL and Acc. The predicted trajectory is also closest to the expert by achieving the minimum MHD. The results demonstrate that the semantic map encoder captures more geometric as well as semantic information so that the cost function can be optimized and generate trajectories which match the expert behaviors. We notice that simply taking the mode of the semantic labels in each grid cell degrades the performance of Wulfmeier2016DeepMaxEnt. We conjecture that taking the mode is a deterministic assignment, which could provide conflicting semantic information, while our model endorses a probabilistic semantic map encoder with Bayesian updates to avoid information loss. Fig. 5 shows an example of the predicted trajectory at an intersection. The semantic map visualizes the class of highest probability, which mostly reflects the ground truth. Subcell objects like roadlines are captured in the semantic map distribution but not visualized in the most probable class. It is interesting to find that our model assigns low cost to road in front of the robot, medium cost for sidewalks, and high cost to road behind itself. This cost assignment is actually effective for the robot to navigate to the goal.
6 Conclusion
We propose an inverse reinforcement learning approach for infering navigation costs from demonstrations with semantic observations. Our model introduces a new cost representation composed of a probabilistic semantic occupancy encoder and a cost encoder defined over the semantic features. The cost function can be optimized via backpropagation with closedform (sub)gradient. Experiments in the CARLA simulator show that our model outperforms methods that do not encode semantic information probabilistically over time. Our work offers a promising solution for learning semantic features in navigation and may enable efficient online learning in challenging conditions.
We gratefully acknowledge support from NSF CRII IIS1755568, ARL DCIST CRA W911NF1720181, and ONR SAI N000141812828.
Comments
There are no comments yet.