I Introduction
Practical applications of autonomous robot systems increasingly require operation in unstructured, partially observed, unknown, and changing environments. Achieving safe and robust navigation in such conditions is directly coupled with the quality of the environment representation and the cost function specifying desirable robot behavior. Designing a cost function that accurately encodes safety, liveness, and efficiency requirements of navigation tasks is a major challenge. In contrast, it is significantly easier to obtain demonstrations of desirable behavior. The field of inverse reinforcement learning [Ng_IRL00, IRLSurvey, Neu_Apprenticeship12] (IRL) provides numerous tools for learning cost functions from expert demonstration.
We consider the problem of safe navigation from partial observations in unknown environments. We assume that demonstrations containing expert poses, controls, and sensory observations over time are available. The objective is to infer the cost function, which depends on the observation sequence, and explains the demonstrated behavior. This inference can only be done indirectly by comparing the control inputs, that a robot may take based on its current cost representation, to the expert’s actions in that situation.
Learning a cost function requires a differentiable control policy with respect to the stage cost parameters. Computing such derivatives has been addressed by several successful approaches [Ratliff_06, Ziebart_MaxEnt08, Tamar_VIN16, Okada_PIN17]. Ratliff et al. [Ratliff_06] developed algorithms with regret bounds for computing subgradients of planning algorithms (e.g., A* [ARA], RRT [Lavalle_RRT98, Karaman_RRTstar11], etc.) with respect to the cost features. Ziebart et al. [Ziebart_MaxEnt08] developed a dynamic programming algorithm for computing the expected state visitation frequency of a policy extracted from expert demonstrations to enforce that it earns the same reward as the demonstrated policy. Tamar et al. [Tamar_VIN16] showed that the value iteration algorithm could be approximated using a series of convolution (computing value function expectation over stochastic transitions) and maxpooling (choosing the best action) allowing automatic differentiation. Okada et al. [Okada_PIN17] proposed path integral networks in which the control sequence resulting from path integral control may be differentiated with respect to the controller parameters. All of these works, however, assume a known environment and only optimize the parameters of a cost function defined over it.
A series of recent works [ChoiIRLinPOMDPJMLR2011, Shankar_RLviaRCNN2016, Wulfmeier_DeepMaxEnt16, Gupta_CMP17, Karkus_QMDPNet2017, Karkus_DAN19] address IRL under partial observability. Wulfmeier et al. [Wulfmeier_DeepMaxEnt16]
train deep neural network representations for Ziebart et al.’s Maximum Entropy IRL
[Ziebart_MaxEnt08] method and consider streaming lidar scan observations of the environment. Karkus et al. [Karkus_DAN19] formulate the IRL problem as a POMDP [ASTROM_POMDP1965], including the robot pose and environment map in the state. Since a naïve representation of the occupancy distribution may require exponential memory in the map size, the authors assume partial knowledge of the environment (the structure of a building is known but the furniture placement is not). A control policy is obtained via the SARSOP [Kurniawati_SARSOP08] algorithm, which approximates the costtogo function only over an optimally reachable space. Gupta et al. [Gupta_CMP17] address visual navigation in partially observed environments while using hierarchical VIN as the planner. Khan et al. [Khan_MACN18] introduce a memory module to VIN to address partial observability.Many IRL algorithms rely on dynamic programming, including VIN and derivatives [Gupta_CMP17, Khan_MACN18], which requires updating costtogo estimates over all possible states. Our insight is that in partially observable environments, the costtogo estimates need to be updated and differentiated only over a subset of states. Inspired by [Ratliff_06], we obtain costtogo estimates only over promising states using a motion planning algorithm. This helps to obtain a closedform subgradient of the costtogo with respect to the learned cost function from the optimal trajectory. While Ratiff et al. [Ratliff_06] exploit this observation in fully observable environments, none of the works focusing on partial observability take advantage of this. Our work differs from closely related works in Table I. In summary, we offer two contributions illustrated in Fig. 1: Firstly, we develop a nonstationary cost function representation composed of a probabilistic occupancy map encoder, with recurrent dependence on the observation sequence, and a cost encoder, defined over the occupancy features (Sec. III). Secondly, we optimize the cost parameters using a closedform subgradient of the costtogo obtained only over a subset of promising states (Sec. IV).
Reference \Feature  POE  PBB  DNN  CfG 
VIN [Tamar_VIN16]  ✗  ✗  ✓  ✗ 
CMP [Gupta_CMP17]  ✓  ✗  ✓  ✗ 
MaxEntIRL [Ziebart_MaxEnt08]  ✗  ✗  ✗  ✓ 
MaxMarginIRL [Ratliff_06]  ✗  ✓  ✗  ✓ 
DAN [Karkus_DAN19]  ✓  ✗  ✓  ✗ 
Ours  ✓  ✓  ✓  ✓ 
Ii Problem Formulation
Consider a robot navigating in an unknown environment with the task of reaching a goal state . Let be the robot state, capturing its pose, twist, etc., at discrete time . For a given control input where is assumed finite, the robot state evolves according to known deterministic dynamics: . Let be a function specifying the true occupancy of the environment by labeling states as either feasible () or infeasible () and let be the space of possible environment realizations . Let be a cost function specifying desirable robot behavior in a given environment, e.g., according to an expert user or an optimal design. We assume that the robot does not have access to either the true occupancy map or the true cost function . However, the robot is able to make observations (e.g., using a lidar scanner or a depth camera) of the environment in its vicinity, whose distribution depends on the robot state and the environment . Given a training set of expert trajectories with length to demonstrate desirable behavior, our goal is to

learn a cost function estimate that depends on an observation sequence from the true latent environment and is parameterized by ,

derive a stochastic policy from such that the robot behavior under matches the prior experience .
To balance exploration in partially observable environments with exploitation of promising controls, we specify as a Boltzmann policy [Ramachandran_BayesianIRL07, Neu_Apprenticeship12] associated with the cost :
(1) 
where the optimal costtogo function is:
(2)  
s.t. 
Problem.
Given demonstrations , optimize the cost function parameters so that loglikelihood of the demonstrated controls is maximized under the robot policies :
(3) 
The problem setup is illustrated in Fig. 1. While Eqn. (2) is a standard deterministic shortest path (DSP) problem, the challenge is to make it differentiable with respect to . This is needed to propagate the loss in (3) back through the DSP problem to update the cost parameters . Once the parameters are optimized, the robot can generalize to navigation tasks in new partially observable environments by evaluating the cost based on the observations iteratively and (re)computing the associated policy .
Iii Cost Function Representation
. A convolutional neural network, parameterized by
, extracts features from the map state to specify the cost at a given robot statecontrol pair . The learnable parameters are .We propose a cost function representation comprised of two components: a map encoder and a cost encoder.
The map encoder incrementally updates a hidden state using the most recent observation obtained from robot state . For example, a Bayes filter with likelihood function parameterized by can convert the sequential input into a fixedsized hidden state :
(4) 
The cost encoder uses the latent environment map to define the cost function estimate at a given statecontrol pair . A convolutional neural network (CNN) [Goodfellowetal2016] with parameters can extract cost features from the environment map:
(5) 
This conceptual model, combining recurrent estimation of a hidden environment state, followed by feature extraction to define the cost at
is illustrated in Fig. 2. The model is differentiable by design, allowing its parameters to be optimized. We propose an instantiation of this general model, specific to modeling occupancy costs from depth measurements in robot navigation tasks.Iiia Map Encoder
We encode the occupancy probability of different environment areas into a hidden state
. In detail, we discretize into cells and letbe the vector of true occupancy values over the cells. Since
is unknown to the robot, we maintain the occupancy likelihood given the history of states and observations . See Fig. 3 for an example of a depth measurement and associated occupancy likelihood over the map . The representation complexity may be simplified significantly if one assumes independence among the map cells :(6) 
We use inspiration from occupancy grid mapping [Thrun_PR05, Hornung_Octomap13] to design recurrent updates for the occupancy probability of each cell . Since
is binary, its likelihood update can be simplified by defining the logodds ratio of occupancy:
(7) 
The recurrent Bayesian update of is:
(8) 
where the increment is a logodds observation model. The occupancy posterior can be recovered from the occupancy logodds ratio
via a sigmoid function:
(9) 
The sigmoid function satisfies the following properties:
(10) 
To complete the recurrent occupancy model design, we parameterize the logodds observation model:
(11) 
where Bayes rule was used to represent it in terms of an inverse observation model and a prior occupancy logodds ratio , whose value may depend on the environment (e.g., specifies a uniform prior over occupied and free cells). Note that map cells outside of the sensor field of view at time are not affected by , in which case . Now, consider a cell along the direction of the th sensor ray, whose depth measurement is . Let be the distance between the robot position and the center of mass of cell . We model the occupancy likelihood of as a truncated sigmoid around the true distance :
(12) 
where is a learnable parameter, , and is a distance threshold on the influence of the th sensor ray on the occupancy of cells around the point of reflection. Using (10), this implies the following logodds inverse sensor model for cells along the th sensor ray:
(13) 
This model suggests that one may also use a more expressive multilayer neural network in place of the linear transformation
of the distance differential along the th ray:(14) 
IiiB Cost Encoder
Given a statecontrol pair , the cost encoder uses the output of the map encoder to obtain a cost function estimate . A complex interacting structure over the feature representation can be obtained via a deep neural network with parameters . Wulfmeier et al. [Wulfmeier_DeepMaxEnt16]
proposed several CNN architectures, including pooling and residual connections of fully convolutional networks, to model
from . While complex architecture may provide better performance in practice, we also develop a simpler model for comparison using the inductive bias of obstacle avoidance in robot navigation.Let and be a small and large positive parameter, respectively. The cost of applying control in robot state can be modeled as large when the transition encounters an obstacle and as small otherwise:
Since is unknown, we use the estimated occupancy probabilities to compute the expectation of over :
(16)  
where is the entry of the map encoder state that corresponds to the environment cell containing . This simple cost encoder has parameters and its output is differentiable with respect to and through .
Iv Cost Learning via Differentiable Planning
We focus on optimizing the parameters of the cost representation developed in Sec. III. Since the true cost
is not directly observable, we need to differentiate the loss function
in (3), which, in turn, requires differentiating through the DSP problem in (2) with respect to the cost function estimate .Value Iteration Networks (VIN) [Tamar_VIN16] shows that iterations of the value iteration algorithm can be approximated by a neural network with convolutional and minpooling layers. This allows VIN to be differentiable with respect to the stage cost. While VIN can be modified to operate with a finite horizon and produce a nonstationary policy, it would still be based on full Bellman backups (convolutions and minpooling) over the entire state space. As a result, VIN scales poorly with the statespace size, while it might not even be necessary to determine the optimal costtogo at every state and control in the case of partially observable environments.
Instead of using dynamic programming to solve the DSP (2), any motion planning algorithm (e.g., A* [ARA], RRT [Lavalle_RRT98, Karaman_RRTstar11], etc.) that returns the optimal costtogo over a subset of the statecontrol space provides an accurate enough solution. For example, a backwards A* search applied to problem (2) with start state , goal state , and predecessors expansions according to the motion model provides an upper bound to the optimal costtogo:
where are the values computed by A* for expanded nodes in the CLOSED list and visited nodes in the OPEN list. Thus, a Boltzmann policy can be defined using the values returned by A* for all
and a uniform distribution over
for all other states . A* would significantly improves the efficiency of VIN [Tamar_VIN16] or other full backup Dynamic Programming variants by performing local Bellman backups on promising states (the CLOSED list).In addition to improving the efficiency of the forward computation of , using a planning algorithm to solve (2) is also more efficient in backpropagating errors with respect to . In detail, using the subgradient method [Shor_Subgradient12, Ratliff_06] to optimize leads to a closedform (sub)gradient of with respect to , removing the need for backpropagation through multiple convolutional or minpooling layers. We proceed by rewriting in a form that makes its subgradient with respect to obvious. Let be the set of feasible statecontrol trajectories starting at , and satisfying for with . Let be an optimal trajectory corresponding to the optimal costtogo function of a deterministic shortest path problem, i.e., the controls in satisfy the additional constraint for . Let be a statecontrol visitation function indicating if is visited by . With these definitions, we can view the optimal costtogo function as minimum over of the inner product of the cost function and the visitation function :
(17) 
where can be assumed finite because both and are finite. This form allows us to (sub)differentiate with respect to for any , .
Lemma 1.
Let be differentiable and convex in . Then, , where , is a subgradient of the piecewisedifferentiable convex function .
Applying Lemma 1 to (17) leads to the following subgradient of the optimal costtogo function:
(18) 
which can be obtained from the optimal trajectory corresponding to
. This result and the chain rule allow us to obtain the complete (sub)gradient of
.Proposition 1.
The computation graph structure implied by Prop. 1 is illustrated in Fig. 1. The graph consists of a cost representation layer and a differentiable planning layer, allowing endtoend minimization of via stochastic (sub)gradient descent. Full algorithms for the training and testing phases (Fig. 1) are shown in Alg. 1 and Alg. 2.
Although the form of the gradient in Prop. 1 is similar to that in Ziebart et al. [Ziebart_MaxEnt08], our contributions are orthogonal. Our contribution is to obtain a nonstationary costtogo function and its (sub)gradient for a finitehorizon problem using forward and backward computations that scale efficiently with the size of the state space. On the other hand, the results of Ziebart et al. [Ziebart_MaxEnt08] provide a stationary costtogo function and its gradient for an infinite horizon problem. The maximum entropy formulation of the stationary policy is a well grounded property of using a soft version of the Bellman update, which can explicitly model the suboptimality of expert trajectories. However, we can show the benefits of our approach without including the maximum entropy formulation and will leave it as future work.
V Experiments
We evaluate our approach in 2D grid world navigation tasks at two scales. Obstacle configurations are generated randomly in maps of sizes or . We use an 8connected grid so that a control causes a transition from to one of the eight neighbor cells . At each step, the robot receives a lidar scan at resolution, resulting in beams in each scan (see Fig. 3
). The lidar range readings are corrupted by an additive zero mean Gaussian noise. The standard deviation of the noise is
and (grid cell = 1) and the lidar maximum range is and in and domains, respectively. Note that the lidar range is smaller than the map size to demonstrate environment partial observability. During test time, the domain size is the maximum size allowed for the observed area along a trajectory. Demonstrations are obtained by running an A* planning algorithm to solve the deterministic shortest path problem on the true map . The number of maps and training samples generated are shown in Table II.Dataset  #maps  #samples  #maps  #samples 
Train  7638  514k  970  460k 
Validation  966  66k  122  58k 
Test  952    122   
Va Baseline and model variations
We use DeepMaxEnt [Wulfmeier_DeepMaxEnt16] as a baseline and compare it to three variants of our model. In all variants, we parameterize an inverse observation model and use the logodds update rule in Eqn. (13) as the map encoder. This map encoder is sufficiently expressive to model occupancy probability from lidar observations in a 2D enviroment. The A* algorithm is used to solve the DSP (2), providing the optimal costtogo for and a subgradient of according to (18
). All the neural networks are implemented in the PyTorch library
[Paszke_Pytorch17] and trained with the Adam optimizer [Kingma_ADAM14] until convergence.DeepMaxEnt uses a neural network to learn a cost function directly from lidar observations without explicitly having a map representation. The neural network in our experiments is the “Standard FCN” in [Wulfmeier_DeepMaxEnt16] in the domain, and the encoderdecoder architecture in [SegNet] in the domain. Value iteration is approximated by a finite number of Bellman backup iterations, equal to the map size. The experiments in the original DeepMaxEnt paper [Wulfmeier_DeepMaxEnt16]
use the mean and variance of the height of the 3D lidar points in each cell, as well as a binary indicator of cell visibility, as input features to the neural network. Since our synthetic experiments are set up in 2D, the count of lidar beams in each cell is used as replacement of the height mean and variance. This is a fair adaptation because Wulfmeier et. al.
[Wulfmeier_DeepMaxEnt16] argued that obstacles generally represent areas of larger height variance which means more beam counts in our observations.OursHCE stands for hardcoded cost encoder. This simple variant of our model uses Eqn. (16) with and set explicitly as constants.
OursSCE stands for softcoded cost encoder and has and in Eqn (16) as learnable parameters.
OursCNN is our most generic variant using a convolutional neural network as cost encoder. The network architecture is the same as in DeepMaxEnt for fair comparison.
VB Experiments and Results
Model generalization
Fig 4 shows the comparison of multiple measures of accuracy for different algorithms. Both OursHCE and OursSCE explicitly incorporate cell traversability through the cost design in (16). Test results show that this explicit cost encoder is successful at obstacle avoidance, regardless of whether the parameters , are constants in OursHCE or learnable in OursSCE. The performance of OursHCE also shows that the map encoder is learning a correct map representation from the noisy lidar observations since the only trainable parameters are in the inverse observation model (13). However, both models fail at matching demonstrations in validation because the cost encoder (16) emphasizes obstacle avoidance explicitly, leaving little capacity to learn from demonstrations. OursCNN combines the strength of learning from demonstrations and generalization to new navigation tasks while avoiding obstacles. Its validation results are on par with DeepMaxEnt, showing the validity of the closedform subgradient in (18). OursCNN significantly outperforms DeepMaxEnt in new tasks at test time. The performance gap of DeepMaxEnt in the two domains shows that a general CNN architecture applied directly to the lidar scan measurements is not as effective as the map encoder in OursCNN at modeling occupancy probability. Fig 5 shows the map occupancy estimation, as well as the optimal trajectories necessary for subgradient computation in Sec IV.
Robustness to noise
The robustness of OursCNN to the observation noise is evaluated in the domain. Fig 4 shows that the performance degrades as the noise increases but our inverse observation model (13) still generalizes well considering that noise with standard deviation of is significant when the lidar range is only .
Computational efficiency
Finally, we compare the efficiency of a forward pass through our A* planner and the value iteration algorithm in DeepMaxEnt. The A* algorithm in our models is implemented in C++ and evaluated on a CPU. The VI algorithm is implemented using convolutional and minpooling layers in Pytorch as described in [Tamar_VIN16] and is evaluated on a GPU. We record the time that each models takes to return a policy given a cost function . Our A* algorithm takes only ms as compared to VI’s ms on the map. In the domain, our A* algorithm takes ms, compared to VI’s ms, illustrating the scalability of our model in the size of the state space.
Vi Conclusion
We proposed an inverse reinforcement learning approach for infering navigation costs from demonstration in partially observable enviroments. Our model introduces a new cost representation composed of a probabilistic occupancy encoder and a cost encoder defined over the occupancy features. We showed that a motion planning algorithm can compute optimal costtogo values over the cost representation, while the costtogo (sub)gradient may be obtained in closedform. Our work offers a promising model for encoding occupancy features in navigation tasks and may enable efficient online learning in challenging operational conditions.
Comments
There are no comments yet.