Learning Navigation Costs from Demonstration in Partially Observable Environments

02/26/2020 ∙ by Tianyu Wang, et al. ∙ University of California, San Diego 1

This paper focuses on inverse reinforcement learning (IRL) to enable safe and efficient autonomous navigation in unknown partially observable environments. The objective is to infer a cost function that explains expert-demonstrated navigation behavior while relying only on the observations and state-control trajectory used by the expert. We develop a cost function representation composed of two parts: a probabilistic occupancy encoder, with recurrent dependence on the observation sequence, and a cost encoder, defined over the occupancy features. The representation parameters are optimized by differentiating the error between demonstrated controls and a control policy computed from the cost encoder. Such differentiation is typically computed by dynamic programming through the value function over the whole state space. We observe that this is inefficient in large partially observable environments because most states are unexplored. Instead, we rely on a closed-form subgradient of the cost-to-go obtained only over a subset of promising states via an efficient motion-planning algorithm such as A* or RRT. Our experiments show that our model exceeds the accuracy of baseline IRL algorithms in robot navigation tasks, while substantially improving the efficiency of training and test-time inference.



There are no comments yet.


page 1

page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Practical applications of autonomous robot systems increasingly require operation in unstructured, partially observed, unknown, and changing environments. Achieving safe and robust navigation in such conditions is directly coupled with the quality of the environment representation and the cost function specifying desirable robot behavior. Designing a cost function that accurately encodes safety, liveness, and efficiency requirements of navigation tasks is a major challenge. In contrast, it is significantly easier to obtain demonstrations of desirable behavior. The field of inverse reinforcement learning [Ng_IRL00, IRLSurvey, Neu_Apprenticeship12] (IRL) provides numerous tools for learning cost functions from expert demonstration.

We consider the problem of safe navigation from partial observations in unknown environments. We assume that demonstrations containing expert poses, controls, and sensory observations over time are available. The objective is to infer the cost function, which depends on the observation sequence, and explains the demonstrated behavior. This inference can only be done indirectly by comparing the control inputs, that a robot may take based on its current cost representation, to the expert’s actions in that situation.

Fig. 1: Architecture for learning cost function representations from demonstrations. Our main contribution lies in a non-stationary cost representation, combining a probabilistic occupancy map encoder, with recurrent dependence on observations , and a cost encoder

, defined over the occupancy features. We also perform efficient (in the size of the state-space) forward non-stationary policy computation and efficient (closed-form subgradient) backpropagation. The bottom-right plot illustrates that motion planning may be used to update and differentiate cost-to-go estimates only in promising areas of the state space rather than using full Bellman backups, which is particularly redundant in partially observable environments.

Learning a cost function requires a differentiable control policy with respect to the stage cost parameters. Computing such derivatives has been addressed by several successful approaches [Ratliff_06, Ziebart_MaxEnt08, Tamar_VIN16, Okada_PIN17]. Ratliff et al. [Ratliff_06] developed algorithms with regret bounds for computing subgradients of planning algorithms (e.g., A* [ARA], RRT [Lavalle_RRT98, Karaman_RRTstar11], etc.) with respect to the cost features. Ziebart et al. [Ziebart_MaxEnt08] developed a dynamic programming algorithm for computing the expected state visitation frequency of a policy extracted from expert demonstrations to enforce that it earns the same reward as the demonstrated policy. Tamar et al. [Tamar_VIN16] showed that the value iteration algorithm could be approximated using a series of convolution (computing value function expectation over stochastic transitions) and maxpooling (choosing the best action) allowing automatic differentiation. Okada et al. [Okada_PIN17] proposed path integral networks in which the control sequence resulting from path integral control may be differentiated with respect to the controller parameters. All of these works, however, assume a known environment and only optimize the parameters of a cost function defined over it.

A series of recent works [ChoiIRLinPOMDPJMLR2011, Shankar_RLviaRCNN2016, Wulfmeier_DeepMaxEnt16, Gupta_CMP17, Karkus_QMDPNet2017, Karkus_DAN19] address IRL under partial observability. Wulfmeier et al. [Wulfmeier_DeepMaxEnt16]

train deep neural network representations for Ziebart et al.’s Maximum Entropy IRL 

[Ziebart_MaxEnt08] method and consider streaming lidar scan observations of the environment. Karkus et al. [Karkus_DAN19] formulate the IRL problem as a POMDP [ASTROM_POMDP1965], including the robot pose and environment map in the state. Since a naïve representation of the occupancy distribution may require exponential memory in the map size, the authors assume partial knowledge of the environment (the structure of a building is known but the furniture placement is not). A control policy is obtained via the SARSOP [Kurniawati_SARSOP08] algorithm, which approximates the cost-to-go function only over an optimally reachable space. Gupta et al. [Gupta_CMP17] address visual navigation in partially observed environments while using hierarchical VIN as the planner. Khan et al. [Khan_MACN18] introduce a memory module to VIN to address partial observability.

Many IRL algorithms rely on dynamic programming, including VIN and derivatives [Gupta_CMP17, Khan_MACN18], which requires updating cost-to-go estimates over all possible states. Our insight is that in partially observable environments, the cost-to-go estimates need to be updated and differentiated only over a subset of states. Inspired by [Ratliff_06], we obtain cost-to-go estimates only over promising states using a motion planning algorithm. This helps to obtain a closed-form subgradient of the cost-to-go with respect to the learned cost function from the optimal trajectory. While Ratiff et al. [Ratliff_06] exploit this observation in fully observable environments, none of the works focusing on partial observability take advantage of this. Our work differs from closely related works in Table I. In summary, we offer two contributions illustrated in Fig. 1: Firstly, we develop a non-stationary cost function representation composed of a probabilistic occupancy map encoder, with recurrent dependence on the observation sequence, and a cost encoder, defined over the occupancy features (Sec. III). Secondly, we optimize the cost parameters using a closed-form subgradient of the cost-to-go obtained only over a subset of promising states (Sec. IV).

Reference \Feature POE PBB DNN CfG
VIN [Tamar_VIN16]
CMP [Gupta_CMP17]
MaxEntIRL [Ziebart_MaxEnt08]
MaxMarginIRL [Ratliff_06]
DAN [Karkus_DAN19]
TABLE I: Comparison with closely related work based on the use of Partially Observable Environments (POE), Partial Bellman Backups (PBB), Deep Neural Network (DNN) representation, and Closed-form Gradients (CfG). PBB refers to computing and differentiating values over a subset of promising states as opposed to the whole state space. DAN [Karkus_DAN19] works with uncertainty only on the robot or furniture location, while the main environment structure is known.

Ii Problem Formulation

Consider a robot navigating in an unknown environment with the task of reaching a goal state . Let be the robot state, capturing its pose, twist, etc., at discrete time . For a given control input where is assumed finite, the robot state evolves according to known deterministic dynamics: . Let be a function specifying the true occupancy of the environment by labeling states as either feasible () or infeasible () and let be the space of possible environment realizations . Let be a cost function specifying desirable robot behavior in a given environment, e.g., according to an expert user or an optimal design. We assume that the robot does not have access to either the true occupancy map or the true cost function . However, the robot is able to make observations (e.g., using a lidar scanner or a depth camera) of the environment in its vicinity, whose distribution depends on the robot state and the environment . Given a training set of expert trajectories with length to demonstrate desirable behavior, our goal is to

  • learn a cost function estimate that depends on an observation sequence from the true latent environment and is parameterized by ,

  • derive a stochastic policy from such that the robot behavior under matches the prior experience .

To balance exploration in partially observable environments with exploitation of promising controls, we specify as a Boltzmann policy [Ramachandran_BayesianIRL07, Neu_Apprenticeship12] associated with the cost :


where the optimal cost-to-go function is:


Given demonstrations , optimize the cost function parameters so that log-likelihood of the demonstrated controls is maximized under the robot policies :


The problem setup is illustrated in Fig. 1. While Eqn. (2) is a standard deterministic shortest path (DSP) problem, the challenge is to make it differentiable with respect to . This is needed to propagate the loss in (3) back through the DSP problem to update the cost parameters . Once the parameters are optimized, the robot can generalize to navigation tasks in new partially observable environments by evaluating the cost based on the observations iteratively and (re)computing the associated policy .

Iii Cost Function Representation

Map encoder
Cost encoder
Fig. 2: Neural network model of a cost function representation. A Bayes filter with likelihood function parameterized by , takes in sequential observations and outputs a latent map representation

. A convolutional neural network, parameterized by

, extracts features from the map state to specify the cost at a given robot state-control pair . The learnable parameters are .

We propose a cost function representation comprised of two components: a map encoder and a cost encoder.

The map encoder incrementally updates a hidden state using the most recent observation obtained from robot state . For example, a Bayes filter with likelihood function parameterized by can convert the sequential input into a fixed-sized hidden state :


The cost encoder uses the latent environment map to define the cost function estimate at a given state-control pair . A convolutional neural network (CNN) [Goodfellow-et-al-2016] with parameters can extract cost features from the environment map:


This conceptual model, combining recurrent estimation of a hidden environment state, followed by feature extraction to define the cost at

is illustrated in Fig. 2. The model is differentiable by design, allowing its parameters to be optimized. We propose an instantiation of this general model, specific to modeling occupancy costs from depth measurements in robot navigation tasks.

Iii-a Map Encoder

We encode the occupancy probability of different environment areas into a hidden state

. In detail, we discretize into cells and let

be the vector of true occupancy values over the cells. Since

is unknown to the robot, we maintain the occupancy likelihood given the history of states and observations . See Fig. 3 for an example of a depth measurement and associated occupancy likelihood over the map . The representation complexity may be simplified significantly if one assumes independence among the map cells :


We use inspiration from occupancy grid mapping [Thrun_PR05, Hornung_Octomap13] to design recurrent updates for the occupancy probability of each cell . Since

is binary, its likelihood update can be simplified by defining the log-odds ratio of occupancy:


The recurrent Bayesian update of is:


where the increment is a log-odds observation model. The occupancy posterior can be recovered from the occupancy log-odds ratio

via a sigmoid function:


The sigmoid function satisfies the following properties:

Fig. 3: A robot (green) navigating in a 2-D grid environment. The true environment (left) is a grid where black regions are obstacles and white regions are free. On the right, the noisy lidar beams with endpoints in red have maximum range . The robot estimates the occupancy probability through an inverse observation model in Eqn (14) (darker means higher probability of occupancy).

To complete the recurrent occupancy model design, we parameterize the log-odds observation model:


where Bayes rule was used to represent it in terms of an inverse observation model and a prior occupancy log-odds ratio , whose value may depend on the environment (e.g., specifies a uniform prior over occupied and free cells). Note that map cells outside of the sensor field of view at time are not affected by , in which case . Now, consider a cell along the direction of the -th sensor ray, whose depth measurement is . Let be the distance between the robot position and the center of mass of cell . We model the occupancy likelihood of as a truncated sigmoid around the true distance :


where is a learnable parameter, , and is a distance threshold on the influence of the -th sensor ray on the occupancy of cells around the point of reflection. Using (10), this implies the following log-odds inverse sensor model for cells along the -th sensor ray:


This model suggests that one may also use a more expressive multi-layer neural network in place of the linear transformation

of the distance differential along the -th ray:


In summary, the map encoder starts with prior occupancy log-odds , updates them recurrently via:


where the log-odds inverse sensor model is specified for the -th cell along the -th ray in (14), and provides the cell occupancy likelihood in (9) as output.

Iii-B Cost Encoder

Given a state-control pair , the cost encoder uses the output of the map encoder to obtain a cost function estimate . A complex interacting structure over the feature representation can be obtained via a deep neural network with parameters . Wulfmeier et al. [Wulfmeier_DeepMaxEnt16]

proposed several CNN architectures, including pooling and residual connections of fully convolutional networks, to model

from . While complex architecture may provide better performance in practice, we also develop a simpler model for comparison using the inductive bias of obstacle avoidance in robot navigation.

Let and be a small and large positive parameter, respectively. The cost of applying control in robot state can be modeled as large when the transition encounters an obstacle and as small otherwise:

Since is unknown, we use the estimated occupancy probabilities to compute the expectation of over :


where is the entry of the map encoder state that corresponds to the environment cell containing . This simple cost encoder has parameters and its output is differentiable with respect to and through .

Iv Cost Learning via Differentiable Planning

We focus on optimizing the parameters of the cost representation developed in Sec. III. Since the true cost

is not directly observable, we need to differentiate the loss function

in (3), which, in turn, requires differentiating through the DSP problem in (2) with respect to the cost function estimate .

Value Iteration Networks (VIN) [Tamar_VIN16] shows that iterations of the value iteration algorithm can be approximated by a neural network with convolutional and minpooling layers. This allows VIN to be differentiable with respect to the stage cost. While VIN can be modified to operate with a finite horizon and produce a non-stationary policy, it would still be based on full Bellman backups (convolutions and min-pooling) over the entire state space. As a result, VIN scales poorly with the state-space size, while it might not even be necessary to determine the optimal cost-to-go at every state and control in the case of partially observable environments.

Instead of using dynamic programming to solve the DSP (2), any motion planning algorithm (e.g., A* [ARA], RRT [Lavalle_RRT98, Karaman_RRTstar11], etc.) that returns the optimal cost-to-go over a subset of the state-control space provides an accurate enough solution. For example, a backwards A* search applied to problem (2) with start state , goal state , and predecessors expansions according to the motion model provides an upper bound to the optimal cost-to-go:

where are the values computed by A* for expanded nodes in the CLOSED list and visited nodes in the OPEN list. Thus, a Boltzmann policy can be defined using the -values returned by A* for all

and a uniform distribution over

for all other states . A* would significantly improves the efficiency of VIN [Tamar_VIN16] or other full backup Dynamic Programming variants by performing local Bellman backups on promising states (the CLOSED list).

In addition to improving the efficiency of the forward computation of , using a planning algorithm to solve (2) is also more efficient in back-propagating errors with respect to . In detail, using the subgradient method [Shor_Subgradient12, Ratliff_06] to optimize leads to a closed-form (sub)gradient of with respect to , removing the need for back-propagation through multiple convolutional or min-pooling layers. We proceed by rewriting in a form that makes its subgradient with respect to obvious. Let be the set of feasible state-control trajectories starting at , and satisfying for with . Let be an optimal trajectory corresponding to the optimal cost-to-go function of a deterministic shortest path problem, i.e., the controls in satisfy the additional constraint for . Let be a state-control visitation function indicating if is visited by . With these definitions, we can view the optimal cost-to-go function as minimum over of the inner product of the cost function and the visitation function :


where can be assumed finite because both and are finite. This form allows us to (sub)differentiate with respect to for any , .

Lemma 1.

Let be differentiable and convex in . Then, , where , is a subgradient of the piecewise-differentiable convex function .

Applying Lemma 1 to (17) leads to the following subgradient of the optimal cost-to-go function:


which can be obtained from the optimal trajectory corresponding to

. This result and the chain rule allow us to obtain the complete (sub)gradient of


Proposition 1.

A subgradient of the loss function in (3) with respect to the cost function parameters can be obtained as follows:


where the first term has a closed-form, while the second term is available from (18) and the cost representation in Sec. III:


The computation graph structure implied by Prop. 1 is illustrated in Fig. 1. The graph consists of a cost representation layer and a differentiable planning layer, allowing end-to-end minimization of via stochastic (sub)gradient descent. Full algorithms for the training and testing phases (Fig. 1) are shown in Alg. 1 and Alg. 2.

Although the form of the gradient in Prop. 1 is similar to that in Ziebart et al. [Ziebart_MaxEnt08], our contributions are orthogonal. Our contribution is to obtain a non-stationary cost-to-go function and its (sub)gradient for a finite-horizon problem using forward and backward computations that scale efficiently with the size of the state space. On the other hand, the results of Ziebart et al. [Ziebart_MaxEnt08] provide a stationary cost-to-go function and its gradient for an infinite horizon problem. The maximum entropy formulation of the stationary policy is a well grounded property of using a soft version of the Bellman update, which can explicitly model the suboptimality of expert trajectories. However, we can show the benefits of our approach without including the maximum entropy formulation and will leave it as future work.

1:Input: Demonstrations
2:while  not converged do
4:     for  and  do
5:          Update based on and as in Sec. III
6:          Obtain from DSP (2) with stage cost
7:          Obtain from via Eq. (1)
9:     Update via Prop. 1
Algorithm 1 Training: learn cost function parameters
1:Input: Start state , goal state , optimized
2:Current state
3:while  do
4:     Make an observation
5:     Update based on and as in Sec. III
6:     Obtain , from DSP (2) with stage cost
7:     Obtain from via Eq. (1)
8:     Update via
9:Output: Navigation succeeds or fails.
Algorithm 2 Testing: compute control policy for learned

V Experiments

Model Val. loss Val. acc. (%) Test traj. succ. rate (%) Test traj. diff. Val. loss Val. acc. (%) Test traj. succ. rate (%) Test traj. diff. DeepMaxEnt [Wulfmeier_DeepMaxEnt16] 0.18 93.6 90.9 0.145 0.23 93.7 31.1 6.528 Ours-HCE 1.10 59.5 99.7 0.378 1.32 42.2 100.0 2.876 Ours-SCE 0.66 62.1 97.2 0.174 0.66 62.1 84.4 1.569 Ours-CNN 0.27 90.5 96.7 0.144 0.14 95.1 90.1 1.196
Fig. 4: Validation and test results for the and grid world domains. Cross entropy loss (3) and prediction accuracy for the validation set are reported. Test trajectories are iteratively rolled out from the non-stationary policy

. A trial is classified as successful if the goal is reached without collisions within twice the number of steps of a shortest path in the groundtruth enviroment.

Ours-CNN is capable of matching the expert demonstrations while generalizing to new robot navigation tasks in test time. Right: Plot showing the effect of noise on the accuracy of Ours-CNN model.
Fig. 5: Examples of occupancy estimation, A* motion planning and subgradient computation during a successful test trajectory. The first figure shows the true occupancy map with the robot start and goal locations in green and magenta, respectively. The second and third figures show the current lidar observation in red and robot trajectory thus far in green. The map occupancy estimate in grayscale is in the background. The optimal cost-to-go estimate from A* is shown in a blue-yellow colormap in the foreground (brighter means higher cost-to-go). The optimal trajectories in cyan corresponding to are obtained during A* planning for subgradient computation in Eqn. (18). The last figure shows the final successful trajectory in green and an optimal shortest path in the fully observable enviroment in blue.

We evaluate our approach in 2D grid world navigation tasks at two scales. Obstacle configurations are generated randomly in maps of sizes or . We use an 8-connected grid so that a control causes a transition from to one of the eight neighbor cells . At each step, the robot receives a lidar scan at resolution, resulting in beams in each scan (see Fig. 3

). The lidar range readings are corrupted by an additive zero mean Gaussian noise. The standard deviation of the noise is

and (grid cell = 1) and the lidar maximum range is and in and domains, respectively. Note that the lidar range is smaller than the map size to demonstrate environment partial observability. During test time, the domain size is the maximum size allowed for the observed area along a trajectory. Demonstrations are obtained by running an A* planning algorithm to solve the deterministic shortest path problem on the true map . The number of maps and training samples generated are shown in Table II.

Dataset #maps #samples #maps #samples
Train 7638 514k 970 460k
Validation 966 66k 122 58k
Test 952 - 122 -
TABLE II: Dataset size. We sample 10 trajectories in each map in training and validation, and each sample takes the form . In testing, the robot’s task is to navigate from one randomly sampled start to goal location on each map.

V-a Baseline and model variations

We use DeepMaxEnt [Wulfmeier_DeepMaxEnt16] as a baseline and compare it to three variants of our model. In all variants, we parameterize an inverse observation model and use the log-odds update rule in Eqn. (13) as the map encoder. This map encoder is sufficiently expressive to model occupancy probability from lidar observations in a 2D enviroment. The A* algorithm is used to solve the DSP (2), providing the optimal cost-to-go for and a subgradient of according to (18

). All the neural networks are implemented in the PyTorch library 

[Paszke_Pytorch17] and trained with the Adam optimizer [Kingma_ADAM14] until convergence.

DeepMaxEnt uses a neural network to learn a cost function directly from lidar observations without explicitly having a map representation. The neural network in our experiments is the “Standard FCN” in [Wulfmeier_DeepMaxEnt16] in the domain, and the encoder-decoder architecture in [SegNet] in the domain. Value iteration is approximated by a finite number of Bellman backup iterations, equal to the map size. The experiments in the original DeepMaxEnt paper [Wulfmeier_DeepMaxEnt16]

use the mean and variance of the height of the 3D lidar points in each cell, as well as a binary indicator of cell visibility, as input features to the neural network. Since our synthetic experiments are set up in 2D, the count of lidar beams in each cell is used as replacement of the height mean and variance. This is a fair adaptation because Wulfmeier et. al. 

[Wulfmeier_DeepMaxEnt16] argued that obstacles generally represent areas of larger height variance which means more beam counts in our observations.

Ours-HCE stands for hard-coded cost encoder. This simple variant of our model uses Eqn. (16) with and set explicitly as constants.

Ours-SCE stands for soft-coded cost encoder and has and in Eqn (16) as learnable parameters.

Ours-CNN is our most generic variant using a convolutional neural network as cost encoder. The network architecture is the same as in DeepMaxEnt for fair comparison.

V-B Experiments and Results

Model generalization

Fig 4 shows the comparison of multiple measures of accuracy for different algorithms. Both Ours-HCE and Ours-SCE explicitly incorporate cell traversability through the cost design in (16). Test results show that this explicit cost encoder is successful at obstacle avoidance, regardless of whether the parameters , are constants in Ours-HCE or learnable in Ours-SCE. The performance of Ours-HCE also shows that the map encoder is learning a correct map representation from the noisy lidar observations since the only trainable parameters are in the inverse observation model (13). However, both models fail at matching demonstrations in validation because the cost encoder (16) emphasizes obstacle avoidance explicitly, leaving little capacity to learn from demonstrations. Ours-CNN combines the strength of learning from demonstrations and generalization to new navigation tasks while avoiding obstacles. Its validation results are on par with DeepMaxEnt, showing the validity of the closed-form subgradient in (18). Ours-CNN significantly outperforms DeepMaxEnt in new tasks at test time. The performance gap of DeepMaxEnt in the two domains shows that a general CNN architecture applied directly to the lidar scan measurements is not as effective as the map encoder in Ours-CNN at modeling occupancy probability. Fig 5 shows the map occupancy estimation, as well as the optimal trajectories necessary for subgradient computation in Sec IV.

Robustness to noise

The robustness of Ours-CNN to the observation noise is evaluated in the domain. Fig 4 shows that the performance degrades as the noise increases but our inverse observation model (13) still generalizes well considering that noise with standard deviation of is significant when the lidar range is only .

Computational efficiency

Finally, we compare the efficiency of a forward pass through our A* planner and the value iteration algorithm in DeepMaxEnt. The A* algorithm in our models is implemented in C++ and evaluated on a CPU. The VI algorithm is implemented using convolutional and minpooling layers in Pytorch as described in [Tamar_VIN16] and is evaluated on a GPU. We record the time that each models takes to return a policy given a cost function . Our A* algorithm takes only ms as compared to VI’s ms on the map. In the domain, our A* algorithm takes ms, compared to VI’s ms, illustrating the scalability of our model in the size of the state space.

Vi Conclusion

We proposed an inverse reinforcement learning approach for infering navigation costs from demonstration in partially observable enviroments. Our model introduces a new cost representation composed of a probabilistic occupancy encoder and a cost encoder defined over the occupancy features. We showed that a motion planning algorithm can compute optimal cost-to-go values over the cost representation, while the cost-to-go (sub)gradient may be obtained in closed-form. Our work offers a promising model for encoding occupancy features in navigation tasks and may enable efficient online learning in challenging operational conditions.