Practical applications of autonomous robot systems increasingly require operation in unstructured, partially observed, unknown, and changing environments. Achieving safe and robust navigation in such conditions is directly coupled with the quality of the environment representation and the cost function specifying desirable robot behavior. Designing a cost function that accurately encodes safety, liveness, and efficiency requirements of navigation tasks is a major challenge. In contrast, it is significantly easier to obtain demonstrations of desirable behavior. The field of inverse reinforcement learning [Ng_IRL00, IRLSurvey, Neu_Apprenticeship12] (IRL) provides numerous tools for learning cost functions from expert demonstration.
We consider the problem of safe navigation from partial observations in unknown environments. We assume that demonstrations containing expert poses, controls, and sensory observations over time are available. The objective is to infer the cost function, which depends on the observation sequence, and explains the demonstrated behavior. This inference can only be done indirectly by comparing the control inputs, that a robot may take based on its current cost representation, to the expert’s actions in that situation.
Learning a cost function requires a differentiable control policy with respect to the stage cost parameters. Computing such derivatives has been addressed by several successful approaches [Ratliff_06, Ziebart_MaxEnt08, Tamar_VIN16, Okada_PIN17]. Ratliff et al. [Ratliff_06] developed algorithms with regret bounds for computing subgradients of planning algorithms (e.g., A* [ARA], RRT [Lavalle_RRT98, Karaman_RRTstar11], etc.) with respect to the cost features. Ziebart et al. [Ziebart_MaxEnt08] developed a dynamic programming algorithm for computing the expected state visitation frequency of a policy extracted from expert demonstrations to enforce that it earns the same reward as the demonstrated policy. Tamar et al. [Tamar_VIN16] showed that the value iteration algorithm could be approximated using a series of convolution (computing value function expectation over stochastic transitions) and maxpooling (choosing the best action) allowing automatic differentiation. Okada et al. [Okada_PIN17] proposed path integral networks in which the control sequence resulting from path integral control may be differentiated with respect to the controller parameters. All of these works, however, assume a known environment and only optimize the parameters of a cost function defined over it.
A series of recent works [ChoiIRLinPOMDPJMLR2011, Shankar_RLviaRCNN2016, Wulfmeier_DeepMaxEnt16, Gupta_CMP17, Karkus_QMDPNet2017, Karkus_DAN19] address IRL under partial observability. Wulfmeier et al. [Wulfmeier_DeepMaxEnt16]
train deep neural network representations for Ziebart et al.’s Maximum Entropy IRL[Ziebart_MaxEnt08] method and consider streaming lidar scan observations of the environment. Karkus et al. [Karkus_DAN19] formulate the IRL problem as a POMDP [ASTROM_POMDP1965], including the robot pose and environment map in the state. Since a naïve representation of the occupancy distribution may require exponential memory in the map size, the authors assume partial knowledge of the environment (the structure of a building is known but the furniture placement is not). A control policy is obtained via the SARSOP [Kurniawati_SARSOP08] algorithm, which approximates the cost-to-go function only over an optimally reachable space. Gupta et al. [Gupta_CMP17] address visual navigation in partially observed environments while using hierarchical VIN as the planner. Khan et al. [Khan_MACN18] introduce a memory module to VIN to address partial observability.
Many IRL algorithms rely on dynamic programming, including VIN and derivatives [Gupta_CMP17, Khan_MACN18], which requires updating cost-to-go estimates over all possible states. Our insight is that in partially observable environments, the cost-to-go estimates need to be updated and differentiated only over a subset of states. Inspired by [Ratliff_06], we obtain cost-to-go estimates only over promising states using a motion planning algorithm. This helps to obtain a closed-form subgradient of the cost-to-go with respect to the learned cost function from the optimal trajectory. While Ratiff et al. [Ratliff_06] exploit this observation in fully observable environments, none of the works focusing on partial observability take advantage of this. Our work differs from closely related works in Table I. In summary, we offer two contributions illustrated in Fig. 1: Firstly, we develop a non-stationary cost function representation composed of a probabilistic occupancy map encoder, with recurrent dependence on the observation sequence, and a cost encoder, defined over the occupancy features (Sec. III). Secondly, we optimize the cost parameters using a closed-form subgradient of the cost-to-go obtained only over a subset of promising states (Sec. IV).
Ii Problem Formulation
Consider a robot navigating in an unknown environment with the task of reaching a goal state . Let be the robot state, capturing its pose, twist, etc., at discrete time . For a given control input where is assumed finite, the robot state evolves according to known deterministic dynamics: . Let be a function specifying the true occupancy of the environment by labeling states as either feasible () or infeasible () and let be the space of possible environment realizations . Let be a cost function specifying desirable robot behavior in a given environment, e.g., according to an expert user or an optimal design. We assume that the robot does not have access to either the true occupancy map or the true cost function . However, the robot is able to make observations (e.g., using a lidar scanner or a depth camera) of the environment in its vicinity, whose distribution depends on the robot state and the environment . Given a training set of expert trajectories with length to demonstrate desirable behavior, our goal is to
learn a cost function estimate that depends on an observation sequence from the true latent environment and is parameterized by ,
derive a stochastic policy from such that the robot behavior under matches the prior experience .
To balance exploration in partially observable environments with exploitation of promising controls, we specify as a Boltzmann policy [Ramachandran_BayesianIRL07, Neu_Apprenticeship12] associated with the cost :
where the optimal cost-to-go function is:
Given demonstrations , optimize the cost function parameters so that log-likelihood of the demonstrated controls is maximized under the robot policies :
The problem setup is illustrated in Fig. 1. While Eqn. (2) is a standard deterministic shortest path (DSP) problem, the challenge is to make it differentiable with respect to . This is needed to propagate the loss in (3) back through the DSP problem to update the cost parameters . Once the parameters are optimized, the robot can generalize to navigation tasks in new partially observable environments by evaluating the cost based on the observations iteratively and (re)computing the associated policy .
Iii Cost Function Representation
. A convolutional neural network, parameterized by, extracts features from the map state to specify the cost at a given robot state-control pair . The learnable parameters are .
We propose a cost function representation comprised of two components: a map encoder and a cost encoder.
The map encoder incrementally updates a hidden state using the most recent observation obtained from robot state . For example, a Bayes filter with likelihood function parameterized by can convert the sequential input into a fixed-sized hidden state :
The cost encoder uses the latent environment map to define the cost function estimate at a given state-control pair . A convolutional neural network (CNN) [Goodfellow-et-al-2016] with parameters can extract cost features from the environment map:
This conceptual model, combining recurrent estimation of a hidden environment state, followed by feature extraction to define the cost atis illustrated in Fig. 2. The model is differentiable by design, allowing its parameters to be optimized. We propose an instantiation of this general model, specific to modeling occupancy costs from depth measurements in robot navigation tasks.
Iii-a Map Encoder
We encode the occupancy probability of different environment areas into a hidden state. In detail, we discretize into cells and let
be the vector of true occupancy values over the cells. Sinceis unknown to the robot, we maintain the occupancy likelihood given the history of states and observations . See Fig. 3 for an example of a depth measurement and associated occupancy likelihood over the map . The representation complexity may be simplified significantly if one assumes independence among the map cells :
We use inspiration from occupancy grid mapping [Thrun_PR05, Hornung_Octomap13] to design recurrent updates for the occupancy probability of each cell . Since
is binary, its likelihood update can be simplified by defining the log-odds ratio of occupancy:
The recurrent Bayesian update of is:
where the increment is a log-odds observation model. The occupancy posterior can be recovered from the occupancy log-odds ratio
via a sigmoid function:
The sigmoid function satisfies the following properties:
To complete the recurrent occupancy model design, we parameterize the log-odds observation model:
where Bayes rule was used to represent it in terms of an inverse observation model and a prior occupancy log-odds ratio , whose value may depend on the environment (e.g., specifies a uniform prior over occupied and free cells). Note that map cells outside of the sensor field of view at time are not affected by , in which case . Now, consider a cell along the direction of the -th sensor ray, whose depth measurement is . Let be the distance between the robot position and the center of mass of cell . We model the occupancy likelihood of as a truncated sigmoid around the true distance :
where is a learnable parameter, , and is a distance threshold on the influence of the -th sensor ray on the occupancy of cells around the point of reflection. Using (10), this implies the following log-odds inverse sensor model for cells along the -th sensor ray:
This model suggests that one may also use a more expressive multi-layer neural network in place of the linear transformationof the distance differential along the -th ray:
Iii-B Cost Encoder
Given a state-control pair , the cost encoder uses the output of the map encoder to obtain a cost function estimate . A complex interacting structure over the feature representation can be obtained via a deep neural network with parameters . Wulfmeier et al. [Wulfmeier_DeepMaxEnt16]
proposed several CNN architectures, including pooling and residual connections of fully convolutional networks, to modelfrom . While complex architecture may provide better performance in practice, we also develop a simpler model for comparison using the inductive bias of obstacle avoidance in robot navigation.
Let and be a small and large positive parameter, respectively. The cost of applying control in robot state can be modeled as large when the transition encounters an obstacle and as small otherwise:
Since is unknown, we use the estimated occupancy probabilities to compute the expectation of over :
where is the entry of the map encoder state that corresponds to the environment cell containing . This simple cost encoder has parameters and its output is differentiable with respect to and through .
Iv Cost Learning via Differentiable Planning
We focus on optimizing the parameters of the cost representation developed in Sec. III. Since the true cost
is not directly observable, we need to differentiate the loss functionin (3), which, in turn, requires differentiating through the DSP problem in (2) with respect to the cost function estimate .
Value Iteration Networks (VIN) [Tamar_VIN16] shows that iterations of the value iteration algorithm can be approximated by a neural network with convolutional and minpooling layers. This allows VIN to be differentiable with respect to the stage cost. While VIN can be modified to operate with a finite horizon and produce a non-stationary policy, it would still be based on full Bellman backups (convolutions and min-pooling) over the entire state space. As a result, VIN scales poorly with the state-space size, while it might not even be necessary to determine the optimal cost-to-go at every state and control in the case of partially observable environments.
Instead of using dynamic programming to solve the DSP (2), any motion planning algorithm (e.g., A* [ARA], RRT [Lavalle_RRT98, Karaman_RRTstar11], etc.) that returns the optimal cost-to-go over a subset of the state-control space provides an accurate enough solution. For example, a backwards A* search applied to problem (2) with start state , goal state , and predecessors expansions according to the motion model provides an upper bound to the optimal cost-to-go:
where are the values computed by A* for expanded nodes in the CLOSED list and visited nodes in the OPEN list. Thus, a Boltzmann policy can be defined using the -values returned by A* for all
and a uniform distribution overfor all other states . A* would significantly improves the efficiency of VIN [Tamar_VIN16] or other full backup Dynamic Programming variants by performing local Bellman backups on promising states (the CLOSED list).
In addition to improving the efficiency of the forward computation of , using a planning algorithm to solve (2) is also more efficient in back-propagating errors with respect to . In detail, using the subgradient method [Shor_Subgradient12, Ratliff_06] to optimize leads to a closed-form (sub)gradient of with respect to , removing the need for back-propagation through multiple convolutional or min-pooling layers. We proceed by rewriting in a form that makes its subgradient with respect to obvious. Let be the set of feasible state-control trajectories starting at , and satisfying for with . Let be an optimal trajectory corresponding to the optimal cost-to-go function of a deterministic shortest path problem, i.e., the controls in satisfy the additional constraint for . Let be a state-control visitation function indicating if is visited by . With these definitions, we can view the optimal cost-to-go function as minimum over of the inner product of the cost function and the visitation function :
where can be assumed finite because both and are finite. This form allows us to (sub)differentiate with respect to for any , .
Let be differentiable and convex in . Then, , where , is a subgradient of the piecewise-differentiable convex function .
which can be obtained from the optimal trajectory corresponding to
. This result and the chain rule allow us to obtain the complete (sub)gradient of.
The computation graph structure implied by Prop. 1 is illustrated in Fig. 1. The graph consists of a cost representation layer and a differentiable planning layer, allowing end-to-end minimization of via stochastic (sub)gradient descent. Full algorithms for the training and testing phases (Fig. 1) are shown in Alg. 1 and Alg. 2.
Although the form of the gradient in Prop. 1 is similar to that in Ziebart et al. [Ziebart_MaxEnt08], our contributions are orthogonal. Our contribution is to obtain a non-stationary cost-to-go function and its (sub)gradient for a finite-horizon problem using forward and backward computations that scale efficiently with the size of the state space. On the other hand, the results of Ziebart et al. [Ziebart_MaxEnt08] provide a stationary cost-to-go function and its gradient for an infinite horizon problem. The maximum entropy formulation of the stationary policy is a well grounded property of using a soft version of the Bellman update, which can explicitly model the suboptimality of expert trajectories. However, we can show the benefits of our approach without including the maximum entropy formulation and will leave it as future work.
We evaluate our approach in 2D grid world navigation tasks at two scales. Obstacle configurations are generated randomly in maps of sizes or . We use an 8-connected grid so that a control causes a transition from to one of the eight neighbor cells . At each step, the robot receives a lidar scan at resolution, resulting in beams in each scan (see Fig. 3
). The lidar range readings are corrupted by an additive zero mean Gaussian noise. The standard deviation of the noise isand (grid cell = 1) and the lidar maximum range is and in and domains, respectively. Note that the lidar range is smaller than the map size to demonstrate environment partial observability. During test time, the domain size is the maximum size allowed for the observed area along a trajectory. Demonstrations are obtained by running an A* planning algorithm to solve the deterministic shortest path problem on the true map . The number of maps and training samples generated are shown in Table II.
V-a Baseline and model variations
We use DeepMaxEnt [Wulfmeier_DeepMaxEnt16] as a baseline and compare it to three variants of our model. In all variants, we parameterize an inverse observation model and use the log-odds update rule in Eqn. (13) as the map encoder. This map encoder is sufficiently expressive to model occupancy probability from lidar observations in a 2D enviroment. The A* algorithm is used to solve the DSP (2), providing the optimal cost-to-go for and a subgradient of according to (18
). All the neural networks are implemented in the PyTorch library[Paszke_Pytorch17] and trained with the Adam optimizer [Kingma_ADAM14] until convergence.
DeepMaxEnt uses a neural network to learn a cost function directly from lidar observations without explicitly having a map representation. The neural network in our experiments is the “Standard FCN” in [Wulfmeier_DeepMaxEnt16] in the domain, and the encoder-decoder architecture in [SegNet] in the domain. Value iteration is approximated by a finite number of Bellman backup iterations, equal to the map size. The experiments in the original DeepMaxEnt paper [Wulfmeier_DeepMaxEnt16]
use the mean and variance of the height of the 3D lidar points in each cell, as well as a binary indicator of cell visibility, as input features to the neural network. Since our synthetic experiments are set up in 2D, the count of lidar beams in each cell is used as replacement of the height mean and variance. This is a fair adaptation because Wulfmeier et. al.[Wulfmeier_DeepMaxEnt16] argued that obstacles generally represent areas of larger height variance which means more beam counts in our observations.
Ours-HCE stands for hard-coded cost encoder. This simple variant of our model uses Eqn. (16) with and set explicitly as constants.
Ours-SCE stands for soft-coded cost encoder and has and in Eqn (16) as learnable parameters.
Ours-CNN is our most generic variant using a convolutional neural network as cost encoder. The network architecture is the same as in DeepMaxEnt for fair comparison.
V-B Experiments and Results
Fig 4 shows the comparison of multiple measures of accuracy for different algorithms. Both Ours-HCE and Ours-SCE explicitly incorporate cell traversability through the cost design in (16). Test results show that this explicit cost encoder is successful at obstacle avoidance, regardless of whether the parameters , are constants in Ours-HCE or learnable in Ours-SCE. The performance of Ours-HCE also shows that the map encoder is learning a correct map representation from the noisy lidar observations since the only trainable parameters are in the inverse observation model (13). However, both models fail at matching demonstrations in validation because the cost encoder (16) emphasizes obstacle avoidance explicitly, leaving little capacity to learn from demonstrations. Ours-CNN combines the strength of learning from demonstrations and generalization to new navigation tasks while avoiding obstacles. Its validation results are on par with DeepMaxEnt, showing the validity of the closed-form subgradient in (18). Ours-CNN significantly outperforms DeepMaxEnt in new tasks at test time. The performance gap of DeepMaxEnt in the two domains shows that a general CNN architecture applied directly to the lidar scan measurements is not as effective as the map encoder in Ours-CNN at modeling occupancy probability. Fig 5 shows the map occupancy estimation, as well as the optimal trajectories necessary for subgradient computation in Sec IV.
Robustness to noise
The robustness of Ours-CNN to the observation noise is evaluated in the domain. Fig 4 shows that the performance degrades as the noise increases but our inverse observation model (13) still generalizes well considering that noise with standard deviation of is significant when the lidar range is only .
Finally, we compare the efficiency of a forward pass through our A* planner and the value iteration algorithm in DeepMaxEnt. The A* algorithm in our models is implemented in C++ and evaluated on a CPU. The VI algorithm is implemented using convolutional and minpooling layers in Pytorch as described in [Tamar_VIN16] and is evaluated on a GPU. We record the time that each models takes to return a policy given a cost function . Our A* algorithm takes only ms as compared to VI’s ms on the map. In the domain, our A* algorithm takes ms, compared to VI’s ms, illustrating the scalability of our model in the size of the state space.
We proposed an inverse reinforcement learning approach for infering navigation costs from demonstration in partially observable enviroments. Our model introduces a new cost representation composed of a probabilistic occupancy encoder and a cost encoder defined over the occupancy features. We showed that a motion planning algorithm can compute optimal cost-to-go values over the cost representation, while the cost-to-go (sub)gradient may be obtained in closed-form. Our work offers a promising model for encoding occupancy features in navigation tasks and may enable efficient online learning in challenging operational conditions.