Robotic exploration is the task of autonomously navigating an unknown environment with the goal of gathering sufficient information to represent it, often via a spatial map [stachniss2009robotic]. This ability is key to enable many downstream tasks such as planning [selin2019efficient] and goal-driven navigation [savva2019habitat, wijmans2019dd, morad2021embodied]. Although a vast portion of existing literature tackles this problem [osswald2016speeding, chaplot2019learning, chen2019learning, ramakrishnan2020occupancy], it is not yet completely solved, especially in complex indoor environments. The recent introduction of large datasets of photorealistic indoor environments [chang2017matterport3d, xia2018gibson] has eased the development of robust exploration strategies, which can be validated safely and quickly thanks to powerful simulating platforms [deitke2020robothor, savva2019habitat]
. Moreover, exploration algorithms developed on simulated environments can be deployed in the real world with little hyperparameter tuning[kadian2020sim2real, bigazzi2021out, truong2021bi], if the simulation is sufficiently realistic.
Most of the recently devised exploration algorithms exploit deep reinforcement learning (DRL) [zhu2018deep], as learning-based exploration and navigation algorithms are more flexible and robust to noise than geometric methods [chen2019learning, niroui2019deep, ramakrishnan2021exploration]. Despite these advantages, one of the main challenges in training DRL-based exploration algorithms is designing appropriate rewards.
In this work, we propose a new reward function that employs the impact of the agent actions on the environment, measured as the difference between two consecutive observations [raileanu2020ride], discounted with a pseudo-count [bellemare2016unifying] for previously-visited states (see Fig 1). So far, impact-based rewards [raileanu2020ride] have been used only as an additional intrinsic reward in procedurally-generated (grid-like mazes) or singleton (the test environment is the same employed for training) synthetic environments. Instead, our reward can deal with photorealistic non-singleton environments. To the best of our knowledge, this is the first work to apply impact-based rewards to this setting.
Recent research on robotic exploration proposes the use of an extrinsic reward based on occupancy anticipation [ramakrishnan2020occupancy]. This reward encourages the agent to navigate towards areas that can be easily mapped without errors. Unfortunately, this approach presents two major drawbacks. First, one would need the precise layout of the training environments in order to compute the reward signal. Second, this reward heavily depends on the mapping phase, rather than focusing on what has been already seen. In fact, moving towards new places that are difficult to map would produce a very low occupancy-based reward. To overcome these issues, a different line of work focuses on the design of intrinsic reward functions, that can be computed by the agent by means of their current and past observations. Some examples of recently proposed intrinsic rewards for robot exploration are based on curiosity [bigazzi2020explore], novelty [ramakrishnan2021exploration], and coverage [chaplot2019learning]. All these rewards, however, tend to vanish with the length of the episode because the agent quickly learns to model the environment dynamics and appearance (for curiosity and novelty-based rewards) or tends to stay in previously-explored areas (for the coverage reward). Impact, instead, provides a stable reward signal throughout all the episode [raileanu2020ride].
Since robotic exploration takes place in complex and realistic environments (Fig. 1
) that can present an infinite number of states, it is impossible to store a visitation count for every state. Furthermore, the vector of visitation counts would consist of a very sparse vector, and that would cause the agent to give the same impact score to nearly identical states. To overcome this issue, we introduce an additional module in our design to keep track of a pseudo-count for visited states. The pseudo-count is estimated by a density model trained end-to-end and together with the policy. We integrate our newly-proposed reward in a modular embodied exploration and navigation system inspired by that proposed by Chaplotet al. [chaplot2019learning] and consider two commonly adopted collections of photorealistic simulated indoor environments, namely the Gibson [xia2018gibson] and Matterport 3D (MP3D) [chang2017matterport3d]. Furthermore, we also deploy the devised algorithm in the real world. The results in both simulated and real environments are promising: we outperform state-of-the-art baselines in simulated experiments and demonstrate the effectiveness of our approach in real-world experiments.
Ii Related Work
Ii-a Geometric Robot Exploration Methods
Classical heuristic and geometric-based exploration methods rely on two main strategies: frontier-based exploration[yamauchi1997frontier] and next-best-view planning [gonzalez2002navigation]. The former entails iteratively navigating towards the closest point of the closest frontier, which is defined as the boundary between the explored free space and the unexplored space. The latter entails sequentially reaching cost-effective unexplored points, i.e., points from where the gain in the explored area is maximum, weighed by the cost to reach them. These methods have been largely used and improved [holz2010evaluating, bircher2016receding, niroui2019deep] or combined in a hierarchical exploration algorithm [zhu2018deep, selin2019efficient]. However, when applied with noisy odometry and localization sensors or in highly complex environments, geometric approaches tend to fail [chen2019learning, niroui2019deep, ramakrishnan2021exploration]. In light of this, increasing research effort has been dedicated to the development of learning-based approaches, which usually exploit DLR to learn robust and efficient exploration policies.
Ii-B Intrinsic Exploration Rewards
The lack of ground-truth in the exploration task forces the adoption of reinforcement learning (RL) for training exploration methods. Even when applied to tasks different from robot exploration, RL methods have low sample efficiency. Thus, they require designing intrinsic reward functions that encourage visiting novel states or learning the environment dynamics. The use of intrinsic motivation is beneficial when external task-specific rewards are sparse or absent. Among the intrinsic rewards that motivate the exploration of novel states, Bellemare et al. [bellemare2016unifying] introduced the notion of pseudo visitation count by using a Context-Tree Switching (CTS) density model to extract a pseudo-count from raw pixels and applied count-based algorithms. Similarly, Ostrovski et al. [ostrovski2017count] applied the autoregressive deep generative model PixelCNN [oord2016conditional] to estimate the pseudo-count of the visited state. Rewards that encourage the learning of the environment dynamics comprehend Curiosity [pathak2017curiosity], Random Network Distillation (RND) [burda2018exploration], and Disagreement [pathak2019self]. Curiosity forces the agent towards areas that maximize to prediction error for future states. RND exploits the prediction error of the state encodings made with a fixed random initialized network. Disagreement employs an ensemble of dynamics models and rewards the agent for visiting states where the disagreement of the ensemble is high. Recently, Raileanu et al. [raileanu2020ride] proposed to jointly encourage both the visitation of novel states and the learning of the environment dynamics. However, their approach is developed for grid-like environments with a finite number of states, where the visitation count can be easily employed as a discount factor. In this work, we improve Impact, a paradigm that rewards the agent proportionally to the change in the state representation caused by its actions, and design a reward function that can deal with photorealistic scenes with non-countable states.
Ii-C Learning-based Robot Exploration Methods
In the context of robotic exploration and navigation tasks, the introduction of photorealistic simulators has represented a breeding ground for the development of self-supervised DRL-based visual exploration methods. Ramakrishnan et al. [ramakrishnan2021exploration] identified four paradigms for visual exploration: novelty-based, curiosity-based (as defined above), reconstruction-based, and coverage-based. Each paradigm is characterized by a different reward function used as a self-supervision signal for optimizing the exploration policy. In particular, novelty discourages re-visiting the same areas as it is defined as the inverse visitation counts for each area; reconstruction favors reaching positions from which it is easier to predict unseen observations of the environment; and coverage maximizes the information gathered at each time-step, being it the number of objects or landmarks reached or area seen. A coverage-based reward, considering the area seen, is also used in the successful modular approach to Active SLAM presented by Chaplot et al. [chaplot2019learning], which combines a neural mapper module with a hierarchical navigation policy. To enhance exploration efficiency in complex environments, Ramakrishnan et al. [ramakrishnan2020occupancy] resorted to an extrinsic reward by introducing the occupancy anticipation reward, which aims to maximize the agent accuracy in predicting occluded unseen areas.
Ii-D Deep Generative Models
Deep generative models are trained to approximate high-dimensional probability distributions by means of a large set of training samples. In recent years, literature on deep generative models followed three main approaches: latent variable models like VAE[kingma2013auto], implicit generative models like GANs [goodfellow2014generative]
, and exact likelihood models. Exact likelihood models can be classified in non-autoregressive flow-based models, like RealNVP[dinh2016density] and Flow++ [ho2019flow++]
, and autoregressive models, like PixelCNN/RNN[oord2016conditional] and Image Transformer [parmar2018image]
. Non-autoregressive flow-based models consist of a sequence of invertible transformation functions to compose a complex distribution modeling the training data. Autoregressive models decompose the joint distribution of images as a product of conditional probabilities of the single pixels. Usually, each pixel is computed using as input only the previous predicted ones, following a raster scan order. In this work, we employ PixelCNN/RNN[oord2016conditional] to learn a probability distribution over possible states and estimate a pseudo visitation count.
Iii Proposed Method
Iii-a Exploration Architecture
Following the current state-of-the-art architectures for navigation for embodied agents, the proposed method comprises three main components: a CNN-based mapper, a pose estimator, and a hierarchical navigation policy. The navigation policy defines the actions of the agent, the mapper builds a top-down map of the environment to be used for navigation, and the pose estimator locates the position of the agent on the map. Our architecture is depicted in Fig. 2 and described below.
The mapper generates a map of the free and occupied regions of the environment discovered during the exploration. At each time step, the RGB observation and the depth observation are processed to output a two-channel local map depicting the area in front of the agent, where each cell of the local map describes the state of a cm area of the environment, the channels measure the probability of a cell being occupied and being explored. The local maps are aggregated and registered to the global map of the environment using the estimated pose from the pose estimator. The resulting global map is used by the navigation policy for action planning.
Iii-A2 Pose Estimator
The pose estimator is used to predict the displacement of the agent in consequence of an action. The considered atomic actions of the agent are: go forward 0.25m, turn left 10°, turn right 10°. However, the noise in the actuation system and the possible physical interactions between the agent and the environment could produce unexpected outcomes causing positioning errors. The pose estimator reduces the effect of such errors by predicting the real displacement . The input of this module consists of the RGB-D observations and and the local maps . Each modality is encoded singularly to obtain three different estimates of the displacement:
where and and are weights matrices and bias. Eventually, the displacement estimates are aggregated with a weighted sum:
where MLP is a three-layered fully-connected network, (, , ) are the inputs encoded by a CNN, and
denotes tensor concatenation. The estimated pose of the agent at timeis given by:
Note that, at the beginning of each exploration episode, the agent sets its position to the center of its environment representation, i.e.,
Iii-A3 Navigation Module
The sampling of the atomic actions of the agent relies on the hierarchical navigation policy that is composed of the following modules: the global policy, the planner, and the local policy.
The global policy samples a point on an augmented global map of the environment, , that represents the current global goal of the agent. The augmented global map is a map obtained by stacking the two-channel global map from the Mapper with the one-hot representation of the agent position and the map of the visited positions, which collects the one-hot representations of all the positions assumed by the agent from the beginning of the exploration. Moreover,
is in parallel cropped with respect to the position of the agent and max-pooled to a spatial dimensionwhere . These two versions of the augmented global map are concatenated to form the input of the global policy that is used to sample a goal in the global action space . The global policy is trained with reinforcement learning with our proposed impact-based reward , defined below, that encourages exploration.
The planner consists of an A* algorithm that uses the global map and the global goal to plan a path towards the global goal and samples a local goal within m from the position of the agent.
The local policy outputs the atomic actions of the agent to reach the local goal and is trained to minimize the euclidean distance to the local goal, which is expressed via the following reward:
where is the euclidean distance to the local goal at time step . Note that the output actions in our setup are discrete. These platform-agnostic actions can be translated into signals for specific robots actuators, as we do in this work. Alternatively, based on the high-level predicted commands, continuous actions can be predicted, e.g. in the form of linear and angular velocity commands to the robot, by using an additional, lower-level policy, as done in [irshad2021hierarchical]. The implementation of such policy is beyond the scope of our work.
Following the hierarchical structure, the global goal is reset every steps, and the local goal is reset if at least one of the following conditions verifies: a new global goal is sampled, the agent reaches the local goal, the local goal location is discovered to be in a occupied area.
Iii-B Impact-Driven Exploration
The exploration ability of the agent relies on the design of an appropriate reward for the global policy. In this setting, the lack of external rewards from the environment requires the design of a dense intrinsic reward. To the best of our knowledge, our proposed method presents the first implementation of impact-driven exploration in photorealistic environments. The key idea of this concept is encouraging the agent to perform actions that have impact on the environment and the observations retrieved from it, where the impact at time step is measured as the -norm of the encodings of two consecutive states and , considering the RGB observation as the state . The reward of the global policy for the proposed method is calculated as:
where is the visitation count of the state at time step , i.e., how many times the agent has observed . The visitation count is used to drive the agent out of regions already seen in order to avoid trajectory cycles.
Iii-B1 Visitation Counts
The concept of normalizing the reward using visitation count fails when the environment is continuous, since during exploration is unlikely to visit exactly the same state more than once. In fact, even microscopic changes in terms of translation or orientation of the agent cause shifts in the values of the RGB observation, thus resulting in new states. Therefore, using a photorealistic continuous environment nullifies the scaling property of the denominator of the global reward in Eq. 7 because every state during the exploration episode is, most of the times, only encountered for the first time. To overcome this limitation, we implement two types of pseudo-visitation counts to be used in place of , which extend the properties of visitation counts to continuous environments: Grid and Density Model Estimation.
Grid: With this approach, we consider a virtual discretized grid of cells with fixed size in the environment. We then assign a visitation count to each cell of the grid. To this end, we take the global map of the environment and divide it into cells of size . The estimated pose of the agent, regardless of its orientation , is used to select the cell that the agent occupies at time . In the Grid formulation, the visitation count of the selected cell is used as in Eq. 7 and is formalized as:
where returns the block corresponding to the estimated position of the agent.
Density Model Estimation (DME): Let be an autoregressive density model defined over the states , where is the set of all possible states. We call the probability assigned by to the state after being trained on a sequence of states , and , or recoding probability, the probability assigned by to after being trained on . The prediction gain of describes how much the model has improved in the prediction of after being trained on itself, and is defined as
In this work, we employ a lightweight version of Gated PixelCNN [oord2016conditional] as density model. To compute the input of the PixelCNN model, we transform the RGB observation to grayscale and we crop and resize it to a lower size . The transformed observation is quantized to bins to form the final input to the model, . The model is trained to predict the conditional probabilities of the pixels in the transformed input image, with each pixel depending only on the previous ones following a raster scan order. The output has shape and each of its elements represents the probability of a pixel belonging to each of the bins. The joint distribution of the input modeled by PixelCNN is:
where is the pixel of the image . is trained to fit by using the negative log-likelihood loss.
Let be the pseudo-count total, i.e., the sum of all the visitation counts of all states during the episode. The probability and the recoding probability of can be defined as:
Note that, if is learning-positive, i.e., if for all possible sequences and all , we can approximate as:
To use this approximation in Eq. 7, we still need to address three problems: it does not scale with the length of the episode, the density model could be not learning-positive, and should be large enough to avoid the reward becoming too large regardless the goal selection. In this respect, to take into account the length of the episode, we introduce a normalizing factor , where is the number of steps done by the agent since the start of the episode. Moreover, to force to be learning-positive, we clip to 0 when it becomes negative. Finally, to avoid small values at the denominator of (Eq. 7), we introduce a lower bound of 1 to the pseudo visitation count. The resulting definition of in the Density Model Estimation formulation is:
where is a term used to scale the prediction gain.
Iv Experimental Setup
For comparison with state-of-the-art DRL-based methods for embodied exploration, we employ the photorealistic simulated 3D environments contained in the Gibson dataset [xia2018gibson] and the MP3D dataset [chang2017matterport3d]. Both these datasets consist of indoor environments where different exploration episodes take place. In each episode, the robot starts exploring from a different point in the environment. Environments used during training do not appear in the validation/test split of these datasets. Gibson contains scans of different indoor locations, for a total of around M exploration episodes ( locations are used in episodes for test in the so-called Gibson Val split). MP3D consists of scans of large indoor environments ( of those are used in episodes for the validation split and in episodes for the test split).
Iv-B Evaluation Protocol
Following the well-established evaluation protocol for embodied exploration in Habitat [savva2019habitat, chaplot2019learning, ramakrishnan2020occupancy], we train our models on the Gibson train split. Then, we perform model selection basing on the results obtained on Gibson Val. We then employ the MP3D validation and test splits to benchmark the generalization abilities of the agents. To evaluate exploration agents, we employ the following metrics. IoU between the reconstructed map and the ground-truth map of the environment: here we consider two different classes for every pixel in the map (free or occupied). Similarly, the map accuracy (Acc, expressed in ) is the portion of the map that has been correctly mapped by the agent. The area seen (AS, in ) is the total area of the environment observed by the agent. For both the IoU and the area seen, we also present the results relative to the two different classes: free space and occupied space respectively (FIoU, OIoU, FAS, OAS). Finally, we report the mean positioning error achieved by the agent at the end of the episode. A larger translation error (TE, expressed in ) or angular error (AE, in degrees) indicates that the agent struggles to keep a correct estimate of its position throughout the episode. For all the metrics, we consider episodes of length and steps.
|G = 2||0.726||0.721||0.730||51.41||61.88||34.17||27.71||0.240||4.450|
|G = 4||0.796||0.792||0.801||54.34||61.17||33.74||27.42||0.079||1.055|
|G = 5||0.806||0.801||0.813||55.21||62.17||34.31||27.87||0.077||0.881|
|G = 10||0.789||0.784||0.794||54.26||61.67||34.06||27.61||0.111||1.434|
|B = 64||0.773||0.768||0.778||53.58||61.00||33.79||27.21||0.131||2.501|
|B = 128||0.796||0.794||0.799||54.73||62.07||34.27||27.79||0.095||1.184|
|B = 256||0.685||0.676||0.695||49.27||61.40||33.95||27.45||0.311||6.817|
|Gibson Val (T = 500)||Gibson Val (T = 1000)|
|MP3D Val (T = 500)||MP3D Val (T = 1000)|
|MP3D Test (T = 500)||MP3D Test (T = 1000)|
Iv-C Implementation Details
The experiments are performed using the Habitat Simulator [savva2019habitat] with observations of the agent set to be RGB-D images and episode length during training set to . Each model is trained with the training split of the Gibson dataset [xia2018gibson] with environments in parallel for M frames.
Navigation Module: The reinforcement learning algorithm used to train the global and local policies is PPO [schulman2017proximal] with Adam optimizer and a learning rate of . The global goal is reset every time steps and the global action space hyperparameter is . The local policy is updated every steps and the global policy is updated every steps.
Mapper and Pose Estimator: These models are trained with a learning rate of with Adam optimizer, the local map size is set with while the global map size is for episodes in the Gibson dataset and in the MP3D dataset. Both models are updated every time steps, where is the reset interval of the global policy.
Density Model: The model used for density estimation is a lightweight version of Gated PixelCNN [oord2016conditional] consisting of a masked convolution followed by two residual blocks with masked convolutions with output channels, a masked convolutional layer with output channels, and a final
masked convolution that returns the output logits with shape, where is the number of bins used to quantize the model input. We set for the resolution of the input and the output of the density model.
V Experimental Results
V-a Exploration Results
As a first step, we perform model selection using the results on the Gibson Val split (Table I). Our agents have different hyperparameters that depend on the implementation for the pseudo-counts. When our model employs grid-based pseudo-counts, it is important to determine the dimension of a single cell in this grid-based structure. In our experiments, we test the effects of using squared cells, with . The best results are obtained with , with small differences among the various setups. When using pseudo-counts based on a density model, the most relevant hyperparameters depend on the particular model employed as density estimator. In our case, we need to determine the number of bins for PixelCNN, with . We find out that the best results are achieved with .
In Table II, we compare the Impact (Grid) and Impact (DME) agents with agents based on the curiosity, coverage, and anticipation rewards. It is worth noting that the anticipation agent uses a strong extrinsic reward that, different from our intrinsic impact-based reward, cannot be computed without strong information on the ground-truth map layout. As it can be seen, results achieved by Impact are constantly better than those obtained by the competitors. In particular, the two proposed impact-based agents constantly achieve first and second position in terms of IoU, FIoU, OIoU, Acc, AS, FAS, and OAS, both for and . Overall, the proposed impact-based intrinsic reward provides a notable boost compared to the simple curiosity.
|Curiosity||Coverage||Anticipation||Impact (Grid)||Impact (DME)|
Further, we test the exploration abilities of the different agents on the Matterport3D (MP3D) dataset. These experiments are intended to test the generalization abilities of exploration agents, therefore we keep the hyperparameters selected in the previous experiments fixed. Both on MP3D Val and MP3D Test (Tables III and IV respectively), the proposed impact-based exploration agents surpass the competitors on all the metrics. The different implementations chosen for the pseudo-counts affect final performance, with Impact (DME) bringing the best results in terms of AS. For , the two impact-based agents are best and second best in terms of IoU, FIoU, and Acc.
|OccAnt (RGB-D) [ramakrishnan2020occupancy]||-||-||0.930||0.800||-|
In Fig. 3, we report some qualitative results displaying the trajectories and the area seen by different agents in the same episode. Also, from a qualitative point of view, the benefit given by the proposed reward in terms of exploration trajectories and explored areas is easy to identify.
V-B PointGoal Navigation
One of the main advantages of training deep modular agents for embodied exploration is that they easily adapt to perform downstream tasks, such as PointGoal navigation [savva2019habitat]. Recent literature [chaplot2019learning, ramakrishnan2020occupancy] has discovered that hierarchical agents trained for exploration are competitive with state-of-the-art architecture tailored for PointGoal navigation and trained with strong supervision for billion frames [wijmans2019dd]. Additionally, the training time and data required to learn the policy is much more limited (2 to 3 orders of magnitude smaller). In Table V, we report the results obtained on the standard benchmark for PointGoal navigation in Habitat [savva2019habitat].
We consider five main metrics: the average distance to the goal achieved by the agent, the mean number of atomic steps per episode, and three success-related metrics. The success rate (SR) is the fraction of episodes terminated within meters from the goal, while the SPL and SoftSPL weigh the distance from the goal with the length of the path taken by the agent in order to penalize inefficient navigation. As it can be seen from Table V, the two proposed agents achieve first and second place when compared with the main competitors: OccAnt [ramakrishnan2020occupancy] and Active Neural Slam (ANS) [chaplot2019learning], thus confirming the appropriateness of impact-based rewards also for PointGoal navigation. For both the competitors, we report the results achieved by the RGB-D agents, as reported in the papers. For completeness, we also compare with the results achieved by DD-PPO [wijmans2019dd]
, a method trained with imitation learning for the PointGoal task onbillion frames, times more than the frames used to train our agents.
V-C Real-world Deployment
As agents trained in realistic indoor environments using the Habitat simulator are adaptable to real-world deployment [kadian2020sim2real, bigazzi2021out], we also deploy the proposed approach on a LoCoBot robot [locobot]. We employ the PyRobot interface [murali2019pyrobot] to deploy code and trained models on the robot. To enable the adaptation to the real-world environment, there are some aspects that must be taken into account during training. As a first step, we adjust the simulation in order to reproduce realistic actuation and sensor noise. To that end, we adopt the noise model proposed in [chaplot2019learning]
based on Gaussian Mixture Models fitting real-world noise data acquired from a LoCoBot. Additionally, we modify the parameters of the RGB-D sensor used in simulation to match those of the RealSense camera mounted on the robot. Specifically, we change the camera resolution and field of view, the range of depth information, and the camera height. Finally, it is imperative to prevent the agent from learning simulation-specific shortcuts and tricks. For instance, the agent may learn to slide along the walls due to imperfect dynamics in simulation[kadian2020sim2real]. To prevent the learning of such dynamics, we employ the bump sensor provided by Habitat and block the agent whenever it is in contact with an obstacle. When deployed in the real world, our agent is able to explore the environment without getting stuck or bumping into obstacles.
In this work, we present an impact-driven approach for robotic exploration in indoor environments. Different from previous research that considered a setting with procedurally-generated environments with a finite number of possible states, we tackle a problem where the number of possible states is non-numerable. To deal with this scenario, we exploit a deep neural density model to compute a running pseudo-count of past states and use it to regularize the impact-based reward signal. The proposed agent stands out from the recent literature on embodied exploration in photorealistic environments and can also solve downstream tasks such as PointGoal navigation. Additionally, being trained on the Habitat simulator, we show that it is easy to deploy the trained models in the real world.
This work has been supported by “Fondazione di Modena” and by the “European Training Network on PErsonalized Robotics as SErvice Oriented applications” (PERSEO) MSCA-ITN-2020 project (Grant Agreement 955778).