Only a few decades ago, intelligent robots that could autonomously walk and talk existed only in the bright minds of book and movie authors. People used to think about artificial intelligence only as a fictional feature, as the machines they interacted with were purely reactive and showed no form of autonomy. Nowadays, intelligent systems are everywhere, with deep learning being the main engine of the so-called AI revolution. More recently, advances in the field of embodied AI aim to foster the next generation of autonomous and intelligent robots. Progress in this field includes visual navigation and instruction following[anderson2018vision], event though current research is also focused on the creation of new research platforms for simulation and evaluation [Savva_2019_ICCV, xia2018gibson]karpathy2015deep, anderson2018bottom, cornia2020m2]. By describing the content of an image or a video, captioning models can bridge the gap between the black-box architecture and the user.
In this paper, we propose a new task at the intersection of embodied AI, computer vision, and natural language processing, and aim to create a robot that can navigate through a new environment and describe what it sees. We call this new task Explore and Explain since it tackles the problem of joint exploration and captioning (Fig. 1). In this schema, the agent needs to perceive the environment around itself, navigate it driven by an exploratory goal, and describe salient objects and scenes in natural language. Beyond navigating the environment and translating visual cues in natural language, the agent also needs to identify appropriate moments to perform the explanation step.
It is worthwhile to mention that both exploration and explanation feature significant challenges. Effective exploration without any previous knowledge of the environment can not exploit a reference trajectory and the agent cannot be trained with classic methods from reinforcement learning[wijmans2019dd].
To overcome this problem, we design a self-supervised exploration module that is driven solely by curiosity towards the new environment. In this setting, rewards are more sparse than in traditional setups and encourage the agent to explore new places and to interact with the environment.
While we are motivated by recent works incorporating curiosity in Atari and other exploration games [agrawal2016learning, pathak2017curiosity, burda2018large], the effectiveness of a curiosity-based approach in a photorealistic, indoor environment has not been tested extensively. Some preliminary studies [ramakrishnan2020exploration] suggest that curiosity struggles with embodied exploration. In this work, we show that a simple modification of the reward function can lead to striking improvements in the exploration of unseen environments.
Additionally, we encourage the agent to produce a description of what it sees throughout the navigation. In this way, we match the agent internal state (the measure of curiosity) with the variety and the relevance of the generated captions. Such matching offers a proxy for the desirable by-product of interpretability. In fact, by looking at the caption produced, the user can more easily interpret the navigation and perception capabilities of the agent, and the motivations of the actions it takes [cornia2019smart]. In this sense, our work is related to goal-driven explainable AI, i.e. the ability of autonomous agents to explain their actions and the reasons leading to their decisions [anjomshoae2019explainable].
Previous work on image captioning has mainly focused on recurrent neural networks. However, the rise of Transformer[vaswani2017attention] and the great effectiveness shown by the use of self-attention have motivated a shift towards recurrent-free architectures. Our captioning algorithm builds upon recent findings on the importance of fully-attentive networks for image captioning and incorporates self-attention both during the encoding of the image features and in the decoding phase. This also allows for a reduction in computational requirements.
Finally, to bridge exploration and recounting, our model can count on a novel speaker policy, which regulates the speaking rate of our captioner using information coming from the agent perception. We call our architecture , from the name of the task: Explore and Explain.
Our main contributions are as follows:
We propose a new setting for embodied AI, Explore and Explain in which the agent must jointly deal with two challenging tasks: exploration and captioning of unseen environments.
We devise a novel solution involving curiosity for exploration. Thanks to curiosity, we can learn an efficient policy which can easily generalize to unseen environments.
We are the first, to the best of our knowledge, to apply a captioning algorithm exclusively to indoor environment for robotic exploration. Results are encouraging and motivate further research.
Ii Related Work
Our work is related to the literature on embodied visual exploration, curiosity-driven exploration, and captioning. In the following, we provide an overview of the most important work in these settings, and we briefly describe the most commonly used interactive environments for navigation agents.
Embodied visual exploration. Current research on embodied AI is mainly focused on tasks that require navigating indoor locations. Vision-and-language navigation [anderson2018vision], point-goal and object-goal navigation [wijmans2019dd, anderson2018evaluation, zhu2017target] are all tasks involving the ability for the agent to move across a previously unknown environment. Very recently, Ramakrishnan et al. [ramakrishnan2020exploration] highlighted the importance of visual exploration in order to pre-train a generic embodied agent. While their study is mainly focused on exploration as a mean to gather information and to prepare for future tasks, we investigate the role of surprisal for exploration and the consistency between navigation paths and the descriptions given by the agent during the episodes.
Curiosity-driven exploration. Curiosity-driven exploration is an important topic in reinforcement learning literature. In this context, [oudeyer2009intrinsic] provides a good summary on early works on intrinsic motivation. Among them, Schmidhuber [schmidhuber2010formal] and Sun et al. [sun2011planning] proposed to use information gain and compression as intrinsic rewards, while Klyubin et al. [klyubin2005empowerment], and Mohamed and Rezende [mohamed2015variational] adopted the concept of empowerment as reward during training. Differently, Houthooft et al. [houthooft2016vime] presented an exploration strategy based on the maximization of information gain about the agent’s belief of environment dynamics. Another common approach for exploration is that of using state visitation counts as intrinsic rewards [bellemare2016unifying, tang2017exploration]. Our work follows the strategy of jointly training forward and backward models for learning a feature space, which has demonstrated to be effective in curiosity-driven exploration in Atari and other exploration games [agrawal2016learning, pathak2017curiosity, burda2018large]. To the best of our knowledge, we are the first to investigate this type of exploration algorithms in photorealistic indoor environments.
Interactive environments. When it comes to the training of intelligent agents, an important role is played by the underlying environment. A first test bed for research in reinforcement learning has been provided by the Atari games [bellemare2013arcade, brockman2016openai]. However, these kind of settings are not suitable for navigation and exploration in general. To solve this problem, many maze-like environments have been proposed [kempka2016vizdoom, beattie2016deepmind]. However, agents trained on synthetic environments hardly adapt to real world scenarios, because of the drastic change in terms of appearances. Simulating platforms like Habitat [Savva_2019_ICCV], Gibson [xia2018gibson], and Matterport3D simulator [anderson2018vision] provide a photorealistic environment to train navigation agents. Some of these simulators only provide RGB equirectangular images as visual input [anderson2018vision], while others employ the full 3D model and implement physic interactions with the environment [Savva_2019_ICCV, xia2018gibson].
Automatic captioning. In the last few years, a large number of models has been proposed for image captioning [anderson2018bottom, xu2015show, rennie2017self]. The majority of them use recurrent neural networks as language models and a representation of the image which might be given by the output of a CNN [rennie2017self, vinyals2017show]
, or by a time-varying vector extracted with attention mechanisms over either a spatial grid of CNN features[xu2015show] or multiple image region vectors extracted from a pre-trained object detector [anderson2018bottom]. Regarding the training strategies, notable advances have been made by using reinforcement learning to train non-differentiable captioning metrics [rennie2017self]. Recently, following the strong advent of fully-attentive mechanisms in sequence modeling tasks [vaswani2017attention], different Transformer-based captioning models have been presented [cornia2020m2, herdade2019image]. In this work, we devise a captioning model based on the Transformer architecture that, for the first time, is applied to images taken from indoor environments for robotic exploration.
Iii Proposed Method
The proposed method consists of three main parts: a navigation module, a speaker policy, and a captioner. The last two components constitute the speaker module, which is used to explain the agent first-person point of view. The explanation is elicited by our speaker module basing on the information gathered during the navigation. Our architecture is depicted in Fig. 2 and detailed below.
Iii-a Navigation module
The navigation policy takes care of the agent displacement inside the environment. At each time step the agent acquires an observation from the surroundings, performs an action , and gets the consequent observation . The moves available to the agent are simple, atomic actions such as rotate 15 degrees and step ahead. Our navigation module consists of three main components: a feature embedding network, a forward model, and an inverse model. The discrepancy of the predictions of dynamics models with the actual observation is measured by a reward signal , which is then used to stimulate the agent to move towards more informative states.
Embedding network. At each time step , the agent observes the environment and gathers . This observation corresponds to the raw RGB-D pixels coming from the forward-facing camera of the agent. Yet, raw pixels are not optimal to encode the visual information [burda2018large]
. For this reason, we employ a convolutional neural networkto encode a more efficient and compact representation of the surrounding environment. We call this embedded representation . To ensure that the features observed by the agent are stable throughout the training, we do not change the set of parameters during training. This approach is shown to be efficient for generic curiosity-based agents [burda2018large].
Forward dynamics model. Given an agent with policy , represented by a neural network with parameters , the selected action at timestep is given by:
After executing the chosen action, the agent observes a new visual stimulus . The problem of predicting the next observation given the current input and action to be performed can be defined as a forward dynamics problem:
where is the predicted visual embedding for the next observation and is the forward dynamics model with parameters
. The forward model is trained to minimize the following loss function:
Inverse dynamics model. Given two consecutive observations , the inverse dynamics model aims to predict the action performed at timestep :
is the predicted estimate for the actionand is the inverse dynamics model with parameters . In our work, the inverse model
predicts a probability distribution over the possible actions and it is optimized to minimize the cross-entropy loss with the ground-truth actionperformed in the previous time step:
where is the one-hot representation for .
Curiosity-driven exploration. The agent exploration policy is trained to maximize the expected sum of rewards:
where the exploration reward at timestep , also called surprisal [achiam2017surprise], is given by our forward dynamics model:
where weights the importance of the intrinsic reward signal w.r.t. the policy loss, and balances the contributions of the forward and inverse models.
Penalty for repeated actions. To encourage diversity in our policy, we devise a penalty which triggers after the agent has performed the same move for timesteps. This prevents the agent from always picking the same action and encourages the exploration of different combinations of atomic actions.
We can thus rewrite the surprisal in Eq. 7 as:
where is the penalty at time step . In the simplest formulation, can be modeled with a scalar which is either or equal to a constant , after an action has been repeated times.
Iii-B Speaker policy
As the navigation proceeds, new observations are acquired and rewards are obtained at each time step. Based on these, a speaker policy can be defined, that activates the captioning module. Different types of information from the environment and the navigation module allow defining different policies. In this work, we consider three policies, namely: object-driven, depth-driven, and curiosity-driven.
Object-driven policy. Given the RGB component of the observation , relevant objects can be recognized. When at least a minimum number of such objects are observed, the speaker policy triggers the captioner. The idea behind this policy is to let the captioner describe the scene only when objects that allow connoting the different views are present.
Depth-driven policy. Given the depth component of the observation , the speaker policy activates the captioner when the mean depth value perceived is above a certain threshold. This way, the captioner is triggered only depending on the distance of the agent from generic objects, regardless of their semantic category.
Curiosity-driven policy. Given the surprisal reward defined as in Eq. 7 and possibly cumulated over multiple timesteps, , the speaker policy triggers the captioner when is above a certain threshold. This policy is independent of the type of information perceived from the environment but is instead closely related to the navigation module. Thus, it helps to match the agent’s internal state with the generated captions more explicitly than the other policies.
Iii-C Captioning module
When the speaker policy activates, a captioning module is in charge of producing a description in natural language given the current observation . Following recent literature on the topic, we here employ a visual encoder based on image regions [ren2017faster]
, and a decoder which models the probability of generating one word given previously generated ones. In contrast to previous captioning approaches based on recurrent networks, we propose to use a fully-attentive model for both the encoding and the decoding stage, building on the Transformer model[vaswani2017attention].
Region encoder. Given a set of features from image regions extracted from the agent visual view, our encoder applies a stack of self-attentive and linear projection operations. As the former be seen as convolutions on a graph, the role of the encoder can also be interpreted as that of learning visual relationships between image regions. The self-attention operator builds upon three linear projections of the input set, which are treated as queries, keys and values for an attention distribution. Stacking region features in matrix form, the operator can be defined as follows:
The output of the self-attention operator is a new set of elements , with the same cardinality as , in which each element of is replaced with a weighted sum of the values, i.e. of linear projections of the input.
Following the structure of the Transformer model, the self-attention operator
is followed by a position-wise feed-forward layer, and each of these two operators is encapsulated within a residual connection and a layer norm operation. Multiple layers of this kind are then applied in a stack fashion to obtain the final encoder.
Language decoder. The output of the encoder module is a set of region encodings with the same cardinality of . We employ a fully-attentive decoder which is conditioned on both previously generated words and region encodings, and is in charge of generating the next tokens of the output caption. The structure of our decoder follows that of the Transformer [vaswani2017attention], and thus relies on self-attentive and cross-attentive operations.
Given a partially decoded sequence of words , each represented as a one-hot vector, the decoder applies a self-attention operation in which is used to build queries, keys and values. To ensure the causality of this sequence encoding process, we purposely mask the attention operator so that each word can only be conditioned to its left-hand sub-sequence, i.e. word is conditioned on only. Afterwards, a cross-attention operator is applied between and to condition words on regions, as follows:
As in the Transformer model, after a self-attention and a cross-attention stage, a position-wise feed-forward layer is applied, and each of these operators is encapsulated within a residual connection and a layer norm operation. Finally, our decoder stacks together multiple decoder layers, helping to refine the understanding of the textual input.
Overall, the decoder takes as input word vectors, and the -th element of its output sequence encodes the prediction of a word at time , conditioned on . After taking a linear projection and a softmax operation, this encodes a probability over words in the dictionary. During training, the model is trained to predict the next token given previous ground-truth words; during decoding, we iteratively sample a predicted word from the output distribution and feed it back to the model to decode the next one, until the end of the sequence is reached. Following the usual practice in image captioning literature, the model is trained to predict an end-of-sequence token to signal the end of the caption.
Iv Experimental Setup
The main testbed for this work is Matterport3D [Matterport3D], a photorealistic dataset of indoor environments. Some of the buildings in the dataset contain outdoor components like swimming pools or gardens, raising the difficulty of the exploration task. The dataset is split into scenes for training, for validation, and for testing. It also provides instance segmentation annotations that we use to evaluate the captioning module. Overall, the dataset is annotated with different semantic categories. For both training and testing, we use the episodes provided by Habitat API [Savva_2019_ICCV] for the point goal navigation task, employing only the starting point of each episode. The size of the training set amounts to a total of M episodes, while the test set is composed of episodes.
Iv-B Evaluation protocol
Navigation module. To quantitatively evaluate the navigation module, we use a curiosity-based metric: we extract the sum of the surprisal values defined in Eq. 7 every steps performed by the agent, and then we compute the average over the number of test episodes.
Captioning module. Standard captioning methods are usually evaluated by comparing each generated caption against the corresponding ground-truth sentences. However, in this setting, only the information on which objects are present on the scene is available, thanks to the semantic annotations provided by the Matterport3D dataset. Therefore, to evaluate the performance of our captioning module, we define two different metrics: a soft coverage measure that assesses how the predicted caption covers all the ground-truth objects, and a diversity score that measures the diversity in terms of described objects of two consecutively generated captions.
In details, for each caption generated according to the speaker policy, we compute the soft coverage measure between the ground-truth set of semantic categories and the set of nouns in the caption. Given a predicted caption, we firstly extract all nouns from the sentence and we compute the optimal assignment between them and the set of ground-truth categories , using distances between word vectors and the Hungarian algorithm [kuhn1955hungarian]. We then define an intersection score between the two sets as the sum of assignment profits. Our coverage measure is computed as the ratio of the intersection score and the number of ground-truth semantic classes:
where is the intersection score, and the operator represents the cardinality of the set of ground-truth categories.
Since images may contain small objects which not necessarily should be mentioned in a caption describing the overall scene, we define a variant of the coverage measure by thresholding over the minimum object area. In this case, we consider as the set of objects whose overall areas are greater than the threshold.
For the diversity measure, we consider the sets of nouns extracted from two consecutively generated captions, indicated as and , and we define a soft intersection over union score between the two sets of nouns. Also in this case, we compute the intersection score between the two sets using word distances and the Hungarian algorithm to find the optimal assignment. Recalling that set union can be expressed in function of an intersection, the final diversity score is computed by subtracting the intersection over union score from (i.e. the Jaccard distance between the two sets):
where is the intersection score previously defined, and the operator represents the cardinality of the sets of nouns.
We evaluate the diversity of generated captions with respect to the three speaker policies described in Sec. III-B and considering different thresholds for each policy (i.e. number of objects, mean depth value, and surprisal score). For each speaker policy and selected threshold, the agent is triggered a different number of times thus generating a variable number of captions during the episode. We define the agent’s overall loquacity as the number of times it is activated by the speaker policy according to a given threshold. In the experiments, we report the loquacity values averaged over the test episodes.
Iv-C Implementation and training details
Navigation module. Navigation agents are trained using only visual inputs, with each observation converted to grayscale, cropped and re-scaled to a size. A stack of four historical observations is used for training in order to model temporal dependencies. We adopt PPO [schulman2017proximal] as learning algorithm and employ Adam [kingma2015adam] as optimizer. The learning rate for all networks is set to and the length of rollouts is equal to
. For each rollout we make 3 optimization epochs. The featuresused by the forward and backward dynamics networks are -dimensional and are obtained using a randomly initialized convolutional network with fixed weights , following the approach in [burda2018large].
The model is trained using the splits described in Sec. IV-A, stopping the training after updates of the agent. The length of an exploration episode is steps. In our experiments, we set the parameters reported in Eq. 8 to and , respectively. Concerning the penalty given to the agent to stimulate diversity (Eq. 9), we set after the same action is repeated for times.
Speaker policy. For the object-driven policy, we use the instance segmentation annotations provided by the Matterport3D simulator. For this policy, we select of the semantic categories in the dataset, discarding the contextual ones, which would not be discriminative for the different views acquired by the agent, as for example wall, floor, and ceiling. This way, we can better evaluate the effect of the policy without it being affected by the performance of an underlying object detector of recognizing objects in the agent’s current view. Also for the depth-driven policy, we obtain the depth information of the current view from the Matterport3D simulator, averaging the depth values to extract a single score. In the curiosity-driven policy, we consider the sum of surprisal scores extracted over the last 20 steps, obtained by the agent during navigation.
Captioning module. To represent image regions, we use Faster R-CNN [ren2017faster] finetuned on the Visual Genome dataset [krishnavisualgenome, anderson2018bottom], thus obtaining a -dimensional feature vector for each region. To represent words, we use one-hot vectors and linearly project them to the input dimensionality of the model, . We also employ sinusoidal positional encodings [vaswani2017attention] to represent word positions inside the sequence, and sum the two embeddings before the first encoding layer. In both region encoder and language decoder, we set the dimensionality of each layer to , the number of heads to , and the dimensionality of the inner feed-forward layer to . We use dropout with keep probability after each attention layer and after position-wise feed-forward layers.
Following a standard practice in image captioning [rennie2017self, anderson2018bottom], we train our model in two phases using image-caption pairs coming from the COCO dataset [lin2014microsoft]. Firstly, the model is trained with cross-entropy loss to predict the next token given previous ground-truth words. Then, we further optimize the sequence generation using reinforcement learning employing a variant of the self-critical sequence training [rennie2017self] on sequences sampled using beam search [anderson2018bottom]. Pre-training with cross-entropy loss is done using the learning rate scheduling strategy defined in [vaswani2017attention] with a warmup equal to iterations. Then, during finetuning with reinforcement learning, we use the CIDEr-D score [vedantam2015cider] as reward and a fixed learning rate equal to . We train the model using the Adam optimizer [kingma2015adam] and a batch size of . During CIDEr-D optimization and caption decoding, we use beam search with a beam size equal to . To compute coverage and diversity metrics and for extracting nouns from predicted captions, we use the spaCy NLP toolkit111https://spacy.io/. We use GloVe word embeddings [pennington2014glove] to compute word similarities between nouns and semantic class names.
|w/o Penalty for repeated actions (RGB only)||0.193|
|w/o Penalty for repeated actions (Depth only)||0.361|
|w/o Penalty for repeated actions (RGB + Depth)||0.439|
|Random Exploration||w/o Penalty|
V Experimental Results
V-a Navigation results
As defined in Sec. IV-B, we evaluate the performance of our navigation agents by computing the average surprisal score over test episodes. Results are reported in Table I and show that our complete method () outperforms all other variants, achieving a significantly greater surprisal score than our method without penalty. In particular, the final performance greatly benefits from using both visual modalities (RGB and depth), instead of using a single visual modality to represent the scene. Notably, random exploration (e.g. sampling
from a uniform distribution over the available actions at each time step) proves to be a strong baseline for this task, performing better than our single-modality RGB agent. Nonetheless, our final agent greatly outperforms the baselines, scoring and above the random policy and the vanilla curiosity-based agent respectively.
|Object-driven policy||Object-driven policy||Object-driven policy|
|(6 lay. as in [vaswani2017attention])||0.456||0.550||0.609||0.706||0.386||0.387||0.502||0.576||0.696||0.363||0.348||0.468||0.549||0.691||0.352|
|Depth-driven policy||Depth-driven policy||Depth-driven policy|
|(6 lay. as in [vaswani2017attention])||0.433||0.532||0.600||0.705||0.360||0.420||0.519||0.585||0.701||0.346||0.399||0.497||0.566||0.691||0.339|
|Curiosity-driven policy||Curiosity-driven policy||Curiosity-driven policy|
|(6 lay. as in [vaswani2017attention])||0.425||0.523||0.588||0.703||0.356||0.421||0.515||0.581||0.699||0.360||0.422||0.518||0.583||0.702||0.364|
Qualitative Analysis. In Fig. 3, we report some top-down views from the testing scenes, together with the trajectory from three different navigation agents: the random baseline, our approach without the penalty for repeated action described in Sec. III-A, and our full model. We notice that the agent without penalty usually remains in the starting area and thus has some difficulties in exploring the whole environment. Instead, our complete model demonstrates better results as it is able to explore a much wider area within the environment. Thus, we conclude that the addition of a penalty for repeated actions in the final reward function is of central importance when it comes to stimulating the agent towards the exploration of regions far from the starting point.
V-B Speaker Results
Here, we provide quantitative and qualitative results for our speaker module, which is composed of a policy and a captioner. The policy is in charge of deciding when to activate the captioner, which in turns generates a description of the first-person view of the agent. Results are reported in Table II and discussed below.
Speaker Policy. Among the three different policies, the object-driven speaker performs the best in terms of coverage and diversity. In particular, setting a low threshold () provides the highest scores. At the same time, the agent tends to speak more often, which is desirable in a visually rich environment. As the threshold for gets higher, performances get worse. This indicates that, as the number of object in the scene increases, there are many details that the captioner cannot describe. The same applies for the depth-driven policy: while the agent tends to describe well items that are closer, it experiences some troubles when facing an open space with more distant objects ().
Instead, our curiosity-driven speaker shows a more peculiar behaviour: as the threshold grows, results get better in terms of diversity, while the coverage scores are quite stable (only in terms of ). It is also worth mentioning that our curiosity-based speaker can be adopted in any kind of environment, as the driving metric is computed from the raw RGB-D input. The same does not apply in an object-driven policy, since the agent needs semantic information. Further, the curiosity-driven policy employs a learned metric, hence being more related to the exploration module.
From all these observations, we can conclude that curiosity not only helps training navigation agents, but also represents and important metric when bridging cross-modal components in embodied agents.
Captioner. When evaluating the captioning module, we compare the performance using a different number of encoding and decoding layers. As it can be seen from Table II, the captioning model achieves the best results when composed of layers for coverage and layer for diversity. While this is in contrast with traditional Transformer-based models [vaswani2017attention], that employ or more layers, it is in line with recent research on image captioning [cornia2020m2], which finds beneficial to adopt fewer layers. At the same time, a more lightweight network can possibly be embedded in many embodied agents, thus being more appropriate for our task.
Qualitative Analysis. We report some qualitative results for in Fig. 4. To ease visualization, we underline the items mentioned by the captioner in the sentence, and highlight them with a bounding box of the same color in the corresponding input image. Our agent can explain the scene perceived from a first-person, egocentric point of view. We can notice that identifies all the main objects in the environment and produces a suitable description even when the view is partially occluded.
In this work, we have presented a new setting for embodied AI that is composed of two tasks: exploration and captioning. Our agent uses intrinsic rewards applied to navigation in a photorealistic environment and a novel speaker module that generates captions. The captioner produces sentences according to a speaker policy that could be based on three metrics: object-driven, depth-driven, and curiosity-driven. The experiments show that is able to generalize to unseen environments in terms of exploration, while the speaker policy functions to filter the number of time steps where the caption is actually generated. We hope that our work serves as a starting point for future research on this new coupled-task of exploration and captioning. Our results with curiosity-based navigation in photorealistic environments and with the speaker module motivate further works in this direction.