Indoor assistant robots that are able to perform tasks, such as searching for objects and answering questions about the environment, according to verbal commands from users have promising application prospects. We expect robots to not only complete these tasks correctly but also complete them efficiently, which benefits improving user experience and reduces energy requirements.
The ability of reasoning about potential occlusions of objects is essential for achieving the aforementioned goal. When asked to search for an object, a robot needs to reason whether the target object is possibly occluded by visible objects, and then determine whether to check the occluded space by executing movement actions. However, occlusion reasoning is non-trivial: a robot needs to know the size of the target object from the verbal instruction, and compare it with the size of the visible objects to perform occlusion reasoning. Though existing work has shown that robots with active perception can achieve various tasks [21, 20, 17, 19], in this work we further investigate if robots can efficiently explore environments by performing occlusion reasoning.
To answer this question, we propose a novel robotic object existence prediction (ROEP) task. Fig. 1 shows the task in real scenarios, and a simulation environment built using the robot simulator CoppeliaSim . The robot is the humanoid Pepper111https://www.softbankrobotics.com/emea/en/pepper. from SoftBank Robotics, which has three omnidirectional wheels for flexible locomotion. The movement of the robot is implemented as a circular motion around the table by degrees clockwise or anticlockwise. The robot receives a word instruction (e.g. “marble”), and is rewarded for correctly predicting whether the target object exists on the table while executing as few movement steps as possible. There are three main challenges behind achieving this goal: 1) the robot needs to connect linguistic concepts with visual representations; 2) the robot needs to memorize the past interactions with the environment to make action selection decisions; and 3) the selected actions and the final prediction functionally interact with each other, which makes the training difficult.
We propose a novel model (see Fig. 2
) to address the above challenges. This model is a recurrent neural network consisting of five modules: a visual perception module, a word embedding module, a memory module, an action selection module, and an existence prediction module. The model can be jointly trained with reinforcement learning and supervised learning methods using a curriculum training strategy.
We evaluate our model by comparing it with three baselines, which are a passive model without any movement, a random model with a stochastic movement selection strategy, and an exhaustive exploration model which takes a maximum number of movements. Experimental results demonstrate that our model can outperform the passive and random baselines by a large margin, and achieve a similar prediction accuracy to the exhaustive exploration model while requiring only about of the baseline’s number of movement steps on average. This shows the necessity of active perception and occlusion reasoning to successfully achieve the task, and that a good occlusion reasoning ability is obtained by our model.
As the number of different objects increases, the number of possible combinations of two objects with occlusion increases exponentially. So a good generalization performance on novel combinations of two objects is especially important for occlusion reasoning. We evaluate the generalization performance of our model on novel object combinations held out from the training data, where we show that the generalization to novel object combinations comes with a moderate loss of accuracy while maintaining a small average number of movement steps. Moreover, the generalization performance increases when more kinds of object combinations are included in the training data.
The main contributions of the paper can be summarized as follows: 1) we formulate a novel robotic object existence prediction (ROEP) task, which poses a high requirement of active perception and occlusion reasoning ability for robots; 2) we develop a novel model that can efficiently achieve this task; and 3) we find that the proposed model generalizes to novel object combinations with a moderate loss of accuracy, and that the variety of object combinations in the training data benefits increasing generalization.
Ii Related Work
Mobile robots with active perception: Zhu et al.  proposed a reinforcement learning model for the task of target-driven visual navigation. The model is expected to navigate towards a visual target in indoor scenes with a minimum number of movement steps by its egocentric visual inputs and the image of the target. Ye et al.  studied the problem of mobile robots searching small target objects in arbitrary poses in indoor environments. They proposed a model integrating an object recognition module and a deep reinforcement learning-based action selection module together for the object searching task. Wang et al.  focused on the efficiency of robots when searching for target objects. They proposed a scheme to encode the prior knowledge of the relationship between rooms and objects in a belief map to facilitate efficient searching. Instead of focusing on achieving tasks in large-scale indoor environments, we concentrate on the efficiency of the model when encountering occlusion situations.
Object occlusion: The occlusion situation between objects is very common in robotic scenarios. However, the occlusion reasoning ability of autonomous mobile robots has not been studied well. Yang et al.  introduced the task of embodied amodal recognition focusing on the visual recognition ability of agents in scenes with occlusion. They proposed a model that can navigate in the environment to perform object classification, location, and segmentation. However, this work did not concentrate on the occlusion reasoning ability of the agent. A recent work on developing robots with the occlusion reasoning ability is . This work introduced the task of answering visual questions via manipulation (MQA), where a robot manipulator needs to perform a series of actions to move objects possibly occluding some small objects on a tabletop, in order to correctly answer visual questions. Similar to the ROEP task, the MQA task also requires the robot to have the ability of occlusion reasoning to perform reasonable exploration actions. However, the robot in MQA is a manipulator that explores the environment by moving objects, while we focus on autonomous mobile robots to explore the environment by active perception.
Embodied learning: Robotics research is recently benefiting from achievements in vision and language processing. On the other hand, researchers are also taking advantage of agents situated in 3D environments to conduct multimodal research. It has been proven that an active agent is able to connect linguistic concepts with visual representations of the environment through training to complete action-involved tasks [8, 4]. Hill et al.  found that an embodied agent can achieve one-shot word learning when trained with reinforcement learning in a 3D environment. The proposed ROEP task also involves multimodalities, including vision, language, and action. Different from the abovementioned work, our model needs to specifically connect linguistic concepts with visual representations about object size through training to achieve the ROEP task.
Iii Robotic Object Existence Prediction
Iii-a Simulation Environment
Existing simulation environments are not suitable for the ROEP task. We create a corresponding tabletop simulation environment using the robot simulator CoppeliaSim  (see Fig. 1). The robot can capture egocentric RGB images by a visual sensor mounted on its head, and execute actions selected from (circle_left, circle_right, and stop). By taking the action circle_left, the robot circles around the table clockwise by 30 degrees. The action circle_right works in the same way but in an anticlockwise direction. When the action stop is selected or the maximum number of movement steps is reached, the robot takes no movement action and predicts whether the queried object exists.
A total of 21 everyday objects are used in the simulation environment. Some of them are from the YCB dataset . The rest of them are provided by CoppeliaSim or collected online. These objects are divided into 3 categories according to their relative size, as shown in Table I. When fitting these objects into cubes, objects from the Large category have a minimum height of and an average volume of . The heights of objects from the Medium category are from to , and their average volume is . Objects from the Small category have a maximum height of and an average volume of . There are potential occlusions of objects from different categories.
Iii-B Data Generation
Our data is automatically generated based on predefined rules like the CLEVR  and the ShapeWorld  datasets. All the samples are generated during training and testing periods. Each data sample is a triplet [Scene, Query, Prediction]. Scene is an arrangement of objects on the table. Query
is a word randomly selected with equal probability from TableI to instruct the robot to search for the referred object in Scene. Prediction is a ground-truth binary label representing whether the target object exists in Scene. It is randomly set as positive or negative with the equal probability of . Based on a determined pair of Query and Prediction, a corresponding scene is then generated.
There are three different types of scenes: 1) scenes that contain one object; 2) scenes with two objects without occlusion from the initial field of view of the robot; 3) scenes with two objects, one of which is occluded by the other one from the initial field of view of the robot. They account for the same proportion () in the generated data. To generate scenes with one object, the object is randomly placed on the table. To generate scenes with two objects, some geometric calculations using position coordinates of the robot’s visual sensor, and both position coordinates and heights of the two objects are applied for controlling whether there is occlusions in generated scenes. It should be noted that the smaller object is not necessarily fully occluded by the larger one in scenes with occlusion.
We have a reasoning table (see Table II) of the ideal action strategy at the first time step in an episode. This table shows whether the robot should move to change its viewpoint or predict the existence of the target object directly when given a query for objects of a specific category (different columns), and the object seen from the initial viewpoint. Except for the situation where a Large object is queried, or a Small object is seen, the robot has to utilize both information from the word instruction and visual perception to make an ideal action decision. Because there are at most two objects on the table, whenever the robot sees two objects, the robot should give an existence prediction directly no matter which object is queried.
Our proposed model is inspired by the recurrent attention model, which is originally applied to attention-driven image classification tasks. The proposed model is a recurrent neural network overall (see Fig. 2
), and can be divided into five parts: 1) a memory module for incrementally building up state representations, 2) a visual perception module for extracting visual representations, 3) a word embedding module for extracting distributed representations of a query word, 4) an action selection module for making action decisions, and 5) an existence prediction module for producing final predictions.
The Visual Perception module takes the egocentric RGB image ( pixels) as input to extract visual representations. It first extracts the 128 2828 image feature maps from the conv3 layer of a fixed ResNet18 
pretrained on ImageNet. The feature is then passed through two CNN layers both with 256 33 kernels, and an average pooling layer to obtain the visual representations with a length of . This process is similar to the visual module of the MAC model  designed for visual reasoning on the CLEVR dataset .
The Word Embedding
module maps each word instruction to a 10-dimensional word vector. The weights of the embedding module are randomly initialized, and updated during training.
The Memory module is a recurrent unit that takes the concatenated representations as the input, and combines with the internal representations at the previous time step to produce the new internal representations . This process can be formalized as
where and are weight matrices,
is a bias vector, ReLU(
) is the rectified linear activation function. More sophisticated units such as LSTM or GRU are not used for the memory module because a vanishing gradient is not a problem for our task since only few recurrent steps have to be taken.
The Action Selection module and Existence Prediction module are both classification networks with softmax outputs. The action selection module is a fully connected network with one hidden layer ( hidden units). Its three softmax outputs correspond to three movement actions. The existence prediction module has a single linear layer followed by a softmax layer with two outputs which correspond to the positive and negative prediction respectively.
The parameters of our model include parameters of the visual perception module, the word embedding module, the memory module, the action selection module, and the existence prediction module . The model is non-differential overall. We train the model jointly with supervised learning and reinforcement learning methods, where is trained using reinforcement learning, are trained using supervised learning.
The task can be formalized as a partially observable Markov decision process from the perspective of reinforcement learning. The true state of the environment cannot be fully observed. The action selection module is a reinforcement learning agent, which needs to learn a stochastic policywith the parameters , where is one of the three actions in the predefined action set. Executing each movement action except the stop action leads the model to obtain a new visual input. is the history of past interactions with the environment from time step to . The internal representations in the memory module is an approximation to .
The model is expected to gain a high reward at the end of each episode. We design a cost-sensitive reward function containing two parts, an accuracy reward and a latency reward . An accuracy reward of is received when a correct prediction is produced. An accuracy reward of is received when an incorrect prediction is produced. The latency reward is
where is the number of movement steps the agent takes in one episode. means that the stop action is selected at time step . The total reward at time step is a summation of these two rewards: . We use rather than as the denominator of to make sure that is negative when the prediction is incorrect. At other time steps (), we set .
The agent is expected to maximize the expected reward return under the policy .
We use Monte-Carlo policy gradient (REINFORCE)  to optimize the agent. REINFORCE uses the sample gradient to approximate the actual gradient of
where is the accumulated reward following the action ,
is the estimated reward predicted by a baseline network, which has a single linear layer takingas the input. The estimated reward
is used for reducing variance of gradient estimation. The baseline network is trained with a mean squared error loss.
To use gradient descent algorithms for optimizing the agent, we define loss . It should be noted that gradients of and
are not backpropagated to the memory, visual perception, and word embedding module.
We train these modules along with the existence prediction module using supervised learning methods to optimize the binary cross-entropy loss
where is the labeled ground-truth prediction ( for yes, for no), is the estimated probability of the prediction yes. Gradients of are backpropagated to update parameters of the existence prediction , memory , visual perception , and word embedding module.
The total loss function is a weighted summation of the three losses, as
where and are weight coefficients of and respectively.
Iv-C Training Details
We found that it is hard to train the model from scratch on data with all three different types of scenes, which corresponds to the finding of  that joint training perception and policy networks from scratch is difficult. We resort to a curriculum training strategy to train the model on data with 4 levels of increasing difficulty. We refer to data only containing scenes with one object as L1-1-vis, data only containing scenes with two objects without occlusion as L2-2-vis, data only containing scenes with two objects with occlusion as L3-2-occ, and data containing all types of scenes as L4-overall, in which three types of scenes occupy the same proportion. The model is trained on these four levels of data sequentially. The parameters obtained from one training stage are loaded as the initial parameters for the next training stage.
We use the Adam optimizer with a learning rate of . The weight coefficients in the total loss function (Eq. 6) is set as , for training stage on the first three levels. A smaller weight coefficient is used for the last training level to make the training process stabler.
V-a Curriculum Training
Our model is trained using a curriculum training strategy. Specifically, the model is trained sequentially on L1-1-vis, L2-2-vis, L3-2-occ, and L4-overall data with a fixed number of episodes (, , , and respectively) in our experiments. The total training process takes about four days using one GPU (NVIDIA Titan RTX). We noticed that it is unnecessary to train the model to achieve the best performance in the first three training stages if we are only interested in the final model. We repeat the experiment three times to avoid the effect of randomness. The accuracy of correct predictions and the average number of movement steps are used as metrics to evaluate the performance.
Fig. 3 shows the training curves in different training stages. In the first two training stages on L1-1-vis and L2-2-vis data, the accuracy increases stably until reaching a plateau of over , while the average number of movement steps stay near . In the third training stage on L3-2-occ data, the accuracy rapidly increases in the first episodes with the rapid increase of the average movement steps. In the last two training stages on L3-2-occ and L4-overall data, the average movement steps continuously decrease after the accuracy has reached a plateau.
We refer to models obtained from the first three training stages at the , , episodes as Model, Model, Model respectively. The final model is obtained from the last training stage at the episodes, and denoted as Final Model. Performance of each model when tested on different test data ( episodes) is presented in Table III. The results show that each model scores well on the test data that corresponds to the training statistics (diagonal in bold font), and that the final model performs nearly as well as the individual models on their test data. Fig. 4 shows examples when there is only one object, which is larger than the target object, visible from the initial perspective of the agent. A video showing the experimental results is available at https://youtu.be/L4p7yo8dMmQ.
V-B Comparison with Baselines
We compare the proposed model with three baselines that have the same architecture as the proposed model, but with different action selection strategies. These baselines include a passive model without any movement, a random model with a stochastic movement selection strategy, and an exhaustive exploration model that executes the circle_left action for a maximum number of movement steps before producing a prediction. The average movement steps of the three baselines are , , and respectively.
The prediction accuracy of these baselines and our final model when tested on different test data is presented in Table IV. The passive model and the random model are able to achieve a performance close to that of the exhaustive model on L1-1-vis, and L2-2-vis data, but performs poorly on L3-2-occ data. This reveals that active perception is necessary to address the ROEP task. Our model can achieve a similar accuracy on all test data to the exhaustive model while requiring only of the baseline’s number of movement steps on average ( steps by our model, steps by the exhaustive model). This demonstrates that our model has obtained a good occlusion reasoning ability. However, there are still some challenges remaining: 1) The model learns to always choose one direction to move, rather than choose the optimal direction according to the orientation of the visible object or partial occlusion to check the occluded space (see Fig. 5); 2) The model moves steps on average in scenes without occlusion (L2-2-vis), which is unnecessary.
|Test Data||Passive Model||Random Model||Exhaustive Model||Our Model|
V-C Generalization Evaluation
A good generalization performance on novel combinations of two objects is especially important for occlusion reasoning. To evaluate the generalization performance, we train our network on two different sets of training data excluding some object combinations, which are called holdout combinations. That means scenes with some specific object combinations, e.g. [mug, battery], are not included in the training data.
There are three types of combinations of two different size categories, namely [Large, Medium], [Large, Small], [Medium, Small], and 147 possible combinations of two objects from different size categories. In the first training set, 21 object combinations (7 for each category combination) are held out only for testing, which accounts for of all possible combinations. In the second set, 42 object combinations (14 for each category combination) are held out, which accounts for of all possible combinations. Holdout combinations are determined by randomly selecting from all possible object combinations before the start of training. Every object in Table I is shown in the training data. Experiments are repeated three times with different holdout combinations and random initialization.
Table V presents the test results of the models trained on aforementioned two sets of training data, denoted as 21 holdout and 42 holdout respectively. Test data L2-2-vis (training) and L3-2-occ (training) contain scenes with object combinations used for training. Test data L2-2-vis (holdout) and L3-2-occ (holdout) only contain scenes with holdout object combinations. The results show that the two models can achieve similar high performance on scenes with object combinations used for training. When tested on L2-2-vis (holdout) and L3-2-occ (holdout), the model trained on 21 holdout can still work well with an accuracy of over and a small average number of movement steps. The performance of the model trained on 42 holdout drops moderately to accuracy when tested on L3-2-occ (holdout), where occlusion reasoning on novel combinations is necessary.
|21 holdout||42 holdout|
Experimental setup: The current experimental setup is simplified. There is a strong prior that there are at most two objects existing on the table, which limits the complexity of potential occlusion situations. An interesting extension is to generate scenes with more objects on the table and extend the task to counting objects. Moreover, the action space of the robot is small. The actions of circle_left and circle_right used in the current experimental setting limits the generalization capability to environments with tables of different sizes or shapes. More complex robot actions also involving move_ahead, rotate_left, rotate_right, etc., will be used in future work. On the one hand, it will be feasible to transfer a robot with these more complex actions to other environments. On the other hand, it will make the task more challenging, as the robot has greater flexibility in its movements, which places higher demands on action planning.
The current training process is complex, since the curriculum training strategy involves four sequential training stages to obtain the final model. A possible solution to simplify training is using unsupervised learning instead of curriculum learning to learn good visual and word representations.
Sim-to-real transfer: In this paper, we validate the effectiveness of the proposed model in a simulation environment. We can imagine that directly transferring the resulting model trained in a simulation environment to a real-world scenario (see Fig. 1) would result in a certain performance loss. Some techniques, such as fine-tuning the model in a more photo-realistic simulation environment with randomized lighting conditions of the real environment, may mitigate the performance degradation.
In this work, we introduced the task of robotic object existence prediction (ROEP), which is complementary to existing robotic tasks that require active perception. Different from existing tasks, ROEP focuses on the occlusion reasoning ability, which helps a robot explore its environments more efficiently. As such, it can be used for example as a probing task for existing models that are designed for general tasks.
To solve ROEP we proposed a novel recurrent neural network which is trained end-to-end jointly with reinforcement learning and supervised learning methods using a curriculum training strategy. We showed empirically that the proposed model can efficiently achieve the ROEP task compared with the baselines. We also showed that generalization to novel object combinations comes with a moderate loss of accuracy, while including more kinds of object combinations in the training data can increase the generalization performance. This finding, which is related to the finding in , can be considered as a recommendation when training a model for tasks that implicitly involve occlusion reasoning (e.g., object goal navigation ).
Acknowledgement. The authors gratefully acknowledge support from the China Scholarship Council (CSC) and the German Research Foundation DFG under project CML (TRR 169). We would like to thank Erik Strahl for his support with the experimental setup.
Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp. 41–48. Cited by: §I.
-  (2015) Benchmarking in manipulation research: the YCB object and model set and benchmarking protocols. Proceedings of the 2015 IEEE International Conference on Advanced Robotics (ICAR). Cited by: §III-A.
-  (2020) Object goal navigation using goal-oriented semantic exploration. In In Neural Information Processing Systems (NeurIPS), Cited by: §VII.
Gated-attention architectures for task-oriented language grounding.
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 32. Cited by: §II.
-  (2021) MQA: answering the question via robotic manipulation. In Proceedings of Robotics: Science and Systems (RSS), Cited by: §II.
-  (2018) World models. arXiv preprint arXiv:1803.10122. Cited by: §VI.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: §IV-A.
-  (2017) Grounded language learning in a simulated 3D world. arXiv preprint arXiv:1706.06551. Cited by: §II.
-  (2020) Environmental drivers of systematicity and generalization in a situated agent. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §VII.
-  (2021) Grounded language learning fast and slow. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §II.
-  (2018) Compositional attention networks for machine reasoning. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §IV-A.
-  (2017) CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2901–2910. Cited by: §III-B, §IV-A.
-  (2017) ShapeWorld - A new test methodology for multimodal language understanding. arXiv preprint arXiv:1704.04517. Cited by: §III-B.
-  (2014) Recurrent models of visual attention. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), pp. 2204–2212. Cited by: §IV-A.
-  (2013) CoppeliaSim (formerly V-REP): a versatile and scalable robot simulation framework. In Proceedings of The International Conference on Intelligent Robots and Systems (IROS), Note: www.coppeliarobotics.com Cited by: §I, §III-A.
-  (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §IV-A.
-  (2018) Efficient object search with belief road map using mobile robot. IEEE Robotics and Automation Letters 3 (4), pp. 3081–3088. Cited by: §I, §II.
-  (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3-4), pp. 229–256. Cited by: §IV-B.
-  (2019) Embodied amodal recognition: learning to move to perceive objects. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Vol. , pp. 2040–2050. External Links: Cited by: §I, §II, §IV-C.
-  (2018) Active object perceiver: recognition-guided policy learning for object searching on mobile robots. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6857–6863. Cited by: §I, §II.
-  (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3357–3364. Cited by: §I, §II.