Teaching Robots Novel Objects by Pointing at Them

12/25/2020 ∙ by Sagar Gubbi Venkatesh, et al. ∙ indian institute of science 3

Robots that must operate in novel environments and collaborate with humans must be capable of acquiring new knowledge from human experts during operation. We propose teaching a robot novel objects it has not encountered before by pointing a hand at the new object of interest. An end-to-end neural network is used to attend to the novel object of interest indicated by the pointing hand and then to localize the object in new scenes. In order to attend to the novel object indicated by the pointing hand, we propose a spatial attention modulation mechanism that learns to focus on the highlighted object while ignoring the other objects in the scene. We show that a robot arm can manipulate novel objects that are highlighted by pointing a hand at them. We also evaluate the performance of the proposed architecture on a synthetic dataset constructed using emojis and on a real-world dataset of common objects.



There are no comments yet.


page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robots that can operate in unconstrained environments and collaborate with humans must be capable of learning about new objects they may encounter. Pointing at an object with our hand is a natural way to communicate with the robot about a new object. In this paper, we consider the problem of teaching robots novel objects by pointing at the new object. Once we show the robot a new object, it can generate and store a feature vector corresponding to that object and then re-use it for one-shot localization of the object in new scenes (Fig. 


Neural networks have been used in recent years to learn fully differentiable visuomotor policies that directly map pixels to actuator commands [12, 21, 20, 14, 7]. The neural network architecture typically used for such policies can be decomposed into vision layers and control layers. The vision layers comprise of several convolutional and pooling layers followed by a spatial attention mechanism that attends to the objects of interest in the image. We propose modulating the spatial attention so as the network is able to attend to the object that the hand is pointing at (see Fig. 1) while ignoring the other distracting objects in the scene.

In this work, we assume that only the location of the object of interest is available for training and that the position and orientation of the pointing hand are unavailable. So, this is a weakly supervised learning problem where the neural network must figure out as part of the learning process that the pointing hand in the image is salient and then learn to attend to the object being pointed at. On the other hand, this assumption makes the process of data acquisition with a real robot easier by reducing the labeling effort.

Unlike other papers on object detection [15, 16], we are primarily interested in teaching robots new objects. This means that we are interested in objects not seen by the neural network during training. We accomplish this using Siamese networks [11, 18, 1]

, which are twin neural networks with shared weights. The idea is to use the neural network to obtain from the image a feature vector representing the object of interest rather than classifying the contents of the image as is usually done (Fig. 

1). This vector can be subsequently used in new environments to find the novel object of interest.

Fig. 1: One-shot localization of novel object selected by the pointing hand. The feature vector of the object (blue bottle cap) that the hand is pointing at is extracted using the proposed attention modulation mechanism. This is then used to localize the object in new scenes. Note that the pointing finger is at considerable distance from the object of interest.

Our contributions are as follows:

  • We propose a spatial attention modulation mechanism that endows the neural network with the ability to selectively attend to the object that is being pointed at while ignoring other distracting objects in the scene.

  • We show that the proposed method can be combined with Siamese networks to teach robots novel objects.

The proposed network architecture is trained on synthetic data constructed from a dataset of emojis. We demonstrate the proposed method on the Dobot Magician robot arm. We show that the robot learns new objects that we point at and can find them in new scenes.

The rest of this paper is organized as follows. In the next section, papers related to this work are discussed. In Section 3, the proposed model architecture is described. Section 4 details experimental results, and Section 5 concludes the paper.

Ii Related Work

Ii-a Hand Recognition

One of the ways to design a system that can infer the object of interest in a similar scenario is to use a pipelined approach. For example, one could employ deep learning models trained to localize human hands in an image 

[2, 4], extract the most relevant keypoints of the hand (say the joints of the index finger) and then fit a line that passes through these keypoints. An object recognition module could then be used to localize each object in the scene, project these points on the line and pick the object corresponding to the closest point to the the hand as the object of interest. However, training such a system requires a strong level of supervision such as the positions of all objects in the scene, and possibly even the keypoints of the hand if one would like to fine-tune the hand localization models. However, this approach will not be feasible in a weak supervision setting as outlined in this paper where only the location of the object of interest is given. Moreover, it has been shown across a wide range of problems that using an end-to-end approach leads to better performance as compared to using a pipelined approach for a given task [9, 8, 21].

Ii-B Spatial Attention

The architecture of typical end-to-end networks for visuomotor tasks can be broadly grouped into two sets of layers. The initial group of layers form the vision layers that help in localizing the relevant objects in the image. The remaining layers form what is known as the control layers which are responsible for coming up with the appropriate control actions required to perform the task at hand. A key component in such end-to-end networks is some form of a spatial attention mechanism that learns to attend to the relevant object of interest in the scene. The work presented in [21]

demonstrates the use of imitation learning for teaching a PR2 robot to perform simple tasks such as pick-and-place. The authors developed a virtual reality based system to teleoperate the robot and collect training data. The data was then used to train an end-to-end network that maps image pixels directly to robot joint velocities. The network consists of an initial set of convolution layers that generates a feature map. The feature map is passed through a spatial softargmax layer to output a feature vector. The resulting feature vector is then passed through a few fully connected layers to predict the joint velocities of the robot. The spatial softargmax layer serves as a simple spatial attention mechanism where the attention weight corresponding to each pixel of the feature map depends on the degree of activation.

Ii-C One Shot Learning

Apart from inferring the object of interest in an image in the presence of other objects, another goal of this paper is to enable robots to recognize objects that they have never encountered before by training it on only a few examples involving the novel object. Broadly speaking, meta learning and Siamese networks are two approaches one can take to achieve this. We review both approaches below.

Ii-C1 Meta Learning

In Meta learning, also known as learning to learn, a distribution of tasks are provided as training data. Typically only a few examples for each task are provided in the training data. The weights of the network after the training process completes serves as a good initialization for the network to learn to perform any new previously unseen task. Only a few training examples involving the novel task and a few gradient descent steps are required for the network to converge to an optimal set of weights [5]. Recently, meta learning and imitation learning has been combined to enable robots to perform novel tasks such as pick-and-place by training on just a single example [6, 19].

Fig. 2:

The proposed neural network architecture for one-shot localization of the object selected by the pointing hand. The convolutional layers in the “Conv” block are Conv3x3(16)-ELU-Conv3x3(32)-ELU-Conv3x3(64)-ELU-MaxPool2x2-Conv3x3(64)-ELU-Conv3x3(128)-ELU-MaxPool2x2-Conv3x3(256)-ELU-Conv3x3(512)-ELU-Conv3x3(1024)-ELU. All convolutions are “valid” convolutions that do not use padding so that the feature vector for the object is the same regardless of whether it is near the edge of the image or at the center. The receptive field of the “Conv” block is 34 px.

Ii-C2 Siamese Networks

Siamese networks are used to address the similarity learning problem where it is desirable to infer if a pair of images (referred to exemplar and search images) are similar to each other or not. This is done by using twin convolution neural networks with shared weights that transform the images

and into feature embeddings and , respectively. The embedding pair is then combined using a transformation that can be used to make suitable predictions depending on the task at hand. For example, in the context of image classification [11] the transformation

is a distance metric that can be used to measure the similarity score between the object in the exemplar image and the search image. The system is trained on several examples of similar and dissimilar pairs of images. Once training is complete a database of images is built with one image corresponding to each object of interest. At test time the similarity of the search image is tested against each image in the database to determine the object class of the search image. Siamese networks have been used in face recognition systems as well 

[17]. However, in both these papers the comparison is possible only if the exemplar and search images are of the same dimension.

The authors of [1] used fully convolutional neural networks to enable comparison of images of dissimilar dimensions with Siamese networks. Their architecture was adapted successfully for object tracking in videos. Here the user provides the exemplar image by cropping out the object of interest from the first frame of the video which is then compared against each subsequent frame using the Siamese network. More recently, the authors in [18] combined fully convolutional Siamese networks with spatial attention to enable object localization for robot pick-and-place tasks. The paper explores specifying the object of interest by using visual cues instead of requiring the user to provide a cropped image of the object. Given a group of objects in a scene the user indicates to the robot the object of interest by shining a laser beam directly on it. Although the authors talk of localizing novel objects using a laser beam as a visual cue, the network designed by them should work for any other kind of visual cue (such as a stick or even a hand) so long as it is in very close proximity with the object of interest. However, a human merely has to point at an object from a distance to convey that it is of interest and an observer infers and localizes the object being referred to by looking in the direction of the pointing hand. We would like to design systems that communicate intent to robots much like how humans communicate with each other via visual cues or using natural language [13, 3]. Learning to localize an object of interest from natural language instructions requires a different architecture design compared to the one presented in this paper. We will restrict our focus to learning to localize the object of interest by using a visual cue provided such as a hand pointing at the object from a distance.

Iii Network Architecture

Iii-a Localizing the Object of Interest

The proposed neural network for one-shot localization is shown in Fig. 2. Let the exemplar image (denoted as ) correspond to the image that contains the novel object that is being pointed at by the hand. Let the search image (denoted as ) be the image of the new scene in which the same object must be localized. The network outputs the locations of the object in the exemplar image and the search image which are denoted as and , respectively. The mean squared error loss is used to train the network.

The localization of the object is performed in a similar fashion as described in [18] except for the attention modulation block. The exemplar image is passed through the CNN to obtain a feature map . This is then passed through a bottleneck convolutional layer (conv) to obtain . Let us ignore for now how the attention modulation map is generated. We will describe the generation of the attention map in Section III-B. A spatial attention map is generated by adding the attention modulation map to and the resulting sum is passed through a spatial soft-argmax layer whose output is the predicted location of the object of interest in the exemplar image (see Eqns. (1), (2) and (3)). The spatial attention map is used to obtain the feature vector corresponding to the object of interest from the feature map (see Eqn. (4)). Note that in Fig. 2 corresponds to the element-wise multiplication operation (with the appropriate broadcasting done to account for the different number of channels present in and ).


The localization of the object in the search image is done by first passing through the CNN to obtain . Then the feature vector is used like a matched filter (or equivalently as a conv11 layer with as the weights) to generate . The location of the object in the search image is then determined by passing through a spatial soft-argmax layer (similar to the operations in Eqns. (2) and (3)).

Iii-B Generating the Attention Modulation Map

Fig. 3: Beam / Cone like attention modulation maps for different positions and orientations of the pointing hand. The dark regions correspond to a value of -2.0 which suppresses peaks corresponding to objects in (Fig. 2), whereas the bright regions correspond to value 0.0 which allows values in that area in to pass through unchanged. The beam width is , and the step size is .

When there are multiple objects in the scene, we expect multiple bright spots each corresponding to an object in (Fig. 6). We would like to suppress the peaks in corresponding to objects that are not being pointed at. Had we known the location and orientation of the hand, we could have directly suppressed the irrelevant peaks. However, since we do not have labels corresponding to the pose of the pointing hand in the scene, the neural network must learn to attend to the hand and then use this to suppress the irrelevant peaks in . To enable this, we use a “soft” or differentiable way to compute the position and orientation of the hand which is then employed to suppress irrelevant objects in .

The feature map is passed through two independent bottleneck layers to produce maps and corresponding to the position and orientation of the pointing hand respectively in . Similar to Eq. (4), spatial attention is used to attend to the pointing hand and to obtain the orientation of the hand . The final position and orientation of the hand () is used to “soft select” a pre-defined attention modulation map (see Fig.3). The set of pre-defined attention modulation maps include beams from all possible locations and orientations of the pointing hand as shown in Fig. 3. Each modulation map is constructed by drawing a beam emanating from the position of the hand and in the direction the hand is pointing at. We use an orientation step size of with a beam width of . Thus, there are 303024

21600 such maps. Note that no explicit loss function is used to learn

and . Rather, network learns to predict appropriate values for and that result in the “selection” of an appropriate attention map, which is possible only by correctly recognizing the position and orientation of the hand. The modulation map thus obtained, , is added to to highlight the object being pointed at while suppressing the irrelevant ones. Thus, the pixels in that lie inside the beam are passed as is whereas the pixels that lie outside the beam are suppressed. Note that the entire attention modulation scheme is differentiable and hence can be learned through back-propagation.

The proposed way of creating attention modulation maps is most suitable for top view images (Fig. 6). For perspective views where the depth of the object is more relevant, it may be necessary to generate maps in 3D by casting a cone of rays and using the perspective projection. We leave this for future work.

Iv Experimental Results

To evaluate the proposed neural network, we first train it on a synthetic dataset and compare it with alternative architectures. The trained network is deployed on a robot arm to demonstrate its real world performance.

Iv-a Localization Performance

A dataset of 5000 training images and 1000 test images is created by placing emojis (Fig. 4) at non-overlapping positions against a backdrop as shown in Fig. 6. A hand emoji (Fig. 5) is placed at a random location pointing to an object. One or more distracting objects are placed at random locations not on the line segment between the pointing hand and the object. The label for each sample is the position of the object that the hand is pointing at.

To evaluate the proposed spatial attention modulation mechanism, we first consider only localization of the object in the image containing the pointing hand () while ignoring the other input () and the output of Siamese network . Table I compares the proposed approach with two baselines. The FC layers baseline refers to using fully connected layers111The fully connected layers used are FC1024-ELU-FC256-ELU-FC2. to predict from . The Conv layers baseline uses convolutional layers222Conv3x3(2048)-ELU-MaxPool2x2-Conv3x3(2048)-ELU-Conv3x3(2048)-ELU-MaxPool2x2-Conv3x3(2048)-ELU-Conv3x3(2048)-ELU-FC2. to predict the position of the object. The networks are trained with mean squared error loss with weight decay 1e-8 using the Adam optimizer[10]

with learning rate 1e-4. The evaluation metric is accuracy where we consider the prediction to be accurate if there is sufficient overlap between the ground truth and predicted bounding box. Specifically, the IOU (intersection-over-union) between the ground truth bounding box and the predicted bounding box has to be at least 0.5. All the three networking achieve low training error (accuracy over 99%), but the test error varies, and the proposed approach generalizes the best. A sample output is shown in Fig. 

6 where we observe that the spatial attention modulation mechanism is working as one might expect.

A second dataset containing images corresponding to a new environment () where the object highlighted by the pointing is present along with distracting objects is constructed as before (Fig. 7). The proposed architecture in its entirety with the Siamese network to process and predict is trained on this dataset. The accuracy on this dataset drops only marginally to 95.31%. Table II compares performance on this dataset and shows that attention modulation is essential to localize the desired object. The sample output in Fig. 7 shows that the desired object in is being attended to.

Fig. 4: A few sample objects used for training and evaluation. The set of emojis is divided into 2075 for training and 703 for testing.
Fig. 5: A few sample hand images used for training and evaluation. The set of hand emojis is divided into 47 for training and 8 for testing.
Fig. 6: Sample prediction of the proposed architecture. The network has properly localized the pointing hand and chosen a suitable attention modulation map. The activation corresponding to the object that is not pointed at has been appropriately suppressed in .
Neural Network Architecture Accuracy
FC layers 11.72%
Conv layers 41.41%
Proposed approach 96.88%
TABLE I: Comparison of the proposed approach with different baselines
Neural Network Architecture Accuracy
Without Attention Modulation[18] 12.5%
Proposed approach (with modulation) 95.31%
TABLE II: Comparison of localization performance of the Siamese network on novel objects with and without attention modulation
Fig. 7: A sample prediction from the proposed architecture on synthetic data.

Iv-B Evaluation on Robot Arm

We demonstrate the proposed neural network using the Dobot Magician, a 3-DoF robot arm (Fig. 8). The objects used for evaluation with the robot are shown in Fig. 9. To convert the localized object in pixel space to the robot co-ordinate space, a chessboard calibration pattern is used (Fig. 10), and OpenCV is used for calibration. Figure 11 shows a sample predicted from the proposed neural network. We see that the pointing hand has been localized, and the network has learnt to predict an appropriate attention modulation map that selects the object being pointed at (blue bottle cap in Fig. 11). We also see that the activation corresponding to the distracting object in has been successfully suppressed. With the feature vector corresponding to the bottle cap extracted, the Siamese net successfully attends to the same bottle cap in a new scene (). In this manner, 20 trials were performed. The proposed network localizes the desired object to within 1 cm in all the trials. A video of the robot in operation is available at https://youtu.be/bJ5HKllhqLg.

Fig. 8: A sample demonstration with the Dobot Magician robot arm.
Fig. 9: Objects used for evaluating the proposed approach on the Dobot arm.
Fig. 10: The chessboard calibration pattern used to convert pixels to robot co-ordinates.
Fig. 11: A sample prediction from the proposed architecture.

V Conclusions

We have proposed a spatial attention modulation method that endows a neural network with the ability to attend to a hand pointing at an object in an image and to focus on the object that is being pointed at. The proposed approach generalizes significantly better compared to architectures that use only fully connected or convolutional layers for localization. Furthermore, this approach can be combined with a Siamese network to localize objects that were not present in the training dataset. This network architecture can be used in building robots that can interact naturally with humans and learn about new objects over time.


This project was supported by the Robert Bosch Center for Cyber-Physical Systems.


  • [1] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr (2016) Fully-convolutional siamese networks for object tracking. In

    European conference on computer vision

    pp. 850–865. Cited by: §I, §II-C2.
  • [2] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh (2018)

    OpenPose: realtime multi-person 2d pose estimation using part affinity fields

    External Links: 1812.08008 Cited by: §II-A.
  • [3] H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi (2018) Touchdown: natural language navigation and spatial reasoning in visual street environments. External Links: 1811.12354 Cited by: §II-C2.
  • [4] B. Doosti (2019)

    Hand pose estimation: a survey

    External Links: 1903.01013 Cited by: §II-A.
  • [5] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. External Links: 1703.03400 Cited by: §II-C1.
  • [6] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine (2017) One-shot visual imitation learning via meta-learning. External Links: 1709.04905 Cited by: §II-C1.
  • [7] A. Giusti, J. Guzzi, D. C. Cireşan, F. He, J. P. Rodríguez, F. Fontana, M. Faessler, C. Forster, J. Schmidhuber, G. Di Caro, et al. (2015)

    A machine learning approach to visual perception of forest trails for mobile robots

    IEEE Robotics and Automation Letters 1 (2), pp. 661–667. Cited by: §I.
  • [8] A. Graves and N. Jaitly (2014-01)

    Towards end-to-end speech recognition with recurrent neural networks

    31st International Conference on Machine Learning, ICML 2014 5, pp. 1764–1772. Cited by: §II-A.
  • [9] A. Gupta, A. Vedaldi, and A. Zisserman (2016) Synthetic data for text localisation in natural images. External Links: 1604.06646 Cited by: §II-A.
  • [10] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-A.
  • [11] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §I, §II-C2.
  • [12] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §I.
  • [13] D. Misra, J. Langford, and Y. Artzi (2017)

    Mapping instructions and visual observations to actions with reinforcement learning

    External Links: 1704.08795 Cited by: §II-C2.
  • [14] R. Rahmatizadeh, P. Abolghasemi, A. Behal, and L. Bölöni (2018) From virtual demonstration to real-world manipulation using lstm and mdn. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §I.
  • [15] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 779–788. Cited by: §I.
  • [16] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §I.
  • [17] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2014) DeepFace: closing the gap to human-level performance in face verification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1701–1708. Cited by: §II-C2.
  • [18] S. G. Venkatesh and B. Amrutur (2019) One-shot object localization using learnt visual cues via siamese networks. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6700–6705. Cited by: §I, §II-C2, §III-A, TABLE II.
  • [19] T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine (2018) One-shot imitation from observing humans via domain-adaptive meta-learning. External Links: 1802.01557 Cited by: §II-C1.
  • [20] T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine (2018) One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557. Cited by: §I.
  • [21] T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel (2018) Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §I, §II-A, §II-B.