To build generalist robots that can operate on novel objects that it has never been exposed to in the past, we must create vision systems that can recognize new objects from a few examples. One way to specify novel objects of interest is to use a computer interface such as touchscreen or keyboard and mouse to produce a cropped image that contains the object of interest. However, it maybe more natural or easier to specify the object of interest using visual cues like pointing at it with our hand or with a laser pointer. We consider the problem of learning to infer novel objects of interest that are being pointed at with a visual cue and then localizing this object so that the robot can operate with this novel object.
One possible approach to building such a vision system is to build a visual cue detector that can find the visual cue in an image and then to crop the object from the image using the location of the cue. This cropped image can then be used by the robot to look for and manipluate the object of interest in any environment. The problem with such a pipelined approach is that it doesn’t generalize easily to new visual cues. Furthermore, the errors in successive stages add up leading to poor overall performance. Errors in localizing the visual cue could result in the object of interest not being centered in the cropped image, which potentially leads to poor performance in the next stage, whereas in an end-to-end network the latter layers naturally learn to compensate for variations in the earlier layers. This problem has been observed in a wide range of tasks from scene text detection  to speech recognition  where end-to-end neural networks have outperformed earlier pipelined approaches. Thus, we seek to build an end-to-end neural network that takes in the image where the object of interest is highlighted with a visual cue and the image where the novel object must be localized and directly outputs the desired location.
Another reason to have the object localization performed with an end-to-end differentiable network is that it can then be used as part of a larger imitation learning system. Imitation learning aims to learn to control robots by imitating a human expert. Training data is gathered by recording the activity of a teleoperated robot controlled by an expert
. The camera feed and the current state of the robot is given to a neural network which predicts commands for the actuator. The network architecture in this approach may be broken into two components (a) vision layers, and (b) control layers. The vision layers are a few convolutional layers followed by a spatial softmax layer that outputs 2D locations of points of interest in the image. The control layers are a few fully connected layers that consume the localization information and produce the actuation signal. In this work, we do not use a neural network for actuation. The neural network only localizes the object of interest, and the pick-and-place operation follows a pre-planned path.
In order to detect novel objects, there are broadly two approaches based on deep learning: Meta learning and Siamese networks . Both approaches train on a few instances of a large number of objects with the intent of generalizing to new, unseen objects during inference. Meta-learning aims to discover initial weights for the network which when fine tuned with a few steps of gradient descent using the novel object as input results in a network that can detect the new object. Siamese networks uses twin convolutional networks with shared weights to directly learn feature representations that can discriminate between similar and dissimilar objects. We propose using a Siamese network with attention  to infer the new object of interest and then to find that object in a new scene (Fig. 1
). Note that this is a weakly supervised learning problem since the ground truth location of the visual cue is not available and only the location of the object of interest in the new environment is provided during training (Fig.1), so direct application of approaches like template matching is not possible.
The primary contribution of this paper is a system where users can specify to a robot an object of interest by using a laser pointer, which the robot can then manipulate. We evaluate our approach by having a simulated robot pick-and-pace novel objects from a synthetic dataset. We also evaluate the performance of the one-shot localization network on a dataset derived from the Omniglot handwritten character dataset  and on a small dataset of toys.
Ii Related Work
Ii-a Meta Learning
Meta learning, or learning to learn, takes a distribution of tasks during training and produces a quick learner that can generalize to new tasks from a few examples , . Domain adaptive meta learning has been used with imitation learning to learn to imitate by observing a single demonstration performed by a human expert 
. This differs from meta learning in that the loss function used to update the weights of the network must also be learnt because labels are unavailable for the adaptation input, and this happens through higher order derivatives. Comparing two images to determine if they are of the same object can be thought of as a distribution of tasks each corresponding to determining if the image corresponds to a particular object. When examples of a new object are given, a few gradient descent update steps adapt the neural network to look for the new object. Despite being a powerful architecture for quickly learning new tasks in a handful of iterations, the neural network is harder to train, suffers from vanishing or exloding gradients, and has a high suscpetibility to the random initialization seed as documented in. We have been so far unsuccessful in using meta learning to localize novel objects specified by visual cues.
Ii-B Siamese Networks
. They are twin neural networks that share weights. The feature vectors corresponding to the two images are computed by passing them through the same convolutional network. These vectors are used to perform a binary classification indicating if the two images are similar. We use the Siamese network to look for and localize a novel object of interest, but we combine it with attention so that the object of interest can be inferred from a larger scene rather than directly providing the network with the object of interest in a small image patch. The attention mechanism can be thought of as matching a query to a table of key-value pairs . The query is compared against all the keys through the operation to obtain a score for each key, and the weighted average of the corresponding values gives the value of interest. We use this to extract the feature vector of the object of interest from the feature map of a larger image in which the object is present.
A closely related method for determining similarity between images is metric learning with the triplet loss. During training, an anchor image, a positive image (similar to the anchor), and a negative image (dissimilar to the anchor) are used. The distance between the feature vectors corresponding to the positive and anchor image is minimized while that between the anchor and negative image is maximized. Although the architecture we propose is similar to the Siamese network, as we shall see in the next section, the implicit loss it uses for learning is closer to the triplet loss.
The triplet loss has been used in  to sort novel objects into buckets, but the approach used is a grasp-first-then-recognize workflow. In this work, we first localize the object of interest and then pick it up.
Iii Similarity Learning for Novel Object Localization
An image containing the object highlighted by a visual cue is drawn from the distribution . i.e., . Likewise, where is an image containing the object (without any visual cue) at position . We now have a supervised learning problem with loss function where denotes the model parameters.
To begin with, consider the problem of classifying if two images are similar or dissimilar (Fig. 2). If the two images are and , we would like to learn a function that scores the similarity between images and . One way of contructing is by using a convolutional network . The same convolutional network is applied to both the images and to obtain feature vectors and respectively. The distance between these two vectors is a measure of similarity.
The distance function in Eqn. (1) can be a simple function like the norm or a more complex learnable function such as a neural network.
We now turn to the problem of localizing a novel object in an image . Here, the exemplar image contains the object of interest, and we would like to find the location of the object of interest in a larger image (Fig. 3). A natural way of addressing this problem is to have a sliding window over the larger image and to compare each window with . The location of inside the large image is found when we get a matching window. If the convolutional network
is fully convolutional, i.e., the network does not use padding (all the convolutions are “valid” convolutions), then the larger imagecan be passed through the same convolutional network to obtain a feature map , where each “pixel” in the resulting feature map will be identical to the feature vector that would have been obtained from the corresponding window in the input image. Subsequently, the feature vector is compared with each pixel of the feature map to determine where the object of interest lies in . The similarity score for each window is,
Finally, we turn to the problem where the object of interest must also be inferred from a large image (Fig. 4). We would like to infer which object in the image is being highlighted by a visual cue. We use attention to infer the highlighted object . The resulting spatial attention map is used to obtain a weighted average of the feature map that gives , the feature vector of interest.
We will explain procedure to obtain the feature vector of interest . The attention score corresponding to each pixel in the feature map is
where is a small neural network comprising of Conv1x1s (bottleneck layers) and ReLUs and whose output has one channel. These attention scores are normalized to the range with the spatial function.
The weighted average of the feature map computed using the normalized attention scores gives the feature vector of the object being highlighted.
Once the feature vector corresponding to the object of interest has been obtained, localization of the object in maybe performed as described earlier.
Iv Network Architecture
The proposed neural architecture for one-shot object localization is in Fig. 5. Two images are given as input to the neural network. The adaptation input (upper image in Fig. 5) contains the object of interest and several distracting objects. The desired object is highlighted by a visual cue. This object must be localized in target environment (lower image in Fig. 5).
The adaptation input is passed through the fully convolutional Siamese network (Fig. 6) to obtain the feature map . The spatial attention mechanism described in the previous section and detailed in Fig. 8 attends to the visual cue, and the feature vector corresponding to the object of interest is extracted from the feature map. The image is passed through the same Siamese network to obtain the feature map . The feature vector corresponding to the object of interest is combined with the feature map by element-wise multiplication. The combined feature map is then passed through a few bottleneck layers to obtain similarity score maps. The th score map is
We now consider how the similarity score maps may be used to localize the object of interest via a differentiable function. If we assume that one and only one object of interest is in the image , then the spatial function may be used to extract the position of the object from the score maps .
Equations (8) and (9) describe the operation which is merely a soft and differentiable version of the operation. Note that in intuitive terms, the operation is asking the question “Which window in the image is most similar to the object of interest?” and not the question “Is this window similar to the object of interest?”. This formulation is thus more similar to the triplet loss than a Siamese network based binary classifier. The distinction is relevant because the former makes the assumption that one and only one object of interest is present in the image.
The combined vector containing 2D points of interest is passed through a linear layer to obtain the location of the object of interest . Note that no supervision about the visual cue or it’s location is provided during training. The inputs are the adaptation input containing the visual cue and the new evironment , and the label is the location of the object of interest in . Mean squared error is the loss function used for training.
We conducted experiments to evaluate the performance of the Siamese network architecture for one-shot object localization. In all the experiments, the network in Fig. 5 was trained using the Adam optimizer  with learning rate 1e-4.
V-a Localizing novel characters from the Omniglot dataset
The Omniglot dataset  is a collection of handwritten characters of 50 alphabets from languages across the world. Each alphabet has a minimum of 15 letters to over 40 letters. There are exactly 20 handwritten instances of each letter across all the alphabets. The 50 alphabets are split into a background set of 40 alphabets and an evaluation set of 10 alphabets. The characters in the background set are used for training and validation, whereas the characters from the evaluation set are used for testing.
We considered the problem of localizing characters from the Omniglot dataset. A synthetic dataset is constructured for the localization experiment. To construct an adaptation input, instances of four different characters are chosen from the training set and placed at random non-overlapping locations on a blank canvas of 150150 px. One of the placed characters is highlighted with a red dot at the center to simulate a laser pointer. Note that the red marker is not always at the exact center of the character because of the slight variation in the way the characters have been written. The new environment is also built by starting from a blank image of 150150 px. A randomly chosen instance of the character that was previously marked is placed at a random location, and it’s location is recorded. Three additional characters chosen at random either from one of the previously placed characters or a different character from the training set that is not in the adaptation input are placed at non-overlapping locations.
Figure 9 shows a sample output from an example in the test set containing a novel character highlighted with a red marker. We see that the attention mechanism is focusing on the red marker, and the network has successfully localized a novel character in the new environment. With the width and height of the image normalized to 1, the mean squared error in localizing the marked character (i.e., norm of the difference between the actual center of the character and the predicted position) across 256 examples constructed from characters in the test set is 0.002 (3.1% of the width of the image). To evaluate the resilience of the network to small variations in the position of the red dot, we add random jitter of 33% of the character size to the position of the red dot. Despite the added jitter, the mean squared error only slightly increases to 0.0024 on the test set. We also examine the localization performance with a different visual cue. With a large green marker above the character (Fig. 10), the mean squared error on the test set is 0.003.
V-B Localizing novel objects on a table in PyBullet simulator
In order to test the proposed one-shot object localizer with a robot, we created an environment in the PyBullet physics simulator111Videos are available at http://ece.iisc.ernet.in/~sagar/iros19. The robot used is a Kuka bot with a gripper attached so that it can pick and place objects on a table (Fig. 11). We constructured 12 cuboid-like objects (Fig. 12) and used 8 of them for training and evaluation and the remaining 4 for testing. Each of the training shapes is painted with 384 random colors to obtain 3072 distinct objects for training. Likewise, the 4 shapes in the test set are each paired with 128 random colors (not intersecting with the training colors) to obtain 512 distinct test objects. Similar to the previous experiment, 4 distinct objects are spawned on the table and one of the objects is marked with a red marker (Fig. 13). The new environment is constructured by placing the marked object at a random location along with 3 random distractor objects.
Figure 13 shows a sample output from objects in the test set. The attention layer has correctly focused on the novel object of interest in the adaptation input, and the network has successfully localized this object in the new environment. With the width and height of the image normalized to 1, the mean squared error in localizing the marked object from 256 sample test scenarios is 0.003 (3.8% of the width of the image). The robot’s operation was recorded on a 20-sample test set. In all instances, the robot successfully picked up the object and placed it in the tray at the edge of the table. To examine the peformance of this network with a smaller dataset, we reduced the number of colors in the training set to 48 and that in the test set to 12. With this truncated dataset, the network overfit, and the mean squared error in localizing objects in the test set grew to 0.010. On a test sample set of size 20, the robot successfully picked and placed 17 objects, but failed to do so in 3 cases. In one of the failure cases, the localization was off by more than 10% which caused the gripper to collide with the object. In two other cases, the robot did not pick up the correct object.
V-C Localizing novel toys on a table
We created a dataset of toys (Fig. 14) to evaluate the performance of the proposed approach on real objects. Similar to the previous experiments, four objects are placed on a table, and one of the objects is highlighted with a laser pointer (Fig. 15). The new environment consists of the highlighted object placed at a random location along with 3 random distractor objects. Figure 15 shows a sample output. We see that the attention mechanism is correctly attending to the highlighted object. With the width and height of the image normalized to 1, the mean squared error in localizing the marked objects in 256 sample test scenarios is 0.03. This is considerably higher than the previous experiments. Of the 256 images, we found that the localization was successful in 216 images, but failed in the rest of the images (error 15% of the size of the image). We found that specular highlights on the objects appeared similar to the laser pointer and had confused the attention mechanism (Fig. 16). Among the successfully localized images, the mean squared error in localizing the marked objects was 0.001 (2.2% of the width of the image).
Siamese networks are a useful tool in building robot vision systems that can adapt to novel environments. We have demonstrated that we can specify a previously unseen object of interest to a robot by using a laser pointer. This architecture can be extended to scenarios where it is desirable to specify the object of interest by pointing a finger at the object. A major limitation of this architecture is the softargmax layer which can only work in circumstances where there is one and only one object of interest in the image. Addressing this limitation is left for future work.
This project was funded in part by Yaskawa Electric Corporation. Sagar Gubbi was supported by Visvesvaraya PhD fellowship, MeitY, Govt. of India. The GPU used in this work was provided by nVidia. We thank Ishan Dave and Arun Kumar for assistance with the experimental setup. We also thank Shishir N. Y. Kolathaya and Nihesh Rathod for helpful discussions.
-  Zhang, Tianhao, et al. ”Deep imitation learning for complex manipulation tasks from virtual reality teleoperation.” 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018.
-  Finn, Chelsea, Pieter Abbeel, and Sergey Levine. ”Model-agnostic meta-learning for fast adaptation of deep networks.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.
-  Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. “Siamese neural networks for one-shot image recognition.” In ICML Deep Learning Workshop, vol. 2. 2015.
-  Wojna, Zbigniew, et al. ”Attention-based extraction of structured information from street view imagery.” 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Vol. 1. IEEE, 2017.
-  Lake, Brenden, et al. ”One shot learning of simple visual concepts.” Proceedings of the Annual Meeting of the Cognitive Science Society. Vol. 33. No. 33. 2011.
-  Nichol, Alex, and John Schulman. ”Reptile: a scalable metalearning algorithm.” arXiv preprint arXiv:1803.02999 (2018).
-  Yu, Tianhe, et al. ”One-shot imitation from observing humans via domain-adaptive meta-learning.” arXiv preprint arXiv:1802.01557 (2018).
-  Antoniou, Antreas, Harrison Edwards, and Amos Storkey. ”How to train your MAML.” arXiv preprint arXiv:1810.09502 (2018).
-  Taigman, Yaniv, et al. ”Deepface: Closing the gap to human-level performance in face verification.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
-  Bertinetto, Luca, et al. ”Fully-convolutional siamese networks for object tracking.” European conference on computer vision. Springer, Cham, 2016.
-  Vaswani, Ashish, et al. ”Attention is all you need.” Advances in Neural Information Processing Systems. 2017.
-  Zeng, Andy, et al. ”Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching.” 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018.
-  Levine, Sergey, et al. ”End-to-end training of deep visuomotor policies.” The Journal of Machine Learning Research 17.1 (2016): 1334-1373.
-  Kingma, Diederik P., and Jimmy Ba. ”Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980 (2014).