Log In Sign Up

Active Perception for Ambiguous Objects Classification

Recent visual pose estimation and tracking solutions provide notable results on popular datasets such as T-LESS and YCB. However, in the real world, we can find ambiguous objects that do not allow exact classification and detection from a single view. In this work, we propose a framework that, given a single view of an object, provides the coordinates of a next viewpoint to discriminate the object against similar ones, if any, and eliminates ambiguities. We also describe a complete pipeline from a real object's scans to the viewpoint selection and classification. We validate our approach with a Franka Emika Panda robot and common household objects featured with ambiguities. We released the source code to reproduce our experiments.


page 1

page 2

page 3

page 5

page 7

page 8


Next-Best-View Estimation based on Deep Reinforcement Learning for Active Object Classification

The presentation and analysis of image data from a single viewpoint are ...

Explaining the Ambiguity of Object Detection and 6D Pose from Visual Data

3D object detection and pose estimation from a single image are two inhe...

Next-Best-View Prediction for Active Stereo Cameras and Highly Reflective Objects

Depth acquisition with the active stereo camera is a challenging task fo...

OVE6D: Object Viewpoint Encoding for Depth-based 6D Object Pose Estimation

This paper proposes a universal framework, called OVE6D, for model-based...

Combining Local and Global Pose Estimation for Precise Tracking of Similar Objects

In this paper, we present a multi-object 6D detection and tracking pipel...

iSPA-Net: Iterative Semantic Pose Alignment Network

Understanding and extracting 3D information of objects from monocular 2D...

Unsupervised learning of object frames by dense equivariant image labelling

One of the key challenges of visual perception is to extract abstract mo...

I Introduction

Object recognition remains a fundamental skill for robots that operate in complex environments. For example, consider the task “fetch an apple”; this simple task requires first detecting and classifying an apple among other fruits. The task becomes more problematic when we ask the robot to “fetch a ripe apple”, which requires the robot to examine the apple from different views to assess the ripeness.

Several objects, especially those used in the packaging of goods, may have a distinguishing feature only on one side (e.g., paint color or instant noodles flavor). Fig. 1

shows an example of food boxes that show the flavor on one side only while they look identical from the other sides. In this case, a robot must classify objects between the different flavors; unfortunately, this may be impossible when the objects face visually identical sides (e.g., when they are stored on a shelf). Object classification from visual data often employs neural network-based systems trained on public image datasets. However, public datasets do not include objects visually identical except for a single side. Hence, different ambiguous objects may fall into the same class. Thus, we must create a specific dataset (e.g., employing a 3D scanner) to train a classifier for our purpose.

Fig. 1: Example of objects visually identical except for a single side.

Consider again the example in Fig. 1. An additional problem is that the training data will include similar input images with different labels (i.e., the backside of the boxes are visually identical), jeopardizing the convergence of the training. The considerations above led to the following research questions:

  • Question 1 In presence of ambiguous objects, should we exclude the non-unique views from the training data? How to exclude them correctly and automatically? Training on the whole dataset might not converge due to different labels on identical inputs. If the images are not identical but very similar, the classifiers might overfit and not generalize correctly to the test data.

  • Question 2 Can actively planning for different viewpoints (e.g., a robot that moves the camera) improve ambiguous objects’ classification? How to implement such an approach? Even assuming that we can correctly train a classifier, the robot camera might face the ambiguous part of the object. If the robot was able to determine that this view would lead to unreliable classification, it could trigger an appropriate fallback behavior. For example, the robot could look at the other side of the object to acquire a non-ambiguous, classifiable view. For this purpose, we would like to estimate, given an object, the next best view for its classification.

Fig. 2: Possible observed views of the object after changing robot’s viewpoint. While the back of both objects is similar, front sides are different enough for classification.

In this paper, we address the above questions by proposing a novel active perception pipeline. We propose an image similarity metric based on the embedding of a denoising autoencoder

[16] for image reconstruction. The metric is used to estimate the ambiguity of object orientations and solve the problem of ambiguous objects classification. Using this ambiguity score, we are able to correctly train classifiers by excluding ambiguous views from the training data. We propose an active perception framework that selects the next best viewpoint to minimize ambiguity and disambiguate objects using the trained classifier.

Ii Background

Ii-a Object classification

Object classification is one of the most common tasks in computer vision. Machine learning approaches turned object classification into a data-oriented problem. The community is continuously improving image classification 

[17, 13]

on standard benchmarks such as ImageNet dataset 

[3], many works are based on popular neural networks such as ResNet [6]

. A typical training dataset for image classification consists of labeled images. It is common to use a network whose parameters have been obtained by training the network on a large, public dataset such as ImageNet, and then

fine-tune it using additional data from the task of interest. The training process is usually terminated by achieving a certain classification performance (e.g., in terms of classification accuracy) on a validation dataset that differs from the training one. If objects views are annotated with the pose of the camera, it is possible to cast the pose estimation problem as a classification or regression problems [15].

Ii-B Object pose estimation with Augmented Autoencoder

Deep Convolutional Neural Networks (CNNs) were successfully employed for 6D object pose estimation by regressing 2D keypoints on the object used to estimate the 6D pose via Perspective-n-Point (PNP) techniques 


. Other techniques relied on denoising autoencoders for image reconstruction in order to extract from RGB images a meaningful latent representation, conditioned on the object orientation, as in Augmented Autoencoders (AAE)


Fig. 3: The training schema for AAE. Denosing autoencoder is aimed to reconstructed the non-augmented rendered view of the same as augmented input orientation.
precalculated codebook input embedding rendered object views (grid on SO(3)) calculate
Fig. 4: Pose estimation with AAE. The precalculated and saved codebook of embeddings is used First we create embeedding of the input image

In this context, a denoising autoencoder is trained to reconstruct a clean rendered of the object having orientation starting from an image of the same view augmented with different light conditions, noise, and small occlusions as (see Fig. 3). The autoencoder is based on CNNs and is trained to minimize the error in pixel space between the augmented and non-augmented images:


After the training procedure, the approach stores an array of embeddings generated by encoding the rendered views for uniformly sampled orientations. Such an array is called codebook. To estimate the object orientation given an input image at runtime (see Fig. 4), one should first pass the image through the encoder of the AAE to calculate the image embedding

. Then, find the codebook’s closest entry by applying the cosine similarity between codebook and image embeddings, i.e.:


The autoencoder is trained on rendered data with 3D models of objects.

Ii-C Active perception

Animals often act to gather information about the world through an active perception approach [5]. This approach, applied to robotics, was employed to improve mobile robot localization [2], and for better object tracking in data acquisition [9]. Active vision is also connected to multi-view classification or pose estimation. Recent works [10] show that multi-view solutions improve the results over single-view ones. Other works [11] employed active vision for object classification, but their active vision approach was limited to fusing the data from several fixed camera viewpoints in a passive fashion. Active perception helps to cope with environmental uncertainties when we can move the camera viewpoint. However, correct motion planning and data interpretation might still be challenging.

Iii Approach Overview

Fig. 5: Proposed pipeline divided into training and inference parts.

In this section, we formalize the problem and briefly describe the developed solution.

We first address how to identify ambiguous orientations of objects. Given a pair of objects A and B, we show how to select views from object A that are the most discriminative against object B, and train a classifier solely on such views. For this purpose, we define an ambiguity rank to compute the next best view, which will be then used to compute the next best view to take.

Let be a view of the object that has an orientation with respect to the camera frame; the ambiguity of the object at orientation stems from the existence of an object such that, at orientation , is “visually similar” to at . This definition implies a certain metric of image similarity . In Section IV-B we describe in detail this similarity metric.

Note that the images would be different if taken under different light conditions, background, or due to noise. We assume that the value of the similarity between views under different rotations and is mostly unchanged in different light conditions. In other words, we assume that inequalities like remains true for different illuminations, therefore we can fix certain light conditions while calculating image similarity.

Definition 1.

Given two objects , the ambiguity of orientation of object is defined as the maximum value of a given similarity metric between and views from object :


The most suitable object orientation that allows unambiguously classifying objects and can be then found as the solution of the following minimization:


In case we have more than two ambiguous objects , we can define the ambiguity of the orientation from the object :


In this paper, we assume that we have a pair of ambiguous objects for simplicity. Eq. 6 above allows to trivially extend the developed solutions to a group of three or more objects. We found it easier to analyze and use ambiguity if it is in the range of

. We linearly transform ambiguity for each object to this range. Now we can define ambiguous and non-ambiguous views.

Definition 2.

An orientation of the object from the group of ambiguous objects is ambiguous if w.r.t to chosen threshold . It is non-ambiguous otherwise.

A lower similarity metric between two objects should result in lower ambiguity for the views at the orientations that show discriminative visual features (e.g. left hand side on Fig. 1), and higher ambiguity for those that are difficult to discriminate (e.g. right hand side on Fig. 1).

We now show how to use the ambiguity to select discriminative views and train a classifier that effectively disambiguate between two objects A and B provided an image containing an object. We take a uniformly distributed subset

. Then, we should choose a threshold to form , such that . We can sort a set of object orientations by the ambiguity values in ascending order (as in Fig. 6) to visually verify that we selected only non-ambiguous orientations.

We use the subset to train the classifier on non-ambiguous data.

We first estimate the possible object hypothesis (i.e. or ) and their initial orientation (i.e. or ), then we compute the related ambiguity. If the resulting ambiguity is above the threshold , we move the robot to a more discriminative view.

We train an autoencoder-based model for each group of ambiguous objects (e.g., pair of the two flavors boxes in Fig. 2). The trained network is used both for estimating the rotation of the object, as in [16], and for evaluating the similarity metric required to implement Definitions 1 and 2. We search in the space of the other objects’ orientations using a similarity metric to compare the rendered orientations.

The robot can now perform the active classification of the object (see. Fig 5, inference part). Our input is an RGB image from the robot camera and current robot pose. We first employ the Faster R-CNN [14] object detector to obtain a crop of the image containing the object of interest that we will input to the autoencoder. In case we have more than one pair of ambiguous objects (e.g., bottles and boxes from Fig. 11), we first have to classify between categories formed by ambiguous pairs (e.g., the first category has both bottles, the second category has both boxes). This task can be accomplished by using a standard classifier. Once the category of interest is identified, the ambiguity of objects within the category depends on the orientation w.r.t. the camera. Hence, if the current view is ambiguous, we move the robot camera to a more discriminative view.

As mentioned above, we estimate the next best view based on the current robot pose, object pose, and the set of orientations with calculated ambiguities (see Eq.(4)). If the ambiguity of current object orientation is below a given threshold, we can directly apply a classifier trained for this particular category of ambiguous objects.

Iv Ambiguity ranking

This section describes the ambiguity ranking procedure (see Fig. 5). We first train the autoencoder-based similarity metric and then estimate the ambiguity rank using this metric.

Iv-a Autoencoder based view similarity

In this work we use the AAE for object rotation estimation and, in addition, we use it to implement the similarity metric as per Definition 1:


The AAE aims at reconstructing a clean view of the object on white background given a crop of the input image containing the object with background, lighting and occlusion augmentations. The reconstruction occurs by encoding the augmented object view to the latent representation (called embedding) and then decoding the latent representation to the non-augmented object view. To correctly apply the image similarity metric on an object’s views, the AAE’s latent space must be shared among all similar objects. For this purpose, we trained a single autoencoder to reconstruct all the ambiguous objects, 2). We later refer to this autoencoder as joint AAE, as views from all ambiguous objects of the same group are jointly used to train the AAE.

We sample object views on a quasi-uniform Fibonacci grid on the sphere around the object and uniform steps by in-plane rotation.

Having trained the joint AAE, we can generate a codebook for every ambiguous object of interest (see Sec. II-B).

Iv-B Similarity based view ranking

(a) Three most discriminating view pairs of two objects. The main difference between images, a picture of a bird, is seen clearly and occupies a non-negligible part of the image.
(b) Views from the collection sorted by similarity. Here we see views that lack distinguishable features and views that contain unique features, although in a small part of the view (the picture of a chicken or turkey).
(c) The most similar views as ranked by the trained AAE similarity metric.
Fig. 6: Visualization of the ranking procedure, sorted views from one object (top row) along with the most similar view from the other object (bottom row). It helps to check if we performed the view matching and ranking correctly.

In general, most similar view pairs might have independent orientations. Therefore, to estimate the initial ambiguity of the orientiation of the object (i.e. ), we find the most similar view from the other object by maximizing the term


using the trained similarity metric as per Eq. (7).

This procedure might result computationally expensive. Thus, we sample fewer views over the object compared to those considered while generating the codebook. Note also that here we do not need to sample orientations that differ only in rotations round the camera’s optical axis as the resulting images are identical up to a rotation.

in-plane rotation axis
(a) , , and in-plane axis w.r.t to object.
(b) Fibonnaci pseudo-uniform sphere grid example.
Fig. 7: Fibonacci pseudo-uniform sphere grid.

We replace the cropped image (Fig. 3) with the and take the codebook from the object . In order to find the most similar view , we perform a descent search in the space of the rotations, parametrized as Euler angles, in order to maximize the term in Eq. (8).

We remark that the AAE similarity metric adopted in Eq. (8) was originally designed to estimate the 3D orientation of an object of interest, hence the devised procedure produces correctly aligned view pairs, as can be seen in Fig. 6.

The output of the matching procedure is a list of tuples , sorted by similarity metric. A useful feature of this approach is that we can render view pairs and visually check the correctness of matching and sorting by ambiguity (see Fig. 6).

V Training classifiers

In this section, we address Question 1. We fist show how to split the dataset, then we show how to classify ambiguous objects. To avoid overfitting we validate the classifiers and select the threshold based on a dataset of real images.

V-a Splitting the dataset

Once we sorted object orientations by their ambiguity, we have to select an ambiguity threshold such that . This way, the classifier can be trained only on non-ambiguous views. However, we found challenging to select the ambiguity thresholds given only synthetic images rendered from the 3D model. By plotting the ambiguity for the set of sorted rotations in , we observed there is no sharp change of ambiguity that could indicate a possible ambiguity threshold (e.g., see AAE baseline on Fig. 8). Moreover, we found that the training procedure can be easily compromised by the imperfections in the 3D models, causing overfitting. Therefore, we compute the accuracy on a small dataset of real objects, which then helps us in defining the ambiguity threshold.

V-B Classification of ambiguous objects and between the categories of ambiguous objects

In this work, we mainly focus on a classification inside a pair or a group of ambiguous objects. However, we might face a situation when we have more than one category of ambiguous objects and some other non-ambiguous objects.

Vi Inference

In this section, we address Question 2. First, we need to determine if an acquired view is ambiguous. In that case, we have to find the camera pose where we expect to observe the least ambiguous view taking into account robot motion constraints. However, after this additional movement, the new view may still be more ambiguous than expected, for example due to errors in the estimation of the object orientation. In general, we cannot expect to move the robot to the best configuration in one step, and several movements could be required to converge to a configuration that is suitable for the classification of the object identity. The overall process consists of several stages that we report hereafter:

Object detection

We first crop the image as both the AAE and our classifier require a square crop with the object in the center.

Classification between categories of ambiguous objects

In the case of several groups of ambiguous objects, we must first identify in which group a given object falls. As we trained different autoencoder weights for each group, this step of classification provides us with the correct choice of weights for both the joint autoencoder and the in-group classifiers.

Object orientation and view ambiguity estimation

To estimate the view ambiguity, we need to first estimate the object orientation. Hence, we need to pass the image through the encoder of the AAE to get the associated embedding and maximize the similarity between this embedding and the object codebook that has been evaluated offline (see Sec. II-B). As we have at least two in-group classes, each with an associated codebook to be used to estimate the orientation of the object (see Sec. II-B), we may get different ambiguity ranks given that they depend on the orientation and on the in-group object class. If the mean of these ambiguity ranks is below the ambiguity threshold (see Sec. IV-A), we apply an in-group classifier and return the result. Otherwise, we have to move the robot camera to an appropriate pose.

Next best view for object classification

The next view of the object must be such that the classification ambibuity is at least reduced. However, the robot must deal with reachability constraints and avoid collisions with the surrounding objects. Therefore, the possible view orientations are limited to the set satisfying these constraints. Hence, we compute the next best camera-to-object orientation with the following minimization problem:


Here we used the mean to calculate the combined ambiguity rank of expected views assuming different object classes and orientations . Having found the next best robot pose, we move the robot, get the new camera input, and repeat the inference steps. We terminate the robot movements when we achieved a view with an ambiguity lower than the desired threshold or when the current view is least ambiguous among the possible robot poses ( equals to current robot orientation).

Vii Comparison against other feature extraction and matching methods

In this section, we compare our autoencoder-based similarity with other metrics. We evaluated similarity metrics on pairs , where the second view in each pair is the most similar to the first one according to the AAE-based image similarity.

(a) Mean squared error
(b) SIFT descriptors
(c) ResNet embedding
(d) Different light conditions
Fig. 8: Different similarity metrics w.r.t. to AAE baseline. X axis shows indexes of view pairs sorted by AAE ambiguity rank. Y axis shows scaled similarity metric for the pair indexed by X.
(a) A ambiguity plot for emulated mustard bottles pair. Three plots were acquired with the different number of steps in the descent search part of ambiguity ranking. A higher number of steps makes the ambiguity rank approach one and results in a smoother plot.
(b) Example of views that have low ambiguity. The modified part of the texture is visible and covers a significant part of the object view.
(c) Example of views that have ambiguity close to the threshold. The views are ambiguous, as they only differ in a small part of the view.
(d) Example of ambiguous views. We see unmodified part of the object therefore similarity approaches one.
Fig. 9: Experiments with YCB mustard bottle and the same bottle with modified texture

The first metric we compare to is the pixelwise mean squared error (MSE) between the images of the two views (Fig. 7(a)). The second metric is based on a comparison between the SIFT descriptors [8] evaluated on the two images (Fig. 7(b)). The third metric is the cosine similarity between the embeddings extracted from the two images using a feature extractor based on a CNN. Specifically, we adopted the ResNet-50 architecture from which the last fully connected layer has been removed (see Fig. 7(c)). We observed that the plots corresponding to the three metrics barely correlate to the AAE baseline (see Fig. 8), this means that these metrics are unable to discriminate ambiguous and non-ambiguous views.

In addition, we checked how different light conditions affect ambiguity discrimination. To this end, we rendered the views while varying the light intensity. For a broad range of configurations of the light intensity, the ability to discriminate between the two objects seem mostly unaffected (see Fig. 7(d)).

Viii Experiments

In this section, we present the validation of our approach with a set of experiments, both in simulation and on a real robot. We implemented the whole pipeline for training and inference as a Python package, available online111

. The package includes an implementation of autoencoder-based similarity metric using the PyTorch Lightning 

[4] framework while following the original work on AAE including augmentations [16]. We fine-tune the ResNet-18 CNN classifier pre-trained on the ImageNet dataset. For object detection, we use a pre-trained Faster R-CNN network from the Torchvision package [12].

Viii-a Simulated experiment

As mentioned earlier in this work, existing datasets do not contain pairs of objects with substantial ambiguity. For this reason, we generated data featuring pairs of ambiguous objects using the publicly available 3D model of the “mustard bottle” object from the YCB model set [1]. We used both the original 3D mesh of the object and a modified version of it where we introduced a variant in the texture of the front side. The resulting pair of objects has some distinguishing features on the front side and several completely identical views (see Fig. 8(b)). The ambiguity rank of such identical views must be exactly . Fig. 8(a) shows the obtained ambiguity plot. The ambiguity approaches one for half of the object orientations (half of the sorted views on the Fig 8(a)), therefore, a value close to one should be used as an ambiguity threshold for this pair of objects. Fig. 9 shows examples from the sorted view pairs. The lowest-ranked views look visually the most distinguishable as the view shows the modified texture (Fig. 8(b)) while the views with higher ambiguity do not feature the modified texture (Fig. 8(d)) (this is also corroborated by the shape of the plot in Fig. 8(a)). Near the middle of the sorted views, where ambiguity approaches one, we see that differences between two views are very small (Fig. 8(c)). Hence, in this case, we can choose a similarity threshold around . Only on view with the ambiguity less than would we see the non-ambiguous feature. The proposed tests in simulation show that using the autoencoder to rank the ambiguity of the views is sound, as formulated in Section III.

Viii-B Experimental setup

Despite the lack of ambiguous objects in the available datasets, pairs of ambiguous objects could be easily found in the real world, e.g., among packaged food available in supermarkets. For our experiments, we chose two groups of ambiguous objects. One of them is a pair of food boxes, the other is a pair of bottles (see Fig. 11). We retrieved the 3D models using a commercially available 3D scanner, namely the Shining 3D scanner. Although the reconstructed models had small artifacts, the quality of the models resulted good enough for our purposes. We used a Franka Emika Panda robot with an Intel RealSense D415 camera, mounted on the end-effector, to acquire image views.

Viii-C Classification test on recorded data

In a first set of experiments, we acquired a set images using the camera mounted on the robot in several fixed poses. The poses were chosen such that the object appeared in the center of the image plane. Furthermore, we assumed to know the Cartesian position of the object with respect to the robot reference frame. This constraint does not affect the verification of the pipeline but simplifies the setup and the analysis.

We recorded data by following a fixed trajectory along several parallel circles on a sphere centered on the Cartesian position of the object. Each trajectory was parametrized in terms of Azimuth and elevation angles . For each value of the elevation, we sampled uniformly along the space of Azimuth angles (see Fig.6(b)). Poses of the end-effector that were not reachable, due to constraints in the joints configuration, were not considered while acquiring the data. The collected data consisted of the images captured from the robot camera along with the relative pose between the camera and the object. We collected a small dataset of object views from different sides, including ambiguous and non-ambiguous views. The trajectory consisted of three parallel circles (i.e., with a constant elevation ) on the sphere (see Fig.6(b)). We sampled points each parallel circle. We repeated the execution four times for each object, each time considering a different orientation of the object with respect to the robot root frame.

Fig. 10: The performance of best classifier for each training ambiguity threshold used (X axis) and for different part of training data (less than certain ambiguity rank, see legend).
Fig. 11: Two groups of ambiguous objects that were used in the experiments.

We then trained the object classifiers on the collected data while varying the threshold that separates ambiguous views from non-ambiguous views. Fig. 10 shows the results in terms of classification accuracy. The classification performance drops significantly when the training data include object orientations with ambiguity higher than . The objects used in the real experiment have visually distinguishable features from one side only. We found better classification performance for the non-ambiguous part of the object’s views than the whole object. This is clear also from Fig. 10, as plots for evaluation on more ambiguous parts of recorded data are under the plot for lower ambiguity of test data. Moreover, classifiers trained on both ambiguous and non-ambiguous orientations tend to predict one class with high certainty irrespective of the actual identify within the group of ambiguous objects.

Viii-D Active classification results

In a second set of experiments, we tested the active classification capabilities of the proposed pipeline. We made two types of experiments. First, as we had recorded data for many camera-to-object orientations, we performed an offline active vision test constraining robot movements by recorded positions. We compared this offline experiment against the random baseline when the robot picked the next view randomly from . Next, we performed an online active vision test, where the robot starts from a random pose, hence a random orientation of the camera, and then tries to reach the pose where the ambiguity rank of the current object view is less than and classify the object.

To perform the offline active vision experiments, we recorded images using a similar approach as above but with a radius of , steps on parallel circles on the sphere around the object. We compared the next best view selection against the random next view. In all cases, we outperform random next view selection (see Fig. 13). Note that as bottles have a symmetrical shape, they are more challenging for orientation estimation and require additional views.

(a) Plots for the bottles (see Fig. 11).
(b) Plots for the boxes (see Fig. 11).
Fig. 12:

Classification success probability in offline active vision experiment. X axis represents the ambiguity rank threshold chosen to terminate the active perception. Different plots on the same figure represent the classification performance for different numbers of extra robot movements allowed. Dashed represents the performance of the random baseline. For example, red (highest) plot on the Fig.

11(a) shows the fraction of correct active classifications after no more than three extra robot movements.
(a) Initial pose of the robot.
(b) Robot pose after the next best view selection.
Fig. 13: An example of robot movement to improve classification. The robot moves its viewpoint from the back, ambiguous, view of the “chicken box” (from which classification is impossible), to the front, where the box has discriminative features.

Regarding active vision experiments, each experiment started from a random camera-to-object pose and terminated by reaching the ambiguity threshold or when the current robot pose was optimal, according to Eq. 9. Overall, experiments were performed, of which only 20 finished with incorrect classification results, corresponding to a classification accuracy of about . In these experiments, we set the ambiguity threshold to i.e., if the robot acquired an image that was evaluated with a rank less than , we stopped the classification procedure. The obtained results correlate with those obtained on prerecorded data, and described in Sec. VIII-C, when considering the same ambiguity rank of (see Fig. 10). An example of the action recognition task is shown in Fig. 13.

Ix Conclusions

This paper argued that everyday objects have ambiguous views that make standard classification approaches challenging to apply. We propose a novel active perception strategy based on view ambiguity estimation employing an autoencoder embedding. We validated our approach on a real robot using household objects, demonstrating its feasibility and performance.

In future work we plan to investigate the use of the autoencoder similarity metric to cluster groups of ambiguous objects within a given dataset, and extend this work for selecting best views to perform object discrimination in presence of occlusions. Another direction of research is to employ the same similarity metric for handling symmetries in pose estimation.

X Acknowledgements

We thank Fabrizio Bottarel for installing the Franka Emika Panda robot and the associated software ecosystem. This work was supported by the European H2020 project No. 730994 (TERRINet) and ERA-NET CHIST-ERA call 2017 project HEAP.


  • [1] B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar (2015) The ycb object and model set: towards common benchmarks for manipulation research. In 2015 International Conference on Advanced Robotics (ICAR), Vol. , pp. 510–517. External Links: Document Cited by: §VIII-A.
  • [2] M. Colledanchise, D. Malafronte, and L. Natale (2020) Act, perceive, and plan in belief space for robot localization. In 2020 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 3763–3769. External Links: Document Cited by: §II-C.
  • [3] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    Vol. , pp. 248–255. External Links: Document Cited by: §II-A.
  • [4] W. Falcon and .al (2019) PyTorch lightning. GitHub. Note: 3. Cited by: §VIII.
  • [5] J. J. Gibson and L. Carmichael (1966) The senses considered as perceptual systems. Vol. 2, Houghton Mifflin Boston. Cited by: §II-C.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. External Links: 1512.03385 Cited by: §II-A.
  • [7] Y. He, W. Sun, H. Huang, J. Liu, H. Fan, and J. Sun (2020) PVN3D: a deep point-wise 3d keypoints voting network for 6dof pose estimation. External Links: 1911.04231 Cited by: §II-B.
  • [8] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §VII.
  • [9] E. Maiettini, G. Pasquale, L. Rosasco, and L. Natale (2017)

    Interactive data collection for deep learning object detectors on humanoid robots

    In 2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids), Vol. , pp. 862–868. External Links: Document Cited by: §II-C.
  • [10] K. Ogawara and K. Iseki (2020) Estimation of object class and orientation from multiple viewpoints and relative camera orientation constraints. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 1–7. External Links: Document Cited by: §II-C.
  • [11] K. Ogawara and K. Iseki (2020) Estimation of object class and orientation from multiple viewpoints and relative camera orientation constraints. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 1–7. External Links: Document Cited by: §II-C.
  • [12] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §VIII.
  • [13] H. Pham, Z. Dai, Q. Xie, M. Luong, and Q. V. Le (2021) Meta pseudo labels. External Links: 2003.10580 Cited by: §II-A.
  • [14] S. Ren, K. He, R. Girshick, and J. Sun (2016) Faster r-cnn: towards real-time object detection with region proposal networks. External Links: 1506.01497 Cited by: §III.
  • [15] M. Schwarz, H. Schulz, and S. Behnke (2015) RGB-d object recognition and pose estimation based on pre-trained convolutional neural network features. In 2015 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 1329–1335. External Links: Document Cited by: §II-A.
  • [16] M. Sundermeyer, Z. Marton, M. Durner, M. Brucker, and R. Triebel (2018-09) Implicit 3d orientation learning for 6d object detection from rgb images. In The European Conference on Computer Vision (ECCV), Cited by: §I, §II-B, §III, §VIII.
  • [17] M. Tan and Q. V. Le (2020) EfficientNet: rethinking model scaling for convolutional neural networks. External Links: 1905.11946 Cited by: §II-A.