Object recognition remains a fundamental skill for robots that operate in complex environments. For example, consider the task “fetch an apple”; this simple task requires first detecting and classifying an apple among other fruits. The task becomes more problematic when we ask the robot to “fetch a ripe apple”, which requires the robot to examine the apple from different views to assess the ripeness.
Several objects, especially those used in the packaging of goods, may have a distinguishing feature only on one side (e.g., paint color or instant noodles flavor). Fig. 1
shows an example of food boxes that show the flavor on one side only while they look identical from the other sides. In this case, a robot must classify objects between the different flavors; unfortunately, this may be impossible when the objects face visually identical sides (e.g., when they are stored on a shelf). Object classification from visual data often employs neural network-based systems trained on public image datasets. However, public datasets do not include objects visually identical except for a single side. Hence, different ambiguous objects may fall into the same class. Thus, we must create a specific dataset (e.g., employing a 3D scanner) to train a classifier for our purpose.
Consider again the example in Fig. 1. An additional problem is that the training data will include similar input images with different labels (i.e., the backside of the boxes are visually identical), jeopardizing the convergence of the training. The considerations above led to the following research questions:
Question 1 In presence of ambiguous objects, should we exclude the non-unique views from the training data? How to exclude them correctly and automatically? Training on the whole dataset might not converge due to different labels on identical inputs. If the images are not identical but very similar, the classifiers might overfit and not generalize correctly to the test data.
Question 2 Can actively planning for different viewpoints (e.g., a robot that moves the camera) improve ambiguous objects’ classification? How to implement such an approach? Even assuming that we can correctly train a classifier, the robot camera might face the ambiguous part of the object. If the robot was able to determine that this view would lead to unreliable classification, it could trigger an appropriate fallback behavior. For example, the robot could look at the other side of the object to acquire a non-ambiguous, classifiable view. For this purpose, we would like to estimate, given an object, the next best view for its classification.
Ii-a Object classification
Object classification is one of the most common tasks in computer vision. Machine learning approaches turned object classification into a data-oriented problem. The community is continuously improving image classification[17, 13]
on standard benchmarks such as ImageNet dataset, many works are based on popular neural networks such as ResNet 
. A typical training dataset for image classification consists of labeled images. It is common to use a network whose parameters have been obtained by training the network on a large, public dataset such as ImageNet, and thenfine-tune it using additional data from the task of interest. The training process is usually terminated by achieving a certain classification performance (e.g., in terms of classification accuracy) on a validation dataset that differs from the training one. If objects views are annotated with the pose of the camera, it is possible to cast the pose estimation problem as a classification or regression problems .
Ii-B Object pose estimation with Augmented Autoencoder
Deep Convolutional Neural Networks (CNNs) were successfully employed for 6D object pose estimation by regressing 2D keypoints on the object used to estimate the 6D pose via Perspective-n-Point (PNP) techniques
. Other techniques relied on denoising autoencoders for image reconstruction in order to extract from RGB images a meaningful latent representation, conditioned on the object orientation, as in Augmented Autoencoders (AAE).
In this context, a denoising autoencoder is trained to reconstruct a clean rendered of the object having orientation starting from an image of the same view augmented with different light conditions, noise, and small occlusions as (see Fig. 3). The autoencoder is based on CNNs and is trained to minimize the error in pixel space between the augmented and non-augmented images:
After the training procedure, the approach stores an array of embeddings generated by encoding the rendered views for uniformly sampled orientations. Such an array is called codebook. To estimate the object orientation given an input image at runtime (see Fig. 4), one should first pass the image through the encoder of the AAE to calculate the image embedding
. Then, find the codebook’s closest entry by applying the cosine similarity between codebook and image embeddings, i.e.:
The autoencoder is trained on rendered data with 3D models of objects.
Ii-C Active perception
Animals often act to gather information about the world through an active perception approach . This approach, applied to robotics, was employed to improve mobile robot localization , and for better object tracking in data acquisition . Active vision is also connected to multi-view classification or pose estimation. Recent works  show that multi-view solutions improve the results over single-view ones. Other works  employed active vision for object classification, but their active vision approach was limited to fusing the data from several fixed camera viewpoints in a passive fashion. Active perception helps to cope with environmental uncertainties when we can move the camera viewpoint. However, correct motion planning and data interpretation might still be challenging.
Iii Approach Overview
In this section, we formalize the problem and briefly describe the developed solution.
We first address how to identify ambiguous orientations of objects. Given a pair of objects A and B, we show how to select views from object A that are the most discriminative against object B, and train a classifier solely on such views. For this purpose, we define an ambiguity rank to compute the next best view, which will be then used to compute the next best view to take.
Let be a view of the object that has an orientation with respect to the camera frame; the ambiguity of the object at orientation stems from the existence of an object such that, at orientation , is “visually similar” to at . This definition implies a certain metric of image similarity . In Section IV-B we describe in detail this similarity metric.
Note that the images would be different if taken under different light conditions, background, or due to noise. We assume that the value of the similarity between views under different rotations and is mostly unchanged in different light conditions. In other words, we assume that inequalities like remains true for different illuminations, therefore we can fix certain light conditions while calculating image similarity.
Given two objects , the ambiguity of orientation of object is defined as the maximum value of a given similarity metric between and views from object :
The most suitable object orientation that allows unambiguously classifying objects and can be then found as the solution of the following minimization:
In case we have more than two ambiguous objects , we can define the ambiguity of the orientation from the object :
In this paper, we assume that we have a pair of ambiguous objects for simplicity. Eq. 6 above allows to trivially extend the developed solutions to a group of three or more objects. We found it easier to analyze and use ambiguity if it is in the range of
. We linearly transform ambiguity for each object to this range. Now we can define ambiguous and non-ambiguous views.
An orientation of the object from the group of ambiguous objects is ambiguous if w.r.t to chosen threshold . It is non-ambiguous otherwise.
A lower similarity metric between two objects should result in lower ambiguity for the views at the orientations that show discriminative visual features (e.g. left hand side on Fig. 1), and higher ambiguity for those that are difficult to discriminate (e.g. right hand side on Fig. 1).
We now show how to use the ambiguity to select discriminative views and train a classifier that effectively disambiguate between two objects A and B provided an image containing an object. We take a uniformly distributed subset. Then, we should choose a threshold to form , such that . We can sort a set of object orientations by the ambiguity values in ascending order (as in Fig. 6) to visually verify that we selected only non-ambiguous orientations.
We use the subset to train the classifier on non-ambiguous data.
We first estimate the possible object hypothesis (i.e. or ) and their initial orientation (i.e. or ), then we compute the related ambiguity. If the resulting ambiguity is above the threshold , we move the robot to a more discriminative view.
We train an autoencoder-based model for each group of ambiguous objects (e.g., pair of the two flavors boxes in Fig. 2). The trained network is used both for estimating the rotation of the object, as in , and for evaluating the similarity metric required to implement Definitions 1 and 2. We search in the space of the other objects’ orientations using a similarity metric to compare the rendered orientations.
The robot can now perform the active classification of the object (see. Fig 5, inference part). Our input is an RGB image from the robot camera and current robot pose. We first employ the Faster R-CNN  object detector to obtain a crop of the image containing the object of interest that we will input to the autoencoder. In case we have more than one pair of ambiguous objects (e.g., bottles and boxes from Fig. 11), we first have to classify between categories formed by ambiguous pairs (e.g., the first category has both bottles, the second category has both boxes). This task can be accomplished by using a standard classifier. Once the category of interest is identified, the ambiguity of objects within the category depends on the orientation w.r.t. the camera. Hence, if the current view is ambiguous, we move the robot camera to a more discriminative view.
As mentioned above, we estimate the next best view based on the current robot pose, object pose, and the set of orientations with calculated ambiguities (see Eq.(4)). If the ambiguity of current object orientation is below a given threshold, we can directly apply a classifier trained for this particular category of ambiguous objects.
Iv Ambiguity ranking
This section describes the ambiguity ranking procedure (see Fig. 5). We first train the autoencoder-based similarity metric and then estimate the ambiguity rank using this metric.
Iv-a Autoencoder based view similarity
In this work we use the AAE for object rotation estimation and, in addition, we use it to implement the similarity metric as per Definition 1:
The AAE aims at reconstructing a clean view of the object on white background given a crop of the input image containing the object with background, lighting and occlusion augmentations. The reconstruction occurs by encoding the augmented object view to the latent representation (called embedding) and then decoding the latent representation to the non-augmented object view. To correctly apply the image similarity metric on an object’s views, the AAE’s latent space must be shared among all similar objects. For this purpose, we trained a single autoencoder to reconstruct all the ambiguous objects, 2). We later refer to this autoencoder as joint AAE, as views from all ambiguous objects of the same group are jointly used to train the AAE.
We sample object views on a quasi-uniform Fibonacci grid on the sphere around the object and uniform steps by in-plane rotation.
Having trained the joint AAE, we can generate a codebook for every ambiguous object of interest (see Sec. II-B).
Iv-B Similarity based view ranking
In general, most similar view pairs might have independent orientations. Therefore, to estimate the initial ambiguity of the orientiation of the object (i.e. ), we find the most similar view from the other object by maximizing the term
using the trained similarity metric as per Eq. (7).
This procedure might result computationally expensive. Thus, we sample fewer views over the object compared to those considered while generating the codebook. Note also that here we do not need to sample orientations that differ only in rotations round the camera’s optical axis as the resulting images are identical up to a rotation.
We replace the cropped image (Fig. 3) with the and take the codebook from the object . In order to find the most similar view , we perform a descent search in the space of the rotations, parametrized as Euler angles, in order to maximize the term in Eq. (8).
We remark that the AAE similarity metric adopted in Eq. (8) was originally designed to estimate the 3D orientation of an object of interest, hence the devised procedure produces correctly aligned view pairs, as can be seen in Fig. 6.
The output of the matching procedure is a list of tuples , sorted by similarity metric. A useful feature of this approach is that we can render view pairs and visually check the correctness of matching and sorting by ambiguity (see Fig. 6).
V Training classifiers
In this section, we address Question 1. We fist show how to split the dataset, then we show how to classify ambiguous objects. To avoid overfitting we validate the classifiers and select the threshold based on a dataset of real images.
V-a Splitting the dataset
Once we sorted object orientations by their ambiguity, we have to select an ambiguity threshold such that . This way, the classifier can be trained only on non-ambiguous views. However, we found challenging to select the ambiguity thresholds given only synthetic images rendered from the 3D model. By plotting the ambiguity for the set of sorted rotations in , we observed there is no sharp change of ambiguity that could indicate a possible ambiguity threshold (e.g., see AAE baseline on Fig. 8). Moreover, we found that the training procedure can be easily compromised by the imperfections in the 3D models, causing overfitting. Therefore, we compute the accuracy on a small dataset of real objects, which then helps us in defining the ambiguity threshold.
V-B Classification of ambiguous objects and between the categories of ambiguous objects
In this work, we mainly focus on a classification inside a pair or a group of ambiguous objects. However, we might face a situation when we have more than one category of ambiguous objects and some other non-ambiguous objects.
In this section, we address Question 2. First, we need to determine if an acquired view is ambiguous. In that case, we have to find the camera pose where we expect to observe the least ambiguous view taking into account robot motion constraints. However, after this additional movement, the new view may still be more ambiguous than expected, for example due to errors in the estimation of the object orientation. In general, we cannot expect to move the robot to the best configuration in one step, and several movements could be required to converge to a configuration that is suitable for the classification of the object identity. The overall process consists of several stages that we report hereafter:
We first crop the image as both the AAE and our classifier require a square crop with the object in the center.
Classification between categories of ambiguous objects
In the case of several groups of ambiguous objects, we must first identify in which group a given object falls. As we trained different autoencoder weights for each group, this step of classification provides us with the correct choice of weights for both the joint autoencoder and the in-group classifiers.
Object orientation and view ambiguity estimation
To estimate the view ambiguity, we need to first estimate the object orientation. Hence, we need to pass the image through the encoder of the AAE to get the associated embedding and maximize the similarity between this embedding and the object codebook that has been evaluated offline (see Sec. II-B). As we have at least two in-group classes, each with an associated codebook to be used to estimate the orientation of the object (see Sec. II-B), we may get different ambiguity ranks given that they depend on the orientation and on the in-group object class. If the mean of these ambiguity ranks is below the ambiguity threshold (see Sec. IV-A), we apply an in-group classifier and return the result. Otherwise, we have to move the robot camera to an appropriate pose.
Next best view for object classification
The next view of the object must be such that the classification ambibuity is at least reduced. However, the robot must deal with reachability constraints and avoid collisions with the surrounding objects. Therefore, the possible view orientations are limited to the set satisfying these constraints. Hence, we compute the next best camera-to-object orientation with the following minimization problem:
Here we used the mean to calculate the combined ambiguity rank of expected views assuming different object classes and orientations . Having found the next best robot pose, we move the robot, get the new camera input, and repeat the inference steps. We terminate the robot movements when we achieved a view with an ambiguity lower than the desired threshold or when the current view is least ambiguous among the possible robot poses ( equals to current robot orientation).
Vii Comparison against other feature extraction and matching methods
In this section, we compare our autoencoder-based similarity with other metrics. We evaluated similarity metrics on pairs , where the second view in each pair is the most similar to the first one according to the AAE-based image similarity.
The first metric we compare to is the pixelwise mean squared error (MSE) between the images of the two views (Fig. 7(a)). The second metric is based on a comparison between the SIFT descriptors  evaluated on the two images (Fig. 7(b)). The third metric is the cosine similarity between the embeddings extracted from the two images using a feature extractor based on a CNN. Specifically, we adopted the ResNet-50 architecture from which the last fully connected layer has been removed (see Fig. 7(c)). We observed that the plots corresponding to the three metrics barely correlate to the AAE baseline (see Fig. 8), this means that these metrics are unable to discriminate ambiguous and non-ambiguous views.
In addition, we checked how different light conditions affect ambiguity discrimination. To this end, we rendered the views while varying the light intensity. For a broad range of configurations of the light intensity, the ability to discriminate between the two objects seem mostly unaffected (see Fig. 7(d)).
In this section, we present the validation of our approach with a set of experiments, both in simulation and on a real robot. We implemented the whole pipeline for training and inference as a Python package, available online111https://github.com/safoex/aoc
. The package includes an implementation of autoencoder-based similarity metric using the PyTorch Lightning framework while following the original work on AAE including augmentations . We fine-tune the ResNet-18 CNN classifier pre-trained on the ImageNet dataset. For object detection, we use a pre-trained Faster R-CNN network from the Torchvision package .
Viii-a Simulated experiment
As mentioned earlier in this work, existing datasets do not contain pairs of objects with substantial ambiguity. For this reason, we generated data featuring pairs of ambiguous objects using the publicly available 3D model of the “mustard bottle” object from the YCB model set . We used both the original 3D mesh of the object and a modified version of it where we introduced a variant in the texture of the front side. The resulting pair of objects has some distinguishing features on the front side and several completely identical views (see Fig. 8(b)). The ambiguity rank of such identical views must be exactly . Fig. 8(a) shows the obtained ambiguity plot. The ambiguity approaches one for half of the object orientations (half of the sorted views on the Fig 8(a)), therefore, a value close to one should be used as an ambiguity threshold for this pair of objects. Fig. 9 shows examples from the sorted view pairs. The lowest-ranked views look visually the most distinguishable as the view shows the modified texture (Fig. 8(b)) while the views with higher ambiguity do not feature the modified texture (Fig. 8(d)) (this is also corroborated by the shape of the plot in Fig. 8(a)). Near the middle of the sorted views, where ambiguity approaches one, we see that differences between two views are very small (Fig. 8(c)). Hence, in this case, we can choose a similarity threshold around . Only on view with the ambiguity less than would we see the non-ambiguous feature. The proposed tests in simulation show that using the autoencoder to rank the ambiguity of the views is sound, as formulated in Section III.
Viii-B Experimental setup
Despite the lack of ambiguous objects in the available datasets, pairs of ambiguous objects could be easily found in the real world, e.g., among packaged food available in supermarkets. For our experiments, we chose two groups of ambiguous objects. One of them is a pair of food boxes, the other is a pair of bottles (see Fig. 11). We retrieved the 3D models using a commercially available 3D scanner, namely the Shining 3D scanner. Although the reconstructed models had small artifacts, the quality of the models resulted good enough for our purposes. We used a Franka Emika Panda robot with an Intel RealSense D415 camera, mounted on the end-effector, to acquire image views.
Viii-C Classification test on recorded data
In a first set of experiments, we acquired a set images using the camera mounted on the robot in several fixed poses. The poses were chosen such that the object appeared in the center of the image plane. Furthermore, we assumed to know the Cartesian position of the object with respect to the robot reference frame. This constraint does not affect the verification of the pipeline but simplifies the setup and the analysis.
We recorded data by following a fixed trajectory along several parallel circles on a sphere centered on the Cartesian position of the object. Each trajectory was parametrized in terms of Azimuth and elevation angles . For each value of the elevation, we sampled uniformly along the space of Azimuth angles (see Fig.6(b)). Poses of the end-effector that were not reachable, due to constraints in the joints configuration, were not considered while acquiring the data. The collected data consisted of the images captured from the robot camera along with the relative pose between the camera and the object. We collected a small dataset of object views from different sides, including ambiguous and non-ambiguous views. The trajectory consisted of three parallel circles (i.e., with a constant elevation ) on the sphere (see Fig.6(b)). We sampled points each parallel circle. We repeated the execution four times for each object, each time considering a different orientation of the object with respect to the robot root frame.
We then trained the object classifiers on the collected data while varying the threshold that separates ambiguous views from non-ambiguous views. Fig. 10 shows the results in terms of classification accuracy. The classification performance drops significantly when the training data include object orientations with ambiguity higher than . The objects used in the real experiment have visually distinguishable features from one side only. We found better classification performance for the non-ambiguous part of the object’s views than the whole object. This is clear also from Fig. 10, as plots for evaluation on more ambiguous parts of recorded data are under the plot for lower ambiguity of test data. Moreover, classifiers trained on both ambiguous and non-ambiguous orientations tend to predict one class with high certainty irrespective of the actual identify within the group of ambiguous objects.
Viii-D Active classification results
In a second set of experiments, we tested the active classification capabilities of the proposed pipeline. We made two types of experiments. First, as we had recorded data for many camera-to-object orientations, we performed an offline active vision test constraining robot movements by recorded positions. We compared this offline experiment against the random baseline when the robot picked the next view randomly from . Next, we performed an online active vision test, where the robot starts from a random pose, hence a random orientation of the camera, and then tries to reach the pose where the ambiguity rank of the current object view is less than and classify the object.
To perform the offline active vision experiments, we recorded images using a similar approach as above but with a radius of , steps on parallel circles on the sphere around the object. We compared the next best view selection against the random next view. In all cases, we outperform random next view selection (see Fig. 13). Note that as bottles have a symmetrical shape, they are more challenging for orientation estimation and require additional views.
Classification success probability in offline active vision experiment. X axis represents the ambiguity rank threshold chosen to terminate the active perception. Different plots on the same figure represent the classification performance for different numbers of extra robot movements allowed. Dashed represents the performance of the random baseline. For example, red (highest) plot on the Fig.11(a) shows the fraction of correct active classifications after no more than three extra robot movements.
Regarding active vision experiments, each experiment started from a random camera-to-object pose and terminated by reaching the ambiguity threshold or when the current robot pose was optimal, according to Eq. 9. Overall, experiments were performed, of which only 20 finished with incorrect classification results, corresponding to a classification accuracy of about . In these experiments, we set the ambiguity threshold to i.e., if the robot acquired an image that was evaluated with a rank less than , we stopped the classification procedure. The obtained results correlate with those obtained on prerecorded data, and described in Sec. VIII-C, when considering the same ambiguity rank of (see Fig. 10). An example of the action recognition task is shown in Fig. 13.
This paper argued that everyday objects have ambiguous views that make standard classification approaches challenging to apply. We propose a novel active perception strategy based on view ambiguity estimation employing an autoencoder embedding. We validated our approach on a real robot using household objects, demonstrating its feasibility and performance.
In future work we plan to investigate the use of the autoencoder similarity metric to cluster groups of ambiguous objects within a given dataset, and extend this work for selecting best views to perform object discrimination in presence of occlusions. Another direction of research is to employ the same similarity metric for handling symmetries in pose estimation.
We thank Fabrizio Bottarel for installing the Franka Emika Panda robot and the associated software ecosystem. This work was supported by the European H2020 project No. 730994 (TERRINet) and ERA-NET CHIST-ERA call 2017 project HEAP.
-  (2015) The ycb object and model set: towards common benchmarks for manipulation research. In 2015 International Conference on Advanced Robotics (ICAR), Vol. , pp. 510–517. External Links: Cited by: §VIII-A.
-  (2020) Act, perceive, and plan in belief space for robot localization. In 2020 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 3763–3769. External Links: Cited by: §II-C.
ImageNet: a large-scale hierarchical image database.
2009 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 248–255. External Links: Cited by: §II-A.
-  (2019) PyTorch lightning. GitHub. Note: https://github.com/PyTorchLightning/pytorch-lightning 3. Cited by: §VIII.
-  (1966) The senses considered as perceptual systems. Vol. 2, Houghton Mifflin Boston. Cited by: §II-C.
-  (2015) Deep residual learning for image recognition. External Links: Cited by: §II-A.
-  (2020) PVN3D: a deep point-wise 3d keypoints voting network for 6dof pose estimation. External Links: Cited by: §II-B.
-  (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §VII.
Interactive data collection for deep learning object detectors on humanoid robots. In 2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids), Vol. , pp. 862–868. External Links: Cited by: §II-C.
-  (2020) Estimation of object class and orientation from multiple viewpoints and relative camera orientation constraints. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 1–7. External Links: Cited by: §II-C.
-  (2020) Estimation of object class and orientation from multiple viewpoints and relative camera orientation constraints. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 1–7. External Links: Cited by: §II-C.
-  (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Cited by: §VIII.
-  (2021) Meta pseudo labels. External Links: Cited by: §II-A.
-  (2016) Faster r-cnn: towards real-time object detection with region proposal networks. External Links: Cited by: §III.
-  (2015) RGB-d object recognition and pose estimation based on pre-trained convolutional neural network features. In 2015 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 1329–1335. External Links: Cited by: §II-A.
-  (2018-09) Implicit 3d orientation learning for 6d object detection from rgb images. In The European Conference on Computer Vision (ECCV), Cited by: §I, §II-B, §III, §VIII.
-  (2020) EfficientNet: rethinking model scaling for convolutional neural networks. External Links: Cited by: §II-A.