Investigating the Importance of Shape Features, Color Constancy, Color Spaces and Similarity Measures in Open-Ended 3D Object Recognition

by   S. Hamidreza Kasaei, et al.
University of Groningen

Despite the recent success of state-of-the-art 3D object recognition approaches, service robots are frequently failed to recognize many objects in real human-centric environments. For these robots, object recognition is a challenging task due to the high demand for accurate and real-time response under changing and unpredictable environmental conditions. Most of the recent approaches use either the shape information only and ignore the role of color information or vice versa. Furthermore, they mainly utilize the L_n Minkowski family functions to measure the similarity of two object views, while there are various distance measures that are applicable to compare two object views. In this paper, we explore the importance of shape information, color constancy, color spaces, and various similarity measures in open-ended 3D object recognition. Towards this goal, we extensively evaluate the performance of object recognition approaches in three different configurations, including color-only, shape-only, and combinations of color and shape, in both offline and online settings. Experimental results concerning scalability, memory usage, and object recognition performance show that all of the combinations of color and shape yields significant improvements over the shape-only and color-only approaches. The underlying reason is that color information is an important feature to distinguish objects that have very similar geometric properties with different colors and vice versa. Moreover, by combining color and shape information, we demonstrate that the robot can learn new object categories from very few training examples in a real-world setting.



There are no comments yet.


page 1

page 6

page 7

page 8


OrthographicNet: A Deep Learning Approach for 3D Object Recognition in Open-Ended Domains

Service robots are expected to be more autonomous and efficiently work i...

Combining Shape Features with Multiple Color Spaces in Open-Ended 3D Object Recognition

As a consequence of an ever-increasing number of camera-based service ro...

Object Recognition under Multifarious Conditions: A Reliability Analysis and A Feature Similarity-based Performance Estimation

In this paper, we investigate the reliability of online recognition plat...

One-Shot Concept Learning by Simulating Evolutionary Instinct Development

Object recognition has become a crucial part of machine learning and com...

Artificial Color Constancy via GoogLeNet with Angular Loss Function

Color Constancy is the ability of the human visual system to perceive co...

Orientation-boosted Voxel Nets for 3D Object Recognition

Recent work has shown good recognition results in 3D object recognition ...

New Graph-based Features For Shape Recognition

Shape recognition is the main challenging problem in computer vision. Di...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

One of the primary goals in service robotics is to develop perception capabilities that will allow robots to interact with the environment robustly. Towards this goal, a robot must be able to recognize a large set of object categories accurately. Furthermore, in order to interact with human users, this process of object recognition cannot take more than a fraction of a second. In human-centric environments, the robot may frequently face a new object that visually can be either very similar or not similar to other categories. For example, consider apples and oranges categories: what is the difference between apples and oranges? They both fall within the class of fruits, both are edible, have a similar spherical shape and grow on trees. Although object recognition is a typical task that is performed intuitively by human cognition, it can be quite complex when a robot has to do it.

A 3D object recognition system is composed of several software modules such as Object Detection, Object Representation, Object Recognition and Perceptual Memory. Object detection is responsible for detecting all objects in a scene. Object representation is concerned with the calculation of a set of features for the given object. The obtained representation is then sent to the object recognition module. The target object is finally recognized by comparing its representation against all the descriptions of known objects (stored in the perceptual memory). As you can see, object representation plays a prominent role because the output of this module is used for learning as well as recognition. Moreover, the representation of an object should contain enough information enabling to recognize the same or similar objects seen from different perspectives. Therefore, several important questions should be taken into account when representing an object:

which perceptual data should be used? How to represent it to the robot? Which senses would a person use to classify highly similar objects?

Arguably, we can confidently state that vision would be the most important sense, while other senses such as tactile could be used for this task.

Fig. 1: Step by step visualization of the process in the GOOD object descriptor, creating a representation of a ‘Vase’ object. (a) shows the 3D point cloud of the object, its bounding box, reference frame, and three projected views; (b), (c), and (d) show the three projection planes created from these views, i.e., the number of bins is 5. The projections are then converted into histograms by counting the number of points falling in each bin, as shown in (e), (f), and (g); Finally, the GOOD object representation is created by concatenating the three histograms as visualized in (h).

Going from this, we still do not have a definite answer on what the difference is between apples and oranges. An apple can be colored orange, while a green-colored orange could also be considered an orange. The same mutual relation goes for their shape. Taking this in mind, describing objects only by either shape or color will likely lead to confusion eventually. In this work, we assume that an object has already been segmented from a scene. The extracted point cloud of the object, containing RGB and depth data, is used to describe the shape and color of the object for distinguishing objects that have a very similar shape with a different color or vice versa. Towards this goal, we extend the Global Orthographic Object Descriptor (GOOD) [13] by adding color constancy information as an aid to improve object recognition performance. GOOD is a light-weight object descriptor that creates a convenient object representation directly from a 3D point cloud. As 3D data contains more structural information about objects, it is more robust than RGB data to the effects of illumination and shadows [21]. The required steps leading to the eventual GOOD object representation for a vase object are shown in Fig. 1. In summary, this paper contains the following main contributions:

  • Develop a 3D object descriptor that represents both shape and color constancy information for a given object.

  • Extensively evaluate the role of shape features, color constancy, color spaces, and similarity measures in open-ended 3D object recognition.

The remainder of this paper is organized as follows. In Section II, we briefly discuss related works. The methodology for computing the object descriptor is presented in Section III. Evaluation of the proposed descriptor is presented in Section IV. Finally, in Section V, conclusions are presented, and future research is discussed.

Ii Related work

Three-dimensional object recognition has been under investigation for a long time in various research fields, such as pattern recognition, computer graphics, and robotics

[19][4][14][27]. Although an exhaustive survey of 3D object descriptors is beyond the scope of this paper [2][6][25], we will review the main efforts.

Object representations based on just RGB data are sensitive to illuminations and shadows. Moreover, they cannot provide accurate representation of objects’ shape. To cope with aforementioned limitations, 3D data can be used to facilitate the representation of objects. Existing 3D object representation approaches are based on either global or local descriptors. As the name suggests, global descriptors represent the complete object. In contrast, local descriptors encode an object in a piece-wise manner, representing small patches of the object around specific key points. Generally, global descriptors are increasingly used in the context of 3D object recognition, object manipulation, as well as geometric categorization. These must be efficient in terms of computation time as well as the memory, to facilitate real-time performance. Some descriptors use a Reference Frame (RF) to compute a pose invariant description. Therefore, this property can be used to categorize 3D shape descriptors into three categories, including () shape descriptors without a common reference; () shape descriptors computed relative to a reference axis; () shape descriptors computed relative to an RF.

Most of the shape descriptors of the first category use certain statistic features or geometric properties of the points on the surface like depth value, curvature, and surface normal to generate a description. For instance, W. Wohlkinger and M. Vincze [28] introduced a global shape descriptor called Ensemble of Shape Functions (ESF) that does not require the use of normals to describe the object. The characteristic properties of an object are represented using an ensemble of ten 64-bin histograms of angle, point distance, and area shape functions. ESF completely ignores the potential role of color information.

In contrast, the descriptors in the second and third category encode the spatial information of the objects’ points using a Reference Frame (RF). In the second category, Viewpoint Feature Histogram (VFH) [23] is a well-known descriptor. It is based on another set of descriptors, the point feature histogram (PFH) [26], more specifically the fast point feature histogram (FPFH) [24]. The histogram of a PFH results from considering several angular features between the normals of pairs on the point cloud. What VHF adds to FPFH is the consideration of a viewpoint component. The direction from the viewpoint to the centroid of the object is translated to all points. The angle between this and the normal of the points constitutes the first component of the histogram. The other components of the histograms are similar to FPFH, but the pan, tilt, and yaw angles are now computed between the normals of the points and the viewpoint direction of the centroid. In the third category, We have the Global Orthographic Object Descriptor (GOOD) [13]

, which performs a principal component analysis on the point cloud of an object to make an unambiguous reference frame for the object. The resulting RF is then used to create three orthogonal projection of the object with respect to the X,Y, and Z axes. Each of these projections is then converted into a histogram and then combined using two statistical features, i.e., entropy and variance, to provide the final descriptor of the object. The Globally Aligned Spatial Distribution (GASD)

[17] is also fallen into the third category. GASD explores the idea of forming an object descriptor containing both color and shape information. GASD represents the shape information, almost similar to the GOOD descriptor. Besides, color information is incorporated into the descriptor in order to increase its discriminative power. We refer the reader to two comprehensive surveys on local feature descriptors [5, 16]. In this paper, we select one descriptor from each category to investigate the importance of shape information. They are including ESF, VFH, and GOOD.

Iii Proposed Approach

A point cloud of an object is represented as a set of points, , where each point is described by their 3D coordinates and RGB information. In this work, we mainly use GOOD object descriptor to represent the object as a histogram [13][8]. The reason why we use GOOD rather than other 3D object descriptors is that the GOOD is a pose- and scale-invariant descriptor, and therefore suitable for 3D perception in autonomous robots. As shown in Fig.1

, this method performs a principal component analysis on the point cloud of an object to find the eigenvectors of the object. Over different trials, the direction of eigenvectors is not unique and has

ambiguity. A sign disambiguation method is used to avoid this problem. The resulting unambiguous local reference frame, centered on the object, is then used to create three orthogonal projection planes. The projections are divided up in a grid of

bins, which are used to compute a normalized distribution matrix by counting how many points fall within each bin. The histogram of the plane is created by stringing the rows of the matrix together. The obtained histograms corresponding to the three projections are then combined to form a single representation for a given object. The histogram appearing first in the combined histogram is the one with the highest entropy. The second one is the one with the lowest variance of the remaining two, automatically placing the remaining one in the last position.

The GOOD object descriptor does not contain color information. Therefore, we have decided to append color constancy information to the GOOD object descriptor by taking an average color of all points of the object. The idea of considering color constancy information is inspired by the work of Bramão et al. [1], which showed the importance of color constancy in object recognition tasks. Therefore, the integration of color constancy information of an object seems to be sufficient to improve the performance of object recognition. Color diagnostic objects will have a single dominant color that is typical for this object and could be used for the recognition of this object. Non-color diagnostic objects will not have a dominant color value and thus can’t really be used to recognize an object. Human perception and recognition, of course, do not just use color constancy information to recognize objects. However, the research by Bramão et al. [1] showed that the color diagnosticity of an object significantly influences the performance of object recognition. In most of cases, in addition to the shape properties, it is sufficient to only look at the color constancy information. Moreover, the cost of the implementation is less than using an independent texture descriptor (e.g., ORB [22]), and it is not substantially altering the shape descriptor. Given this point that only bins are appended to the final object description for the color constancy information, it would not really affect the GOOD descriptor. It is worth to mention that the size of depends on the color space. In most cases, is set to three, which is much smaller than the size of the shape descriptor (). Therefore, to avoid the dominance of the shape information, we add the parameter color weight, , which is further explained in the experimental results section.

Iv Result and Discussion

Three types of experiments were carried out to evaluate the proposed approach. In all experiments, the obtained object representations are paired with an instance-based learning (IBL) approach (see e.g., [20]). Therefore, a category is described by a set of known instances. An advantage of the IBL approaches is that they can recognize objects using a minimal number of experiments, and the training phase is very fast. IBL is a baseline approach to evaluate object representations. However, more advanced approaches like SVM and Bayesian [11][9] approaches can be easily adapted. Similarly, a simple baseline recognition mechanism in the form of the nearest neighbor classifier is used. In particular, IBL approaches can be seen as a combination of particular object representation, similarity measure, and classification rule. It should be noted that in addition to GOOD descriptor [13], two popular state-of-the-art 3D object descriptors including, VFH [23] and ESF [28] were evaluated, which are available in the Point-Cloud Library111 We compare the obtained results and use the best configuration as default system’s configuration in the second round of experiments (open-ended evaluation). In the following subsections, we have investigated the importance of shape information and similarity measures using an extensive set of offline evaluations and considered the importance of color constancy and color spaces in a broad set of open-ended assessments.

Iv-a Classical evaluation using restaurant object dataset

For this round of experiments, we have used the restaurant object dataset since it has a small number of classes (10 categories) with a significant intra-class variation that is suitable for performing extensive sets of experiments. The parameter of the selected object descriptors must be tuned to provide a good balance between recognition performance, memory usage, and processing speed. The descriptiveness of the GOOD descriptor was evaluated with varying number of bins, , ranging from to with the interval of . For the VFH descriptor, we performed a parameter sweep on the

normal estimation radius

parameter, ranging from to with the interval of , to find the value which resulted in the highest accuracy. The ESF object descriptor does not have any parameters to be optimized. Furthermore, the choice of the similarity measure has an impact on the recognition performance. In the case of the similarity measure, since the selected object descriptors represent an object as a normalized histogram, the dissimilarity between two histograms can be computed by different distance functions. We refer the reader to a comprehensive survey on distance/similarity measures provided by S. Cha [3]. In this work, during the selection of the distance functions, care was taken to select functions that were dissimilar from each other. This policy will increase the chance that different distance functions lead to different results. Based on these considerations, the following 14 functions have been explored: Euclidean, Manhattan, , Pearson, Neyman, Canberra, KL divergence, symmetric KL divergence, Motyka, Cosine, Dice, Bhattacharyya, Gower, and Sorensen. We refer the reader to [3] for the mathematical equations. We therefore performed a total of 10-fold cross-validation experiments to obtain best configuration for each method.

The configuration that obtained the best performance in terms of accuracy and computation time was GOOD with bins and Manhattan (city-block) distance function. This distance function is in the Minkowski family and has very low computational expenses. The accuracy of the proposed system with this configuration was . A complete experiment (including both learning and recognition phases) using this configuration took seconds. The following results are computed using this configuration unless otherwise noted.

All combination of parameters that obtained the best accuracy is summarized in Table I. Although a large number of bins provides more details about the point distribution, it increases computation time, memory usage, and sensitivity to noise. The descriptiveness of VFH was not as good as the other descriptors. VFH with the radius parameter set to and Canberra distance function resulted in the best performance with a accuracy followed by the same radius parameter and Motyka function which resulted in an accuracy of . One crucial observation is that for VHF, there is a significant drop in performance when the normal estimation radius becomes too small or too large. It was observed that ESF performed well on all distance functions, always having a precision greater than .

In terms of computation time, GOOD achieves the best performance, which is around and times better than ESF and VFH, respectively. The underlying reason is that GOOD works directly on 3D point clouds and requires neither triangulation of the object’s points nor surface meshing [8]. We, therefore, use the GOOD descriptor as the basis of our proposed model in the remaining experiments.

No. Descriptor #bins Distance Function Accuracy
1 GOOD 15 Motyka 0.97
2 GOOD 15 Euclidean 0.97
3 GOOD 15 Cosine 0.97
4 GOOD 15 Dice 0.97
5 GOOD 15 Sorensen 0.97
6 GOOD 15 Manhattan 0.97
7 GOOD 15 Gower 0.97
8 GOOD 25 0.97
9 GOOD 30 Bhattacharyya 0.97
10 GOOD 35 Bhattacharyya 0.97
11 GOOD 35 0.97
12 ESF Euclidean 0.97
13 ESF Manhattan 0.97
14 ESF Bhattacharyya 0.97
15 ESF Sorensen 0.97
16 ESF Neyman 0.97
TABLE I: Best Object Recognition Performance

Iv-B Open-ended evaluation using RGB-D object dataset

In this round of experiments, we explore the importance of color constancy and color spaces. To evaluate the performance of object recognition approaches in an open-ended domain, Kasaei et al. [12] has recently adopted a teaching protocol which simulated the simultaneous nature of learning and recognition. The main idea is to emulate the interactions of a robot with the surrounding environment over long periods. The teaching protocol determines which examples are used for training the algorithm, and which are used to test the algorithm. This protocol is based on a Test-then-Train scheme, which can be followed by a human user or by a simulated user. We develop a simulated teacher to follow the protocol and autonomously interacts with the system using teach, ask and correct actions. In this experiment, the robot initially has zero knowledge, and the training instances become gradually available according to the teaching protocol.

The idea is that the simulated teacher introduces a category to the robot using three randomly selected object views. The robot creates a model for that category based on these instances. Afterward, the teacher picks a never-seen-before object view and tests the robot to see if it has learned the category, and learning this category does not interfere with the previously learned categories. This is done by asking the robot to recognize unseen object views of the currently known categories. When the robot makes a misclassification, the teacher will provide feedback with the correct category. This way, the robot adjusts its category model using the mistaken instance. The simulated teacher estimates the recognition accuracy of the robot using a sliding window of size iterations, where is the number of categories that the robot has already learned. If the number of iterations it took since the last time the agent learned a new category is less than , all results are used. If the recognition performance of the agent is higher than the protocol threshold, , the simulated teacher introduces a new category. The protocol threshold in our experiment is set to since we wanted to make it harder to learn new categories. In this way, the difference between the color weights becomes more visible. This relatively high protocol threshold also allows for robustness tests, as configurations that are still able to learn many categories can be considered to be more robust. If the agent could not meet this protocol threshold after a certain number of iterations (e.g., ), a breakpoint is encountered. This way, the simulated teacher can state that the agent can not learn any more categories. The agent may learn all existing categories before reaching to the breaking points. In such cases, it is no longer possible to continue the protocol, and the evaluation process is halted. In the reported results, this is shown by the stopping condition, “lack of data”.

Iv-B1 Adding color information to shape descriptors

Toward this goal, the objects’ point clouds are taken as input in the form of RGB-D (*.pcd) files. We convert the color constancy information of the object’s point cloud into three different color spaces, including RGB, YUV, and HSV. The color spaces and the procedure of combining color constancy and shape information to form a descriptor for a given object are discussed briefly in this section.

In RGB space, often the most popular color space, colors are made up of red, blue, and green channels, having a range of values . We get the RGB values for all points of the object and calculate the summation of each channel values separately. We then get the average colors of the object by dividing the obtained red, green, blue values by the number of points of the object. Finally, since the shape information is normalized, i.e., having a range from to , we also normalize the obtained color values to be in the range of , by diving each color to . The obtained values are then appended to the shape description of the object.

YUV space is mainly used for television transmission and represents a color by three components, one channel for luminance and two channels for chrominance. The Y component determines the brightness of color, which is referred to as luminance. The U and V component determines the color itself, also called chroma. The value of Y ranges from to , while the value of U and V ranges to

. The YUV values can be derived from the RGB values using the following linear transformation:


where is added to and component so that each of the YUV components ranges in . Afterward, the obtained colors are normalized and appended to the histogram of the object as done for the RGB color space.

HSV color space was developed to take into consideration how humans view color, where stands for hue, stands for saturation, and stands for value. In particular, it describes a color (hue) in terms of the saturation (shade) and value (brightness). The hue components represent the angle, and its value ranges degree. The saturation component describes the percentage of gray in a particular color and value works in conjunction with saturation and describes the brightness or intensity of the color, range from percent. The RGB value of every detected point is converted to HSV; this is done using the minimum and maximum value of the normalized RGB value of the point in the point cloud. It is worth to mention, the final object descriptor is formed the same way as the other two color spaces. The normalized HSV color dissimilarity, , of two object views, and , can be computed using the following equations:


After forming the object descriptor containing both color and shape information, we use the color weights parameter to set how important the difference in color is of the two compared object representations. We are doing this because in the representation of the object, the number of bins representing the shape of the object is much more than the number of bins representing the color information (675 bins vs. 3 bins), and hence the shape information will largely dominate the decision. We, therefore, calculate the difference by using a weighted distance function, as shown below:


where is the difference in the shape space, is the difference in the color space, and is the color weight, which is a value between and .

Iv-B2 Dataset and evaluation metrics

In this round of experiments, we use the Washington RGB-D dataset [15]. This dataset is known as one of the largest available 3D objects datasets and consists of 51 categories with 250.000 views of 300 objects. When an experiment is carried out, learning performance is evaluated using several measures [20][18][7], including: (i) the number of learned categories (NLC) at the end of the experiment, an indicator of how much the system was capable of learning; (ii) the number of question/correction iterations (QCI) required to learn those categories and the average number of stored instances per category (AIC), indicators of time and memory resources required for learning; (iii) Global Classification Accuracy (GCA), computed using all predictions in a complete experiment, and the Average Protocol Accuracy (APA), i.e. average of all accuracy values successively computed to control the application of the teaching protocol. GCA and APA are indicators of how well the system learns.

1 0.0 648.10 196.76 18.90 4.38 11.78 1.35 0.74 0.84
TABLE II: Summary of evaluation using shape information
1 0.1 922.20 459.24 24.10 7.53 11.60 1.84 0.76 0.84
2 0.2 1217.70 669.51 31.80 11.66 10.84 1.40 0.78 0.84
3 0.3 1881.60 555.00 44.70 8.10 11.28 1.27 0.80 0.841
4 0.4 1751.80 477.00 45.70 7.04 10.19 1.19 0.81 0.85
5 0.5 1656.20 260.22 49.40 4.72 8.92 0.80 0.82 0.85
*6 0.6 1632.50 153.28 51.00 0.0 8.30 0.77 0.84 0.86
*7 0.7 1509.50 104.62 51.00 0.0 7.55 0.57 0.85 0.86
*8 0.8 1452.30 76.15 51.00 0.0 7.07 0.47 0.86 0.87
*9 0.9 1410.20 43.18 51.00 0.0 6.79 0.38 0.86 0.88
10 1.0 1257.10 609.35 33.30 10.84 10.50 1.38 0.79 0.85

Stopping condition was “lack of data”. Best result highlighted by blue color.

TABLE III: Summary of evaluation in RGB color space

Iv-B3 Results

Since the order of introducing the categories may have an effect on the performance of the system, ten experiments were carried out for each of shape-only, color-only (), and nine combinations of shape and color in three mentioned color spaces, i.e., , resulting experiments. This is due to the nature of IBL approaches that the recognition of new objects relies on all the previously learned objects. For example, if the teacher introduces a red apple right after a red tomato (both a red color and a similar shape), it would be harder to recognize this new object than when a banana followed the red tomato (different color and different shape) are introduced. Detailed summaries of the obtained results are reported in Tables II – IV, and depicted in Figures  2 – 6. For all results, boxplots are added to show the variation of obtained results for each configuration based on minimum,

first quartile

, median, third quartile, and maximum performances. Line plots are also added to display the average number of learned categories as a function of color weight.

1 0.1 802.30 44321 21.90 7.49 11.53 1.72 0.75 0.84
2 0.2 1183.20 572.72 29.40 10.13 11.64 1.43 0.77 0.84
3 0.3 1507.00 587.95 36.30 8.87 11.62 1.53 0.79 0.84
4 0.4 1524.10 655.25 39.10 9.31 10.65 1.79 0.80 0.84
5 0.5 2095.50 161.95 50.30 1.64 10.95 0.85 0.81 0.84
*6 0.6 1817.70 138.34 51.00 0.0 9.31 0.76 0.82 0.85
*7 0.7 1659.90 84.90 51.00 0.0 8.32 0.47 0.84 0.86
*8 0.8 1455.10 58.64 51.00 0.0 7.23 0.35 0.85 0.87
*9 0.9 1375.50 26.95 51.00 0.0 6.58 0.31 0.87 0.88
10 1.0 1568.30 664.25 40.20 9.56 10.55 1.71 0.80 0.84

Stopping condition was “lack of data”. Best result highlighted by blue color.

TABLE IV: Summary of evaluation in YUV color space
1 0.1 958.80 523.49 25.80 10.08 10.89 1.69 0.77 0.84
2 0.2 1639.30 618.87 40.50 10.33 11.01 1.18 0.80 0.84
3 0.3 1717.40 407.80 46.40 6.92 9.90 0.98 0.81 0.84
4 0.4 1628.00 114.81 49.90 2.60 8.7079 0.62 0.83 0.85
*5 0.5 1608.50 113.54 51.00 0.0 8.12 0.56 0.84 0.86
*6 0.6 1454.20 71.34 51.00 0.0 7.17 0.45 0.85 0.87
*7 0.7 1406.20 43.20 51.00 0.0 6.73 0.36 0.87 0.88
*8 0.8 1369.60 19.66 51.00 0.0 6.46 0.29 0.87 0.88
*9 0.9 1371.90 28.90 51.00 0.0 6.42 0.33 0.87 0.89
10 1.0 1624.00 513.38 42.30 7.66 10.34 1.78 0.81 0.85

Stopping condition was “lack of data”. Best result highlighted by blue color.

TABLE V: Summary of evaluation in HSV color space
Fig. 2: Summary of open-ended evaluation of all approaches; These plots show the number of learned categories versus color weight for all experiments in four different space. Boxplots represent the distribution of obtained results for each configuration based on minimum, first quartile, median, third quartile, and maximum performances. The blue lines represent the average number of learnt categories as a function of color weight.

One important observation is that considering color constancy information significantly improved object recognition performance. It was found that the performance of the agent is improved by increasing the level of color weight in all color spaces. Notably, the agent learned all 51 categories in all color spaces when the color weight was in the range of . It is worth to mention, in this range, all experiments concluded prematurely due to the “lack of data”, i.e., no more categories available in the dataset, indicating the potential for learning many more categories. Moreover, it was observed that the agent with neither color-only nor shape-only configurations could learn all categories in all of the experiments.

On closer inspection, we can see that the combination of HSV color and shape model resulted in a better performance in all levels of color combination, as clearly shown in Fig. 2. By comparing all approaches, it is also visible that the agent learned all categories faster in HSV space than in other color spaces. It can also be concluded that shape+HSV () obtained the best GCA and APA with stable performance. In contrast, the performance of the agent with shape-only () configuration was the worst among the evaluated configurations. In the case of color-only (), the best performance was obtained in HSV color space, where the agent on average learned categories, and YUV and RGB spaces achieved the second and third places by learning on average and categories respectively.

Fig. 3: Summary of open-ended evaluations: these graphs show the number of question/correction iterations (QCI) required to learn a certain number of categories as a function of color weight. The blue lines also represent the average number of learned categories in different combinations of color and shape.
Fig. 4: This graph shows the number of instances stored in the models of all of the categories in three system configurations: shape-only, color-only, and shape+HSV (). Each bar represents the three instances provided at the introduction of the category, together with any instances that had to be corrected somewhere along the experiment. Onion, jar-food, and camera were the most difficult categories for shape-only, color-only and shape+color configurations respectively, i.e., requiring the largest number of instances. It should be noted that categories that were introduced near the end of the experiment have been tested less, which is clearly visible in a general trend for fewer instances to be included for categories appearing later. The agent learned , , and categories with shape-only, color-only, and shape+color (HSV) configurations respectively. It is worth to mention that the shape+color experiment finished due to lack of data condition, showing the potential to learn many more categories.

Fig. 3 illustrates “how fast” the learning occurred in each of the experiments while shedding light on the number of learned categories (blue lines). It shows the number of question/correction iterations (QCI) required to learn a certain number of categories. We can see that, on average, the longest experiments were observed with shape+YUV, when the parameter was set to . The shortest ones were observed with shape+HSV with . It should be noted that the agent with shape+HSV () configuration was able to learn all 51 categories in all experiments, while the experiments with shape+YUV () were stopped due to reaching the break point condition after leaning 50 categories on average (see Table V and IV). In the case of shape+RGB, the best performance of the agent was achieved when the set to . With this shape+RGB configuration, the agent on average learned all categories using question/correction iterations. It was also observed that the longest experiments were continued for question/correction iterations with shape+RGB () configuration and the agent on average was able to learn categories (see Table III).

Fig. 5: Summary of open-ended evaluations: these graphs represent the average number of stored instances per category and the average number of learned categories at the end of experiments as an indicator of how much memory does each approach take to learn a certain number of categories. The blue lines display the average number of learned categories as a function of color weight.
Fig. 6: These graphs show the global classification accuracy as a function of the number of learned categories in three different color spaces. In all these experiments, color weight was set to .
Fig. 7: System performance during the serve_a_coke scenario; (a) Initially, the system starts with no knowledge of any object. The posture of the UR5e arm in each state is also visualized. The table is then detected, as shown by the green polygon. Afterward, the object candidates are detected and highlighted by different bounding boxes. The local reference frame of each object represents the pose of the object as estimated by the object tracking module. (b) A user then teaches all the active objects to the system, and all objects are correctly recognized, i.e., the output of object recognition is shown in red on top of each object. (c) The robot then finds out the CokeCan object and goes to its pre-grasp area and (d) picks it up first from the table. (e) The robot retrieves the position of Cup first, and then moves the CokeCan on top of the Cup and serves the drink. (f) Finally, the robot goes back to the initial position.
Fig. 8: Our experimental setup consists of a computer for human-robot interaction purposes, a Kinect sensor, and a UR5e robotic-arm as the primary sensory-motor embodiment for perceiving and acting upon its environment.

Fig. 4 represents the exact number of stored instances per category for shape-only, color-only (HSV), and shape+HSV (). By comparing the obtained results, it can be concluded that the agent with shape+HSV configuration not only stored much fewer instances per category but also it could learn more categories as well. Fig. 5 provides a detailed summary of the obtained results concerning the average number of stored instances per category (AIC) as a function of color weight. By comparing all approaches, it is clear that shape+HSV, shape+YUV, and shape+RGB on average stored less than seven instances per category to learn all categories, while shape-only and color-only required more than 10 instances per categories to learn and categories respectively. The shape+HSV () configuration on average stored smallest number of instances per category (see Table V).

Fig. 6 shows the global classification accuracy obtained by the best combination of shape and color as a function of the number of learned categories in three different color spaces (i.e., the best configuration in each color space is highlighted by the blue color in respective tables). One important observation is that accuracy decreases in all approaches, as more categories are introduced. This is expected since a higher number of categories known by the system tends to make the classification task more difficult.

Iv-C Real-robot experiment

To show the strength of the proposed approach, we carried out a real-robot experiment in the context of the serve_a_coke scenario. We have integrated the proposed approach into the cognitive robotics system presented in [10]. In this experiment, a table is in front of a Kinect sensor, and a user interacts with the system. There are one instance of four object categories on the table: CokeCan, BeerCan, Cup and Vase. This is a suitable set of objects for this test, since there are objects with very similar shapes and different colors (CokeCan, BeerCan and Cup) and also objects with very different shapes and similar colors (CokeCan and Vase). The experimental setup is shown in Fig. 8. It consists of a computer for human-robot interactions, a Kinect sensor for perceiving the environment, and a Universal Robot (UR5e) for manipulation purposes. Fig. 7 presents some snapshots of this experiment.

At the start of the experiment, the set of categories known to the system is empty, and therefore, the system recognizes all table-top objects as Unknown (Fig. 7 (a)). A user interacts with the system by teaching all object categories. The system conceptualizes them using the extracted object views and recognizes all objects properly (Fig. 7 (b)). In this task, the robot must be able to detect the pose of objects as well as recognize the label of all active objects. Afterward, it has to grasp the CokeCan object (Fig. 7 (c, d)) and transport it on top of the Cup object and serve the drink (Fig. 7 (e)). The robot finally returns to the initial pose (Fig. 7 (f)). It was observed that the proposed object descriptor is capable to provide distinctive global feature for recognizing geometrically similar objects with different color and vise versa. This evaluation also illustrates the process of learning object categories in an open-ended fashion. A video of this session is also available online at:

V Conclusion

In this article, we have investigated the importance of shape features, color constancy information, and similarity measures in open-ended 3D object recognition. Towards this goal, an instance-based 3D object category learning and recognition has been developed, which can be seen as a combination of a memory system, an object representation, a similarity measure, and the nearest neighbor classifier. We have selected three state-of-the-art global 3D shape descriptors, namely GOOD [13], ESF [28], and VFH [23], which provide a good trade-off between descriptiveness, computation time and memory usage and are suitable for real-time robotic application. Besides, a multitude of distance functions has been implemented to measure the similarity of two object views. Accordingly, system configurations have been examined in offline settings.

The offline experiments have been performed to optimize the parameters of selected shape descriptors and investigate the importance of similarity measures. It was observed that the combination of the GOOD descriptor () and function made the best result in terms of both accuracy and computation time. We then investigate the importance of color information in an open-ended learning setting. In particular, we have added the color constancy information of an object to its shape description. A set of open-ended experiments has been performed in three popular color spaces including: RGB, YUV, and HSV. In this round of experiments, we adopted a teaching protocol to incrementally evaluate the performance of the system concerning several characteristics, including descriptiveness, scalability, and experiment time.

Experimental results show that the overall classification performance obtained with the proposed shape+color approach is clearly better than the best accuracies achieved with the color-only and shape-only methods. In particular, by setting the color weight parameter in the range of in all color spaces, the agent could learn all categories in all experiments with stable performance. This might suggest that there are reliable color differences between categories and similar color values within categories in the Washington RGB-D dataset [15]. This is not always the case in the real-world environment. Furthermore, it was observed that the performance of the agent with color-only setting () was better than the shape-only configuration (). This might be caused by a data bias in the dataset. Concerning computational time (QCI), the best result was obtained with shape+HSV (), followed by the shape+YUV with the same . It was also observed that the agent could learn new categories from very few examples in an incremental and open-ended manner. A real demonstration was also carried out to show the usefulness of the proposed method.

Although the addition of color information to the object representation improved the performance of object recognition, the number of bins representing the color constancy information was greatly outnumbered by the number of bins dedicated to the shape of the objects. The color information had a small role in the resulting histogram since only the color constancy of the object was used. In the continuation of this work, we would like to investigate the possibility of integrating color information in a concrete manner. Furthermore, separate distance functions could be used to estimate the similarity of objects in terms of shape and color information.


  • [1] I. Bramão, L. Faísca, K. M. Petersson, and A. Reis (2012) The contribution of color to object recognition. Advances in Object Recognition Systems, pp. 16. Cited by: §III.
  • [2] L.E. Carvalho and A. von Wangenheim (2019) 3D object recognition and classification: a systematic literature review. Pattern Anal Applic 22 (2), pp. 1243–1292. Cited by: §II.
  • [3] S. Cha (2007)

    Comprehensive survey on distance/similarity measures between probability density functions

    City 1 (2), pp. 1. Cited by: §IV-A.
  • [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 580–587. Cited by: §II.
  • [5] Y. Guo, M. Bennamoun, F. Sohel, M. Lu, and J. Wan (2014) 3Dobject recognition in cluttered scenes with local surface features: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (11), pp. 2270–2287. Cited by: §II.
  • [6] X. Hana, J. S. Jin, J. Xie, M. Wang, and W. Jiang (2018) A comprehensive review of 3D point cloud descriptors. arXiv preprint arXiv:1802.02297. Cited by: §II.
  • [7] S. H. Kasaei, L. S. Lopes, and A. M. Tomé (2018) Coping with context change in open-ended object recognition without explicit context information. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–7. Cited by: §IV-B2.
  • [8] S. H. Kasaei, L. S. Lopes, A. M. Tomé, and M. Oliveira (2016) An orthographic descriptor for 3D object learning and recognition. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4158–4163. Cited by: §III, §IV-A.
  • [9] S. H. Kasaei, M. Oliveira, G. H. Lim, L. S. Lopes, and A. M. Tomé (2018) Towards lifelong assistive robotics: a tight coupling between object perception and manipulation. Neurocomputing 291, pp. 151–166. Cited by: §IV.
  • [10] S. H. Kasaei, N. Shafii, L. S. Lopes, and A. M. Tomé (2019) Interactive open-ended object, affordance and grasp learning for robotic manipulation. In 2019 IEEE/RSJ International Conference on Robotics and Automation (ICRA), pp. 3747–3753. Cited by: §IV-C.
  • [11] S. H. Kasaei, J. Sock, L. S. Lopes, A. M. Tomé, and T. Kim (2018) Perceiving, learning, and recognizing 3D objects: an approach to cognitive service robots. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §IV.
  • [12] S. H. Kasaei, M. Oliveira, G. H. Lim, L. Seabra Lopes, and A. M. Tomé (2015) Interactive open-ended learning for 3D object recognition: an approach and experiments. Journal of Intelligent & Robotic Systems 80 (3), pp. 537–553. External Links: ISSN 0921-0296, 1573-0409, Document Cited by: §IV-B.
  • [13] S. H. Kasaei, A. M. Tomé, L. Seabra Lopes, and M. Oliveira (2016) GOOD: a global orthographic object descriptor for 3D object recognition and manipulation. Pattern Recognition Letters 83, pp. 312–320. External Links: ISSN 01678655, Document Cited by: §I, §II, §III, §IV, §V.
  • [14] T. Kim, M. Jeong, S. Kim, S. Choi, and C. Kim (2019) Diversify and match: a domain adaptive representation learning paradigm for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12456–12465. Cited by: §II.
  • [15] K. Lai, L. Bo, X. Ren, and D. Fox (2011-05) A large-scale hierarchical multi-view rgb-d object dataset. In 2011 IEEE International Conference on Robotics and Automation, Vol. , pp. 1817–1824. External Links: ISSN Cited by: §IV-B2, §V.
  • [16] C. Leng, H. Zhang, B. Li, G. Cai, Z. Pei, and L. He (2018) Local feature descriptor for image matching: a survey. IEEE Access 7, pp. 6424–6434. Cited by: §II.
  • [17] J. P. S. d. M. Lima and V. Teichrieb (2016)

    An efficient global point cloud descriptor for object recognition and pose estimation

    In 2016 29th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 56–63. External Links: ISBN 978-1-5090-3568-7 Cited by: §II.
  • [18] L. S. Lopes and A. Chauhan (2007) How many words can my robot learn?: an approach and experiments with one-class learning. Interaction Studies 8 (1), pp. 53–81. Cited by: §IV-B2.
  • [19] E. Martinez-Martin and A. P. Del Pobil (2017) Object detection and recognition for assistive robots: experimentation and implementation. IEEE Robotics & Automation Magazine 24 (3), pp. 123–138. Cited by: §II.
  • [20] M. Oliveira, L. S. Lopes, G. H. Lim, S. H. Kasaei, A. M. Tomé, and A. Chauhan (2016) 3D object perception and perceptual learning in the race project. Robotics and Autonomous Systems 75, pp. 614–626. Cited by: §IV-B2, §IV.
  • [21] D. Regazzoni, G. de Vecchi, and C. Rizzi (2014) RGB cams vs RGB-d sensors: low cost motion capture technologies performances and limitations. Journal of Manufacturing Systems 33 (4), pp. 719–728. External Links: ISSN 02786125, Document Cited by: §I.
  • [22] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski (2011) ORB: an efficient alternative to sift or surf. In 2011 International conference on computer vision, pp. 2564–2571. Cited by: §III.
  • [23] R. B. Rusu, G. Bradski, R. Thibaux, and J. Hsu (2010-10) Fast 3D recognition and pose using the viewpoint feature histogram. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. , pp. 2155–2162. External Links: ISSN 2153-0858 Cited by: §II, §IV, §V.
  • [24] R. B. Rusu, N. Blodow, and M. Beetz (2009) Fast point feature histograms (fpfh) for 3D registration. In 2009 IEEE International Conference on Robotics and Automation, pp. 3212–3217. Cited by: §II.
  • [25] R. B. Rusu and S. Cousins (2011) 3D is here: point cloud library (pcl). In 2011 IEEE international conference on robotics and automation, pp. 1–4. Cited by: §II.
  • [26] R. B. Rusu, Z. C. Marton, N. Blodow, and M. Beetz (2008) Learning informative point classes for the acquisition of object model maps. In 2008 10th International Conference on Control, Automation, Robotics and Vision, pp. 643–650. Cited by: §II.
  • [27] M. Ullrich, H. Ali, M. Durner, Z. Márton, and R. Triebel (2017) Selecting cnn features for online learning of 3D objects. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5086–5091. Cited by: §II.
  • [28] W. Wohlkinger and M. Vincze (2011) Ensemble of shape functions for 3D object classification. In 2011 IEEE international conference on robotics and biomimetics, pp. 2987–2992. Cited by: §II, §IV, §V.