Combining Shape Features with Multiple Color Spaces in Open-Ended 3D Object Recognition

by   Nils Keunecke, et al.
University of Groningen

As a consequence of an ever-increasing number of camera-based service robots, there is a growing demand for highly accurate real-time 3D object recognition. Considering the expansion of robot applications in more complex and dynamic environments, it is evident that it is impossible to pre-program all possible object categories. Robots will have to be able to learn new object categories in the field. The network architecture proposed in this work expands from the OrthographicNet, an approach recently proposed by Kasaei et al., using a deep transfer learning strategy which not only meets the aforementioned requirements but additionally generates a scale and rotation-invariant reference frame for the classification of objects. In its current iteration, the OrthographicNet only uses shape-information. With the addition of multiple color spaces, the upgraded network architecture proposed here, can achieve an even higher descriptiveness while simultaneously increasing the robustness of predictions for similarly shaped objects. Multiple color space combinations and network architectures are evaluated to find the most descriptive system. However, this performance increase is not achieved at the cost of longer processing times, because any system deployed in robotic applications will need the ability to provide real-time information about its environment. Experimental results show that the proposed network architecture ranks competitively among other state-of-the-art algorithms.



There are no comments yet.


page 1

page 3

page 6


OrthographicNet: A Deep Learning Approach for 3D Object Recognition in Open-Ended Domains

Service robots are expected to be more autonomous and efficiently work i...

Simultaneous Multi-View Object Recognition and Grasping in Open-Ended Domains

A robot working in human-centric environments needs to know which kind o...

Interactive Open-Ended Learning for 3D Object Recognition

The thesis contributes in several important ways to the research area of...

Artificial Color Constancy via GoogLeNet with Angular Loss Function

Color Constancy is the ability of the human visual system to perceive co...

Real-Time Object Detection and Recognition on Low-Compute Humanoid Robots using Deep Learning

We envision that in the near future, humanoid robots would share home sp...

Are Face and Object Recognition Independent? A Neurocomputational Modeling Exploration

Are face and object recognition abilities independent? Although it is co...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

At the end of this decade, autonomous mobile robots are believed to be used everywhere in our life as service robots, self-driving cars and in industrial process automation. However, as the environment in which a robot has to operate becomes ever more complex, it is important that the robot can rely on a robust and accurate form of perceiving its environment. Although, 3D object recognition has made significant advancements recently, there are unresolved issues. A dynamic environment makes it impossible to pre-train all possible object categories before the robot goes into operation. Any object recognition system will need the ability to learn new categories in an online scenario, while the robot is running. In order to meet these requirements, a system must be able to adjust its representation of existing categories as new instances are encountered as well as the ability to accept new categories. While this procedure can be partly supervised in the form of human feedback, it also has to learn independently from on-site experiences in its environment.
The recently proposed OrthographicNet as introduced by Kasaei et al. [1]

addresses these issues. The approach uses orthographic projections of objects to compute a scale and rotation invariant reference frame. Based on a top-, side- and front-view which are processed by CNNs, the OrthographicNET generates view specific feature vectors of all projections. These are then max/avg pooled into a global feature vector before an object is ultimately classified. The advantage of this architectural design is represented by a higher average classification accuracy because the difference of the feature vectors between instances of the same class decreases. However, in its current version, the OrthographicNET is a

shape-only descriptor which means that it does not evaluate any color information. As camera-based robot applications gain in popularity and experimental evidence supports the importance of color-based information for object recognition Kasaei et al. [2], the approach proposed in this work argues that the state-of-the-art performance of Kasaei’s OrthographicNET can be increased if the descriptor additionally takes color information into consideration. Previous research by Gowda and Yuan [3] indicates that the processing of additional color spaces compared to only evaluating standard RGB data can further increase accuracy and robustness of the classifier.
It can be expected that not only the average class accuracy increases, but that objects which were difficult to separate previously may then be divided into more precisely defined classes. The ability to provide ’real-time’ 3D object recognition can not be abandoned due to its importance in applications in service robot or autonomous transportation. Therefore, memory usage and computation time have to be used as performance metrics during the evaluation of any robot perception system as well.
The remainder of this paper is organized as follows: first, related work will be presented. Section 3 will discuss the architecture of the system presented in this paper. In section 4 the results of the offline and online evaluation are presented. This section additionally features the robot demonstration. Finally, in section 5 conclusions will be drawn and further work is proposed.

Ii Related Works

Three-dimensional object recognition has become a field of fundamental importance in computer vision, pattern recognition and robotics. Convolutional neural networks are frequently used to do these image classifications

[4]. Recently a tendency towards deeper network architectures can be observed [5] [6]. While excellent results have been reported, it has become evident that due to the enormous number of parameters, computation time and memory requirements have to be considered as well. Once in operation, the image classification algorithm may have limited resources for example in robot applications. Generally, there is a trend towards optimizing the descriptiveness to computational complexity ratio of network architecture [3] [7].

Additionally, it has been identified that it is impossible to pre-train neural networks for 3D object recognition entirely. Open-Ended learning approaches have recently become more popular

[8] [9] [10]. These approaches address the problem of CNNs to adapt to an increasing number of categories as this would require reshaping of the typology of the network. CNNs tend to require a lot of training data to perform accurately, which is unfeasible in open-ended learning as new categories have to be learned opportunistically with very few instances in the beginning and incrementally updated as more instances are presented. Among others Kasaei et al. [1]

report that their transfer learning approach yields excellent results for such tasks. These open-ended network architectures are usually pre-trained on the imagenet dataset


Fig. 1: This figure by Ayoobi et al. [12] illustrates the problem that shape-only descriptors have with similarly shaped objects. The shape representation of the different presented categories are almost impossible to detect. Color information in contrast makes the differentiation easy.

In this work, the OrthographicNet as an open-ended deep transfer learning approach to 3D object recognition is addressed in particular. It is one of several recently proposed view-based approaches which tend to show superior performance compared to volume-based and point-based approaches [13]. In contrast other view-based approaches, the OrthographicNet uses a single view to generate 2D orthographic projections of an object. For robotics applications and other real-world implementations multi-view representations of objects are problematic due to the lack of scenarios where objects are fully observable.
In general, most object recognition algorithms can be divided in two groups with either focus on shape-information [14] [15] [16] or on color-information [17]. While shape-only approaches frequently struggle with similarly shaped objects [2] (Fig. 2), color-based approaches are volatile to shadows and illumination [18] and tend to have a bias towards texture [19]. Even though there exist several state-of-the-art shape-only approaches, the neuroscientific argument has been made by Bramao et al., that for humans color-information is essential for object recognition [20]. There are recent findings which suggest that neural networks can profit as well if they combine the available shape information with color information [2] [21] [22]. Several strategies to achieve this interaction exist. Three approaches are presented in order of increasing computational complexity. (i) a color constancy value can be calculated to find the average color of an object [2]. While this approach already increases the performance, it lacks the ability to detect complicated textures. (ii) Shape information can be evaluated in parallel to an RGB image of the object [23] [24]. This reduces in particular the overall tendency of the purely color-based descriptors to be biased towards texture and will help shape-only descriptors to differentiate between objects of similar shape (like two different soda cans). (iii) Gowda et al. have found that combining different color spaces by transforming the RGB image improves the reported overall accuracy [3] as different features of an objects are represented differently in different color spaces. The third approach combines color information with shape information. It searches for the most optimal combination of color spaces and combine them with a state-of-the-art shape descriptor, such as the aforementioned OrthographicNet. It is expected that the resulting architecture has all the advantages of shape and color descriptors while the two parts of the network architecture mitigate each other’s weaknesses.
In the following section, the overall system architecture will be explained in detail to substantiate the hypothesis, that the performance of the OrthographicNet can be improved if the current implementation of the network is amended with color information.

Iii Overall System Architecture

Fig. 2: The overall system architecture: In the first step RGB-D input is preprocessed to generate orthographic projection for shape and color information individually. On the right side of the graph the shape information is processed via the OrthographicNet [1]. On the left side RGB color information is transformed into different color spaces and evaluated individually either via a DenseNet 40-12-BC or the MobileNetv2, depending on the setup. The output of both sides is concatenated into a global feature vector which is finally classified.

From an RGB-D image a point cloud is extracted, where each point contains standard RGB color information (a Red, Green, and Blue value), as well as a distance value. The data is separated into two input streams, one for color information and one for shape information, which is fed into the proposed network architecture Fig.  2. The preprocessing of the input images mostly remains unchanged from the original OrthographicNet [1]. Following the process described by Kasaei et al., three scale and rotation invariant projections are calculated for the depth data and for the color information, respectively. For details on this process, the reader is referred to the aforementioned paper. The obtained orthographic depth projections, namely the front-, side-, and top-view are fed into the OrthographicNet. In this work, the MobileNetv2 [25] was used for each individual projection, as Kasaei et al. have reported the best results with this network architecture.
To process the color information, initial tests showed that no performance improvements could be gained using all three projections but rather that a single projection performed significantly better than the other two. This is related to the aforementioned texture-bias of color-based classifiers. Fig.  3 shows three calculated orthographic RGB projections of a soda can. Entropy is defined as a metric

Fig. 3: front, side, and top-view of a soda can

to measure how much information is contained in a single projection. All three orthographic RGB projections of an object are compared and the projection with the highest entropy (eg. most information) is selected for further processing. Using only one projection instead of three effectively reduces the network size of the color evaluation by two-thirds which will be beneficial for the computational performance. Based on the work of Gowda and Yuan [3], the following color spaces were selected: RGB, HED, HSV, LAB, YCbCr, YIQ, and YUV. Additionally, gray-scale was included for a total of 8 color spaces. While all color spaces represent the same color, they use different mathematical models to represent that color. The resulting numerical differences have an impact on neural networks where different filters are learned depending on the color space. Input images are transformed into each color space using the respective transformation as displayed exemplary in the following equation, shown for a color transformation from RGB to the YUV color space.


After the color space transformations, input images are fed into CNN’s.

Iv Results

To evaluate the proposed approach, a total of three experiments were performed: the offline-evaluation, online-evaluation and finally a real-time robot demonstration.

Iv-a Offline Evaluation

In this paper two neural network architectures were used to evaluate color information. A DenseNet [26] 40-12-BC is used, which is 40 layers deep and has a growth factor of 12. The BC refers to compression layers at the end of each dense block. Additionally, the MobileNetv2 was used. It is significantly deeper than the DenseNet and has almost 10 times the parameters as the DenseNet (Table  I). Each color space was trained on both network architectures.

Model DenseNet MobileNetv2
Depth 40 88
Feature length 132 float 1280 float
input size 64 x 64 224 x 224
Parameters 0.225M 2.25M
Size 3 MB 14.5 MB
TABLE I: Properties of the used CNNs

Iv-A1 Color space evaluation

Testing was carried out on the Washington-D [27] data set. This dataset contains images of 300 common household items, which are organized in 51 classes. From the available 250000 views, 50000 orthographic projections were generated and divided into a training and a validation dataset with a 80/20 split. From the results displayed in Table  II, it can be observed that the MobileNetv2 performs about 2% better on average than the DenseNet. However, in the HSV and the YCbCr color space, the DenseNet report better results than the MobileNetv2. The overall highest average class accuracy (ACA) found is for RGB and MobileNetv2.

color space DenseNet 40-12-BC MobileNetv2
RGB 96.56% 98.56%
HED 93.59% 96.86%
HSV 97.32% 96.40%
LAB 96.37% 96.87%
YCbCr 97.44% 97.19%
YIQ 92.33% 92.51%
YUV 95.24% 97.43%
grayscale 91.55% 95.80%
TABLE II: Accuracies for all color spaces

The best ACA of the DenseNet was with the YCbCr color space with . Notably, some color spaces perform significantly worse in this classification task. These are grayscale, YIQ and HED. It is evident, that not all color spaces may be beneficial for the overall accuracy of a combined system.

Iv-A2 Color space optimization

In a next step the color spaces were optimized, looking for the combination that yielded the highest average class accuracy. To combine the networks trained on a specific color space, the feature vectors of the individual networks were merged using maximum/average pooling and classified with a few fully connected layers. Due to the different sizes of the feature vectors between the DenseNet and the MobileNetv2, color space optimization was carried out architecture specific. It was found that the best combination of color spaces for the MobileNet as well as the DenseNet was a combination of the RGB, HSV, YCbCr and YUV color space at and , respectively. As computational performance is a key metric of evaluation, table  IV-A2 also shows the best combination of two and three color spaces.

Combination (DenseNet 40-12-BC) ACA
YCbCr 97.44%
HSV, YCbCr 97.48%
RGB, HSV, YCbCr 97.67%
RGB, HSV, YCbCr, YUV 98.12%
TABLE III: Colorspace optimization for DenseNet 40-12-BC and MobileNetv2
Combination (MobileNetv2) ACA
RGB 98.56%
RGB, YCbCr 98.69%
RGB, YCbCr, YUV 98.79%
RGB, HSV, YCbCr, YUV 98.89%

Once the best color space combination is obtained, the final and complete system architecture can be constructed. After an input image is evaluated in terms of its shape information as well as its color information, a combined feature vector is constructed by concatenating the two individual feature vectors. Maximum or average pooling cannot be applied here as for shape all orthographic projections are evaluated while for color only the orthographic projection with the maximum entropy is evaluated. Similar to the individual sections previously, a few fully connected layers with dropout layers in between are used for classification.

Iv-A3 Evaluation of the final network architecture

To test the performance of the final network architecture the Washington-D dataset [27] was used, divided 80/20 into training and validation data. To reduce the overhead of the color space transformation, for this part of the evaluation, all images where transformed beforehand. The Washington-D dataset is known to be color biased [2] and the previous color-based evaluation in this work consistently yielded higher accuracies, than the that were reported for the OrthographicNet [1], a color weight vector is introduced and applied to the color feature vector. For the shape feature vector, a weight is applied respectively:


The combined feature vector is obtained by weighting the shape and the color feature vector, where are the views of the respective orthographic projections and is the view with the maximum entropy.

To mitigate the effect of weight initialization and and batch shuffling, each network architecture was trained three times and the reported average class accuracy (ACA) averaged over the runs. Training was carried out with stochastic gradient descent (SGD), a learning rate of 0.05, and a learning rate decay of 2%.

weight DenseNet 40-12-BC MobileNetv2
0 90.56% 90.56%
0.2 97.44% 97.51%
0.4 98.10% 98.48%
0.5 98.87% 98.92%
0.6 99.14% 99.00%
0.7 99.13% 99.01%
0.8 99.00% 99.07%
0.9 98.92% 98.93%
1.0 98.12% 98.89%
TABLE IV: Average class accuracy (ACA) for the combined network of color and shape information. A weight w was defined and multiplied to the feature vector of the color evaluation. The feature vector of the shape evaluation was weighted with .

Based on the results displayed in  IV, the model which uses the DenseNet 40-12-BC for color evaluation achieves the highest overall ACA with for color vector weight of . The highest accuracy for the MobileNetv2 based system was . The MobileNetv2 based color evaluation with its parameters is worse than the DenseNet 40-12-BC with only parameters. While the classifier is theoretically able to learn the optimal weight between color and shape feature vectors during training, minor improvements can be observed from the weight initialization. Furthermore, the optimal weight between color and shape information will be relevant in the next step of the evaluation, where the open-ended capabilities of the network architecture were tested.

Iv-B Online Evaluation

To evaluate the network architecture proposed in this paper in the open-ended scenario, a test-then-train approach is used. Following the protocol proposed by Kasaei [2], a simulated user is defined which carries out three different actions. The teach-action introduces are new category to the system. The ask-action selects a previously unseen view of a known category to the system. In the case of a misclassification, the simulated user chooses the correct-action to notify the system of a misclassification and tells it the true category of the object. Following this protocol, it is possible to simulate an environment in which robot is simultaneously learning and recognising. In the beginning of the evaluation, the system does not know any categories, but is pretrained on the Washington-D dataset. An instance-based learning approach [9] is then used to classify the seen object based on the feature vector generated by the network. By continuously introducing new instances of categories to the system, the robustness of category representations increases. A new category is introduced, once the recognition accuracy exceeds a threshold. Should the system fail to reach this threshold after 100 iterations, the experiment is aborted by the simulated user, as it can be concluded that the system no longer has the robustness to learn additional categories. For the online evaluation, the Washington-D dataset [27] was used. As the dataset is limited in terms of categories, it may be possible that a system was able to learn all available categories. This is indicated with a (*) lack of data in Table V

. The performance of the online evaluation highly depends on the order in which categories and views are selected by the simulated user. To account for this factor, all experiments in this section were carried out 10 times. For better comparability and a more precise estimations of the potential performance of this work, five evaluation metrics are used. QCI denotes the number of question-correct iterations, that were necessary to learn the categories. This acts a measure for how fast the system learned. ALC is the average number of all categories learned by the system. AIC is the average number of instances per category. Finally, the global class accuracy (GCA) and the Average Protocol Accuracy indicate how well the system performs.

RACE 382.10 19.90 8.88 0.67 0.78
BoW 411.80 21.80 8.20 0.71 0.82
Open-Ended LDA 262.60 14.40 9.14 0.66 0.80
GOOD 1659.20 39.20 17.28 0.66 0.74
OrthographicNet(*) 1342.60 51.00 8.97 0.77 0.80
this + DenseNet 40-12-BC(*) 1409.10 51.00 10.28 0.75 0.77
this + MobileNetv2(*) 1329.10 51.00 7.97 0.81 0.83
(*) indicates that the stopping condition was ”lack of data.”
TABLE V: Summary of the online evaluation

The obtained results for these experiments in relation to other recent approaches in open-ended object recognition such as BoW, Race, Open-Ended LDA and Good [16] are displayed in Table V. It can be observed that the MobileNetv2 based network architecture proposed in this work performs the best in all categories. It not only learns the fastest with 1329.10 question-correction iterations but the average instances per category decreased by 1 compared to the shape-only OrthopgraphicNet. Additionally, the network architecture shows a performance increase in the GCA and APA evaluation metric. The excellent scalability of the OrthographicNet can still be observed as all 10 experiments for both new network architectures have consistently learned all 51 categories. However, the significant performance improvement from a purely shape-based descriptor to a system which combines shape and color information as seen in the offline evaluation was not nearly as significant in the online evaluation. The network architecture which used the DenseNet 40-12-BC even performed worse than the original OrthographicNet. The DenseNet based model learned slower with 1409.10 question-correction iterations and performed worse than the original OrthographicNet as well as the MobileNetv2 based network architecture with a global class accuracy at and an average protocol accuracy at just . A possible cause for the low performance of the DenseNet-based architecture may be the comparatively small feature vector.

Iv-C Real-Time robot demonstrations

Fig. 4: From the top left to the bottom right, these snapshots show the performance of the system during the Serve_A_Beer-scenario. (1) The setup of the environment with the table, the 5 objects, XBox Kinect sensor, and the UR 5e robot arm. (2) Through the Kinect sensor, it can be observed, that the table is detected as well as all objects, as indicated by the bounding boxes. (3-4) As more categories are introduced by the human user, the objects are identified. (5) After the command to ”serve a beer” is given, the object is located and grasped. (6) The robot moves the object over the target location.

In the final section of the evaluation of the presented approach, the network architecture is integrated into an object perception system developed by Oliveira et al. [28]. For this demonstration a ”Serve_A_Beer” scenario is used. The setup consists of a table with five different objects, namely a BeerCan, CocktailCan, Mug, Oreo, and Vase. A XBox Kinect sensor is used as the perception device and an UR5e robot arm as the action device. Note the similarities in shape between objects, in particular the two types of cans but also in color between the CocktailCan and the Vase, which enable this setup to demonstrate the performance of the presented network architecture.
Fig. 4 shows snapshots of the demonstration and the subsequently described process. Initially, the system detects that there are objects on the table, as indicated by the bounding boxes. However, all objects are labelled as category unknown. Secondly, a user starts providing the system with the respective category labels. As more categories are introduced, the system recognizes each of them as different categories and not as other instances of previously introduced categories. When the command to ”serve a beer” is given, the system locates the object of the category BeerCan, goes into position to grasp the object and finally picks it up. In the last step of the demonstration, the robot moves the grasped BeerCan over the Mug.
With this real-time robot demonstration it has been shown that the system is able to recognize objects from different orientations, detect these objects at real-time and learn new categories in an open-ended fashion. A video of this demonstration is available at:

V Conclusions

In this work, a network architecture was proposed which seeks to combine shape information with color information for more accurate and robust 3D open-ended objection recognition. A RGB-D image was preprocessed following the steps of the OrthographicNet [1] to obtain three rotation and scale-invariant global orthographic projections of an object. Shape and color information were evaluated separately using the MobileNetv2 for shape evaluation and the MobileNetv2 as well as the DenseNet 40-12-BC for color evaluation. The results of the shape evaluation and the color evaluation were combined in a feature vector which was then classified. The proposed approach was analyzed in offline and online experiments and its capabilities presented in a real-life demonstration. In each test the system showed state-of-the-art performance in descriptiveness but also in terms of computational performance. Based on its computation-time and memory usage the system can be used in real-time (mobile) robotics applications. The DenseNet 40-12-BC proved superior over the MobileNetv2 both in terms of descriptiveness and computational performance in the offline evaluation but showed lower descriptiveness in the online evaluation. In the open-ended evaluation, neither of the proposed network architectures were able to show the same superiority in performance they achieved in the offline evaluation compared to the shape-only approaches. In the continuation of this work, the open-ended evaluation and the interaction between the feature vector and the instance-based learning approach in particular, shall be investigated. The lack of other large RGB-D datasets apart from Washington-D remains a key issue due to the datasets previously discussed color bias.


  • [1] Hamidreza Kasaei.

    Orthographicnet: A deep learning approach for 3d object recognition in open-ended domains.

    02 2019.
  • [2] Hamidreza Kasaei, Maryam Ghorbani, Jits Schilperoort, and Wessel Rest. Investigating the importance of shape features, color constancy, color spaces and similarity measures in open-ended 3d object recognition. 02 2020.
  • [3] Shreyank Gowda and Chun Yuan. ColorNet: Investigating the Importance of Color Spaces for Image Classification, pages 581–596. 05 2019.
  • [4] Ahmed Ali Mohammed Al-Saffar, Hai Tao, and Mohammed Ahmed Talab. Review of deep convolution neural network in image classification. In 2017 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET), pages 26–31. IEEE, 2017.
  • [5] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [6] Zhipeng Zhang and Houwen Peng. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4591–4600, 2019.
  • [7] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282, 2017.
  • [8] SM Lucas. Towards the open ended evolution of neural networks. 1995.
  • [9] S Hamidreza Kasaei, Miguel Oliveira, Gi Hyun Lim, Luís Seabra Lopes, and Ana Maria Tomé. Interactive open-ended learning for 3d object recognition: An approach and experiments. Journal of Intelligent & Robotic Systems, 80(3-4):537–553, 2015.
  • [10] Miguel Oliveira, Luís Seabra Lopes, Gi Hyun Lim, S Hamidreza Kasaei, Angel D Sappa, and Ana Maria Tomé. Concurrent learning of visual codebooks and object categories in open-ended domains. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2488–2495. IEEE, 2015.
  • [11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei Fei Li. Imagenet: a large-scale hierarchical image database. pages 248–255, 06 2009.
  • [12] H. Ayoobi, H. Kasaei, M. Cao, Rineke Verbrugge, and B. Verheij. Local-hdp : Interactive open-ended 3d object categorization. In ECCV 2020, 2020.
  • [13] Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida.

    Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5010–5019, 2018.
  • [14] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape matching and object recognition using shape contexts. IEEE transactions on pattern analysis and machine intelligence, 24(4):509–522, 2002.
  • [15] Ismail Khalid Kazmi, Lihua You, and Jian Jun Zhang. A survey of 2d and 3d shape descriptors. In 2013 10th International Conference Computer Graphics, Imaging and Visualization, pages 1–10. IEEE, 2013.
  • [16] Hamidreza Kasaei, Ana Tomé, Luís Seabra Lopes, and Miguel Oliveira. Good: A global orthographic object descriptor for 3d object recognition and manipulation. Pattern Recognition Letters, 83, 07 2016.
  • [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [18] Christian H Poth and Werner X Schneider. Breaking object correspondence across saccades impairs object recognition: The role of color and luminance. Journal of Vision, 16(11):1–1, 2016.
  • [19] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
  • [20] Inês Bramão, Alexandra Reis, Karl Magnus Petersson, and Luís Faísca. The role of color information on object recognition: A review and meta-analysis. Acta psychologica, 138:244–53, 07 2011.
  • [21] Chi-Yi Tsai and Shu-Hsiang Tsai. Simultaneous 3d object recognition and pose estimation based on rgb-d images. IEEE Access, 6:28859–28869, 2018.
  • [22] Saman Zia, Buket Yuksel, Deniz Yuret, and Yucel Yemez. Rgb-d object recognition using deep convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 896–903, 2017.
  • [23] Umar Asif, Mohammed Bennamoun, and Ferdous A Sohel. Rgb-d object recognition and grasp detection using hierarchical cascaded forests. IEEE Transactions on Robotics, 33(3):547–564, 2017.
  • [24] Jerome Paul N Cruz, Ma Lourdes Dimaala, Laurene Gaile L Francisco, Erica Joanna S Franco, Argel A Bandala, and Elmer P Dadios. Object recognition and detection by shape and color pattern recognition utilizing artificial neural networks. In 2013 International Conference of Information and Communication Technology (ICoICT), pages 140–144. IEEE, 2013.
  • [25] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  • [26] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • [27] Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. A large-scale hierarchical multi-view rgb-d object dataset. pages 1817–1824, 05 2011.
  • [28] Miguel Oliveira, Luís Seabra Lopes, Gi Hyun Lim, S Hamidreza Kasaei, Ana Maria Tomé, and Aneesh Chauhan. 3d object perception and perceptual learning in the race project. Robotics and Autonomous Systems, 75:614–626, 2016.