The object recognition is one of the important problems in machine vision and essential capabilities for social robot in real word environments. Object recognition in real word is a challenged problem because of environment noisy, complex viewpoint, illumination change and shadows. 2D camera always cannot deal with such hard task. Kinect  released promising a new approach to help compliant hand designing , recognize objects and human (emotions) for robot. The characteristic of the new Kinect 2.0 release are list as follows (see figure 1):
RGB Camera Take the color image/video in the scope of view.
IR Emitters: When actively projected Near Infrared Spectrum (NIS) irradiates to rough object or through a frosted glass, spectrum will distort and form random spots (called speckle) that can be read by an infrared camera.
Depth Camera Analyze infrared spectrum, and create RGB-Depth (RGB-D) images of human body and objects in the visual field.
Microphone Array Equip built-in components (e.g. Digital Signal Processor (DSP)) to collect voice and filter background noise simultaneously. The equipment can locate the sound source direction.
Kinect (RGB-D sensor) can capture a color image and corresponding depth information of each pixel for real word objects and scenes synchronously. The color and depth images are complementary information for real-word tasks. A significant number of applications are exploited by RGB-D sensor. It can be referred to object detection [2, 10, 12], human motion analysis [3, 11], object tracking , and object/human recognition [4-8, 14, 15, 33] etc.. In this paper, we only discuss recognition problem by using RGB-D sensor. The recent feature-based object recognition methods mainly fall into three categories: converting 3D point clouds, jointing 2D and depth image features and designing RGB-D feature.
To extract 3D object feature, Bo et al.
utilizes depth information and maps to generate 3D point clouds. Motivated by local feature in RGB image, depth kernel feature can be extracted for 3D point clouds. However, feature of 3D point clouds will suffer from noisy and the limited views while only one view/similar views is available and noisy involved. Jointing 2D and depth image features is a flexible approach for RGB-D vision learning. This relies on many excellent 2D image descriptors which are proposed in computer vision.
, which according to the texture and edge of image respectively. Recently, motivated by the visual perception mechanisms for image retrieval, perceptual uniform descriptor (PUD) achieves high performance by involving human perception. PUD defined the perceptual-structures by the similarity of edge orientation and the colors, and introduced structure element correlation statistics to capture the spatial correlation among them. On the contrary, local image descriptors focus on describing local information which includes edge and gradient information etc.. Lowe et al.  introduced a classical local descriptor called scale-invariant feature transform (SIFT), which aims to detect and describe local neighborhoods closing to key points in scale space. SIFT and HOG are both can be included in Bag of Words (BOW) framework as image descriptor. Gabor wavelets  have been also applied to image understand, which due to vision similarity of human and Gabor wavelets. HMAX-based image descriptor  according to the hierarchical visual processing in the primary visual cortex (V1) can get promising results in vision tasks. The above image descriptors can be selected to process the RGB and depth image respectively. However, most jointing features are incompatible, which means difficult to choose a suitable weight for jointing.
To full use of the 2D image descriptor, we intend to utilize fusion approach to get 3D object feature for further RGB-D vision learning. Few feature fusion methods are proposed in RGB-D recognition to our best knowledge. The most popular approach is multi-view graph learning 
. This approach can get good performance while only classifies a small quantity of objects.
In this paper, a new feature graph fusion (FGF) method is proposed for RGB and depth images. We first utilize Jaccard similarity to construct a graph of RGB and depth images, which indicates the similarity of pair-wise images. Then, fusion feature of RGB and depth images can be computed by our Extended Jaccard Graph (EJG) using word embedding method. Our Feature Graph Fusion can get better performance and efficiency in RGB-D sensor for robots. Our simple Kinect-based robot is show in figure 2.
Ii BOW based on SIFT and CNN-RNN
In this section, we will introduce the fundamental features which are used in our paper. As describe in introduction section, local and Bio-Inspired features always can achieve better results than other type ones. Subsection 2.1 and 2.2 introduce BOW and CNN-RNN features respectively.
Ii-a BOW based on SIFT
Scale Invariant Feature Transform (SIFT) was first introduced to extract distinctive local features for image matching. This feature has great discriminative performance and robustness and has been widely applied for various vision tasks. It is significantly invariant to translation, rotation and rescaling of images, and also has certain robustness to change in 3D viewpoint and illumination. There are two major stages in SIFT for maintaining the superior properties: detector and descriptor.
SIFT detector aims at find out the key points or regions in the Gaussian scale space. Since natural images from camera or other devices tend to be sampled from different views, it is necessary to construct scale space pyramid to simulate all the possible scales for identifying accurately the locations and scales of key points. And then the locations can be determined using local extreme detection in the difference-of-Gaussian scale space. Some low contrast or edge responses need to be further removed due to their less discrimination.
SIFT descriptor is computed using the image gradients in the neighborhood of key point. In order to maintain rotation invariance, the descriptors need to be rotated relative to the key point orientation. And then by computing the gradient information in the neighborhood of key point, the descriptor characterizes the orientation distribution around the key point which is distinctive and partially robust to illumination or 3D viewpoint. So for each key points detected above, a 128-D feature vector is created to extract the local discriminative information.
Even though SIFT has superior performance in local feature description and matching without high efficiency, it is still not appropriate to be used for analyzing the holistic image feature directly considering the large amount of key points in each image. So by introducing Bag-of-Word model to be combined with SIFT, it takes the main SIFT vectors as the basic words, which preserves the great distinction of SIFT. The main procedure for this strategy is to find out the cluster centers as words by pre-training with k-means, and then create the words vectors by assigning the whole SIFT key points into the nearest word. Normally, these cluster centers contains the discriminative patches among images. So the selected number of words also plays an important part in image representation. The large words might create elaborate features which describe more detailed information, while the small words mainly consider coarse distribution of these SIFT descriptors.
In D-SIFT, SIFT descriptor is created in the whole image region without detecting the key points in scale-space. Instead of smoothing the images by Gaussian kernel in scale-space, the image is pre-smoothed before feature description. So it is much faster than standard SIFT since the key point detection tends to be very time-consuming in large-scale image understanding. The main process of BOW using D-SIFT is concluded as Fig. 3.
Ii-B Cnn-Rnn (Crnn)
Richard Socher et. al  proposed CNN-RNN model which has 2 steps: 1) learning the CNN filters in an unsupervised way by clustering random patches and then feeding these patches into a CNN layer. 2) using the resulting low-level, translationally invariant features generated from the first step to compose higher order features that can then be used to classify the images with RNNs.
First, random patches are extracted into two sets: RGB and depth. Then, each set of patches is normalized and whitened. K-means classifier is used to cluster patches for pre-processing.
Second, a CNN architecture is chosen for its translational invariance properties to generate features for the RNN layer. The main idea of CNNs is to convolve filters over the input image. The single layer CNN is similar to the one proposed by Jarrett et. al  and consists of a convolution, followed by rectification and local contrast normalization (LCN) [25, 26, 27]. Each image of size (height and width) is convolved with square filters of size , resulting in filter responses, and each of which is with dimensionality . After that, the image averagely pools them with square regions of size
and a stride size of s, to obtain a pooled response with width and height equal to. So the output X of the CNN layer applied to one image is a dimensional 3D matrix. The same procedure is applied to both color and depth images separately.
The idea of recursive neural networks[28, 29] is to learn hierarchical feature representations by applying the same neural network recursively in a tree structure. In the case of CNN-RNN model, the leaf nodes of the tree are K-dimensional vectors (the result of the CNN pooling over an image patch repeated for all filters) and there are of them.
It starts with a 3D matrix for each image (the columns are K-dimensional), and defines a block to be a list of adjacent column vectors which are merged into a parent vector . For convenience, only square blocks with size are employed. For instance, if vectors are merged in a block with , it will output a total size and a resulting list of vectors . In general, their are vectors in each block. The neural network where the parameter matrix , is a nonlinearity such as . Generally, there will be parent vectors , forming a new matrix . The vectors in will again be merged in blocks just as those in matrix with the same tied weights resulting in matrix .
Iii Feature Graph Fusion
In this section, EJG will propose by Jaccard similarity for robust graph construction which is important to our FGF in this paper.
Iii-a Extended Jaccard Graph
We use extended Jaccard graph to construct a fused graph to compute feature fusion. The detail of graph fusion are described in this subsection.
We first define as the original image set. Let denote the query, and represent the KNNS111KNNS indicates the nearest neighborhood of a sample. of , . is the original ranking list which returns top- images of . Similar to , we denote as the KNNS of , , as the KNNS of . Jaccard coefficient ( is set to measure the similarity of and as follows:
The information in is more than that in norm measure of and . In construction of the graph, the edge weight between and is denoted by in Eq.(1), where is a decay coefficient.
To avoid outliers in, we consider comparing with by the similar process of and as follows:
Then, the weight of and can be computed by Eq. (4) as follows
In order to obtain the complementary information of RGB and depth image features to improve the accuracy of machine/robot recognition, we need to fuse multi-feature of images. We denote as node, as edge and as weight in image graph. Assuming RGB and depth features have been extracted from an object. Then RGB and depth graphs can be constructed by Extended Jaccard Graph in reference . In graph fusion methods, the RGB feature graph defines as , and depth feature graph can be denoted by . Multi-feature graph can be expressed by which satisfies three constrains as follows: 1) ; 2) ; 3) . The fusion graph can be treat as the relationships between images in dataset. We can also get the final fusion feature on in the next subsection.
Iii-B Fusion Feature by Word Embedding
We fuse RGB weight affinity matrixand depth weight affinity matrix as affinity matrix , where are denotes in subsection IIIA. Then, we can get the normalized neighborhood affinity matrix, where . can be expressed using a Gaussian kernel as follows.
where is the bandwidth parameter of Gaussian kernel, we denote
by variance of the i-th row.
The fused features are implicit expression in the normalized neighborhood affinity matrix . We use the following optimization models to get the fused features.
, where , is the fused feature of the -th RGB-D image pair. We change the likelihood function into log function as follows.
Iv Experimental Results and Analysis
In this section, we use two datasets to evaluate our feature fusion method. We first introduce the parameters and details of the two dataset. Then, the results and its analysis of the experiments are list in subsection 4.2.
Iv-a Details of Datasets
The dataset 1 and dataset 2 are collected by Kinect V1 and V2 respectively. The difference between V1 and V2 are listed in Table 1. Dataset 1 is recorded by Kinect V2 and Dataset 2 is given by Kinect V1. The two datasets are described as follows.
|Parameter||Kinect V1||Kinect V2|
|Joint||20 Joint/ person||25 Joint/person|
|Range of Detection||0.84.0 m||0.54.5 m|
|Angle||Horizontal||57 degree||70 degree|
|Vertical||43 degree||60 degree|
|Active IR video stream||NO||512424,11-bit|
Dataset 1 (DUT RGB-D face dataset): This dataset utilizes Microsoft Kinect for Xbox one V2.0 camera to acquire images, which acquires RGB images as well as depth images. This dataset contains 1620 RGB-D (RGB and depth) photos recorded with 6480 files of 54 people. Each class includes 30 faces. Expressions of happiness, anger, sorrow, scare and surprise are acquired from five different angles (up, down, left, right and middle) for each person. Color photos are recorded with 8 bits, and each color image is decomposed into three files (R, G, and B). Depths photos are recorded using 16-bit data to guarantee the depth of facial small changes are accurately recorded. All people in these photos do not wear glasses to ensure the precision of expression acquisition.
Dataset 2: The RGB-D Household Object Dataset contains 300 household objects. The dataset was captured by a Kinect V1 camera. Each image pair are RGB and depth images (RGB size: 640480 and depth image at 30 Hz). The objects fall into 51 categories. The objects are obtained by RGB-D video from different angles of each object. More details can be referred to .
Iv-B Experimental results and analysis
In social robot tasks, recognition and grasp are both important to applications. The experimental results of DUT RGB-D face dataset are listed in table 2 and figure 4 as follows:
We use dense SIFT method to extract feature of RGB-D face dataset, and utilize one Vs. rest SVM classifer to complete the face recognition. In face recognition, we extract 3 training faces in each class and the rest as the testing set. As can be seen in table 2, depth information is more effective than RGB representation. The RGB recognition rate is 82.30% which is 22.29% higher than the depth faces. This is because that face recognition is high related to RGB. An important result is that RGB+depth feature deduced 0.27% than single RGB feature. This phenomenon illustrates joint feature may suffer from data distribution changed. Our method achieve 84.50% which is higher than any single feature (or joint feature) and can deal with the joint shortcoming by fused graph feature extraction.
Fig. 4 shows parameter influence of our fusion model. We can see that FGF is not sensitive to the change of parameter.
|Methods ()||Recognition Rate||std|
In object experiment, we use CNN-RNN features extracting from RGB and depth images and split 10 times. Each split of testing set selects all images of one instance and the rest as training set. Table 3 shows the results of object RGB-D recognition. Different from face experiments, object recognition using depth information can get 92.98% higher precision than that using RGB images. This result illustrates object recognition more relies on “depth feeling”. Our fused feature is more effective and efficiency than the joint one (reach 93.92% only using 200 dimension feature), though the joint feature enhances the precision in object dataset. Table 4 and Fig. 5222 denotes using different and in Dataset2. illustrate that our method can achieve more higher results than other state-of-the-art methods.
In this paper, we built a vision robot with RGB-D camera and gave a DUT RGB-D face dataset. We mainly proposed a RGB-D recognition method FGF and evaluated FGF in two RGB-D datasets. FGF can get better performance than previous approach and can help robot to execute complex tasks, such as SLAM, compliant hand designing, human-robot interaction etc.. We will consider designing a more effective sensor and robust supervised dimensionality reduction method (such as reference ) as robot vision in our future work.
-  Li R, Wu W, Qiao H. The compliance of robotic hands–from functionality to mechanism[J]. Assembly Automation, 2015, 35(3): 281-286.
-  Veenendaal A, Daly E, Jones E, et al. Fear Detection with Background Subtraction from RGB-D data[J]. Computer Science and Emerging Research Journal, 2013, 1.
-  Ye M, Zhang Q, Wang L, et al. A survey on human motion analysis from depth data[M]//Time-of-flight and depth imaging. sensors, algorithms, and applications. Springer Berlin Heidelberg, 2013: 149-187.
-  Goswami G, Bharadwaj S, Vatsa M, et al. On RGB-D face recognition using Kinect[C]//Biometrics: Theory, Applications and Systems (BTAS), 2013 IEEE Sixth International Conference on. IEEE, 2013: 1-6.
-  Huynh T, Min R, Dugelay J L. An efficient LBP-based descriptor for facial depth images applied to gender recognition using RGB-D face data[C]//Asian Conference on Computer Vision. Springer Berlin Heidelberg, 2012: 133-145.
-  Szwoch M. On facial expressions and emotions RGB-D database[C]//International Conference: Beyond Databases, Architectures and Structures. Springer International Publishing, 2014: 384-394.
-  Ciaccio C, Wen L, Guo G. Face recognition robust to head pose changes based on the RGB-D sensor[C]//Biometrics: Theory, Applications and Systems (BTAS), 2013 IEEE Sixth International Conference on. IEEE, 2013: 1-6.
-  Min R, Kose N, Dugelay J L. Kinectfacedb: A kinect database for face recognition[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2014, 44(11): 1534-1548.
-  Li S, Ngan K N, Sheng L. A head pose tracking system using RGB-D camera[C]//International Conference on Computer Vision Systems. Springer Berlin Heidelberg, 2013: 153-162.
-  Gupta S, Girshick R, Arbeláez P, et al. Learning rich features from RGB-D images for object detection and segmentation[C]//European Conference on Computer Vision. Springer International Publishing, 2014: 345-360.
-  Zhang C, Tian Y. Histogram of 3D facets: A depth descriptor for human action and hand gesture recognition[J]. Computer Vision and Image Understanding, 2015, 139: 29-39.
-  Xia L, Chen C C, Aggarwal J K. Human detection using depth information by kinect[C]//CVPR 2011 WORKSHOPS. IEEE, 2011: 15-22.
-  Cruz L, Lucio D, Velho L. Kinect and rgbd images: Challenges and applications[C]//Graphics, Patterns and Images Tutorials (SIBGRAPI-T), 2012 25th SIBGRAPI Conference on. IEEE, 2012: 36-49.
-  Filliat D, Battesti E, Bazeille S, et al. RGBD object recognition and visual texture classification for indoor semantic mapping[C]//2012 IEEE International Conference on Technologies for Practical Robot Applications (TePRA). IEEE, 2012: 127-132.
-  Li Y. Hand gesture recognition using Kinect[C]//2012 IEEE International Conference on Computer Science and Automation Engineering. IEEE, 2012: 196-199.
-  Bo L, Ren X, Fox D. Depth kernel descriptors for object recognition[C]//2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2011: 821-826.
-  Ojala T, Pietikäinen M, Mäenpää T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2002, Vol. 24, No. 7, pp. 971-987.
-  Ojala T, Pietikäinen M, Mäenpää T (2000). Gray scale and rotation invariant texture classification with local binary patterns[M]//Computer Vision-ECCV. Springer Berlin Heidelberg, pp. 404-420.
Dalal N, Triggs B (2005). Histograms of oriented gradients for human detection[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1: 886-893.
-  Liu S, Wu J, Feng L, et al. (2016). Perceptual uniform descriptor and Ranking on manifold: A bridge between image representation and ranking for image retrieval[J]. arXiv preprint arXiv:1609.07615.
-  Lowe D G (2004). Distinctive image features from scale-invariant keypoints[J]. International journal of computer vision, Vol. 60, No. 2, pp. 91-110.
-  Jones J P, Palmer L A (1987). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex[J]. Journal of neurophysiology, 58(6): 1233-1258.
-  H. Qiao, Y. L. Li, F. F. Li, X. Y. Xi and W. Wu, Biologically Inspired Model for Visual Cognition Achieving Unsupervised Episodic and Semantic Feature Learning, IEEE Transactions on Cybernetics, vol. 46, no. 10, pp. 2335-2347, 2016.
Socher, Richard, et al (2012). ”Convolutional-recursive deep learning for 3d object classification.” Advances in Neural Information Processing Systems.
-  K. Jarrett and K. Kavukcuoglu and M. Ranzato and Y. LeCun. What is the Best Multi-Stage Architecture for Object Recognition? In ICCV. IEEE, 2009.
-  N. Pinto, D. D. Cox, and J. J. DiCarlo. Why is real-world visual object recognition hard? PLoS Comput Biol, 2008.
Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, and A.Y. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012.
-  R. Socher, C. Lin, A. Y. Ng, and C.D. Manning. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. In ICML, 2011.
-  Liu S, Sun M, Feng L, et al. (2016). Three Tiers Neighborhood Graph and Multi-graph Fusion Ranking for Multi-feature Image Retrieval: A Manifold Aspect[J]. arXiv preprint arXiv:1609.07599.
-  Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in neural information processing systems. 2013: 3111-3119.
-  Mikolov T, Chen K, Corrado G, et al (2013). Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781.
-  Zhang L, Zhang D (2016). Robust Visual Knowledge Transfer via Extreme Learning Machine based Domain Adaptation, IEEE Transactions on Image Processing, vol. 25, no. 10, pp. 4959-4973.
-  Zhang L, Zhang D. Visual Understanding via Multi-Feature Shared Learning with Global Consistency, IEEE Transactions on Multimedia, vol. 18, no. 2, pp. 247-259, 2016.
-  Liu S, Lin F, Hong Q (2015). Scatter Balance: An Angle-Based Supervised Dimensionality Reduction[J]. IEEE Transactions on Neural Networks & Learning Systems, Vol. 26, No. 2, pp. 277-289.