Object Detection and Sorting by Using a Global Texture-Shape 3D Feature Descriptor

02/04/2018 ∙ by Zhun Fan, et al. ∙ 0

Object recognition and sorting plays a key role in robotic systems, especially for the autonomous robots to implement object sorting tasks in a warehouse. In this paper, we present a global texture-shape 3D feature descriptor which can be utilized in a sorting system, and this system can perform object sorting tasks well. Our proposed descriptor stems from the clustered viewpoint feature histogram (CVFH). As the CVFH feature descriptor relies on the geometrical information of the whole 3D object surface only, it can not perform well on the objects with similar geometrical information. Therefore, we extend the CVFH descriptor with texture information to generate a new global 3D feature descriptor. Then this proposed descriptor is tested for sorting 3D objects by using multi-class support vector machines (SVM). It is also evaluated by a public 3D image dataset and real scenes. The results of evaluation show that our proposed descriptor have a good performance for object recognition compared to the CVFH. Then we leverage this proposed descriptor in the proposed sorting system, showing that the proposed descriptor helps the sorting system implement the object detection, the object recognition and object grasping tasks well.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction and related work

With the development of e-commerce, it becomes more and more important for the autonomous robots to execute object sorting task in a warehouse. As the autonomous robots executing objects sorting tasks in a warehouse, they need to locate the target objects and identify what the objects are. All these information can be provided by object recognition. However, it is still a relatively challenging and difficult task to design a reliable and effective object recognition system.[1]

It is a common way to utilize the features of the target objects in the 2D image plane for executing object recognition. There are several examples of these approaches, which are SIFT[2], SURF[3] and ORB[4]. Even though these methods can perform well on the objects with high texture, they show inefficiency on textureless objects.

In recent years, more and more low-cost, effective 3D sensors are available for object recognition. These 3D sensors can obtain both shape information and textured information of the target objects, like the Asus Xtion camera and the 3D (RGB-D) Microsoft Kinect[5]. As a result, the 3D feature descriptors have become a popular way for object recognition. The 3D feature descriptors are divided in two types, one is local 3D feature descriptor, and the other is global 3D feature descriptor.

The local 3D feature descriptor relies on the geometrical information of the 3D object surface. Each local 3D feature descriptor is computed for a single key point of the 3D object surface, thus each 3D object have more than one local 3D feature descriptor. A set of all local 3D feature descriptors can be used for object recognition by means of feature matching. Several local 3D feature descriptors are proposed for object recognition, like the 3D shape context (3DSC)[6], the unique shape context descriptor (USC)[7], the normal aligned radial feature (NARF)[8] and the rotational projection statistics (RoPS)[9]. These descriptors can obtain a good performance on the objects which contain rich geometrical information for object recognition. In addition, They don’t need to undergo the stage of segmentation when they are applied on object recognition. But these local 3D feature descriptors need a lot of computational resource for computing descriptors and descriptor matching when they are used for object recognition.

For the global 3D feature descriptor, it is also relies on the geometrical information of the whole 3D object surface. But the global 3D feature descriptor is computed for a whole 3D object. Thus every single object has one global 3D feature descriptor only. These descriptors can be used for executing object recognition and object classification by means of feature matching. However, they need to extract the target objects from the cluttered scenes before they are applied on object recognition. There are several global 3D feature descriptors used for object recognition, like the global fast point feature histogram (GFPFH)[10], the ensemble of shape functions (ESF)[11], the oriented, unique and repeatable clustered viewpoint feature histogram (OUR-CVFH)[12] and the global radius-based surface descriptor (GRSD)[13]. Obviously, when these descriptors are used for implementing object recognition, they require less computational resource for computing descriptors and descriptor matching in comparison with the local 3D feature descriptor.

These 3D feature descriptors rely on the geometrical information of the whole 3D object surface only. Thus these descriptors can obtain a good performance on the object with different shapes, but they cannot perform well on the objects with similar shapes. Some 3D feature descriptors are also based on both shape and texture information, like the Color-SHOT (CSHOT)[14]. It is a local 3D feature descriptor which combine the shape information with the color information. Obviously, when this local 3D feature descriptor is used for implementing object recognition, it requires computational resources for computing descriptors and descriptor matching, which worsen the situation of computing burden.

In this work, we propose a global texture-shape 3D feature descriptor for object recognition and sorting. The contributions of the paper are: (1) We present a global texture-shape 3D feature descriptor which is generated by extending the clustered viewpoint feature histogram (CVFH)[15] with textured information. (2) We evaluate the recognition performance of the proposed descriptor with both public 3D image dataset and real scenes, and utilize our proposed sorting system to implement the object detection, object recognition and object grasping tasks in a sequence. (3) We use a multi-class support vector machine (SVM)[16]classifier for recognizing target objects instead of feature matching, which can reduce the situation of computing burden. Additionally, Training multi-class classifier does not need a large data set and lengthy training process.

The paper is organized as follows. The sorting system including the proposed feature descriptor is described in Section II. The experiments and results are shown in Section III. In the end, we have conclusions and future work in Section IV.

Ii System description

In this section, the method of segmentation are presented first. Then we describe the global texture-shape 3D feature descriptor in detail. The proposed object recognition and grasping system is presented at the end of the section.

Ii-a The method of segmentation

Here, we present a method of segmentation as the preprocess for utilizing the proposed 3D global descriptor to recognize the target objects. In this paper, we utilize the 3D sensor Microsoft Kinect camera to capture the point cloud image of real scene, Fig. 1 shows the process of the proposed method for segmentation.

(a) (b) (c) (d)
Fig. 1: The process of the proposed method for segmentation: (a) Obtain the 3D point cloud image of real scene. (b) Obtain the region of the target objects. (c) Filter planar objects by using RANSAC algorithm. (d) Get the target objects by using Euclidean Cluster algorithm.

The Fig. 1 shows that the first step for segmentation is obtaining the 3D point cloud image of the real scene, which captured by Microsoft Kinect camera (as shown in Fig. 1a). We need to obtain the region of the target objects by filtering the point cloud image of source real scene (as shown in Fig. 1b). Then we subtract large planar objects from the real scene by utilizing the random sample consensus (RANSAC)[17] algorithm (as shown in Fig. 1c). In the end, we can obtain each single target object from the rest of the real scene point cloud image by employing the Euclidean Cluster Algorithm (based on Euclidean Distance, as shown in Fig. 1d)[18].

Obtain the region of the target objects: After obtaining the point cloud of the source scene, this method performs foreground subtraction and background subtraction to generate another new point cloud of scene. Then the method filters the information on the left and right side of the new scene point cloud. After that, we can obtain the region of the target objects, which is shown in Fig. 1b.

RANSAC algorithm[17]:

First of all, This algorithm randomly selects a set of points from the source scene point cloud image. Also, this RANSAC algorithm find a mathematical model which fitted for these selected point set, and computes the parameters of this mathematical model. Then this algorithm estimates the error value of every single point in the rest of point cloud image relative to the mathematical model. If the error value lower than a predetermined threshold value, this point can fit mathematical model and is considered as inlier. If the error value is not lower the predetermined threshold value, this point cannot fit mathematical model and is considered as outlier. Repeating previous steps. After all iterations, we can obtain a set of points which contains the largest number of inliers, and this set of points is the planar object we need to subtract.

The algorithm of the Euclidean Cluster[18]: The algorithm create an empty point set , and randomly adds a point of the point cloud image to this point set (). This algorithm computes the distance between the point and each of its neighboring point . After that, adding the point to the point set () if the value of distance lower the predetermined threshold value. Then the algorithm selects another point from the point set , including the point , and also calculates the distance between this point and each its neighboring point, and adds the suitable points to the the point set . Repeating the steps until all points of the point cloud image have been processed. Then we obtain a point set which contains all suitable points, and this point set is the individual object we need to extract from the real scene. Similarly, we can obtain other individual objects from the real scene by using this algorithm.

Ii-B The design of the global texture-shape 3D feature descriptor

Combining the shape information with color information is an essential way to design a reliable and effective object recognition system. Here, we present a global texture-shape 3D feature descriptor, which relies on shape and color information. Thus the proposed global 3D feature descriptor can make full use of the shape and color information of the target object. This proposed global 3D feature descriptor stems from the clustered viewpoint feature histogram (CVFH)[15]. In this paper, this global texture-shape 3D feature descriptor is denote as Color-CVFH. Next we first give a brief description of the CVFH descriptor.

CVFH descriptor[15]: The CVFH stems from the viewpoint feature histogram (VFH)[19]. The VFH descriptor is a 3D global descriptor which formed by four different angular distributions of the object surface normals. In addition, the VFH consists of two components, including viewpoint direction component and extended FPFH[20] component. Here, the centroid of the whole surface points is denote by , and the normal of this centroid is name as (). Also, we set a local reference coordinate frame (, , ) for every point of the object surface point, and we can obtain the following equations[19].

(1)

The norm of every point of the object surface point is name as , and we can obtain four different normal angular deviations by using the following equations[19].

(2)

In addition, the extended FPFH[20] component is built by utilizing these three different normal angular deviations including the , and , and the viewpoint direction component is built by using the normal angular deviation . However, the VFH descriptor can not perform well on the occlusion objects. To solve this problem, the CVFH descriptor computes a VFH histogram for each stable, smooth region of the cluster surface by utilizing the region-growing segmentation, rather than one single VFH histogram for the whole cluster surface. Additionally, the CVFH descriptor not only contains a viewpoint direction component and an extended FPFH component, but also has a new component which is the shape distribution component (SDC)[15]. The definition of the SDC is shown as following.

(3)

Where represents the total number of the whole object surface points. The SDC component is described by a histogram of 45 bins, each normal angular deviation of the extended FPFH[20] component is also described by a histogram of 45 bins, and the angular deviations of the viewpoint direction component is described by a histogram of 128 bins. Therefore, the CVFH descriptor is described by a histogram of 308 bins.

Color-CVFH descriptor: In this paper, we design a global texture-shape 3D feature descriptor named as Color-CVFH which contains two components, including a global color histogram and a CVFH histogram. We can obtain the point cloud image of real scene by using the RGB-D sensor Microsoft Kinect, and this kind of point cloud not only contains the shape information, but also the color information. Also, we can acquire the point cloud of each individual target object from the real scene by utilizing the proposed method of segmentation. Thus we can get the color information of the target object, which is described in RGB color space. In most circumstance, the color information which is described in HSV (Hue, Saturation, Value) color space outperforms the color information in RGB color space for perception. In this work, we transform the color information from RGB color space to HSV color space. The hue dimension is a primary component in HSV color space for perception, and the hue dimension is divided into a histogram of 90 bins. Also, the saturation dimension and value dimension are both divided into a histogram of 51 bins. After that, we can obtain a global color histogram feature which contains 192 histogram bins, and one single object has a global color histogram only. Here, we denote the global color histogram feature of the object as , the CVFH histogram feature of the object as , and the Color-CVFH descriptor of the object as . Mathematically, the definition of the Color-CVFH descriptor is shown as following.

(4)

The equation (4) shows that the Color-CVFH descriptor consists of two different global histograms. The first part of the Color-CVFH is a global color histogram , followed by the CVFH histogram as the second part. Obviously, the Color-CVFH descriptor is a global 3D feature descriptor which built by a histogram of 500 bins (HSV: 192 bins, CVFH: 308 bins). The frame of the Color-CVFH descriptor is shown in Fig. 2. Besides, The Color-CVFH descriptor make full use of color and geometrical information, and can be conveniently utilized to train a multi-class SVM classifier for recognizing and classifying target objects.

Fig. 2: The frame of the Color-CVFH descriptor

Ii-C The design of sorting system

In this section, we present a sorting system, which is based on the proposed descriptor Color-CVFH. The proposed system is able to execute the object detection, object recognition and object grasping tasks in a sequence. The communication between different processes in this proposed system are handled by the Robot Operating System (ROS)[21], and the overview of the proposed system is shown in Fig. 3.

Fig. 3: The overview of the sorting system

The Fig. 3 shows that the first step of the sorting system is capturing the point cloud image of the real scene by the 3D sensor Microsoft Kinect camera. Next, each individual target object is extracted from the real scene by utilizing the RANSAC[17] algorithm and Euclidean Cluster[18] algorithm. Then the objects can be detected by calculating the centroid of each target object. After the target objects are detected, the proposed feature descriptor Color-CVFH of these target objects can be computed, and used to train the multi-class SVM classifier to acquire the category and position of each individual target object. In this system, the centroid of each target object is considered as the position for the robotic manipulator to grasp. Then the system calculates the forward kinematics and inverse kinematics solutions of the robotic manipulator. After the system obtaining the forward kinematics and inverse kinematics solutions, this system can control the end effector of the robotic manipulator approaches to the target object by utilizing the forward kinematics and inverse kinematics solution. Then, the system execute the object grasping according to the centroid of the target object. In the end, the robotic manipulator can grasp the target objects, and place these target objects to their specified positions according to their categories. After that, the sorting system completes the sequence tasks of the object detection, object recognition and object grasping.

Forward kinematics and inverse kinematics: In this paper, the sorting system is run on a Universal Robots UR5, which is a manipulator with 6 revolute joints. In this sorting system, the forward kinematics of the Universal Robots UR5 is calculated by the Denavit-Hartenberg (D-H) parameters[22]. Here we calculate the inverse kinematics solution of the Universal Robots UR5 by using the the geometric method[23]. By leveraging the geometric method, we can obtain 1 to 8 inverse kinematics solutions. An example of 8 inverse kinematics solutions by using geometric method is shown in Fig. 4, This geometric method is suitable for computing by programming, which not only saves the computational time, but also acquires high accuracy inverse kinematics solutions.

Fig. 4: An example of 8 inverse kinematics solutions by using geometric method.

Iii Experiments and results

We begin this section by evaluating the proposed descriptor Color-CVFH in a public RGB-D image dataset[24], which is constructed by Washington University. We evaluate the recognition performance of the proposed descriptor Color-CVFH in this public dataset at two levels, including category level recognition and instance level recognition. Besides, the Color-CVFH descriptor is evaluated by real scenes (150 different scenes sampled in total).

In this work, the recognition performance of the Color-CVFH descriptor and the CVFH descriptor are evaluated by three metrics which are recall, precision and F1_score. These three metrics are defined as following.

(5)
(6)
(7)

Where TP represents the number of true positives, FP represents the number of false positives, TN represents the number of true negatives, and FN represents the number of false negatives.

In the end, we test the proposed sorting system by using the proposed descriptor Color-CVFH to execute the object detection, object recognition and object grasping tasks in a sequence.

Iii-a Evaluated by a public dataset at category level recognition

In this experiment, the public RGB-D image dataset[24] which is utilized for evaluating the recognition performance of the Color-CVFH descriptor is created by Washington University. This public RGB-D image dataset contains 300 common household objects in 51 categories, and each object (instance) is captured by 3 different video sequences. Some samples of the public RGB-D image dataset are shown in Fig. 5.

Fig. 5: The samples of the public RGB-D image dataset

In this public dataset, each category contains several different objects (instances). The category level recognition aims at classifying previously unseen objects belonging to which category, such as recognizing the object apple2 (as shown in Fig. 6) belonging to the category of apple. In this experiment, we select ten categories of the objects from this public dataset, which are apple, bell pepper, lemon, lime, orange, peach, pear, plate, potato and tomato. These ten categories of objects are shown in Fig. 6. Here we select two similar object from each category. For example, we select apple1 and apple2 from the category of apple. They are both in red color, which are shown in Fig. 6. From the Fig. 6, we can see that apple, bell pepper, peach and tomato have similar shapes and colors. Lime and orange have similar shapes but different colors. Lemon, orange and pear have similar colors. Pear and plate have similar colors but different shapes.

Fig. 6: One view of two objects from each of the ten categories used in the experiment. For the first row, from left to right: apple1, apple2, bell pepper1, bell pepper2, lemon1 and lemon2. Apple1 and apple2 belong to the category of apple, bell pepper1 and bell pepper2 belong to the category of bell pepper, lemon1 and lemon2 belong to the category of lemon; For the second row, from left to right: lime1, lime2, orange1, orange2, peach1 and peach2. Lime1 and lime2 belong to the category of lime, orange1 and orange2 belong to the category of orange, peach1 and peach2 belong to the category of peach; For the third row, from left to right: pear1, pear2, plate1 and plate2. Pear1 and pear2 belong to the category of pear, plate1 and plate2 belong to the category of plate; For the fourth row, from left to right: potato1, potato2, tomato1 and tomato2. Potato1 and potato2 belong to the category of potato, tomato1 and tomato2 belong to the category of tomato.

This experiment is used for evaluating the recognition performance of the Color-CVFH descriptor and the CVFH descriptor by using multi-class SVM classifier. We select 885 point clouds of each category for training the multi-class SVM classifier, and each object consists of 3 different video sequences. We select another 295 point clouds of each category (each object consists of 3 different video sequences) for testing. The results of the experiment are shown in Table I and Table II.

Actual classPrediction class Apple Bell pepper Lemon Lime Orange Peach Pear Plate Potato Tomato
Apple 41.02% 30.17% 0.68% 0 5.42% 0 14.24% 0 8.47% 0
Bell pepper 2.71% 54.92% 0 0 1.69% 0 0.68% 0 40% 0
Lemon 0.34% 0 70.51% 24.41% 0 1.69% 1.69% 0 0 1.36%
Lime 0 0 7.12% 72.20% 0 3.05% 0.34% 0 0 17.29%
Orange 2.03% 0.68% 0 0 90.51% 0 6.44% 0 0.34% 0
Peach 0 0.34% 0.68% 28.14% 0 69.49% 1.36% 0 0 0
Pear 1.36% 0.68% 4.07% 1.36% 1.36% 1.02% 90.17% 0 0 0
Plate 0 0 0 0 0 0 0 100% 0 0
Potato 0 80.34% 0 0 0 0 0 0 19.66% 0
Tomato 0 0 0 80.68% 0 0.34% 0 0 0 18.98%
TABLE I: The results of prediction for CVFH by using the public dataset at category level recognition
CVFH Color-CVFH
Recall Precision F1_score Recall Precision F1_score
Apple vs. all 41.02% 86.43% 0.56 100% 100% 1
Bell pepper vs. all 54.92% 32.86% 0.41 100% 100% 1
Lemon vs. all 70.51% 84.90% 0.77 100% 100% 1
Lime vs. all 72.20% 34.92% 0.47 100% 100% 1
Orange vs. all 90.51% 91.44% 0.91 100% 100% 1
Peach vs. all 69.49% 91.93% 0.79 100% 100% 1
Pear vs. all 90.17% 78.47% 0.84 100% 100% 1
Plate vs. all 100% 100% 1 100% 100% 1
Potato vs. all 19.66% 28.71% 0.23 100% 100% 1
Tomato vs. all 18.98% 50.45% 0.28 100% 100% 1
TABLE II: The results of the evaluation for CVFH and Color-CVFH by using the public dataset at category level recognition

From the Table I and Table II, we can see that the classifier trained by the CVFH descriptor has a poor performance compared to the classifier trained by the Color-CVFH descriptor. For the classifier trained by the CVFH descriptor, about one third of apple are considered as bell pepper, thus the classifier gets a poor recall () for the category of apple. This classifier is much more likely to consider the bell pepper and potato as the same category. In addition, Two fifths of bell pepper are recognized as potato, and about four fifths of potato are considered as bell pepper, thus the classifier have poor performance for these two categories in recall (bell pepper: , potato: ) and precision (bell pepper: , potato: ). Over four fifths of tomato are recognized as lime, so that the classifier has a poor recall () for the category of tomato. However the classifier gets good results for the category of plate in the metrics of recall, precision and F1_score, because the CVFH descriptor is based on the shape information, and the category of plate has a distinctive shape in this experiment.

For the classifier trained by the descriptor Color-CVFH. The proposed descriptor Color-CVFH can make full use of the object’s shape information and color information. Thus the classifier trained by the descriptor Color-CVFH can not only perform well on the target objects which have different shape information, but also acquire a good recognition performance on the objects which have similar shape information and color information. From the Table II we can see that the classifier can classify each target object at category level correctly in this experiment. Even though apple, bell pepper, peach and tomato have similar shapes and colors, and they are still classified correctly (at category level recognition).

Iii-B Evaluated by a public dataset at instance level recognition

In this work, we evaluate the recognition performance of the Color-CVFH descriptor and the CVFH descriptor in a public dataset at instance level recognition. The instance level recognition aims at classifying an previously unseen object and deciding whether it belongs to the instance that has been previously seen. For example, the object apple1 (as shown in Fig. 7) belong to the instance of apple1 but apple2, though these two instances both belong to the category of apple. In this experiment, we select all of instances from the categories of the apple and the orange, which are apple1, apple2, apple3, apple4, apple5, orange1, orange2, orange3 and orange4. These nine instances are shown in Fig. 7. From the Fig. 7, we can see that apple1 and apple2 have similar shapes and colors, and apple3 and appl4 look pretty much the same object. Orange1, orange2 and orange4 have similar shapes and colors. All these instances have similar shapes, they are both spherical shaped objects.

Fig. 7: One view of each of the nine objects (instances) used in the experiment. For the first row, left to right: apple1, apple2, apple3, apple4, apple5; For the second row, left to right: orange1, orange2, orange3, orange4.

For the experiment of instance level recognition, we choose the scenario of alternating contiguous frames[24]: Firstly, we divide each video sequence into three contiguous sequences of equal length. Because each object originally has three different video sequences, each object (instance) now has nine video sequences. Then we randomly select seven of these nine video sequences from each object for training multi-class SVM classifier, and select the remaining two for testing.

Here, we select 4642 different point clouds in total for training, and select 1290 different point clouds in total for testing. Then we use these data to train the multi-class SVM classifier. This experiment is used for evaluating the recognition performance of the Color-CVFH descriptor and the CVFH descriptor. The results of the experiment are shown in Table III, Table IV and Table V.

Actual classPrediction class Apple1 Apple2 Apple3 Apple4 Apple5 Orange1 Orange2 Orange3 Orange4
Apple1 61.76% 0 25% 0 0 0 8.82% 3.68% 0.74%
Apple2 50.36% 0 48.92% 0.72% 0 0 0 0 0
Apple3 66.92% 0 33.08% 0 0 0 0 0 0
Apple4 57.66% 0 36.50% 5.11% 0 0.73% 0 0 0
Apple5 39.19% 0 43.92% 8.78% 0 8.11% 0 0 0
Orange1 4.67% 0 1.33% 9.33% 0 81.33% 2.67% 0 0
Orange2 0 0 2.01% 1.34% 0 0.67% 53.02% 42.28% 0
Orange3 0 0 1.36% 0 0 0 37.41% 61.22% 0
Orange4 0 0 0.66% 0 0 0 0 0 99.34%
TABLE III: The results of prediction for CVFH by using the public dataset at instance level recognition
Actual classPrediction class Apple1 Apple2 Apple3 Apple4 Apple5 Orange1 Orange2 Orange3 Orange4
Apple1 88.24% 11.76% 0 0 0 0 0 0 0
Apple2 0 100% 0 0 0 0 0 0 0
Apple3 0 0 82.71% 17.29% 0 0 0 0 0
Apple4 0 0 32.12% 67.88% 0 0 0 0 0
Apple5 0 0 0 0 100% 0 0 0 0
Orange1 0 0 0 0 0 100% 0 0 0
Orange2 0 0 0 0 0 0 100% 0 0
Orange3 0 0 0 0 0 0 0 100% 0
Orange4 0 0 0 0 0 0 0 0 100%
TABLE IV: The results of prediction for Color-CVFH by using the public dataset at instance level recognition
CVFH Color-CVFH
Recall Precision F1_score Recall Precision F1_score
Apple1 vs. all 61.76% 21.71% 0.32 88.24% 100% 0.94
Apple2 vs. all 0 0 0 100% 100% 1
Apple3 vs. all 33.08% 16.36% 0.22 82.71% 71.43% 0.77
Apple4 vs. all 5.11% 18.92% 0.08 67.88% 80.17% 0.74
Apple5 vs. all 0 0 0 100% 100% 1
Orange1 vs. all 81.33% 89.71% 0.85 100% 100% 1
Orange2 vs. all 53.02% 52.67% 0.53 100% 100% 1
Orange3 vs. all 61.22% 56.96% 0.59 100% 100% 1
Orange4 vs. all 99.34% 99.34% 0.99 100% 100% 1
TABLE V: The results of the evaluation for CVFH and Color-CVFH by using the public dataset at instance level recognition

From the Table III, Table IV and Table V, we can see that the classifier trained by the Color-CVFH descriptor outperforms the classifier trained by the CVFH descriptor in terms of recall, precision and F1_score. For the classifier trained by the CVFH descriptor, the instances from the category of apple are much more likely considered as the object apple1 or the object apple3. About half of apple2, about two thirds of apple3, nearly three fifths of apple4 and nearly two fifths of apple5 are considered as apple1. A quarter of apple1, about half of apple2, about one third of apple4 and more than two fifths of apple5 are considered as apple3. Thus the classifier have a poor performance for the apple1 and apple3 in precision (apple1: , apple3: ). None of apple2 and apple5 are recognized correctly, thus the classifier has very bad performance for them in recall and precision. In addition, this classifier is much more likely to consider the orange2 and orange3 as the same instance. More than two fifths of orange2 are considered as orange3, and nearly two fifths of orange2 are considered as orange3, thus the classifier does not have good performance on them in recall (orange2: , orange3: ) and precision (orange2: , orange3: ).

Contributed by the combination of the shape information and color information, the classifier trained by the descriptor Color-CVFH can perform well in this experiment. It can be observed from the Table V that most of the target objects are classified correctly at instance level in this experiment. Even though the instances belonging to the category of the orange are look similar in both shape and color, they both acquire good performance in terms of recall, precision and F1_score. However, because the apple3 and apple4 look pretty much the same object, including shape and color, this classifier makes some mistakes when recognizing the apple3 and apple4.

Iii-C Performance evaluation with the real scenes

In this work, we evaluate the recognition performance of the Color-CVFH descriptor and the CVFH descriptor in real scenes. The objects which are applied for this experiment are bottle0, cup, ball, bottle1, bottle2 and box. These six objects are shown in Fig. 8, and the Fig. 9 shows the point clouds of these six objects. Also, these six objects are divided into four categories, which are bottle0, cup, ball and others. The object bottle0, object cup and object ball belong to the category of bottle0, the category of cup and the category of ball respectively, and the object bottle1, object bottle2 and object box both belong to the category of others. The category of others contains 45 different point clouds, and each of the other three categories are built by 100 different point clouds. All these point cloud are acquired by the 3D sensor Microsoft Kinect camera on multiple views. It can be observed from the Fig. 8 and Fig. 9 that the bottle0 and bottle1 are in the same shape, but they have different colors. The bottle0 is similar to bottle2 both in shape and color. The cup and the box share some similarities in shape and color from a certain angle (as shown in Fig. 9). However, the ball has distinctive shape in this experiment. All these point clouds are used for training the multi-class SVM classifiers to evaluate the recognition performance of the Color-CVFH descriptor and the CVFH descriptor.

Fig. 8: One view of the six chosen objects used in the experiment. For the first row, left to right: bottle0, cup, ball; For the second row, left to right: bottle1, box, bottle2.
Fig. 9: The point cloud of the six chosen objects used in the experiment (One view). For the first row, left to right: bottle0, cup, ball; For the second row, left to right: bottle1, box, bottle2.
(a) (b) (c)
Fig. 10: Three samples of real scene point clouds: (a) There are ball, bottle0, cup, bottle1 on a planar object. (b) There are bottle0, box, cup, ball on a planar object. (c) There are bottle2, ball, bottle0, cup on a planar object.

We can obtain each individual target object from the real scene by utilizing the proposed method of segmentation. Here, 150 different point clouds of the real scenes are used for this experiment. The Fig. 10 shows some samples of real scene point clouds. After that, we use the two aforementioned trained SVM classifiers to evaluate the performance of the Color-CVFH descriptor and the CVFH descriptor. The Table VI and Table VII show the results of this experiment. It can be observed from the Table VI and Table VII that the performance of the classifier trained by the CVFH descriptor compares poorly with the classifier trained by the Color-CVFH descriptor. For the classifier trained by the CVFH descriptor, the bottle0, bottle1 and bottle2 are look much alike in shape, the classifier are more likely consider the bottle0 as bottle1 and bottle2 as the same category. Nearly one fifth of bottle0 are recognized as the category of others, and over half of the category of others (the bottle1 and bottle2 belong to the category of others) are classified as the category of bottle0. Therefore, the classifier does not perform well for the category of bottle0 in terms of recall () and precision (). Because the cup and the box share some similarities in shape and color from a certain angle. Some objects belonging to the cup are considered as the category of others (the box belong to the category of others). As a result, the classifier does not have a good precision () for the category of cup. However, the classifier perform well for the category of ball in terms of recall, precision and F1_score, since the ball is quite different from other objects of this experiment in shape.

Actual classPrediction class Bottle0 Cup Ball Others
Bottle0 82.67% 0 0 17.33%
Cup 0 96% 0 4%
Ball 0 4% 96% 0
Others 52% 6.67% 0 41.33%
TABLE VI: The results of prediction for CVFH by testing in real scenes
CVFH Color-CVFH
Recall Precision F1_score Recall Precision F1_score
Bottle0 vs. all 82.67% 61.39% 0.70 100% 100% 1
Cup vs. all 96.00% 90.00% 0.93 100% 100% 1
Ball vs. all 96.00% 100% 0.98 100% 100% 1
TABLE VII: The results of the evaluation for CVFH and Color-CVFH by testing in real scenes

For the classifier trained by the Color-CVFH, this classifier obtain a good recognition performance in terms of recall, precision and F1_score. Contributed by the combination of shape and information, the classifier trained by the Color-CVFH descriptor can not only perform well on the objects with similar shapes, but also have a good recognition performance on the objects with similar shapes and colors at category level, such as the bottle0 and bottle2 in this experiment. Additionally, from the Table VII, we can see that the classifier trained by Color-CVFH descriptor can clsssify each individual object from the the real scene correctly.

The CVFH descriptor is based on the geometrical information of the 3D object surface only, thus it can obtain a good recognition performance on the objects with different shapes, like the plate in the first experiment. But it is relatively ineffective at recognizing the objects which have similar shape information, like the apple2 and apple5 in the second experiment. To cope with this problem, our proposed Color-CVFH descriptor are designed by combining the shape information with the color information, and the results of the above three experiments show that this proposed descriptor can acquire a good recognition performance in both the public RGB-D image dataset and real scenes. In addition, the proposed descriptor is relatively effective at recognizing the objects which have similar shape information and color information at category level, like the object bottl0 and object bottle2 in the third experiment. Though it makes some mistakes when recognizing the objects with the same shape and color at instance level, like the instance apple2 and instance apple4 in the second experiment. Nevertheless, the proposed Color-CVFH descriptor can acquire a remarkable recognition performance in most circumstance.

Iii-D Testing the sorting system

In this work, we utilize the four-classes SVM classifier trained by the proposed Color-CVFH descriptor in the third experiment to the proposed sorting system. Then we test its performance to sort these four different objects, including object detection, object recognition and object grasping, which are shown in Fig. 11.

Fig. 11: The real scene of the object recognition and grasping task.
(a) (b) (c) (d) (e) (f) (g) (h) (i)
Fig. 12: The result of experiment for object recognition, grasping and sorting: (a) to (i) show the process of the object recognition, object grasping and object sorting.

The Fig. 12 shows that the proposed sorting system not only can figure out the category of these four target objects, but also grasp the objects and place them to their specified locations successfully. For example, it can pick up the cup and place it on the stool with the label of cup. The experimental results show that the sorting system can implement the object detection, object recognition and object grasping tasks well.

Iv Conclusion and future work

We present a global texture-shape3D feature descriptor for object recognition and sorting in this paper, and we present a method of segmentation as the preprocess step for utilizing this proposed descriptor for object recognition. We utilize the multi-class SVM classifier for recognizing the target objects instead of feature matching, which can reduce the situation of computing burden. Then we evaluate the recognition performance of the proposed descriptor with both public dataset and real scenes, the experimental results show that the classifier trained by the proposed descriptor outperform the classifier trained by the CVFH descriptor for recognizing and sorting the objects with similar shapes and colors. In the end, we present and test a sorting system, and employ the proposed Color-CVFH descriptor on this sorting system. The experimental results show that the proposed system can implement the object detection, object recognition and object grasping tasks well.

The proposed 3D feature descriptor can perform well at category level recognition, but it make some mistakes when distinguishing the objects with the same shape and color information at instance level recognition. Additionally, we focus on improving the performance of object recognition without considering the objects’ pose estimation. Thus, the pose estimation and more texture information will be considered in the future work.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China (NSFC) under grant 61300159, 61473241 and 61332002, by the Project of Internation as well as Hongkong, Macao & Taiwan Science and Technology Cooperation Innovation Platform in Universities in Guangdong Province under grant 2015KGJH2014, by the Science and Technology Planning Project of Guangdong Province of China under grant 2013B011304002, by Educational Commission of Guangdong Province of China under grant 2015KGJHZ014.

References

  • [1] N. Correll, K. E. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser, K. Okada, A. Rodriguez, J. M. Romano, and P. R. Wurman, “Analysis and observations from the first amazon picking challenge,” IEEE Transactions on Automation Science and Engineering, 2016.
  • [2] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”

    International journal of computer vision

    , vol. 60, no. 2, pp. 91–110, 2004.
  • [3] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Computer vision and image understanding, vol. 110, no. 3, pp. 346–359, 2008.
  • [4] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in Computer Vision (ICCV), 2011 IEEE international conference on.   IEEE, 2011, pp. 2564–2571.
  • [5] Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE multimedia, vol. 19, no. 2, pp. 4–10, 2012.
  • [6] A. Frome, D. Huber, R. Kolluri, T. Bülow, and J. Malik, “Recognizing objects in range data using regional point descriptors,” Computer vision-ECCV 2004, pp. 224–237, 2004.
  • [7] F. Tombari, S. Salti, and L. Di Stefano, “Unique shape context for 3d data description,” in Proceedings of the ACM workshop on 3D object retrieval.   ACM, 2010, pp. 57–62.
  • [8]

    B. Steder, R. B. Rusu, K. Konolige, and W. Burgard, “Point feature extraction on 3d range scans taking into account object boundaries,” in

    Robotics and automation (icra), 2011 ieee international conference on.   IEEE, 2011, pp. 2601–2608.
  • [9] Y. Guo, F. Sohel, M. Bennamoun, M. Lu, and J. Wan, “Rotational projection statistics for 3d local surface description and object recognition,” International journal of computer vision, vol. 105, no. 1, pp. 63–86, 2013.
  • [10] R. B. Rusu, A. Holzbach, M. Beetz, and G. Bradski, “Detecting and segmenting objects for mobile manipulation,” in Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on.   IEEE, 2009, pp. 47–54.
  • [11] W. Wohlkinger and M. Vincze, “Ensemble of shape functions for 3d object classification,” in Robotics and Biomimetics (ROBIO), 2011 IEEE International Conference on.   IEEE, 2011, pp. 2987–2992.
  • [12] A. Aldoma, F. Tombari, R. B. Rusu, and M. Vincze, “Our-cvfh–oriented, unique and repeatable clustered viewpoint feature histogram for object recognition and 6dof pose estimation,” in

    Joint DAGM (German Association for Pattern Recognition) and OAGM Symposium

    .   Springer, 2012, pp. 113–122.
  • [13] Z.-C. Marton, D. Pangercic, R. B. Rusu, A. Holzbach, and M. Beetz, “Hierarchical object geometric categorization and appearance classification for mobile manipulation,” in Humanoid Robots (Humanoids), 2010 10th IEEE-RAS International Conference on.   IEEE, 2010, pp. 365–370.
  • [14] F. Tombari, S. Salti, and L. Di Stefano, “A combined texture-shape descriptor for enhanced 3d feature matching,” in Image Processing (ICIP), 2011 18th IEEE International Conference on.   IEEE, 2011, pp. 809–812.
  • [15] A. Aldoma, M. Vincze, N. Blodow, D. Gossow, S. Gedikli, R. B. Rusu, and G. Bradski, “Cad-model recognition and 6dof pose estimation using 3d cues,” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on.   IEEE, 2011, pp. 585–592.
  • [16] V. Vapnik, “The support vector method of function estimation,” Nonlinear modeling: Advanced black-box techniques, vol. 55, p. 86, 1998.
  • [17] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  • [18] R. B. Rusu, “Semantic 3d object maps for everyday manipulation in human living environments,” Ph.D. dissertation, Computer Science department, Technische Universitaet Muenchen, Germany, October 2009.
  • [19] R. B. Rusu, G. Bradski, R. Thibaux, and J. Hsu, “Fast 3d recognition and pose using the viewpoint feature histogram,” in Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on.   IEEE, 2010, pp. 2155–2162.
  • [20] R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms (fpfh) for 3d registration,” in Robotics and Automation, 2009. ICRA’09. IEEE International Conference on.   IEEE, 2009, pp. 3212–3217.
  • [21] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operating system,” in ICRA workshop on open source software, vol. 3, no. 3.2.   Kobe, 2009, p. 5.
  • [22] J. Denavit, “A kinematic notation for lower-pair mechanisms based on matrices,” ASME J. Appl. Mech., pp. 215–221, 1955.
  • [23] K. P. Hawkins, “Analytic inverse kinematics for the universal robots ur-5/ur-10 arms,” Georgia Institute of Technology, Tech. Rep., 2013.
  • [24] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view rgb-d object dataset,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on.   IEEE, 2011, pp. 1817–1824.