The goal of computer vision-based grasp pattern recognition is to derive the grasp type according to visual information of household objects, thus facilitating the grasping process of upper limb prosthesis control or gesture-based human robot interaction. A typical grasping process consists of grasp pattern recognition based on a visual classifier and robot hand control, as shown in Fig.1 (a). In the past years, many grasp pattern recognition methods have been proposed, which have a great impact in several fields including robot imitation [1, 2, 3], human-robot interaction [4, 5], human grasp predicting [6, 7, 8, 9, 10], prosthetic design [11, 12, 13], robot control [14, 15, 16, 17, 18], and prosthetic control [19, 20, 21]. In recent years, Convolutional Neural Network (CNN) [22, 23] has shown excellent performance on many challenging tasks, such as object detection and tracking. It provides a new solution for grasp pattern recognition. As a pioneering work of the CNN-based method,  proposed a CNN with a simple structure for grasp pattern recognition (it is called CnnGrasp in this paper). In , Ghazaei et al. labeled the images of household objects into corresponding grasp types and converted grasp pattern recognition to an image classification problem. Inspired by this work, a series of CNN-based grasp pattern recognition methods emerged [20, 8].
However, grasp pattern recognition has several significant problems. One is the presence of unseen objects during evaluation, in which a tested object and its views are wholly unseen by algorithms during training, such as the example given in Fig. 1 (b) . In the bottom panel, the tested object (red apple) has never appeared in the training set. In this case, it is usually difficult for the traditional model to predict correctly.
The unseen problems such as the size confusion problem and the structural confusion problem can affect the model performance greatly. As shown in Fig. 1 (c), the battery and the glue stick are mis-recognized when ‘PWN’ instead of ‘Tripod’ is used due to their ‘big sizes’ shown in the images. The structural confusion problem occurs when different objects are shot at certain angles, which leads to similar appearances of the objects in the images, such as the bottle and box as shown in Fig. 1 (c). It can be seen that the image taken from the bottle is confused as a rectangle, and the potato may be confused as a plate. Therefore, the grasp types may be mis-recognized in these cases.
In addition, compared with object categories labeling, grasp type labeling is much more complex and takes a lot of human resources, considering the varying hand controlling requirements and gesture annotations of different applications and tasks. For example, in some tasks, a higher grasping precision is required, and the four commonly used gestures in daily life are no longer sufficient. Another example is prosthetic control, in which the influence of personal habits on grasping gestures should be considered. These factors all affect the gesture annotation results. Also, it is labor-consuming and time-costing to re-label gesture data to adapt to different applications and scenarios. Therefore, the following problem arises: how to train a relatively powerful grasp classifier with only a few grasp labeling results that are given in advance.
One simplest idea for handling this problem is to introduce object categories information. For the human being, even if part of the visual information of the object is missing or the objects are unseen before, one can still figure out the corresponding grasp type according to their experience of using the same category of objects before. This is because object category information usually contains the necessary information including geometric structure, size, and even material of the objects. However, this information is ignored by the existing grasp pattern recognition methods. Thus, it is a natural thought to jointly learn object category classification and grasp pattern recognition to improve grasp pattern recognition.
Motivated by the above observation, this paper proposes a new deep learning architecture called Dual-branches Convolutional Neural Network (DcnnGrasp) for effective grasp pattern recognition. DcnnGrasp has two branches, namely the object category classification branch and the grasp pattern recognition branch. The former can help to achieve better grasp pattern recognition precision. Since the recognition tasks share a strong correlation, it is natural to jointly learn object category information and object grasp type information. To our best knowledge, this is the first work that utilizes object category information.
The contributions of this work are three-folds:
A novel dual-branch network structure (DcnnGrasp) is proposed for grasp pattern recognition (See Fig. 2
), in which the features extracted from the object category branch and the grasp type branch are integrated to enhance the performance of grasp pattern recognition. To train DcnnGrasp, four dual-label datasets with grasp types and object category labels were established in this paper. These datasets were obtained by manually labeling based on existing datasets. The experimental results show that the proposed model achieves significantly improved performance for both seen and unseen testing objects.
To maximize the collaborative learning of object category classification and grasp pattern recognition, this paper proposes a new loss function (JCEAR) from the Bayesian perspective. The regular parameters in the JCEAR can be learned adaptively during the training process. Based on JCEAR, a training strategy is formulated for training the parameters corresponding to the object category classification branch and the grasp pattern recognition branch in DcnnGrasp separately and iteratively. This makes the two branches teach each other interactively and improve the performance of both.
Experiments have been conducted to examine the robustness of the proposed method (DcnnGrasp) in obtaining 3D information of the objects on the dataset with missing grasp labels and its generalizability in unseen objects. All experimental results demonstrate the effectiveness of DcnnGrasp for all cases. Specifically, for the unseen problem, our method achieved a global accuracy of about 94% and 99% on the RGB-D Object dataset and Hit-GPRec dataset, respectively. Meanwhile, DcnnGrasp outperformed the second-best by nearly 15% in a global accuracy on the RGB-D Object dataset, indicating that the proposed method can solve the unseen problem well. In addition, when only one object with gesture labels in each object category appears in the training process, the GA values of DcnnGrasp are still higher than 90% and 95% on the RGB-D Object dataset and Hit-GPRec dataset respectively, indicating the proposed method can solve the problem of difficult gesture labeling in grasp pattern recognition to some extent.
2 Related Work
In , Došen et al. integrated a simple vision system into the prosthetic hand, in which the camera (distance hardware) and software were used to recognize the object, and a control signal was generated for the prehension of the artificial hand. After that, a series of computer vision-based grasp pattern recognition methods have been proposed [9, 27]. For example, Kopicki et al.  proposed a one-shot learning method of dexterous grasps for the case of novel objects, in which point cloud image data were collected by a depth camera (RGB-D camera). In , the model of each grasp type was learned from a single kinesthetic demonstration. Then, multiple models learned from a single kinesthetic demonstration for each grasp type were used for grasp prediction and generation. This improved the robustness of the method for the incomplete data in both the training and testing stages.
The popularity of mobile phones generally makes image data more accessible. Ghazaei et al.  was the first to apply grey images data of household objects to the grasp pattern recognition problem, in which CNN (only consists of two convolutional layers and a downsampling layer) was used to predict grasp type from the image of an object. Furthermore, Deng et al.  developed an attention-based visual analysis framework which was used to obtain grasp-relevant information in a scene. A deep convolutional neural network was used for detecting the grasp type and locating the attention point on the object. This method speeded up grasp planning with more stable configurations. To improve the effectiveness of grasp pattern recognition, Shi et al.  combined the 3D information and shape of the object using the depth image and grayscale image simultaneously. Corona et al. 
proposed another approach to learn the 3D information of the objects in the image effectively. They designed a multi-task GAN architecture called GanHand to estimate the 3D shape of the objects from a single RGB image and predicts the best grasp type according to the taxonomy with 33 grasp types.
Considering the case of embedding the camera in the palm of a robotic hand (or prosthetic hand), the identified grasp types for the same object from different approaching orientations of hand may be different [30, 31]. Therefore, in , instead of making absolute positive or negative predictions, Han et al. utilized the priority of grasp type as the form of prediction result, and took the ranked lists of grasps as a new label form. Based on this, they constructed a probabilistic classifier using the Plackett-Luce model. Furthermore, Zandigohar et al.  used the probability lists of grasps as the form of prediction result. However, this method does not utilize category information, so it requires more training data to cover the feature space to achieve high performance.
3 The Proposed Learning Methods
In this part, the design of our proposed method (including DcnnGrasp and JCEAR) for joint learning of object category classification and grasp pattern recognition to achieve grasp pattern recognition performance.
3.1 DcnnGrasp for Grasp Pattern Recognition
To achieve the goal of learning object category classification and grasp pattern recognition jointly, this paper designs a dual-branch network structure (DcnnGrasp). As shown in Figure 2, DcnnGrasp consists of two branches, both of which take a given household object image x as the input. For the convenience of presentation, this paper uses ‘category’ and ‘grasp’ to respectively represent the upper branch and the lower branch in Figure 2 according to their function. Each branch consists of one feature extractor and one classifier, and the features extracted from the object category feature extractor and the grasp feature extractor are integrated to enhance the performance of grasp pattern recognition. For grasp feature extractor, CnnGrasp is used in this paper. In the following, the object category feature extractor, the category classifier, and the grasp classifier are introduced in detail.
3.1.1 Object category feature extractor
The object category feature extractor consist of convolutional layers (DenseNet) and fully connection layers. Assume that is the feature maps obtained by convolutional layers, and define
as vectorizedfor convenience. Let
be the output of the -th layer with a dimension of (), where
is a nonlinear activation function.
Therefore, the object category features can be obtained by
Here, represents the learnable parameters in and the parameters
3.1.2 Category classifier
The category classifier takes as input. Let
be the output of the -th layer in the category classifier (). The output of the category classifier can be obtained as:
where stands for the parameters
3.1.3 Grasp classifier
Assuming that is the output of the grasp type feature extractor. The grasping information and object category information can be fused by concatenating and , as shown in Fig. 2, and define the resulting feature as I.
The grasp classifier takes I as its input. This paper defines as I for convenience. Let
be the output of the -th layer with a dimension of (), where is a nonlinear activation function. The output of the category classifier can be obtained as:
where and represent the learnable parameters in the grasp type feature extractor and the parameters
3.2 Joint Cross-Entropy with Adaptive Regularizer
In this paper, a novel loss function is developed to help the model to further extract and utilize the relationship between object category classification and grasp pattern recognition, thus improving the performance of the model on grasp pattern recognition.
Let one-shot vectors and be the true labels for grasp pattern recognition and object category classification, respectively. and are the resulting probability of belonging to grasp type and object class respectively. Thus, and .
To jointly lean multiple tasks, one simplest choice for the loss function is to use the following joint cross-entropy (JCE):
However, JCE is just the weighted sum of the cross-entropy functions of the two tasks, which loses the relationship between object category classification and grasp pattern recognition. To achieve the goal of jointly learning object category classification and grasp pattern recognition, this paper proposes Joint Cross-Entropy with Adaptive Regularizer (JCEAR) from the Bayesian perspective.
Since each of and follows a multinomial distribution, the following likelihood function can be taken:
Let be a matrix, in which each element of W is . Then,
By calculating the distribution of and over the training set, can be estimated as
where is the number of the images belonging to the -th class object and the -th class grasp.
Assume that , where . Therefore, define a prior
and is -th row vector of . Then, from the Bayesian perspective, can be obtained by
This leads to the minimization objective as the loss function for the proposed model, where
Meanwhile, the parameter in the proposed loss function JCEAR is updated adaptively during the training process according to the following rule:
is initialized as , and is updated by
where is the number of iterations (See Algorithm 1), , and
and is -th image in the -th sample batch. Here,
by pretraining DenseNet with ImageNet, the remaining parameters within and randomly.
The whole training learning strategy based on JCRAR is presented in Algorithm 1. To guarantee that each module in DcnnGrasp focuses on its own task, in Algorithm 1, the learnable parameters of the two branches are updated separately111Different from deep mutual learning , and the two branches are trained for their own task by a common loss function..
4.1 Experimental Settings
To evaluate the effectiveness of the methods, five datasets including the RGB-D Object dataset , Hit-GPRec dataset , Amsterdam library of object images , Columbia Object Image Library (COIL-100) , and MeganePro dataset 1 were used in the experiment.
|Training Set + Validation Set||Testing Set|
|With grasp labels||Without grasp labels|
RGB-D Object dataset 222http://rgbd-dataset.cs.washington.edu/dataset.html contains 300 objects (aka instances) in 51 categories with a total of 207921 images. The dataset was labeled 333https://drive.google.com/file/d/1PqQ_5HJOmUcQHLNnBUPUD6_qqhR7s-3c/view according to four grasp types widely used in our daily life including palmar wrist neutral, palmar wrist pronated, pinch, and tripod. A detailed description is presented in the supplementary materials of this paper.
Hit-GPRec dataset444http://homepage.hit.edu.cn/ydp contains 121 daily objects taken from different environmental conditions (4 types of lighting, 16 rotation angles of view, 4 different camera positions, and different postures of the objects). These objects can be categorized into four grasp types (including cylindrical, lateral, spherical, and tripod). Meanwhile, these objects were manually divided into 66 object categories further555https://drive.google.com/file/d/14bEIDjd9a-80LKk0I2xnuQmUUhcNzLoi/view.
Amsterdam library of object images (ALOI)666https://aloi.science.uva.nl/ contains colorful images of a thousand small objects. Each object has 72 images with a black background. This dataset was initially used for object category classification tasks. Then, Ghazaei et al.  chose 494 objects from the dataset and added grasp types labels (including four gestures: palmar wrist neutral, palmar wrist pronated, pinch, and tripod) to these objects777https://iopscience.iop.org/article/10.1088/1741-2552/aa6802/data. Based on this, our study manually divided the objects into 129 categories888https://docs.google.com/spreadsheets/d/1qvuiRL0x8NTgjMNrRRpDo9utKgCeNFoH/edit#gid=72548980.
Columbia University Image Library (COIL-100)999https://www.cs.columbia.edu/CAVE/software/softlib/coil-100.php contains one hundred objects. Each object was photographed according to a 5-degree rotation, and 72 color images with a size of 128128 were obtained. Our study chose 96 objects that can be grasped with a single hand and manually labeled these images with four gestures (including palmar wrist neutral, palmar wrist pronated, pinch, and tripod) and 53 object category labels 101010https://drive.google.com/file/d/1psxIYfG9t8Y-_WwO8lIiL4Ykg848qi2a/view?usp=sharing.
MeganePro dataset 1111111https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1Z3IOM were obtained from 30 healthy subjects with an average age of 46.63 ± 15.11 years. It contains 18 object categories and 10 grasp types that were selected based on the hand taxonomies  and grasp frequency in Activities of Daily Living . Based on this, the object images were obtained by detecting and cropping from the original video dataset121212https://github.com/MountStonne/CropedObjectsByFrames.
To verify the generalizability and robustness of grasp pattern recognition, two forms of dataset division were considered: within-whole dataset cross-validation (WWC) and between-object cross-validation (BOC). For WWC, the whole dataset was sampled and divided into a training set, a validation set, and a testing set at the ratio of . The dataset division method BOC was used to verify the ability of the models for unseen objects (the objects do not appear in the training set and validation set). In this case, an object and its views were either wholly seen or unseen. of the objects in the dataset were selected and divided into a training set and a validation set at the ratio of . The remaining objects in the dataset were taken as the testing set. For fair comparisons, this study used standard ten-fold cross-validation, and the average results were reported.
4.1.3 Object Category-based Sampling
To better investigate the relationship between object categories and grasp labels, we proposed an object category-based sampling (OCS) method, ensuring that at least one object with grasp labels in each object category appeared in the training process. Mark the total number of the -th object category in the dataset and the expected number of randomly chosen objects with grasp labels in the training process as and , respectively. As shown in Table I, if OCS is chosen, for each object category, objects with grasp labels, objects without grasp labels will be used in the training process, and the remaining one will be taken as the testing data, when . When , all objects in that category and their grasp labels will appear in the training process. All samples used in the training process were divided into training and validation sets according to 9:1, whether grasp labels were used or not. It was worth mentioning that, for OCS, the object category labels of all samples in both the training and validation sets were used, and the one-hot vectors composed of zeros were used as the ‘label’ of samples with missing grasp labels during the training process. OCS can be used to test the prediction performance of the model for grasp labels of different objects in the same object category and investigate the relationship between object category and grasp labels better. Each experiment on OCS was performed three times in which the samples for training, validation and testing were chosen randomly according to Table I, and the average results were reported.
4.1.4 Implementation Details
As shown in Fig. 2, RGB images of household objects were taken as the input of our network. The input images were resized into 224224 pixels. The input images of the category branch (each pixel is divided by 255) the grasp branch were normalized according to
where represents the -th channel of the image x (with size of ),
|Methods||RGB-D Object dataset||Hit-GPRec dataset||ALOI||COIL-100||MeganePro dataset 1|
|Methods||RGB-D Object dataset||Hit-GPRec dataset||ALOI|
The network structure is follows. The object category feature extractor adopts DenseNet121 
and the category classifier adopts three fully connected layers and a softmax layer. The final output dimensions of each fully connected layer are 256, 128, and 64, respectively. The softmax layer is essentially a fully connected layer, but the activation function adopts softmax. The final output dimension equals the number of object categories. The grasp type feature extractor adopts CnnGrasp. The grasp classifier is composed of two fully connected layers and a softmax layer. The final output dimension of each fully connected layer is 128 and 64. The final output dimension of the softmax layer equals the number of grasp types.
4.2 Comparison with State-of-the-Art Methods
4.2.1 Evaluation Metrics and Compared Methods
In this paper, our method was compared with the following state-of-art methods: CnnGrasp , EfficientNet , EfficientNet_v2 , Lightlayers , GhostNet , and RegNet . These methods are described in Table II, where only CnnGrasp is designed specifically for grasp classification and the remaining five methods are given for the classical classification problem.
4.2.2 Comparison in WWC
To verify the robustness of the proposed method (DcnnGrasp) in different views of objects, our method was compared with all six methods in WWC on all five datasets. All experimental results are presented in Table III. It can be seen from this table that DcnnGrasp achieved the best prediction performance compared with other methods for grasp classification. Specifically, DcnnGrasp obtained an accuracy of almost 100% on the RGB-D Object dataset, Hit-GPRec dataset, and COIL-100. For the Hit-GPRec dataset and COIL-100, the proposed method even outperformed most comparing methods by in accuracy.
The comparison of the grasp classification methods (including CnnGrasp and DcnnGrasp) with five classical classification methods indicated that both two grasp classification methods achieved excellent grasp classification performance on ALOI, COIL-100, and MeganePro datasets. For the RGB-D Object dataset and Hit-GPRec dataset, DcnnGrasp still achieved the best prediction performance, but the performance of CnnGrasp was mediocre. The difference between the accuracy of CnnGrasp and DcnnGrasp was even more than and on the RGB-D Object dataset and the Hit-GPRec dataset, respectively. All experimental results shown in Table III illustrate the effectiveness of the proposed method in WWC.
4.2.3 Comparison in BOC
To demonstrate the generalizability of the proposed method in unseen objects, our method was compared with all six methods in BOC on three datasets including the RGB-D Object dataset, Hit-GPRec dataset, and ALOI (For BOC, only the datasets containing more than 100 objects were used). All results are presented in Table IV. It can be seen from this table that DcnnGrasp still archived the best prediction performance for BOC.
indicated that BOC (unseen problem) was much more challenging than WWC. The performance of all compared methods decreased dramatically in most cases. On the RGB-D Object dataset, the gap between the results of WWC and BOC was even 20% for some methods (such as GhostNet, RepVGG, EfficientNet_v2, RegNetX600, and RegNetY600). However, even for BOC, on the Hit-GPRec dataset, the performance of DcnnGrasp on all three evaluation metrics was close to 100%. On the RGB-D Object dataset, they achieved an accuracy of more than 94%, which is 15% higher than that of all comparing methods.
4.3 Robustness on the dataset with missing grasp labels
4.3.1 Comparison in BOC
In this part, the robustness of DcnnGrasp was investigated on the dataset with missing grasp labels, in which the proportion of the remaining grasp labels was . Meanwhile, one-hot vectors composed of zeros were used for losing grasp labels during the training process. DcnnGrasp was compared with the grasp classification method CnnGrasp on these datasets including the RGB-D Object dataset, Hit-GPRec dataset, and ALOI dataset for BOC sampling. All comparison results were illustrated in Fig. 3.
It can be seen from Fig. 3 that the performance of DcnnGrasp outperformed CnnGrasp for all cases significantly. In addition, we discovered that, even for a small , DcnnGrasp maintained an excellent performance on the RGB-D Object dataset and Hit-GPRec dataset. For , the GA values of DcnnGrasp were still higher than 85% and 95% on the RGB-D Object dataset and Hit-GPRec dataset, respectively. It showed the robust grasping prediction ability of DcnnGrasp when datasets had more sufficient objects per object category, which guarantee DcnnGrasp can study the relationship between grasp types and object categories well.
4.3.2 Comparison in OCS
|Datasets||RGB-D Object dataset||Hit-GPRec dataset|
Since prosthetic or robotic hands will be generally initialized before leaving the factory in which a well-designed dataset can be used during the training process, we used OCS to simulate such scenarios in this experiment, where the expected number of objects with grasp labels in each object category was set as . The results for the RGB-D Object dataset and Hit-GPRec dataset were presented in Table V. These two datasets were chosen as they have more sufficient objects per object category compared with other datasets such as ALOI.
|Dual branch network||✓||✓||✓|
|Using object category label||✓||✓|
|Training strategy based on JCEAR||✓|
|RGB-D Object dataset||72.87||43.25/76.38||91.50/94.43|
From Table V, we can see that the DcnnGrasp maintains much stable performance compared with CnnGrasp as decreases. Even for the case when only one object with grasp labels in each object category appears in the training process, i.e., , the GA values of DcnnGrasp were still higher than 90% and 95% on the RGB-D Object dataset and Hit-GPRec dataset, respectively, while the performance of CnnGrasp drops hugely if is decreased to . It demonstrated the strong robustness of DcnnGrasp for different objects in the same category and the case of the dataset with missing grasp labels.
4.4 Ablation Studies
4.4.1 Effects of Different GTFEs
Ablation studies were performed from two aspects: (1) grasp type feature extractor (GTFE), and (2) component of the proposed method. The impact of each aspect on the model performance was analyzed in turn.
This study respectively used EfficientNet (B0), EfficientNet_v2 (B0), LightLayers, and CnnGrasp as the grasp type feature extractor to investigate the impact of different extractors on the network performance131313Three networks including EfficientNet (B0), EfficientNet_v2 (B0), and LightLayers were chosen here because of their efficiency and good performance on comparison experiments.. The experimental results are shown in Fig. 4. It can be seen from this figure that the methods taking CnnGrasp as the grasp type feature extractor in DcnnGrasp achieved the best results, which shows that the choice of grasp type feature extractor is not related to the depth of the network (CnnGrasp only has two layers). Meanwhile, the performance of all four models including CnnGrasp, EfficientNet (B0), EfficientNet_v2 (B0), and LightLayers was enhanced by the proposed method significantly in most cases. Specifically, for the RGB-D Object dataset and Hit-GPRec dataset, the GA values of all methods taking four traditional models as the grasp type feature extractor reached more than 90%, which is about 20% higher than those of the four models. As it will be discussed later, this performance advantage is not only benefited from the dual-branch network but also from the training strategy based on JCEAR, which enhanced the joint learning of object category classification and grasp pattern recognition significantly.
4.4.2 Effects of Different Strategies
To validate the effectiveness of the proposed strategies to improve the performance of grasp pattern recognition (including the dual-branch network and the introduction of object category information and training strategy based on JCEAR), this paper compared the following variants obtained by combining the strategies gradually:
The dual-branch network is used, and the cross-entropy is taken as the loss function.
The dual-branch network is used, and the joint cross-entropy is taken as the loss function, in which object category labels are used.
Both the dual-branch network and training strategy based on JCEAR are used, in which object category labels are used.
All comparison results are presented in Table VI. It can be seen that the significant increase in GA is attributed to each proposed strategy, demonstrating the effectiveness of the proposed strategies. Meanwhile, the comparison of v1 and v2 indicates significant improvements attributed to the proposed training strategy based on JCEAR. Especially, on the RGB-D Object dataset and Hit-GPRec dataset, the increase in GA is at least 13.5%.
In the above experiments, for most cases, DcnnGrasp achieved the best performance. To analyze the performance of DcnnGrasp in grasp pattern recognition deeply, based on the observations from the experiment and the visualized results of DcnnGrasp and CnnGrasp in Fig. 5, this paper performs further discussions on the following four aspects: (1) robustness in 3D information of the objects for WWC, (2) sensitivity to the shadow and confusing background, (3) generalizability in unseen objects, and (4) robustness in a dataset with missing grasp labels.
5.1 Robustness in 3D information of the objects for WWC
From the comparison results in WWC, it can be seen that DcnnGrasp obtained the best results on all datasets, and all evaluation metric values were more than 98%, indicating strong robustness of DcnnGrasp in different views of the objects. Meanwhile, from the visualized results shown in Fig. 5 (a), compared with DcnnGrasp, CnnGrasp obtained confusion about the 3d information of the objects easier. Specifically, in the prediction for ‘Lemon’, CnnGrasp obtained a wrong grasping gesture ‘Tripod’. This is possibly because the 3d shape of ‘Lemon’ was mistakenly recognized as flat. In addition, from the cases of ‘Food Box’ and ‘Box’, it can be seen that CnnGrasp was sensitive to the deformation generated by taking photos of the object from different angles. However, even for this situation, DcnnGrasp still worked well. All these observations show the strong robustness of DcnnGrasp in obtaining 3D information of the objects, which is attributed to the introduction of object category information.
5.2 Sensitivity to the shadow and confusing background
In grasp pattern recognition, shadows and confusing backgrounds often appear in household object image data, which will affect the performance of grasp classification. For example, in the ALOI dataset, the example images listed in Fig. 5 show that the target object is usually indistinguishable from the background. For ‘Bottle’, ‘AShtray’, and ‘Bowl’, it is difficult to recognize the target object even for humans. This is one of the reasons why most methods failed on the ALOI dataset, particularly BOC sampling. Another example is the Hit-GPRec dataset contaminated by the shadows. As shown in Table III and IV, on this dataset, DcnnGrasp achieved a value of at least 99% for the evaluation metrics in all cases. This good performance is attributed to the proposed training strategy based on JCEAR, as shown in the ablation study given in Table VI.
5.3 Generalizability in unseen objects
According to the comparison results given in Tables III and IV, most of the methods performed worse than WWC in BOC validation, indicating the considerable challenge of unseen problem. One of the reasons may be the dataset. For example, on the ALOI dataset, there are only a few different individuals in each category, and for some object categories, there is even one object. This makes it difficult to learn the relationship between object categories and grasping gestures. A similar situation can be observed on other datasets. For example, on the Hit-GPRec dataset, there are only two objects in the category ‘Roll Paper Tube’, and the two objects are marked with different grasp types (one is ‘Tripod’, and the other is ‘Cylindrical’), which leads to a wrong prediction by DcnnGrasp for an example of ‘Roll Paper Tube’ as shown in Fig. 5 (b).
Another reason for the occurrence of the wrong prediction for grasping gesture may be that, as shown in Fig. 5 (b), the 3d information losing problem in BOC is more serious compared with WWC. For example, in the prediction for ‘Can’ and ‘Pitcher’, CnnGrasp mistakenly recognized these two objects as tiny things and thus made a wrong prediction for the grasping gesture. This is common in traditional grasp classification methods due to size information lost in RGB images. In addition, from the cases of ‘Block’ and ‘Food Box’, it can be seen that CnnGrasp is sensitive to deformation, which is caused by taking a photo of the object from different angles.
Although the unseen problem is a challenge in grasp pattern recognition, DcnnGrasp still obtained excellent results (higher than 94%) on the RGB-D Object dataset and Hit-GPRec dataset. Also, it outperformed the state-of-art methods on all datasets significantly in all evaluation metrics. All results verified the strong generalizability of DcnnGrasp in unseen objects, which is attributed to the proposed training strategy based on JCEAR.
5.4 Robustness on the dataset with missing grasp labels
In the experiments given in Section 4.3, the robustness of DcnnGrasp was investigated on the dataset with missing grasp labels. The experimental results indicated the effectiveness of our method even when the proportion of the remaining grasp labels is small, especially on the RGB-D Object dataset and Hit-GPRec dataset. The comparision in OCS showed the strong robustness of DcnnGrasp to missing grasp labels compared with CnnGrasp. On Hit-GPRec dataset, for DcnnGrasp, the difference caused by different were less than 2%, but for CnnGrasp, the difference had achieved 10% to 15% for different valuation metrics. Even only one object with gesture labels in each object category appeared in the training process, the GA values of DcnnGrasp were higher than 90% and 95% on the RGB-D Object dataset and Hit-GPRec dataset respectively. It is because DcnnGrasp can learn and utilize the relationship between object categories and grasping gestures well when there are sufficient objects with grasp labels in each object category appears in the training process.
6 Conclusion and Future Work
This paper proposes a novel dual-branch convolutional neural network (DcnnGrasp) that utilizes the object category information to improve grasp pattern recognition. Meanwhile, a novel loss function JCEAR and a new training strategy are given to maximize the collaborative learning of object category classification and grasp pattern recognition. To train DcnnGrasp, two dual-label datasets are constructed based on the existing household datasets including the RGB-D Object dataset and Hig-GPRec dataset. Experimental results demonstrated the excellent performance of the proposed method.
As a newly developed technology, computer vision-based grasp pattern recognition still has a lot of challenges. For example, how to predict human grasp affordances for a single RGB image of a scene with an arbitrary number of objects? Also, it is interesting to consider personal habits in grasp pattern recognition.
-  Z. Wang, J. Merel, S. Reed, G. Wayne, N. de Freitas, and N. Heess, “Robust imitation of diverse behaviors,” arXiv preprint arXiv:1707.02747, 2017.
-  Y. Liu, A. Gupta, P. Abbeel, and S. Levine, “Imitation from observation: Learning to imitate behaviors from raw video via context translation,” In ICRA, pp. 1118–1125, 2018.
-  N. Di Palo and E. Johns, “Safari: Safe and active robot imitation learning with imagination,” arXiv preprint arXiv:2011.09586, 2020.
-  H. Jin, E. Dong, M. Xu, and J. Yang, “A smart and hybrid composite finger with biomimetic tapping motion for soft prosthetic hand,” JBE, vol. 17, pp. 484–500, 2020.
-  X. Wang, F. Geiger, V. Niculescu, M. Magno, and L. Benini, “Smarthand: Towards embedded smart hands for prosthetic and robotic applications,” arXiv preprint arXiv:2107.14598, 2021.
-  E. Corona, A. Pumarola, G. Alenya, F. Moreno-Noguer, and G. Rogez, “Ganhand: Predicting human grasp affordances in multi-object scenes,” In CVPR, pp. 5031–5041, 2020.
-  M. Zandigohar, M. Han, M. Sharif, S. Y. Gunay, M. P. Furmanek, M. Yarossi, P. Bonato, C. Onal, T. Padir, D. Erdogmus et al., “Multimodal fusion of emg and vision for human grasp intent inference in prosthetic hand control,” arXiv preprint arXiv:2104.03893, 2021.
-  G. Ghazaei, F. Tombari, N. Navab, and K. Nazarpour, “Grasp type estimation for myoelectric prostheses using point cloud feature learning,” arXiv preprint arXiv:1908.02564, 2019.
-  M. Zandigohar, D. Erdogmus, and G. Schirner, “Netcut: Real-time dnn inference using layer removal,” arXiv preprint arXiv:2101.05363, 2021.
-  M. Veres, M. Moussa, and G. W. Taylor, “Modeling grasp motor imagery through deep conditional generative models,” RAL, vol. 2, no. 2, pp. 757–764, 2017.
-  P. Weiner, J. Starke, F. Hundhausen, J. Beil, and T. Asfour, “The kit prosthetic hand: design and control,” In IROS, pp. 3328–3334, 2018.
-  S. Liu, M. Van, Z. Chen, J. Angeles, and C. Chen, “A novel prosthetic finger design with high load-carrying capacity,” Mechanism and Machine Theory, vol. 156, p. 104121, 2021.
-  K. H. Yusof, M. A. Zulkipli, A. S. Ahmad, M. F. Yusri, S. Al-Zubaidi, and M. Mohammed, “Design and development of prosthetic leg with a mechanical system,” In ICSGRC, pp. 217–221, 2021.
-  S. Kumra and C. Kanan, “Robotic grasp detection using deep convolutional neural networks,” In IROS, pp. 769–776, 2017.
-  G. Du, K. Wang, and S. Lian, “Vision-based robotic grasping from object localization pose estimation grasp detection to motion planning: A review,” arXiv preprint arXiv:1905.06658, 2019.
-  S. Caldera, A. Rassau, and D. Chai, “Review of deep learning methods in robotic grasp detection,” Multimodal Technologies and Interaction, vol. 2, no. 3, p. 57, 2018.
-  Y. Song, L. Gao, X. Li, and W. Shen, “A novel robotic grasp detection method based on region proposal networks,” Robotics and Computer-Integrated Manufacturing, vol. 65, p. 101963, 2020.
-  Z. Deng, G. Gao, S. Frintrop, F. Sun, C. Zhang, and J. Zhang, “Attention based visual analysis for fast grasp planning with a multi-fingered robotic hand,” Frontiers in neurorobotics, vol. 13, p. 60, 2019.
-  L. T. Taverne, M. Cognolato, T. Bützer, R. Gassert, and O. Hilliges, “Video-based prediction of hand-grasp preshaping with application to prosthesis control,” In ICRA, pp. 4975–4982, 2019.
-  F. Hundhausen, D. Megerle, and T. Asfour, “Resource-aware object classification and segmentation for semi-autonomous grasping with prosthetic hands,” In 2019 IEEE-RAS 19th International Conference on Humanoid Robots, pp. 215–221, 2019.
-  T. Kara and A. S. Masri, “Modeling and analysis of a visual feedback system to support efficient object grasping of an emg-controlled prosthetic hand,” CDBME, vol. 5, no. 1, pp. 207–210, 2019.
-  J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai et al., “Recent advances in convolutional neural networks,” Pattern Recognition, vol. 77, pp. 354–377, 2018.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”InNIPS, vol. 25, pp. 1097–1105, 2012.
-  G. Ghazaei, A. Alameer, P. Degenaar, G. Morgan, and K. Nazarpour, “Deep learning-based artificial vision for grasp classification in myoelectric hands,”JNE, vol. 14, no. 3, p. 036025, 2017.
-  D. P. Bertsekas, “Nonlinear programming,” JORS, vol. 48, no. 3, pp. 334–334, 1997.
-  S. Došen and D. B. Popović, “Transradial prosthesis: artificial vision for control of prehension,” Artificial organs, vol. 35, no. 1, pp. 37–48, 2011.
-  N. Wake, D. Saito, K. Sasabuchi, H. Koike, and K. Ikeuchi, “Object affordance as a guide for grasp-type recognition,” arXiv preprint arXiv:2103.00268, 2021.
-  M. Kopicki, R. Detry, M. Adjigble, R. Stolkin, A. Leonardis, and J. L. Wyatt, “One-shot learning and generation of dexterous grasps for novel objects,” IJRR, vol. 35, no. 8, pp. 959–976, 2016.
-  C. Shi, D. Yang, J. Zhao, and H. Liu, “Computer vision-based grasp pattern recognition with application to myoelectric control of dexterous hand prosthesis,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 28, no. 9, pp. 2090–2099, 2020.
-  M. Han, S. Y. Günay, İ. Yildiz, P. Bonato, C. D. Onal, T. Padir, G. Schirner, and D. Erdoğmuş, “From hand-perspective visual information to grasp type probabilities: deep learning via ranking labels,” In 12th ACM international conference on pervasive technologies related to assistive environments, pp. 256–263, 2019.
-  M. Zandigohar, M. Han, D. Erdoğmuş, and G. Schirner, “Towards creating a deployable grasp type probability estimator for a prosthetic hand,” Cyber Physical Systems. Model-Based Design, pp. 44–58, 2019.
-  K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view rgb-d object dataset,” In ICRA, pp. 1817–1824, 2011.
-  J.-M. Geusebroek, G. J. Burghouts, and A. W. Smeulders, “The amsterdam library of object images,” IJCV, vol. 61, no. 1, pp. 103–112, 2005.
-  S. A. Nene, S. K. Nayar, H. Murase et al., “Columbia object image library (coil-100),” 1996.
-  M. Cognolato, A. Gijsberts, V. Gregori, G. Saetta, K. Giacomino, A.-G. M. Hager, A. Gigli, D. Faccio, C. Tiengo, F. Bassetto et al., “Gaze, visual, myoelectric, and inertial data of grasps for intelligent prosthetics,” Scientific data, vol. 7, no. 1, pp. 1–15, 2020.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” In CVPR, pp. 4700–4708, 2017.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” In CVPR, pp. 248–255, 2009.
-  M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” In ICML, pp. 6105–6114, 2019.
-  M. Tan and Q. V. Le, “Efficientnetv2: Smaller models and faster training,” arXiv preprint arXiv:2104.00298, 2021.
-  D. Jha, A. Yazidi, M. A. Riegler, D. Johansen, H. D. Johansen, and P. Halvorsen, “Lightlayers: Parameter efficient dense and convolutional layers for image classification,” In PDCAT, pp. 285–296, 2020.
-  K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “Ghostnet: More features from cheap operations,” In CVPR, pp. 1580–1589, 2020.
-  I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Designing network design spaces,” In CVPR, pp. 10 428–10 436, 2020.
-  M. R. Cutkosky et al., “On grasp choice, grasp models, and the design of hands for manufacturing tasks.” IEEE Transactions on robotics and automation, vol. 5, no. 3, pp. 269–279, 1989.
-  I. M. Bullock, J. Z. Zheng, S. De La Rosa, C. Guertler, and A. M. Dollar, “Grasp frequency and usage in daily household and machine shop tasks,” ToH, vol. 6, no. 3, pp. 296–308, 2013.
C. Robert, “Machine learning, a probabilistic perspective,” 2014.
-  F. Thabtah, M. Eljinini, M. Zamzeer, and W. Hadi, “Naïve bayesian based on chi square to categorize arabic data,” In 11th international business information management association conference (IBIMA) conference on innovation and knowledge management in twin track economies, Cairo, Egypt, pp. 4–6, 2009.
-  J. Opitz and S. Burst, “Macro f1 and macro f1,” arXiv preprint arXiv:1911.03347, 2019.
-  Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” In CVPR, pp. 4320–4328, 2018.