DcnnGrasp: Towards Accurate Grasp Pattern Recognition with Adaptive Regularizer Learning

The task of grasp pattern recognition aims to derive the applicable grasp types of an object according to the visual information. Current state-of-the-art methods ignore category information of objects which is crucial for grasp pattern recognition. This paper presents a novel dual-branch convolutional neural network (DcnnGrasp) to achieve joint learning of object category classification and grasp pattern recognition. DcnnGrasp takes object category classification as an auxiliary task to improve the effectiveness of grasp pattern recognition. Meanwhile, a new loss function called joint cross-entropy with an adaptive regularizer is derived through maximizing a posterior, which significantly improves the model performance. Besides, based on the new loss function, a training strategy is proposed to maximize the collaborative learning of the two tasks. The experiment was performed on five household objects datasets including the RGB-D Object dataset, Hit-GPRec dataset, Amsterdam library of object images (ALOI), Columbia University Image Library (COIL-100), and MeganePro dataset 1. The experimental results demonstrated that the proposed method can achieve competitive performance on grasp pattern recognition with several state-of-the-art methods. Specifically, our method even outperformed the second-best one by nearly 15 global accuracy for the case of testing a novel object on the RGB-D Object dataset.

READ FULL TEXT VIEW PDF

page 1

page 3

page 9

page 10

08/26/2020

Grasp-type Recognition Leveraging Object Affordance

A key challenge in robot teaching is grasp-type recognition with a singl...
01/26/2021

Logical-Combinatorial Approaches in Dynamic Recognition Problems

A pattern recognition scenario, where instead of object classification i...
10/20/2014

Building pattern recognition applications with the SPARE library

This paper presents the SPARE C++ library, an open source software tool ...
01/10/2013

Application of Hopfield Network to Saccades

Human eye movement mechanisms (saccades) are very useful for scene analy...
07/18/2005

Pattern Recognition for Conditionally Independent Data

In this work we consider the task of relaxing the i.i.d assumption in pa...
02/03/2020

Optimizing Correlated Graspability Score and Grasp Regression for Better Grasp Prediction

Grasping objects is one of the most important abilities to master for a ...
08/04/2019

Efficient training and design of photonic neural network through neuroevolution

Recently, optical neural networks (ONNs) integrated in photonic chips ha...

1 Introduction

The goal of computer vision-based grasp pattern recognition is to derive the grasp type according to visual information of household objects, thus facilitating the grasping process of upper limb prosthesis control or gesture-based human robot interaction. A typical grasping process consists of grasp pattern recognition based on a visual classifier and robot hand control, as shown in Fig. 

1 (a). In the past years, many grasp pattern recognition methods have been proposed, which have a great impact in several fields including robot imitation [1, 2, 3], human-robot interaction [4, 5], human grasp predicting [6, 7, 8, 9, 10], prosthetic design [11, 12, 13], robot control [14, 15, 16, 17, 18], and prosthetic control [19, 20, 21]. In recent years, Convolutional Neural Network (CNN) [22, 23] has shown excellent performance on many challenging tasks, such as object detection and tracking. It provides a new solution for grasp pattern recognition. As a pioneering work of the CNN-based method, [24] proposed a CNN with a simple structure for grasp pattern recognition (it is called CnnGrasp in this paper). In [24], Ghazaei et al. labeled the images of household objects into corresponding grasp types and converted grasp pattern recognition to an image classification problem. Inspired by this work, a series of CNN-based grasp pattern recognition methods emerged [20, 8].

(a) Deep learning-based artificial vision system for robot hand control
(b) The unseen problem in Grasp Pattern Recognition
(c) Challenges in Grasp Pattern Recognition
Fig. 1: (a) Deep learning-based artificial vision system for robot hand control: the visual classifier predicts the grasp type for household objects in the input image and then controls the device to perform the corresponding grasping action, where PWN and PWP refer to palmar wrist neutral and palmar wrist pronated, respectively. (b) The unseen problem in grasp pattern recognition, where the objects in the testing images have never appeared in the training set. (c) Challenges in grasp pattern recognition (especially in the unseen problem), such as the size confusion problem and structural confusion problem.

However, grasp pattern recognition has several significant problems. One is the presence of unseen objects during evaluation, in which a tested object and its views are wholly unseen by algorithms during training, such as the example given in Fig. 1 (b) [24]. In the bottom panel, the tested object (red apple) has never appeared in the training set. In this case, it is usually difficult for the traditional model to predict correctly.

The unseen problems such as the size confusion problem and the structural confusion problem can affect the model performance greatly. As shown in Fig. 1 (c), the battery and the glue stick are mis-recognized when ‘PWN’ instead of ‘Tripod’ is used due to their ‘big sizes’ shown in the images. The structural confusion problem occurs when different objects are shot at certain angles, which leads to similar appearances of the objects in the images, such as the bottle and box as shown in Fig. 1 (c). It can be seen that the image taken from the bottle is confused as a rectangle, and the potato may be confused as a plate. Therefore, the grasp types may be mis-recognized in these cases.

In addition, compared with object categories labeling, grasp type labeling is much more complex and takes a lot of human resources, considering the varying hand controlling requirements and gesture annotations of different applications and tasks. For example, in some tasks, a higher grasping precision is required, and the four commonly used gestures in daily life are no longer sufficient. Another example is prosthetic control, in which the influence of personal habits on grasping gestures should be considered. These factors all affect the gesture annotation results. Also, it is labor-consuming and time-costing to re-label gesture data to adapt to different applications and scenarios. Therefore, the following problem arises: how to train a relatively powerful grasp classifier with only a few grasp labeling results that are given in advance.

One simplest idea for handling this problem is to introduce object categories information. For the human being, even if part of the visual information of the object is missing or the objects are unseen before, one can still figure out the corresponding grasp type according to their experience of using the same category of objects before. This is because object category information usually contains the necessary information including geometric structure, size, and even material of the objects. However, this information is ignored by the existing grasp pattern recognition methods. Thus, it is a natural thought to jointly learn object category classification and grasp pattern recognition to improve grasp pattern recognition.

Motivated by the above observation, this paper proposes a new deep learning architecture called Dual-branches Convolutional Neural Network (DcnnGrasp) for effective grasp pattern recognition. DcnnGrasp has two branches, namely the object category classification branch and the grasp pattern recognition branch. The former can help to achieve better grasp pattern recognition precision. Since the recognition tasks share a strong correlation, it is natural to jointly learn object category information and object grasp type information. To our best knowledge, this is the first work that utilizes object category information.

The contributions of this work are three-folds:

  • A novel dual-branch network structure (DcnnGrasp) is proposed for grasp pattern recognition (See Fig. 2

    ), in which the features extracted from the object category branch and the grasp type branch are integrated to enhance the performance of grasp pattern recognition. To train DcnnGrasp, four dual-label datasets with grasp types and object category labels were established in this paper. These datasets were obtained by manually labeling based on existing datasets. The experimental results show that the proposed model achieves significantly improved performance for both seen and unseen testing objects.

  • To maximize the collaborative learning of object category classification and grasp pattern recognition, this paper proposes a new loss function (JCEAR) from the Bayesian perspective. The regular parameters in the JCEAR can be learned adaptively during the training process. Based on JCEAR, a training strategy is formulated for training the parameters corresponding to the object category classification branch and the grasp pattern recognition branch in DcnnGrasp separately and iteratively. This makes the two branches teach each other interactively and improve the performance of both.

  • Experiments have been conducted to examine the robustness of the proposed method (DcnnGrasp) in obtaining 3D information of the objects on the dataset with missing grasp labels and its generalizability in unseen objects. All experimental results demonstrate the effectiveness of DcnnGrasp for all cases. Specifically, for the unseen problem, our method achieved a global accuracy of about 94% and 99% on the RGB-D Object dataset and Hit-GPRec dataset, respectively. Meanwhile, DcnnGrasp outperformed the second-best by nearly 15% in a global accuracy on the RGB-D Object dataset, indicating that the proposed method can solve the unseen problem well. In addition, when only one object with gesture labels in each object category appears in the training process, the GA values of DcnnGrasp are still higher than 90% and 95% on the RGB-D Object dataset and Hit-GPRec dataset respectively, indicating the proposed method can solve the problem of difficult gesture labeling in grasp pattern recognition to some extent.

2 Related Work

In [26], Došen et al. integrated a simple vision system into the prosthetic hand, in which the camera (distance hardware) and software were used to recognize the object, and a control signal was generated for the prehension of the artificial hand. After that, a series of computer vision-based grasp pattern recognition methods have been proposed [9, 27]. For example, Kopicki et al. [28] proposed a one-shot learning method of dexterous grasps for the case of novel objects, in which point cloud image data were collected by a depth camera (RGB-D camera). In [28], the model of each grasp type was learned from a single kinesthetic demonstration. Then, multiple models learned from a single kinesthetic demonstration for each grasp type were used for grasp prediction and generation. This improved the robustness of the method for the incomplete data in both the training and testing stages.

Fig. 2: The architecture of DcnnGrasp. The model consists of four parts, object category feature extractor, category classifier, grasp type feature extractor, and grasp classifier. The newly proposed loss function called Joint Cross-Entropy is combined with Adaptive Regularizer (JCEAR) to train the network, where

is a conditional probability matrix obtained by statistical methods on the training set. The matrix

reflects the relationship between object categories and grasp types. To enable the model to achieve better performance, this paper proposes a training strategy based on (JCEAR). Backpropagation stage 1 and stage 2 come from our proposed training strategy.

The popularity of mobile phones generally makes image data more accessible. Ghazaei et al. [24] was the first to apply grey images data of household objects to the grasp pattern recognition problem, in which CNN (only consists of two convolutional layers and a downsampling layer) was used to predict grasp type from the image of an object. Furthermore, Deng et al. [18] developed an attention-based visual analysis framework which was used to obtain grasp-relevant information in a scene. A deep convolutional neural network was used for detecting the grasp type and locating the attention point on the object. This method speeded up grasp planning with more stable configurations. To improve the effectiveness of grasp pattern recognition, Shi et al. [29] combined the 3D information and shape of the object using the depth image and grayscale image simultaneously. Corona et al. [6]

proposed another approach to learn the 3D information of the objects in the image effectively. They designed a multi-task GAN architecture called GanHand to estimate the 3D shape of the objects from a single RGB image and predicts the best grasp type according to the taxonomy with 33 grasp types

[6].

Considering the case of embedding the camera in the palm of a robotic hand (or prosthetic hand), the identified grasp types for the same object from different approaching orientations of hand may be different [30, 31]. Therefore, in [30], instead of making absolute positive or negative predictions, Han et al. utilized the priority of grasp type as the form of prediction result, and took the ranked lists of grasps as a new label form. Based on this, they constructed a probabilistic classifier using the Plackett-Luce model. Furthermore, Zandigohar et al. [31] used the probability lists of grasps as the form of prediction result. However, this method does not utilize category information, so it requires more training data to cover the feature space to achieve high performance.

3 The Proposed Learning Methods

In this part, the design of our proposed method (including DcnnGrasp and JCEAR) for joint learning of object category classification and grasp pattern recognition to achieve grasp pattern recognition performance.

3.1 DcnnGrasp for Grasp Pattern Recognition

To achieve the goal of learning object category classification and grasp pattern recognition jointly, this paper designs a dual-branch network structure (DcnnGrasp). As shown in Figure  2, DcnnGrasp consists of two branches, both of which take a given household object image x as the input. For the convenience of presentation, this paper uses ‘category’ and ‘grasp’ to respectively represent the upper branch and the lower branch in Figure  2 according to their function. Each branch consists of one feature extractor and one classifier, and the features extracted from the object category feature extractor and the grasp feature extractor are integrated to enhance the performance of grasp pattern recognition. For grasp feature extractor, CnnGrasp is used in this paper. In the following, the object category feature extractor, the category classifier, and the grasp classifier are introduced in detail.

3.1.1 Object category feature extractor

The object category feature extractor consist of convolutional layers (DenseNet) and fully connection layers. Assume that is the feature maps obtained by convolutional layers, and define

as vectorized

for convenience. Let

be the output of the -th layer with a dimension of (), where

is a nonlinear activation function.

Therefore, the object category features can be obtained by

Here, represents the learnable parameters in and the parameters

3.1.2 Category classifier

The category classifier takes as input. Let

be the output of the -th layer in the category classifier (). The output of the category classifier can be obtained as:

where stands for the parameters

3.1.3 Grasp classifier

Assuming that is the output of the grasp type feature extractor. The grasping information and object category information can be fused by concatenating and , as shown in Fig. 2, and define the resulting feature as I.

The grasp classifier takes I as its input. This paper defines as I for convenience. Let

be the output of the -th layer with a dimension of (), where is a nonlinear activation function. The output of the category classifier can be obtained as:

where and represent the learnable parameters in the grasp type feature extractor and the parameters

respectively.

3.2 Joint Cross-Entropy with Adaptive Regularizer

In this paper, a novel loss function is developed to help the model to further extract and utilize the relationship between object category classification and grasp pattern recognition, thus improving the performance of the model on grasp pattern recognition.

Let one-shot vectors and be the true labels for grasp pattern recognition and object category classification, respectively. and are the resulting probability of belonging to grasp type and object class respectively. Thus, and .

To jointly lean multiple tasks, one simplest choice for the loss function is to use the following joint cross-entropy (JCE):

(1)

However, JCE is just the weighted sum of the cross-entropy functions of the two tasks, which loses the relationship between object category classification and grasp pattern recognition. To achieve the goal of jointly learning object category classification and grasp pattern recognition, this paper proposes Joint Cross-Entropy with Adaptive Regularizer (JCEAR) from the Bayesian perspective.

Since each of and follows a multinomial distribution, the following likelihood function can be taken:

(2)

Let be a matrix, in which each element of W is . Then,

(3)

Therefore

By calculating the distribution of and over the training set, can be estimated as

(4)

where is the number of the images belonging to the -th class object and the -th class grasp.

Assume that , where . Therefore, define a prior

(5)

where

(6)

and is -th row vector of . Then, from the Bayesian perspective, can be obtained by

(7)

This leads to the minimization objective as the loss function for the proposed model, where

(8)

Meanwhile, the parameter in the proposed loss function JCEAR is updated adaptively during the training process according to the following rule:
is initialized as , and is updated by

(9)

where is the number of iterations (See Algorithm 1), , and

(10)

and is -th image in the -th sample batch. Here,

(11)
Initialize: , , the learnable parameters in

by pretraining DenseNet with ImageNet

[37], the remaining parameters within and randomly.
Computing by (4);
While not converged do
1. Fix and , and update and by
(12)
2. Update by (9);
3. Fix and , and update and by
(13)
4. ;
5. Check the convergence conditions
end while
Algorithm 1 Training strategy based on JCRAR

The whole training learning strategy based on JCRAR is presented in Algorithm 1. To guarantee that each module in DcnnGrasp focuses on its own task, in Algorithm 1, the learnable parameters of the two branches are updated separately111Different from deep mutual learning [48], and the two branches are trained for their own task by a common loss function..

4 Experiments

4.1 Experimental Settings

4.1.1 Dataset

To evaluate the effectiveness of the methods, five datasets including the RGB-D Object dataset [32], Hit-GPRec dataset [29], Amsterdam library of object images [33], Columbia Object Image Library (COIL-100) [34], and MeganePro dataset 1 were used in the experiment.

Total objects number
in -th object category
The expected number of
objects with grasp labels
Training Set + Validation Set Testing Set
With grasp labels Without grasp labels
TABLE I: Object category-based sampling method (OCS).

4.1.2 Cross-validation

To verify the generalizability and robustness of grasp pattern recognition, two forms of dataset division were considered: within-whole dataset cross-validation (WWC) and between-object cross-validation (BOC). For WWC, the whole dataset was sampled and divided into a training set, a validation set, and a testing set at the ratio of . The dataset division method BOC was used to verify the ability of the models for unseen objects (the objects do not appear in the training set and validation set). In this case, an object and its views were either wholly seen or unseen. of the objects in the dataset were selected and divided into a training set and a validation set at the ratio of . The remaining objects in the dataset were taken as the testing set. For fair comparisons, this study used standard ten-fold cross-validation, and the average results were reported.

4.1.3 Object Category-based Sampling

To better investigate the relationship between object categories and grasp labels, we proposed an object category-based sampling (OCS) method, ensuring that at least one object with grasp labels in each object category appeared in the training process. Mark the total number of the -th object category in the dataset and the expected number of randomly chosen objects with grasp labels in the training process as and , respectively. As shown in Table I, if OCS is chosen, for each object category, objects with grasp labels, objects without grasp labels will be used in the training process, and the remaining one will be taken as the testing data, when . When , all objects in that category and their grasp labels will appear in the training process. All samples used in the training process were divided into training and validation sets according to 9:1, whether grasp labels were used or not. It was worth mentioning that, for OCS, the object category labels of all samples in both the training and validation sets were used, and the one-hot vectors composed of zeros were used as the ‘label’ of samples with missing grasp labels during the training process. OCS can be used to test the prediction performance of the model for grasp labels of different objects in the same object category and investigate the relationship between object category and grasp labels better. Each experiment on OCS was performed three times in which the samples for training, validation and testing were chosen randomly according to Table I, and the average results were reported.

4.1.4 Implementation Details

As shown in Fig. 2, RGB images of household objects were taken as the input of our network. The input images were resized into 224224 pixels. The input images of the category branch (each pixel is divided by 255) the grasp branch were normalized according to

(14)

where represents the -th channel of the image x (with size of ),

(15)

and

(16)
Model Name Description
CnnGrasp
consisted of two convolution layers and
a downsampling layer
EfficientNet
EfficientNet_v2
considered network depth, width, and resolution
to balance accuracy and speed
Lightlayers
resorted Matrix Factorization to reduce
the complexity of the DNN models without
much loss in the accuracy
GhostNet
used the Ghost module to upgrade existing
convolutional neural networks
RegNet
was a low-dimensional design space composed
of simple and regular networks
TABLE II: Description of state-of-art methods
Methods RGB-D Object dataset Hit-GPRec dataset ALOI COIL-100 MeganePro dataset 1
GA MF1 MCR GA MF1 MCR GA MF1 MCR GA MF1 MCR GA MF1 MCR
CnnGrasp 75.81 70.13 73.77 78.56 68.42 70.11 95.45 95.65 95.67 97.45 97.18 97.07 98.57 97.84 97.81
GhostNet 97.16 97.23 97.58 65.31 62.99 67.28 68.40 64.38 68.74 28.13 10.97 25.00 14.09 2.46 10.00
RepVGG 98.76 98.85 98.92 97.43 97.37 97.48 96.66 96.80 96.78 95.78 95.61 95.89 96.66 95.66 96.07
EfficientNet 98.20 98.17 98.13 79.74 77.75 78.18 88.69 89.06 89.48 39.76 29.09 37.89 63.93 60.21 62.22
EfficientNet_v2 99.73 99.74 99.73 86.25 84.53 84.32 97.86 97.87 97.89 93.09 92.55 92.85 97.57 96.76 96.93
RegNetX600 93.22 93.48 94.01 47.74 38.11 45.73 83.56 82.76 83.73 81.16 80.05 82.06 78.77 75.19 75.52
RegNetY600 97.06 97.21 97.12 52.56 47.17 53.25 91.23 91.57 91.16 81.85 79.87 80.57 86.72 82.41 81.42
LightLayers 85.88 85.61 85.70 78.33 76.70 79.01 55.79 49.20 54.80 73.36 71.01 73.68 86.72 82.41 81.42
DcnnGrasp 99.99 99.99 99.99 100.00 100.00 100.00 99.24 99.21 99.26 100.00 100.00 100.00 98.66 98.53 98.60
TABLE III: Comparison with other benchmark methods under WWC sampling.
Methods RGB-D Object dataset Hit-GPRec dataset ALOI
GA MF1 MCR GA MF1 MCR GA MF1 MCR
CnnGrasp 70.70 70.88 71.16 81.59 80.61 81.67 71.77 72.17 72.06
GhostNet 75.52 75.43 76.21 60.66 57.43 64.62 58.08 53.66 60.25
RepVGG 72.60 72.67 73.90 93.87 93.68 94.01 74.08 74.96 75.32
EfficientNet 79.07 78.67 80.07 73.56 69.75 70.13 72.38 72.42 74.17
EfficientNet_v2 76.84 75.65 76.59 78.27 76.46 75.19 72.49 73.07 75.66
RegNetX600 69.61 67.93 69.59 55.99 51.54 54.62 68.71 68.69 69.87
RegNetY600 69.30 70.17 71.80 62.40 57.22 57.96 70.20 70.22 72.13
LightLayers 64.02 63.99 67.32 66.12 61.32 65.18 40.19 33.73 42.61
DcnnGrasp 94.43 94.40 95.28 99.81 99.81 99.88 78.99 79.47 80.18
TABLE IV: Comparison with other benchmark methods under BOC sampling.

The network structure is follows. The object category feature extractor adopts DenseNet121 [36]

and the category classifier adopts three fully connected layers and a softmax layer. The final output dimensions of each fully connected layer are 256, 128, and 64, respectively. The softmax layer is essentially a fully connected layer, but the activation function adopts softmax. The final output dimension equals the number of object categories. The grasp type feature extractor adopts CnnGrasp. The grasp classifier is composed of two fully connected layers and a softmax layer. The final output dimension of each fully connected layer is 128 and 64. The final output dimension of the softmax layer equals the number of grasp types.

For the optimization of the sub-problems (12) and (13

), Adam was used as the optimizer, and its learning rate was initialized as 0.001. The exponential decay rate for the 1st and 2nd moment estimates were set as 0.9 and 0.999, respectively.

4.2 Comparison with State-of-the-Art Methods

4.2.1 Evaluation Metrics and Compared Methods

This study used the global accuracy (GA), the macro recall (MRC) [46] and the macro F1 (MF1) [47] to evaluate the performance of the proposed method [45].

In this paper, our method was compared with the following state-of-art methods: CnnGrasp [24], EfficientNet [38], EfficientNet_v2 [39], Lightlayers [40], GhostNet [41], and RegNet [42]. These methods are described in Table II, where only CnnGrasp is designed specifically for grasp classification and the remaining five methods are given for the classical classification problem.

4.2.2 Comparison in WWC

To verify the robustness of the proposed method (DcnnGrasp) in different views of objects, our method was compared with all six methods in WWC on all five datasets. All experimental results are presented in Table III. It can be seen from this table that DcnnGrasp achieved the best prediction performance compared with other methods for grasp classification. Specifically, DcnnGrasp obtained an accuracy of almost 100% on the RGB-D Object dataset, Hit-GPRec dataset, and COIL-100. For the Hit-GPRec dataset and COIL-100, the proposed method even outperformed most comparing methods by in accuracy.

The comparison of the grasp classification methods (including CnnGrasp and DcnnGrasp) with five classical classification methods indicated that both two grasp classification methods achieved excellent grasp classification performance on ALOI, COIL-100, and MeganePro datasets. For the RGB-D Object dataset and Hit-GPRec dataset, DcnnGrasp still achieved the best prediction performance, but the performance of CnnGrasp was mediocre. The difference between the accuracy of CnnGrasp and DcnnGrasp was even more than and on the RGB-D Object dataset and the Hit-GPRec dataset, respectively. All experimental results shown in Table III illustrate the effectiveness of the proposed method in WWC.

4.2.3 Comparison in BOC

(a) GA on RGB-D Object dataset
(b) GA on Hit-GPRec dataset
(c) GA on ALOI dataset
Fig. 3: GA under different proportions (proportion means the ratio of data items with grasp labels to the data items in the entire dataset).

To demonstrate the generalizability of the proposed method in unseen objects, our method was compared with all six methods in BOC on three datasets including the RGB-D Object dataset, Hit-GPRec dataset, and ALOI (For BOC, only the datasets containing more than 100 objects were used). All results are presented in Table IV. It can be seen from this table that DcnnGrasp still archived the best prediction performance for BOC.

In addition, the comparison of the results in Table III and Table IV

indicated that BOC (unseen problem) was much more challenging than WWC. The performance of all compared methods decreased dramatically in most cases. On the RGB-D Object dataset, the gap between the results of WWC and BOC was even 20% for some methods (such as GhostNet, RepVGG, EfficientNet_v2, RegNetX600, and RegNetY600). However, even for BOC, on the Hit-GPRec dataset, the performance of DcnnGrasp on all three evaluation metrics was close to 100%. On the RGB-D Object dataset, they achieved an accuracy of more than 94%, which is 15% higher than that of all comparing methods.

4.3 Robustness on the dataset with missing grasp labels

4.3.1 Comparison in BOC

In this part, the robustness of DcnnGrasp was investigated on the dataset with missing grasp labels, in which the proportion of the remaining grasp labels was . Meanwhile, one-hot vectors composed of zeros were used for losing grasp labels during the training process. DcnnGrasp was compared with the grasp classification method CnnGrasp on these datasets including the RGB-D Object dataset, Hit-GPRec dataset, and ALOI dataset for BOC sampling. All comparison results were illustrated in Fig. 3.

It can be seen from Fig. 3 that the performance of DcnnGrasp outperformed CnnGrasp for all cases significantly. In addition, we discovered that, even for a small , DcnnGrasp maintained an excellent performance on the RGB-D Object dataset and Hit-GPRec dataset. For , the GA values of DcnnGrasp were still higher than 85% and 95% on the RGB-D Object dataset and Hit-GPRec dataset, respectively. It showed the robust grasping prediction ability of DcnnGrasp when datasets had more sufficient objects per object category, which guarantee DcnnGrasp can study the relationship between grasp types and object categories well.

4.3.2 Comparison in OCS

Datasets RGB-D Object dataset Hit-GPRec dataset
GA MF1 MRE GA MF1 MRE
n=3 DcnnGrasp 95.53 95.75 96.11 98.40 97.27 96.98
CnnGrasp 68.13 66.95 66.71 78.76 75.44 74.34
n=2 DcnnGrasp 91.46 92.33 91.64 97.71 96.94 97.22
CnnGrasp 63.31 61.72 60.97 76.79 73.21 70.84
n=1 DcnnGrasp 93.24 93.99 93.70 96.98 97.00 96.72
CnnGrasp 58.75 56.37 55.43 68.20 59.96 57.49
TABLE V: Experimental results under the proposed specific sampling method. Note that, there are objects in each object category with grasp labels in the training and validation sets.

Since prosthetic or robotic hands will be generally initialized before leaving the factory in which a well-designed dataset can be used during the training process, we used OCS to simulate such scenarios in this experiment, where the expected number of objects with grasp labels in each object category was set as . The results for the RGB-D Object dataset and Hit-GPRec dataset were presented in Table V. These two datasets were chosen as they have more sufficient objects per object category compared with other datasets such as ALOI.

v1 v2 v3
Dual branch network
Using object category label
Training strategy based on JCEAR
RGB-D Object dataset 72.87 43.25/76.38 91.50/94.43
Hit-GPRec dataset 82.96 39.10/85.43 98.31/99.81
ALOI dataset 63.79 15.72/69.92 41.27/78.99
TABLE VI: Ablation study of our methods. Note that red and blue appear in some places. a and b represent the GA of the object category classification task and the grasp pattern recognition task, respectively.

From Table V, we can see that the DcnnGrasp maintains much stable performance compared with CnnGrasp as decreases. Even for the case when only one object with grasp labels in each object category appears in the training process, i.e., , the GA values of DcnnGrasp were still higher than 90% and 95% on the RGB-D Object dataset and Hit-GPRec dataset, respectively, while the performance of CnnGrasp drops hugely if is decreased to . It demonstrated the strong robustness of DcnnGrasp for different objects in the same category and the case of the dataset with missing grasp labels.

4.4 Ablation Studies

4.4.1 Effects of Different GTFEs

(a) GA on RGB-D Object dataset
(b) GA on Hit-GPRec dataset
(c) GA on ALOI dataset
Fig. 4: Comparison on the effectiveness of four different grasp type feature extractors in the case of BOC sampling. ‘Original’ represents four models, naemly CnnGrasp, EfficientNet, EfficientNet_v2, and LightLayers. ‘Ours’ represents DcnnGrasp using the four models as grasp type feature extractors.

Ablation studies were performed from two aspects: (1) grasp type feature extractor (GTFE), and (2) component of the proposed method. The impact of each aspect on the model performance was analyzed in turn.

This study respectively used EfficientNet (B0), EfficientNet_v2 (B0), LightLayers, and CnnGrasp as the grasp type feature extractor to investigate the impact of different extractors on the network performance131313Three networks including EfficientNet (B0), EfficientNet_v2 (B0), and LightLayers were chosen here because of their efficiency and good performance on comparison experiments.. The experimental results are shown in Fig. 4. It can be seen from this figure that the methods taking CnnGrasp as the grasp type feature extractor in DcnnGrasp achieved the best results, which shows that the choice of grasp type feature extractor is not related to the depth of the network (CnnGrasp only has two layers). Meanwhile, the performance of all four models including CnnGrasp, EfficientNet (B0), EfficientNet_v2 (B0), and LightLayers was enhanced by the proposed method significantly in most cases. Specifically, for the RGB-D Object dataset and Hit-GPRec dataset, the GA values of all methods taking four traditional models as the grasp type feature extractor reached more than 90%, which is about 20% higher than those of the four models. As it will be discussed later, this performance advantage is not only benefited from the dual-branch network but also from the training strategy based on JCEAR, which enhanced the joint learning of object category classification and grasp pattern recognition significantly.

4.4.2 Effects of Different Strategies

To validate the effectiveness of the proposed strategies to improve the performance of grasp pattern recognition (including the dual-branch network and the introduction of object category information and training strategy based on JCEAR), this paper compared the following variants obtained by combining the strategies gradually:

  • The dual-branch network is used, and the cross-entropy is taken as the loss function.

  • The dual-branch network is used, and the joint cross-entropy is taken as the loss function, in which object category labels are used.

  • Both the dual-branch network and training strategy based on JCEAR are used, in which object category labels are used.

All comparison results are presented in Table VI. It can be seen that the significant increase in GA is attributed to each proposed strategy, demonstrating the effectiveness of the proposed strategies. Meanwhile, the comparison of v1 and v2 indicates significant improvements attributed to the proposed training strategy based on JCEAR. Especially, on the RGB-D Object dataset and Hit-GPRec dataset, the increase in GA is at least 13.5%.

(a) The results on the case of WWC sampling.
(b) The results on the case of BOC sampling.
Fig. 5: Classified visual results. The blue label represents the object category label. The green label indicates a correct classification, and the red label indicates a wrong classification.

5 Discussion

In the above experiments, for most cases, DcnnGrasp achieved the best performance. To analyze the performance of DcnnGrasp in grasp pattern recognition deeply, based on the observations from the experiment and the visualized results of DcnnGrasp and CnnGrasp in Fig. 5, this paper performs further discussions on the following four aspects: (1) robustness in 3D information of the objects for WWC, (2) sensitivity to the shadow and confusing background, (3) generalizability in unseen objects, and (4) robustness in a dataset with missing grasp labels.

5.1 Robustness in 3D information of the objects for WWC

From the comparison results in WWC, it can be seen that DcnnGrasp obtained the best results on all datasets, and all evaluation metric values were more than 98%, indicating strong robustness of DcnnGrasp in different views of the objects. Meanwhile, from the visualized results shown in Fig. 5 (a), compared with DcnnGrasp, CnnGrasp obtained confusion about the 3d information of the objects easier. Specifically, in the prediction for ‘Lemon’, CnnGrasp obtained a wrong grasping gesture ‘Tripod’. This is possibly because the 3d shape of ‘Lemon’ was mistakenly recognized as flat. In addition, from the cases of ‘Food Box’ and ‘Box’, it can be seen that CnnGrasp was sensitive to the deformation generated by taking photos of the object from different angles. However, even for this situation, DcnnGrasp still worked well. All these observations show the strong robustness of DcnnGrasp in obtaining 3D information of the objects, which is attributed to the introduction of object category information.

5.2 Sensitivity to the shadow and confusing background

In grasp pattern recognition, shadows and confusing backgrounds often appear in household object image data, which will affect the performance of grasp classification. For example, in the ALOI dataset, the example images listed in Fig. 5 show that the target object is usually indistinguishable from the background. For ‘Bottle’, ‘AShtray’, and ‘Bowl’, it is difficult to recognize the target object even for humans. This is one of the reasons why most methods failed on the ALOI dataset, particularly BOC sampling. Another example is the Hit-GPRec dataset contaminated by the shadows. As shown in Table III and IV, on this dataset, DcnnGrasp achieved a value of at least 99% for the evaluation metrics in all cases. This good performance is attributed to the proposed training strategy based on JCEAR, as shown in the ablation study given in Table VI.

5.3 Generalizability in unseen objects

According to the comparison results given in Tables III and IV, most of the methods performed worse than WWC in BOC validation, indicating the considerable challenge of unseen problem. One of the reasons may be the dataset. For example, on the ALOI dataset, there are only a few different individuals in each category, and for some object categories, there is even one object. This makes it difficult to learn the relationship between object categories and grasping gestures. A similar situation can be observed on other datasets. For example, on the Hit-GPRec dataset, there are only two objects in the category ‘Roll Paper Tube’, and the two objects are marked with different grasp types (one is ‘Tripod’, and the other is ‘Cylindrical’), which leads to a wrong prediction by DcnnGrasp for an example of ‘Roll Paper Tube’ as shown in Fig. 5 (b).

Another reason for the occurrence of the wrong prediction for grasping gesture may be that, as shown in Fig. 5 (b), the 3d information losing problem in BOC is more serious compared with WWC. For example, in the prediction for ‘Can’ and ‘Pitcher’, CnnGrasp mistakenly recognized these two objects as tiny things and thus made a wrong prediction for the grasping gesture. This is common in traditional grasp classification methods due to size information lost in RGB images. In addition, from the cases of ‘Block’ and ‘Food Box’, it can be seen that CnnGrasp is sensitive to deformation, which is caused by taking a photo of the object from different angles.

Although the unseen problem is a challenge in grasp pattern recognition, DcnnGrasp still obtained excellent results (higher than 94%) on the RGB-D Object dataset and Hit-GPRec dataset. Also, it outperformed the state-of-art methods on all datasets significantly in all evaluation metrics. All results verified the strong generalizability of DcnnGrasp in unseen objects, which is attributed to the proposed training strategy based on JCEAR.

5.4 Robustness on the dataset with missing grasp labels

In the experiments given in Section 4.3, the robustness of DcnnGrasp was investigated on the dataset with missing grasp labels. The experimental results indicated the effectiveness of our method even when the proportion of the remaining grasp labels is small, especially on the RGB-D Object dataset and Hit-GPRec dataset. The comparision in OCS showed the strong robustness of DcnnGrasp to missing grasp labels compared with CnnGrasp. On Hit-GPRec dataset, for DcnnGrasp, the difference caused by different were less than 2%, but for CnnGrasp, the difference had achieved 10% to 15% for different valuation metrics. Even only one object with gesture labels in each object category appeared in the training process, the GA values of DcnnGrasp were higher than 90% and 95% on the RGB-D Object dataset and Hit-GPRec dataset respectively. It is because DcnnGrasp can learn and utilize the relationship between object categories and grasping gestures well when there are sufficient objects with grasp labels in each object category appears in the training process.

6 Conclusion and Future Work

This paper proposes a novel dual-branch convolutional neural network (DcnnGrasp) that utilizes the object category information to improve grasp pattern recognition. Meanwhile, a novel loss function JCEAR and a new training strategy are given to maximize the collaborative learning of object category classification and grasp pattern recognition. To train DcnnGrasp, two dual-label datasets are constructed based on the existing household datasets including the RGB-D Object dataset and Hig-GPRec dataset. Experimental results demonstrated the excellent performance of the proposed method.

As a newly developed technology, computer vision-based grasp pattern recognition still has a lot of challenges. For example, how to predict human grasp affordances for a single RGB image of a scene with an arbitrary number of objects? Also, it is interesting to consider personal habits in grasp pattern recognition.

References

  • [1] Z. Wang, J. Merel, S. Reed, G. Wayne, N. de Freitas, and N. Heess, “Robust imitation of diverse behaviors,” arXiv preprint arXiv:1707.02747, 2017.
  • [2] Y. Liu, A. Gupta, P. Abbeel, and S. Levine, “Imitation from observation: Learning to imitate behaviors from raw video via context translation,” In ICRA, pp. 1118–1125, 2018.
  • [3] N. Di Palo and E. Johns, “Safari: Safe and active robot imitation learning with imagination,” arXiv preprint arXiv:2011.09586, 2020.
  • [4] H. Jin, E. Dong, M. Xu, and J. Yang, “A smart and hybrid composite finger with biomimetic tapping motion for soft prosthetic hand,” JBE, vol. 17, pp. 484–500, 2020.
  • [5] X. Wang, F. Geiger, V. Niculescu, M. Magno, and L. Benini, “Smarthand: Towards embedded smart hands for prosthetic and robotic applications,” arXiv preprint arXiv:2107.14598, 2021.
  • [6] E. Corona, A. Pumarola, G. Alenya, F. Moreno-Noguer, and G. Rogez, “Ganhand: Predicting human grasp affordances in multi-object scenes,” In CVPR, pp. 5031–5041, 2020.
  • [7] M. Zandigohar, M. Han, M. Sharif, S. Y. Gunay, M. P. Furmanek, M. Yarossi, P. Bonato, C. Onal, T. Padir, D. Erdogmus et al., “Multimodal fusion of emg and vision for human grasp intent inference in prosthetic hand control,” arXiv preprint arXiv:2104.03893, 2021.
  • [8] G. Ghazaei, F. Tombari, N. Navab, and K. Nazarpour, “Grasp type estimation for myoelectric prostheses using point cloud feature learning,” arXiv preprint arXiv:1908.02564, 2019.
  • [9] M. Zandigohar, D. Erdogmus, and G. Schirner, “Netcut: Real-time dnn inference using layer removal,” arXiv preprint arXiv:2101.05363, 2021.
  • [10] M. Veres, M. Moussa, and G. W. Taylor, “Modeling grasp motor imagery through deep conditional generative models,” RAL, vol. 2, no. 2, pp. 757–764, 2017.
  • [11] P. Weiner, J. Starke, F. Hundhausen, J. Beil, and T. Asfour, “The kit prosthetic hand: design and control,” In IROS, pp. 3328–3334, 2018.
  • [12] S. Liu, M. Van, Z. Chen, J. Angeles, and C. Chen, “A novel prosthetic finger design with high load-carrying capacity,” Mechanism and Machine Theory, vol. 156, p. 104121, 2021.
  • [13] K. H. Yusof, M. A. Zulkipli, A. S. Ahmad, M. F. Yusri, S. Al-Zubaidi, and M. Mohammed, “Design and development of prosthetic leg with a mechanical system,” In ICSGRC, pp. 217–221, 2021.
  • [14] S. Kumra and C. Kanan, “Robotic grasp detection using deep convolutional neural networks,” In IROS, pp. 769–776, 2017.
  • [15] G. Du, K. Wang, and S. Lian, “Vision-based robotic grasping from object localization pose estimation grasp detection to motion planning: A review,” arXiv preprint arXiv:1905.06658, 2019.
  • [16] S. Caldera, A. Rassau, and D. Chai, “Review of deep learning methods in robotic grasp detection,” Multimodal Technologies and Interaction, vol. 2, no. 3, p. 57, 2018.
  • [17] Y. Song, L. Gao, X. Li, and W. Shen, “A novel robotic grasp detection method based on region proposal networks,” Robotics and Computer-Integrated Manufacturing, vol. 65, p. 101963, 2020.
  • [18] Z. Deng, G. Gao, S. Frintrop, F. Sun, C. Zhang, and J. Zhang, “Attention based visual analysis for fast grasp planning with a multi-fingered robotic hand,” Frontiers in neurorobotics, vol. 13, p. 60, 2019.
  • [19] L. T. Taverne, M. Cognolato, T. Bützer, R. Gassert, and O. Hilliges, “Video-based prediction of hand-grasp preshaping with application to prosthesis control,” In ICRA, pp. 4975–4982, 2019.
  • [20] F. Hundhausen, D. Megerle, and T. Asfour, “Resource-aware object classification and segmentation for semi-autonomous grasping with prosthetic hands,” In 2019 IEEE-RAS 19th International Conference on Humanoid Robots, pp. 215–221, 2019.
  • [21] T. Kara and A. S. Masri, “Modeling and analysis of a visual feedback system to support efficient object grasping of an emg-controlled prosthetic hand,” CDBME, vol. 5, no. 1, pp. 207–210, 2019.
  • [22] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai et al., “Recent advances in convolutional neural networks,” Pattern Recognition, vol. 77, pp. 354–377, 2018.
  • [23]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”In

    NIPS, vol. 25, pp. 1097–1105, 2012.
  • [24] G. Ghazaei, A. Alameer, P. Degenaar, G. Morgan, and K. Nazarpour, “Deep learning-based artificial vision for grasp classification in myoelectric hands,”JNE, vol. 14, no. 3, p. 036025, 2017.
  • [25] D. P. Bertsekas, “Nonlinear programming,” JORS, vol. 48, no. 3, pp. 334–334, 1997.
  • [26] S. Došen and D. B. Popović, “Transradial prosthesis: artificial vision for control of prehension,” Artificial organs, vol. 35, no. 1, pp. 37–48, 2011.
  • [27] N. Wake, D. Saito, K. Sasabuchi, H. Koike, and K. Ikeuchi, “Object affordance as a guide for grasp-type recognition,” arXiv preprint arXiv:2103.00268, 2021.
  • [28] M. Kopicki, R. Detry, M. Adjigble, R. Stolkin, A. Leonardis, and J. L. Wyatt, “One-shot learning and generation of dexterous grasps for novel objects,” IJRR, vol. 35, no. 8, pp. 959–976, 2016.
  • [29] C. Shi, D. Yang, J. Zhao, and H. Liu, “Computer vision-based grasp pattern recognition with application to myoelectric control of dexterous hand prosthesis,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 28, no. 9, pp. 2090–2099, 2020.
  • [30] M. Han, S. Y. Günay, İ. Yildiz, P. Bonato, C. D. Onal, T. Padir, G. Schirner, and D. Erdoğmuş, “From hand-perspective visual information to grasp type probabilities: deep learning via ranking labels,” In 12th ACM international conference on pervasive technologies related to assistive environments, pp. 256–263, 2019.
  • [31] M. Zandigohar, M. Han, D. Erdoğmuş, and G. Schirner, “Towards creating a deployable grasp type probability estimator for a prosthetic hand,” Cyber Physical Systems. Model-Based Design, pp. 44–58, 2019.
  • [32] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view rgb-d object dataset,” In ICRA, pp. 1817–1824, 2011.
  • [33] J.-M. Geusebroek, G. J. Burghouts, and A. W. Smeulders, “The amsterdam library of object images,” IJCV, vol. 61, no. 1, pp. 103–112, 2005.
  • [34] S. A. Nene, S. K. Nayar, H. Murase et al., “Columbia object image library (coil-100),” 1996.
  • [35] M. Cognolato, A. Gijsberts, V. Gregori, G. Saetta, K. Giacomino, A.-G. M. Hager, A. Gigli, D. Faccio, C. Tiengo, F. Bassetto et al., “Gaze, visual, myoelectric, and inertial data of grasps for intelligent prosthetics,” Scientific data, vol. 7, no. 1, pp. 1–15, 2020.
  • [36] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” In CVPR, pp. 4700–4708, 2017.
  • [37] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” In CVPR, pp. 248–255, 2009.
  • [38] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” In ICML, pp. 6105–6114, 2019.
  • [39] M. Tan and Q. V. Le, “Efficientnetv2: Smaller models and faster training,” arXiv preprint arXiv:2104.00298, 2021.
  • [40] D. Jha, A. Yazidi, M. A. Riegler, D. Johansen, H. D. Johansen, and P. Halvorsen, “Lightlayers: Parameter efficient dense and convolutional layers for image classification,” In PDCAT, pp. 285–296, 2020.
  • [41] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “Ghostnet: More features from cheap operations,” In CVPR, pp. 1580–1589, 2020.
  • [42] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Designing network design spaces,” In CVPR, pp. 10 428–10 436, 2020.
  • [43] M. R. Cutkosky et al., “On grasp choice, grasp models, and the design of hands for manufacturing tasks.” IEEE Transactions on robotics and automation, vol. 5, no. 3, pp. 269–279, 1989.
  • [44] I. M. Bullock, J. Z. Zheng, S. De La Rosa, C. Guertler, and A. M. Dollar, “Grasp frequency and usage in daily household and machine shop tasks,” ToH, vol. 6, no. 3, pp. 296–308, 2013.
  • [45]

    C. Robert, “Machine learning, a probabilistic perspective,” 2014.

  • [46] F. Thabtah, M. Eljinini, M. Zamzeer, and W. Hadi, “Naïve bayesian based on chi square to categorize arabic data,” In 11th international business information management association conference (IBIMA) conference on innovation and knowledge management in twin track economies, Cairo, Egypt, pp. 4–6, 2009.
  • [47] J. Opitz and S. Burst, “Macro f1 and macro f1,” arXiv preprint arXiv:1911.03347, 2019.
  • [48] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” In CVPR, pp. 4320–4328, 2018.