Object 6D Pose Estimation with Non-local Attention

02/20/2020 ∙ by Jianhan Mei, et al. ∙ 16

In this paper, we address the challenging task of estimating 6D object pose from a single RGB image. Motivated by the deep learning based object detection methods, we propose a concise and efficient network that integrate 6D object pose parameter estimation into the object detection framework. Furthermore, for more robust estimation to occlusion, a non-local self-attention module is introduced. The experimental results show that the proposed method reaches the state-of-the-art performance on the YCB-video and the Linemod datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The estimation of object instance 6D pose has been a fundamental component in many application fields, e.g. robotic manipulation and autonomous driving. It is challenging to estimate 6D object instance pose from 2D images since object information is lost during the projection from 3D to 2D.

Typically, the task of 6D object pose estimation is separated into two-stage: 1) instance object detection and 2) parameter estimation. Driven by the great success of deep learning [32, 11, 10, 23], the object detection has been well-studied in recent years and lots of deep learning based methods, e.g. [14, 29, 21, 27, 28]

, have been proposed that achieved excellent performance in many scenarios. All of these methods crop the interested regions from the feature maps of the convolution layers and normalize them into a fixed size for the classification and the regression. The classifier and the regressor predict the object information by the local part of the feature maps. And the global information can be recovered by an inverse normalization. However, since 2D space normalization is not equivalent to 3D targets, the regressor suffers from loss of global information when directly integrating the 6D pose parameter into the framework of detection.

For the 6D parameter regression, the traditional methods focus on recovering the pose by matching key point features between 3D models and images  [6, 22, 30]. Such methods suffer the problem of key points extraction and description. The existing RGB-D data-based methods  [17, 3, 2, 31, 18] improve the pose parameter regression significantly by using additional depth information. However, depth cameras have highly constrained configurations and are unavailable in some scenarios (e.g., outdoor scene). Hence in this paper, we address the problem of 6D parameter regression from RGB images, which is much easier to be obtained. Recently, PoseCNN  [38] have shown that the object 6D pose information can be learned directly from the 2D image by utilizing the powerful learning capability of deep networks and the known camera intrinsic. However, the estimation networks still face problems of the parameter normalization, decoupling, and the prediction precision.

Figure 1: The brief pipeline of the 6D object pose task.

To build up a more efficient and effective network, we try to integrate 6D object pose parameter estimation into the object detection framework. Assuming the 6D object instance pose is defined by a reference 3D model, we use the 3D bounding box to map it to a unique 8 points 2D box. The 6D pose parameter is separated into the rotation and the translation which are normalized according to the Region of Interest (RoI) feature respectively. For better extraction of the object feature, a non-local based self-attention mechanism is introduced. By weighting the original feature using the non-local information, the final output features are more robust in case of the occlusion. For automatically finding the multi-task trade-off for rotation and translation parameters, the final loss is computered on the transformed 2D coordinates as shown in Fig. 1. For better illustration and fair comparison, we use Faster R-CNN (Region-based Convolutional Network) [29] as the backbone. Note that the proposed integration method is not limited to specific frameworks.

2 Related Work

The acquisition of object instances information in an image is typically base on the detection approaches. Before the advent of the deep network based detection, Deformable Part Model (DPM)  [12] and Selective Search  [33] are the most powerful detectors. In recent years, deep learning achieves great success [9, 34, 20, 35, 8] and the deep learning based detection methods have made significant improvements. Start from the  [14, 29], the two-stages frameworks are proposed and improved. Based on  [16], the RoI Pooling is introduced in  [29] and used in the most recent two-stages detection frameworks. Before feeding the convolutional feature to the classifier, the feature is cropped and resized into fixed shape, which causes the information lost. Later, the single-shot detectors are proposed  [21, 27, 28]. The multi-scale bounding boxes attached with the convolutional feature maps are used instead of the proposals in the two-stages frameworks. The single-shot methods are typically faster than the two-stages framework while they still suffer the problem of the input resizing problem.

Understanding 3D scenes from 2D images has been studied for a long time. Especially for instance-level 3D parameter estimation, many research efforts have been tried  [19, 24, 25, 37, 1, 40, 39, 13]. The traditional pose estimation methods are roughly classified into the template-based methods and the feature-based methods. With the prevalence of the deep learning methods, deep neural network has been applied to the 6D object pose estimation task. Based on the object detection framework, either the key points or the pose parameters are regressed by the network. However, the deep learning based regression methods suffer the problem of the object occlusion. In [36], the non-local neural network is introduced. The non-local block is used as a self-attention, which enforce the network to use the global information. In this work, we try to utilize the self-attention property of the non-local neural network to have better object feature.

3 Method

The task of recovering 6D pose parameters of all the object instances in a single RGB image consists of two parts, which are the object instance detection and the pose parameter regression. Based on one of the most popular two-stage object detection frameworks, Faster R-CNN, we integrate the 6D pose parameter regression into the deep network together with the object instance classification and localization. Further, the non-local neural block is introduced as a self-attention which makes the feature concentrate on its object parts and robust to the occlusion.

3.1 Virtual RoI Camera Transform

In  [19], the allocentric and egocentric description problem of the global image and the proposal is discussed.  [19] uses the allocentric representation for learning parameter from the RoI features. The pose parameters of each object instance are re-defined by a canonical object center and a 2D amodal bounding box. By applying the perspective mapping, the global egocentric pose can be recovered. However, since the recovered egocentric pose is related to the predicted values, the canonical object center and the 2D amodal bounding box, the prediction errors from the translation and rotation parameters would interact with each other. Also in  [38], the importance of decoupling the regression of the translation and rotation parameters is claimed.

Figure 2: Virtual RoI camera transform.

In this work, we normalize the 6D pose parameters according to the RoI proposals. In Faster R-CNN, the RoI features are cropped from the global feature map. Showing in Fig. 2, as the description of the proposal, the cropped RoI feature can be regarded as a view changing transform of the original scene. Following  [15, 19], the virtual RoI camera and its intrinsic matrix is defined as:

(1)

where, is the camera intrinsic matrix, is the virtual RoI camera intrinsic matrix, and are width and height of the RoI.

Since that in a fixed network structure, the size of the RoI pooling is fixed. We normalize the camera changing mapping by the view of the virtual RoI camera and make each proposal mapped to a new coordinate space within . The virtual camera principal axis is mapped to the center point . According to  [15], the infinite homography matrix is used to define the 2D transformation between the virtual RoI view and the original image view, where is the rotation matrix from the image camera to the RoI camera. The 6D poses of the objects in each proposal are normalized by the virtual camera principal axis. Here, we discard the allocentric concept. The object poses both in the global image and in the RoI are considered as under the egocentric representation, but with a 2D transformation by the infinite homography matrix.

In Faster R-CNN, the object bounding boxes are normalized by the RoIs in Eq. 2:

(2)

where, are the left top coordinate and the width and the height of the object on the image, are the left top coordinate and the width and the height of the RoI on the image, are the left top coordinate and the width and the height of the object on the RoI.

Essentially, the object bounding boxes normalization in  [14, 29] can be regarded as the camera view changing without considering the 3D rotation so that the coordinates can be normalized simply by the width and height ratios of each RoI.

3.1.1 Rotation Normalization

As described in the previous section, the views of RoI proposals are transformed from the original image view by the infinite homography matrix. The rotation matrix defines the rotation from the image camera principal axis to the RoI proposal camera principal axis. The 6D object rotation parameter needs to be normalized by the RoI proposal camera principal axis. The center, , of the RoI proposal is used to calculate the RoI camera principal axis. The rotation matrix can be obtained by Rodrigues rotation formula  [7]:

(3)

where,

is the 3 by 3 identity matrix,

and

denote the cross product and the inner product of vectors respectively.

During training the network, the rotation labels are represented as: . And the rotation output is represented as the quaternion to constrain the rotation regression network output such that . And the original object rotation can be recovered by .

3.1.2 Translation and Depth Normalization

The depth of an object cannot be directly reflected by only a single RGB image. Given a 3D translation vector , we treat the 2D translation and separately.

Since all the cropped RoI features are resized to a fixed resolution, the depth cannot be estimated directly. The perspective method is used. The depth can be represented as:

(4)

where, and are the area of the identity mapping 2D bounding box and the area of the RoI, is the label depth. The output depth can be recovered by :

Similar to the rotation parameter, is used for final normalized 3D translation vector:

(5)

where, the translation vector is first normalized by . Then the depth is replaced by the objective depth . And is the target object translation vector that will be further used for the parameter regression. And the original translation is recovered by using the inverse matrix .

3.2 Non-local Attention

The object pose estimation task often suffers from the problem of occlusion. We introduce the non-local self-attention [36] to make the system concentrate on the non-occluded object part. The non-local image processing was first used in image filtering [ [5]s]. In [ [5]s], the algorithm calculates the filter response considering both local and distant pixels.

In  [36], the non-local mean operation is introduced in deep neural networks as:

(6)

where, is the output of the operation while is the input. denotes the output position and enumerates all position used for the non-local calculation. The function computes the scalar for its two inputs and maps to a new representation. The two functions can be of specific instantiations. is a normalization factor.

Following  [36], we introduce a non-local block after the RoI pooling. The RoI features have fixed spatial size. As in Eq. 6, each channel of a RoI feature map predicts the full-size attention mask. And the output of the block is a weighted feature for better spatial attention of the further tasks.

Figure 3: (a) the original RGB images. (b) the images with transformed 3D models to 2D silhouette

3.3 Loss

As the pose parameters are separated into translation and rotation, for better multi-task weighting, we do not directly regress the two parameters. The final parameter regression loss is computed on the transformed coordinates. More specifically the smooth L1 function [29] is used for all the parameter regression.

(7)

where, denotes the transform to its point coordinates . In the 6D pose case, can be represented as transform matrix form:

(8)

where, is the rotation matrix, denotes the translation vector. The last row is the homogeneous extension.

4 Experiment

4.1 Dataset and Setup

Our experiments are mainly conducted on the YCB-Video [38] dataset. For YCB-Video dataset, following dataset split in  [38]

, the 80 videos with 80,000 synthetic images are for training and 2,949 key frames are for testing. For evaluation metrics, the average distance (ADD) and ADD symmetry (ADD-S) are tested. For testing the robust of the network, the Linemod dataset 

[17] is evaluated following the settings of [4].

The model is implemented using the TensorFlow library. The VGG16 network is initialized by the pre-trained model on ImageNet. And the Adam optimizer is used for gradient calculation and weights updating.

4.2 Results on the YCB-Video Dataset

Following [38], the 3D coordinate regression network is chosen as the baseline network. And the single RGB version of PoseCNN [38] is compared with the propose method.

Method
3D
Coordinate
PoseCNN
Proposed
Proposed with
self-attention
ADD 15.1 53.7 50.3 53.9
ADD-S 29.8 75.9 75.1 77.0
Table 1: Performance evaluation on the YCB-Video dataset

As shown in Table 1, the proposed method without the non-local self-attention module reaches the comparable performance of PoseCNN [38] where we do not use the symmetry distance loss and the Hough Voting based translation regression that needs an extra segmentation branch. Furthermore, with the self-attention module, our network outperforms the previous methods. Some of the visualized results are shown in Fig. 3.

4.3 Results on the Linemod Dataset

The Linemod dataset [17] contains 15 objects. For each object, it has 1200 images. The objective item is annotated in each object image set. We use the training setting as [26]. The results are shown in Table 2:

Method
PoseCNN
Proposed with
self-attention
Pose 62.7 64.3
Table 2: Performance evaluation on the Linemod Dataset

Consistent with the YCB-video result, our network outperforms PoseCNN on the evaluation metric of [4], which demonstrates the robustness of the proposed network.

5 Conclusion

In this work, we discuss the integration of 6D object pose estimation into the prevalent deep learning based object detection framework. The 6D pose parameter is separated into rotation and translation that are normalized separately. Moreover, a non-local self-attention mechanism is introduced to obtain better performance robust to occlusion. Experimental results demonstrate the feasibility of the proposed network, which reaches the state-of-the-art performance on the two datasets.

References

  • [1] M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic (2014) Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of CAD models. In

    2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014

    ,
    pp. 3762–3769. Cited by: §2.
  • [2] L. Bo, X. Ren, and D. Fox (2014) Learning hierarchical sparse features for RGB-(D) object recognition. I. J. Robotics Res. 33 (4), pp. 581–599. Cited by: §1.
  • [3] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother (2014) Learning 6d object pose estimation using 3d object coordinates. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II, pp. 536–551. Cited by: §1.
  • [4] E. Brachmann, F. Michel, A. Krull, M. Y. Yang, S. Gumhold, and C. Rother (2016) Uncertainty-driven 6d pose estimation of objects and scenes from a single RGB image. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 3364–3372. Cited by: §4.1, §4.3.
  • [5] A. Buades, B. Coll, and J. Morel (2005) A non-local algorithm for image denoising. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20-26 June 2005, San Diego, CA, USA, pp. 60–65. Cited by: §3.2.
  • [6] A. Collet, M. Martinez, and S. S. Srinivasa (2011) The MOPED framework: object recognition and pose estimation for manipulation. I. J. Robotics Res. 30 (10), pp. 1284–1306. Cited by: §1.
  • [7] Dai and J. S. (2006) An historical review of the theoretical development of rigid body displacements from rodrigues parameters to the finite twist. Mechanism and Machine Theory 41 (1), pp. 41–52. Cited by: §3.1.1.
  • [8] H. Ding, X. Jiang, A. Q. Liu, N. M. Thalmann, and G. Wang (2019) Boundary-aware feature propagation for scene segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6819–6829. Cited by: §2.
  • [9] H. Ding, X. Jiang, B. Shuai, A. Q. Liu, and G. Wang (2019) Semantic correlation promoted shape-variant context for segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [10] H. Ding, X. Jiang, B. Shuai, A. Q. Liu, and G. Wang (2020) Semantic segmentation with context encoding and multi-path decoding. IEEE Transactions on Image Processing. Cited by: §1.
  • [11] H. Ding, X. Jiang, B. Shuai, A. Qun Liu, and G. Wang (2018) Context contrasted feature and gated multi-scale aggregation for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2393–2402. Cited by: §1.
  • [12] P. F. Felzenszwalb, R. B. Girshick, and D. A. McAllester (2010) Cascade object detection with deformable part models. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, pp. 2241–2248. Cited by: §2.
  • [13] S. Fidler, S. J. Dickinson, and R. Urtasun (2012) 3D object detection and viewpoint estimation with a deformable 3d cuboid model. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp. 620–628. Cited by: §2.
  • [14] R. B. Girshick (2015) Fast R-CNN. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1440–1448. Cited by: §1, §2, §3.1.
  • [15] A. Harltey and A. Zisserman (2006) Multiple view geometry in computer vision (2. ed.). Cambridge University Press. Cited by: §3.1, §3.1.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III, pp. 346–361. Cited by: §2.
  • [17] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. R. Bradski, K. Konolige, and N. Navab (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Computer Vision - ACCV 2012 - 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part I, pp. 548–562. Cited by: §1, §4.1, §4.3.
  • [18] W. Kehl, F. Milletari, F. Tombari, S. Ilic, and N. Navab (2016) Deep learning of local RGB-D patches for 3d object detection and 6d pose estimation. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, pp. 205–220. Cited by: §1.
  • [19] A. Kundu, Y. Li, and J. M. Rehg (2018) 3D-rcnn: instance-level 3d object reconstruction via render-and-compare. In CVPR, Cited by: §2, §3.1, §3.1.
  • [20] J. Liu, H. Ding, A. Shahroudy, L. Duan, X. Jiang, G. Wang, and A. K. Chichung (2019) Feature boosting network for 3d pose estimation. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
  • [21] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pp. 21–37. Cited by: §1, §2.
  • [22] D. G. Lowe (1999) Object recognition from local scale-invariant features. In ICCV, pp. 1150–1157. Cited by: §1.
  • [23] J. Mei, Z. Wu, X. Chen, Y. Qiao, H. Ding, and X. Jiang (2019) DeepDeblur: text image recovery from blur to sharp. Multimedia Tools and Applications 78 (13), pp. 18869–18885. Cited by: §1.
  • [24] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka (2017) 3D bounding box estimation using deep learning and geometry. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 5632–5640. Cited by: §2.
  • [25] P. Poirson, P. Ammirato, C. Fu, W. Liu, J. Kosecka, and A. C. Berg (2016) Fast single shot detection and pose estimation. In Fourth International Conference on 3D Vision, 3DV 2016, Stanford, CA, USA, October 25-28, 2016, pp. 676–684. Cited by: §2.
  • [26] M. Rad and V. Lepetit (2017) BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 3848–3856. Cited by: §4.3.
  • [27] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 779–788. Cited by: §1, §2.
  • [28] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6517–6525. Cited by: §1, §2.
  • [29] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 91–99. Cited by: §1, §1, §2, §3.1, §3.3.
  • [30] F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce (2006) 3D object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. International Journal of Computer Vision 66 (3), pp. 231–259. Cited by: §1.
  • [31] M. Schwarz, H. Schulz, and S. Behnke (2015)

    RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features

    .
    In IEEE International Conference on Robotics and Automation, ICRA 2015, Seattle, WA, USA, 26-30 May, 2015, pp. 1329–1335. Cited by: §1.
  • [32] B. Shuai, H. Ding, T. Liu, G. Wang, and X. Jiang (2018) Toward achieving robust low-level and high-level scene parsing. IEEE Transactions on Image Processing 28 (3), pp. 1378–1390. Cited by: §1.
  • [33] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders (2013) Selective search for object recognition. International Journal of Computer Vision 104 (2), pp. 154–171. Cited by: §2.
  • [34] X. Wang, H. Ding, and X. Jiang (2019) Dermoscopic image segmentation through the enhanced high-level parsing and class weighted loss. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 245–249. Cited by: §2.
  • [35] X. Wang, X. Jiang, H. Ding, and J. Liu (2019) Bi-directional dermoscopic feature learning and multi-scale consistent decision fusion for skin lesion segmentation. IEEE Transactions on Image Processing. Cited by: §2.
  • [36] X. Wang, R. B. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 7794–7803. Cited by: §2, §3.2, §3.2, §3.2.
  • [37] Y. Xiang, W. Choi, Y. Lin, and S. Savarese (2015) Data-driven 3d voxel patterns for object category recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 1903–1911. Cited by: §2.
  • [38] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox (2017) PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes. CoRR abs/1711.00199. Cited by: §1, §3.1, §4.1, §4.2, §4.2.
  • [39] M. Z. Zia, M. Stark, B. Schiele, and K. Schindler (2013) Detailed 3d representations for object recognition and modeling. IEEE Trans. Pattern Anal. Mach. Intell. 35 (11), pp. 2608–2623. Cited by: §2.
  • [40] M. Z. Zia, M. Stark, and K. Schindler (2014) Are cars just 3d boxes? jointly estimating the 3d shape of multiple objects. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pp. 3678–3685. Cited by: §2.