1 Introduction
In this work, we focus on the challenging task of 3D rotation estimation – rotational orientation of an object with respect to a given reference frame, usually the camera –, an important topic in computer vision. The 6DOF pose includes the 3D translation of the object in the scene, and the 3D rotation of the object in the scene. A given 3D scene can be represented by 6DOF poses of all objects in the scene. Pose estimation is a vital step in many robotics pipelines such as robotic manipulation and autonomous driving, but current methods have high computational requirements. While some recent pose estimation methods such as
[14] and [4] have been able to achieve real time performance, they require highend GPUs for fast inference. Their performance on CPUs is significantly limited, even though mobile robots or other embedded devices generally have limited computing hardware due to power and size constraints. Another issue with stateoftheart deep neural networks is that they require large training datasets to achieve good performance, often in the order of several hundreds of thousands of samples. However, annotating 6D object pose datasets is an extremely challenging task. For example, to create the popular LINEMOD object pose dataset [3], each frame was annotated by using a robotic arm to control the camera, thereby providing the ground truth rotation for each frame. Despite this the dataset is still limited to just 1,000 images for each of the 15 object classes, which is significantly less than what would normally be required to train a deep neural network. As an alternative solution, creating synthetic images using 3D models of the objects is an option. However, performance of a neural network trained on computer generated images may not generalize to the real world. The key idea in this paper is to propose a twostage approach: instead of estimating the pose from scratch, we assume that, along with the location of the object, a coarse estimate of the 3D pose of the object is provided to us. Given an RGB image of an object along with its location in the image represented as a bounding box around the object, we attempt to provide an estimate of the 3D rotation of the object with respect to the camera. Our key contribution is a shallow neural network framework that is able to use a bounding box RGB image of the object along with a course 3D pose estimate to produce a highly precise 3D pose. Our system relies only on the RGB bounding box image, and does not require a 3D CAD model of the object for training unlike other works that have attempted this solution. Our system can be trained on smaller datasets, and can provide quick inference on computationally limited hardware. One advantage of using a shallow neural network for refinement is the fewer number of training samples required to achieve optimal performance. Our network provides optimal performance with just a few thousand labeled samples per object, and allows for easy dataset augmentation without the use of synthetic images. By using a shallow neural network for regression of the refined 3D pose, we significantly reduce the computing cost of the 3D pose estimation pipeline. Thus, mobile robots can take advantage of a shallow network based pose refinement algorithm in addition to complement a fast but coarse pose estimation system. Refining an estimated 3D pose has other significant advantages over an endtoend pose estimation framework. It provides flexibility to refine a pose to a certain degree while taking a tradeoff between precision and inference time into account, and can be used in iterative optimization methods.2 Related Works
3D pose estimation and object detection are well explored topics. Literature in this domain includes research on 6D pose estimation. In this section, we will focus on the more recent work on this problem. We will also limit our review to work that involves pose estimation using convolutional neural networks on single RGB images rather than methods that use more complex sensing techniques. Further, we will review both general 3D object pose methods as well as methods that specifically involve pose refinement as an optional final step.
2D keypoint detection and matching are commonly used for 3D pose estimation from RGB images. Keypoints in the 2D image are detected using methods such as SIFT [6] and matched to keypoints in 3D models of the objects. Then, the projection of the 3D keypoints are aligned to the 2D keypoints to estimate the pose of the object. This method, used by [17] and [10], is quite computationally demanding.
Newer methods attempt to estimate the 3D pose directly from an RGB image without the use of a 3D model. [7] regresses on the 3D pose of objects in the 3D rotation space. Their pipeline takes in bounding box images of unoccluded and untruncated objects, and outputs a pose as an axisangle rotation or a unit quaternion (see section 3). Unlike earlier 3D pose estimation methods such as [13] and [16]
that approach pose estimation as a classification problem, wherein they classify rotations into bins,
[7] attempt to directly regress on the 3D rotation of the object.Pose refinement as a secondary stage to pose estimation in order to improve results was first used in [9] for hand pose estimation. They used a synthetic image of the hand created using the estimated pose from the first stage to update the pose estimate in an iterative algorithm.
[12, 4, 8] subsequently used the refinement approach in 3D object pose estimation. [12] present an endtoend 3D object pose estimation system using a deep convolutional neural network on RGB images. Their optional pose refinement stage involves generating a binary mask or a color render of the object using a 3D model of the object at the estimated pose, and using a CNN to minimize the error between the projection of the 3D bounding box of the ground truth object pose and the newly rendered object pose. [4] also render each object into the scene using the estimated pose, and then try to minimize the projection error of contours of the rendered surfaces against the ground truth. [8] propose a similar approach. After a course initial 3D pose hypothesis, they iteratively minimize the projection from the rendered models till they converge to a refined pose. All these methods use a 3D CAD model of the object for pose refinement. Our method is the first to refine a pose using just the RGB image.
3 Representing 3D Pose
The 3D orientation of an object in the world is commonly represented by Euler angles. However, Euler angles are not an ideal representation of rotation since they suffer from singularities. We can instead use an axisangle representation, i.e., the orientation is represented as a 3D rotation by an angle about a given axis passing through the origin of the object, as calculated in the object localization task. This representation of a rotation is called the axisangle representation. A 3D rotation can also be represented by a rotation matrix, which lies in the set of the special orthogonal matrices of dimension 3, i.e., matrices that are orthogonal, and with determinant :
(1) 
Another popular representation of 3D rotations is unit quaternions. There is a oneone translations between rotation matrices and unit quaternions. For an axisangle rotation represented by unit axis and angle , the corresponding unit quaternion is:
(2) 
Similarly, for a given quaternion , the corresponding axisangle rotation would be:
(3) 
A quaternion can be split into a scalar part
and a vector part
. That is:(4) 
With this, we can perform multiplication for quaternions using the standard vector cross and dot products. Multiplication of two quaternions is the same as performing the rotation operations represented by each quaternion sequentially.
(5) 
We also define the quaternion as the quaternion obtained by using in eq. 2.
(6) 
This represents the reverse of the original rotation. Rotation by around axis is the same as rotating by angle around axis . From this, we can define the geodesic angle between two quaternions:
(7) 
where is the scalar component of a quaternion.
4 Dataset
We used the YCBVideo dataset [18] for all our testing and training. The dataset consists of individual frames from video clips of 21 common objects. Each frame is annotated with ground truth object pose as a 3D rotation matrix as well as bounding box coordinates for each object in the frame. Other annotations such as extrinsic camera parameters and object 3D translation were provided as well but were not required for our framework. Figure 1 is a sample image from the dataset.
The rounding error while reading the rotation matrix for the ground truth pose meant resulted in nonorthogonal rotational matrices. This lead to singularities when converting the rotation matrices to unit quaternions. Thus, we first had to reorthogonalize the rotation matrix. We used a method recommended in [11]. We first took two of the three orthogonal vectors from the rotation matrix and evaluated the error in their dot product:
(8) 
(9) 
We used half of the error to reorthogonalize the two vectors, and create the third vector as the cross product of the two reorthogonalized vectors:
(10) 
(11) 
(12) 
Finally, we normalized each of the three orthogonal vectors:
(13) 
We used 9 of the 21 objects for all our experiments. Figure 2 shows the 9 objects used. We had over 12,000 bounding box images and corresponding ground truth poses for each of the 9 objects. We used 5,000 samples for training, 3,000 for validation, and 3,000 for testing. Further, since the frames were recorded as video clips, pose change between subsequent frames was incremental. Thus, each object database was randomized in order.
5 Network Architecture
Our network is composed of three stages – the CNN feature extraction stage, the fully connected pose input stage, and the fully connected pose refinement stage. The feature extraction stage takes in the bounding box image of the object as an input. The input image is resized for the network. The pose input stage takes a course estimate of the pose as a unit quaternion as input. The outputs of both these stages are concatenated and input into the pose refinement stage. The final 4 dimensional output is unit normalized.
Figure 3 shows the overall architecture of the pose refinement pipeline.Instead of directly outputting a refined 3D pose, we output a unit quaternion representing the rotation from the input pose estimate to the ground truth. The reason for this choice is to prevent the network from overfitting to the image data. The network should factor both the bounding box image as well as the coarse pose estimate in the final output. We explored this further in section 6.2.4.
5.1 CNN Feature Extraction Stage
The CNN feature extraction stage is a simple shallow CNN network composed of two convolutional layers, two batch normalization layers, and two max pooling layers.
Figure 4 shows the architecture of the stage. While deeper CNN networks perform really well in classification tasks, 3D pose estimation relies mostly on finer image features extracted by the first few layers of CNN networks. This is further true for pose refinement. Thus, just these few layers are adequate for our application. We verified this by comparing our CNN network against the VGGM network in our framework in section 6.2.2.5.2 Fully Connected Pose Input Stage
The fully connected pose input stage consists of 3 fully connected layers that expand the dimension of the input pose estimate from 4 to 4,096. This matches the output dimension of the CNN features extraction stage. Figure 5 shows the architecture of this stage.
5.3 Pose Refinement Stage
The first step in the pose refinement stage is to concatenate the outputs of the other two stages. The resultant 8,192 dimensional vector is fed through 4 fully connected layers to get a 4 dimensional quaternion output. We looked into enforcing a constraint on this stage to output unitnormalized quaternions as explored in [5]. However, we found that unitnormalizing the 4 dimensional output is simpler and works just as well. Figure 6 shows the architecture of this stage.
5.4 Loss Function
For our loss function, we attempt to minimize the angle between our output rotation and our sample label. We convert the quaternion to an axisangle representation of the rotation. The axis is arbitrary. However, the angle is the metric of the error in our pose. Thus, for output
and label , we use eq. 7:(14) 
This loss function disregards the axis of the rotation and only minimizes the angle of the rotation.
5.5 Training
A single training sample consisted of a single bounding box image of an object along with a course estimate of the pose of the object. For training, the estimated pose was generated by rotating the ground truth pose around a random axis by an angle drawn from a uniform distribution:
(15) 
Thus
(16) 
for generated by using eq. 15 in eq. 2. Similarly, the label for that sample is generated by using in eq. 2. In this manner, a single bounding box image can be used for multiple samples by generating several poses. This dataset augmentation is extremely useful when dealing with smaller training datasets. However, we had ample number of samples from the YCBVideo dataset so no such augmentation was required. We did, however, verify that our network did not overfit to the images by resampling test images with different pose estimates in section 6.2.4.
We first trained our network on 3,000 images for each object. This served as the initialization of our network weights. For testing for each individual object, we fine tuned our network using the remaining 2,000 training images.
Our nonlinear loss function (eq. 14
) has a number of local minima. This can be seen intuitively since we are outputting a full rotation as a unit quaternion but only minimizing the angle of the rotation. To avoid those local minima while still taking advantage of our loss function, we first trained our network using the MSE loss for 5 epochs, and then continued training with our loss function for another 10 epochs.
[7] and [15] both recommended this strategy when using the error angle for loss. We used the Adam optimizer for all training.6 Results and Discussions
In this section, we introduce our evaluation metric, and discuss the experiments that we performed to verify our claims. We also discuss the decisions we made in our framework and verify our choices. We evaluate our network’s performance with
Varying size of training dataset Different CNN networks for feature extraction stage Different input pose estimation distributions Overfitting to images6.1 Evaluation Metrics
There are three standard metrics used to evaluate the performance of 6D pose estimation frameworks – 2D reprojection error, IoU score, and average 3D distance of model vertices. Each of these metrics provides a threshold for correct estimation of the pose, the final score being the percentage of samples estimated above the threshold. In the 2D reprojection error metric as used by [1], the mean distance between the projection of the object’s 3D mesh vertices onto the image using the estimate and the ground truth are measured as the error, with a value of 5 pixels taken as the threshold for correctness. Since this metric heavily factors in the object detection and 3D localization task, it was not ideal for us. The IoU score, used by [4], is the measure of the overlap between the projected 3D model using the ground truth pose and the predicted pose, with an overlap of 0.5 taken as the threshold. This is, again, not a very useful metric for 3D rotation. The average 3D distance of model vertices, also known as the ADD method, uses the mean distance between the object’s 3D mesh vertices at ground truth pose and estimated pose, with the threshold scaling by size. This method, used by [12] and [18] among others, is better for 3D pose estimation. However, since our framework does not rely on the 3D model of the object for pose refinement, we decided not to use it.
[7], who also rely solely on bounding box RGB images for 3D pose estimation, use the geodesic angle between the ground truth and their estimate to evaluate their system. This is simply the angle of an axisangle rotation required to rotate the estimated pose back to ground truth. Thus, we generated a refined pose from our output unit quaternion as:
(17) 
Then, we calculated the geodesic angle between the refined pose and the ground truth pose using eq. 7:
(18) 
This is the same as our loss function (eq. 14), but between the ground truth pose and the final refined pose rather than between the sample label and the output quaternion. We used this metric for all our evaluations in the section below.
6.2 Experiments
We ran several different experiments to verify each decision we made in our framework and to highlight its advantages. All our code was written in Python 3.6 using the PyTorch library. The network was trained on a system with an AMD Ryzen 7 2700X CPU and Nvidia GTX 1080 Ti GPU. The same system was used for testing. Some inference speed testing was done on the CPU instead of the GPU.
For each experiment, we first established a baseline performance using the training parameters stated in section 5.5. The baseline results are shown in table 1 and fig. 7
. All results are tabulated as the mean angular error between the refined pose and the ground truth pose. The plots also show the standard deviation.
Baseline Results  Mean Angular Error  Standard Deviation 

Tin Can  9.87  1.36 
Crackers  5.39  1.34 
Sugar Box  7.15  2.01 
Mustard  4.80  1.45 
Gelatin  6.72  1.02 
Banana  7.50  0.96 
Bleach  6.53  2.36 
Drill  4.69  1.38 
Scissors  7.34  0.95 
Average  6.67  1.42 
6.2.1 Varying Training Dataset Size
The primary advantage of a relatively shallow neural network architecture is that it requires less training with fewer training samples to achieve optimal performance. To verify this, we retrained our network using 5,000 base training samples from each object and 5,000 fine tuning samples for the specific object, and 1,000 base training samples from each object and 1,000 fine tuning samples for the specific object, and compared the results against the baseline of 3,000 base base training samples of each class and 2,000 fine tuning samples. The results can be seen in table 2 Training Samples Baseline 5,000 1,000 Tin Can 9.87 9.21 10.64 Crackers 5.39 4.75 6.03 Sugar Box 7.15 7.09 7.96 Mustard 4.80 4.36 5.24 Gelatin 6.72 6.68 8.21 Banana 7.50 6.35 7.98 Bleach 6.53 6.29 7.06 Drill 4.69 4.62 5.68 Scissors 7.34 7.16 8.73 Average 6.67 6.28 7.50
Our network retained its performance even with just 9,000 base training samples (1,000 for each object), and just 1,000 finetuning samples. Furthermore, increasing the training samples beyond our baseline did not increase the performance significantly. This highlights the clear advantage of a shallow neural network in training.
6.2.2 Different CNN Feature Extraction Stages
Most neural network architecture literature recommends using CNN networks pretrained on large image datasets such as ImageNet and fine tuned for our particular application. While effective, most such CNN networks are often extremely dense and require long inference times, especially on CPUs. Thus, another advantage of our shallow CNN feature extraction stage is that it can extract the necessary image features for pose refinement without lengthy inference times on constrained hardware. We compared the baseline performance of our framework with our shallow CNN feature extraction stage against a pretrained CNN stage. For this task, we chose the VGGM network
[2] with weights initialized on ImageNet. This network is one of the shallower pretrained VGG networks. We evaluated results both by performing base training on the network and then fine tuning it for each specific object, as well as simply fine tuning without any base training on our dataset. The other two stages were not changed / retrained in this experiment. We also tested inference time by testing on the CPU.As can be seen in table 4 and fig. 9, the VGGM network performed marginally better than our CNN network in the feature extraction stage with base training. With just fine tuning, our network beat the VGGM network for most of the objects with a fair margin. More interestingly, our network had significantly faster inference on the CPU. On average, our CNN network in our framework was able to refine poses over thrice as fast as the VGGM network. We believe that our network provides a valid tradeoff between performance and inference time.
Training Samples  Baseline  VGGM (Base + Fine)  VGGM (Fine only) 

Tin Can  9.87  10.23  10.78 
Crackers  5.39  4.95  7.23 
Sugar Box  7.15  6.25  8.13 
Mustard  4.80  4.65  6.82 
Gelatin  6.72  5.87  6.34 
Banana  7.50  5.32  8.94 
Bleach  6.53  6.34  8.75 
Drill  4.69  4.81  6.87 
Scissors  7.34  6.54  10.90 
Average  6.67  6.11  8.30 
Network  Average FPS  Average Performance 

Ours  78 fps  6.67 
VGGM (Base + Fine)  24 fps  6.67 
VGGM (Base + Fine)  24 fps  8.30 
6.2.3 Different Input Pose Estimate Distributions
Our pose refinement framework is flexible enough to be used with any pose estimation framework with varying performance. To verify that it works with an extremely coarse pose estimation system just as well as with a highly precise pose estimation system, we simulated tests with varying pose estimate inputs. For our baseline, pose estimates for our training and testing data was generated using axisangle rotations with angles drawn from a uniform distribution from to (eq. 15
). We compared this performance against pose estimates with angles drawn from two different normal distributions. The results can be seen in
table 5 and fig. 10.Our framework proved to be extremely robust when given input pose estimates of varying precision. Even a highly precise estimate from an distribution was further refined by our framework, while refinement performance was consistent with a course input pose estimate from an distribution. In both cases, the standard deviation for each pose went down significantly.
Training Samples  Baseline  

Tin Can  9.87  12.73  4.21 
Crackers  5.39  9.84  3.74 
Sugar Box  7.15  10.45  3.95 
Mustard  4.80  10.34  3.68 
Gelatin  6.72  9.74  4.17 
Banana  7.50  12.65  4.85 
Bleach  6.53  9.21  4.36 
Drill  4.69  10.18  3.12 
Scissors  7.34  11.65  4.28 
Average  6.67  10.75  4.04 
6.2.4 Overfitting to Images
Our network should factor both the bounding box image as well as the input pose estimate into the final refined pose. A challenge for our network architecture is to ensure that the network does not overfit to training images. Our solution to this challenge was to have our network output a refining rotation rather than a final refined pose. To verify this design choice, we tested our network by resampling each of our testing images multiple times with a different pose estimate for each sample. Each pose estimate was created by generated using axisangle rotations with angles drawn from a uniform distribution from to (eq. 15).
Training Samples  Baseline  With Resampling 

Tin Can  9.87  9.57 
Crackers  5.39  5.24 
Sugar Box  7.15  7.34 
Mustard  4.80  4.96 
Gelatin  6.72  6.78 
Banana  7.50  7.38 
Bleach  6.53  6.87 
Drill  4.69  4.96 
Scissors  7.34  7.24 
Average  6.67  6.70 
7 Conclusion
We have proposed a new 3D pose refinement framework that can be used in conjunction with a course pose estimation framework to produce highly precise object pose estimates with minimal training data. Through our experiments, we have shown that our framework is robust, fast and precise. We have also used these experiments to justify several of our design choices.
References

[1]
(2016)
Uncertaintydriven 6d pose estimation of objects and scenes from a single rgb image.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 3364–3372. Cited by: §6.1.  [2] (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531. Cited by: §6.2.2.
 [3] (2012) Model based training, detection and pose estimation of textureless 3d objects in heavily cluttered scenes. In Asian conference on computer vision, pp. 548–562. Cited by: §1.
 [4] (2017) SSD6d: making rgbbased 3d detection and 6d pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1521–1529. Cited by: §1, §2, §6.1.
 [5] Enforcing output constraints via sgd: a step towards neural lagrangian relaxation. Cited by: §5.3.
 [6] (1999) Object recognition from local scaleinvariant features.. In iccv, Vol. 99, pp. 1150–1157. Cited by: §2.
 [7] (2017) 3D pose regression using convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2174–2182. Cited by: §2, §5.5, §6.1.
 [8] (2018) Deep modelbased 6d pose refinement in rgb. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 800–815. Cited by: §2.
 [9] (2015) Training a feedback loop for hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3316–3324. Cited by: §2.
 [10] (2017) 6dof object pose from semantic keypoints. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2011–2018. Cited by: §2.
 [11] (2009) Direction cosine matrix imu: theory. Diy Drone: Usa, pp. 13–15. Cited by: §4.
 [12] (2017) BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3836. Cited by: §2, §6.1.
 [13] (2015) Render for cnn: viewpoint estimation in images using cnns trained with rendered 3d model views. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2686–2694. Cited by: §2.
 [14] (2018) Realtime seamless single shot 6d object pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 292–301. Cited by: §1.
 [15] (2009) Distributed imagebased 3d localization of camera sensor networks. In Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, pp. 901–908. Cited by: §5.5.
 [16] (2015) Viewpoints and keypoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1510–1519. Cited by: §2.
 [17] (2016) Single image 3d interpreter network. In European Conference on Computer Vision, pp. 365–382. Cited by: §2.
 [18] (2018) PoseCNN: a convolutional neural network for 6d object pose estimation in cluttered scenes. Cited by: §4, §6.1.