1 Introduction
In the past few years, the fastpacing progress of generic image recognition on ImageNet
[11]has drawn increasing attention of research in classifying finegrained object categories
[9, 26], e.g.bird species [27], car makes and models [10]. However, simply recognizing object labels is still far from solving many industrial problems where we need to have a deeper understanding of other attributes of the object [12]. In this work, we study the problem of estimating 3D pose for finegrained objects from monocular images. We believe this will become an indispensable component in some broader tasks. For example, to build a visionbased car damage assessment system, an important step is to estimate the exact pose of the car so that the damaged part can be well aligned for further detailed analysis.To address this task, collecting suitable data is of vital importance. However, largescale as they are, recent categorylevel pose estimation datasets are typically designed for generic object types [29, 30] and there is so far no largescale pose dataset for finegrained object categories. Although datasets on generic object types could contain decent information for pose, they lack of finedetailed matching of object shapes during annotation, since they usually use only a few universal 3D object models to match a group of objects with different shapes in one hyper class [30]. In this work, we introduce a new dataset that is able to benchmark pose estimation for finegrained objects. Specifically, we augment two existing finegrained recognition datasets, StanfordCars [10] and CompCars [32], with two types of useful 3D information: (i) for each car in the image, we manually annotate the pose parameters for a full perspective projection; (ii) we provide an accurate match of the computer aided design (CAD) model for each category. The resulting augmented dataset consists of more than 20,000 images for over 300 finegrained categories.
To our best knowledge, this is the first work for finegrained object pose estimation. Given the built dataset with highquality pose annotations, we show that the pose parameters can be predicted from a single 2D image with only appearance information. Compared to most previous works [33, 25, 17], our method does not require the intermediate prediction of 2D/3D key points. In addition, we assume a full perspective model, which is a more challenging setting than previous works of estimating discrete/continuous viewpoint angles (azimuth) [4] or recovering the rotation matrices only [14]. Our expected goal is that by projecting the finegrained 3D model according to the regressed pose estimation, the projection can align well with the object in the 2D image. To tackle this problem, we integrate pose estimation into the Faster/Mask RCNN framework [19, 7] by sharing information between the detection and pose estimation branches. However, a simple extension leads to inaccurate prediction result. Therefore, we introduce dense 3D representation into the endtoend deep framework named 3D location field that maps each pixel to the 3D location on the model surface. The idea of using pixel3D coordinates correspondences was explored on multistage frameworks using RGBD input [24, 21, 1]. Under endtoend deep framework with RGB input at categorylevel, we show that this representation can provide powerful supervision for the CNNs to efficiently capture the 3D shape of objects. Additionally, it requires no rendering such that there is no domain gap between realworld annotated data and synthetic data. Using large amount of synthetic location fields for pretraining, we overcome the problem of data shortage as well as the domain gap caused by rendering.
Our contribution is threefold. First, we collect a new large 3D pose dataset for finegrained objects with a better match to the finedetailed shapes of objects. Second, we propose a system based on Faster/Mask RCNN that estimates a full perspective model parameters on our dataset. Third, we integrate location field, a dense 3D representation that efficiently encodes the object 3D shapes, into deep framework in an endtoend fashion. This goes beyond previous works on categorylevel pose estimation, which only estimate discrete/continuous viewpoint angles or recover rotation matrices often with the help of key points.
Dataset  # class  # image  annotation  finegrained 

3D Object [20]  10  6,675  discretized view  ✗ 
EPFL Cars [16]  1  2,299  continuous view  ✗ 
Pascal 3D+ [30]  12  30,899  2d3d alignment  ✗ 
ObjectNet3D [29]  100  90,127  2d3d alignment  ✗ 
StanfordCars 3D (Ours)  196  16,185  2d3d alignment  ✓ 
CompCars 3D (Ours)  113  5,696  2d3d alignment  ✓ 
Total (Ours)  309  21881  2d3d alignment  ✓ 
2 Related Work
Dataset. Earlier object pose datasets are limited not only in their dataset scales but also in the types of annotation they covered. Table 1 provides a quantitative comparison between our dataset and previous ones. For example, 3D Object [20] dataset only provides viewpoint annotation for 10 object classes with 10 instances for each class. EPFL Car dataset [16] consists of 2,299 images of 20 car instances captured at multiple azimuth angles; moreover, the other parameters including elevation and distance are kept almost the same for all the instances in order to simplify the problem [16]. Pascal 3D+ [30] is perhaps the first largescale 3D pose dataset for generic object categories, with 30,899 images from 12 different classes of the Pascal VOC dataset [3]. Recently, ObjectNet3D dataset [29] further extends the scale to 90,127 images of 100 categories. Both Pascal 3D+ and ObjectNe3D datasets assume a camera model with 6 parameters to annotate. However, different images in one hyper class (i.e., cars) are usually matched with a few coarse 3D CAD models, thereby the projection error might be large due to the lack of accurate CAD models in some cases. Being aware of these problems, we therefore project finegrained CAD models to match with images. In addition, our datasets surpass most of previous ones in both scales of images and classes.
Pose Estimation. Despite the fact that continuous pose parameters are available for dataset such as Pascal 3D+, a majority of previous works [30, 25, 18, 4, 23] still casts the pose estimation problem as a multiclass classification of discrete viewpoint angles, which can be further refined as shown in [31, 6]. There are very few works except [17, 14] that directly regresses the continuous pose parameters. Although [17] estimates a weakperspective model for object categories and is able to lay the 3D models onto 2D images for visualization, its quantitative evaluation is still limited to 3D rotations. In contrast, we tackle a more challenging problem that estimates the full perspective matrices from a single image. Our new dataset allows us to quantitatively evaluate the estimated perspective projection. Based on this, we design a new efficient CNN framework as well as a new 3D representation that further improves the pose estimation accuracy.
FineGrained Recognition. Finegrained recognition refers to the task of distinguishing subordinate categories [27, 10, 26]. In earlier works, 3D information is a common source to gain recognition performance improvement [34, 28, 15, 22]
. As deep learning prevails and finegrained datasets become larger
[13, 9], the effect of 3D information on recognition diminishes. Recently, [22] incorporate 3D bounding box into deep framework when images of cars are taken from a fixed camera. On the other hand, almost all existing finegrained datasets are lack of 3D pose labels or 3D shape information [10], and pose estimation for finegrained object categories are not wellstudied. Our work fills this gap by annotating poses and matching CAD models on two existing popular finegrained recognition datasets and performing the new task of pose estimation based on the augmented annotations.3 Dataset
Our dataset annotation process is similar to ObjectNet3D [29]. We first select the most appropriate 3D car model from ShapeNet [2] for each category in the finegrained image dataset. For each image, we then obtain its pose parameters by asking the annotators to align the projection of the 3D model with the image using our designed interface.
3.1 3D Models
We build two finegrained 3D pose datasets for vehicles. Each dataset consists of two parts, i.e., 2D images and 3D models. The 2D images of vehicles are collected from StanfordCars [10] and CompCars [32] respectively. Target objects in most images are nonoccluded and easy to identify. In order to distinguish between finegrained categories, we adopt a distinct model for each category. Thanks to ShapeNet [2], a large number of 3D models for finegrained vehicles are available with make/model names in their meta data, which are used to find the corresponding 3D model given an image category name. If there is no exact match between a category name and meta data, we manually select a visually similar 3D model for that category. For StanfordCars, we annotate images for all 196 categories, where 148 categories have exact matched models. For CompCars, we only include 113 categories with matched 3D models in ShapeNet. To our best knowledge, our dataset is the very first one which employs finegrained category aware 3D model in 3D pose estimation.
3.2 Camera Model
The world coordinate system is defined in accordance with the 3D model coordinate system. In this case, a point on a 3D model is projected onto a point on a 2D image:
(1) 
via a perspective projection matrix:
(2) 
where denotes the intrinsic parameter:
(3) 
and encodes a rotation matrix between the world and camera coordinate systems, parameterized by three angles, i.e., elevation , azimuth and inplane rotation . We assume that the camera is always facing towards the origin of the 3D model. Hence the translation is only defined up to the model depth , the distance between the origins of two coordinate systems, and the principal point is the projection of the origin of world coordinate system on the image. As a result, our model has 7 parameters in total: camera focal length , principal point location , , azimuth , elevation , inplane rotation and model depth . Note that, since the images are collected online, even the annotated intrinsic parameters (, and ) are approximation. Compared with previous annotations [30, 29] with 6 parameters ( fixed), our camera model considers both the camera focal length and object depth in a full perspective projection for finer 2D3D alignment.
3.3 2D3D Alignment
We annotate 3D pose information for all 2D images in our datasets through crowdsourcing. To facilitate the annotation process, we develop an annotation tool illustrated in Figure 2. For each image during annotation, we choose the 3D model according to the finegrained car type given beforehand. Then, we ask the annotators to adjust the 7 parameters so that the projected 3D model is aligned with the target object in 2D image. This process can be roughly summarized as follows: (1) shift the 3D model such that the center of the model (the origin of the world coordinate system) is roughly aligned with the center of the target object in the 2D image; (2) rotate the model to the same orientation as the target object in the 2D image; (3) adjust the model depth and camera focal length to match the size of the target object in the 2D image. Some finer adjustment might be applied after the three main steps. In this way we annotate all 7 parameters across the whole dataset. On average, each image takes approximately 60 seconds to annotate by an experienced annotator. To ensure the quality, after one round of annotation across the whole dataset, we perform quality check and let the annotators do a second round revision for unqualified examples.
azimuth  elevation  theta 
azimuth  elevation  theta 
3.4 Dataset Statistics
We plot the distributions of azimuth (), elevation () and inplane rotation () in Figure 3 and Figure 4
for StanfordCars 3D and CompCars 3D, respectively. For azimuth, due to the nature of the original finegrained recognition dataset, we found it is not uniformly distributed, while the distributions of the two dataset are complementary to some degree. Elevations and inplane rotations are not severe as expected, since the images of cars are usually taken from the ground view.
4 3D Pose Estimation for FineGrained Object Categories
Given an input image of a finegrained object, our task is to predict all the 7 parameters related to Equation (2), i.e., 3D rotation , distance , principal point and , such that the projected 3D model can align well with the object in the 2D image.
4.1 Baseline Framework
Our baseline method only uses 2D appearance to regress the pose parameters. It is a modified version of Faster RCNN [19] which was originally designed for object detection. Casting our pose estimation problem into a detection framework is motivated by the relation between the two tasks. Since we are not using key points as an attention mechanism, performing pose estimation within the region of interest (RoI) helps us get rid of unrelated image regions hence make use of 2D information more effectively. In addition, 3D pose estimation is highly related to the detection task, especially the intrinsic parameters in Equation (3).
We parametrize the 3D rotation using the quaternion representation, converted from the angles . The principal point is highly related to RoI center. Therefore, we regress , the offset of the principal point from the RoI center. Such offset exists since the projection of the 3D object center might not necessarily be the 2D center depending on the poses. For other parameters ( and ), we regress the standard format as they are.
The modification of the network architecture is relatively straightforward. As shown in Figure 5, we add a pose estimation branch along with the existing class prediction and bounding box regression branches. Similar to the bounding box regression branch, the estimation of each group of pose parameters consists of a fullyconnected (FC) layer and a smoothed loss. The centers of the RoIs are also used to adjust the regression targets at training time and generate the final predictions at test time, as discussed above. For each training image, its bounding box is figured out from the perspective projection of the corresponding 3D model. Since we have finegrained 3D models and highquality annotations, these bounding boxes are tight to their corresponding objects.
4.2 Improve Pose Estimation via 3D Location Field
The key difference of our dataset to previous ones is that we have finegrained 3D models such that the projection aligns better with the image. This advantage allows us to explore the usage of dense 3D representations in addition to 2D appearance to regress the pose parameters.
Given an object in an image and its 3D model, our representation, named as 3D location field, maps every foreground pixel to its corresponding location on the surface of the 3D model, i.e., . The resulting field has the same size as the image and has three channels containing the , and coordinates respectively. A sample image with corresponding 3D location field can be seen in Figure 6. The 3D location field is a dense representation of 3D information which can be directly used as network input.
We explore the usage of 3D location field to improve pose estimation based on Mask RCNN. We would still expect only 2D image input at test time, therefore we regress 3D location field and use the regressed field for pose estimation. Based on the framework in Figure 5, we add a branch to regress 3D location field (instead of regressing binary masks in Mask RCNN). The regressed location fields are fed into a CNN consisting of additional convolutional layers followed by layers to regress the pose parameters. The regressions from 2D appearance (as part of Figure 5) and 3D location field are later combined to produce the final pose parameters. Figure 7 shows the detailed network structure.
We train the pose regression from location fields using a large amount of synthetic data. The synthetic location fields are generated from the 3D models with various predefined poses. The location field is a very suitable representation for synthetic data augmentation due to the following reasons: (i) the field only encodes 3D location information without any rendering of 3D models and naturally avoids the domain gap between synthetic data and photorealistic data; (ii) the field is invariant to color, texture and scale of the images.
5 Experiments
5.1 Evaluation Metrics
For each test sample, we introduce two metrics to comprehensively evaluate object poses.
Following [25, 17], the first metric, Rotation Error, focuses on the quality of viewpoint estimation only. Given the predicted and ground truth rotation matrices , the difference between the two measured by geodesic distance is .
The second metric evaluates the overall quality of perspective projection. Our evaluation metric is based on
Average Distance of Model Points in [8], which measures the averaged distance between predicted projected points and their corresponding ground truth projections. Concretely, given one test result , where its predicted pose is , its ground truth pose and corresponding 3D model , the metric is defined as(4) 
According to [8], this is the most widelyused error function to evaluate a projection matrix. The unit of the above distance is the number of pixels. To make the metric scaleinvariant, we normalize it using the diameter of the 2D bounding box. We denote the normalized distance as . It is worth mentioning again that the 3D models are only used when computing the evaluation metrics. During test time, only a single 2D image is fed into the network to predict the pose .
To measure the performance over the whole test set, we compute the mean and median of and over all test samples. Also, by setting thresholds on the two metrics, we can get an accuracy number. For , following [25, 17], we set the threshold to be . For , the common threshold is 0.1, which means that the prediction with average projection error less than 10% of the 2D diameter is considered correct.
5.2 Experimental Settings
Data Split. For StanfordCars 3D, since we have annotated all the images, we follow the standard train/test split provided by the original dataset [10] with 8144 training examples and 8041 testing examples. For CompCars 3D, we randomly sample of our annotated data as training set and the rest as testing set, resulting in 3798 training and 1898 test examples.
Baseline Implementation. Our implementation is based on the Detectron package [5], which includes Faster/Mask RCNN implementations. The convolutional body (i.e., the “backbone” in [7]) used for the baseline is ResNet50. For fair comparison, the convolutional body is initialized from ImageNet pretrained model, and other layers are randomly initialized (i.e., we are not using COCO pretrained detectors). Following the setting of Mask RCNN, the whole network is trained endtoend. At test time, we adopt a cascaded strategy, where the 3D pose branch is applied only to the highest scoring box prediction.
Comparison to Previous Baselines. It is worth mentioning that, when only evaluating the rotation error in Section 5.1, our baseline in Figure 5 is almost identical to the baselines in Pascal3D+ [30] and ObjectNet3D [29] except that their detection and pose estimation heads are parallel while ours is cascaded.
3D Location Field. In Section 4.2, incorporating 3D location fields involves two steps – field regression and pose regression from fields. Field regression is trained together with detection and baseline pose estimation in an endtoend fashion, similar to Mask RCNN. The ground truth training fields are generated from the annotations (3D models and poses). The second step, pose regression from fields is trained using the synthetic data generated from the pool of matched 3D models in a dataset (38102/14017 synthetic samples for StanfordCars&CompCars 3D). We only regress the quaternion using the location fields.
Method  Median  Mean  Median  Mean  

Baseline  6.68  9.89  96.59  0.0888  0.1087  60.04 
w./ Field  5.68  7.67  98.73  0.0834  0.0977  66.07 
Method  Median  Mean  Median  Mean  

Baseline  8.09  13.02  93.62  0.1275  0.1580  32.52 
w./ Field  6.14  8.98  98.00  0.1141  0.1408  40.15 
FT Baseline  5.51  8.69  96.84  0.0878  0.1123  58.58 
FT w./ Field  4.74  7.45  98.31  0.0836  0.1047  64.01 
5.3 Results and Analysis
The quantitative results for StanfordCars 3D and CompCars 3D are shown in Table 2 and Table 3 respectively. The changes of w.r.t the threshold for the datasets are shown in Figure 8. For CompCars 3D dataset, besides ImageNet initialization we also report the result finetuned from a StanfordCars 3D pretrained model, since the number of training samples in StanfordCars is relatively larger.
As can be seen in Table 2 and 3, our baseline performs very well on estimating the rotation matrix for both datasets, with Median less than 10 degrees and around 95%. While recovering the full perspective model is a much more challenging task, Table 2 shows that promising performance can be achieved with enough properly annotated training samples. For StanfordCars 3D, Median (the median of the average projection error) is less than 10% of the diameter of the 2D bounding box. When the training set is limited, from the first and the third row of Table 3
, we can see the effectiveness of transfer learning from a larger dataset. Regarding the effectiveness of the 3D location field, we can observe consistent performance gain across all datasets. The main reasons are twofold: (
i) this 3D representation enables the usage of large amounts of synthetic training data with no domain gap; (ii) our field regression adapted from Mask RCNN works well such that even the pose prediction based on the regressed field can help a lot at test time.We visualize the predicted poses in Figure 9. As shown on the left part of Figure 9, our method is able to handle poses of various orientations, scales and locations of the projection. On the right part of Figure 9, failure cases exist in our predictions, indicating there are still potential rooms for improvement, especially for the estimation of scale, cases with large perspective distortion and some uncommon poses with few training samples.
6 Conclusion
We study the problem of pose estimation for finegrained object categories. We annotate two popular finegrained recognition datasets with finegrained 3D shapes and poses. We propose an approach to estimate the full perspective parameters from a single image. We further propose 3D location field as a dense 3D representation to facilitate pose estimation. Experiments on our datasets suggest that this is an interesting problem in future.
References
 [1] Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6d object pose estimation using 3d object coordinates. In: ECCV (2014)
 [2] Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: ShapeNet: An informationrich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)

[3]
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision
88(2), 303–338 (2010)  [4] Ghodrati, A., Pedersoli, M., Tuytelaars, T.: Is 2D information enough for viewpoint estimation? In: BMVC (2014)
 [5] Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., He, K.: Detectron. https://github.com/facebookresearch/detectron (2018)
 [6] Hara, K., Vemulapalli, R., Chellappa, R.: Designing deep convolutional neural networks for continuous object orientation estimation. arXiv preprint arXiv:1702.01499 (2017)
 [7] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask RCNN. In: ICCV (2017)
 [8] Hodan, T., Matas, J., Obdrzálek, S.: On evaluation of 6d object pose estimation. In: Computer Vision  ECCV 2016 Workshops  Amsterdam, The Netherlands, October 810 and 1516, 2016, Proceedings, Part III. pp. 606–619 (2016)
 [9] Krause, J., Sapp, B., Howard, A., Zhou, H., Toshev, A., Duerig, T., Philbin, J., FeiFei, L.: The unreasonable effectiveness of noisy data for finegrained recognition. In: European Conference on Computer Vision. pp. 301–320. Springer (2016)
 [10] Krause, J., Stark, M., Deng, J., FeiFei, L.: 3D object representations for finegrained categorization. In: ICCV Workshops on 3D Representation and Recognition (2013)

[11]
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)
 [12] Lim, J.J., Pirsiavash, H., Torralba, A.: Parsing IKEA objects: Fine pose estimation. In: ICCV (2013)
 [13] Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for finegrained visual recognition. In: ICCV (2015)
 [14] Mahendran, S., Ali, H., Vidal, R.: 3D pose regression using convolutional neural networks. In: ICCV. vol. 1, p. 4 (2017)
 [15] Mottaghi, R., Xiang, Y., Savarese, S.: A coarsetofine model for 3D pose estimation and subcategory recognition. In: CVPR (2015)
 [16] Ozuysal, M., Lepetit, V., Fua, P.: Pose estimation for category specific multiview object localization. In: CVPR (2009)
 [17] Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6DOF object pose from semantic keypoints. In: ICRA. pp. 2011–2018 (2017)
 [18] Pepik, B., Stark, M., Gehler, P., Schiele, B.: Teaching 3D geometry to deformable part models. In: CVPR (2012)
 [19] Ren, S., He, K., Girshick, R.B., Sun, J.: Faster RCNN: towards realtime object detection with region proposal networks. In: NIPS (2015)
 [20] Savarese, S., FeiFei, L.: 3D generic object categorization, localization and pose estimation. In: ICCV (2007)
 [21] Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.W.: Scene coordinate regression forests for camera relocalization in RGBD images. In: CVPR (2013)
 [22] Sochor, J., Herout, A., Havel, J.: BoxCars: 3D boxes as cnn input for improved finegrained vehicle recognition. In: CVPR (2016)
 [23] Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In: ICCV (2015)
 [24] Taylor, J., Shotton, J., Sharp, T., Fitzgibbon, A.W.: The vitruvian manifold: Inferring dense correspondences for oneshot human pose estimation. In: CVPR (2012)

[25]
Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1510–1519 (2015)
 [26] Van Horn, G., Mac Aodha, O., Song, Y., Shepard, A., Adam, H., Perona, P., Belongie, S.: The inaturalist challenge 2017 dataset. arXiv preprint arXiv:1707.06642 (2017)
 [27] Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The CaltechUCSD Birds2002011 Dataset. Tech. rep. (2011)
 [28] Xiang, Y., Choi, W., Lin, Y., Savarese, S.: Datadriven 3d voxel patterns for object category recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1903–1911 (2015)
 [29] Xiang, Y., Kim, W., Chen, W., Ji, J., Choy, C., Su, H., Mottaghi, R., Guibas, L., Savarese, S.: ObjectNet3D: A large scale database for 3D object recognition. In: ECCV (2016)
 [30] Xiang, Y., Mottaghi, R., Savarese, S.: Beyond PASCAL: A benchmark for 3D object detection in the wild. In: WACV (2014)
 [31] Yang, L., Liu, J., Tang, X.: Object detection and viewpoint estimation with automasking neural network. In: ECCV (2014)
 [32] Yang, L., Luo, P., Change Loy, C., Tang, X.: A largescale car dataset for finegrained categorization and verification. In: CVPR (2015)
 [33] Zhou, X., Leonardos, S., Hu, X., Daniilidis, K., et al.: 3D shape estimation from 2D landmarks: A convex relaxation approach. In: CVPR (2015)
 [34] Zia, M.Z., Stark, M., Schiele, B., Schindler, K.: Detailed 3D representations for object recognition and modeling. PAMI 35(11), 2608–2623 (2013)
Comments
There are no comments yet.