In the past few years, the fast-pacing progress of generic image recognition on ImageNet
has drawn increasing attention of research in classifying fine-grained object categories[9, 26], e.g.bird species , car makes and models . However, simply recognizing object labels is still far from solving many industrial problems where we need to have a deeper understanding of other attributes of the object . In this work, we study the problem of estimating 3D pose for fine-grained objects from monocular images. We believe this will become an indispensable component in some broader tasks. For example, to build a vision-based car damage assessment system, an important step is to estimate the exact pose of the car so that the damaged part can be well aligned for further detailed analysis.
To address this task, collecting suitable data is of vital importance. However, large-scale as they are, recent category-level pose estimation datasets are typically designed for generic object types [29, 30] and there is so far no large-scale pose dataset for fine-grained object categories. Although datasets on generic object types could contain decent information for pose, they lack of fine-detailed matching of object shapes during annotation, since they usually use only a few universal 3D object models to match a group of objects with different shapes in one hyper class . In this work, we introduce a new dataset that is able to benchmark pose estimation for fine-grained objects. Specifically, we augment two existing fine-grained recognition datasets, StanfordCars  and CompCars , with two types of useful 3D information: (i) for each car in the image, we manually annotate the pose parameters for a full perspective projection; (ii) we provide an accurate match of the computer aided design (CAD) model for each category. The resulting augmented dataset consists of more than 20,000 images for over 300 fine-grained categories.
To our best knowledge, this is the first work for fine-grained object pose estimation. Given the built dataset with high-quality pose annotations, we show that the pose parameters can be predicted from a single 2D image with only appearance information. Compared to most previous works [33, 25, 17], our method does not require the intermediate prediction of 2D/3D key points. In addition, we assume a full perspective model, which is a more challenging setting than previous works of estimating discrete/continuous viewpoint angles (azimuth)  or recovering the rotation matrices only . Our expected goal is that by projecting the fine-grained 3D model according to the regressed pose estimation, the projection can align well with the object in the 2D image. To tackle this problem, we integrate pose estimation into the Faster/Mask R-CNN framework [19, 7] by sharing information between the detection and pose estimation branches. However, a simple extension leads to inaccurate prediction result. Therefore, we introduce dense 3D representation into the end-to-end deep framework named 3D location field that maps each pixel to the 3D location on the model surface. The idea of using pixel-3D coordinates correspondences was explored on multi-stage frameworks using RGB-D input [24, 21, 1]. Under end-to-end deep framework with RGB input at category-level, we show that this representation can provide powerful supervision for the CNNs to efficiently capture the 3D shape of objects. Additionally, it requires no rendering such that there is no domain gap between real-world annotated data and synthetic data. Using large amount of synthetic location fields for pre-training, we overcome the problem of data shortage as well as the domain gap caused by rendering.
Our contribution is three-fold. First, we collect a new large 3D pose dataset for fine-grained objects with a better match to the fine-detailed shapes of objects. Second, we propose a system based on Faster/Mask R-CNN that estimates a full perspective model parameters on our dataset. Third, we integrate location field, a dense 3D representation that efficiently encodes the object 3D shapes, into deep framework in an end-to-end fashion. This goes beyond previous works on category-level pose estimation, which only estimate discrete/continuous viewpoint angles or recover rotation matrices often with the help of key points.
|Dataset||# class||# image||annotation||fine-grained|
|3D Object ||10||6,675||discretized view||✗|
|EPFL Cars ||1||2,299||continuous view||✗|
|Pascal 3D+ ||12||30,899||2d-3d alignment||✗|
|ObjectNet3D ||100||90,127||2d-3d alignment||✗|
|StanfordCars 3D (Ours)||196||16,185||2d-3d alignment||✓|
|CompCars 3D (Ours)||113||5,696||2d-3d alignment||✓|
|Total (Ours)||309||21881||2d-3d alignment||✓|
2 Related Work
Dataset. Earlier object pose datasets are limited not only in their dataset scales but also in the types of annotation they covered. Table 1 provides a quantitative comparison between our dataset and previous ones. For example, 3D Object  dataset only provides viewpoint annotation for 10 object classes with 10 instances for each class. EPFL Car dataset  consists of 2,299 images of 20 car instances captured at multiple azimuth angles; moreover, the other parameters including elevation and distance are kept almost the same for all the instances in order to simplify the problem . Pascal 3D+  is perhaps the first large-scale 3D pose dataset for generic object categories, with 30,899 images from 12 different classes of the Pascal VOC dataset . Recently, ObjectNet3D dataset  further extends the scale to 90,127 images of 100 categories. Both Pascal 3D+ and ObjectNe3D datasets assume a camera model with 6 parameters to annotate. However, different images in one hyper class (i.e., cars) are usually matched with a few coarse 3D CAD models, thereby the projection error might be large due to the lack of accurate CAD models in some cases. Being aware of these problems, we therefore project fine-grained CAD models to match with images. In addition, our datasets surpass most of previous ones in both scales of images and classes.
Pose Estimation. Despite the fact that continuous pose parameters are available for dataset such as Pascal 3D+, a majority of previous works [30, 25, 18, 4, 23] still casts the pose estimation problem as a multi-class classification of discrete viewpoint angles, which can be further refined as shown in [31, 6]. There are very few works except [17, 14] that directly regresses the continuous pose parameters. Although  estimates a weak-perspective model for object categories and is able to lay the 3D models onto 2D images for visualization, its quantitative evaluation is still limited to 3D rotations. In contrast, we tackle a more challenging problem that estimates the full perspective matrices from a single image. Our new dataset allows us to quantitatively evaluate the estimated perspective projection. Based on this, we design a new efficient CNN framework as well as a new 3D representation that further improves the pose estimation accuracy.
Fine-Grained Recognition. Fine-grained recognition refers to the task of distinguishing sub-ordinate categories [27, 10, 26]. In earlier works, 3D information is a common source to gain recognition performance improvement [34, 28, 15, 22]
. As deep learning prevails and fine-grained datasets become larger[13, 9], the effect of 3D information on recognition diminishes. Recently,  incorporate 3D bounding box into deep framework when images of cars are taken from a fixed camera. On the other hand, almost all existing fine-grained datasets are lack of 3D pose labels or 3D shape information , and pose estimation for fine-grained object categories are not well-studied. Our work fills this gap by annotating poses and matching CAD models on two existing popular fine-grained recognition datasets and performing the new task of pose estimation based on the augmented annotations.
Our dataset annotation process is similar to ObjectNet3D . We first select the most appropriate 3D car model from ShapeNet  for each category in the fine-grained image dataset. For each image, we then obtain its pose parameters by asking the annotators to align the projection of the 3D model with the image using our designed interface.
3.1 3D Models
We build two fine-grained 3D pose datasets for vehicles. Each dataset consists of two parts, i.e., 2D images and 3D models. The 2D images of vehicles are collected from StanfordCars  and CompCars  respectively. Target objects in most images are non-occluded and easy to identify. In order to distinguish between fine-grained categories, we adopt a distinct model for each category. Thanks to ShapeNet , a large number of 3D models for fine-grained vehicles are available with make/model names in their meta data, which are used to find the corresponding 3D model given an image category name. If there is no exact match between a category name and meta data, we manually select a visually similar 3D model for that category. For StanfordCars, we annotate images for all 196 categories, where 148 categories have exact matched models. For CompCars, we only include 113 categories with matched 3D models in ShapeNet. To our best knowledge, our dataset is the very first one which employs fine-grained category aware 3D model in 3D pose estimation.
3.2 Camera Model
The world coordinate system is defined in accordance with the 3D model coordinate system. In this case, a point on a 3D model is projected onto a point on a 2D image:
via a perspective projection matrix:
where denotes the intrinsic parameter:
and encodes a rotation matrix between the world and camera coordinate systems, parameterized by three angles, i.e., elevation , azimuth and in-plane rotation . We assume that the camera is always facing towards the origin of the 3D model. Hence the translation is only defined up to the model depth , the distance between the origins of two coordinate systems, and the principal point is the projection of the origin of world coordinate system on the image. As a result, our model has 7 parameters in total: camera focal length , principal point location , , azimuth , elevation , in-plane rotation and model depth . Note that, since the images are collected online, even the annotated intrinsic parameters (, and ) are approximation. Compared with previous annotations [30, 29] with 6 parameters ( fixed), our camera model considers both the camera focal length and object depth in a full perspective projection for finer 2D-3D alignment.
3.3 2D-3D Alignment
We annotate 3D pose information for all 2D images in our datasets through crowd-sourcing. To facilitate the annotation process, we develop an annotation tool illustrated in Figure 2. For each image during annotation, we choose the 3D model according to the fine-grained car type given beforehand. Then, we ask the annotators to adjust the 7 parameters so that the projected 3D model is aligned with the target object in 2D image. This process can be roughly summarized as follows: (1) shift the 3D model such that the center of the model (the origin of the world coordinate system) is roughly aligned with the center of the target object in the 2D image; (2) rotate the model to the same orientation as the target object in the 2D image; (3) adjust the model depth and camera focal length to match the size of the target object in the 2D image. Some finer adjustment might be applied after the three main steps. In this way we annotate all 7 parameters across the whole dataset. On average, each image takes approximately 60 seconds to annotate by an experienced annotator. To ensure the quality, after one round of annotation across the whole dataset, we perform quality check and let the annotators do a second round revision for unqualified examples.
3.4 Dataset Statistics
for StanfordCars 3D and CompCars 3D, respectively. For azimuth, due to the nature of the original fine-grained recognition dataset, we found it is not uniformly distributed, while the distributions of the two dataset are complementary to some degree. Elevations and in-plane rotations are not severe as expected, since the images of cars are usually taken from the ground view.
4 3D Pose Estimation for Fine-Grained Object Categories
Given an input image of a fine-grained object, our task is to predict all the 7 parameters related to Equation (2), i.e., 3D rotation , distance , principal point and , such that the projected 3D model can align well with the object in the 2D image.
4.1 Baseline Framework
Our baseline method only uses 2D appearance to regress the pose parameters. It is a modified version of Faster R-CNN  which was originally designed for object detection. Casting our pose estimation problem into a detection framework is motivated by the relation between the two tasks. Since we are not using key points as an attention mechanism, performing pose estimation within the region of interest (RoI) helps us get rid of unrelated image regions hence make use of 2D information more effectively. In addition, 3D pose estimation is highly related to the detection task, especially the intrinsic parameters in Equation (3).
We parametrize the 3D rotation using the quaternion representation, converted from the angles . The principal point is highly related to RoI center. Therefore, we regress , the offset of the principal point from the RoI center. Such offset exists since the projection of the 3D object center might not necessarily be the 2D center depending on the poses. For other parameters ( and ), we regress the standard format as they are.
The modification of the network architecture is relatively straightforward. As shown in Figure 5, we add a pose estimation branch along with the existing class prediction and bounding box regression branches. Similar to the bounding box regression branch, the estimation of each group of pose parameters consists of a fully-connected (FC) layer and a smoothed loss. The centers of the RoIs are also used to adjust the regression targets at training time and generate the final predictions at test time, as discussed above. For each training image, its bounding box is figured out from the perspective projection of the corresponding 3D model. Since we have fine-grained 3D models and high-quality annotations, these bounding boxes are tight to their corresponding objects.
4.2 Improve Pose Estimation via 3D Location Field
The key difference of our dataset to previous ones is that we have fine-grained 3D models such that the projection aligns better with the image. This advantage allows us to explore the usage of dense 3D representations in addition to 2D appearance to regress the pose parameters.
Given an object in an image and its 3D model, our representation, named as 3D location field, maps every foreground pixel to its corresponding location on the surface of the 3D model, i.e., . The resulting field has the same size as the image and has three channels containing the , and coordinates respectively. A sample image with corresponding 3D location field can be seen in Figure 6. The 3D location field is a dense representation of 3D information which can be directly used as network input.
We explore the usage of 3D location field to improve pose estimation based on Mask R-CNN. We would still expect only 2D image input at test time, therefore we regress 3D location field and use the regressed field for pose estimation. Based on the framework in Figure 5, we add a branch to regress 3D location field (instead of regressing binary masks in Mask R-CNN). The regressed location fields are fed into a CNN consisting of additional convolutional layers followed by layers to regress the pose parameters. The regressions from 2D appearance (as part of Figure 5) and 3D location field are later combined to produce the final pose parameters. Figure 7 shows the detailed network structure.
We train the pose regression from location fields using a large amount of synthetic data. The synthetic location fields are generated from the 3D models with various pre-defined poses. The location field is a very suitable representation for synthetic data augmentation due to the following reasons: (i) the field only encodes 3D location information without any rendering of 3D models and naturally avoids the domain gap between synthetic data and photo-realistic data; (ii) the field is invariant to color, texture and scale of the images.
5.1 Evaluation Metrics
For each test sample, we introduce two metrics to comprehensively evaluate object poses.
Following [25, 17], the first metric, Rotation Error, focuses on the quality of viewpoint estimation only. Given the predicted and ground truth rotation matrices , the difference between the two measured by geodesic distance is .
The second metric evaluates the overall quality of perspective projection. Our evaluation metric is based onAverage Distance of Model Points in , which measures the averaged distance between predicted projected points and their corresponding ground truth projections. Concretely, given one test result , where its predicted pose is , its ground truth pose and corresponding 3D model , the metric is defined as
According to , this is the most widely-used error function to evaluate a projection matrix. The unit of the above distance is the number of pixels. To make the metric scale-invariant, we normalize it using the diameter of the 2D bounding box. We denote the normalized distance as . It is worth mentioning again that the 3D models are only used when computing the evaluation metrics. During test time, only a single 2D image is fed into the network to predict the pose .
To measure the performance over the whole test set, we compute the mean and median of and over all test samples. Also, by setting thresholds on the two metrics, we can get an accuracy number. For , following [25, 17], we set the threshold to be . For , the common threshold is 0.1, which means that the prediction with average projection error less than 10% of the 2D diameter is considered correct.
5.2 Experimental Settings
Data Split. For StanfordCars 3D, since we have annotated all the images, we follow the standard train/test split provided by the original dataset  with 8144 training examples and 8041 testing examples. For CompCars 3D, we randomly sample of our annotated data as training set and the rest as testing set, resulting in 3798 training and 1898 test examples.
Baseline Implementation. Our implementation is based on the Detectron package , which includes Faster/Mask R-CNN implementations. The convolutional body (i.e., the “backbone” in ) used for the baseline is ResNet-50. For fair comparison, the convolutional body is initialized from ImageNet pre-trained model, and other layers are randomly initialized (i.e., we are not using COCO pre-trained detectors). Following the setting of Mask R-CNN, the whole network is trained end-to-end. At test time, we adopt a cascaded strategy, where the 3D pose branch is applied only to the highest scoring box prediction.
Comparison to Previous Baselines. It is worth mentioning that, when only evaluating the rotation error in Section 5.1, our baseline in Figure 5 is almost identical to the baselines in Pascal3D+  and ObjectNet3D  except that their detection and pose estimation heads are parallel while ours is cascaded.
3D Location Field. In Section 4.2, incorporating 3D location fields involves two steps – field regression and pose regression from fields. Field regression is trained together with detection and baseline pose estimation in an end-to-end fashion, similar to Mask R-CNN. The ground truth training fields are generated from the annotations (3D models and poses). The second step, pose regression from fields is trained using the synthetic data generated from the pool of matched 3D models in a dataset (38102/14017 synthetic samples for StanfordCars&CompCars 3D). We only regress the quaternion using the location fields.
|FT w./ Field||4.74||7.45||98.31||0.0836||0.1047||64.01|
5.3 Results and Analysis
The quantitative results for StanfordCars 3D and CompCars 3D are shown in Table 2 and Table 3 respectively. The changes of w.r.t the threshold for the datasets are shown in Figure 8. For CompCars 3D dataset, besides ImageNet initialization we also report the result finetuned from a StanfordCars 3D pretrained model, since the number of training samples in StanfordCars is relatively larger.
As can be seen in Table 2 and 3, our baseline performs very well on estimating the rotation matrix for both datasets, with Median less than 10 degrees and around 95%. While recovering the full perspective model is a much more challenging task, Table 2 shows that promising performance can be achieved with enough properly annotated training samples. For StanfordCars 3D, Median (the median of the average projection error) is less than 10% of the diameter of the 2D bounding box. When the training set is limited, from the first and the third row of Table 3
, we can see the effectiveness of transfer learning from a larger dataset. Regarding the effectiveness of the 3D location field, we can observe consistent performance gain across all datasets. The main reasons are two-fold: (i) this 3D representation enables the usage of large amounts of synthetic training data with no domain gap; (ii) our field regression adapted from Mask R-CNN works well such that even the pose prediction based on the regressed field can help a lot at test time.
We visualize the predicted poses in Figure 9. As shown on the left part of Figure 9, our method is able to handle poses of various orientations, scales and locations of the projection. On the right part of Figure 9, failure cases exist in our predictions, indicating there are still potential rooms for improvement, especially for the estimation of scale, cases with large perspective distortion and some uncommon poses with few training samples.
We study the problem of pose estimation for fine-grained object categories. We annotate two popular fine-grained recognition datasets with fine-grained 3D shapes and poses. We propose an approach to estimate the full perspective parameters from a single image. We further propose 3D location field as a dense 3D representation to facilitate pose estimation. Experiments on our datasets suggest that this is an interesting problem in future.
-  Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6d object pose estimation using 3d object coordinates. In: ECCV (2014)
-  Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision88(2), 303–338 (2010)
-  Ghodrati, A., Pedersoli, M., Tuytelaars, T.: Is 2D information enough for viewpoint estimation? In: BMVC (2014)
-  Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., He, K.: Detectron. https://github.com/facebookresearch/detectron (2018)
-  Hara, K., Vemulapalli, R., Chellappa, R.: Designing deep convolutional neural networks for continuous object orientation estimation. arXiv preprint arXiv:1702.01499 (2017)
-  He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
-  Hodan, T., Matas, J., Obdrzálek, S.: On evaluation of 6d object pose estimation. In: Computer Vision - ECCV 2016 Workshops - Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III. pp. 606–619 (2016)
-  Krause, J., Sapp, B., Howard, A., Zhou, H., Toshev, A., Duerig, T., Philbin, J., Fei-Fei, L.: The unreasonable effectiveness of noisy data for fine-grained recognition. In: European Conference on Computer Vision. pp. 301–320. Springer (2016)
-  Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: ICCV Workshops on 3D Representation and Recognition (2013)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)
-  Lim, J.J., Pirsiavash, H., Torralba, A.: Parsing IKEA objects: Fine pose estimation. In: ICCV (2013)
-  Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: ICCV (2015)
-  Mahendran, S., Ali, H., Vidal, R.: 3D pose regression using convolutional neural networks. In: ICCV. vol. 1, p. 4 (2017)
-  Mottaghi, R., Xiang, Y., Savarese, S.: A coarse-to-fine model for 3D pose estimation and sub-category recognition. In: CVPR (2015)
-  Ozuysal, M., Lepetit, V., Fua, P.: Pose estimation for category specific multiview object localization. In: CVPR (2009)
-  Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-DOF object pose from semantic keypoints. In: ICRA. pp. 2011–2018 (2017)
-  Pepik, B., Stark, M., Gehler, P., Schiele, B.: Teaching 3D geometry to deformable part models. In: CVPR (2012)
-  Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
-  Savarese, S., Fei-Fei, L.: 3D generic object categorization, localization and pose estimation. In: ICCV (2007)
-  Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.W.: Scene coordinate regression forests for camera relocalization in RGB-D images. In: CVPR (2013)
-  Sochor, J., Herout, A., Havel, J.: BoxCars: 3D boxes as cnn input for improved fine-grained vehicle recognition. In: CVPR (2016)
-  Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In: ICCV (2015)
-  Taylor, J., Shotton, J., Sharp, T., Fitzgibbon, A.W.: The vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation. In: CVPR (2012)
Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1510–1519 (2015)
-  Van Horn, G., Mac Aodha, O., Song, Y., Shepard, A., Adam, H., Perona, P., Belongie, S.: The inaturalist challenge 2017 dataset. arXiv preprint arXiv:1707.06642 (2017)
-  Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Tech. rep. (2011)
-  Xiang, Y., Choi, W., Lin, Y., Savarese, S.: Data-driven 3d voxel patterns for object category recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1903–1911 (2015)
-  Xiang, Y., Kim, W., Chen, W., Ji, J., Choy, C., Su, H., Mottaghi, R., Guibas, L., Savarese, S.: ObjectNet3D: A large scale database for 3D object recognition. In: ECCV (2016)
-  Xiang, Y., Mottaghi, R., Savarese, S.: Beyond PASCAL: A benchmark for 3D object detection in the wild. In: WACV (2014)
-  Yang, L., Liu, J., Tang, X.: Object detection and viewpoint estimation with auto-masking neural network. In: ECCV (2014)
-  Yang, L., Luo, P., Change Loy, C., Tang, X.: A large-scale car dataset for fine-grained categorization and verification. In: CVPR (2015)
-  Zhou, X., Leonardos, S., Hu, X., Daniilidis, K., et al.: 3D shape estimation from 2D landmarks: A convex relaxation approach. In: CVPR (2015)
-  Zia, M.Z., Stark, M., Schiele, B., Schindler, K.: Detailed 3D representations for object recognition and modeling. PAMI 35(11), 2608–2623 (2013)