1 Introduction
Understanding 3D properties of objects from an image, i.e. to recover objects’ 3D pose and shape, is an important task of computer vision, as illustrated in Fig. 1. This task is also called ‘‘inversegraphics’’ [27], solving which would enable a wide range of applications in vision and robotics, such as robot navigation [30], visual recognition [15], and humanrobot interaction [2]. Among them, autonomous driving (AD) is a prominent topic which holds great potential in practical applications. Yet, in the context of AD the current leading technologies for 3D object understanding mostly rely on highresolution LiDAR sensor [34], rather than regular camera or image sensors.
Dataset  Image source  3D property  Car keypoints ()  Image ()  Average cars/image  Maximum cars/image  Car models  Stereo 

3DObject [52]  Control  complete 3D  No  350  1  1  10  No 
EPFL Car [47]  Control  complete 3D  No  2000  1  1  20  No 
PASCAL3D+ [65]  Natural  complete 3D  No  6704  1.19  14  10  No 
ObjectNet3D [64]  Natural  complete 3D  Yes (14)  7345  1.75  2  10  No 
KITTI [21]  Selfdriving  3D bbox ori.  No  7481  4.8  14  16  Yes 
ApolloCar3D  Selfdriving  industrial 3D  Yes (66)  5277  11.7  37  79  Yes 
However, we argue that there are multitude drawbacks in using LiDAR, hindering its further uptaking. The most severe one is that the recorded 3D LiDAR points are at best a sparse coverage of the scene from front view [21], especially for distant and absorbing regions. Since it is crucial for a selfdriving car to maintain a safe breaking distance, 3D understanding from a regular camera remains a promising and viable approach attracting significant amount of research from the vision community [6, 56].
The recent tremendous success of deep convolutional network [22]
in solving various computer vision tasks is built upon the availability of massive carefully annotated training datasets, such as ImageNet
[11] and MSCOCO [36]. Acquiring largescale training datasets however is an extremely laborious and expensive endeavour, and the community is especially lacking of fully annotated datasets of 3D nature. For example, for the task of 3D car understanding for autonomous driving, the availability of datasets is severely limited. Take KITTI [21] for instance. Despite being the most popular dataset for selfdriving, it has only about labelled 3D cars yet in the form of bounding box only, without detailed 3D shape information flow [41]. Deep learning methods are generally hungry for massive labelled training data, yet the sizes of currently available 3D car datasets are far from adequate to capture various appearance variations, e.g. occlusion, truncation, and lighting. For other datasets such as PASCAL3D+ [65] and ObjectNet3D [64], while they contain more images, the car instances therein are mostly isolated, imaged in a controlled lab setting thus are unsuitable for autonomous driving.To rectify this situation, we propose a largescale 3D instance car dataset built from real images and videos captured in complex realworld driving scenes in multiple cities. Our new dataset, called ApolloCar3D, is built upon the publicly available ApolloScape dataset [23] and targets at 3D car understanding research in selfdriving scenarios. Specifically, we select images from around 200K released images in the semantic segmentation task of ApolloScape, following several principles such as (1) containing sufficient amount of cars driving on the street, (2) exhibiting large appearance variations, (3) covering multiple driving cases at highway, local, and intersections. In addition, for each image, we provide a stereo pair for obtaining stereo disparity; and for each car, we provide 3D keypoints such as corner of doors and headlights, as well as realistic 3D CAD models with an absolute scale. An example is shown in Fig. 1(b). We will provide details about how we define those keypoints and label the dataset in Sec. 2.
Equipped with ApolloCar3D, we are able to directly apply supervised learning to train a 3D car understanding system from images, instead of making unnecessary compromises falling back to weaksupervision or semisupervision like most previous works do,
e.g. 3DRCNN [28] or single object 3D recovery [60].To facilitate future research based on our ApolloCar3D dataset, we also develop two 3D car understanding algorithms, to be used as new baselines in order to benchmark future contributed algorithms. Details of our baseline algorithms will be described in following sections.
Another important contribution of this paper is that we propose a new evaluation metric for this task, in order to to jointly measure the quality of both 3D pose estimation and shape recovery. We referred to our new metric as ‘‘Average 3D precision (A3DP)", as it is inspired by the AVP metric (average viewpoint precision) for PASCAL3D+
[65] which however only considers 3D pose. In addition, we supply multiple true positive thresholds similar to MS COCO [36].The contributions of this paper are summarized as:

A largescale and growing 3D car understanding dataset for autonomous driving, i.e. ApolloCar3D, which complements existing public 3D object datasets.

A novel evaluation metric, i.e. A3DP, which jointly considers both 3D shape and 3D pose thus is more appropriate for the task of 3D instance understanding.

Two new baseline algorithms for 3D car understanding, which outperform several stateoftheart 3D object recovery methods.

Human performance study, which points out promising future research directions.
2 ApolloCar3D Dataset
Existing datasets with 3D object instances.
Previous datasets for 3D object understanding are often very limited in scale, or with partial 3D properties only, or contains few objects per image [29, 55, 52, 44, 47, 37]. For instance, 3DObject [52] has only 10 instances of cars. The EPFL Car [47] has 20 cars under different viewpoints but was captured in a controlled turntable rather than in real scenes.
To handle more realistic cases from noncontrolled scenes, datasets [35] with natural images collected from Flickr [40] or indoor scenes [10] with Kinect are extended to 3D objects [51]. The IKEA dataset [35] labelled a few hundreds indoor images with 3D furniture models. PASCAL3D+ [65] labelled the 12 rigid categories in PASCAL VOC 2012 [16] images with CAD models. ObjectNet3D [64] proposed a much larger 3D object dataset with images from ImageNet [11] with 100 categories. These datasets, while useful, are not designed for autonomous driving scenarios. To the best of our knowledge, the only realworld dataset that partially meets our requirement is the KITTI dataset [21]. Nonetheless, KITTI only labels each car by a rectangular bounding box, and lacks finegrained semantic keypoint labels (e.g. window, headlight). One exception is the work of [42], yet it falls short in the number of 200 labelled images, and their car parameters are not publicly available.
In this paper, as illustrated in Fig. 1, we offer to the community the first largescale and fully 3D shape labelled dataset with 60K+ car instances, from 5,277 realworld images, based on 34 industrygrade 3D CAD car models. Moreover, we also provide the corresponding stereo image pairs and accurate 2D keypoint annotations. Tab. 1 gives a comparison of key properties of our dataset versus existing ones for 3D object instance understanding.
2.1 Data Acquisition
We acquire images from the ApolloScape dataset [23] due to its high resolution (3384 2710), large scale (140K semantically labelled images), and complex driving conditions. From the dataset, we carefully select images satisfying our requirements as stated in Sec. 1. Specifically, we select images from their labelled videos of 4 different cites satisfying (1) relatively complex environment, (2) interval between selected images 10 frames. After picking images from the whole dataset using their semantic labels, in order to have more diversity, we prune all images manually, and further select ones which contain better variation of car scales, shapes, orientations, and mutual occlusion between instances, yielding 5,277 images for us to label.
For 3D car models, we look for highly accurate shape models, i.e. the offset between the boundary of reprojected model and manually labelled mask is less than on average. However, 3D car meshes in ShapeNet [4] are still not accurate enough for us, and it is too costly to fit each 3D model in the presence of heavy occlusion, as shown in Fig. 1. Therefore, to ensure the quality (accuracy) of 3D models, we hired online model makers to manually build corresponding 3D models given parameters of absolute shape and scale of certain car type. Overall, we build 34 real models including sedan, coupe, minivan, SUV, and MPV, which has covered the majority of car models and types in the market.
2.2 Data Statistics
In Fig. 2, we provide statistics for the labelled cars w.r.t. translation, orientation, occlusion, and model shape. Compared with KITTI [21], ApolloCar3D contains significantly larger amount of cars that are at long distance, under heavy occlusions, and these cars are distributed diversely in space. From Fig. 2(b), the orientation follows a similar distribution, where the majority of cars on road are driving towards or backwards the data acquisition car. In Fig. 2(c), we show distribution w.r.t. car types, where sedans have the most frequent occurrences. The object distribution per image in Fig. 2(e) shows that most of the images contain more than 10 labeled objects.
3 Contextaware 3D Keypoint Annotation
Thanks to the high quality 3D models that we created, we develop an efficient machineaided semiautomatic keypoint annotation process. Specifically, we only ask human annotators to click on a set of predefined keypoints on the object of interest in each image. Afterwards, the EPnP algorithm [31] is employed to automatically recover the pose and model of the 3D car instance by minimizing reprojection error. RANSAC [19]
is used handle outliers or wrong annotations. While only a handful of keypoints can be sufficient solve the EPnP problem, we define
semantic keypoints in our dataset, as shown in Fig. 3, which has much higher density than most previous car datasets [57, 43]. The redundancy enables more accurate and robust shapeandpose registration. We will show the definition of each semantic keypoint in appendix.Contextaware annotation. In the presence of severe occlusions, for which RANSAC also fails, we develop a contextaware annotation process by enforcing coplanar constraints between one car and its neighboring cars. By doing this, we are able to propagate information among neighboring cars, so that we jointly solve for their poses with contextaware constraints.
Formally, the objective for a single car pose estimation is
(1) 
where indicate the pose and shape of a car instance respectively. Here, is the number of models.
is a vector indicating whether the
keypoint of the car has been labelled or not. is the labelled 2D keypoint coordinate on the image. is a perspective projection function projecting the correspondent 3D keypoint on the car model given and camera intrinsic .Our contextaware coplanarity constraint is formulated as:
(2) 
where is a spatial neighbor car, is roll component of , and is the height of the car given its shape .
The total energy to be minimized for finding car pose and shape in image is defined as:
(3) 
where is the index of cars in the image, is a binary function indicating whether car needs to borrow pose information from neighbor cars, and is the set of labelled 2D keypoints of the car. is the set of rich annotated neighboring cars of using instance mask , and is the maximum number of neighbors we use.
To judge whether a car needs to use contextual constrains, we define the condition in Eq. (3) for a car instance as the number of annotated keypoints is greater than 6, and the labelled keypoints are lying on more than two predefined car surfaces (detailed in tab. 2).
Surface name  Keypoints label  

Front surface 


Left surface 


Rear surface 


Right surface 

Otherwise, we additionally use , which is a nearest neighbor function, to find spatial close car instances and regularize the solved poses. Specifically, the metric for retrieve neighborhood is the distance between mean coordinates of labelled keypoints. Here we set .
As illustrated in Fig. 4, to minimize Eq. (3), we first solve for those cars with dense keypoint annotations, by exhausting all car types. We require that the average reprojection error must be below 5 pixels and the reprojected boundary offset to be within 5 pixels. If more than one cars meet the constraints, we choose the one with minimum reprojection error. We then solve for the cars with fewer keypoint annotations, by using its context information provided by its neighboring cars. After most cars are aligned, we ask human annotators to visually verify and adjust the result before committing to the database.
4 Two Baseline Algorithms
Based on ApolloCar3D, we aim to develop strong baseline algorithms to facilitate benchmarking and future research. We first review the most recent literature and then implement two possibly strongest baseline algorithms.
Existing work on 3D instance recovery from images.
3D objects are usually recovered from multiple frames, 3D range sensors [26], or learningbased methods [67, 13]. Nevertheless, addressing 3D instance understanding from a single image in an uncontrolled environment is illposed and challenging, thus attracting growing attention. With the development of deep CNNs, researchers are able to achieve impressive results with supervised [18, 69, 43, 46, 57, 54, 63, 70, 6, 32, 49, 38, 3, 66] or weakly supervised strategies [28, 48, 24]. Existing works consider to represent an object as a parameterized 3D bounding box [18, 54, 57, 49], coarse wireframe skeletons [14, 32, 62, 69, 68], voxels [9], onehot selection from a small set of exemplar models [3, 45, 1], and point clouds [17]. Categoryspecific deformable model has also been used for shapes of simple geometry [25, 24].
For handling cases of multiple instance, 3DRCNN [28] and DeepMANTA [3] are possibly the stateoftheart techniques by combining 3D shape model with Faster RCNN [50] detection. However, due to the lack of high quality dataset, these methods have to rely on 2D masks or wireframes that are coarse information for supervision. Back on ApolloCar3D, in this paper, we adapt their algorithms and conduct supervised training to obtain strong results for benchmarks. Specifically, 3DRCNN does not consider the car keypoints, which we referred to as direct approach, while DeepMANTA considers keypoints for training and inference, which we call keypointbased approach. Nevertheless, both algorithms are not opensourced yet. Therefore, we have to develop our inhouse implementation of their methods, serving as baselines in this paper. In addition, we also propose new ideas to improve the baselines, as illustrated in Fig. 5, which we will elaborate later.
4.1 A Direct Approach
When only car pose and shape are provided, following direct supervision strategy as mentioned in 3DRCNN [28], we crop out corresponding features for every car instance from a fully convolutional feature extractor with RoI pooling, and build independent fully connected layers to regress towards its 2D amodal center, allocentric rotation, and PCAbased shape parameters. Following the same strategy, the regression output spaces of rotation and shape are discretized. Nevertheless, for estimating depth, instead of using amodal box and enumerating depth such that the projected mask best fits the box as mentioned in [28], we use ground truth depths as supervision. Therefore, for our implementation, we replace amodal box regression to depth regression using similar depth discretizing policy as proposed in [20], which provides stateoftheart depth estimation from a single image.
Targeting at detailed shape understanding, we further make two improvements over the original pipeline, as shown in Fig. 5(a). First, as mentioned in [28], estimating object 3D shape and pose are distortionsensitive, and RoI pooling is equivalent to making perspective distortion of an instance in the image, which negatively impact the estimation. 3DRCNN [28]
induces infinity homography to handle the problem. In our case, we replace RoI pooling to a fully convolutional architecture, and perform perpixel regression towards our pose and shape targets, which is simpler yet more effective. Then we aggregate all the predictions inside the given instance mask with a ‘‘selfattention’’ policy as commonly used for feature selection
[59]. Formally, let be the feature map, and the output for car instance is computed as,(4) 
where
is the logits of discretized 3D representation,
is a pixel in the image, is a binary mask of object , is the kernels used for predicting outputs, and is the attention map. is the number of bins for discretization following [28]. We call feature aggregation as mask pooling since it selects the most important information within each object mask.Secondly, as shown in our pipeline, for estimating car translation, i.e. its amodal center and depth , instead of using the same target for every pixel in a car mask, we propose to output a 3D offset at each pixel w.r.t. the 3D car center, which provides stronger supervision and helps learn more robust networks. Previously, inducing relative position of object instances has also been shown to be effective in instance segmentation [58, 33]. Formally, let be the 3D car center, and our 3D offset for a pixel is defined as , where , and is the estimated depth at . In principle, 3D offset estimation is equivalent to jointly computing perpixel 2D offset respect to the amodal center, i.e. and a relative depth to the center depth, i.e. . We adopt such a factorized representation for model center estimation, and the 3D model center can then be recovered by
(5) 
where is the attention at , which is used for output aggregation in Eq. (4). In our experiments in Sec. 5, we show that the two strategies provide improvements over the original baseline results.
4.2 A Keypointbased Approach
When sufficient 2D keypoints from each car are available (e.g.as in Fig. 5(b)), we develop a simple baseline algorithm, inspired by DeepMANTA [3], to align 3D car pose via 2D3D matching.
Different from [3], our 3D car models have much more geometric details and come with the absolute scale, and our 2d keypoints have more precise annotations. Here, we adopt the CPM [61]  a stateoftheart 2d keypoint detector despite the algorithm was originally developed for human pose estimation. We extend it to 2d car keypoint detection and find it works well.
One advantage of using 2d keypoint prediction over our baseline1 i.e.the ‘‘direct approach’’ in Sec. 4.1, is that, we do not have to regress the global depth or scale  the estimation of which by networks is in general not very reliable. Instead of feeding the full image into the network, we crop out each car region in the image for 2d keypoint detection. This is especially useful for images in ApolloScape [23], which have a large number of cars of small size.
Borrowing the contextaware constraints from our annotation process, once we have enough detected keypoints, we first solve the easy cases where a car is less occluded using EPnP[31], then we propagate the information to neighboring cars until all car pose and shapes are found to be consistent with each other w.r.t. the coplanar constraints via optimizing Eq. (3). We referred our car pose solver with coplanar constraints as contextaware solver.
5 Experiments
This section provides key implementation details, our newly proposed evaluation metric, and experiment results. In total, we have experimented on 5,277 images, split to 4,036 for training, 200 for validation, and 1,041 for testing. We sample images for each set following the distribution illustrated in Fig. 2. The goal is to make sure that the testing data cover a wide range of both easy and difficult scenarios.
Implementation details.
Due to the lacking of publicly available source codes, we reimplemented 3DRCNN [28] for 3D car understanding without using keypoints, and DeepMANTA [3] which requires key points annotation. For training MaskRCNN, we downloaded the code from GitHub implemented by an autonomous driving company ^{1}^{1}1https://github.com/TuSimple/mxmaskrcnn. We adopted the fully convolutional features from DeepLabv3 [5] with Xception65 [8] network and follow the same training policy. For DeepMANTA, we used the key point prediction methods from CPM [7]. With 4,036 training images, we obtained about 40,000 labeled vehicles with 2D keypoints, used to train a CPM [7] (with 5 stages of CPM, and VGG16 initialization).
Method  mean pixel error  detection rate 

CPM [61]  
Human label 
Methods  Mask  wKP  A3DPAbs  A3DPRel  Time(s)  
mean  cl  cs  mean  cl  cs  
3DRCNN [28]  gt    0.29s  
+ MP  gt    0.32s  
+ MP + OF  gt    0.34s  
+ MP + OF  pred.    0.34s  
DeepMANTA [3]  gt  ✓  3.38s  
+ CAsolver  gt  ✓  7.41s  
+ CAsolver  pred.  ✓  8.5s  
Human  gt  ✓  607.41s 
Evaluation metrics.
Similar to the detection task, the average precision (AP) [16] is usually used for evaluating 3D object understanding. However, the similarity is measured using 3D bounding box IoU [21] with orientation (average orientation similarity (AOS) [21]) or 2D bounding box with viewpoint (average viewpoint precision (AVP) [65]). Unfortunately, those metrics can only measure very coarse 3D properties, yet object shape has not been considered jointly with 3D rotation and translation.
Mesh distance [53] and voxel IoU [12] are usually used to evaluate 3D shape reconstruction. In our case, a car model is mostly compact, thus we consider comparing projection masks of two models following the idea of visual hull representation [39]. Specifically, we sample 100 orientations at yaw angular direction and project each view of the model to an image with a resolution of . We use the mean IoU over all views as the car shape similarity metric. For evaluating rotation and translation, we follow the metrics commonly used for camera pose estimation [21]. In summary, the criteria for judging a true positive given a set of thresholds is defined as
(6) 
where are the shape ID, translation, and rotation of a predicted 3D car instance.
In addition, a single set of true positive thresholds used by AOS or AVP, e.g. IoU , and rotation , is not sufficient to evaluate detected results thoroughly [21]. Here, following the metric of MS COCO [36], we propose to use multiple sets of thresholds from loose to strict for evaluation. Specifically, the thresholds used in our results for all levels of ifficulty are , where indicates a set of discrete thresholds sampled in a line space from to with an interval of . Similar to MSCOCO, we select one loose criterion and one strict criterion to diagnose the performance of different algorithms. Note that in our metrics, we only evaluate instances with depth less than as we would like to focus on cars that are more immediately relevant to our autonomous driving task.
Finally, in selfdriving scenarios that are safety critical, we commonly care nearby cars rather than those far away. Therefore, we further propose to use a relative error metric for evaluating translation following the ‘‘AbsRel’’ commonly used in depth evaluation [21]. Formally, we change the criteria of to , and set the thresholds to . We call our evaluation metric with absolute translation thresholds as ‘‘A3DPAbs’’, and the one with relative translation thresholds as ‘‘A3DPRel’’, and we report the results under both metrics in our later experiments.
5.1 Quantitative Results
In this section, we compare against our baseline algorithms with the method presented in Sec. 4 by progressively adding our proposed components and losses. Tab. 4 shows the comparison results. For direct regression approach, our baseline algorithm ‘‘3DRCNN’’ provides regression towards translation, allocentric rotation, and car shape parameters. We further extend the baseline method by adding mask pooling (MP) and offset flow (OF). We observe from the table that, swapping RoI pooling for mask pooling moderately improves the results while offset flow brings significant boost. They together help avoiding geometric distortions from regular RoI pooling and bring attention mechanism to focus on relevant regions.
For the keypointbased method, ‘‘DeepMANTA’’ shows the results by using our detected key points and solving with PnP for each car individually, yielding reasonable performance. ‘‘+CAsolver’’ means for cars without sufficient detected key points, we employ our contextaware solver for inference, which provides around improvement. For both methods, switching ground truth mask to segmentation from Mask RCNN gives little drop of the performance, demonstrating the high quality of Mask RCNN results.
Finally, we train a new group of labellers, and ask them to relabel the keypoints on our validation set, which are passed through our contextaware 3D solver. We denote these results as ‘‘human’’ performance. We can see there is a clear gap () between algorithms with human. However, even the accuracy for humans is still not satisfying. After checking the results, we found that this is primarily because humans cannot accurately memorize the semantic meaning of all the 66 keypoints, yielding wrongly solved poses. We conjecture this could be fixed by rechecking and refinement, possibly leading to improved performance.
Tab. 3 shows the accuracy of 2d keypoints. For each predicted keypoint, if its distance to ground truth keypoint is less than , we regard it as positive, otherwise, it is regarded as negative. We first crop out each car using its ground truth mask, then use CPM [61] to train the 2d keypoints detector. The detection rate is (rate of number of positive keypoints and all ground truth), and the mean pixel error is . We also show the accuracy of human labeled keypoints. The detection rate of human labeled 2d keypoints is , and the mean pixel error of detected 2d keypoints is . As discussed in the paper, the mislabelling of human is primarily because humans cannot accurately memorize the semantic meaning of all the 66 keypoints. However, it is still much better than a trained CPM keypoint detector because the robustness of human with respect to appearance and occlusion changes.
5.2 Qualitative Results
Some qualitative results are visualized in Fig. 7. From the two examples, we can find that the additional key point predictions provide more accurate 3D estimation than direct method due to the use of geometric constraints and intercar relationship constraints. In particular, for the direct method, most errors occur in depth prediction. It can be explained by the nature of the method that the method predicts the global 3D property of depth purely based on object appearance in 2D, which is illposed and errorprone. However, thanks to the use of reliable masks, the method discovers more cars than the keypointbased counterpart. For the keypointbased approach, we are able to show that correctly detected keypoints are extremely successful at constraining car poses, while failed or missing keypoint estimation, especially for cars of unusual appearance, will lead to missing detection of cars or wrong solution for poses.
5.3 Result Analysis
To analyze the performance of different approaches, we evaluate them separately on various distances and occlusion ratios. Detailed results are shown in Fig. 6. Checking Fig. 6(a, b), as expected, we can find that the estimation accuracy decreases with farther distances, and the gap between human and algorithm narrows in the distance. In addition, after checking Fig. 6(c, d) for occlusion, we discover that the performance also drops with increasing the occlusion ratio. However, we observe that the performance on nonoccluded cars is the worst on average among all occlusion patterns. This is because most cars which experience little occlusion are from large distance and of small scale, while cars closeby are more often occluded.
6 Conclusion
This paper presents by far the largest and growing dataset (namely ApolloCar3D) for instancelevel 3D car understanding in the context of autonomous driving. It is built upon industrialgrade highprecision 3D car models fitted to car instances captured in real world scenarios. Complementing existing related datasets e.g. [21], we hope this new dataset could serve as a longstanding benchmark facilitating future research on 3D pose and shape recovery.
In order to efficiently annotate complete 3D object properties, we have developed a contextaware 3D annotation pipeline, as well as two baseline algorithms for evaluation. We have also conducted carefully designed human performance study, which reveals that there is still a visible gap between machine performance and that of human’s, motivating and suggesting promising future directions. More importantly, built upon the publicly available ApolloScape dataset [23], our ApolloCar3D dataset contains multitude of data sources including stereo, camera pose, semantic instance label, perpixel depth ground truth, and moving videos. Working with our data enables training and evaluation of a wide range of other vision tasks, e.g. stereo vision, modelfree depth estimation, and optical flow etc., under real scenes.
7 Acknowledgement
The authors gratefully acknowledge He Jiang from Baidu Research for car visualization using obtained poses. Meanwhile, the authors also gratefully acknowledge Maximilian Jaritz from University of California, San Diego for counting car numbers of the proposed dataset.
References

[1]
M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic.
Seeing 3d chairs: exemplar partbased 2d3d alignment using a large
dataset of cad models.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 37623769, 2014.  [2] G. Canal, S. Escalera, and C. Angulo. A realtime humanrobot interaction system based on gestures for assistive scenarios. Computer Vision and Image Understanding, 149:6577, 2016.
 [3] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teulière, and T. Chateau. Deep manta: A coarsetofine manytask network for joint 2d and 3d vehicle analysis from monocular image. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 20402049, 2017.
 [4] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An informationrich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
 [5] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834848, 2018.
 [6] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21472156, 2016.
 [7] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, and G. Y. J. Sun. Cascaded pyramid network for multiperson pose estimation.
 [8] F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, pages 161002357, 2017.
 [9] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3dr2n2: A unified approach for single and multiview 3d object reconstruction. In Proc. Eur. Conf. Comp. Vis., 2016.
 [10] A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner. Scannet: Richlyannotated 3d reconstructions of indoor scenes. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
 [11] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 248255. Ieee, 2009.
 [12] X. Di and P. Yu. 3d reconstruction of simple objects from a single view silhouette image. arXiv preprint arXiv:1701.04752, 2017.
 [13] N. Dinesh Reddy, M. Vo, and S. G. Narasimhan. Carfusion: Combining point tracking and part detection for dynamic 3d reconstruction of vehicles. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2018.
 [14] W. Ding, S. Li, G. Zhang, X. Lei, H. Qian, and Y. Xu. Vehicle pose and shape estimation through multiple monocular vision.
 [15] F. Engelmann, J. Stückler, and B. Leibe. Joint object pose estimation and shape reconstruction in urban street scenes using 3d shape priors. In German Conference on Pattern Recognition, pages 219230. Springer, 2016.
 [16] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.
 [17] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
 [18] S. Fidler, S. Dickinson, and R. Urtasun. 3d object detection and viewpoint estimation with a deformable 3d cuboid model. In Proc. Adv. Neural Inf. Process. Syst., pages 611619, 2012.
 [19] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381395, 1981.
 [20] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 20022011, 2018.
 [21] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 33543361. IEEE, 2012.
 [22] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. In Proc. IEEE Int. Conf. Comp. Vis., pages 29802988. IEEE, 2017.
 [23] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang. The apolloscape dataset for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 954960, 2018.
 [24] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning categoryspecific mesh reconstruction from image collections. In Proc. Eur. Conf. Comp. Vis., 2018.
 [25] A. Kar, S. Tulsiani, J. Carreira, and J. Malik. Categoryspecific object reconstruction from a single image. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 19661974, 2015.
 [26] W. Kehl, F. Milletari, F. Tombari, S. Ilic, and N. Navab. Deep learning of local rgbd patches for 3d object detection and 6d pose estimation. In Proc. Eur. Conf. Comp. Vis., pages 205220. Springer, 2016.
 [27] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In Proc. Adv. Neural Inf. Process. Syst., pages 25392547, 2015.
 [28] A. Kundu, Y. Li, and J. M. Rehg. 3drcnn: Instancelevel 3d object reconstruction via renderandcompare. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 35593568, 2018.
 [29] B. Leibe and B. Schiele. Analyzing appearance and contour based methods for object categorization. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., volume 2, pages II409, 2003.
 [30] J. J. Leonard and H. F. DurrantWhyte. Directed sonar sensing for mobile robot navigation, volume 175. Springer Science & Business Media, 2012.
 [31] V. Lepetit, F. MorenoNoguer, and P. Fua. Epnp: An accurate o (n) solution to the pnp problem. Int. J. Comp. Vis., 81(2):155, 2009.
 [32] C. Li, M. Z. Zia, Q.H. Tran, X. Yu, G. D. Hager, and M. Chandraker. Deep supervision with shape concepts for occlusionaware 3d object parsing. arXiv preprint arXiv:1612.02699, 2016.
 [33] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instanceaware semantic segmentation. arXiv preprint arXiv:1611.07709, 2016.
 [34] M. Liang, B. Yang, S. Wang, and R. Urtasun. Deep continuous fusion for multisensor 3d object detection. In Proc. Eur. Conf. Comp. Vis., pages 641656, 2018.
 [35] J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing ikea objects: Fine pose estimation. In Proc. IEEE Int. Conf. Comp. Vis., pages 29922999, 2013.
 [36] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In Proc. Eur. Conf. Comp. Vis., pages 740755. Springer, 2014.
 [37] R. LopezSastre, C. RedondoCabrera, P. GilJimenez, and S. MaldonadoBascon. Icaro: image collection of annotated realworld objects, 2010.
 [38] F. Massa, R. Marlet, and M. Aubry. Crafting a multitask cnn for viewpoint estimation. arXiv preprint arXiv:1609.03894, 2016.
 [39] W. Matusik, C. Buehler, R. Raskar, S. J. Gortler, and L. McMillan. Imagebased visual hulls. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 369374. ACM Press/AddisonWesley Publishing Co., 2000.
 [40] J. McAuley and J. Leskovec. Image labeling on a network: using socialnetwork metadata for image classification. In Proc. Eur. Conf. Comp. Vis., pages 828841. Springer, 2012.
 [41] M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 30613070, 2015.
 [42] M. Menze, C. Heipke, and A. Geiger. Joint 3d estimation of vehicles and scene flow. In ISPRS Workshop on Image Sequence Analysis (ISA), 2015.
 [43] Y. Miao, X. Tao, and J. Lu. Robust 3d car shape estimation from landmarks in monocular image. In Proc. Brit. Mach. Vis. Conf., 2016.
 [44] P. Moreels and P. Perona. Evaluation of features detectors and descriptors based on 3d objects. Int. J. Comp. Vis., 73(3):263284, 2007.
 [45] R. Mottaghi, Y. Xiang, and S. Savarese. A coarsetofine model for 3d pose estimation and subcategory recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 418426, 2015.
 [46] A. Mousavian, D. Anguelov, J. Flynn, and J. Košecká. 3d bounding box estimation using deep learning and geometry. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 56325640. IEEE, 2017.
 [47] M. Ozuysal, V. Lepetit, and P. Fua. Pose estimation for category specific multiview object localization. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 778785. IEEE, 2009.
 [48] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis. Learning to estimate 3d human pose and shape from a single color image. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2018.
 [49] P. Poirson, P. Ammirato, C.Y. Fu, W. Liu, J. Kosecka, and A. C. Berg. Fast single shot detection and pose estimation. In 3dv, pages 676684. IEEE, 2016.
 [50] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 9199, 2015.
 [51] B. C. Russell and A. Torralba. Building a database of 3d scenes from user annotations. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 27112718. IEEE, 2009.
 [52] S. Savarese and L. FeiFei. 3d generic object categorization, localization and pose estimation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 18, Oct 2007.
 [53] D. Stutz and A. Geiger. Learning 3d shape completion from laser scan data with weak supervision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 2018.
 [54] H. Su, C. R. Qi, Y. Li, and L. J. Guibas. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In Proc. IEEE Int. Conf. Comp. Vis., pages 26862694, 2015.
 [55] A. Thomas, V. Ferrar, B. Leibe, T. Tuytelaars, B. Schiel, and L. V. Gool. Towards multiview object class detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., volume 2, pages 15891596, June 2006.
 [56] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proc. IEEE Int. Conf. Comp. Vis., pages 44894497, 2015.
 [57] S. Tulsiani and J. Malik. Viewpoints and keypoints. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 15101519, 2015.
 [58] J. Uhrig, E. Rehder, B. Fröhlich, U. Franke, and T. Brox. Box2pix: Singleshot instance segmentation by assigning pixels to object boxes. In IEEE Intelligent Vehicles Symposium (IV), 2018.
 [59] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 59986008, 2017.
 [60] Y. Wang, X. Tan, Y. Yang, X. Liu, E. Ding, F. Zhou, and L. S. Davis. 3d pose estimation for finegrained object categories. arXiv preprint arXiv:1806.04314, 2018.
 [61] S.E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 47244732, 2016.
 [62] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. Single image 3d interpreter network. In ECCV, pages 365382. Springer, 2016.
 [63] Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Datadriven 3d voxel patterns for object category recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 19031911, 2015.
 [64] Y. Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese. Objectnet3d: A large scale database for 3d object recognition. In Proc. Eur. Conf. Comp. Vis., pages 160176. Springer, 2016.
 [65] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. pages 7582. IEEE, 2014.
 [66] G. Yang, Y. Cui, S. Belongie, and B. Hariharan. Learning singleview 3d reconstruction with limited pose supervision. In Proc. Eur. Conf. Comp. Vis., September 2018.
 [67] T. Yu, J. Meng, and J. Yuan. Multiview harmonized bilinear network for 3d object recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2018.
 [68] M. Zeeshan Zia, M. Stark, and K. Schindler. Are cars just 3d boxes?jointly estimating the 3d shape of multiple objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 36783685, 2014.
 [69] M. Z. Zia, M. Stark, B. Schiele, and K. Schindler. Detailed 3d representations for object recognition and modeling. IEEE Trans. Pattern Anal. Mach. Intell., 35(11):26082623, 2013.

[70]
M. Z. Zia, M. Stark, and K. Schindler.
Towards scene understanding with detailed 3d object representations.
Int. J. Comp. Vis., 112(2):188203, 2015.
Appendix A Keypoints Definition
Here we show the definitions of 66 semantic keypoints (Fig. 3).

0: Top left corner of left front car light;

1: Bottom left corner of left front car light;

2: Top right corner of left front car light;

3: Bottom right corner of left front car light;

4: Top right corner of left front fog light;

5: Bottom right corner of left front fog light;

6: Front section of left front wheel;

7: Center of left front wheel;

8: Top right corner of front glass;

9: Top left corner of left front door;

10: Bottom left corner of left front door;

11: Top right corner of left front door;

12: Middle corner of left front door;

13: Front corner of car handle of left front door;

14: Rear corner of car handle of left front door;

15: Bottom right corner of left front door;

16: Top right corner of left rear door;

17: Front corner of car handle of left rear door;

18: Rear corner of car handle of left rear door;

19: Bottom right corner of left rear door;

20: Center of left rear wheel;

21: Rear section of left rear wheel;

22: Top left corner of left rear car light;

23: Bottom left corner of left rear car light;

24: Top left corner of rear glass;

25: Top right corner of left rear car light;

26: Bottom right corner of left rear car light;

27: Bottom left corner of trunk;

28: Left corner of rear bumper;

29: Right corner of rear bumper;

30: Bottom right corner of trunk;

31: Bottom left corner of right rear car light;

32: Top left corner of right rear car light;

33: Top right corner of rear glass;

34: Bottom right corner of right rear car light;

35: Top right corner of right rear car light;

36: Rear section of right rear wheel;

37: Center of right rear wheel;

38: Bottom left corner of right rear car door;

39: Rear corner of car handle of right rear car door;

40: Front corner of car handle of right rear car door;

41: Top left corner of right rear car door;

42: Bottom left corner of right front car door;

43: Rear corner of car handle of right front car door;

44: Front corner of car handle of right front car door;

45: Middle corner of right front car door;

46: Top left corner of right front car door;

47: Bottom right corner of right front car door;

48: Top right corner of right front car door;

49: Top left corner of front glass;

50: Center of right front wheel;

51: Front section of right front wheel;

52: Bottom left corner of right fog light;

53: Top left corner of right fog light;

54: Bottom left corner of right front car light;

55: Top left corner of right front car light;

56: Bottom right corner of right front car light;

57: Top left corner of right front car light;

58: Top right corner of front license plate;

59: Top left corner of front license plate;

60: Bottom left corner of front license plate;

61: Bottom right corner of front license plate;

62: Top left corner of rear license plate;

63: Top right corner of rear license plate;

64: Bottom right corner of rear license plate;

65: Bottom left corner of rear license plate.
Comments
There are no comments yet.