Pose Estimation and 3D Reconstruction of Vehicles from Stereo-Images Using a Subcategory-Aware Shape Prior

07/22/2021 ∙ by Max Coenen, et al. ∙ 3

The 3D reconstruction of objects is a prerequisite for many highly relevant applications of computer vision such as mobile robotics or autonomous driving. To deal with the inverse problem of reconstructing 3D objects from their 2D projections, a common strategy is to incorporate prior object knowledge into the reconstruction approach by establishing a 3D model and aligning it to the 2D image plane. However, current approaches are limited due to inadequate shape priors and the insufficiency of the derived image observations for a reliable alignment with the 3D model. The goal of this paper is to show how 3D object reconstruction can profit from a more sophisticated shape prior and from a combined incorporation of different observation types inferred from the images. We introduce a subcategory-aware deformable vehicle model that makes use of a prediction of the vehicle type for a more appropriate regularisation of the vehicle shape. A multi-branch CNN is presented to derive predictions of the vehicle type and orientation. This information is also introduced as prior information for model fitting. Furthermore, the CNN extracts vehicle keypoints and wireframes, which are well-suited for model-to-image association and model fitting. The task of pose estimation and reconstruction is addressed by a versatile probabilistic model. Extensive experiments are conducted using two challenging real-world data sets on both of which the benefit of the developed shape prior can be shown. A comparison to state-of-the-art methods for vehicle pose estimation shows that the proposed approach performs on par or better, confirming the suitability of the developed shape prior and probabilistic model for vehicle reconstruction.



There are no comments yet.


page 1

page 9

page 11

page 13

page 17

page 20

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The image based reconstruction of three-dimensional (3D) scenes and objects is a major topic of interest in computer vision and photogrammetry. The task of inferring the 3D geometry of objects is very challenging for vision algorithms since the perspective projection from 3D to the 2D image plane leaves many ambiguities about 3D objects, causing their reconstruction and the retrieval of their pose and shape to be ill-posed and difficult to solve. Nonetheless, 3D scene understanding and 3D reconstruction of specific target objects have a great relevance for several disciplines, e.g. autonomous driving. The precise reconstruction of objects, especially of other cars, is fundamental to ensure safe navigation and to enable applications such as interactive motion planning and collaborative positioning. Given this background, this paper presents a method for precise vehicle reconstruction from street-level stereo images (cf. Fig. 


Figure 1: Qualitative result of our method. Top: The 3D wireframe of the fitted vehicle model backprojected to the input image. Bottom: Two 3D views on the reconstructed vehicle model and the stereo points.

Geometrically, the image-based reconstruction of 3D objects from their 2D projections is an inverse problem, suffering from the ambiguous mapping inherited by the perspective projection, causing the task to be ill-posed. To address the class of ill-posed problems, a common approach is to introduce suitable constraints, e.g. derived from prior knowledge about the 3D shape. Thus, an essential strategy in existing work on image based object reconstruction is to establish a 3D model and align it to features in the 2D image plane. To relax the requirement for precisely known object models, parametrised deformable shape priors can be formulated, increasing the set of free parameters by the parameters defining the shape. To constrain shape deformations, restricting the shape prior to only result in geometrically valid shapes, the common approach is to penalise deviations from the mean shape determined from the training instances, e.g. (Zia et al., 2013; Engelmann et al., 2016). However, this procedure is founded on the assumption that the differences of vehicle shapes follow a unimodal distribution centred at the mean shape. Arguably, this assumption does not hold true and systematically impairs vehicle categories which differ from the average shape.

For object reconstruction usually entities such as keypoints (Pavlakos et al., 2017; Murthy et al., 2017b; Zia et al., 2013), edges/contours (Leotta and Mundy, 2009; Ramnath et al., 2014; Ortiz-Cayon et al., 2016), the surface of the deformable 3D model (Engelmann et al., 2016), or a combination of all (Coenen and Rottensteiner, 2019) are aligned to their corresponding counterparts localised in the image. Often, the alignment is performed by minimising the reprojection error between projections of the 3D model entities and the associated 2D image detections (Pavlakos et al., 2017; Leotta and Mundy, 2009). Formulating this goal as an objective function, the estimation of the target parameters corresponds to the minimisation of that function. Typically, the objective is non-linear w.r.t. the target parameters and typically non-convex, such that one common way for optimisation is to use local optimisation techniques by applying first-order approaches that are based on the Jacobian of the objective. For example, Murthy et al. (2017a) apply an iterative least squares method for the minimisation of the backprojection error of vehicle model keypoints and keypoints detected in the image. In (Engelmann et al., 2016; Pavlakos et al., 2017)

, gradient descent is applied for vehicle model fitting. However, as non-linear least squares and gradient descent are methods for local optimisation, they do not guarantee finding the global optimum in case the objective function is non-convex. Instead, convergence to the correct minimum heavily relies on the initialisation of the parameters and consequently, already fairly accurate initial solutions are required. Furthermore, least squares optimisation in general is highly sensitive to outliers and erroneous detections which are likely to occur in the context of keypoint based model fitting due to imprecise or false positive keypoint detections.

Following the path of model based object reconstruction, the aforementioned limitations are addressed in this work. Given initially detected vehicles, we present a methodology for the 3D reconstruction of vehicles from stereoscopic images to finally obtain precise estimates for the vehicle’s pose and shape. In extension to our previous work (Coenen and Rottensteiner, 2019), we make the following contributions in this paper:

  • We propose a new subcategory-aware deformable vehicle model to be used as shape prior. In contrast to existing approaches, e.g. (Zia et al., 2013; Engelmann et al., 2016), where a deformable shape model is always learned for the entire class vehicle, this work presents a shape model which also considers individual modes for different vehicle subcategories. Thus, the proposed model allows a more detailed shape regularisation if a prediction of the vehicle type is available. The presented shape prior leads to better constraints on the vehicle shape and evidentially also enhances the results of pose estimation.

  • We extend our multi-branch CNN presented in (Coenen and Rottensteiner, 2019)

    to predict a probability distribution for the vehicle’s subcategory in addition to the prediction of the vehicle’s viewpoint and probability maps for vehicle keypoints and wireframe edges. These subcategory predictions are used to apply a more detailed shape regularisation of the vehicles using the proposed subcategory-aware vehicle model.

  • We present a comprehensive probabilistic model for vehicle reconstruction combining multiple observation likelihoods based on the keypoint and wireframe probability maps. Extending our previous work (Coenen and Rottensteiner, 2019), the proposed subcategory-aware shape model and the predictions of the vehicle type are incorporated into the probabilistic formulation to act as a novel state prior for the vehicle shape.

  • We conduct extensive experiments, assessing the effect of individual constituents of the proposed probabilistic model on the quality of the vehicle reconstructions based on the well-known KITTI dataset (Geiger et al., 2012) and our own stereo dataset for vehicle reconstruction. Extending the work in (Coenen and Rottensteiner, 2019), we do not only analyse the effects of the proposed shape prior on the pose estimates, but also those of other components of the probabilistic model, and we also evaluate the estimated vehicle shapes.

2 Related work

This section provides an overview of related work on object reconstruction and object pose and shape recovery. The focus is on different shape representations used in the literature and on methods that deal with objects, data and applications related to autonomous driving with vehicles as main objects of interest.

2.1 3D Shape representations and shape priors

The ways in which object shapes are represented and prior model knowledge is used vary in the literature. In this review, we limit ourselves to the review of related methods for the representation of rigid objects such as vehicles. Shape representations which are primarily designed for and applied to non-rigid objects such as humans or clothes, e.g. (Akhter et al., 2011), are not considered.

A strongly generalising and frequently used representation of objects is given by 3D bounding boxes (Chen et al., 2015, 2016; Mousavian et al., 2017; Ku et al., 2019). The estimation of oriented 3D bounding boxes for the objects implicitly contains the information about the objects’ extents and poses, i.e. their positions and orientations in 3D space. However, a 3D bounding box representation entirely neglects the reasoning about the object’s shape by only representing objects as 3D boxes. Yet, the fine-grained estimation of an object’s shape can be important for several applications and tasks, such as the (re-)identification of objects (Tang et al., 2019). To introduce prior model knowledge for the reconstruction of vehicles, simple constraints on the vehicles’ symmetry, about the centres of the wheels or the corners of the rooftop to be in a plane, and prior knowledge about the expected size are enforced in (Murthy et al., 2017a; Ding et al., 2018). While these constraints give a rather basic and coarse representation of an object, computer-aided design (CAD) models allow a very detailed description of a 3D shape. A large variety of CAD models are available for all kind of different vehicle types and brands which can be used directly to guide the vehicle reconstruction (Güney and Geiger, 2015). However, without knowing the exact CAD model of interest, it is intractable to run computations with each CAD model to find the most suitable one. Given a detected vehicle, Chabot et al. (2017)

use a convolutional neural network (CNN) to predict the closest 3D shape template from a set of rigid reference models and pick this model for further processing. However, potential errors in predicting the shape template may lead to erroneous reconstruction results. In this work, instead of enforcing a fixed shape for model fitting, softer constraints on the vehicle shape are implemented, which allow deformations of the shape according to the observations. For the task of vehicle re-identification,

Tang et al. (2019) predict the vehicle type, e.g. compact car, sedan, limousine, etc., using a CNN. However, the predicted type is not utilised in the context of vehicle reconstruction. In this work, a CNN which predicts the vehicle type is also used. A confidence-aware shape prior is presented which makes use of the type predictions by constraining shape deformations during the model fitting according to the predicted confidence scores for different vehicle types.

In contrast to rigid model instances, deformable shape representations learned from a set of reference shapes are more flexible and allow to cope with the large intra-class variability of vehicles. Therefore, they are used frequently for object reconstruction. As a consequence, the shape parameters are added to the list of target parameters for the estimation. In (Engelmann et al., 2016; Kundu et al., 2018; Manhardt et al., 2019)

, a Truncated Signed Distance Function (TSDF) is learned as a deformable shape prior for vehicles from a set of CAD models. Using a TSDF, the shape is represented in a voxel grid in which each voxel contains the truncated signed distance towards the object surface and thus, the shape manifold is implicitly represented as the zero-level of the TSDF. The common shape basis of the training set is learned by applying Principle Component Analysis (PCA) to the TSDF representations of the training samples. Once the shape basis is learned, any deformed TSDF shape can be encoded by a low-dimensional shape vector. Similarly,

Najibi et al. (2020) use an implicit shape representation as a zero level set as a signed distance field. They train a CNN end-to-end to detect cars in 3D point clouds; the CNN also predicts the signed distance for every 3D point, and based on this information, a mesh representation is constructed. No quantitative evaluation of shape reconstruction is given. In any case, implicit representations only carry the information about the model surface, while information such as semantic keypoint locations or wireframe edges, which can be significant cues for model fitting, are not explicitly contained. In contrast to that, an Active Shape Model (ASM) is another deformable shape representation frequently used as shape prior for vehicles (Zia et al., 2013; Lin et al., 2014; Murthy et al., 2017a; Ansari et al., 2018) that naturally contains keypoint information as it is learned by performing PCA on keypoints from training models. In (Cootes et al., 2001), an Active Appearance Model (AAM) is proposed, in which an ASM based statistical shape model is combined with an additional model representing the texture variations associated to the shape variations. However, while the proposed AAM is only applied to objects observed from one unique viewpoint (a close-up frontal view of human faces in (Cootes et al., 2001), learning the appearance model for objects from arbitrary viewpoints becomes very complex and therefore intractable. In this work, an ASM representation is adapted as shape prior. However, we extend its keypoint based representation by defining a triangulated mesh and a wireframe topology based on the keypoints to achieve a joint representation for the vehicles surface, keypoints, and wireframe at the same time (Coenen and Rottensteiner, 2019). These model entities are incorporated in the reconstruction process.

Such deformable shape priors are flexible but they can only be deformed in accordance with the variability contained in the training data. A common way of regularising the degree of admissible deformations during inference is to penalise deviations from the mean shape (Zia et al., 2013; Engelmann et al., 2016). However, this strategy is founded in the assumption that the object shape variability follows a unimodal distribution, which usually does not hold true. Especially in the case of vehicles, the shape variations result in several disjoint modes corresponding to different vehicle types, rather than following an unimodal distribution (Lin et al., 2014). Consequently, the applied regularisation of shape deformations is likely to enforce incorrect model shapes. In this work, the modes resulting from different vehicle types are learned together with the overall ASM representation. A CNN based prediction of the vehicle type from the image is used to guide the shape regularisation based on a newly proposed category-aware formulation of the shape prior.

2.2 Pose estimation and 3D reconstruction

2.2.1 3D pose prediction

With the emergence of CNNs, the prediction of 3D object pose from (single) images has experienced a huge boost, and work on 3D object bounding box prediction has expanded significantly over the last few years. One line of work follows the two-step procedure that was already successfully applied by Region Convolutional Network (RCNN) approaches for 2D object detection (Ren et al., 2015), by generating 3D object proposals in a first step, which are passed through a CNN to generate the final detections in a second step. In this context, Chen et al. (2015) make use of stereo image data and the 3D information derived from it to introduce geometric priors, such as object height and point density, and to reason about free-space in order to derive 3D bounding box proposals for street-level objects. Their follow-up work (Chen et al., 2016) replaces the geometric priors by priors based on scene-context and object shape, which are derived from semantic segmentation and instance segmentation using monocular images. While these methods have shown good results, they are computationally expensive due to the generation and the processing of a large number of object proposals initialised in 3D. Stereo images are also used in (Li et al., 2019), where a Stereo RCNN is proposed to simultaneously detect and associate 2D bounding boxes in the left and right images. Furthermore, the authors propose to predict keypoints corresponding to the bottom corners of the 3D bounding box in both stereo images in order to derive the oriented 3D box from them. However, in this work, we are not only interested in the 3D object bounding boxes but rather aim at a shape aware reconstruction of the vehicles.

Another line of work builds upon the success of existing work on 2D object detection for the task of 3D bounding box estimation. Ku et al. (2019) make use of 2D object detections in the image to infer 3D bounding box proposals by leveraging the relation between the 2D bounding box and an estimated object height. However, small errors in the 2D bounding box estimates or the height estimates of the 3D bounding box are likely to cause large errors in the position estimation of the object in 3D space. In (Xiang et al., 2018), a CNN is trained to localise the object centre in the image and to regress the object’s orientation and distance to finally derive the object’s 3D pose. However, predicting object distances, i.e. the absolute scale from single images is an ill-posed problem and therefore causes ambiguous solutions. Given 2D detections delivered by an object detector, Mousavian et al. (2017) propose a CNN to regress the object extents and orientation from single images instead of regressing the 3D translation in object space. The fact that the perspective projection of the 3D bounding box should fit to the 2D image bounding box is used to infer the absolute translation of the object from the regressed object extents and orientation. Similar to this, Tekin et al. (2018) and Grabner et al. (2018) propose to train a CNN for the prediction of the 2D image locations of the projected 3D bounding box vertices to estimate the 6DoF pose of objects with known size via spatial resection. However, these approaches are highly sensitive to errors in the 2D predictions and inaccuracies of the regressed parameters. Estimating 3D dimensions or 3D pose from single images is highly ambiguous and therefore causes large average 3D position errors which can be up to several meters (Mousavian et al., 2017). Besides, the mentioned approaches entirely neglect the reasoning about the object shape by only representing objects as 3D boxes.

2.2.2 3D pose and shape prediction

Recent work on object reconstruction delivers the 3D object pose together with a set of parameters defining the shape of an object given a parametric shape representation. Kundu et al. (2018) learn a ten-dimensional shape representation for vehicles by applying PCA to a training set of voxelised vehicle models. A RCNN is trained to detect vehicles and to regress the ten-dimensional shape vector in addition to the 3D pose parameters, thus obtaining a complete vehicle reconstruction in 3D space. Instead of using PCA, in (Zhu et al., 2017; Manhardt et al., 2019)

, a 3D convolutional autoencoder is trained from voxelised training shapes to learn a shape parameter vector corresponding to the intermediate representation of the autoencoder in the low dimensional latent space. A network is trained to predict a small number of shape parameters together with the object pose. A major drawback of such direct approaches, in which a CNN is trained to regress shape and pose parameters, is the strong dependency on usually large amounts of expensive 3D training data.

2.2.3 Shape aware reconstruction

In contrast to that, indirect approaches initially detect and finally reconstruct the objects of interest by fitting a 3D model defined a priori to the 2D image observations to reason about the object pose and shape. Another advantage of using 3D models as shape priors is their natural invariance w.r.t. the viewpoint. For instance, a classifier for viewpoint estimation requires training data for each of the considered viewpoint classes, whose number will be large if a fine-grained viewpoint estimation is required, resulting also in a large demand for training data. In contrast to that, model driven approaches allow for model hypotheses from any viewpoint to be fitted to the image observations. A common way to shape aware object reconstruction is to match entities such as keypoints, edges/contours, the surface or the silhouette of the model to the corresponding entities inferred from the image.

Edge/wireframe based reconstruction
Pioneer work on vehicle model fitting was based on model edge to image edge alignment (Tsin et al., 2009; Leotta and Mundy, 2009). The authors defined an appearance representation for a deformable vehicle model based on salient object edges that are likely to generate intensity edges such as occluding contours and part boundaries. Given sufficient 2D-3D correspondences between image and model edges, the pose and shape of the deformable model is estimated using iterative least squares adjustment. However, finding the 2D image edge corresponding to a 3D model edge, or more specifically finding corresponding point pairs on these edges, is non-trivial and a challenging task. Occlusions, illumination conditions, contrast, shadows or reflections, and model initialisation are likely to cause outliers and consequently lead to incorrect correspondences. An attempt to filter edge maps and extract semantically meaningful contours has been presented in (Isola et al., 2014) and is used in (Ortiz-Cayon et al., 2016) to reduce the number of outliers for the task of edge-to-edge alignment. Furthermore, a threshold for the angles between the edge normals of image and model edges is used as a criterion by Ortiz-Cayon et al. (2016) and Ramnath et al. (2014) to discard unlikely correspondence candidates. Still, the risk of incorrect matches and the need for good model initialisation remains. In this paper, a CNN is used to extract the desired vehicle wireframe edges, where the CNN is trained to semantically distinguish between wireframe edges belonging to different sides of a vehicle, which allows an informed, more robust, and initialisation-invariant model-to-image-edge association.

Contour/silhouette based reconstruction
Another strategy for model based object reconstruction is to align the silhouette that results from the 3D model to a predicted segmentation mask of the target object. Prisacariu et al. (2012)

train a Random Forest based on Histogram of oriented gradients (HoG)

(Dalal and Triggs, 2005) and colour features for foreground-background classification and define an energy function considering pixelwise foreground and background matching scores to fit a deformable vehicle model to vehicle detections. A similar energy term is incorporated in (Dame et al., 2013), where foreground-background models are learned from reference segmentations for the parts of a Deformable Part Model (DPM) (Felzenszwalb et al., 2010) detector, which is used to infer pixel-wise foreground-background probabilities during test time. In (Kar et al., 2015), an instance segmentation method is applied to derive segmentation masks and a silhouette consistency term is used in the model fitting procedure. Similarly, Wang et al. (2020) propose a silhouette alignment term measuring the consistency between the an image segmentation mask and the object mask obtained by projecting the shape embedding into the image and combine it with a term enforcing photometric consistency between the two images of a stereo pair in order to fit a 3D SDF into detected vehicles. However, the mapping from a 3D shape model to a 2D image silhouette is highly multi-modal and ambiguous, e.g. because object details and structures within the silhouette are neglected, and moreover, segmentation masks of a vehicle in front and in rear views are almost identical, which makes the alignment highly sensitive to initialisation. Furthermore, it is also sensitive to segmentation errors and occlusions.

Surface based reconstruction
To reconstruct objects based on the 3D surface of an object shape model requires depth or 3D information that can be derived from image observations (e.g. from stereo triangulation or Structure from Motion (SfM)) or can be obtained from laserscanning. Güney and Geiger (2015) use CAD models of vehicles to sample disparity patches from them and align these patches to disparity maps estimated from stereo images. Similarly to this, in (Menze et al., 2015)

a 3D vehicle ASM is also estimated in accordance with a dense disparity map. In addition, images of a subsequent time epoch are used to also incorporate model constraints based on scene flow information. In contrast, this paper proposes a method for vehicle reconstruction using a single stereo image pair. In

(Engelmann et al., 2016) and (Coenen et al., 2017), a TSDF and an ASM, respectively, are learned as shape priors for vehicles and are fitted to 3D point clouds obtained from dense stereo correspondences. Similarly, Xiao et al. (2016) fit a vehicle ASM to 3D points obtained from mobile laserscanning. Points reconstructed in 3D from multi-view images are used by Ortiz-Cayon et al. (2016) to fit a CAD model to detected vehicles. In these cases, model fitting is based on minimising the distances between the 3D points and the surface of the shape manifold. This procedure presents various difficulties. One of the major problems is the missing semantic information of 3D points, required to relate them to corresponding parts of the vehicle model. Another problem arises from outliers in the 3D point cloud, resulting e.g. from detection, matching or segmentation errors, and points belonging to parts not represented by the generalised shape prior, like the vehicle’s interior, antennae, mirrors, etc. Furthermore, noisy point clouds and the increasing uncertainty of the stereo reconstructed 3D points with increasing distance from the camera remain challenging for reconstruction approaches that are only based on 3D points. Besides, when 3D points are used as the only data source, valuable image cues are disregarded completely for model fitting. In this work, 3D information is used jointly with 2D image information to exploit synergies of both domains.

Keypoint based reconstruction
Another strategy for shape-aware vehicle reconstruction is to match model keypoints with their corresponding keypoints detected in the image. One advantage of using such semantic keypoints compared to the approaches described so far is the easier definition of correspondences between the image and model entities. Traditionally, handcrafted features, often based on HoG features, are used for the model-to-keypoint alignment, e.g. in (Li et al., 2011; Zia et al., 2013; Bao et al., 2013), while recently CNNs were introduced to detect keypoints, resulting in a better performance. For instance, Chabot et al. (2017) extend a RCNN for object detection to also regress keypoint coordinates in addition to the bounding box and the object class. A more frequently applied architecture for keypoint detection is a stacked-hourglass architecture (Pavlakos et al., 2017; Murthy et al., 2017b; Ding et al., 2018), in which multiple stacked U-Nets (Ronneberger et al., 2015) are used to infer keypoint probability maps from which the keypoint locations are derived, e.g., using non-maximum suppression. A U-Net-like architecture is also used in this work to predict keypoint heatmaps. Given the 2D image keypoint locations and their corresponding vertices of the 3D model, one naive approach for model fitting is to minimize the reprojection error of a spatial resection to solve for the 6DoF pose (Chabot et al., 2017). However, this naive procedure is problematic for various reasons. On the one hand, it is intolerant to keypoint localisation errors, while at the same time the inferred 2D keypoint localisations are likely to be imprecise, leading to potentially large errors in 3D. On the other hand, it is not robust w.r.t to outliers such that potentially false keypoint detections influence the pose estimation directly and demand for robust strategies. The probabilistic approach presented in this paper avoids the need for precise keypoint locations, because it is not built upon inferred 2D keypoint coordinates but on the raw keypoint probability maps instead. By incorporating the full keypoint probability distributions into the optimisation instead of only their inferred modes, the model fitting gains robustness w.r.t imprecise keypoint localisations caused e.g. by broad probability distributions. Performing the keypoint alignment on the raw heatmaps has also been done by Zia et al. (2013). The authors used a Random Forest (RF) classifier (Breiman, 2001)

, trained on gradient based handcrafted features for keypoint classification to derive the keypoint probability maps. The performance of a RF compared to deep learning based techniques, however, is qualitatively inferior as has been found by own previous work using a RF

(Coenen et al., 2019) and a CNN (Coenen and Rottensteiner, 2019) for the prediction of vehicle keypoints.

3 Methodology

This section presents our approach for the model based shape and pose recovery of initially detected vehicles from stereo imagery.

3.1 Overview

The goal of the proposed method is to recover the precise 3D pose (i.e. the 6DoF parameters of the position and orientation in 3D) as well as the type and shape of vehicles detected from street-level stereo images. The method requires images acquired by a calibrated stereo rig, e.g. attached to a moving platform. The camera synchronisation is assumed to be sufficiently accurate so that the influence of object and platform movements can be neglected. The interior and relative orientations as well as the length of the baseline are assumed to be known, and the images are rectified prior to processing so that epipolar lines correspond to image rows. The left stereo partner is defined to be the reference image. The real-time capable ELAS matcher (Geiger et al., 2011) is applied to determine a dense disparity map for every stereo pair, which is used to reconstruct a 3D point cloud in the 3D model coordinate system from every pixel of the reference image via triangulation. The origin of the model coordinate system is defined to be the projection centre of the left camera. Its plane is parallel to the image plane of the epipolar images and the -axis points into the viewing direction.

The proposed method is designed to deduce a 3D vehicle model in which represents the detected vehicle best in terms of pose and shape. To this end, a 3D model is fitted to the observed data. The target parameters are implicitly contained in the fitted model. Prior knowledge about the 3D layout of the observed scene is extracted and used to constrain the parameter space of the model fitting approach. A deformable vehicle model is learned as a shape prior and is fitted to the detected vehicles. The detection of vehicles and their reconstruction are treated as decoupled tasks. To initially detect the vehicles visible in the stereo images, a state-of-the-art detection method is adapted and its output is tailored towards the requirements for the proposed vehicle reconstruction method.

Based on the initial vehicle detections, the core of the proposed method consists of (1) a novel subcategory-aware deformable vehicle model, (2) a multi-task CNN, trained to extract various pieces of semantic information from the vehicle detections to be incorporated into (3) an extensive probabilistic model that is designed to find the best fitting instance of the deformable model for each vehicle. The general framework of the proposed approach is depicted in Fig. 2 and will be presented in the following sections.

Figure 2: Overview of the proposed framework.

3.1.1 Problem statement

The ultimate goal of the proposed method is to detect the set of vehicles visible in the given stereo pair and to recover their precise shape and 6DoF poses in the coordinate system . For this purpose, we describe each stereo scene by a 3D ground plane (cf. Sec.. 3.1.2). Furthermore, each vehicle is associated with its state vector . The state vector contains the pose and shape of the vehicle represented by its position on the ground plane, its orientation , i.e. the rotation angle about the normal vector of the ground plane, and a vector determining the shape of a deformable ASM representing the vehicle (cf. Sec. 3.2). It has to be noted that the 6DoF vehicle pose parameters w.r.t. the reference camera can easily be derived from the 2D pose and on the ground plane, knowing the rigid transformation between the ground plane and the model system .

3.1.2 Scene layout

Prior to vehicle reconstruction, the stereo data are used to derive knowledge about the 3D layout of the scene, represented by the 3D ground plane and a probabilistic free-space grid map. Requiring vehicles always to be located on the ground plane, estimating that plane reduces and constrains the parameter space of the model fitting approach, predetermining three of the 6DoF vehicle pose parameters (1 rotational and 2 translational parameters). The ground plane is extracted from the reconstructed 3D points using the RANSAC-based method described in (Coenen et al., 2018). All inliers of the final RANSAC consensus set are stored as the set of ground points , whereas the remaining points form the set of arbitrary object points .

A probabilistic free-space grid map defined in the ground plane with a spatial resolution is created to quantitatively represent areas not occupied by any object (Coenen et al., 2018). For each grid cell with the number of ground points and the number of object points whose orthogonal projection falls in the respective cell are counted. Grid cells without any projected points are marked as unknown. For all the other cells, the probability of the cell to be free-space is calculated according to


The free-space grid map is used to derive prior information about the vehicle’s position (cf. Sec. 3.4).

3.1.3 Detection of vehicles

To detect the vehicles we apply the pretrained Mask-RCNN of He et al. (2017) to the reference image due to its good performance. It does not only deliver bounding boxes but also an instance segmentation mask for every vehicle detection . Each detection is associated with an observation vector ) containing a set of object points , as well as its bounding boxes and in the left and right images, respectively. To extract the vehicle points , the 3D points reconstructed from the disparities of the foreground pixels belonging to the respective segmentation mask are chosen. While the vehicle bounding boxes in the left image are delivered as an output of the Mask-RCNN, the bounding boxes in the right image are derived from the dense stereo correspondences.

3.2 Subcategory-aware shape prior

Similar to Zia et al. (2013)

we use a 3D ASM as vehicle shape prior. The ASM is learned by applying principal component analysis (PCA) to a set

of a total number of 3D keypoints that were manually labelled for a variety of CAD models of vehicles belonging to one of a set of different vehicle types . In the experiments of this paper, seven vehicle types are distinguished with  {Compact Car, Sedan, SUV, Estate Car, Sports Car, Truck, Van}. A synthesised model, deformed according to the shape parameters , is denoted by and can be obtained by the linear combination


of the mean model

and the eigenvectors

, weighted by the square roots of their corresponding eigenvalues

and scaled by the object specific shape parameters . It has to be noted that in the practical realisation of the presented method, not all eigenvalues and eigenvectors of the ASM are considered. Instead, in order to reduce the dimensionality of the unknown shape parameter vector , the number of considered shape parameters is restricted and is defined as a proper tradeoff between the complexity of the model and the quality of the model approximation. A fully parametrised instance of a 3D vehicle ASM, denoted by , can be created by shifting and rotating the deformed model on the ground plane according to the translation vector and the heading angle .

3.2.1 Geometrical representation

While the classical representation of an ASM only contains explicit information about the keypoints (Cootes et al., 1995), in this work, the ASM is enriched by an additional explicit definition of the model surface as well as a definition of model wireframe edges. In particular, a triangular mesh is defined for the ASM vertices to represent the model surface. The topology of the mesh representing the surface of the ASM is manually defined once and kept constant for all generated, deformed models. A visualisation of the defined triangular vehicle mesh can be found in Fig. 3. Note that keypoints exist (e.g. the centre points of the wheels) which are not part of the triangulation in order to decrease the number of faces.

Moreover, edges connecting pairs of keypoints are manually chosen to define a wireframe topology of the vehicle model, consisting of both, crease edges that describe the outline of the vehicle, and semantic edges describing the boundaries between semantically different vehicle parts. Each entry of contains a tuple of keypoints from defining a single edge of the wireframe. The edges chosen for this purpose are also depicted in Fig. 3. Furthermore, the wireframe is subdivided into four groups of wireframe edges , each of which contains all edges in belonging to the wireframe of one of the four vehicle sides . Note that the edges in the four wireframe subsets are not mutually exclusive, as a wireframe edge can belong to two of the distinguished vehicle sides. This distinctive wireframe representation adds further semantic information to the edges. The approach for vehicle reconstruction proposed in this work makes use of this semantic wireframe distinction by learning a CNN based detector for each of the wireframe representations (cf. Sec. 3.3).

In addition to that, a subset of keypoints is chosen to contain the appearance keypoints for which an image based detector is learned as well (cf. Sec. 3.3). This set of keypoints contains a number of keypoints with a potentially distinctive appearance, such as centre points of wheels, corner points of the wind shield and the rear window, front and back lights, etc. In comparison to the commonly used keypoint based ASM representation (Zia et al., 2013; Pavlakos et al., 2017), the ASM proposed in this paper is extended by the explicit definition of the 3D surface, by the wireframe definition described earlier, as well as by the subcategory awareness which is described subsequently in Sec. 3.2.2. The triangulated surface as well as the wireframe and appearance keypoint definition can be seen in Fig. 3.

Figure 3: Visualisations of the ASM. In the centre, the mean model is depicted with crease edges in red and semantic edges in blue. The triangulated surface is shown in black, the appearance keypoints in green. Additionally to the mean model, deformed models corresponding to the modes for seven vehicle categories are shown.

3.2.2 Mode Learning

In the literature, a common practice is to constrain the shape parameters according to the deviations from the mean shape (Zia et al., 2013; Engelmann et al., 2016), i.e. according to the deviations of from the zero vector (cf. Eq. 2). This strategy is based on the implicit simplifying assumption of a unimodal distribution of the vehicle shape deformations represented by the underlying principal components of the ASM, but in reality the distribution is expected to have several disjoint modes for different vehicle classes. For instance, let only contain the types Van and Sports Car; in this example, the mean model represents a shape which is neither a Van nor a Sports Car, and penalising the deviations from the mean shape is not an optimal choice. To obtain a more realistic model, the mode for each of the vehicle types in is determined in addition to the general ASM formulation. During reconstruction, a prediction of the vehicle type (cf. Sec 3.3) is used to penalise deviations from the corresponding mode instead of the global mean shape. We denote the mode for every vehicle type by , a vector representing the shape parameters that describe the mean shape of all training exemplars associated to the respective type . Following Eq. 2, the mean category shape and its residuals are expressed by


The modes are calculated according to a least-square fitting of the overall ASM to the mean shape of each type by


where is the mean model of all training exemplars. The mean shape of each type is calculated in advance, given the CAD models and their annotated vehicle categories. In this context, it has to be noted that the only additional annotation effort that is required for the proposed mode learning approach consists of associating a vehicle type label to each CAD model used for learning the ASM. The Jacobi matrix is calculated from the partial derivatives of the linear Eq. 2 by the shape parameters . With keypoints defining the ASM and parameters that are considered during vehicle reconstruction, the dimensions of the Jacobi matrix are equal to . The mean ASM model as well as the deformed models according to the modes of seven categories are shown in Fig. 3.

3.3 Multi-branch CNN

As a first step of the model-based vehicle reconstruction, a mulit-branch CNN is trained and used to infer semantic information to be used in the probabilistic model. We use the same multi-branch CNN as proposed in (Coenen and Rottensteiner, 2019)

, but add an additional branch for the prediction of the vehicle type. To make this paper self-contained, we will describe the complete CNN architecture in this section. The input to the network is an image showing a vehicle, cropped by the detection bounding box. The multi-branch CNN consists of one common input branch and three individual output branches, each of them corresponding to one task. Unless specified differently, 3x3 filters are used in the convolutional layers and 2x2 filters with stride 2 are used for max pooling and upsamling operations. The overall architecture of the network can be seen in Fig. 

4. All vehicle detections in the reference (left) image, cropped by their respective bounding box, are fed into the network to infer a probability distribution for the vehicle’s type, a probability distribution for the vehicle’s viewpoint, and the keypoint and wireframe probability maps and , respectively. Additionally, the keypoint and wireframe heatmaps and are computed for the corresponding bounding box crops of the right image by feeding the detection window through the keypoint/wireframe branch. The results are gathered in an additional observation vector for every detected vehicle.

Figure 4: Architecture of the multi-branch CNN. The input is a 3 channel image of size 224x224. The convolutional filters have size 3x3, max pooling and upsampling use filter size 2x2 and stride 2. The number of filters is denoted by d in the figure. Further explanations are given in the main text.

3.3.1 Input branch

The input to the network is a 3-channel image of size 224x224. Bilinear interpolation is used to resize image crops of the vehicle detections to the required size. The input branch acts as a shared backbone feature extractor of the CNN. The architecture of the VGG19 network

(Simonyan and Zisserman, 2015) is adopted for this purpose. The feature map of size 14x14 produced by the input branch is forwarded to the task specific branches which are explained in the following sections.

3.3.2 Viewpoint branch

The viewpoint branch is added to the CNN in order to derive a probability distribution for the vehicle’s viewpoint , which can be incorporated as prior information about the vehicle orientation into the probabilistic approach for model fitting. Since the direct prediction of the vehicle orientation from the detection window is not possible without the additional knowledge of the location of that window in the image, we train the viewpoint branch to predict a a probability distribution for the vehicle viewpoint . As depicted in Fig. (a)a, the viewpoint defines the aspect under which the vehicle is seen. Given the direction of the image ray to the 3D centre of the vehicle, the vehicle orientation can directly be computed from the viewpoint via


The viewpoint branch performs a hierarchical classification of the viewpoint which is discretised in hierarchical viewpoint bins as depicted in Fig. (b)b. By merging the predictions of the individual classification heads using the probabilistic averaging layer, the final probability distribution is obtained. For a detailed description of this branch, the reader is referred to (Coenen and Rottensteiner, 2019).

Figure 5: Definition of the viewpoint angle (a) and of the hierarchical viewpoint classes (b).

3.3.3 Keypoint/Wireframe branch

This branch corresponds to a decoder network, upsampling the output of the input branch to the original input resolution using skip connections between corresponding layers of the encoder and decoder blocks. Following the vehicle keypoint and wireframe definitions of Sec. 3.2, the goal of this branch is to predict the presence of the appearance keypoints and the wireframe edges in the image. Inspired by Newell et al. (2016) and Murthy et al. (2017a), it is trained to produce one heatmap for every appearance keypoint in . Additionally, the network is adapted to also output one heatmap for each of the wireframe definitions associated to one of the four sides . A detailed definition of the appearance keypoints and wireframe edges the network is supposed to predict has been given in Sec. 3.2. The values at each pixel position of the resulting heatmaps correspond to a probability for the presence of the respective keypoint/wireframe edge at that position. The head of this branch consists of convolution layers producing the

keypoint and the four wireframe heatmaps, using a sigmoid activation function to produce pixel-wise outputs in the interval


3.3.4 Vehicle type branch

The vehicle type branch proposed in this work is used to obtain a prediction of the vehicle type for the target vehicle shown in the input image window. This branch starts with a series of four convolutional layers followed by a max pooling layer. Two fully connected layers are applied at the end of the branch and a softmax classification head delivers a class score for each of the vehicle type classes in defined in Sec. 3.2. It has to be noted that the categories of vehicle types do not have a clear definition and, therefore, the class boundaries are somewhat vague. The car body configuration, which is determined by the layout of the engine, passenger and luggage volumes, as well as the number of pillars of a vehicle, i.e. the (almost) vertical supports of a car’s roof and windows, are characteristics that can be used to distinguish different vehicle types. However, ambiguities in the distinction of vehicle types exist and are therefore likely to be contained in the predictions of the vehicle type classification branch. We therefore interpret the prediction scores as probabilities for the target vehicle to belong to the respective classes and gather them in a vector , thus representing the probability distribution for the vehicle type. The complete distribution is incorporated as prior information in the probabilistic model for vehicle reconstruction to introduce constraints on the deformations of the ASM used as shape prior, thus enabling the consideration of cases in which the type branch cannot clearly predict whether the vehicle belongs to one specific category or another.

A detailed description of the training procedure of our CNN is given in Sec. 4.2

3.4 Probabilistic model for vehicle reconstruction

Given the vehicle detections and the observation vector associated to each detection, a vehicle model is fitted to each detection by finding the optimal state variables for position, orientation and shape. For that purpose, a probabilistic model is formulated that simultaneously fits the surface of the 3D ASM to the 3D points, matches model keypoints to the detected keypoints and aligns the model wireframe to the wireframes inferred by the CNN. The inferred scene knowledge in form of the free-space grid map as well as the probability distributions for orientation and the vehicle type are incorporated in the probabilistic model as state priors. Using a probabilistic formulation, the optimal state can be derived by maximising the posterior


In this work, the likelihood is factorised by individual likelihood terms, jointly incorporating both, 2D and 3D information derived from the stereo pairs and the CNN described in Sec. 3.3. The prior acts as a regularisation term on the state parameters and is factorised by one individual factor for each group of parameters, namely for position, orientation and shape, respectively.


This model is visualised in Fig. 6. The likelihood is composed of a 3D likelhihood based on the 3D vehicle points as well as the keypoint and wireframe likelihoods and , which are based on the keypoint and wireframe heatmaps and , respectively. The state priors for position, orientation and shape are derived based on the probabilistic free-space grid map , the probability distribution for the viewpoint and the prediction for the vehicle type , respectively. An energy function is defined which corresponds to the negative logarithm of the posterior of Eq. 7 and which is minimised to find the optimal state parameters. The logarithmic formulation of the individual likelihood and prior terms is the same as in (Coenen and Rottensteiner, 2019), except that we added the shape prior term . To make the paper self-contained, we explain all terms in the following sections.

Figure 6: Visualisation of the probabilistic model. An ASM is fitted to the reconstructed 3D points , the wireframe heatmaps and the keypoint heatmaps . The probabilistic free-space grid map and the probability distributions for the viewpoint and the vehicle type, and , are used to formulate regularisation terms for the state parameters .

3.4.1 3D likelihood

This likelihood term is based on the distance of the 3D points reconstructed from image points representing the surface of a vehicle (cf. Sec. 3.1.3) from the Model surface of the model :


Here, is the overall number of 3D points in and is the depth uncertainty of the individual 3D point , which is determined by applying error propagation to stereo triangulation and using an uncertainty for the disparity estimate of 1 [px]. To add robustness against possible outliers remaining in we use the Huber loss (Huber, 1964) to calculate the distance :


The Huber loss is more robust against outliers compared to the quadratic distance of a 3D point from the model surface . To determine , the distance of the 3D point to every triangle in is calculated and the distance to the model surface is defined as the smallest distance found. Minimizing this term fits the 3D ASM to the 3D point cloud.

3.4.2 Keypoint likelihood

The keypoint likelihood is based on the idea that, when backprojected to the image planes, the keypoints of the ASM representing the true vehicle should be supported by keypoint detections inferred from the image data (i.e. by high probabilities in the keypoint heatmaps) at or close to the backprojected keypoint positions. For this purpose, the keypoints of the model are backprojected to the stereo images, resulting in a list of image points for both, the left () and right () stereo images, respectively. Note that the CNN described in Sec. 3.3 is trained using only the keypoints being visible in the image and not being (self-)occluded. Consequently, the network is able to detect visible keypoints only, which is the reason why self-occlusion caused by the vehicle model itself is considered in the calculation of the keypoint likelihood. Ray tracing techniques can be used to reason about visible and self-occluded model keypoints. However, these techniques usually are computational expensive. Instead, the rigid topology of the ASM surface and the street-level acquisition setup allow the construction of a look-up table, storing the set of visible model keypoints for every viewpoint . Using this look-up table, a boolean variable is associated to every image keypoint, with if the keypoint is visible and inside of the detection bounding box and otherwise. The total number of visible keypoints is . The keypoint likelihood is calculated by


Here, denotes the output of the heatmap for the keypoint at the location in image . Minimizing this term fits the 3D ASM to the keypoints predicted in both images.

3.4.3 Wireframe likelihood

The wireframe likelihood is based on a measure of similarity between the backprojected edges of the model wireframe and the wireframe heatmaps inferred from the CNN. To this end, we backproject the visible parts of the wireframe subsets to the left and right images, resulting in binary wireframe images and with entries of 1 at pixels that are crossed by a wireframe edge in subset and 0 everywhere else. To consider differences between the real image wireframe positions and the model wireframe caused by generalisation effects of the vehicle model, the wireframe images are blurred using a Gaussian filter. The size of the filter is defined according to the backprojection uncertainty of the model keypoints given the generalisation error of the ASM quantified by an uncertainty (set to 10 cm in this work). The applied Gaussian filter is defined according to the resulting backprojection uncertainties, leading to a stronger blurring effect when the vehicle is close to the camera and vice versa.

Given the heatmaps and the blurred images of the backprojected model wireframe , the wireframe likelihood is calculated according to


where we use the Bhattacharyya coefficient (Bhattacharyya, 1943) as a similarity measure between the blurred wireframe images and the wireframe heatmaps. This term will become large if the backprojected wireframes correspond well to the wireframes predicted by the CNN.

3.4.4 Position prior

The position prior is derived from the probabilistic free-space grid map . It is calculated based on the amount of overlap between the minimum enclosing 2D bounding box of the model on the ground plane and the free-space grid map cells with given their probability of being free space (unknown cells are disregarded by setting ):


is the area of the model bounding box. The function calculates the overlap between the model bounding box and a cell using the surveyor’s area formula (Braden, 1986). If there is no intersection between and , returns zero. As the free-space grid map is derived from the reconstructed 3D points, a factor is introduced to transfer the uncertainties of the 3D points to the calculation of the position prior. To this end, the factor is used as a weight of this term based on the grid cell size and the depth uncertainty of a reconstructed point in the distance of the model . This prior penalises models that are partly or fully located in areas which are observed as not being occupied by 3D objects. Particularly, this prior establishes the constraint that free-space between the camera and the point cloud cannot be occupied by the model.

3.4.5 Orientation prior

To calculate the orientation prior for the model , the probability distribution for the vehicle viewpoint inferred by the multi-branch CNN is used. The viewpoint is computed for the model from the model orientation using the relation in Eq. 5. The image ray direction is derived from the ray connecting the camera projection center and the centre of the vehicle model . The orientation prior is calculated according to


denotes the probability for the angle according to the output of the viewpoint classification branch of the CNN. As incorrect viewpoint classifications can be assumed to appear especially between neighbouring viewpoints, this term alone is prone to cause small orientation errors. This is why additionally, the cosine distance of the most likely viewpoint predicted by the vehicle CNN and the model viewpoint is considered in this prior.

3.4.6 Shape prior

The new shape prior formulation proposed in this paper is based on the probability distribution for the vehicle types predicted by the CNN and acts as regularisation term for the shape of the model . Using


the shape prior term penalises deviations of the considered model shape parameters from the predicted modes of the individual vehicle types. In this prior formulation, the deviations from each mode are weighted according to the probability scores predicted by the CNN with . The parameter

represents the ASM standard deviation of deformation, i.e. the square root of the eigenvalue, in the direction of the eigenvector associated to the

shape parameter as described in Sec. 3.2.

As it is reasonable to assume the ASM shape parameters of vehicles not to follow a uni-modal distribution but rather a multi-modal distribution with each mode representing one vehicle category, penalising deviations from the overall mean shape, as is usually done in the literature (Zia et al., 2013; Engelmann et al., 2016), is an unfavourable procedure. The category aware shape prior formulation proposed here gives a more realistic and detailed constraint on the shape parameters. The confidence awareness in the prior term, achieved by considering the probability scores for all distinguished categories, reduces the sensitivity of the shape prior to potential uncertainties in the vehicle type predictions of the CNN, e.g. caused by the vague definition of the car type classes discussed in Sec. 3.3.4.

3.4.7 Inference

To find the optimal pose and shape parameters for each detected vehicle, the energy function derived from Eq. 7 is minimised. As this function is non-convex and discontinuous due to the changing visibility of keypoints/wireframes caused by self-occlusion, a sequential Monte Carlo sampling procedure is applied to approximately determine the parameter set for which the energy function becomes minimal. To this end, the target parameters are sampled to generate model particles for the vehicle ASM. Starting from one or more initial parameter sets, a number of particles are generated in each iteration

by jointly sampling the pose and shape parameters from a uniform distribution centred at the preceding parameter values. For the resampling step, the energy for every particle is calculated in each iteration

and the highest weighted particles are introduced as initial seed particles for the next iteration. In each iteration, the size of the interval from which the parameters are sampled is reduced. In the following paragraphs, more details are given.

Initialisation: A common way for initialisation is to define the initial particles with using the prior distribution . With the prior distribution heavily depends on the predictions of the CNN for the orientation and the shape. The most likely particle is determined from by setting to the most likely orientation given and to the most likely shape given . The initial translation vector is defined as the bounding box centre of the minimum 2D bounding box enclosing the 2D projections of the 3D vehicle points to the ground plane. The particles are sampled using an uniform distribution for position, orientation and shape, respectively, centred at the most likely particle . The initial interval boundaries are set to and  [m] for the orientation and position parameters, respectively, and to for the shape parameters. By choosing as interval for the orientation angle, particles are allowed to take the whole range of possible orientations in the first iteration to be able to deal with incorrect initialisations, thus gaining robustness against initialisation errors.

Resampling: To resample the particles in each iteration the energy is calculated for each preceding particle. The best particles, i.e. particles with the lowest energy, are introduced as seed particles for the resampling step. For each seed particle, an equal number of of offspring particles are drawn from a uniform distribution centred at the respective seed particle. To encourage convergence of the applied Monte Carlo sampling, the range for the respective parameters used to draw the particles from the uniform distributions is reduced by a factor in iteration . As a consequence, the initial parameter range is reduced by the factor in the last iteration . By forwarding multiple particles, the preservation of particles potentially belonging to different minima is enabled. As a consequence, the risk of getting stuck in a local minimum is reduced, which allows to deal with multi-modal distributions and local minima in the objective function.

Final result: The final values for the target parameters of pose and shape are defined after the last iteration and are set to the parameters of the particle achieving the lowest energy within the particle set of the final iteration.

4 Test data and test setup

4.1 Test data

The empirical evaluation of the proposed method is conducted on two data sets. Both data sets consist of stereo image pairs which were acquired by a synchronised and calibrated stereo camera rig, placed on a mobile platform, while driving in regular traffic on public roads in urban environments. One data set is taken from the publicly available KITTI benchmark suite (Geiger et al., 2012) and will therefore be referred to as KITTI data throughout this paper. The second data set was presented in (Coenen and Rottensteiner, 2019) and has been made publicly available111https://doi.org/10.25835/0078519 (Coenen, 2020).

Figure 7: Examples from the test data sets. In (a), an example from the KITTI benchmark is shown, providing 3D reference bounding boxes. In (b), an example of the ICSENS data set is shown, which provides fitted 3D CAD models as reference.

It will be referred to as ICSENS (Integrity and Collaboration in dynamic SEnsor NetworkS, (Schön et al., 2018)) data set in the remainder of this paper. Regarding the KITTI data, the 3D object detection benchmark is used for the evaluation in this paper. It consists of 7481 stereo image pairs and provides the 3D object location and the orientation for every vehicle in the form of manually fitted oriented 3D bounding boxes. It distinguishes three levels of difficulty (easy, moderate and hard), which mainly depend on the level of object occlusion and truncation. In this paper, 260 of the training set images are used for training and the remaining images are used for evaluating the proposed approach. The KITTI test set consists of 7518 stereo images for which no reference is provided. An official evaluation on the test set using the official KITTI metrics is used to compare the performance of the proposed method to the results of related methods which are reported in the KITTI leaderboard222http://www.cvlibs.net/datasets/kitti/eval_3dobject.php. The ICSENS data set consists of a total of 1000 stereo image pairs recorded in the context of (Schön et al., 2018). In contrast to the KITTI dataset, which only delivers oriented 3D bounding boxes as references, the ICSENS dataset delivers the reference shape and the reference vehicle type in addition to its 3D pose. To this end, we manually fitted the most similar model out of a set of vehicle CAD models to the individual vehicles of the ICENS data set, using the 3D point cloud obtained from stereo matching and the back-projected wireframes to assess whether a reference model was correct. We also distinguish between easy (fully visible) and difficult (occluded or truncated) vehicles. A visual comparison of the provided references for the KITTI and the ICSENS data can be seen in Fig. 7. Note that the quality of the reference will be affected by the depth errors of the reconstructed 3D points in the same way as the reconstructed models.

4.2 Parameter setting and training

We select the side length of the free-space grid cells to be 25 cm. For inference, we define the number of particles to be = 200; the number of iterations and the number of offspring particles are both set to 10. The category-aware Active Shape Model, which is proposed as a shape prior in this work (cf. Sec. 3.2) requires the definition of a set of type classes and the availability of 3D keypoint annotations for a set of training exemplars. Using  {Compact Car, Sedan, SUV, Estate Car, Sports Car, Truck, Van}, seven vehicle categories are distinguished (cf. Fig. 3). To learn the ASM, 3D keypoints were manually labelled on a set of 36 different CAD vehicle models collected via Google’s 3D Warhouse333https://3dwarehouse.sketchup.com and belonging to one of the considered vehicle types. Each model differs from the other models in shape and, consequently, also in its 3D extents. Tab. 1 shows the statistics that are obtained from the variations in length, width, and height of the CAD models used in the training set in this work. It shows the mean value, standard deviation (std. dev.) as well as the minimum and maximum values for length, width, and height of the vehicles. Obviously, with a standard deviation of 0.40 m, the largest variations are present w.r.t. the length of the vehicles. With standard deviations of 0.20 m and 0.10 m, respectively, the variations in height and width are considerably smaller. A more detailed overview on the standard deviations of length, width, and height of the CAD vehicles for each of the considered vehicle types can be seen in Tab. 2. It shows that the largest variations in all directions result for the class Van. The class Sedan exhibits the smallest intra-class variations.

[m] mean std. dev. max min
length 4.35 0.40 5.70 3.55
width 1.80 0.10 2.34 1.65
height 1.49 0.20 2.12 1.11
Table 1: Statistical properties for the vehicle extents of the CAD training set used to learn the ASM.
[m] Compact Car Estate Sedan SUV Van Sports Car Truck
length 0.21 0.24 0.19 0.39 0.82 0.22 0.26
width 0.05 0.07 0.04 0.07 0.28 0.04 0.19
height 0.10 0.06 0.05 0.14 0.28 0.13 0.12
Table 2: Standard deviations of the vehicle dimensions differentiated by the individual vehicle types.

The set of appearance keypoints considered in this work contains individual keypoints and corresponds to the keypoints used in (Zia et al., 2013). As explained in Sec. 3.2, a shape basis is learned for each of the considered vehicle types. In Fig. 8, the fitting error of the overall ASM to the mean shape of the vehicle types resulting from Eq. 4, i.e. the root mean square error (RMSE) of the keypoint coordinates, is shown as a function of the number of principal components used. The RMSE represents the generalisation error of the ASM caused by using a restricted number of eigenvaectors. For the number of eigenvalues and eigenvectors to be considered in the ASM during the experiments we choose , which we found to be a proper tradeoff between the complexity of the model and the quality of the model fit. While the RMSE of the type is 0.14 m when using the first three principal components, the RMSE of the remaining vehicle types is smaller than 0.06 m in all cases.

Figure 8: RMSE of the ASM fitting to the individual considered vehicle types.

To train the proposed multi-branch CNN, images of vehicles cropped by the tightly enclosing bounding box of the vehicles are used. We applied the following training strategy in our experiments. The input branch (cf. Fig. 4) is initialised from its corresponding layers of the VGG19 network (Simonyan and Zisserman, 2015)

, pre-trained on ImageNet

(Russakovsky et al., 2015), and is frozen during the training procedure. The remaining convolutional layers are initialised using the He initialiser (He et al., 2015). For the presented experiments, the type branch is trained separately from the viewpoint and the keypoint/wireframe branches. To train the individual output branches, two different data sets are used. For the joint training of the viewpoint and the keypoint/wireframe branches, training images of vehicles are required, including reference information of the vehicle’s viewpoint as well as the image coordinates of the appearance keypoints. To this end, the 260 KITTI images mentioned in Sec. 4.1 are used. The 2D image reference bounding boxes as well as the reference viewpoint angles are provided as annotations of the data set (Geiger et al., 2012). The authors of (Zia et al., 2015) labelled the 36 different Appearance keypoints in this subset of images and made the annotations publicly available. Together with the reference viewpoint angles, their keypoint annotations are used to train the keypoint/wireframe branch and the viewpoint branch for the experiments conducted in this paper, while the vehicle type branch is frozen. For more details on the training procedure of the viewpoint and keypoint/wireframe branches we refer the reader to (Coenen and Rottensteiner, 2019). Note that the KITTI images used for training are not used for the evaluation. After training both branches, the vehicle type branch is trained independently from the remaining branches using the data set provided by (Yang et al., 2015), which contains images of vehicles including bounding box and vehicle type annotations. In this context, the reference vehicle type (one out of ) is assumed to be available for each training sample. In (Yang et al., 2015), twelve different vehicle type classes are distinguished. To map the provided classes to the type definitions used in this work, some classes are merged as shown in Tab. 3

. The categorical cross-entropy is used as loss function to train the

vehicle type branch. The entire network is trained using the Adam optimizer (Kingma and Ba, 2015), a variant of stochastic mini-batch gradient descent with momentum, using the exponential decay rate for the moment estimates and for the moment estimates . A mini-batch size of and an initial learning rate of are applied. To improve training, the learning rate is decreased by a factor of after 5 epochs with no improvement in the validation loss. Furthermore, batch normalisation (Ioffe and Szegedy, 2015) is used and Dropout (Srivastava et al., 2014) is applied to the fully-connected layers with a rate of 0.5. Data augmentation is applied to the training data by horizontally flipping the training images, consequently adapting the viewpoint classes and keypoint/wireframe labels accordingly. The classes for the vehicle type remain unchanged. During training, regions corresponding to 0 and 20% of the bounding box width are randomly clipped away from the image to simulate occlusions. We expect this to help the network to derive suitable features even if the vehicles are not fully visible. Further, random gamma corrections with gamma in the range of are applied to the training images to enforce robustness against radiometric differences, e.g. due to illumination conditions, background-foreground contrast, shadowing, etc.

Ours Compact Car Sedan SUV Estate Car Sports Car Truck Van
Yang et al. Hatchback Sedan SUV Estate car Sports car Pickup MPV
(2015) Fastback Crossover Convertible Minibus
Hardtop conv.
Table 3: Vehicle type class definition made in this paper and the associated classes defined in (Yang et al., 2015).

4.3 Test setup

The vehicle type branch introduced in this work is used to derive a prior distribution for the vehicle type to be incorporated in the proposed probabilistic model to regularise the vehicle shape. An evaluation of the performance of this classification branch allows to draw conclusions about the suitability of the proposed CNN for the derivation of shape prior information. The overall accuracy (OA) is used as evaluation criterion, which is computed according to the ratio of the number of correctly classified vehicles and the total number of vehicles, and therefore represents the overall proportion of correct classifications in [%]. As explained in Sec. 3.3.4, the class boundaries between some the considered vehicle type classes are not clearly defined. The vague definition of the class boundaries is expected to be reflected by a broad distribution of the confidence scores predicted by the vehicle type branch over the concerned classes. To account for this effect in the evaluation, different values of OA are reported: The Top-1 OA, which reports the percentage of correct classifications that are obtained by using the class exhibiting the highest confidence score as the predicted one. In addition, the Top-2 and Top-3 OA values are also analysed. To obtain the Top-2 and Top-3 accuracies, a sample is considered to be a true positive if the reference class is among the two or three classes having the highest confidence scores, respectively. For the evaluation, the ICSENS data set is used, because it contains a reference for the vehicle type.

To asses the impact of the components of the probabilistic model proposed in this paper, different variants for the probabilistic model are defined, each of them considering a different set of likelihood and prior terms as described in the following paragraphs.

Base: In this variant, the baseline setting is defined in which only the 3D likelihood term is considered for model fitting. The 3D likelihood

is chosen to define the baseline model because it is exclusively based on the ASM and the reconstructed 3D points and, therefore, this variant does not require any supervised learning. The observations generated by the CNN (keypoint and wireframe heatmaps) are not considered here. Note that in this variant, the regularisation of shape deformations by the category-aware shape prior is omitted. However, constraints on the shape prior still have to be introduced to avoid unconstrained shape variations of the ASM, which may result in geometrically invalid vehicle shapes. In this setting the shape parameters are chosen to be regularised by penalising deviations from the mean ASM shape, as it is also done in comparable related methods, e.g. 

(Zia et al., 2013; Engelmann et al., 2016). By doing so, the category-aware shape prior formulation proposed in this paper (Eq. 14) is replaced by


Base+S: In this setting, the shape prior term according to Eq. 14 is added to the model alignment based on the 3D likelihood. The potential benefit of the category-aware shape prior in comparison to the regularisation setting defined here can be analysed in this variant.

Base+S+P+O: To assess the potential of the complete state priors, the shape, position, and orientation priors are jointly considered as regularisers on the state parameters during alignment in this variant.

Base+K+W This setting considers all likelihood terms presented in this paper to assess their potential. To constrain the vehicle shape, Eq. 15 is applied to prevent the vehicle models from degenerating.

Full: The evaluation of the vehicle reconstructions using the complete probabilistic formulation of Eq. 7, referred to as the Full model, assesses the quality of the results that can be obtained by the presented approach. The comparison of these variants is performed on the KITTI data set, while the Full model is also evaluated on the ICSENS data.

Evaluation criteria: To evaluate the vehicle reconstruction, the resulting pose and shape of each fitted 3D vehicle model are compared to the reference.

To evaluate the shape of the vehicle reconstructions, the vehicle dimensions is one criteria that can be used in case of the KITTI data, by using the reference 3D bounding boxes. To this end, average absolute errors are computed from the differences of the reference and the inferred length, width, and height of the vehicles for the KITTI dataset. In addition, the point clouds acquired by the Velodyne laserscanner, which were used to generate the ground-truth data of the KITTI data set, can be used to obtain a more detailed evaluation of the shape of the reconstructed vehicle model. To this end, we transform the laserscanner point cloud to the model coordinate system and extract the set of 3D laserscanner points lying inside the 3D reference bounding box of each vehicle. To assess the quality of the fitted models, for each vehicle we compute the RMSE of the distances of the laser points associated with a vehicle from the surface of the fitted ASM. The average and the median values computed from the RMSE of all vehicles in the data set are reported in the evaluation. In addition to the bounding box dimension metrics, the errors based on the Velodyne points give a more precise view of the deviations of the reconstructed models from the true shape of the vehicles. Furthermore, whereas the bounding box metrics assess the quality of the full extent of the vehicle, i.e. including the self-occluded and therefore non visible parts, the error metrics based on the Velodyne points only assess the reconstruction quality of the vehicle parts which are visible to the camera and therefore do not consider the self-occluded and therefore ambiguous vehicle parts.

Instead of 3D bounding boxes, the ICSENS data provides CAD models as reference. In order to compute the same evaluation metrics for the ICSENS data, the minimum 3D bounding box enclosing the CAD models is derived from the reference. In addition, the reference CAD models are used to compute an error metric based on the distances between corresponding keypoints of the reference model and the estimated model in the body coordinate system of the vehicles. To this end, the RMS error

of the Euclidean distances between corresponding keypoints is computed for each of the vehicles. This error represents the quality of the estimated shape w.r.t. the reference shape for a single vehicle. To assess the average quality of shape estimation that is achieved on the whole dataset, the RMSE of of the distances is computed using


To achieve detailed insights into the quality of pose reconstructions, position and orientation estimates are reported in three stages. The values for , , and report the percentage of determined vehicle positions whose Euclidean distance from the reference position is smaller than 0.25 m, 0.50 m, and 0.75 m, respectively. Similarly, , , and show the percentage of estimated vehicle orientations whose difference from the reference orientation is smaller than , , and , respectively. For a joint evaluation of position and orientation, the number of pose estimates that are correct in both, position and orientation, are reported in considering the 0.75 m and thresholds. In order to derive a global error metric for all vehicles, a robust measure of the average error is used by reporting the median of the absolute position errors and the median of the absolute orientation errors . In addition, the median absolute deviation is reported to assess the variability of the errors w.r.t. the median with


using as a constant factor (Hampel et al., 1986).

5 Evaluation

5.1 Vehicle type prediction

This section provides a study of the performance of the vehicle type branch which is proposed in this paper. As mentioned in Sec. 3.3.4, the association of a vehicle to one of a set of defined vehicle types can be an ambiguous task in some cases, even for human annotators. Depending on the vehicle types that are to be distinguished, the transitions between the appearance of different type classes can be smooth, without clear separation. As a consequence, for some vehicles, the decision of assigning them to one class or another can be difficult because the definition of the classes is somewhat vague and not stringent. Transferring these properties to the expected results of a vehicle type classifier leads to the expectation that the partially vague class definitions are reflected by a larger amount of class confusion on the one hand, and by the prediction of confidence scores that lack a distinct maximum on the other hand. In Tab. 4, the OAs computed from the classification results of the vehicle type branch on the ICSENS data set are shown. To analyse the effect of potential confusions due to vague class boundaries between specific vehicle types, the Top-1 to Top-3 overall accuracies are reported in the table.

Top-1 Top-2 Top-3
easy 58.0 71.6 79.8
difficult 57.5 70.8 79.3
Table 4: Overall classification results of the vehicle type branch on the ICSENS data. Top-1 - Top-3 OAs [%] are shown.

Tab. 4 shows that the OA of vehicle types is relatively low. If only the class having the highest score is considered (Top-1 OA), it is at 58.0% and 57.5% for the easy and difficult categories, respectively. Taking into account the OA achieved for the Top-2 and Top-3 evaluation, which permit confusions between the first two and first three most confidently predicted classes, increases the accuracy by up to 13.6% and 21.8%, respectively. On the one hand, the different data domains used for training and testing can be a potential factor limiting the performance of the classifier on the ICSENS data. On the other hand, confusions in the classification results that are caused by the problems described above can potentially cause a significant contingent of incorrect classifications. To get deeper insights into the behaviour of the vehicle type branch, Fig. 9 shows selected properties of the confidence scores predicted by that branch. One indicator for the vagueness of the class definitions to actually be a problem for the classifier is the magnitude of the #1 confidence score, i.e. the score with the largest value among all classes. A histogram of the #1 scores obtained by the vehicle type branch on the ICSENS data is shown in Fig. (a)a. Another indicator for the same phenomenon is the ratio between the #2 and #1 scores, i.e. the ratio between the second largest and the largest confidence scores, lying in the range [0,1]. A larger value for that ratio indicates a higher uncertainty of the classifier about the distinction between the two classes associated to the scores. A histogram of the ratio between the #2 and #1 confidence scores obtained on the ICSENS data is shown in Fig. (b)b. As can be seen in Fig. (a)a, only about 10% of all vehicles obtain a #1 confidence score of 0.9 or higher for one of the classes. Instead, the largest amount of more than 18% of the data achieve #1 scores between 0.5 and 0.6. Interpreting the distribution allows the conclusion that in a considerable amount of cases, the confidence of the classifier to predict the correct class is relatively small. As Fig. (b)b reveals, this conclusion is further stressed by the relatively large proportion of data (14% and 16% for the easy and difficult levels) for which the ratio of #2 and #1 confidence scores is larger than 0.75, i.e. the confidence score for the second most probable class is almost as large as the one for the most probable class. Another 18-21% of the data exhibit a confidence score ratio between 0.5 and 0.75.

To sum up, the prediction of the vehicle type by the vehicle type branch results in comparably low overall Top-1 accuracies. However, the uncertainties of the prediction are reflected by low confidence scores for the predictions. One reason for this behaviour may be the vagueness of the class definition mentioned earlier. However, the confidence-aware shape prior term (Eq. 14) which makes use of the output of the vehicle type branch considers the prediction uncertainties and is therefore able to handle classification errors caused by a potentially indistinct definition of vehicle type assignments.

(a) Histogram of the #1 confidence scores obtained by the vehicle type branch on the ICSENS data.
(b) Histogram of the ratio of #2 and #1 score of the vehicle type branch on the ICSENS data..
Figure 9: Distribution and properties of the classification confidence scores obtained by the vehicle type branch on the ICSENS data.

5.2 Shape reconstruction

This section provides an analysis of the reconstructed vehicle shapes. In Tab. 5, the results achieved on the KITTI data for the shape metrics of the Base, Base+S and the Full models are shown to assess the influence of the shape prior term on the shape estimation as well as the final results of the entire probabilistic model. Comparing the average absolute errors between the reference and the estimated bounding boxes of the Base and the Base+S variants, the consideration of the proposed shape prior leads to significantly better results for the vehicle length and height, which are improved by up to 7 and 4 cm, respectively. Regarding the error metrics based on the Velodyne points, the consideration of the proposed shape prior leads to an decrease of the median and the average RMSE by 3 cm and 1 cm, respectively. This observation underlines the benefit of the category-aware ASM and shape prior presented in this thesis compared to using the commonly applied regularisation of the shape by penalising deviations from the mean ASM shape, as it is done in the Base variant. The consideration of the Full model increases the errors in the estimated length and width of the vehicle extents but gives the distinctly best results for the object height. However, regarding the RMSE of the Velodyne points, distinct improvements of the median and average errors of 3 and 2cm, respectively, are achieved by the Full model. Compared to the base Base variant, this is an improvement of 10% (average) - 20% (median). This observation leads to the conclusion that the Full model improves the reconstruction quality of the vehicle parts facing towards the camera, because these are the parts which are covered by the laserscanner observations, and consequently, the increase in the errors of the bounding box extents results from the estimated vehicle extents in the direction that is invisible to the camera. As an additional general observation it can be noted that the errors resulting for the different difficulty levels are of almost the same size, probably due to the applied ASM-based shape prior, which acts as a regulariser on the shape independently from the observability of the vehicles.

Average absolute errors [m] RMSE [m]
of bounding box dimensions of Velodyne points
length width height median average
easy Base 0.36 0.10 0.23 0.31 0.32
Base+S 0.32 0.08 0.19 0.28 0.31
Full 0.37 0.11 0.12 0.25 0.29
mod. Base 0.39 0.10 0.23 0.31 0.33
Base+S 0.32 0.09 0.19 0.28 0.31
Full 0.37 0.11 0.13 0.25 0.30
hard Base 0.40 0.10 0.22 0.31 0.35
Base+S 0.33 0.09 0.18 0.28 0.33
Full 0.38 0.11 0.13 0.25 0.32
Table 5: Shape evaluation results on the KITTI data. The dimensions of the reference bounding boxes are compared to the vehicle dimensions resulting from the reconstruction (left part). The median and average RMSE of the velodyne vehicle points to the fitted ASM surface are shown (right part).

Regarding the errors in object dimensions obtained on the ICSENS data (cf. Tab. 6), the pattern and magnitude of average absolute errors is comparable to the one obtained on the KITTI data set. The largest discrepancies between reference and estimation occur w.r.t. the length of the vehicles, while the vehicle width is estimated with the smallest errors. Comparing the different tested variants, the smallest errors are obtained by the Base+S+P+O setting, in which only the 3D likelihood under consideration of the state priors is used, which again is an indicator for the benefit of the proposed shape prior term. Comparing the results of the Full model to the Base+S+P+O, average absolute errors are slightly increased. This observation is consistent with those made for the KITTI dataset and allows to draw similar conclusions. The average keypoint error is consistently smallest (24 cm) for the Base+S+P+O variant and slightly increases for the Full model.

Average absolute errors [m] RMS error
length width height [m]
easy Base+S+P+O 0.38 0.09 0.14 0.24
Base+K+W 0.44 0.13 0.21 0.27
Full 0.45 0.13 0.23 0.27
difficult Base+S+P+O 0.38 0.10 0.14 0.24
Base+K+W 0.43 0.12 0.20 0.26
Full 0.42 0.12 0.20 0.26
Table 6: Shape evaluation results on the ICSENS data. The dimensions of the reference CAD models and the reconstructed ASM are compared. Besides, the keypoint based RMS error is computed from the euclidean distances of corresponding keypoints.

5.3 Pose estimation

The pose estimation results for all variants of the probabilistic model on both, the KITTI and the ICSENS datasets, are presented in Tab. 7, distinguishing between the respective levels of difficulty. The results will be discussed in the subsequent sections.

in [%] in [%] in [%] in [m] in []
easy Base 39.9 71.8 87.3 57.0 68.8 74.5 54.1 0.31 0.25 3.9 4.5
Base+S 45.0 74.9 88.8 60.4 69.5 73.1 58.3 0.28 0.22 3.3 3.8
Base+S+P+O 46.6 77.1 92.1 77.3 95.1 98.6 73.0 0.27 0.21 2.5 2.5
Base+K+W 40.7 73.0 88.5 84.9 95.2 97.9 77.6 0.31 0.24 1.9 1.8
Full 38.6 70.3 86.6 90.1 97.8 98.9 79.8 0.33 0.26 1.7 1.6
moderate Base 36.4 67.5 82.6 52.9 64.5 70.2 49.7 0.34 0.27 4.5 5.4
Base+S 40.9 71.0 84.3 56.0 64.9 69.0 53.6 0.31 0.25 3.8 4.7
Base+S+P+O 43.0 74.2 89.2 72.5 89.8 95.3 68.0 0.29 0.22 2.7 2.7
Base+K+W 38.6 70.6 85.9 78.7 88.4 91.8 72.5 0.32 0.26 2.1 2.1
Full 37.6 69.9 86.2 85.4 93.5 96.2 76.5 0.33 0.26 1.8 1.7
hard Base 32.5 60.8 75.2 47.7 58.0 63.4 44.2 0.39 0.33 5.6 7.4
Base+S 36.3 64.0 76.9 50.4 58.4 62.3 47.6 0.35 0.30 4.9 6.5
Base+S+P+O 38.9 68.9 83.8 66.9 82.9 89.4 61.5 0.32 0.26 3.1 3.2
Base+K+W 35.4 65.0 79.8 70.3 79.6 83.2 64.7 0.36 0.30 2.4 2.7
Full 35.0 65.7 81.8 78.6 87.0 90.8 70.1 0.36 0.29 2.0 2.1
easy Base+S+P+O 48.3 81.1 93.5 68.5 91.0 96.7 65.3 0.26 0.19 3.4 3.0
Base+S+K 38.3 76.9 90.7 69.0 88.3 94.0 65.5 0.31 0.21 3.2 2.9
Full 44.7 82.5 93.9 73.8 92.2 97.0 70.5 0.28 0.19 2.8 2.6
diff. Base+S+P+O 41.4 75.2 88.7 64.9 87.6 93.6 60.5 0.30 0.23 3.5 3.2
Base+S+K 37.3 73.7 87.6 66.2 84.7 90.1 62.5 0.32 0.23 3.3 3.2
Full 43.3 78.1 90.5 71.6 89.3 94.1 67.4 0.29 0.21 2.9 2.8
Table 7: Quantitative pose estimation results on the KITTI and ICSENS dataset. The best achieved values for the respective metrics are printed in bold. The arrows indicate if a higher () or a lower () value of the corresponding metric is considered to be better.

5.3.1 Results on the KITTI dataset

Base: In this setting, the ASM is fitted to the triangulated point cloud. As can be seen from Tab. 7, about 40% of the reconstructed vehicles have a position error of less than 25 cm and 87.3% of the vehicles have an error less than 75 cm in the easy category. The median error for the position estimates amounts to 31 cm. Because fewer observations are available for vehicles belonging to the moderate and hard categories, the results for the position estimates are worse, with an median error of up to 39 cm for the hard category. The same behaviour can be observed for the orientation estimates, where the median error of achieved for the easy level increases to for the hard category.

Base+S: In this setting the shape prior term is added to the model alignment based on the category-aware ASM proposed in this paper. As can be noticed in Tab. 7, the number of correct position estimates, especially w.r.t. to the finer-grained metrics and , is significantly improved by the incorporation of the category-aware shape prior, with improvements of up to 5.1%. As has been shown in Sec. 5.2, the category-aware regularisation of the model shapes results in a better representations of the observed vehicles w.r.t. to their dimensions. When only considering the 3D likelihood for model alignment, there are errors of shape reconstructions and therefore errors in the estimated vehicle dimensions, caused by the fact that the observed 3D points are only available for a part of the vehicle because no points are observed on the vehicle side facing away from the camera. Introducing a shape regularisation that is aware of the vehicle category, which in many cases is directly related to the vehicle dimensions, leads to better regularisation constraints on the shape and consequently to enhanced results for the position estimates. At the same time, the consideration of the shape prior also improves the results for orientation estimation w.r.t. the metric. Conclusively, the category-aware shape prior proposed in this paper leads to an increase in correct position estimates, probably due to a better representation of shape and dimension by the shape prior, and to a better quality of the orientation estimations. This demonstrates the suitability of the type predictions as prior information for the vehicle shape.

Base+S+P+O: To assess the full potential of the state priors, the shape, position, and orientation priors are jointly considered in this variant as regularisers for the state parameters during alignment. As can be seen from Tab. 7, throughout all difficulty levels, the consideration of the position and orientation priors significantly increases the number of correct position and orientation estimates in all metrics and decreases the median errors. The analysis of this variant demonstrates the beneficial effect of incorporating the proposed state priors into the approach for model fitting.

Base+K+W: This setting uses all likelihood terms of the probabilistic model. Compared to the Base setting in which only the 3D likelihood is used, only subtle improvements are achieved for the position estimates but the orientation estimates are enhanced distinctly. In the easy category for instance, the median error for the orientation is decreased from to and the enhancement is even larger for the moderate and hard levels. It is reasonable to assume that introducing image observations and semantic knowledge via the keypoint and wireframe likelihoods delivers valuable cues the model fitting. This indicates the contribution of the likelihoods to the vehicle reconstruction. Qualitative results obtained by the Base+K+W variant are shown in Fig. 10, which contains visualisations of the probability maps for the keypoints and wireframes as well as the wireframe of the reconstructed ASM backprojected to the left stereo image. To be able to show the probability maps for the individual keypoints and wireframe definitions in one image, the heatmaps of all keypoints and wireframes are superimposed and the maximum value among all heatmaps for each pixel is shown.

Figure 10: Qualitative results obtained by the Base+K+W variant on four images from the KITTI data set. For every image, a triplet consisting of the probability maps for vehicle keypoints (top) and vehicle wireframes (middle) superimposed to the left input image are shown (the cold to warm colour coding represents low to high probabilities). Furthermore, the backprojected wireframes of the reconstructed vehicles are depicted (bottom).

Full: According to Tab. 7, when describing the results of the Full probabilistic model compared to the Base+K+W and Base+S+P+O variants, a distinction has to be made between the results for position and the results for the orientation. Counterintuitively, the amount of correct position estimates decreases in the easy category when combining the likelihood terms and prior terms in the Full model, compared to both, the Base+K+W and the Base+S+P+O settings. This decrease is less distinct for the moderate and hard categories, where the decrease of correct position estimates of the Full model w.r.t. the Base+S+P+O model is smaller, and in fact an increase of correct position estimates using the Full model can be observed compared to the Base+K+W model. The Base+S+P+O variant delivers the best results for the position estimates throughout all difficulty levels, with 5.5% and 8.0% more correct estimates in the easy category for the and metrics, respectively. In the moderate and hard level, the differences are smaller, with 3.0%/5.4%, and 2.0%/3.9% for the / metrics, respectively.

In contrast, regarding the results for the estimated orientations, the Full model achieves the highest number of correct estimates throughout all levels of difficulty. While the numbers of correct orientation estimates for the metric achieved by the Base+S+P+O settings were already fairly high with up to 98.6% in the easy category and up to 89.4% in the hard category, only small improvements of up to 1.4% are achieved by the Full model. The improvements created by the Full model compared to the Base+K+W and the Base+S+P+O models within the finer-grained metric are up to 12.8% in the easy category and up to 11.7% in the hard category, which is remarkably large. The number of vehicle reconstructions that are correct w.r.t. both, the and metrics, are largest for the Full model throughout all difficulty levels. In accordance with the observations described so far, the median errors resulting from the Full model are slightly larger compared to the other variants regarding the position, but slightly smaller w.r.t. the orientation.

Comparing the results of the Full model to the results reported in (Coenen and Rottensteiner, 2019), the performance of the probabilistic model with and without considering the novel shape prior can be investigated. Regarding the metric, the proposed shape prior term improves the results by 7.2%/5.6%/4.3% for the easy/moderate/hard categories. With respect to the orientation, improvements of 3.4%/3.1%/2.5% are achieved for the criterion and the different difficulty levels by using the shape prior term proposed in this work.

5.3.2 Results on the ICSENS dataset

In addition to the results on the KITTI data set, Tab. 7 contains the results for the Base+K+W, the Base+S+P+O, as well as the Full settings achieved on the ICSENS data. The benefit of combining all likelihood terms with the state priors in the Full formulation on the results for the orientation estimation, which could be observed on the KITTI data, can also be observed in the results achieved for the ICSENS data. The Full model delivers the best results, although the improvement over the other variants is less distinct compared to the results on the KITTI data. However, in contrast to the observations made on the KITTI dataset, where the combination of likelihood and prior terms lead to a decrease in the amount of correct position estimates, the Full model also improves the position results on the ICSENS data. An exception is the metric, where the Base+S+P+O setting achieves the best results for the easy category. Nevertheless, the overall tendency appearing from Tab. 7 attests the beneficial effect of the joint consideration of the proposed likelihoods and state priors. Qualitative results obtained by the Full model on the ICSENS data are shown in Fig. 11.

Figure 11: Qualitative results obtained by the Full variant on four images from the ICSENS data set. The backprojected wireframes of the reconstructed vehicles are depicted.

5.3.3 Comparison of the results on the KITTI and the ICSENS data sets

Comparing the results achieved on the two different datasets by the Full model, different phenomena can be observed. Regarding the results for determining the position, it is apparent that the performance on the ICSENS data reveals significant better numbers compared to the performance on the KITTI data. In the easy category, the difference in the amount of correct position estimates ranges from 4.5 - 11.3% for the different evaluation metrics. An explanation for the comparably better results for the position on the ICSENS data can be the larger baseline of the ICSENS stereo rig (85 cm) compared to the base length used for the KITTI data (54 cm), and consequently, a comparably lower depth uncertainty.

Regarding the results obtained for the orientation, the performance of the coarse viewpoint estimation is comparable for both data sets. However, the achieved numbers for the finer-grained orientation evaluation criteria are distinctly lower for the ICSENS data set, especially for the metric, which is 18.4% and 17.7% lower in the easy and difficult levels, respectively, compared to the criterion. As a consequence of the worse orientation estimates, the median errors for the orientation obtained on the ICSENS data are significantly larger compared to the KITTI data. As the majority of the likelihood and prior terms of the probabilistic model are based on the predictions of the CNN, a potential reason of the decreased performance on the ICSENS data set may be given by the domain gap impacting the performance of the CNN, which is mainly trained on KITTI data and therefore might perform better on data from the same domain.

5.4 Analysis of limitations and further aspects

One difficulty in estimating the position is the fact that the position of a vehicle, being represented by the its centre point which is inside of the vehicle, is an entity which is never directly observed in the images, but instead is derived from the reconstructed 3D vehicle shape. As a matter of fact, a vehicle is never entirely visible in the image but instead, only one or two of the four vehicle sides are observed, while the remaining parts of the vehicle are averted from the camera and therefore are invisible. As a consequence, the extent of the vehicle is unobserved and, thus ambiguous. Model based approaches as proposed in this paper constrain and derive the full extents of the object using statistically learned shape priors. As the vehicle position is derived from the reconstructed model, the remaining ambiguities of object extent in the viewing direction cause errors in the position estimates, so that errors are expected to especially occur in the viewing direction of the camera. To verify this hypothesis, an analysis of the position estimates, distinguished by errors in lateral (across the viewing direction of the camera) and longitudinal (along the viewing direction of the camera) directions is conducted. Tab. 8 contains the numbers of correctly estimated lateral and longitudinal vehicle coordinates, denoted by and ; for this analysis, a coordinate is considered to be correct if its absolute difference from the reference is smaller than 25, 50 and 75 cm, respectively. This table shows the obvious differences between the results for the estimated lateral and longitudinal positions.

Lateral position Longitudinal position
easy 88.0 97.3 98.8 43.3 73.4 88.7
moderate 84.1 95.5 97.9 43.4 74.3 88.8
hard 79.3 91.4 95.6 42.0 71.4 85.5
Table 8: Number of correct longitudinal and lateral position estimates of the Full model on the KITTI data.

While the number of estimated lateral coordinates that are within 75 cm of the reference lies between 95.6 and 98.8% for all difficulty categories, these numbers are approximately 10% lower for the longitudinal position estimates. This discrepancy becomes even larger considering 25 cm as the threshold for an estimate to be counted as correct. In the lateral direction, up to 88.0% of the reconstructed vehicles exhibit a position which is correct within 25 cm. In the longitudinal direction, only half of this number is achieved. The fact that the results for the longitudinal component of the position are worse than those of the lateral one can be attributed to two factors which we believe to interfere with each other. On the one hand, we consider the errors in estimating the vehicle position to be at least partly caused by the ambiguities in determining the spatial extent of the vehicles in the longitudinal direction. On the other hand, this effect can be due to the depth uncertainty, which increases with an increasing distance of the vehicles from the camera.

The influence of the distance of a vehicle from the camera on the quality of the reconstruction can be investigated using Tab. 9, which contains the median errors for position and orientation of the Full model on the KITTI data, differentiated by the distance of the vehicle to the camera. The depth uncertainties of a stereo-reconstructed 3D point in the considered distances, assuming an uncertainty of the disparity of 1 [px], are also shown in Tab. 9. In order to draw a comparison between the quality of the results for the orientation and for the position of the vehicles, the perpendicular error resulting from the median orientation errors in dependency on the distances is computed (cf. the Fig. 12) and can be compared to the median errors of the position achieved in the respective distances. The median errors for the orientation achieved on the KITTI data set are used to compute the perpendiculars at the corresponding distances in order to compare them to the median errors for the position achieved on the KITTI data.

Vehicle distance
5 - 10 m 10 - 15 m 15 - 20 m 20 m
[m] 0.06-0.26 0.26-0.58 0.58-1.03 1.03
easy 1.5 1.5 1.7 2.1
[m] 0.13 - 0.26 0.26 - 0.39 0.45 - 0.59
[m] 0.21 0.23 0.40 0.58
moderate 1.5 1.6 1.9 2.3
[m] 0.13 - 0.26 0.28 - 0.42 0.50 - 0.66
[m] 0.20 0.25 0.42 0.57
hard 1.6 1.8 2.1 2.6
[m] 0.14 - 0.28 0.31 - 0.47 0.55 - 0.73
[m] 0.21 0.28 0.44 0.60
Table 9: Median errors for position, orientation, and the perpendicular achieved by the Full model on the KITTI data differentiated by the vehicle distances.
Figure 12: The relation between vehicle distance, orientation error , and the perpendicular .

Tab. 9 shows that the distance of the vehicle from the camera strongly affects the ability to correctly estimate the vehicle’s position and also affects, although less strikingly, the quality of orientation estimates. In the easy category, the median error of the position estimates also increases drastically almost by a factor of three from 21 cm for vehicles in a distance between 5 and 10 m to 58 cm for vehicles being more distant than 20 m. The numbers for the moderate and hard categories are comparable and show the same pattern. It can be assumed that the increasing depth uncertainty of distant 3D points is responsible for this effect. While an error of 1 [px] in the disparity leads to uncertainties of 6-25 cm for triangulated 3D points in a distance of 5-10 m, the uncertainty is already 103 cm for points in a distance of 20 m. As a consequence, and unsurprisingly, the ability to precisely estimate the position is heavily influenced by the expected depth uncertainty. While an increasing distance of the vehicle also negatively affects the estimation of the orientation, the influence is less distinct compared to the effect on the position estimates.

As can be seen, in a distance larger than 10 m, the effect of the obtained median errors for the orientation on the perpendicular is larger than the median errors of the position for all difficulty levels. In a distance of 20 m, is almost twice as large as . This behaviour can for instance be relevant in the context of applications related to collaborative autonomous driving, in which the determined pose of the vehicles is introduced as vehicle to vehicle (V2V) observations for the task of collaborative positioning (Knuth and Barooah, 2009).

In Fig.13, the average RMSE of the Velodyne points and the reconstructed ASM surface are shown as a function of the vehicle distance to the camera.

(a) Easy
(b) Moderate
(c) Hard
Figure 13: Shape evaluation results on the KITTI data based on the velodyne laserscanner points. The average RMSE of the laserscanner vehicle clouds to the ASM surface is shown as a function of the vehicle distance to the camera for the different variants of the probabilistic model.

It can be seen that the RMSE significantly increases as the vehicle distance becomes larger than 15m. What becomes visible, too, is that the consideration of the proposed shape prior in the Base+S variant and the consideration of the Full model lead to the largest improvements for vehicles close to the camera (up to 15 m); the improvements become smaller with an increasing distance of the vehicles from the camera. We assume this to be the result of the increasing depth errors of the reconstructed points which form the basis of 3D reconstruction (cf. Tab. 9): for points at a large distance from the camera, the deviations of the reconstructed models from the laser point cloud are dominated by these depth errors.

5.5 Comparison to related methods

In order to compare the performance of the proposed method to the performance of related methods, the test data set and evaluation metrics of the official KITTI benchmark444http://www.cvlibs.net/datasets/kitti can be used. However, the focus of that benchmark lies on object detection. As a consequence, metrics to assess the quality of pose estimation are coupled with the performance of detection; cf. (Geiger et al., 2012) for a detailed description of the error metrics. To assess the orientation accuracy, the official metric of the KITTI benchmark is the Average Orientation Similarity (AOS), which multiplies the Average Precision (AP) of the detector with the average cosine distance similarity for the orientation. Because the focus of this work lies on the pose estimation rather than the detection of vehicles, we make use of the Orientation Score (OS) metric which was proposed by Mousavian et al. (2017). OS is a metric which factors out the 2D detector performance by computing the ratio between AOS over AP and thus, is unaffected by the detector’s performance and can be used to assess the quality of the estimated orientation. Tab. 10 shows the OS scores achieved by our Full approach and by related state-of-the-art approaches.

OS scores easy moderate hard
Pepik et al. (2015) 96.28 95.29 95.06
Chen et al. (2015) 98.28 97.13 97.73
Chen et al. (2016) 98.57 97.69 97.31
Xiang et al. (2017) 99.84 99.52 99.25
Ku et al. (2019) 99.63 98.81 98.50
Manhardt et al. (2019) 98.38 97.12 96.45
Ours (Full) 98.77 97.65 96.75
Table 10: Comparison to related methods based on the performance of the KITTI test set. The orientation score (OS) allows a mere comparison of the orientation estimates without the impact of the detection performance.

As can be seen from the table, our method is on par with or outperforms most of the shown related methods in the easy and moderate categories. Regarding the hard category, our methods reveals a larger decrease of the OS compared to the related methods. Throughout all difficulty levels, the best results for OS are achieved by the methods presented by Xiang et al. (2017) and Ku et al. (2019). However, it has to be noted that the different CNN architectures for pose estimation which are presented in these papers were trained using the entire KITTI training data set of almost 8000 images from the same domain, while the approach presented in this paper was trained on 260 images from the KITTI training set only.

in [%]
easy Mousavian et al. (2017) 19.5 38.4 55.1 87.1 94.9 98.9 54.8
Ours (Full) 42.7 75.2 89.5 88.7 97.0 98.5 88.5
mod. Mousavian et al. (2017) 16.8 34.0 49.2 78.7 90.1 96.1 48.1
Ours (Full) 39.3 72.4 87.8 83.5 92.3 95.6 85.8
hard Mousavian et al. (2017) 13.9 29.1 42.9 68.6 80.2 88.1 40.4
Ours (Full) 35.5 66.9 81.9 75.5 84.7 89.0 78.6
Table 11: Comparison of the proposed method to the results achieved by Mousavian et al. (2017) based on the evaluation metrics proposed in this paper.

In order to compare the evaluation metrics developed in the context of this paper to results achieved by other state-of-the-art approaches, we make use of the detection and pose estimation results of (Mousavian et al., 2017), which are provided by the authors for a subset of 3799 images from the KITTI training data. Tab. 11 contains the comparison of results of the proposed Full model and the results from Mousavian et al. (2017) on the same set of images. The images that were used for training in this work were excluded from the evaluation of both, the results of Mousavian et al. (2017) and ours.

Our method significantly outperforms Mousavian et al. (2017) in terms of the number of correct position estimates, especially for the fine evaluation metric , by a factor of up to 2.5. However, for a fair comparison it has to be noted that in (Mousavian et al., 2017) no stereo information is used. Regarding the orientation estimates, while the results for the coarse level of are similar, the probabilistic model of this work leads to significantly better results for the finer evaluation metrics, particularly for the moderate and hard categories. As a consequence, the number of vehicle reconstructions that are considered as correct in both, position and orienation, obtained by our approach is considerably larger.

6 Conclusion

In this paper, we proposed a subcategory-aware shape prior for vehicles. Together with a CNN based prediction of the vehicle type, the novel shape prior is incorporated into an extensive probabilistic model for vehicle reconstruction. The results on two real-world datasets have shown the benefit of the proposed shape prior w.r.t. both, the quality of vehicle shape reconstructions and of vehicle pose estimates. A comparison of our results to those of related state-of-the-art methods has shown that the proposed approach performs on par or better, confirming the suitability of the developed shape prior and probabilistic model for vehicle reconstruction. While the proposed method provides an approach for vehicle reconstruction based on one individual stereo pair, i.e. observations from one epoch in time, an extension of the probabilistic model to a sequence of images acquired at subsequent epochs and the tracking of vehicles would enable the incorporation of multiple observations of the same object. Suitable constraints, e.g. on plausible changes in the position and orientation between two time steps, can for instance be enforced by proper motion models. Besides, the reconstruction of each vehicle is treated individually in this paper. Global constraints that e.g. prevent vehicle reconstructions to coincide, or contextual relations between vehicles are not exploited so far but provide opportunities for further improvement. Another future direction of research could be related to the improvement of the CNN training procedure to allow for end-to-end training and to improve the results of the vehicle type prediction branch. Finally, further experiments can highlight whether the introduction of relative weights to the components of the probabilistic model could further improve the results.


This work was supported by the German Research Foundation (DFG) as part of the Research Training Group i.c.sens [GRK2159].


  • I. Akhter, Y. Sheikh, S. Khan, and T. Kanade (2011) Trajectory Space: A Dual Representation for Nonrigid Structure from Motion. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 33 (7), pp. 1442–1456. External Links: Document Cited by: §2.1.
  • J. A. Ansari, S. Sharma, A. Majumdar, J. K. Murthy, and K. M. Krishna (2018) The Earth ain’t Flat: Monocular Reconstruction of Vehicles on Steep and Graded Roads from a Moving Camera. In IEEE International Conference on Intelligent Robots and Systems (IROS), pp. 8404–8410. External Links: Document Cited by: §2.1.
  • S. Y. Bao, M. Chandraker, Y. Lin, and S. Savarese (2013) Dense Object Reconstruction with Semantic Priors. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 1264–1271. Cited by: §2.2.3.
  • A. Bhattacharyya (1943) On a Measure of Divergence between two Statistical Populations defined by their Probability Distributions. Bulletin of the Calcutta Mathematical Society 35, pp. 99–109. Cited by: §3.4.3.
  • B. Braden (1986) The Surveyor’s Area Formula. The College Mathematics Journal 17 (4), pp. 326–337. Cited by: §3.4.4.
  • L. Breiman (2001) Random Forests. Machine Learning 45 (1), pp. 5–32. Cited by: §2.2.3.
  • F. Chabot, M. Chaouch, J. Rabarisoa, C. Teuliere, and T. Chateau (2017) Deep MANTA: A Coarse-to-Fine Many-Task Network for Joint 2D and 3D Vehicle Analysis from Monocular Image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1827–1836. External Links: Document Cited by: §2.1, §2.2.3.
  • X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun (2015) 3D Object Proposals for accurate Object Class Detection. In Advances in Neural Information Processing Systems (NIPS), Vol. 28, pp. 424–432. Cited by: §2.1, §2.2.1, Table 10.
  • X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun (2016) Monocular 3D Object Detection for Autonomous Driving. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2147–2156. Cited by: §2.1, §2.2.1, Table 10.
  • M. Coenen, F. Rottensteiner, and C. Heipke (2017) Detection and 3D Modelling of Vehicles from terrestrial stereo Image Pairs. In ISPRS International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. XLII-1/W1, pp. 505–512. External Links: Document Cited by: §2.2.3.
  • M. Coenen, F. Rottensteiner, and C. Heipke (2018) Recovering the 3D Pose and Shape of Vehicles from Stereo Images. In ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. IV-2, pp. 73–80. External Links: Document Cited by: §3.1.2, §3.1.2.
  • M. Coenen, F. Rottensteiner, and C. Heipke (2019) Precise Vehicle Reconstruction for autonomous Driving Applications. In ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. IV-2/W5, pp. 21–28. External Links: Document Cited by: §2.2.3.
  • M. Coenen and F. Rottensteiner (2019) Probabilistic Vehicle Reconstruction Using a Multi-Task CNN. In IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 822–831. External Links: Document Cited by: item 2, item 3, item 4, §1, §1, §2.1, §2.2.3, §3.3.2, §3.3, §3.4, §4.1, §4.2, §5.3.1.
  • M. Coenen (2020) The i.c.sens Stereo Image Vehicle Dataset. External Links: Document Cited by: §4.1.
  • T. F. Cootes, C. J. Taylor, and D. H. Cooper (1995) Active Shape Models - Their Training and Application. Computer Vision and Image Understanding (CVIU) 61 (1), pp. 38–59. Cited by: §3.2.1.
  • T.F. Cootes, G.J. Edwards, and C.J. Taylor (2001) Active Appearance Models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 23 (6), pp. 681–685. External Links: Document Cited by: §2.1.
  • N. Dalal and B. Triggs (2005) Histograms of oriented Gradients for Human Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 886–893. External Links: Document Cited by: §2.2.3.
  • A. Dame, V. A. Prisacariu, C. Y. Ren, and I. Reid (2013) Dense Reconstruction using 3D Object Shape Priors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1288–1295. Cited by: §2.2.3.
  • W. Ding, S. Li, G. Zhang, X. Lei, and H. Qian (2018) Vehicle Pose and Shape Estimation through Multiple Monocular Vision. In IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 709–715. Cited by: §2.1, §2.2.3.
  • F. Engelmann, J. Stückler, and B. Leibe (2016) Joint Object Pose Estimation and Shape Reconstruction in urban Street Scenes using 3D Shape Priors. In German Conference on Pattern Recognition (GCPR), Lecture Notes in Computer Science, Vol. 9796, Cham, pp. 219–230. External Links: Document, ISBN 978-3-319-45885-4 Cited by: item 1, §1, §1, §2.1, §2.1, §2.2.3, §3.2.2, §3.4.6, §4.3.
  • P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2010) Object Detection with discriminatively trained part-based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 32 (9), pp. 1627–1645. External Links: ISSN 0098-5589, Document Cited by: §2.2.3.
  • A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous Driving? The KITTI Vision Benchmark Suite. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. External Links: Document Cited by: item 4, §4.1, §4.2, §5.5.
  • A. Geiger, M. Roser, and R. Urtasun (2011) Efficient Large-Scale Stereo Matching. In Asian Conference on Computer Vision (ACCV), Lecture Notes in Computer Science, Vol. 6492, Berlin, pp. 25–38. External Links: Document, ISBN 978-3-642-19314-9 Cited by: §3.1.
  • A. Grabner, P. M. Roth, and V. Lepetit (2018) 3D Pose Estimation and 3D Model Retrieval for Objects in the Wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3022–3031. Cited by: §2.2.1.
  • F. Güney and A. Geiger (2015) Displets: Resolving stereo Ambiguities using Object Knowledge. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4165–4175. Cited by: §2.1, §2.2.3.
  • F. R. Hampel, E. M. Ronchetti, P. Rousseeuw, and W. A. Stahel (1986) Robust Statistics: The Approach based on Influence Functions. Wiley-Interscience, New York, New York, USA. Cited by: §4.3.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. External Links: Document Cited by: §3.1.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

    In IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034. External Links: Document Cited by: §4.2.
  • P. J. Huber (1964) Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics 35 (1), pp. 73–101. Cited by: §3.4.1.
  • S. Ioffe and C. Szegedy (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (ICML), pp. 448–456. Cited by: §4.2.
  • P. Isola, D. Zoran, D. Krishnan, and E. H. Adelson (2014) Crisp Boundary Detection Using Pointwise Mutual Information. In European Conference on Computer Vision (ECCV), pp. 799–814. Cited by: §2.2.3.
  • A. Kar, S. Tulsiani, J. Carreira, and J. Malik (2015) Category-Specific Object Reconstruction From a Single Image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1966–1974. Cited by: §2.2.3.
  • D.P. Kingma and L.J. Ba (2015) Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.2.
  • J. Knuth and P. Barooah (2009) Distributed collaborative Localization of multiple Vehicles from relative Pose Measurements. In Conference on Communication, Control, and Computing, pp. 314–321. Cited by: §5.4.
  • J. Ku, A. D. Pon, and S. L. Waslander (2019) Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11867–11876. Cited by: §2.1, §2.2.1, §5.5, Table 10.
  • A. Kundu, Y. Li, and J. M. Rehg (2018) 3D-RCNN: Instance-level 3D Object Reconstruction via Render-and-Compare. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3559–3568. Cited by: §2.1, §2.2.2.
  • M. J. Leotta and J. L. Mundy (2009) Predicting high Resolution Image Edges with a generic, adaptive, 3-D Vehicle Model. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1311–1318. External Links: Document Cited by: §1, §2.2.3.
  • P. Li, X. Chen, and S. Shen (2019) Stereo R-CNN Based 3D Object Detection for Autonomous Driving. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7644–7652. Cited by: §2.2.1.
  • Y. Li, L. Gu, and T. Kanade (2011) Robustly Aligning a Shape Model and Its Application to Car Alignment of Unknown Pose. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 33 (9), pp. 1860–1876. External Links: Document Cited by: §2.2.3.
  • Y. Lin, V. I. Morariu, W. Hsu, and L. S. Davis (2014) Jointly Optimizing 3D Model Fitting and Fine-Grained Classification. In European Conference on Computer Vision (ECCV), pp. 466–480. External Links: Document Cited by: §2.1, §2.1.
  • F. Manhardt, W. Kehl, and A. Gaidon (2019) ROI-10D: Monocular Lifting of 2D Detections to 6D Pose and Metric Shape. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2069–2078. Cited by: §2.1, §2.2.2, Table 10.
  • M. Menze, C. Heipke, and A. Geiger (2015) Joint 3D Estimation of Vehicles and Scene Flow. In ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. II-3, pp. 427–434. Cited by: §2.2.3.
  • A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka (2017) 3D Bounding Box Estimation Using Deep Learning and Geometry. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7074–7082. Cited by: §2.1, §2.2.1, §5.5, §5.5, §5.5, Table 11.
  • J. K. Murthy, G. V. S. Krishna, F. Chhaya, and K. M. Krishna (2017a) Reconstructing Vehicles from a single Image: Shape Priors for Road Scene Understanding. In IEEE International Conference on Robotics and Automation (ICRA), pp. 724–731. External Links: Document Cited by: §1, §2.1, §2.1, §3.3.3.
  • J. K. Murthy, S. Sharma, and K. M. Krishna (2017b) Shape Priors for real-time monocular Object Localization in dynamic Environments. In IEEE International Conference on Intelligent Robots and Systems (IROS), pp. 1768–1774. External Links: Document Cited by: §1, §2.2.3.
  • M. Najibi, G. Lai, A. Kundu, Z. Lu, V. Rathod, T. Funkhouser, C. Pantofaru, L. S. Ross, and A. Fathi (2020) DOPS: Learning to detect 3D Objects and predict their 3D Shapes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11913–119122. Cited by: §2.1.
  • A. Newell, K. Yang, and J. Deng (2016) Stacked Hourglass Networks for Human Pose Estimation. In European Conference on Computer Vision (ECCV), pp. 483–499. External Links: Document Cited by: §3.3.3.
  • R. Ortiz-Cayon, A. Djelouah, F. Massa, M. Aubry, and G. Drettakis (2016) Automatic 3D Car Model Alignment for Mixed Image-Based Rendering. In International Conference on 3D Vision (3DV), pp. 286–295. External Links: Document Cited by: §1, §2.2.3, §2.2.3.
  • G. Pavlakos, X. Zhou, A. Chan, K. G. Derpanis, and K. Daniilidis (2017) 6-DoF Object Pose from Semantic Keypoints. In IEEE International Conference on Robotics and Automation (ICRA), pp. 2011–2018. External Links: Document Cited by: §1, §2.2.3, §3.2.1.
  • B. Pepik, M. Stark, P. Gehler, and B. Schiele (2015) Multi-View and 3D Deformable Part Models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 37 (11), pp. 2232–2245. External Links: Document Cited by: Table 10.
  • V. A. Prisacariu, A. V. Segal, and I. Reid (2012) Simultaneous Monocular 2D Segmentation, 3D Pose Recovery and 3D Reconstruction. In Asian Conference on Computer Vision (ACCV), pp. 593–606. External Links: Document Cited by: §2.2.3.
  • K. Ramnath, S. N. Sinha, R. Szeliski, and E. Hsiao (2014) Car Make and Model Recognition using 3D Curve Alignment. In IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 285–292. External Links: Document Cited by: §1, §2.2.3.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems (NIPS), Vol. 28, pp. 91–99. Cited by: §2.2.1.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234–241. External Links: Document Cited by: §2.2.3.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: §4.2.
  • S. Schön, C. Brenner, H. Alkhatib, M. Coenen, H. Dbouk, N. Garcia-Fernandez, C. Fischer, C. Heipke, K. Lohmann, I. Neumann, U. Nguyen, J.-A. Paffenholz, T. Peters, F. Rottensteiner, J. Schachtschneider, M. Sester, L. Sun, S. Vogel, R. Voges, and B. Wagner (2018) Integrity and Collaboration in Dynamic Sensor Networks. Sensors 18 (7), pp. 2400–2421. External Links: Document Cited by: §4.1.
  • K. Simonyan and A. Zisserman (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations (ICLR), Cited by: §3.3.1, §4.2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014)

    Dropout: A Simple Way to Prevent Neural Networks from Overfitting

    The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §4.2.
  • Z. Tang, M. Naphade, S. Birchfield, J. Tremblay, W. Hodge, R. Kumar, S. Wang, and X. Yang (2019) PAMTRI: Pose-Aware Multi-Task Learning for Vehicle Re-Identification Using Highly Randomized Synthetic Data. In IEEE International Conference on Computer Vision (ICCV), pp. 211–220. Cited by: §2.1.
  • B. Tekin, S. N. Sinha, and P. Fua (2018) Real-Time Seamless Single Shot 6D Object Pose Prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 292–301. Cited by: §2.2.1.
  • Y. Tsin, Y. Genc, and V. Ramesh (2009) Explicit 3D Modeling for Vehicle Monitoring in Non-overlapping Cameras. In IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 110–115. External Links: Document Cited by: §2.2.3.
  • R. Wang, N. Yang, J. Stueckler, and D. Cremers (2020) DirectShape: Photometric Alignment of Shape Priors for Visual Vehicle Pose and Shape Estimation. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §2.2.3.
  • Y. Xiang, W. Choi, Y. Lin, and S. Savarese (2017) Subcategory-Aware Convolutional Neural Networks for Object Proposals and Detection. In IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 924–933. External Links: Document Cited by: §5.5, Table 10.
  • Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox (2018) PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In Robotics: Science and Systems, Vol. XIV. External Links: Document Cited by: §2.2.1.
  • W. Xiao, B. Vallet, K. Schindler, and N. Paparoditis (2016) Street-Side Vehicle Detection, Classification and Change Detection using mobile Laser Scanning Data. ISPRS Journal of Photogrammetry and Remote Sensing 114, pp. 166–178. External Links: ISSN 09242716, Document Cited by: §2.2.3.
  • L. Yang, P. Luo, C. Change Loy, and X. Tang (2015) A Large-Scale Car Dataset for Fine-Grained Categorization and Verification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3973–3981. Cited by: §4.2, Table 3.
  • R. Zhu, H. Kiani Galoogahi, C. Wang, and S. Lucey (2017) Rethinking Reprojection: Closing the Loop for Pose-Aware Shape Reconstruction From a Single Image. In IEEE International Conference on Computer Vision (ICCV), pp. 57–65. Cited by: §2.2.2.
  • M. Z. Zia, M. Stark, B. Schiele, and K. Schindler (2013) Detailed 3D Representations for Object Recognition and Modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 35 (11), pp. 2608–2623. External Links: ISSN 0098-5589, Document Cited by: item 1, §1, §1, §2.1, §2.1, §2.2.3, §3.2.1, §3.2.2, §3.2, §3.4.6, §4.2, §4.3.
  • M. Z. Zia, M. Stark, and K. Schindler (2015) Towards Scene Understanding with detailed 3D Object Representations. International Journal of Computer Vision (IJCV) 112 (2), pp. 188–203. External Links: ISSN 0920-5691, Document Cited by: §4.2.