1 Introduction
1.1 Motivation and Objective
Car is one of the most frequently seen object category in every day scenes. Car detection and viewpoint estimation by a computer vision system has broad applications such as autonomous driving and parking management. Fig.
1 shows a few examples with varying complexities in car detection from four datasets. Car detection and viewpoint estimation are challenging problems due to the large structural and appearance variations, especially ubiquitous occlusions which further increase the intraclass variations significantly. In this paper, we are interested in learning a unified model which can detect cars in the four datasets and estimate car viewpoints. We aim to address two main issues in the following.The first is to explicitly represent occlusion. Occlusion is a critical aspect in object detection for several reasons: (i) we do not know ahead of time what portion of an object (e.g. car) will be visible in a test image; (ii) we also do not know the occluded areas in weaklylabeled training data (i.e. only bounding boxes of single cars are given, as considered in this paper); and (iii) object occlusions in testing data could be very different from those in training data. Handling occlusions entails models capable of capturing the underlying regularities of occlusions at part level (i.e. different occlusion configurations).
The second is to explicitly exploit contextual information cooccurring with occlusions (see examples in Fig.1 (b), (c) and (d)), which goes beyond singlecar detection. We focus on cartocar contextual patterns (e.g., different multicar configurations such as or cars), which will be utilized in detection and viewpoint estimation and naturally integrated with occlusion configurations.
To represent both occlusion and context, we propose to learn an AndOr model which takes into account structural and appearance variations at multicar, singlecar and part levels jointly. Our AndOr model belongs to grammar models [8, 9] embedded in a hierarchical graph structure, which can express a large number of configurations (occlusion configurations and multicar configurations) in a compositional and reconfigurable manner. Fig.3 illustrates our AndOr model. By reconfigurable, it means that we learn appearance templates and deformation models for single cars and parts, and the composed appearance templates for a multicar contextual pattern is inferred onthefly in detection according to the selections of their child single car Ornodes. So, our model can express a large number of multicar contextual patterns with different compatible occlusion configurations of single cars. Reconfigurability is one of the most desired property in hierarchical models, which plays the main role in boosting the performance in our experiments, and also distinguishes the proposed method to other models such as the visual phrase model [10] and different objectpair models [11, 12, 13, 14].
1.2 Method Overview
1.2.1 Data Preparation with Simulation Study
Manually annotating car views, parts and part occlusions on real images are timeconsuming and usually errorprone. One innovation in this paper is that we generate a large set of occlusion configurations and multicar configurations by CAD models ^{1}^{1}1we used 40 CAD models selected from www.doschdesign.com and Google 3D warehouse and a publicly available graphics rendering engine, the SketchUp SDK ^{2}^{2}2www.sketchup.com. In the CAD simulation, the occlusion configurations and multicar contextual patterns reflect variations in four factors: car type, orientation, relative position and camera view. We decompose a car into semantic parts as shown in different colors in the left side of Fig. 2. We then generate a large number of examples by placing 3 cars in a grid (resembling the regularities of cars in parking lots or on the road, see the middle of Fig. 2). For the cars in the center, we compare their part visibilities from different viewpoints (as illustrated by the camera icons), and obtain the part occlusion data matrix (each row represents an example and each entry takes a binary value, 0/1, representing occluded or not for a part under a viewpoint). The data matrix is used to learn the occlusion configurations. Similarly, we learn different multicar contextual patterns based on the geometric configurations (see some examples in the right side of Fig. 2). Note that the semantic part annotations in the synthetic examples are used to learn the structure of our AndOr model and the parts are treated as latent variables in weaklyannotated training data of real images. We do not evaluate the performance of part localization and instead evaluate the viewpoint estimation based on the inferred part configurations.
In the simulation, we place 3 cars in a grid with three considerations: (i) It can generate different occlusion configurations for the car in the center under different camera viewpoints, as well as different multicar contextual patterns (2car or 3car pattern), which is easier than using 2 cars in processing the data in simulation. (ii) It can generate the synthetic dataset in which the occlusion configurations and multicar contextual patterns are generic enough to cover the four situations in Fig.1. (iii) It can also reduce the gap between the synthetic data and real data when learning the initial appearance parameters for parts with the car in the back instead of the white background (see more details in Sec.5).
1.2.2 The AndOr Model
There are three types of nodes in the AndOr model: an Andnode represents decomposition (e.g., a car is composed of a small number of parts), an Ornode represents alternative ways of decomposition accounting for structural variations (e.g., different part configurations of a single car due to occlusions), and a Terminalnode captures appearance variations to ground a car or a part to image data.
Fig. 3 illustrates the learned AndOr model. The hierarchy consists of a layer of multicar contextual patterns (top) and several layers of occlusion configurations of single cars (bottom). The overall structure is asfollows:
i) The root Ornode represents different multicar configurations which capture both viewpoints and cartocar contextual patterns. Each multicar contextual pattern is then represented by an Andnode (e.g., car pairs and car triples shown in the figure). The contextual information reflect the layout regularities of a small number, (e.g., ), of cars in real sitations (such as cars in a parking lot).
ii) A multicar Andnode is decomposed into nodes representing single cars. Each single car is represented by an Ornode (e.g., the car and the car), since we have different combinations of car types, viewpoints and occlusion configurations.Here, a multicar Andnode embeds the reconfigurable compositional grammar of a multicar configuration (e.g., the three 2car configurations in the righttop of Fig.2) in which the single cars are reconfigurable w.r.t. viewpoint and occlusion configuration (up to some extend), and car type. This reconfigurability gives our model expressive power to handle the large variations of multicar configurations in real sitations.
iii) Each occlusion configuration is represented by an Andnode which is further decomposed into parts. Parts are learned using CAD simulation (i.e., the semantic parts) and are organized into consistently visible parts and optional part clusters (see the example in the rightbottom of Fig. 3). Then, a single car can be represented by the consistently visible parts (i.e., And) and one of the optional part clusters (i.e., Or). The green dashed bounding boxes show some examples corresponding to different occlusion configurations (i.e., visible parts) from the same viewpoint.
1.2.3 Weaklysupervised Learning of the AndOr Model
Using weaklyannotated real image training data and the synthetic data, we learn the AndOr model in two stages:
i) Learning the structure of the hierarchical AndOr model. Both the multicar contextual patterns and occlusion configurations of single cars are learned automatically based on the annotated single car bounding boxes in training data together with the synthetic examples generated from CAD simulations. The multicar contextual patterns are mined or clustered from the geometric layout features. The occlusion configurations are learned by a clustering method using the part visibility data matrix. The learned structure is a directed and acyclic graph since we have both singlecarsharing and partsharing, thus Dynamic Programming (DP) can be applied in inference.
1.2.4 Experiments
In experiments, we evaluate the detection performance of our model on four car datasets: the KITTI dataset [1], the PASCAL VOC2007 car dataset [2] and two selfcollected datasets – the StreetParking dataset [6] and the Parking Lot dataset [7] (which are released with this paper). Our model outperforms different stateoftheart variants of DPM [17] (including the latest implementation [18]) on all the four datasets, as well as other stateoftheart models [19, 20, 14, 6] on the KITTI and the StreetParking datasets. We evaluate viewpoint estimation performance on three car datasets: the PASCAL VOC2006 car dataset [2], the 3D car dataset [3], and the PASCAL3D+ car dataset [4]
. Our model achieves comparable performance with the stateoftheart methods (significantly better than the method using deep learning features
[21]). The detection code and data are available on the author’s homepage ^{3}^{3}3http://www.stat.ucla.edu/~tfwu/projects.htm.Paper Organization. The remaining of this paper is organized as follows. Section 2 overviews the related work and summarizes our contributions. Section 3 presents the AndOr model and defines its scoring functions. Section 4 presents the method of mining multicar contextual patterns and occlusion configurations of single cars in weaklylabeled training data. Section 5 discusses the learning of model parameters using WLSSVM, as well as details of the DP inference algorithm. Section 6 presents the experimental results and comparisons of the proposed model on the four car detection datasets and the three viewpoint estimation datasets. Section 7 concludes the paper with discussions.
2 Related Work and Our Contributions
Over the last decade, object detection has made much progress in various vision tasks such as face detection
[22], pedestrian detection [23], and generic object detection [2, 17, 24]. In this section we focus on occlusion and context modeling in object detection, and classify the recent literature into three research streams. For a full review of contemporary approaches, we refer the reader to recent survey articles
[25, 26, 27].i) Single Object Modeling and Occlusion Modeling. Hierarchical models are widely used in the recent literature of object detection and most existing approaches are devoted to learning a single object model. Many work extended the deformable partbased model [17] (which has a twolayer structure) by exploring deeper hierarchy and global part configurations [24, 28, 15], using strong manuallyannotated parts [29] or CAD models [30], or keeping human intheloop [31]. To address the occlusion problem, various occlusion models estimate the visibilities of parts from image appearance, using assumptions that the visibility of a part is (a) independent from other parts [32, 33, 34, 35, 36], (b) consistent with neighboring parts [37, 15], or (c) consistent with its parent or child parts describing object appearance at different scales [38]. Another essential problem is to organize part configurations. Recently, [15, 34, 6] explored different ways to deal with this problem. In particular, [34] modeled different part configurations by the local part mixtures. [15] used a more flexible grammar model to infer both the occluder and visible parts of an occluded person. [6] regularized parts into consistently visible parts and optional part clusters, which is more efficient to represent occlusion configurations. Recent work [39, 40, 41, 42, 43] proposed to enumerate possible occlusion configurations and model each occlusion configuration as a specific component. [44] proposed a 2D model to learn discriminative subcategories, and [45] further integrated it with an explicit 3D occlusion model, both showing excellent performance on the KITTI dataset. Though those models were successful in some heavily occluded cases, they did not represent contextual information, and usually learned another separate context model using the detection scores as input features. Recently, an AndOr quantization method was proposed to learn AndOr tree models [24, 46] for generic object detection in PASCAL VOC [2] and learn 3D AndOr models [47] respectively, which could be useful in occlusion modeling.
ii) ObjectPair and Visual Phrase Models. To account for the strong cooccurrence, objectpair [11, 12, 13, 14] and visual phrase [10] methods modeled occlusions and interactions using a XtoX or XtoY composite template that spans both one object (i.e., “X” such as a person or a car) and another interacting object (i.e., “X” or “Y” such as the other car in a carpair in parking lots or a bicycle on which a person is riding). Although these models can handle occlusion better than single object models, the objectpair or visual phrase modeled occlusion implicitly, and they were often manually designed with fixed structures (i.e., not reconfigurable in inference). They performed worse than original DPM in the KITTI dataset as evaluated by [14].
iii) Context Models. Many context models have been exploited in object detection with improved performance [48, 49, 50, 51, 52]. Hoiem et al. [50] explored a scene context, Desai et al. [49] improved object detectors by incorporating the multiclass context on the pascal dataset [2] in a maxmargin framework. In [51], Tu and Bai integrated the detector responses with background pixels to determine the foreground pixels. In [52], Chen et. al. proposed a multiorder context representation to take advantage of the cooccurrence of different objects. Recently, [53] explored geographic contextual information to facilitate car detection, and [54] explored a 3D panoramic context in object detection. Although these work verified that context is crucial in object detection, most of them modeled objects and context separately, not in a unified framework.
This paper is extended from our two previous conference papers [7, 6] in the following aspects: (i) A unified representation is learned for integrating occlusion and context; (ii) More details on the learning algorithm and the detection algorithm are presented; (iii) More analyses and comparisons on the experimental results are added with improved performance.
This paper makes three contributions to the literature of car detection.
i) It proposes an AndOr model to represent multicar context and occlusion configurations. The proposed model is multiscale and reconfigurable to account for large structure, viewpoint and occlusion variations.
ii) It presents a simple, yet effective, approach to mine context and occlusion configurations from weaklylabeled training data.
iii) It introduces two datasets for evaluating occlusion and multicar context, and obtains performance comparable to or better than stateoftheart car detection methods in four challenging datasets.
3 Representation and Inference
3.1 The AndOr Model and Scoring Functions
In this section, we introduce the notations in defining the AndOr model and its scoring functions.
An AndOr model is defined by a tuple, where represents the nodes in three subsets: Andnodes , Ornodes and Terminalnodes ; is the set of edges organizing all the nodes in a directed and acyclic graph (DAG); is the set of parameters (for appearance, deformation and bias respectively, to be defined later).
A Parse Tree is an instantiation of the AndOr model by selecting the best child (according to the scoring functions to be defined) for each encountered Ornode. The green arrows in Fig. 3 show an example of parse tree.
Appearance Features. We adopt the Histogram of Oriented Gradients (HOG) feature [55, 17] to describe appearance. Let be an image defined on an image lattice. Denote by the HOG feature pyramid computed for using levels per octave, and by the lattice of the whole pyramid. Let specify a position in the th level of the pyramid . Denote by the extracted HOG features for a Terminalnode placing at position in the pyramid.
Deformation Features. We allow local deformation when composing the child nodes into a parent node. In our model, parts are placed at twice the spatial resolution w.r.t. single cars, while single cars and composite multicars are at the same spatial resolution. We penalize the displacements between the anchor locations of child nodes (w.r.t. the placed parent node) and their actual deformed locations. Denote by the displacement. The deformation feature is defined by,
A Terminalnode grounds a single car or a part to image data (see Layer 3 and 4 in Fig.3). Given a parent node , the model for is defined by a 4tuple
where is the appearance template, the scale factor for placing node w.r.t. its parent node,
a twodimensional vector specifying an anchor position relative to the position of parent node
, and the deformation parameters. Given the position of the parent node , the scoring function of a Terminalnode is defined by,(1) 
where is the space of deformation (i.e., the lattice of the corresponding level in the feature pyramid), with and where means the object and parts are placed at the same resolution and means parts are placed at twice the resolution of the object templates, and denotes the inner product. Fig.3 shows some learned appearance templates.
An Andnode represents a decomposition of a large entity (e.g., a multicar layout at Layer 1 or a single car at Layer 3 in Fig.3) into its constituents (e.g., or single cars or a small number of parts). Single car Andnodes are associated with viewpoints. Unlike the Terminalnodes, single car Andnodes are not allowed to be deformable in a multicar configuration in this paper (we implemented it in experiments and did not observe performance improvement, so for simplicity we make them not deformable). Denote by the set of child nodes of a node . The position of an Andnode is inherited from its parent Ornode, and then the scoring function is defined by,
(2) 
where is the bias term. Each single car Andnode (at Layer 3) can be treated as the DPM [17] or the AndOr structure proposed in [6]. So, our model is flexible to integrate stateoftheart single object models. For multicar Andnodes (at Layer 1), their child nodes are Ornodes and the scoring function is defined below.
An Ornode represents different structure variations (e.g., the root node and the th car node at Layer 2 in Fig.3). For the root Ornode , when placing at the position , the scoring function is defined by,
(3) 
where . For the th car Ornode , given a parent multicar Andnode placed at , the scoring function is then defined by,
(4) 
where with and . The best child of an Ornode is computed by taking of Eqn.(3) and Eqn.(4).
3.2 The DP Algorithm in Detection
In detection, we place the AndOr model at all positions and retrieve the optimal parse trees for all positions at which the scores are greater than the detection threshold. Thank to the directed and acyclic structure of our AndOr model, we can utilize the efficient DP algorithm which consists of two stages:
In the bottomup pass: Following the depthfirstsearch (DFS) order of nodes in the AndOr model, the bottomup pass computes the matching scores of all possible parse trees of the AndOr model at all possible positions in the whole feature pyramid.
First of all, we compute the appearance score maps (pyramid) for all Terminalnodes (which is done by filter convolution). The optimal position of a Terminalnode w.r.t. a parent node can be computed as a function of the position of the parent node. The quality (matching score) of the optimal position for a Terminalnode w.r.t. a given position of the parent is computed using Eqn.1 (which yields the deformed score map through the generalized distance transform trick as done in the DPM [17] for efficiency), and the optimal position can be retrieved by replacing in Eqn.(1) with .
Then, following the DFS order of nodes, we compute the score maps for all the Andnodes and Ornodes using Eqn.(2), (3) and (4) with the score maps of their child nodes having been computed already. Similarly, we can obtain the optimal branch for each Ornode by replacing the in Eqn.(3) and (4) with .
In the topdown pass, we first find all detection candidates for the root Ornode based on its score maps, i.e., the positions Then, following the breadthfirstsearch (BFS) order of nodes, we retrieve the optimal parse tree at each : starting from the root Ornode, we select the optimal branch of each encountered Ornode, keep all the child nodes of each encountered Andnode, and retrieve the optimal position of each Terminalnode. Based on the parsed subtree rooted at single car Andnodes, we obtain the viewpoint estimation and the occlusion configuration.
Postprocessing. To generate the final detection results of single cars for evaluation, we apply multicar guided nonmaximum suppression (NMS) to deal with occlusions:
i) Some of the single cars in a multicar detection candidate are highly overlapped due to occlusion, so if we directly use conventional NMS, we will miss the detection of the occluded cars. We enforce that all the single car bounding boxes in a multicar prediction will not be suppressed by each other. A similar idea is also used in [12].
ii) Overlapped multicar detection candidates might report multiple predictions for the same single car. For example, if a car is shared by a car detection candidate and a car detection candidate, it will be reported twice. We will keep only the one with higher score.
4 Learning AndOr Structures
In this section, we present the methods of learning the structures of AndOr model by mining contextual patterns and occlusion configurations in the positive training dataset.
4.1 Generating Multicar Training Samples
Positive Samples. Denote by the positive training dataset with being the set of annotated single car bound boxes in image . Here, is the lefttop corner and the width and height.
Denote the set of car positive samples by,
(5) 
where all the ’s have more than annotated single cars (i.e., ). We have,
i) consists of all the single car bounding boxes which do not overlap the other ones in the same image. For , is generated iteratively.
ii) In generating (see Fig.4 (a)), for each positive image with , we enumerate all valid car configurations starting from : we first select the current as the first car (), obtain all the surrounding car bounding boxes which overlap , and then select the second car which has the largest overlap if and ().
iii) In generating (, see Fig.4 (b)), for each positive image with and , we first select the current as the seed, obtain the neighbors each of which overlaps at least one bounding box in , and then select the bounding box which has the largest overlap and add to ().
Negative Samples. We collect negative samples in images without cars appearing provided in the benchmark datasets and apply the hard negative mining approach during learning parameters as done in the DPM [17].
4.2 Mining Multicar Contextual Patterns
This section presents the method of learning multicar patterns in Layer in Fig.3. Considering , we use the relative positions of single cars to describe the layout of a multicar sample . Denote by the center of a car bounding box (). Let and be the width and height of the union bounding box of respectively. With the center of the first car being the centroid, we define the layout feature by,
(6) 
We cluster these layout features over to get clusters using means. The obtained clusters are used to specify the Andnodes at Layer 1 in Fig.3. The number of cluster is specified empirically for different training datasets in our experiments.
In Fig. 5 (top), we visualize the clustering results for on the KITTI [1] and the Parking Lot datasets. Each set of color points represents a car context pattern. In the KITTI dataset, we can observe there are some cartocar “peak” modes in the dataset (similar to the analyses in [14]), while the context patterns are more diverse in the Parking Lot dataset.
4.3 Mining Occlusion Configurations
In this section we present the method of learning occlusion configurations for single cars in Layer 3 and 4 in Fig.3. We learn the occlusion configurations automatically from a large number of occlusion configurations generated by CAD simulations. Note that the synthetic data are used to learn the occlusion configurations, while the appearance and geometry parameters are still learned from real data.
4.3.1 Generating Occlusion Configurations
As mentioned in Sec.1.2.1, we choose to put cars in generating occlusion configurations. Specifically, we choose the center and 2 other randomly selected positions on a grid, and put cars around these grid points to simulate occlusions. See some examples in Fig.2.
The occlusion configurations reflect the four factors: car type , orientation , relative position and camera view . To generate an occlusion configuration, we randomly assign values for these factors, where for each car with type , , , where is the nominated position for the  car on the grid, and is the relative distance (along axis and axis) between sampled position and nominated position of the  car. The camera view is in the range of and , we discretize the view space into view bins uniformly along the azimuth angle. In the synthesized configurations, a part is treated as occluded if of its area is not visible.
4.3.2 Constructing the Initial AndOr model of Single Cars
With the partlevel visibility information, we compute two vectors for each occlusion configuration: The first is a ( parts camera views) dimension binary valued vector for the visibilities of parts; and the second is a real valued (( root parts) camera views) dimension vector for the bounding boxes and parts. In both vectors, entries corresponding to invisible parts are set to .
Denoting as the dimension of the vector , and by stacking for occlusion configurations, we can get an occlusion matrix , where the first few rows of this matrix for is shown in the right side in Fig.6. Note that we have partitioned the view space into views, so for each row, the visible parts always concentrate in a segment of the vector representing that view.
In learning an initial AndOr model, each row in corresponds to a small subtree of the root OR node. In particular, each subtree consists of an Andnode as the root and a set of terminal nodes as its children. An example of the data matrix and corresponding initial AndOr model is shown in the middle in Fig.6.
4.3.3 Refining the AndOr Structure
The initial AndOr model is large and redundant, since it has many duplicated occlusion configurations (i.e. duplicated rows in ) and a combinatorial number of part compositions. In the following, we will pursue a compact AndOr structure. The problem can be formulated as:
(7) 
where is the  row of the data matrix , returns its most approximate occlusion configuration generated by the AndOr graph (AOG), is the number of nodes and edges in the structure, and is the tradeoff parameter balancing the model precision and complexity. In each view, we assume the number of occlusion branches is not greater than .
We solve Eqn.7 using a modified graph compression algorithm similar to [56]. As illustrated in the right side in Fig.6, the algorithm starts from the initial AndOr model, and iteratively combines branches if the introduced loss was smaller than the decrements in complexity term . This process is equivalent to iteratively finding large blocks of s on the corresponding data matrix through row and column permutations, where an example is shown in the bottom in Fig.6. As there are consistently visible parts for each view, the algorithm will quickly converge to the structure shown in Fig.3.
With the refined AndOr model, we compute occlusion configurations (i.e., the consistently visible parts and optional occluded parts) in each view. In addition, the bounding box size and nominal position of each Terminalnode w.r.t. its parent Andnode can also be estimated by geometric means of corresponding values in the vector
. These information will be used to initialize the latent variables of our model in learning the parameters.Variants of AndOr Models. We will test our model using two types of specifications to be consistent with our two previous conference papers, one is called AndOr Structure [6] for occlusion modeling based on CAD simulation without multicar context components, and the other called Hierarchical AndOr Model [7] for occlusion and context. We also compare two methods of part selection in hierarchical AndOr model, one is based on the greedy parts as done in the DPM [17], denoted by AOG+Greedy, and the other based on the proposed CAD simulation, denoted by AOG+CAD.
5 Learning Parameters
With the learned AndOr structure, we adopt the WLSSVM method [15] in learning the parameters (for appearance, deformation and bias). When the occlusion configurations are mined by CAD simulations (i.e., for the two model specifications, AndOr Structure and AOG+CAD), we will use both the Step 0 and Step 1 below in learning parameters, otherwise we use Step 1 only (i.e., for AOG+Greedy).
Step 0: Initializing Parameters with Synthetic Training Data. We learn the initial parameters with synthetic training data (see Fig.10). We randomly superimpose the synthetic positive samples on some randomly selected real images without cars appearing (instead of using white background directly, see Fig.10) to reduce the appearance gap between the synthetic samples and real car samples. In the synthetic data, the parse tree for each multicar positive sample is known except that the positions of parts are allowed to deform.
Step 1: Learning Parameters with Real Training Data. In the real training data, we only have annotated bounding boxes for single cars. The parse tree for each multicar positive sample is hidden except for the multicar configuration which can be computed based on the annotated bounding boxes of single cars as stated in Sec.4.2. Then, we initialize the parse tree for each positive sample either based on the initial parameters learned in step 0 (for the AndOr structure and AOG+CAD) or using a similar idea as done in learning the mixture of DPMs [17] to initialize the singlecar Andnodes for AOG+Greedy. After the initialization, the parameters are learned iteratively under the WLSSVM framework. During learning, we run the DP inference to assign the optimal parse trees for multicar positive samples.
The objective function to be minimized is defined by,
(8) 
where represents a training sample () and is the bounding box(es).
is the surrogate loss function,
(9) 
where is the space of all parse trees derived from the AndOr model , computes the score of a parse tree as stated in Sec.3, and the predicted bounding box(es) base on the parse tree. As pointed out in [15], the loss encourages highloss outputs to “pop out” of the first term in the RHS, so that their scores get pushed down. The loss suppresses highloss outputs in the second term in the right hand side, so the score of a lowloss prediction gets pulled up. More details are referred to [15, 16]. In general, since in Eqn.(9) is not convex, the objective function, Eqn.(8) leads to a nonconvex optimization problem. The WLSSVM adopts the CCCP procedure [57] in optimization, which can find a local optima of the objective. The loss function is defined by,
(10) 
where represents background output and is the intersectionunion ratio of two bounding boxes. Following the PASCAL VOC protocol we have and . In practice, we modify the implementation in [18] for our loss formulation.
6 Experiments
In this section, we evaluate our models on four car detection datasets and three car viewpoint estimation dataset and present detail analyses on different aspects of our models. We first introduce two selfcollected car datasets of streetparking cars and parkinglot cars respectively (Sec. 6.1), and then evaluate the detection performance of our models on four datasets (Sec. 6.2): the two selfcollected datasets, the KITTI car dataset [1] and the PASCAL VOC2007 car dataset [2]. We further analyze the performance of our model w.r.t. different aspects of our models (Sec. 6.3). The performance of car viewpoint estimation is presented in Sec. 6.4.
Training and Testing Time. In all experiments, we utilize a parallel computing technique to train our model. It takes about 9 hours to train an AndOr Structure model and 16 hours to train a hierarchical AndOr Model due to inferring the assignments of part latent variables on positive training examples and mining hard negatives. For detection, it takes about 2 and 3 seconds to process an image with size of pixels for a AndOr structure and a hierarchical AndOr model, respectively.
6.1 Datasets
To test our model on occlusion and context modeling, we collected two car datasets ^{4}^{4}4http://www.stat.ucla.edu/~boli/publication/streetparkingrelease.zip and parking_lot_release.zip.
The Street Parking Car Dataset. There are several datasets featuring a large amount of car images [58, 3, 59, 2], but they are not suitable to evaluating occlusion handling, as the proportion of (moderately or heavily) occluded cars is marginal. The recently proposed KITTI dataset [1] contains occluded cars parked along the streets, but it can not fully evaluate the ability of our model since the car views are rather fixed as the video sequences are captured from a car driving on the road (e.g., no birdeye’s view). In addition, the average number of cars on each image is still not large enough (mostly cars, see the statistics in the bottom in Fig. 7). To provide a more challenging occlusion dataset, we collected one emphasizing street parking cars with heavy occlusions, diverse viewpoint changes and much larger number of cars per image (see the last two rows in Fig.9). The dataset consists of images. Fig. 7 shows the bounding box overlapping distribution and average number of cars per image. For the simplicity of annotation, we only label the bounding boxes of single cars in each image. We split the dataset into training and testing sets containing and images, respectively.
The Parking Lot Dataset. Our Street Parking Car Dataset provides more viewpoints, however, the context and occlusion configurations are relatively restricted (most cars just compose the headtohead occlusions). To thoroughly evaluate our models in terms of both context and occlusions, we collected the parking lot car dataset, which has larger occlusion variations and larger number of cars in each image (see the th and th rows in Fig. 9). It contains training images and testing images. Although the number of images is small, the number of cars is noticeably large, with cars (including leftright mirrored ones) for training and cars for testing.
6.2 Detection
We test our hierarchical AndOr Model on four challenging datasets.
6.2.1 Results on the KITTI Dataset
The KITTI dataset [1] contains training images and testing images, which are captured from an autonomous driving platform. We follow the provided benchmark protocol for evaluation. Since the authors of [1] have not released the test annotations, we test our model in the following two settings.
Training and Testing by Splitting the Trainset. We randomly split the KITTI trainset into the training and testing subsets equally.
Baseline Methods. Since DPM [17] is a very competitive model with source code publicly available, we compare our model with the latest version of DPM (i.e., vocrelease5 [18]). The number of components are set to as the baseline methods trained in [1], other parameters are set as default.
Parameter Settings. We consider multicar contextual patterns with the number of cars . We set the number of context patterns and occlusion configurations to be and , respectively. As a result, the learned hierarchical AndOr model has car configurations in layer , and single car branches in layer (see Fig. 3).
Methods  Easy  Moderate  Hard 
mBow [19]  
LSVMMDPMus [17]  
LSVMMDPMsv [17, 20]  
MDPMunBB [17]  
OCDPM [14]  
DPM [18] (trained by us)  
MVRGBDRF [60]  
SubCat [44]  
3DVP [45]  
Regionlets [61]  
AOG+GreedyHalf  
AOG+GreedyFull 
Detection Results. The left figure in Fig. 8 shows the precisionrecall curves of DPM and our model. Our model outperforms DPM by
in terms of average precision (AP). The performance gain comes from both precision and recall, which shows the importance of context and occlusion modeling.
Testing on the KITTI Benchmark. We evaluate our model with two different training data settings: one trained using half training set on the KITTI testset, denoted by AOG+GreedyHalf, and the other trained with full training set, denoted by AOG+GreedyFull (which has context patterns and occlusion configurations).
The benchmark has three subsets (Easy, Moderate, Hard) w.r.t the difficulty of object size, occlusion and truncation. All methods are ranked based on performance in the moderately difficult subset. Our entry in the benchmark is “AOG”. Table I shows the detection results of our model and other stateoftheart models. Here, we omit the CNNbased method, as they are all anonymous submissions. Details of the benchmark results are available at http://www.cvlibs.net/datasets/kitti/eval_object.php.
Our AOG+GreedyFull outperforms all the DPMbased models. Compared with their best model, OCDPM [14], our model improved performance on the three subsets by , , and respectively. We also compare with the baseline DPM trained by ourselves using the vocrelease5 code [18], and obtain , and performance gains on the three stubsets. For other DPM based methods trained by the benchmark authors, our model outperforms the best one  MDPMunBB by , and respectively.
Our model is comparable with SubCat [44], 3DVP [45] and Regionlets [61]. We achieve slightly better performance than Regionlets [61] on the Easy and Hard sets, but lose a bit AP on the Moderate set. Though our method obtains better rank than 3DVP [45] on the moderately difficult set, it performs slightly worse on the easy and hard subsets, which shows the promise of 3D occlusion modeling and subcategory clustering [44, 45].
Comparing AOG+GreedyHalf and AOG+GreedyFull, we can observe that the major improvement () of AOG+GreedyFull comes from the Moderate set, while on the Easy and Hard sets, we obtain small improvement ( and , respectively). These results meet some analyses in [62], which indicate there are still large potential improvement on object representation, and much effort should be devoted to improving our current hierarchical AndOr model.
The first rows in Fig. 9 show the qualitative results of our model. The red bounding boxes show successful detection, the blue ones missing detection, and the green ones false alarms. In experiments, our model is robust to detect cars with heavy cartocar occlusions and background clutters. The failure cases are mainly due to extreme occlusions, extremly low resolution, large car deformation and/or inaccurate (or multiple) bounding box localization.
6.2.2 Results on the Parking Lot Dataset
Evaluation Protocol. We follow the PASCAL VOC evaluation protocol [2] with the overlap of intersection over union being greater than or equal to (instead of original ). In practice, we set this threshold to make a compromise between localization accuracy and detection difficulty. The detected cars with bounding box height smaller than pixels do not count as false positives as done in [1]. We compare with the latest version of DPM implementation [18] and set the number of contextual patterns and occlusion configurations to be and respectively.
Detection Results. The right side in Fig. 8 shows the performance comparisons between our model and DPM. Our model obtains in AP, which outperforms the latest version of DPM by . The fourth and fifth rows in Fig. 9 show the qualitative results. Our model is capable of detecting cars with different occlusions and viewpoints.
6.2.3 Results on the Street Parking Dataset
To compare with the benchmark methods, we follow the evaluation protocol provided in [6].
Results of our model and other benchmark methods are shown in Table II, our hierarchical AndOr model outperforms DPM [18] and our previous AndOr Structure [6] by and respectively. We think the performance is improved due to the joint representation of context patterns and occlusion configurations. The last two rows in Fig. 9 show some qualitative examples. Our model is capable of detecting occluded streetparking cars, meanwhile it also has a few inaccurate detection results and misses some cars (mainly due to low resolution).
6.3 Diagnosing the Performance of our Model
In this section, we evaluate various aspects to diagnose the effects of each individual component in our model.
6.3.1 The Effect of Occlusion Modeling
Our AndOr Structure model is based on CAD simulation. Thus in the first analysis, we test the effectiveness of the learned AndOr structure in representing different occlusion configurations. To this purpose, we generate a synthetic dataset using 5,040 car synthetic images as our training data, and a mixture of 3,000 car and car (placed in a grid) synthetic images as our testing data. For each generated image, we add the background from the category None of the TU Graz02 dataset [63] and apply Gaussian blur to reduce the boundary effects. Samples of the training and testing data are shown on the left and middle in Fig.10. In experimental comparisons, the best DPM has components and the best AndOr structure has views with occlusion configurations, layers and nodes in total. As shown in the right side in Fig.10, our model outperforms the DPM by 7.2% in AP.
6.3.2 The Effect of CAD Simulation in Real Situations
To verify the effectiveness of our AndOr Structure model in terms of occlusion modeling, we compare it with stateoftheart DPM [17]. Both of these two models are based on partlevel occlusion modeling. The AndOr Structure learns semantic visible parts based on CAD simulations. The DPM handles occlusion implicitly by introducing a trunction feature at each HOG cell. The second and third column in Table II show their performance on Street Parking dataset. We can see the semantic visible parts learned from CAD simulations can generalize to real datasets. By adding context, we are interested in whether it affects the effectiveness of occlusion modeling. To compare AOG+Greedy and AOG+CAD fairly, they have the same number of context patterns and occlusion configurations, and respectively. As shown in the fourth and fifth column in Table II, AOG+CAD performs better than AOG+Greedy, which shows the advantage of modeling occlusion using semantic visible parts.
Fig. 11 shows the inferred part bounding boxes by AOG+Greedy and AOG+CAD. We can observe that the semantic parts in AOG+CAD are meaningful, although they may be not accurate enough in some examples.
6.3.3 The Effect of Multicar Context Modeling
The stateoftheart models are mainly based on single car modeling. To evaluate the effectiveness of context, we compare our hierarchical AndOr model with other noncontext models in Table I. We can see that our model outperforms all other models in different occlusion settings. Specifically, our model outperforms DPM by a large margin (above 10% in AP) on the “Moderate” and “Hard” KITTI test data, which shows context is very important to object detection especially in heavily occluded cartocar situations.
On the Street Parking dataset, we observe the same results. In Table II, both AOG+Greedy and AOG+CAD outperform DPM and AndOr Structure by a large margin. Here, AOG+Greedy and AOG+CAD jointly model context and occlusions, while DPM and AndOr Structure model occlusions only.
6.3.4 Performance on General Occlusion Settings
Our model is generalizable in terms of context and occlusion modeling, it can cope with both occlusion and nonocclusion situations. To verify our model on less occluded settings, we use the PASCAL VOC 2007 Car dataset as a testbed. As analyzed by Hoiem, et. al. in [5], cars in the PASCAL VOC dataset do not have much occlusions and cartocar context.
We first show that our AndOr Structure is capable to detect cars on the PASCAL VOC 2007 as well as the DPM method [18]. To approximate the occlusion configurations observed on this dataset, we generate synthetic images with cartocar occlusions and car selfocclusions. For the cartocar occlusions, we use the full grid instead of the special case in the street parking dataset. Correspondingly, the learned AndOr structure contains branches for selfocclusions as well as those for cartocar occlusions. On this dataset, the DPM has components and the AndOr structure has views with occlusion configurations, layers and nodes.
The third column in Table III shows the performance of our AndOr structure model and the DPM. Our model achieves slightly better recall than DPM, which meets the analysis in [5]. This experiment shows that our AndOr structure method does not lose performance in general datasets.
Then, we verify our hierarchical AndOr model is capable to detect cars on the PASCAL VOC 2007 as well as other single object models. We compare with the latest version of DPM [18]. The APs are 60.6% (our model) and 58.2% (DPM) respectively (Table III).
VDPM [4]  DPMVOC+VP [30]  (fisher+spm) [21]  (decaf) [21]  our AndOr Structure  

4 views  
8 views  
16 views  
24 views 
6.4 View Estimation
With the help of CAD simulations, our AndOr Structure model can compute the viewpoints of detected cars. To verify the capability of view estimation, we perform experiments.
Firstly, we report the mean precision in pose estimation (MPPE), equivalent to the means of confusion matrix diagonals, on both the Pascal VOC 2006 car dataset
[69] and the 3D Object dataset [3]. The 3D Object Classes dataset [3] is introduced in 2007. For each class, it has images of 10 different object instances with 8 different poses. We follow the evaluation protocol described in [3]: 7 randomly selected car instances are used for training, and 3 instances for testing. The 2D car bounding boxes are computed from the annotated segmentation masks. The negative examples are collected from the PASCAL VOC 2007 car dataset. For the VOC 2006 car database [69], there are 469 cars with viewpoint labels (frontal, rear, left and right). We only use these labeled images with the standard training/test split. The detection performance is evaluated through precisionrecall (PR) curve. For view estimation, the two datasets emphasize visible cars. Our AndOr structure has views with (selfocclusion) branches, layers and nodes. Table IV shows the comparison of our model with the stateoftheart methods on these two datasets. Our model is comparable to or better than some recently proposed models [64, 65, 30].Secondly, we compare our model with the stateoftheart models on the recently proposed PASCAL3D+ Dataset [4]. This dataset augments rigid categories in the PASCAL VOC 2012 [2] with 3D annotations by fitting CAD models with 2D images semimanually. It is a challenging dataset for 3D object detection and pose estimation. We test on the car category. We use the metric  Average Viewpoint Precision (AVP) [4] to simultaneously evaluate 2D bounding box localization and viewpoint estimation. In computing the AVP, a candidate detection is considered to be a true positive if and only if the bounding box overlap is larger than and the viewpoint is correct.
7 Conclusion
In this paper, we present an AndOr model to represent context and occlusion for car detection and viewpoint estimation. The model structure is learned by mining multicar contextual patterns and occlusion configurations at three levels: a) multicar layouts, b) single car and c) parts. Our model is organized in a directed and acyclic graph structure so the efficient DP algorithm can be used in inference. The model parameters are learned by WLSSVM[15]. Experimental results show that our model is effective in modeling context and occlusion information in complex situations, and achieves better performance over stateoftheart car detection methods and comparable performance on viewpoint estimation.
There are two main limitations in our current implementation. The first one is that we exploited the multicar contextual patterns using car composite only. In the scenarios similar to street parking cars and parking lot cars, we could explore multicar context with more than 2 spatiallyaligned cars, as well as 3D scene parsing context [70]
. The second one is that we utilized only the HOG features for appearance. Based on the recent progress on feature learning by convolutional neural network (CNN)
[71, 72], we can also substitute the HOG by the CNN features. Both aspects are addressed in our ongoing work and may potentially improve the performance.Acknowledgments
B. Li is supported by China 973 Program under Grant no. 2012CB316300. T.F. Wu and S.C. Zhu are supported by DARPA MSEE project FA 86501117149, MURI grant ONR N000141010933, and NSF IIS1018751. We thank Dr. Wenze Hu for helpful discussions.
References
 [1] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012.
 [2] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010.
 [3] S. Savarese and L. FeiFei, “3d generic object categorization, localization and pose estimation,” in ICCV, 2007.
 [4] Y. Xiang, R. Mottaghi, and S. Savarese, “Beyond pascal: A benchmark for 3d object detection in the wild,” in WACV, 2014.
 [5] D. Hoiem, Y. Chodpathumwan, and Q. Dai, “Diagnosing error in object detectors,” in ECCV, 2012.
 [6] B. Li, W. Hu, T.F. Wu, and S.C. Zhu, “Modeling occlusion by discriminative andor structures,” in ICCV, 2013.
 [7] B. Li, T. Wu, and S.C. Zhu, “Integrating context and occlusion for car detection by hierarchical andor model,” in ECCV, 2014.
 [8] S.C. Zhu and D. Mumford, “A stochastic grammar of images,” Found. Trends. Comput. Graph. Vis., vol. 2, no. 4, pp. 259–362, Jan. 2006.
 [9] P. Felzenszwalb and D. McAllester, “Object detection grammars,” University of Chicago, Computer Science TR201002, Tech. Rep., 2010.
 [10] M. Sadeghi and A. Farhadi, “Recognition using visual phrases,” in CVPR, 2011.
 [11] B. Li, X. Song, T. Wu, W. Hu, and M. Pei, “Couplinganddecoupling: A hierarchical model for occlusionfree object detection,” Pattern Recognition, vol. 47, no. 10, pp. 3254 – 3264, 2014.
 [12] S. Tang, M. Andriluka, and B. Schiele, “Detection and tracking of occluded people,” in BMVC, 2012.
 [13] W. Ouyang and X. Wang, “Singlepedestrian detection aided by multipedestrian detection,” in CVPR, 2013.
 [14] B. Pepik, M. Stark, P. Gehler, and B. Schiele, “Occlusion patterns for object class detection,” in CVPR, 2013.
 [15] R. Girshick, P. Felzenszwalb, and D. McAllester, “Object detection with grammar models,” in NIPS, 2011.
 [16] D. McAllester and J. Keshet, “Generalization bounds and consistency for latent structural probit and ramp loss,” in NIPS, 2011.
 [17] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained partbased models,” TPAMI, vol. 32, no. 9, pp. 1627–1645, Sep. 2010.
 [18] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester, “Discriminatively trained deformable part models, release 5,” http://people.cs.uchicago.edu/ rbg/latentrelease5/.
 [19] J. Behley, V. Steinhage, and A. Cremers, “Laserbased Segment Classification Using a Mixture of BagofWords,” in IROS, 2013.
 [20] A. Geiger, C. Wojek, and R. Urtasun, “Joint 3d estimation of objects and scene layout,” in NIPS, 2011.
 [21] A. Ghodrati, M. Pedersoli, and T. Tuytelaars, “Is 2d information enough for viewpoint estimation?” in BMVC, 2014.
 [22] P. Viola and M. J. Jones, “Robust realtime face detection,” Int. J. Comput. Vision, vol. 57, no. 2, pp. 137–154, May 2004.
 [23] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” TPAMI, vol. 34, no. 4, pp. 743–761, Apr. 2012.
 [24] X. Song, T.F. Wu, Y. Jia, and S.C. Zhu, “Discriminatively trained andor tree models for object detection,” in CVPR, 2013.

[25]
K. Grauman and B. Leibe, Visual Object Recognition
, ser. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2011.
 [26] A. Andreopoulos and J. K. Tsotsos, “50 years of object recognition: Directions forward.” Computer Vision and Image Understanding, vol. 117, no. 8, pp. 827–891, 2013.
 [27] X. Zhang, Y.H. Yang, Z. Han, H. Wang, and C. Gao, “Object class detection: A survey,” ACM Comput. Surv., vol. 46, no. 1, pp. 10:1–10:53, Jul. 2013.
 [28] L. Zhu, Y. Chen, A. Yuille, and W. Freeman, “Latent hierarchical structural learning for object detection,” in CVPR, 2010.
 [29] H. Azizpour and I. Laptev, “Object detection using stronglysupervised deformable part models,” in ECCV, 2012.
 [30] B. Pepik, M. Stark, P. Gehler, and B. Schiele, “Teaching 3d geometry to deformable part models,” in CVPR, 2012.
 [31] S. Branson, P. Perona, and S. Belongie, “Strong supervision from weak annotation: Interactive training of deformable part models,” in ICCV, 2011.
 [32] B. Wu and R. Nevatia, “Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors,” International Journal of Computer Vision, vol. 75, no. 2, pp. 247–266, 2007.
 [33] X. Wang, T. Han, and S. Yan, “An hoglbp human detector with partial occlusion handling,” in ICCV, 2009.
 [34] M. Hejrati and D. Ramanan, “Analyzing 3d objects in cluttered images,” in NIPS, 2012.
 [35] X. BurgosArtizzu, P. Perona, and P. Dollár, “Robust face landmark estimation under occlusion,” in ICCV, 2013.
 [36] C. Desai and D. Ramanan, “Detecting actions, poses, and objects with relational phraselets.” in ECCV, 2012.
 [37] T. Gao, B. Packer, and D. Koller, “A segmentationaware object detection model with occlusion handling,” in CVPR, 2011.
 [38] G. Duan, H. Ai, and S. Lao, “A structural filter approach to human detection,” in ECCV, 2010.
 [39] M. Mathias, R. Benenson, R. Timofte, and L. Van Gool, “Handling occlusions with frankenclassifiers,” in ICCV, 2013.
 [40] X. Yu, Z. Lin, J. Brandt, and D. N. Metaxas, “Consensus of regression for occlusionrobust facial feature localization,” in ECCV, 2014.
 [41] M. Z. Zia, M. Stark, and K. Schindler, “Explicit Occlusion Modeling for 3D Object Class Representations,” in CVPR, 2013.
 [42] G. Ghiasi and C. C. Fowlkes, “Occlusion coherence: Localizing occluded faces with a hierarchical deformable part model,” in CVPR, 2014.
 [43] G. Ghiasi, Y. Yang, D. Ramanan, and C. C. Fowlkes, “Parsing occluded people,” in CVPR, 2014.
 [44] E. OhnBar and M. Trivedi, “Learning to detect vehicles by clustering appearance patterns,” TITS, 2015.
 [45] Y. Xiang, W. Choi, Y. Lin, and S. Savarese, “Datadriven 3d voxel patterns for object category recognition,” in CVPR, 2015.

[46]
J. Zhu, T. Wu, S.C. Zhu, X. Yang, and W. Zhang, “Learning reconfigurable scene representation by tangram model,” in
WACV, 2012.  [47] W. Hu and S.C. Zhu, “Learning 3d object templates by quantizing geometry and appearance spaces,” TPAMI, vol. 37, no. 6, pp. 1190–1205, 2015.
 [48] Y. Yang, S. Baker, A. Kannan, and D. Ramanan, “Recognizing proxemics in personal photos,” in CVPR, 2012.
 [49] C. Desai, D. Ramanan, and C. Fowlkes, “Discriminative models for multiclass object layout,” IJCV, vol. 95, no. 1, pp. 1–12, 2011.
 [50] D. Hoiem, A. Efros, and M. Hebert, “Putting objects in perspective,” IJCV, vol. 80, no. 1, pp. 3–15, 2008.
 [51] Z. Tu and X. Bai, “Autocontext and its application to highlevel vision tasks and 3d brain image segmentation,” TPAMI, vol. 32, no. 10, pp. 1744–1757, Oct. 2010.
 [52] G. Chen, Y. Ding, J. Xiao, and T. X. Han, “Detection evolution with multiorder contextual cooccurrence,” in CVPR, 2013.
 [53] K. Matzen and N. Snavely, “Nyc3dcars: A dataset of 3d vehicles in geographic context,” in ICCV, 2013.
 [54] Y. Zhang, S. Song, P. Tan, and J. Xiao, “Panocontext: A wholeroom 3d context model for panoramic scene understanding,” in ECCV, 2014.
 [55] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
 [56] Z. Si and S.C. Zhu, “Learning andor templates for object recognition and detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 9, pp. 2189–2205, 2013.
 [57] A. L. Yuille and A. Rangarajan, “The ConcaveConvex Procedure (CCCP),” in NIPS, 2001.
 [58] B. Leibe and B. Schiele, “Analyzing appearance and contour based methods for object categorization,” in CVPR, 2003.
 [59] M. Ozuysal, V. Lepetit, and P. Fua, “Pose estimation for category specific multiview object localization,” in CVPR, 2009.

[60]
A. Gonzalez, G. Villalonga, D. V. J. Xu, J. Amores, and A. Lopez, “Multiview random forest of local experts combining rgb and lidar data for pedestrian,” in
IEEE Intelligent Vehicles Symposium (IV), 2015.  [61] X. Wang, M. Yang, S. Zhu, and Y. Lin, “Regionlets for generic object detection,” in ICCV, December 2013.
 [62] X. Zhu, C. Vondrick, D. Ramanan, and C. C. Fowlkes, “Do we need more training data or better models for object detection?” in BMVC, 2012.
 [63] A. Opelt and A. Pinz, “Object Localization with Boosting and Weak Supervision for Generic Object Recognition,” in SCIA, 2005.
 [64] R. J. LopezSastre, T. T., and S. Savarese, “Deformable part models revisited: A performance evaluation for object category pose estimation,” in ICCVWS CORP, 2011.
 [65] C. Gu and X. Ren, “Discriminative MixtureofTemplates for Viewpoint Classification,” in ECCV, 2010.
 [66] M. Sun, H. Su, S. Savarese, and L. FeiFei, “A multiview probabilistic model for 3d object classes,” in CVPR, 2009.
 [67] J. Liebelt and C. Schmid, “Multiview object class detection with a 3D geometric model,” in CVPR, 2010.
 [68] D. Glasner, M. Galun, S. Alpert, R. Basri, and G. Shakhnarovich, “Viewpointaware object detection and pose estimation,” in ICCV, 2011.
 [69] M. Everingham, A. Zisserman, C. K. I. Williams, and L. Van Gool, “The PASCAL Visual Object Classes Challenge 2006 (VOC2006) Results,” http://www.pascalnetwork.org/challenges/VOC/voc2006/results.pdf.
 [70] X. Liu, Y. Zhao, and S. Zhu, “Singleview 3d scene parsing by attributed grammar,” in CVPR, 2014.

[71]
A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in
NIPS, 2012.  [72] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
Comments
There are no comments yet.