1 Introduction
Our visual world is rich in structural regularity. Studies in perception show that the human visual system imposes structure to reason about stimuli[1]
. Consequently, early work in computer vision studied perceptual organization as a fundamental precept for recognition and reconstruction
[2, 3]. However, algorithms designed on these principles relied on handcrafted features (e.g. corners or edges) and hardcoded rules (e.g. junctions or parallelism) to hierarchically reason about abstract concepts such as shape [4, 5]. Such approaches suffered from limitations in the face of realworld complexities. In contrast, convolutional neural networks (CNNs), as endtoend learning machines, ignore inherent perceptual structures encoded by taskrelated intermediate concepts and attempt to directly map from input to the label space.AbuMostafa [6] proposes “hints” as a middle ground, where a taskrelated hint derived from prior domain knowledge regularizes the training of neural networks by either constraining the parameter space or generating more training data. In this work, we revisit and extend this idea by exploring a specific type of hint, which we refer to as an “intermediate concept”, that encodes a subgoal to achieve the main task of interest. For instance, knowing object orientation is a prerequisite to correctly infer object part visibility which in turn constrains the 3D locations of semantic object parts. We present a generic learning architecture where intermediate concepts sequentially supervise hidden layers of a deep neural network to learn a specific inference sequence for predicting a final task.
We implement this deep supervision framework with a novel CNN architecture for predicting 2D and 3D object skeletons given a single test image. Our approach is in the spirit of [3, 2] that exploit object pose as an auxiliary shape concept to aid shape interpretation and mental rotation. We combine this early intuition with the discriminative power of modern CNNs by deeply supervising for multiple shape concepts such as object pose. As such, deep supervision teaches the CNN to sequentially model intermediate goals to parse 2D or 3D object skeletons across large intraclass appearance variations and occlusion.
An earlier version of this work has been presented in a conference paper [7]. In this extended version, we formalize a probabilistic notion of intermediate concepts that predicts improved generalization performance by deeply supervising intermediate concepts (Section 3). Further, we add new experiments including a new object class (bed) (Section 5.2.4) and image classification results on CIFAR100 [8] (Section 5.1). This motivates our network architecture in which we supervise convolutional layers at different depths with the available intermediate shape concepts.
Due to the scarcity of 3D annotated images, we render 3D CAD models to create synthetic images with concept labels as training data. In addition, we simulate challenging occlusion configurations between objects to enable robust datadriven occlusion reasoning (in contrast to earlier modeldriven attempts [9, 10]). Figure 1 introduces our framework and Figure 4 illustrates an instance of a CNN deeply supervised by intermediate shape concepts for 2D/3D keypoint localization. We denote our network as “DISCO” short for Deep supervision with Intermediate Shape COncepts.
Most existing approaches [11, 12, 13, 14, 10] estimate 3D geometry by comparing projections of parameterized shape models with separately predicted 2D patterns, such as keypoint locations or heat maps. This makes prior methods sensitive to partial view ambiguity [15] and incorrect 2D structure prediction. Moreover, scarce 3D annotation of real image further limits their performance. In contrast, our method is trained on synthetic data only and generalizes well to real images. We find deep supervision with intermediate concepts to be a critical element to bridge the synthetic and real world. In particular, our deep supervision scheme empirically outperforms the singletask architecture, and multitask networks which supervise all the concepts at the final layer. Further, we quantitatively demonstrate significant improvements over prior stateoftheart for 2D/3D keypoint prediction on PASCAL VOC, PASCAL3D+[16], IKEA[17] and KITTI3D where we add 3D annotation for part of KITTI[18] data. These observations confirm that intermediate concepts regularize the learning of 3D shape in the absence of photorealism in rendered training data.
Additionally, we show another application of our generic deep supervision framework for image classification on CIFAR100 [8]. As such, coarsegrained class labels used as intermediate concepts are able to improve finegrained recognition performance, which further validates our deep supervision strategy.
In summary, we make the following contributions in this work:

We present a CNN architecture where its hidden layers are supervised by a sequence of intermediate shape concepts for the main task of 2D and 3D object geometry estimation.

We formulate a probabilistic framework to explain why deep supervision may be effective in certain cases. Our proposed framework is a generalization of conventional supervision schemes employed in CNNs, including multitask supervision and Deeply Supervised Nets [19].

We show the utility of rendered data with access to intermediate shape concepts. We model occlusions by rendering multiple object configurations, which presents a novel route to exploiting 3D CAD data for parsing cluttered scenes.

We empirically demonstrate stateoftheart performance on 2D/3D semantic part localization and object classification on several public benchmarks. In some experiments, the proposed approach even outperforms the stateoftheart methods trained on real images. We also demonstrate superior performance to baselines including the conventional multitask supervision and different orders of intermediate concepts.
In the following, we review the related work in Section 2 and introduce the probabilistic framework and algorithm of deep supervision in Section 3. Details of network architecture and data simulation are discussed in Section 4. We discuss experiment results in Section 5 and conclude the paper in Section 6.
2 Related Work
We present a deep supervision scheme with intermediate concepts for deep neural networks. One application of our deep supervision is 3D object structure inference which is linked to recent advances including reconstruction, alignment and pose estimation. We review related work on these problems in the following:
Multitask Learning. In neural networks, multitask learning architectures exploit multiple taskrelated concepts to jointly supervise a network at the last layer. Caruana [20] empirically demonstrates its advantage over a singletask neural architecture on various learning problems. Recently, multitask learning has been applied to a number of vision tasks including face landmark detection [21] and viewpoint estimation [22]. Hierarchy and Exclusion (HEX) graph [23] is proposed to capture hierarchical relationships among object attributes for improved image classification. In addition, some theories [24, 25] attempt to investigate how shared hidden layers reduce required training data by jointly learning multiple tasks. However, to our knowledge, no study has been conducted on quantifying the performance boost to a main task. It is also unclear whether a design choice meets the assumption of conducive task relationships used in these theories. This may explain that some task combinations for multitask networks yield worse performance compared with singletask networks [20].
Deep Supervision. Deeply Supervised Nets (DSN) [19]
uses a single task label to supervise the hidden layers of a CNN, speeding up convergence and addressing the vanishing gradient problem. However, DSN assumes that optimal local filters at shallow layers are building blocks for optimal global filters at deep layers, which is probably not true for a complex task. Recently a twolevel supervision is proposed
[26] for counting objects in binary images. One hidden layer is hardcoded to output object detection responses at fixed image locations. This work can be seen as a preliminary study to leverage taskrelated cues that assist the final task by deep supervision. We advance this idea further to a more general setting for deep learning without hardcoded internal representations.3D Skeleton Estimation. Many works model 3D shape as a linear combination of shape bases and optimize basis coefficients to fit computed image evidence such as heat maps [14] and object part detections [10]. A prominent recent approach called single image 3D INterpreter Network (3DINN) [27] is a sophisticated CNN architecture to estimate a 3D skeleton based only on detected visible 2D joints. However, in contrast to our approach, the training of 3DINN does not jointly optimize for 2D and 3D keypoint localization. This decoupling of 3D structure from object appearance leads to partial view ambiguity and thus 3D prediction error.
3D Reconstruction. A generative inverse graphics model is formulated in [12]
for 3D mesh reconstruction by matching mesh proposals to extracted 2D contours. Recently, given a single image, autoencoders have been exploited for 2D image rendering
[28], multiview mesh reconstruction [29] and 3D shape regression under occlusion [30]. The encoder network learns to invert the rendering process to recognize 3D attributes such as object pose. However, methods such as [29, 30] are quantitatively evaluated only on synthetic data and seem to achieve limited generalization to real images. Other works such as [11] formulate an energybased optimization framework involving appearance, keypoint and normal consistency for dense 3D mesh reconstruction, but require both 2D keypoint and object segmentation annotations on real images for training. Volumetric frameworks using either discriminative [31] or generative [32] modeling infer a 3D shape distribution on voxel grids given image(s) of an object, limited to lowresolutions. Lastly, 3D voxel examplars [33] jointly recognize 3D shape and occlusion patterns by template matching, which is not scalable.3D Model Retrieval and Alignment. This line of work estimates 3D object structure by retrieving the closest object CAD model and performing alignment, using 2D images [34, 35, 16] and RGBD data [36, 37]. Unfortunately, a limited number of CAD models can not represent all instances in one object category. Further, the retrieval step is slow for a large CAD dataset and alignment is sensitive to error in estimated pose.
Pose Estimation and 2D Keypoint Detection. “Render for CNN” [22] renders 3D CAD models as additional training data besides real images for object viewpoint estimation. We extend this rendering pipeline to support object keypoint prediction and cluttered scene rendering to learn occlusions from data. Viewpoint prediction is utilized in [38] to boost the performance of 2D landmark localization. Recent work such as DDN [39] optimizes deformation coefficients based on the PCA representation of 2D keypoints to achieve stateoftheart performance on face and human body. Dense feature matching approaches which exploit topdown object category knowledge [40, 14] are recent successes, but our method yields superior results while being able to transfer knowledge from rich CAD data.
Occlusion Modeling. Most work on occlusion invariant recognition relies on explicit occluder modeling [41, 10]. However, as it is hard to explicitly model object appearance, the variation in occluder appearance is also too broad to be captured effectively by modeldriven approaches. This is why recent work has demonstrated gains by learning occlusions patterns from data [42, 33]. Thanks to deep supervision, which enables effective generalization from CAD renderings to real images, we are able to leverage a significantly larger array of synthetic occlusion configurations.
3 Deep Supervision with Intermediate Concepts
In this section, we introduce a novel CNN architecture with deep supervision. Our approach draws inspiration from Deeply Supervised Nets (DSN) [19]. DSN supervises each layer by the main task label to accelerate training convergence. Our method differs from DSN in that we sequentially apply deep supervision on intermediate concepts intrinsic to the ultimate task, in order to regularize the network for better generalization. We employ this enhanced generalization ability to transfer knowledge from richly annotated synthetic data to the domain of real images.
Toy Example. To motivate the idea of supervising intermediate concepts, consider a very simple network with 2 layers: where
is ReLU activation
. Provided that the true model for a phenomenon is and the training data is . A learning algorithm may obtain a different model which still achieves zero loss over training data but fails to generalize to the case when or . However, if we have additional cues that tell us the value of intermediate layer activations, for each , we can achieve better generalization. For example, suppose we have training examples with an additional intermediate cue where . We find that the incorrect solution above that works for is removed because it does not agree with . While simple, this example illustrates that deep supervision with intermediate concepts can regularize network training and reduce overfitting.In the following, we formalize the notion of intermediate concept in Section 3.1, introduce our supervision approach which exploits intermediate concepts in Section 3.2, and discuss the improved generalization of deep supervision in Section 3.3.
3.1 Intermediate Concepts
We consider a supervised learning task to predict
from . We have a training set sampled from an unknown distribution , where each training tuple consists of multiple task labels: . Without the loss of generality, we analyze the th concept in the following, where . Here, is regarded as an intermediate concept to estimate , where and . Intuitively, knowledge of constrains the solution space of , as in our simple example above.Formally, we define an intermediate concept of as a strict necessary condition such that there exists a deterministic function which maps to : . In general, there is no inverse function that maps to because multiple may map to the same . In the context of multiclass classification where task and both contain discrete class labels, task induces a finer partition over the input space than task by further partitioning each class in . Figure 2 illustrates a fictitious example of hierarchical partitioning over 2D input space created by three intermediate concepts . As we can see in Figure 2, a sequence of intermediate concepts hierarchically decompose the input space from coarse to fine granularity. Concretely, we denote a concept hierarchy as where is a strict necessary condition of for all .
In many vision problems, we can find concepts that approximate a concept hierarchy . As mentioned above, nonoverlapping coarsegrained class labels constitute strict necessary conditions for a finegrained classification task. In addition, object pose and keypoint visibility are both strict necessary conditions for 3D object keypoint location, because the former can be unambiguously determined by the latter.
3.2 Algorithm
Given a concept hierarchy and the corresponding training set , we formulate a new deeply supervised architecture to jointly learn the main task along with its intermediate concepts. Consider a multilayer convolutional neural network with hidden layers that receives input and outputs predictions for . The th concept is applied to supervise the intermediate hidden layer at depth by adding a side output branch at th hidden layer. We denote the function represented by the th hidden layer as , with parameters . The output branch at depth constructs a function with parameters . Further, we denote as the function for predicting concept such that . Figure 3 shows a schematic diagram of our deep supervision framework. In Section 4, we concretely instantiate each
as a convolutional layer followed by batch normalization and ReLU layers and each
as global average pooling followed by fully connected layers. However, we emphasize that our algorithm is not limited to this particular layer configuration.We formulate the following objective function to encapsulate these ideas:
(1) 
where , and . In addition, is the loss for task scaled by the loss weight . We optimize Equation 1 over
by simultaneously backpropagating the loss of each supervisory signal all the way back to the first layer.
We note that Equation 1 is a generic supervision framework which represents many existing supervision schemes. For example, the standard CNN with a single task supervision is a special case when . Additionally, the multitask learning [20] places all supervision on the last hidden layer: for all . DSN[19] framework is obtained when and for all . In this work, we propose to apply different concepts in a concept hierarchy at locations with growing depths: where and .
3.3 Generalization Analysis
Notation  Meaning 

The th concept  
The intermediate concept of  
The supervision depth of  
A function that predicts given input  
True risk of  
Empirical risk of given a training set  
A set of with low empirical risk  
A set of with low empirical and true risk  
Generalization probability of  
Subset of that achieves low empirical risk on  
Subset of that achieves low empirical risk on  
Generalization probability constrained by 
In this section, we present a generalization metric and subsequently show how deep supervision with intermediate concepts can improve the generalization of a deep neural network with respect to this metric, compared to other standard supervision methods. We also discuss the limitations of this analysis. For clarity, we summarize our notation in Table I.
3.3.1 Generalization Metric
Deep neural networks are function approximators that learn mappings from an input space to an output space . For a network with a fixed structure, there usually exists a set of functions (equivalently a set of parameters) where each element achieves a low empirical loss on a training set . In the following, we define a generalization metric to measure the probability that a function is a “true” solution for a supervised learning task.
Recall that represents the function composed by the first hidden layers and an output branch for predicting concept . The true risk
is defined based on random variables
and where :(2) 
Given a training set , the empirical risk of is:
(3) 
Given limited training data , a deep neural network is optimized to find a solution with low empirical loss. We consider empirical loss to be “low” when . is the risk threshold which indicates “good” performance for a task. Next, we define the function set in which each function achieves low empirical risk:
(4) 
Similarly, we also define the function set where each function achieves risks less than for both and :
(5) 
By definition, we know . Given a training set and network structure, the generalization capability of the outcome of network training depends upon the likelihood that is also a member of .
We consider
to be a random variable as it is the outcome of a stochastic optimization process such as stochastic gradient descent. We assume that the optimization algorithm is unbiased within
, such that apriori probability of converging to anyis uniformly distributed. We formalize a generalization metric for a CNN for predicting
by defining a probability measure based on the function sets and :(6)  
where is the Lebesgue measure [43] of set indicating the “volume” or “size” of set ^{1}^{1}1Each function has a onetoone mapping to a parameter in where is the dimension of the parameter. We know that any subset of is Lebesgue measurable.. Moreover, due to . The equality is achieved when . It follows that the higher the , the better the generalization.
When an intermediate concept of is available, we insert one output branch at depth of CNN to predict . Then, our deep supervision algorithm in Section 3.2 aims to minimize empirical risk on both and . Recall that . As a consequence, does not contain any output branch for the intermediate concept . However, we note that shares some hidden layers with . Similar to , we can define the generalization probability of given the supervision of its intermediate concept :
(7)  
where the function set is a subset of :
(8) 
and the function set is a subset of :
(9) 
Note that we use a different threshold for
in order to account for the difference between loss functions
and . We do not require the true risk of intermediate concept to be lower than because the objective is to analyze the achievable generalization with respect to predicting .3.3.2 Improved Generalization through Deep Supervision
A machine learning model for predicting
suffers from overfitting when the solution achieves low empirical risk over but high true risk . In other words, the higher the probability , the lower the chance that the trained model overfits . One general strategy to reduce the overfitting is to increase the diversity and size of training set . In this case, the denominator of Equation (6) decreases because fewer functions achieve low loss on more diverse data. In the following, we show that supervising an intermediate concept of at some hidden layer is similarly capable of removing some incorrect solutions in and thus improves the generalization because .First, given an intermediate concept of where , we specify the following assumptions for our analysis.

The neural network underlying our analysis is large enough to satisfy the universal approximation theorem [44] for the concepts of interest, that is, its hidden layers have sufficient learning capacity to approximate arbitrary functions.

For a concept hierarchy , if is a reasonable estimate of , then should also be a reasonable estimate of the corresponding intermediate concept . Formally, we assume:
(10) where is the value space of concept .
In practice, one may identify many tasks and relevant intermediate concepts satisfying Assumption 2 when using common loss functions and . We discuss this further in Section 3.3.3. To obtain Assumption 3 above, we take the following two steps. First, with , Assumption 1 allows us to find a . As a consequence, we can always construct a from through using the first layers: . Second, Assumption 2 further extends that for any , its first layers can be used to obtain a .
Given an intermediate concept that satisfies the above assumptions, the following two propositions discuss how (the supervision depth of ) affects the generalization ability of in terms of . First, we show that supervising intermediate concepts in the wrong order has no effect on improving generalization.
Proposition 1.
If , the generalization performance of is not guaranteed to improve:
(11) 
Proof.
We first consider the case when and both supervise the same hidden layer: . Given a sample set and a function which correctly predicts for : , we can construct to yield the correct prediction for . Based on Assumption 1
, a multilayer perceptron (i.e. fully connected layers) is able to represent any mapping function
. Therefore, to approximate , we can append fully connected layers which implement to : . Based on Assumption 2, for any function in , there exists a corresponding function which satisfies . This indicates that which in turn implies . When , hidden layers from to can be implemented to achieve an identity mapping and then follow the same analysis for the case . As a consequence, Proposition 1 holds. ∎Proposition 2.
There exists a such that and the generalization performance of is improved:
(12) 
Proof.
From Equation 4 and 8, we observe that and . Thus, we obtain:
(13) 
Given a training set , Equation 13 essentially means that the number of functions that simultaneously fit both and is not more than the number of functions that fit each of them individually. Intuitively, as the toy example earlier, the hidden layers of some network solutions for yield incorrect predictions of the intermediate concept . This implies that in practice. Subsequently, Assumption 3 suggest that there exists one or multiple ’s such that the first layers of each solution are contained in . In other words, we can find a supervision depth for which satisfies:
(14) 
As a result, Proposition 2 is proved by Equation 13 and Equation 14. ∎
To this end, we can improve the generalization of via by inserting the supervision of before . As a consequence, given a concept hierarchy , the supervision depths of concepts should be monotonically increasing: . We then extend Equation 13 to incorporate all available intermediate concepts of :
(15) 
As we report in Section 5, the empirical evidence shows that more intermediate concepts often greatly improves the generalization performance of the main task, which implies a large gap between two sides of Equation 15. Similar to Equation 14, we still have:
(16) 
As a consequence, the generalization performance of given its necessary conditions can be improved if we supervise each of them at appropriate depths where :
(17) 
Furthermore, is monotonically decreasing by removing intermediate concepts: . The more concepts applied, the better chance that the generalization is improved. In conclusion, deep supervision with intermediate concepts regularizes the network training by decreasing the number of incorrect solutions that generalize poorly to the test set.
3.3.3 Discussion
Generalization of Intermediate Concept. We generalize the notion of intermediate concept, using conditional probabilities, with being the error necessary condition of if and for any sample satisfy:
(18) 
where . The strict necessary condition defined in Section 3.1 holds when . When , the monotonically increasing supervision order indicated by Equation 17 is no longer ensured. However, the architecture design suggested by our generalization analysis in Section 3.3.2 achieves the best performance in our empirical studies in Section 5. We believe that the generalization analysis in Section 3.3.2 is a good approximation for case with small in real applications. We leave the analytic quantification of how affects deep supervision to future work.
Assumption 2. If Assumption 2 does not hold, both the numerator and denominator in Equation 7 decrease by different amounts. As a consequence, we cannot obtain Proposition 1 for all cases. However, many commonly used loss functions satisfy this assumption when . One simple example is when and are indicator functions (i.e. ) for all ^{2}^{2}2Note that the indicator function can be applied to discrete and continuous values of and .. As such, when and thus Assumption 2 is satisfied. Another example can be that and are both L2 loss (i.e. ) and is a projection function where and is a projection (i.e. ). In this case, .
Uniform Probability of . In practice, this assumption may seem to contradict some empirical studies like [45]
where common CNNs generalize well after overfitting to largescale training data (e.g. Imagenet
[46]). This phenomenon actually demonstrates another dimension of improving generalization: training models on a large training set so that is shrinking and converging to . Our work results shows that with deep supervision is an alternative route to achieve generalization given limited training data or data from a different domain, compared with standard supervision methods.DSN as a special case. Since a task is also a necessary condition of itself, our deep supervision framework actually contains DSN[19] as a special case where each intermediate concept is the main task itself. To illustrate the distinction enabled by our framework, we mimic DSN by setting the first intermediate concept . Thus, the first hidden layers are forced to directly predict . Each can be trivially used to construct by forcing an identity function for layers to . This suggests that is mainly constrained by . Therefore, even though larger spatial supports from deeper layers between and reduce empirical risk in DSN, the learning capacity is restricted by supervision for at the first layers.
4 Implementation and Data
We apply our method to both object classification and key point localization. For object classification, we use the semantic hierarchy of labels to define intermediate concepts. For example, container is an intermediate concept (a generalization) of cup. For key point localization, we specify a 3D skeleton for each object class where nodes or keypoints represent semantic parts, and their connections define 3D object geometry. Given a single real RGB image of an object, our goal is to predict the keypoint locations in image coordinates as well as normalized 3D coordinates while inferring their visibility states. and coordinates of 2D keypoint locations are normalized to along the image width and height, respectively. 3D keypoint coordinates are centered at origin and scaled to set the longest dimension along ,, to unit length. Note that 2D/3D keypoint locations and their visibility all depend on the specific object pose with respect to the camera viewpoint.
To set up the concept hierarchy for 2D/3D keypoint localization, we have chosen in order, object orientation , which is needed to predict keypoint visibility , which roughly depicts the 3D structure prediction , which finally leads to 2D keypoint locations including ones that are not visible in the current viewpoint. We impose the supervision of the concept hierarchy into a CNN as shown in Fig. 4 and minimize Equation 1 to compute the network parameters.
We emphasize that the above is not a error concept hierarchy because object pose (), and 3D keypoint location () are not strict necessary conditions for visibility (), and 2D keypoint location (), respectively. However, we posit that the corresponding residuals (’s) of are small. First, knowing object pose constrains keypoint visibilities to such an extent, that prior work has chosen to use ensembles of 2D templates for visual object parsing [47, 42]. Second, there is a long and fruitful tradition in computer vision, starting from Marr’s seminal ideas [3] to leverage 3D object representations as a tool for 2D recognition. In sum, our present choice of is an approximate realization of a error concept hierarchy which nonetheless draws inspiration from our analysis, and works well in practice.
4.1 Network Architecture
In this section, we detail the network structure for keypoint localization. Our network resembles the VGG network [48] and consists of deeply stacked convolutional layers. Unlike VGG, we remove local spatial pooling between convolutional layers. This is motivated by the intuition that spatial pooling leads to the loss of spatial information. Further, we couple each convolutional layer with batch normalization [49] and ReLU, which defines . The output layer at depth for task is constructed with one global average pooling (GAP) layer followed by one fully connected (FC) layer with neurons, which is different from stacked FC layers in VGG. The GAP layer averages filter responses over all spatial locations within the feature map. From Table III in Section 5.2.1, we empirically show that these two changes are critical to significantly improve the performance of VGGlike networks for 2D/3D landmark localization.
We follow the common practice of employing dropout [50] layers between the convolutional layers, as an additional means of regularization. At layers ,,
, we perform the downsampling using convolution layers with stride
. The bottomleft of Figure 4 illustrates the details of our network architecture. “(ConvA)xB” means A stacked convolutional layers with filters of size BxB. We deploy convolutional layers in total.We use L2 loss at all points of supervision. In practice, we only consider the azimuth angle of the object viewpoint with respect to a canonical pose. We further discretize the azimuth angle into
bins and regress it to a onehot encoding (the entry corresponding to the predicted discretized pose is set to
and all others to). Keypoint visibility is also represented by a binary vector with
indicating occluded state of a keypoint. During training, each loss is backpropagated to train the network jointly.4.2 Synthetic Data Generation
Our approach needs a large amount of training data because it is based on deep CNNs. It also requests finer grained labels than many visual tasks such as object detection. Furthermore, we aim for the method to work for heavily cluttered scenes. Therefore, we generate synthetic images that simulate realistic occlusion configurations involving multiple objects in close proximity. To our knowledge, rendering cluttered scenes that comprise of multiple CAD models is a novelty of our approach, although earlier work [42, 33] used real image cutouts for bounding box level localization.
An overview of the rendering process is shown in the upperleft of Fig. 4. We pick a small subset of CAD models from ShapeNet [51] for a given object category and manually annotate 3D keypoints on each CAD model. Next, we render each CAD model via Blender with randomly sampled graphics parameters including camera viewpoint, number/strength of light sources, and surface gloss reflection. Finally, we follow [22] to overlay the rendered images on real backgrounds to avoid overfitting. We crop the object from each rendered image and extract the object viewpoint, 2D/3D keypoint locations and their visibility states from Blender as the training labels. In Figure 4 (right), we show an example of rendering and its 2D/3D annotations.
To model multiobject occlusion, we randomly select two different object instances and place them close to each other without overlapping in 3D space. During rendering, we compute the occlusion ratio of each instance by calculating the fraction of visible 2D area versus the complete 2D projection of CAD model. Keypoint visibility is computed by raytracing. We select instances with occlusion ratios ranging from to . Fig. 5 shows two training examples where cars are occluded by other nearby cars. For truncation, we randomly select two image boundaries (left, right, top, or bottom) of the object and shift them by of the image size along that dimension.
5 Experiments
We first present an empirical study of image classification problem on CIFAR100 [8] where a strict concept hierarchy is applied to boost the finegrained object classification performance. Subsequently, we extensively demonstrate competitive or superior performance for 2D/3D keypoint localization over several stateoftheart methods, on multiple datasets: KITTI3D, PASCAL VOC, PASCAL3D+ [16] and IKEA [17].
5.1 Cifar100
Methods  Error(%) 

DSN[19]  34.57 
FitNet, LSUV[52]  27.66 
ResNet1001[53]  27.82 
preact ResNet1001[54]  22.71 
plainsingle  23.31 
plainall  23.26 
DISCOrandom  27.53 
DISCO  22.46 
The image classification problem has a natural concept hierarchy where object categories can be progressively partitioned from coarse to fine granularity. In this section, we exploit coarsegrained class labels (20classes) from CIFAR100 [8] to assist finegrained recognition into 100 classes. Most existing methods directly learn a model for finegrained classification task while ignoring coarsegrained labels. In contrast, we leverage coarsegrained labels as an intermediate concept in our formulation. We use the same network architecture shown in Section 4.1 but with only 20 layers. The number of filters are , and for layers of 15, 610 and 1020 respectively. Downsampling is performed at layer 6 and 11 and the coarsegrained label supervises layer 16.
Table II compares the error of DISCO with stateoftheart and variants of DISCO. We use plainsingle and plainall to denote the networks with supervisions of single finegrained label, and both labels at last layer, respectively. DISCOrandom uses a (fixed) random coarsegrained class label for each training image. We observe that plainall achieves roughly the same performance as plainsingle, which replicates our earlier finding (Section 5.2.1) that intermediate supervision signal applied at the same layer as the main task helps relatively little in generalization. However, DISCO is able to reduce the error of plainsingle by roughly using the intermediate supervision signal. These results support our derivation of Proposition 1 and Proposition 2 in Section 3.3. Further, DISCOrandom is significantly inferior to DISCO as a random intermediate concept makes the training more difficult. Finally, DISCO slightly outperforms the current stateoftheart “preact ResNet1001[54]” on image classification but with only half of the network parameters compared with [54].
5.2 2D and 3D Keypoint Localization
Method  2D  3D  3Dyaw  
Full  Truncation  MultiCar Occ  Other Occ  All  Full  Full  
DDN [39]  67.6  27.2  40.7  45.0  45.1  NA  
WNgtyaw* [40]  88.0  76.0  81.0  82.7  82.0  NA  
Zia et al. [10]  73.6  NA  73.5  7.3  
DSN2D  45.2  48.4  31.7  24.8  37.5  NA  
DSN3D  NA  68.3  12.5  
plain2D  88.4  62.6  72.4  71.3  73.7  NA  
plain3D  NA  90.6  6.5  
plainall  90.8  72.6  78.9  80.2  80.6  92.9  3.9 
DISCO3D2D  90.1  71.3  79.4  82.0  80.7  94.3  3.1 
DISCOvis3D2D  92.3  75.7  81.0  83.4  83.4  95.2  2.3 
DISCO(3Dvis)  91.9  77.6  82.2  86.1  84.5  94.2  2.3 
DISCOreverse  30.4  29.7  22.8  19.6  25.6  54.8  13.0 
DISCOVgg  83.5  59.4  70.1  63.1  69.0  89.7  6.8 
DISCO  93.1  78.5  82.9  85.3  85.0  95.3  2.2 
DISCO(Det)  95.9  78.9  87.7  90.5  88.3  95.5  2.1 
Training Data  Test Data  
Full  Trunc.  MultiCar  Full  Trunc.  Occ. 
91.8  53.6  68.3  
89.9  73.8  61.7  
91.3  74.7  82.7  
92.9  71.3  63.4  
92.5  73.2  84.1  
90.5  70.4  81.2  
93.1  78.5  83.2 
In this Section, we demonstrate the performance of the deep supervision network (Fig. 4) for predicting the locations of object keypoints on 2D image and 3D space.
Dataset. For data synthesis, we sample CAD models of cars, sofas, chairs and beds from ShapeNet [51]. Each car model is annotated with keypoints [10] and each furniture model (chair, sofa or bed) with keypoints [16] ^{3}^{3}3We use 10 keypoints which are consistent with [27] to evaluate chair and bed on IKEA.. We synthesize 600k car images including occluded instances and 300k images of fully visible furniture (chair+sofa+bed). We pick rendered images of 5 CAD models from each object category as validation set.
We introduce KITTI3D with annotations of 3D keypoint and occlusion type on car images from [18]. We label car images with one of four occlusion types: no occlusion (or fully visible cars), truncation, multicar occlusion (target car is occluded by other cars) and occlusion cause by other objects. The number of images for each type is , , and , respectively.
To obtain 3D groundtruth for these car images, we fit a PCA model trained on 3D keypoint annotation on CAD data, by minimizing the 2D projection error for known 2D landmarks provided by Zia et al. [10] and object pose from KITTI [18]. First, we compute the mean shape and principal components from 3D skeletons of our annotated CAD models. and () are matrices where each column contains 3D coordinates of a keypoint. Thus, the 3D object structure is represented as , where is the weight for . To avoid distorted shapes caused by large , we constrain to lie within where
is the standard deviation along the
principal component direction. Next, given the groundtruth pose , we compute 3D structure coefficients that minimize the projection error with respect to 2D ground truth :(19)  
where the camera intrinsic matrix is with the scaling and shifting . computes the 2D image coordinate from 2D homogeneous coordinate . In practice, to obtain the ground truth with even higher quality, we densely sample object poses in the neighborhood of and solve (19) by optimizing given a fixed and then search for the lowest error among all sampled . We only provide 3D keypoint labels for fully visible cars because we do not have enough visible 2D keypoints for most of the occluded or truncated cars and thus obtain rather crude 3D estimates for such cases.
Evaluation metric. We use PCK and APK metrics [56] to evaluate the performance of 2D keypoint localization. A 2D keypoint prediction is correct when it lies within the radius of the ground truth, where is the maximum of image height and width and . PCK is the percentage of correct keypoint predictions given the object location and keypoint visibility. APK is the mean average precision of keypoint detection computed by associating each estimated keypoint with a confidence score. In our experiments, we use the regressed values of keypoint visibility as confidence scores. We extend 2D PCK and APK metrics to 3D by defining a correct 3D keypoint prediction whose euclidean distance to the ground truth is less than in normalized coordinates.
Training details. We set loss weights of visibility, 3D and 2D keypoint locations to and object pose to . We use stochastic gradient descent with momentum to train the proposed CNN from scratch. Our learning rate starts at and decreases by onetenth when the validation error reaches a plateau. We set the weight decay to , resize all input images to x and use batch size of . We initialize all weights using Glorot and Bengio [57]. For car model training, we form each batch using a mixture of fully visible, truncated and occluded cars, numbering , and , respectively. For the furniture, each batch consists of fully visible and truncated objects randomly sampled from the joint synthetic image set of chair, sofa and bed.
5.2.1 Kitti3d
We compare our method with DDN [39] and WarpNet [40] for 2D keypoint localization and Zia et al. [10] for 3D structure prediction. We use the original source codes for these methods. However, WarpNet is a siamese archtecture which warps a reference image to a test image benefiting from classaware training. In order to use it for landmark transfer task, we need a reference image to be warped. Thus, we retrieve labeled synthetic car images with the same pose as test image for landmark transfer using the CNN architecture proposed in [40] (WNgtyaw), and then compute the median of predicted landmark locations as the final result. The network is trained to warp pairs of synthetic car images in similar poses. Additionally, we perform an ablative analysis of DISCO. First, we replace all intermediate supervisions with the final labels, as DSN [19] does, for 2D (DSN2D) and 3D (DSN3D) structure prediction. Next, we incrementally remove the deep supervision used in DISCO one by one. DISCOvis3D2D, DISCO3D2D, plain3D, and plain2D are networks without pose, pose+visibility, pose+visibility+2D and pose+visibility+3D, respectively. Further, we change the locations of the intermediate supervision signals. plainall shifts supervision signals to the final convolutional layer. DISCO(3Dvis) switches 3D and visibility in DISCO, and DISCOreverse reverses the entire order of supervisions in DISCO. Finally, DISCOVGG replaces stridebased downsampling and GAP in DISCO with nonoverlapping spatial pooling (x) and a fully connected layer with neurons, respectively. All methods are trained on the same set of synthetic training images and tested on real cropped cars on ground truth locations in KITTI3D.
In Table III, we report PCK accuracies for various methods^{4}^{4}4We cannot report Zia et al.[10] on occluded data because only a subset of images has valid result in those classes. and the mean error of estimated yaw angles “3Dyaw” over all fully visible cars. This objectcentric yaw angle is computed by projecting all 3D keypoints onto the ground plane and averaging the directions of lines connecting correspondences between left and right sides of a car. In turn, the 3Dyaw error is the average of absolute error between the estimated yaw and the ground truth.
PCK[]  Full  Full[]  Occluded  Big Image  Small Image  All [APK ] 

Long[58]  55.7  NA  
VpsKps[38]  81.3  88.3  62.8  90.0  67.4  40.3 
DSN2D  75.4  87.8  54.5  85.5  63.3  NA 
plain2D  76.7  90.6  50.4  80.6  69.4  NA 
plainall  75.9  90.4  53.0  82.4  65.1  41.7 
DISCOreverse  64.5  84.5  41.2  55.5  67.0  24.9 
DISCO3D2D  81.5  92.0  61.0  87.6  73.1  NA 
DISCO  81.8  93.4  59.0  87.7  74.3  45.4 
We observe that DISCO outperforms competitors in both 2D and 3D keypoint localization across all occlusion types. Moreover, we observe a monotonic increase in 2D and 3D accuracy with increasing supervision: plain2D or plain3D DISCO3D2D DISCOvis3D2D DISCO. Further, plainall is superior to plain2d and plain3d, while DISCO exceeds plainall by on 2DAll and on 3DFull. These experiments confirm that joint modeling of 3D shape concepts is better than independent modeling. Moreover, alternative supervision orders (DISCOreverse, DISCO(3Dvis)) are found to be inferior to the proposed order which captures underlying structure between shape concepts. Last, DISCOVGG performs significantly worse than DISCO by on 2DAll and on 3DFull, which validates our removal of local spatial pooling and adopt global average pooling. In conclusion, the proposed deep supervision architecture coupled with intermediate shape concepts improves the generalization ability of CNN. As more concepts are introduced in the “correct” order, we observe improvement in performance.
We also conduct an ablative study of training data with different occlusion types. Table IV demonstrates 2D keypoint localization accuracies over different occlusion categories on KITTI3D given various combination of training data. “Occ.” stands for test examples with multiobject occlusions where the occluder is either another car or a different object such as a pedestrian. As we can see, DISCO trained on fully visible cars alone achieves much worse performance on truncated and occluded test data than when trained on data with simulated truncation and multicar occlusion. We observe that multicar occlusion data is also helpful in modeling truncation cases, and the network trained by multicar data obtains the second best result on truncated cars. The best overall performance is obtained by including all three types of examples (no occlusion, multicar occlusion, truncation), emphasizing the efficacy of our data generation strategy.
Finally, we evaluate DISCO on detection bounding boxes computed from RCNN[55] with IoU to the groundtruth of KITTI3D. “DISCODet” in the last row of Table III shows PCK accuracies of DISCO using detection results. The 2D/3D keypoint localization accuracies even exceeds the performance of DISCO using groundtruth bounding boxes by on 2DAll and on 3DAll.
5.2.2 Pascal Voc
We evaluate DISCO on the PASCAL VOC 2012 dataset for 2D keypoint localization [56]. Unlike KITTI3D where car images are captured on real roads and mostly in low resolution, PASCAL VOC contains car images with larger appearance variations and heavy occlusions. In Table V, we compare our results with the stateoftheart [38, 58] on various subclasses of the test set: fully visible cars (denoted as “Full”), occluded cars, highresolution (average size x) and lowresolution images (average size x). Please refer to [38] for details of the test setup. Note that these methods [38, 58] are trained on real images, whereas DISCO training exclusively leverages synthetic training data.
We observe that DISCO outperforms [38] by and on PCK at and , respectively. In addition, DISCO is robust to lowresolution images, improving accuracy on lowresolution set compared with [38]. This is critical in real perception scenarios where distant objects are small in images of street scenes. However, DISCO is inferior on the occluded car class and highresolution images, attributable to our use of small images (x) for training and the fact that our occlusion simulation does not capture the complex occlusions created by noncar objects such as walls and trees. Finally, we compute APK accuracy at for DISCO on the same detection candidates used in [38]^{5}^{5}5We run the source code [38] to obtain the same object candidates.. We can see that DISCO outperforms [38] by on the entire car dataset (Full+Occluded). This suggests DISCO is more robust to noisy detection results and more accurate on keypoint visibility inference than [38]. We attribute this to global structure modeling of DISCO during training where the full set of 2D keypoints resolves the partial view ambiguity whereas traditional methods like [38] only are supervised with visible 2D keypoints.
Note that some definitions of our car keypoints [10] are slightly different from [56]. For example, we annotate the bottom corners of the front windshield whereas [56] labels the side mirrors. In our experiments, we ignore this annotation inconsistency and directly assess the prediction results. We reemphasize that unlike [58, 38], we do not use the PASCAL VOC train set. Thus, even better performance is expected when real images with consistent labels are used for training.
5.2.3 Pascal 3d+
Method  CAD alignment GT  Manual GT 
VDPM16 [16]  NA  51.9 
Xiang et al. [59]  64.4  64.3 
Random CAD [16]  NA  61.8 
GT CAD [16]  NA  67.3 
DSN2D  66.4  63.3 
plain2D  67.4  64.3 
plainall  66.8  64.2 
DISCOreverse  54.2  56.0 
DISCO  71.2  67.6 
PASCAL3D+ [16] provides object viewpoint annotations for PASCAL VOC objects by manually aligning 3D object CAD models onto the visible 2D keypoints. Because only a few CAD models are used for each category, the 3D keypoint locations are only approximate. Thus, we use the evaluation metric proposed by [16] which measures 2D overlap (IoU) against projected model mask. With a 3D skeleton of an object, we are able to create a coarse object mesh based on the geometry and compute segmentation masks by projecting coarse mesh surfaces onto the 2D image based on the estimated 2D keypoint locations.
Table VI reports object segmentation accuracies on two types of ground truth. The column “Manual GT” uses manual pixellevel annotation provided by PASCAL VOC 2012, whereas “CAD alignment GT” uses 2D projections of aligned CAD models as ground truth. Note that “CAD alignment GT” covers the entire object extent in the image including regions occluded by other objects. DISCO significantly outperforms a stateoftheart method [33] by and despite using only synthetic data for training. Moreover, on “Manual GT” benchmark, we compare DISCO with “Random CAD” and “GT CAD” which stand for the projected segmentation of randomly selected and ground truth CAD models respectively, given ground truth object pose. We find that DISCO yields even superior performance to “GT CAD”. This provides evidence that joint modeling of 3D geometry manifold and viewpoint is better than the pipeline of object retrieval plus alignment. Finally, we note that a forward pass of DISCO only takes less than 10ms during testing, which is far more efficient compared with sophisticated CAD alignment approaches [10] that usually needs more than 1s for one image input.
5.2.4 Ikea
Method  Sofa  Chair  Bed  

Recall  PCK  Recall  PCK  Recall  PCK  
3DINN  88.0  31.0  87.8  41.4  88.6  42.3 
DISCO  84.4  37.9  90.0  65.5  87.1  55.0 
In this section, we evaluate DISCO on the IKEA dataset [17] with 3D keypoint annotations provided by [27]. One question remaining for the DISCO network is whether it is capable of learning 3D object geometry for multiple object classes simultaneously. Therefore, we train a single DISCO network from scratch which jointly models three furniture classes: sofa, chair and bed. At test time, we compare DISCO with the stateoftheart 3DINN[27] on IKEA. Since 3DINN evaluates the error of 3D structure prediction in the object canonical pose, we align the PCA bases of both the estimated 3D keypoints and their groundtruth. Table VII reports the PCK[] and average recall[27] (mean PCK over densely sampled within ) of 3DINN and DISCO on all furniture classes. The corresponding PCK curves are visualized in Figure 6. We retrieve PCK accuracies of 3DINN on the IKEA dataset from its publicly released results. DISCO significantly outperforms 3DINN on PCK by on sofa, chair and bed respectively, which means that DISCO obtains more correct predictions of keypoint locations than 3DINN. This substantiates that direct exploitation of the rich visual details from images adopted by DISCO is critical to infer more accurate and finegrained 3D structure than lifting sparse 2D keypoints to 3D shapes like 3DINN. However, DISCO is inferior to 3DINN in terms of average recall on the sofa and bed class. As shown in Figure 5(a), the incorrect predictions by DISCO deviate more from the groundtruth than 3DINN. This is mainly because 3D predicted shapes from 3DINN are constrained by shape bases so even incorrect estimates have realistic object shapes when recognition fails. Moreover, our 3D keypoint labeling for the sofa CAD models is slightly different from [27]. We annotate the corners of reachable seating areas of a sofa while IKEA labels the corners of the outer volume parallel to the seating area We conclude that DISCO is able to learn 3D patterns of object classes other than the car category and shows potential as a generalpurpose approach to jointly model 3D geometric structure of multiple objects in a single model.
5.2.5 Qualitative Results
In Figure 7
, we visualize example predictions from DISCO on KITTI3D (left column) and PASCAL VOC (right column). From left to right, each column shows the original object image, the predicted 2D object skeleton with instance segmentation and the predicted 3D object skeleton with visibility. From top to bottom, we show the results under no occlusion (row 12), truncation (row 34), multicar occlusion (row 56), other occluders (row 78) and failure cases (row 9). We observe that DISCO is able to localize 2D and 3D keypoints on real images with complex occlusion scenarios and diverse car models such as sedan, SUV and pickup. Moreover, the visibility inference is mostly correct. These capabilities highlight the potential of DISCO as a building block for holistic scene understanding in cluttered scenes. In failure cases, the left car is mostly occluded by another object and the right one is severely truncated and distorted in projection. We may improve the performance of DISCO on these challenging cases by exploiting more sophisticated data simulated with complex occlusions
[60] and finetuning DISCO on real data.In addition, we qualitatively compare 3DINN and DISCO on three categories in IKEA dataset in Figure 8. For the chair, 3DINN fails to delineate the inclined seatbacks in the example images while DISCO being able to capture this structural nuance. For the sofa, DISCO correctly infers the location of sofa armrest whereas 3DINN merges armrests to the seating area or predicts an incorrect size of the seatback. Finally, DISCO yields better estimates of the scale of bed legs than 3DINN. We attribute this relative success of DISCO to direct mapping from image evidence to 3D structure, as opposed to lifting 2D keypoint predictions to 3D.
6 Conclusion
Visual perception often involves sequential inference over a series of intermediate goals of growing complexity towards the final objective. In this paper, we have employed a probabilistic framework to formalize the notion of intermediate concepts which points to better generalization through deep supervision, compared to the standard endtoend training. This inspires a CNN architecture where hidden layers are supervised with an intuitive sequence of intermediate concepts, in order to incrementally regularize the learning to follow the prescribed inference sequence. We practically leveraged this superior generalization capability to address the scarcity of 3D annotation: learning shape patterns from synthetic training images with complex multiple object configurations. Our experiments demonstrate that our approach outperforms current stateoftheart methods on 2D and 3D landmark prediction on public datasets, even with occlusion and truncation. We applied deep supervision to finegrained image classification and showed significant improvement over singletask as well as multitask networks on CIFAR100. Finally, we have presented preliminary results on jointly learning 3D geometry of multiple object classes within a single CNN. Our future work will extend this direction by learning shared representations for diverse object classes. We also see wide applicability of deep supervision, even beyond computer vision, in domains such as robotic planning, scene physics inference and generally wherever deep neural networks are being applied. Another future direction is to extract label relationship graphs from the CNN supervised with intermediate concepts, as opposed to explicitly constructed Hierarchy and Exclusion graphs [23].
Acknowledgments
This work was part of C. Li’s intern project at NEC Labs America, in Cupertino. We acknowledge the support by NSF under Grant No. 1227277. We also thank Rene Vidal, Alan Yuille, Austin Reiter and Chong You for helpful discussions.
References
 [1] B. J. Smith, “Perception of Organization in a Random Stimulus,” 1986.
 [2] D. Lowe, Perceptual Organization and Visual Recognition. Kluwer85.
 [3] D. Marr and H. K. Nishihara, “Representation and recognition of the spatial organization of threedimensional shapes,” RSL B’78.
 [4] R. Mohan and R. Nevatia, “Using perceptual organization to extract 3D structures,” PAMI, 1989.
 [5] S. Sarkar and P. Soundararajan, “Supervised learning of large perceptual organization,” PAMI, 2000.
 [6] Y. S. AbuMostafa, “Hints,” Neural Computation, 1995.
 [7] C. Li, M. Z. Zia, Q.H. Tran, X. Yu, G. D. Hager, and M. Chandraker, “Deep supervision with shape concepts for occlusionaware 3d object parsing,” 2017.
 [8] A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,” University of Toronto, Technical Report, Chapter 3, 2009.
 [9] Y. Xiang and S. Savarese, “Object detection by 3d aspectlets and occlusion reasoning,” in 3dRR, 2013.
 [10] M. Z. Zia, M. Stark, and K. Schindler, “Towards Scene Understanding with Detailed 3D Object Representations,” IJCV, 2015.
 [11] A. Kar, S. Tulsiani, J. Carreira, and J. Malik, “Categoryspecific object reconstruction from a single image,” in CVPR, 2015.
 [12] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. B. Tenenbaum, “Deep convolutional inverse graphics network,” in NIPS, 2015.
 [13] T. Wu, B. Li, and S.C. Zhu, “Learning AndOr Model to Represent Context and Occlusion for Car Detection and Viewpoint Estimation,” PAMI, 2016.
 [14] T. Zhou, P. Krähenbühl, M. Aubry, Q. Huang, and A. A. Efros, “Learning Dense Correspondence via 3Dguided Cycle Consistency,” 2016.
 [15] H.J. Lee and Z. Chen, “Determination of 3D human body postures from a single view,” CVGIP, 1985.
 [16] Y. Xiang, R. Mottaghi, and S. Savarese, “Beyond PASCAL: A Benchmark for 3D Object Detection in the Wild,” in WACV, 2014.
 [17] J. J. Lim, H. Pirsiavash, and A. Torralba, “Parsing IKEA Objects: Fine Pose Estimation,” in ICCV, 2013.
 [18] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite,” in CVPR, 2012.
 [19] C.Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “DeeplySupervised Nets,” AISTATS, 2015.
 [20] R. Caruana, “Multitask learning.” Springer, 1998.
 [21] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection by deep multitask learning,” in ECCV, 2014.
 [22] H. Su, C. R. Qi, Y. Li, and L. J. Guibas, “Render for CNN: Viewpoint estimation in images using CNNs trained with Rendered 3D model views,” in ICCV, 2015.
 [23] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam, “LargeScale Object Classification Using Label Relation Graphs,” in ECCV, 2014.
 [24] J. Baxter, “A model of inductive bias learning,” JAIR, 2000.
 [25] A. Maurer, M. Pontil, and B. RomeraParedes, “The benefit of multitask representation learning,” JMLR, 2016.
 [26] Ç. Gülçehre and Y. Bengio, “Knowledge matters: Importance of prior information for optimization,” JMLR, 2016.
 [27] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman, “Single Image 3D Interpreter Network,” 2016.
 [28] A. Dosovitskiy, J. Springenberg, and T. Brox, “Learning to Generate Chairs with Convolutional Neural Networks,” in CVPR, 2015.
 [29] M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Multiview 3D Models from Single Images with a Convolutional Network,” in ECCV, 2016.
 [30] P. Moreno, C. K. Williams, C. Nash, and P. Kohli, “Overcoming occlusion with inverse graphics.” in ECCV, 2016.
 [31] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese, “3DR2N2: A Unified Approach for Single and Multiview 3D Object Reconstruction,” in ECCV, 2016.

[32]
D. Rezende, S. Eslami, P. Battaglia, M. Jaderberg, and N. Heess, “Unsupervised learning of 3d structure from images,” 2016.
 [33] Y. Xiang, W. Choi, Y. Lin, and S. Savarese, “Datadriven 3D voxel patterns for object category recognition,” in CVPR, 2015.
 [34] M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic, “Seeing 3D chairs: Exemplar partbased 2D3D alignment using a large dataset of CAD models,” in CVPR, 2014.
 [35] J. J. Lim, A. Khosla, and A. Torralba, “FPM: Fine pose Partsbased Model with 3D CAD models,” in ECCV, 2014.
 [36] A. Bansal, B. Russell, and A. Gupta, “Marr revisited: 2D3D Alignment via Surface Normal Prediction,” in CVPR, 2016.
 [37] S. Gupta, P. Arbeláez, R. Girshick, and J. Malik, “Inferring 3d object pose in RGBD images,” arXiv:1502.04652, 2015.
 [38] S. Tulsiani and J. Malik, “Viewpoints and Keypoints,” CVPR, 2016.
 [39] X. Yu, F. Zhou, and M. Chandraker, “Deep Deformation Network for Object Landmark Localization,” 2016.
 [40] A. Kanazawa, D. W. Jacobs, and M. Chandraker, “WarpNet: Weakly Supervised Matching for Singleview Reconstruction,” CVPR, 2016.
 [41] X. Wang, T. X. Han, and S. Yan, “An HOGLBP Human Detector with Partial Occlusion Handling,” in ICCV, 2009.
 [42] B. Pepikj, M. Stark, P. Gehler, and B. Schiele, “Occlusion Patterns for Object Class Detection,” in CVPR, 2013.
 [43] H. Lebesgue, “Intégrale, longueur, aire,” Annali di Matematica Pura ed Applicata (18981922), vol. 7, no. 1, pp. 231–359, 1902.
 [44] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, 1989.
 [45] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016.
 [46] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in CVPR, 2009.
 [47] Y. Li, L. Gu, and T. Kanade, “Robustly Aligning a Shape Model and Its Application to Car Alignment of Unknown Pose,” TPAMI, 2011.
 [48] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv:1409.1556, 2014.
 [49] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” JMLR, 2015.
 [50] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in NIPS, 2012.
 [51] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan et al., “Shapenet: An informationrich 3d model repository,” arXiv:1512.03012, 2015.
 [52] D. Mishkin and J. Matas, “All you need is a good init,” 2016.
 [53] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016.
 [54] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” 2016.
 [55] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” CVPR, 2014.
 [56] Y. Yang and D. Ramanan, “Articulated pose estimation with flexible mixturesofparts,” in CVPR, 2011.
 [57] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” AISTATS, 2010.
 [58] J. L. Long, N. Zhang, and T. Darrell, “Do convnets learn correspondence?” in NIPS, 2014.
 [59] R. Mottaghi, Y. Xiang, and S. Savarese, “A coarsetofine model for 3d pose estimation and subcategory recognition,” in CVPR, 2015.
 [60] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in ECCV, 2016.
Comments
There are no comments yet.