1 Introduction
Formulations of vision problems as probabilistic inversions of generative models based on computer graphics have a long history [2, 20] and have recently attracted renewed attention [9, 7, 6]. However, applications to 3D object perception from single natural images have seemed intractable. Generative models for natural images have instead either focused on 2D problems [17], considered only lowdimensional latent scenes comprised of simple shapes[18, 9], or made heavy use of temporal continuity [21]. ^{†}^{†}Contact(left to right): {tejask@mit.edu}, {vkm@mit.edu}, {pkohli@microsoft.com}, {jbt@mit.edu}
On the modeling side, accounting for the enormous variability in 3D object shape and 2D appearance via realistic generative models can seem wildly intractable. The failure of the Inverse Graphics approach has primarily been due to the lack of a generic graphics engine and the computational intractability of the inversion problem. It has also proved difficult to navigate the tension between model flexibility and inference tractability. Consider that permissive models such as generic priors on 3D meshes both exacerbate the intractability of inference and sometimes lead to unrealistic percepts. On the other hand, identifying high likelihood inputs to a rendering engine can seem challenging enough without additionally accounting for a highly structured scene prior.
This paper proposes and evaluates a new approach that aims to address these challenges. We show that it is possible to solve challenging, realworld 3D vision problems by approximate inference in generative image models for deformable 3D meshes. Our approach uses scene representations that build on tools from computeraided design (CAD) and nonparametric Bayesian statistics
[14]. We specify priors on object geometry using probabilistic CAD (PCAD) programs within a generic rendering engine environment [3]: stochastic procedures that sample meshes from component priors and apply affine transformations to place them in a scene. Image likelihoods are based on similarity in a feature space based on standard midlevel image representations from the vision literature [5, 11].To the best of our knowledge, our system is the first realworld formulation to define rich generative models over rigid and deformable 3D CAD programs to interpret real images. We apply this approach to 3D human pose estimation and object shape reconstruction from single images, achieving quantitative performance that exceeds stateoftheart baselines. The 3D mesh based parametrization of our model consists of a large set of mixed discrete and continuous latent variables. This, coupled with the complex manytomany nature of graphics mapping, makes computation of the inverse mapping an extremely difficult inference problem. Our inference algorithm integrates singlesite and locally blocked MetropolisHastings proposals and Hamiltonian Monte Carlo, and discriminative proposals learned from training data generated from our models. We show that discriminative proposals aids the inference pipeline both in terms of speed and accuracy, allowing us to successfully leverage strengths of discriminative methods in generative modeling.
2 Modeling via Probabilistic CAD Programs
Probabilistic deformable CAD programs (PCADs) define generative models for 3D shapes by combining three major components as first proposed in [9]:
(a) Stochastic scene generator is the distribution over 3D meshes and other scene elements such as affine transformations. The scene generator can be factorized into several scene elements , which denotes scene configuration or latent variables over 3D meshes. This factorization induces a set of latent variables with priors .
(b) Approximate renderer projects the 3D mesh to 2D midlevel representation for stochastic comparison. The approximate renderer is a complex simulator denoted by function , where is the 2D projection of the 3D generated scene and denotes additional control variables for the rendering engine. One of our key contributions is to formulate image interpretation as inverting this simulator under observations. Blender is a widely used open source 3D graphics environment and it exposes numerous interfaces to model, render and manipulate complex 3D scenes. In this paper, we take the idea of inverse graphics literally by abstracting the simulator to be our function and driving it by putting rich priors over scene elements
. In order to abstract away the extreme pixel variability in real world images, we transform RGB images to contour maps with the recent Structured Random Forest model
[5]. The approximate renderer outputs the same midlevel representation conditioned on , resulting in an image hypothesis space with simple appearances variability while preserving 3D geometric details.(c) Stochastic Comparator is a likelihood function or in case of likelihood free inference [4], a distance function . In our experiments, we use a contour based representation for both data and the rendered image . Our stochastic comparator is defined as a probabilistic variant of the chamfer distance. Traditionally, chamfer distance uses a distance transform to output a smooth cost function over an observation image and a fixed sized template. As a first step, we transform the rendering and data by computing the distance transform , which computes the value of closest point on the contour from every location in the image. We use the nonsymmetric chamfer distance and use as the template since
will typically have many outliers and will not be robust to variations. The likelihood
function can then be expressed as follows:We can now formulate the image interpretation task as approximately sampling the posterior distribution of scene elements given observations ():
(1) 
2.1 Generative Model
We demonstrate our framework on two challenging real world 3D parsing problems – parsing generic 3D artifact objects and inferring fine 3D pose of humans from single images. The flexibility of our rich priors and the expressivity of the generic graphics simulator allows us to handle the extreme shape variability between the two problem domains with relative ease. For all generative models, we denote the affine transformations over an entire object as a latent matrix denoted by . This matrix is a affine matrix drawn from the the uniform distribution, where the range over translation variables is , range over scale is and range over rotation is over all three axes independently.
2.1.1 3D Object Parsing CAD Program
Lathing and casting is a useful representation to express CAD models and is the main inspiration for our approach to modeling generic 3D artifacts. Given object boundaries in space, we use a popular algorithm in graphics to lathe an object by taking a cross section of points, defining a medial axis for the cross section and sweeping the cross section across the medial axis by continuously perturbing with respect to . The space over is extremely large due to variability in 3D shapes. We introduce a generative model consisting of Gaussian Processes (GPs) as a flexible nonparametric prior over object profiles for lathing (hereby referring to by ). The intermediate output from the graphics simulator is a mesh approximating the whole or part of a 3D object, which can be rendered to an image by camera reprojection.
The height of the object is sampled from a uniform distribution. 3D Objects can consist of several subparts. Without the loss of generality and for simplicity, we study objects with upto two subparts and circular crosssection. We sample a cut
along the medial axis of the object using the beta distribution, resulting in two independent GPs spanning the cut proportions. Since the smoothness and profile of 3D objects is a priori unknown, we need to do hyperparameter inference on the bandwidths
and of the covariance kernel of the GPs. The resulting points from GPs are passed to the graphics simulator for lathing based mesh generation, which results in the generation of . During inference, reconstructing 3D objects amounts to calculating the posterior . We show typical samples from the prior and an example inference trajectory in Figure 2. The generative model can be formalized as follows:2.1.2 3D Human Pose CAD Program
We can naturally define a compositional model over the parts parameterized by the GP, such as an infinite mixture of GP experts or a hierarchical GP mixture model to learn 3D shapes such as human bodies. However, this is outside the scope of the current paper and is left for future work. In order to demonstrate deformable CAD programs, we designed a compositional mesh of the human body, where parts or vertex groups of the mesh approximate bones and joints. We use popular algorithm in graphics to generate armature for the resulting mesh and define priors over the armature control points.
The armature is a tree with the root node centered around the center of the mesh. The underlying armature tree is used as the medial axis to deform the mesh in a local partwise fashion. Each joint/bone on the armature has a affine matrix with scale , rotation and location latent variables.The armature marked on the 3D mesh are depicted in Figure 1b. Whenever the random choices in are reflipped during inference, the change is propagated all the way to the root node or a predefined stopping node to smoothly evolve the 3D mesh. In order to assess coverage of our model, we show samples drawn from the prior in Figure 1b and an illustrative run of the inference.
3 Inference via Markov chain Monte Carlo
Inference for inverting graphics simulators is intractable and therefore we resort to using Markov chain montecarlo for approximate inference. Inference in our model is especially hard due to the following reasons: highly coupled variables, mix of discrete and continuous variables, many local minimas due to clutter, occlusion and noisy observations. For approximate inference to be possible in inverting the graphics simulator, we propose the following mixture of proposal kernels:
Local Random Proposal: Single site metropolis hastings moves on continuous variables and Gibbs moves on discrete variables. The proposal kernel is:
(9) 
Block Proposal: Affine latent variables such as the rotation matrix are almost always coupled with latent variables parameterizing the 3D mesh. Let us denote to be a set of latents belonging to the same affine transformation and to be all other nonaffine latents. In order to allow the affine latent groups to mix, we use the following blocked proposal:
(10) 
Discriminative Proposal: Despite asymptotic convergence, inference often gets suck in local minimas due to occlusion, clutter and noisy observations. Often times, the only way to escape a local minima is if the sampler could make precise changes to large number of latent variables at the same time. We follow a strategy similar to [7] to learn datadriven proposals:
(11)  
(12)  
(13) 
Local Gradient Proposal: Since our likelihood function is smooth for continuous variables and often times there is coupling between these variables, we can exploit gradient information by constructing a Hamiltonian kernel , where denotes all the continuous scene variables.
During inference, we approximate by accepting or rejecting with the following ratio:
(14)  
(15) 
4 Experiments
4.1 3D Object Parsing
We collected a small dataset of about 20 3D objects from the internet and demonstrate superior results over the current state of the art model for single object reconstruction [1]. As compared to the baseline, our model does not require presegmented objects as we do joint inference over affine transformations and mesh elements. Evaluating 3D reconstruction is challenge for single images. We asked CAD experts to manually generate CAD fits to images in blender and evaluated both the approaches by calculating ZMAE score (shiftinvariant surface meansquared error) and NMSE (meansquared error over normals). As described in Figure 2a, our model has a much lower ZMAE and NMSE score than SIRFS without utilizing presegmented objects. Moreover, most of the error in our approach can be attributed to slight misalignment of the 3D parse from our model with respect to ground truth (SIRFS has an inherent advantage in this metric since objects are presegmented). Most importantly, as demonstrated in Figure 4, our approach has dramatically better qualitative 3D results as compared to SIRFS. Figure 2b shows typical trajectory of the inference from prior to the posterior on a challenging real world image. In future work, we hope to naturally extend our model to handle arbitrary cross sections (same GP based model) and a much more structured hypothesis space to represent Gaussian Process based object parts.
4.2 3D Human Pose Estimation
We collected a small dataset of humans performing a variety of poses from sources such as KTH [16], LabelMe [15] images with significant occlusion in the “person sitting” category and the Internet (about 50 images), and demonstrate superior results in comparison with the current stateoftheart Deformable Parts Model (DPM) for pose estimation [19]. We project the 3D pose obtained from our model to 2D keypoints (such as head, left arm etc.) for comparison with the baseline. As shown in Figure 5, we outperform the baseline by a significant margin.
As shown in Figure 6, images with people sitting and heavy occlusion are very hard for the discriminative model to get right – mainly due to “missing” observation signal – making a strong case for a model based approach like ours which seems to give reasonable parses. Most of our model’s failure cases as shown in Figure 6 and 5b, are in inferring arm position; this is typically due to noisy and low quality contour maps around the arm area due to its small size. In the subsequent section, we will utilize the strengths of the bottomup DPM pipeline to learn a better strategy for doing probabilistic inference.
4.2.1 Discriminative Datadriven Proposals
To aid inference in 3D human pose estimation, we also explore the use of discriminatively trained predictors that act as sample generators in our inversion pipeline. We sample a large number of samples from the prior and fit the pose estimated on the rendered images using the feedforward DPM pose detection pipeline. During inference, we fit the DPM pose model on the real test image, find Knearest neighbors from the sampled dataset in the space of DPM pose parameters and a local density estimator (KDE) to generate a discriminatively trained proposal. Intuitively, the feedforward pathways rapidly initializes the latent variables to a reasonable state, leaving further fine tuning up to inference pipeline. This effect is confirmed by Figure 7, where the about 100 independent Markov chains are ran with and without discriminative proposals. The results consistently favors the runs with discriminative learning both in terms of accuracy and speed.
Given a color image, we calculate the features using DPM pose detector and retrieve Knearest neighbors from data generated in (a). Given the neighbors, we fit a kernel density estimator (KDE) to its latent variables to get the learned proposal.
(d) A sample parsed result with proposal learning. (e) Samples drawn from the KDE given the color image (b) are semantically close to the posterior. Inference finetunes the solution to a better fit via other proposal kernels in our algorithm. As shown on the logl plot, we run about 100 independent chains with and without the learned proposal. Inference with learned proposal consistently outperforms baseline in terms of both speed and accuracy.5 Discussion
We have shown that it is possible to solve challenging, realworld 3D vision problems by approximately Bayesian inversion of probabilistic CAD programs. Shape variability is addressed using 3D modeling tools from computer graphics and with nonparametric Bayesian statistics to deal with unknown shape categories. Appearance variability is handled by comparing renderings to realworld images in a feature space based on midlevel representations and distance metrics from computer vision. Inference is handled via a mixtures of Hamiltonian Monte Carlo, standard singlesite and locally blocked MetropolisHastings moves. This approach yields quantitative and qualitative performance improvements on 3D human pose estimation and object reconstruction as compared to stateoftheart baselines, with only moderate computational cost. Additionally, datadriven proposals learned from synthetic data, incorporating representations from human pose detectors, can be used to improve inference speed.
Several research directions seem appealing on the basis of these results. Scaling in object and scene complexity could be handled by incorporating object priors based on hierarchical shape decompositions or infinite mixtures of piecewise GPs. The explicit 3D scenes could be augmented with physical information, with kinematic and physical constraints integrated via undirected potentials; it may begin to be practical to build systems that perform rich physical reasoning directly from realworld visual data. Image comparison could be improved by using richer appearance models such as Epitomes
[12]. It could also be fruitful to experiment directly modeling reflectance and illumination via an approach like SIRFS [1], though choosing the right resolution for comparison may be difficult. It seems natural to explore richer bottomup proposal mechanisms that integrate stateoftheart discriminative techniques, including modern artificial neural networks
[8]. Many of these directions, as well as the exploration of alternative inference strategies, may be simplified by implementation as generative probabilistic graphics programs atop generalpurpose probabilistic programming systems [10].The generative, approximately Bayesian approach to vision has a long way to go before it can begin to compete with the flexibility and maturity of current bottomup vision pipelines, let alone their computational efficiency. Basic design tradeoffs in modeling and inference are not yet well understood, and to make a 3D scene parser that performs acceptably on the object recognition challenges like PASCAL may be a good proxy for the general problem of visual perception. Despite these limitations, however, it in some ways offers a clearer scaling path to rich percepts in uncontrolled settings. Our results suggest it is now possible to realize some of this potential in practice, and not only produce rich, representationally explicit percepts but also obtain good quantitative performance. We hope to see many more illustrations of this kind of approach in the future.
6 Acknowledgments
We thank Peter Battaglia, Daniel Selsam, Owain Evans, Alexey Radul and Sam Gershman for their valuable feedback and discussions. Tejas Kulkarni was graciously supported by the Henry E. Singleton Fellowship. Partly funded by the DARPA PPAML program, grants from the ONR and ARO, and Google’s ”Rethinking AI” project.
References
 [1] J. Barron and J. Malik. Shape, illumination, and reflectance from shading. Technical report, Berkeley Tech Report, 2013.
 [2] B. G. Baumgart. Geometric modeling for computer vision. Technical report, DTIC Document, 1974.
 [3] Blender Online Community. Blender  a 3D modelling and rendering package. Blender Foundation, Blender Institute, Amsterdam,
 [4] K. Csilléry, M. G. Blum, O. E. Gaggiotti, and O. François. Approximate bayesian computation (abc) in practice. Trends in ecology & evolution, 25(7):410–418, 2010.
 [5] P. Dollár and C. L. Zitnick. Structured forests for fast edge detection. 2013.
 [6] A. Gupta, A. A. Efros, and M. Hebert. Blocks world revisited: Image understanding using qualitative geometry and mechanics. In Computer Vision–ECCV 2010, pages 482–496. Springer, 2010.
 [7] V. Jampani, S. Nowozin, M. Loper, and P. V. Gehler. The informed sampler: A discriminative approach to bayesian inference in generative computer vision models. arXiv preprint arXiv:1402.0859, 2014.
 [8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
 [9] V. Mansinghka, T. D. Kulkarni, Y. N. Perov, and J. Tenenbaum. Approximate bayesian image interpretation using generative probabilistic graphics programs. In Advances in Neural Information Processing Systems, pages 1520–1528, 2013.
 [10] V. Mansinghka, D. Selsam, and Y. Perov. Venture: a higherorder probabilistic programming platform with programmable inference. arXiv preprint arXiv:1404.0099, 2014.
 [11] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(5):530–549, 2004.
 [12] G. Papandreou, T. Chicago, L.C. Chen, and A. L. Yuille. Modeling image patches with a generic dictionary of miniepitomes. 2014.
 [13] J. Portilla and E. P. Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. International Journal of Computer Vision, 40(1):49–70, 2000.

[14]
C. E. Rasmussen.
Gaussian processes for machine learning.
2006.  [15] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a database and webbased tool for image annotation. International journal of computer vision, 77(13):157–173, 2008.
 [16] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 3, pages 32–36. IEEE, 2004.
 [17] Z. Tu and S.C. Zhu. Image segmentation by datadriven markov chain monte carlo. PAMI, 24(5):657–673, 2002.
 [18] J. Xiao, B. C. Russell, and A. Torralba. Localizing 3d cuboids in singleview images. In NIPS, volume 2, page 4, 2012.
 [19] Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixturesofparts. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1385–1392. IEEE, 2011.

[20]
A. Yuille and D. Kersten.
Vision as bayesian inference: analysis by synthesis?
Trends in cognitive sciences, 10(7):301–308, 2006.  [21] S. Zuffi, J. Romero, C. Schmid, M. J. Black, et al. Estimating human pose with flowing puppets. In IEEE Intenational Conference on Computer Vision (ICCV), 2013.