Computer vision has historically been formulated as the problem of producing symbolic descriptions of scenes from input images [horn1986robot]
. This is usually done by building bottom-up processing pipelines that isolate the portions of the image associated with each scene element and extract features that are indicative of its identity. Many pattern recognition and learning techniques can then be used to build classifiers for individual scene elements, and sometimes to learn the features themselves[hinton2006fast, hinton1997generative]. ††* The first two authors contributed equally to this work. ††* (vkm, kulkarni, perov, jbt)@mit.edu
This approach has been remarkably successful, especially on problems of recognition. Bottom-up pipelines that combine image processing and machine learning can identify written characters with high accuracy and recognize objects from large sets of possibilities. However, the resulting systems typically require large training corpuses to achieve reasonable levels of accuracy, and are difficult both to build and modify. For example, the Tesseract system[smith2007overview] for optical character recognition is over lines of C++. Small changes to the underlying assumptions frequently necessitates end-to-end retraining and/or redesign.
Generative models for a range of image parsing tasks are also being explored [tu2005image, del2012bayesian, Tu:2002, ZhaoZhu2011, wingate2011nonstandard]. These provide an appealing avenue for integrating top-down constraints with bottom-up processing, and provide an inspiration for the approach we take in this paper. But like traditional bottom-up pipelines for vision, these approaches have relied on considerable problem-specific engineering, chiefly to design and/or learn custom inference strategies, such as MCMC proposals [Tu:2002, ZhaoZhu2011] that incorporate bottom-up cues. Other combinations of top-down knowledge with bottom up processing have been remarkably powerful. For example, [hoiem2006putting] has shown that global, 3D geometric information can significantly improve the performance of bottom-up object detectors.
In this paper, we propose a novel formulation of image interpretation problems, called generative probabilstic graphics programming (GPGP). Generative probabilistic graphics programs share a common template: a stochastic scene generator, an approximate renderer based on existing graphics software, a highly stochastic likelihood model for comparing the renderer’s output with the observed data, and latent variables that control the fidelity of the renderer and the tolerance of the image likelihood. Our probabilistic graphics programs are written in a variant of the Church probabilistic programming language [goodman2008church]. Each model we introduce requires less than lines of probabilistic code. The renderers and likelihoods for each are based on standard templates written as short Python programs. Unlike typical generative models for scene parsing, inverting our probabilistic graphics programs requires no custom inference algorithm design. Instead, we rely on the automatic Metropolis-Hastings transition operators provided by our probabilistic programming system. The approximations and stochasticity in our renderer, scene generator and likelihood models serve to implement a variant of approximate Bayesian computation [wilkinson2008approximate, marjoram2003markov]. This combination can produce a kind of self-tuning analogue of annealing that facilities reliable convergence.
To the best of our knowledge, our GPGP framework is the first real-world image interpretation formulation to combine probabilistic programming, automatic inference, computer graphics, and approximate Bayesian computation; this constitutes our main contribution. Our second contribution is to provide demonstrations of the efficacy of this approach on two image interpretation problems: reading snippets of degraded and adversarially obscured alphanumeric characters, and inferring 3D road models from vehicle mounted cameras. In both cases we quantitatively report the accuracy of our approach on representative test datasets, as compared to standard bottom-up baselines that have been extensively engineered.
2 Generative Probabilistic Graphics Programs and Approximate Bayesian Inference.
Generative probabilistic graphics programs define generative models for images by combining four components. The first is a stochastic scene generator written as probabilistic code that makes random choices for the location and configuration of the main elements in the scene. The second is an approximate renderer based on existing graphics software that maps a scene and control variables to an image . The third is a stochastic likelihood model for image data that enables scoring of rendered scenes given the control variables. The fourth is a set of latent variables that control the fidelity of the renderer and/or the tolerance in the stochastic likelihood model. These components are described schematically in Figure 1.
We formulate image interpretation tasks in terms of sampling (approximately) from the posterior distribution over images:
We perform inference over execution histories of our probabilistic graphics programs using a uniform mixture of generic, single-variable Metropolis-Hastings transitions, without any custom, bottom-up proposals. We first give a general description of the generative model and inference algorithm induced by our probabilistic graphics programs; in later sections, we describe specific details for each application.
Let be a decomposition of the scene into parts with independent priors . For example, in our text application, the s include binary indicators for the presence or absence of each glyph, along with its identity (“A“ through ”Z”, plus digits 0-9), and parameters including location, size and rotation. Also let be a decomposition of the control variables into parts with priors
, such as the bandwidths of per-glyph Gaussian spatial blur kernels, the variance of a Gaussian image likelihood, and so on. Our proposals modify single elements of the scene and control variables at a time, as follows:
be the total number of random variables in each execution. For simplicity, we describe the case where this number can be bounded above beforehand, i.e. total a priori scene complexity is limited. At each inference step, we choose a random variable indexuniformly at random. If corresponds to a scene variable , then we propose from , so our overall proposal kernel . If corresponds to a control variable , we propose from . In both cases we re-render the scene . We then run the kernel associated with this variable, and accept or reject via the Metropolis-Hastings equation:
We implement our probabilistic graphics programs in a variant of the Church probabilistic programming language. The Metropolis-Hastings inference algorithm we use is provided by default in this system; no custom inference code is required. In the context of our generative probabilistic graphics formulation, this algorithm makes implicit use of ideas from approximate Bayesian computation (ABC). ABC methods approximate Bayesian inference over complex generative processes by using an exogenous distance function to compare sampled outputs with observed data. In the original rejection sampling formulation, samples are accepted only if they match the data within a hard threshold. Subsequently, combinations of ABC and MCMC were proposed [marjoram2003markov], including variants with inference over the threshold value [Ratmann30062009]. Most recently, extensions have been introduced where the hard cutoff is replaced with a stochastic likelihood model [wilkinson2008approximate]. Our formulation incorporates a combination of these insights: rendered scenes are only approximately constrained to match the observed image, with the tightness of the match mediated by inference over factors such as the fidelity of the rendering and the stochasticity in the likelihood. This allows image variability that is unnecessary or even undesirable to model to be treated in a principled fashion.
3 Generative Probabilistic Graphics in 2D for Reading Degraded Text.
We developed a probabilistic graphics program for reading short snippets of degraded text consisting of arbitrary digits and letters. See Figure 2 for representative inputs and outputs. In this program, the latent scene contains a bank of variables for each glyph, including whether a potential letter is present or absent from the scene, what its spatial coordinates and size are, what its identity is, and how it is rotated:
Our renderer rasterizes each letter independently, applies a spatial blur to each image, composites the letters, and then blurs the result. We also applied global blur to the original training image before applying the stochastic likelihood model on the blurred original and rendered images. The stochastic likelihood model is a multivariate Gaussian whose mean is the blurry rendering; formally, . The control variables for the renderer and likelihood consist of per-letter Gaussian spatial blur bandwidths , a global image blur on the rendered image , a global image blur on the original test image
, and the standard deviation of the Gaussian likelihood(with , and
set to favor small bandwidths). To make hard classification decisions, we use the sample with lowest pixel reconstruction error from a set of 5 approximate posterior samples. We also experimented with enabling enumerative (griddy) Gibbs sampling for uniform discrete variables with 10% probability. The probabilistic code for this model is shown in Figure4.
To assess the accuracy of our approach on adversarially obscured text, we developed a CAPTCHA corpus consisting of over 40 images from widely used websites such as TurboTax, E-Trade, and AOL, plus additional challenging synthetic CAPTCHAs with high degrees of letter overlap and superimposed distractors. Each source of text violates the underlying assumptions of our probabilistic graphics program in different ways. TurboTax CAPTCHAs incorporate occlusions that break strokes within letters, while AOL CAPTCHAs include per-letter warping. These CAPTCHAs all involve arbitrary digits and letters, and as a result lack cues from word identity that the best published CAPTCHA breaking systems depend on [mori2003recognizing]. We observe robust character recognition given enough inference, with an overall character detection rate of 70.6%. To calibrate the difficulty of our corpus, we also ran the Tesseract optical character recognition engine [smith2007overview] on our corpus; its character detection rate was 37.7%.
We have found that the dynamically-adjustable fidelity of our approximate renderer and the high stochasticity of our generative model are necessary for inference to robustly converge to accurate results. This aspect of our formulation can be viewed as a kind of self-tuning, stochastic Bayesian analogue of annealing, and significantly improves the robustness of our approach. See Figure 3 for an illustration of these dynamics.
4 Generative Probabilistic Graphics in 3D: Road Finding.
We have also developed a generative probabilistic graphics program for localizing roads in 3D from single images. This is an important problem in autonomous driving. As with many perception problems in robotics, there is clear scene structure to exploit, but also considerable uncertainty about the scene, as well as substantial image-to-image variability that needs to be robustly ignored. See Figure 5b for example inputs.
The probabilistic graphics program we use for this problem is shown in Figure 7. The latent scene is comprised of the height of the roadway from the ground plane, the road’s width and lane size, and the 3D offset of the corner of the road from the (arbitrary) camera location. The prior encodes assumption that the lanes are small relative to the road, and that the road has two lanes and is very likely to be visible (but may not be centered). This scene is then rendered to produce a surface-based segmentation image, that assigns each input pixel to one of 4 regions . Rendering is done for each scene element separately, followed by compositing, as with our 2D text program. See Figure 5a for random surface-based segmentation images drawn from this prior. Extensions to richer road and ground geometries are an interesting direction for future work.
In our experiments, we used -means (with ) to cluster RGB values from a randomly chosen training image. We used these clusters to build a compact appearance model based on cluster-center histograms, by assigning text image pixels to their nearest cluster. Our stochastic likelihood incorporates these histograms, by multiplying together the appearance probabilities for each image region . These probabilities, denoted , are smoothed by pseudo-counts
drawn from a Gamma distribution. Letbe the per-region normalizing constant, and be the quantized pixel at coordinates in the input image. Then our likelihood model is:
Figure 5f shows appearance model histograms from one random training frame. Figure 5c shows the extremely noisy lane/non-lane classifications that result from the appearance model on its own, without our scene prior; accuracy is extremely low. Other, richer appearance models, such as Gaussian mixtures over RGB values (which could be either hand specified or learned), are compatible with our formulation; our simple, quantized model was chosen primarily for simplicity. We use the same generic Metropolis-Hastings strategy for inference in this problem as in our text application. Although deterministic search strategies for MAP inference could be developed for this particular program, it is less clear how to build a single deterministic search algorithm that could work on both of the generative probabilistic graphics programs we present.
In Table 1, we report the accuracy of our approach on one road dataset from the KITTI Vision Benchmark Suite. To focus on accuracy in the face of visual variability, we do not exploit temporal correspondences. We test on every 5th frame for a total of 80. We report lane/non-lane accuracy results for maximum likelihood classification over 10 appearance models (from 10 randomly chosen training images), as well as for the single best appearance model from this set. We use 10 posterior samples per frame for both. For reference, we include the performance of a sophisticated bottom-up baseline system from [aly2008real]
. This baseline system requires significant 3D a priori knowledge, including the intrinsic and extrinsic parameters of the camera, and a rough intial segmentation of each test image. In contrast, our approach has to infer these aspects of the scene from the image data. We also show examples of the uncertainty estimates that result from approximate Bayesian inference in Figure 6. Our probabilistic graphics program for this problem requires under 20 lines of probabilistic code.
|Aly et al [aly2008real]||68.31%|
|GPGP (Best Single Appearance)||64.56%|
|GPGP (Maximum Likelihood over Multiple Appearances)||74.60%|
We have shown that it is possible to write short probabilistic graphics programs that use simple 2D and 3D computer graphics techniques as the backbone for highly approximate generative models. Approximate Bayesian inference over the execution histories of these probabilistic graphics programs — automatically implemented via generic, single-variable Metropolis-Hastings transitions, using existing rendering libraries and simple likelihoods — then implements a new variation on analysis by synthesis [yuille2006vision]. We have also shown that this approach can yield accurate, globally consistent interpretations of real-world images, and can coherently report posterior uncertainty over latent scenes when appropriate. Our core contributions are the introduction of this conceptual framework and two initial demonstrations of its efficacy.
To scale our inference approach to handle more complex scenes, it will likely be important to consider more complex forms of automatic inference, beyond the single-variable Metropolis-Hastings proposals we currently use. For example, discriminatively trained proposals could help, and in fact could be trained based on forward executions of the probabilistic graphics program. Appearance models derived from modern image features and texture descriptors [portilla2000parametric, oliva2001modeling] — going beyond the simple quantizations we currently use — could also reduce the burden on inference and improve the generalizability of individual programs. It is important to note that the high dimensionality involved in probabilistic graphics programming does not necessarily mean inference (and even automatic inference) is impossible. For example, approximate inference in models with probabilities bounded away from 0 and 1 can sometimes be provably tractable via sampling techniques, with runtimes that depend on factors other than dimensionality [dagum1997optimal]. In fact, preliminary experiments with our 2D text program (not shown) appear to show flat convergence times with up to 30 unknown letters. Exploring the role of stochasticity in facilitating tractability is an important avenue for future work.
The most interesting potential of generative probabilistic graphics programming is the avenue it provides for bringing powerful graphics representations and algorithms to bear on the hard modeling and inference problems in vision. For example, to avoid global re-rendering after each inference step, we need to represent and exploit the conditional independencies in the rendering process. Lifting graphics data structures such as z-buffers into the probabilistic program itself might enable this. Real-time, high-resolution graphics software contains solutions to many hard technical problems in image synthesis. Long term, we hope probabilistic extensions of these ideas ultimately become part of analogous solutions for image analysis.
We are grateful to Keith Bonawitz and Eric Jonas for preliminary work exploring the feasibility of CAPTCHA breaking in Church, and to Seth Teller, Bill Freeman, Ted Adelson, Michael James and Max Siegel for helpful discussions.