1 Introduction
The primary motivation of this work is that objects and scenes can be represented using hierarchical structures defined by compositional rules. For instance, faces are composed of eyes, nose, mouth. Similarly, geometric objects such as curves can be defined in terms of shorter curves that are recursively described. A hierarchical structure defined by compositional rules defines a rich description of a scene that captures both the presence of different objects and relationships among them. Moreover, compositional rules provide contextual cues for inference with ambiguous data. For example, the presence of some parts of a face in a scene provides contextual cues for the presence of other parts.
In the models we consider, every object has a type from a finite alphabet and a pose from a finite but large pose space. While classical language models generate sentences using a single derivation, the grammars we consider generate scenes using multiple derivations. These derivations can be unrelated or they can share subderivations. This allows for very general descriptions of scenes.
We show how to represent the distributions defined by probabilistic scene grammars using factor graphs, and we use loopy belief propagation (LBP) [17, 15] for approximate inference. Inference with LBP simultaneously combines “bottomup” and “topdown” contextual information. For example, when faces are defined using a composition of eyes, nose and mouth, the evidence for a face or one of its parts provides contextual influence for the whole composition. Inference via message passing naturally captures chains of contextual evidence. LBP also naturally combines multiple contextual cues. For example, the presence of an eye may provide contextual evidence for a face at two different locations because a face has a left and a right eye. However, the presence of two eyes side by side provides strong evidence for a single face between them.
We demonstrate the practical feasibility of the approach on two very different applications: curve detection and face localization. Figure 1 shows samples from the two different grammars we use for the experimental results. The contributions of our work include (1) a unified framework for contextual modeling that can be used in a variety of applications; (2) a construction that maps a probabilistic scene grammar to a factor graph together with an efficient message passing scheme; and (3) experimental results showing the effectiveness of the approach.
Probabilistic grammars and compositional models are widely used for parsing sentences in natural language processing
[16]. Recursive descriptions of objects using grammars and production rules have also been widely used in computer graphics to generate geometric objects, biological forms, and landscapes [19]. A variety of other compositional models have been used in computer vision. The models we consider are closely related to the Markov backbone model in
[14]. Other related approaches include [2, 20, 8, 21, 12, 10]. These previous methods have relied on MCMC or heuristic methods for inference, or dynamic programming for scenes with single objects. The models we consider generalize partbased models for object detection such as pictorial structures
[11, 7] and constellations of features [3]. In particular, the grammars we consider define objects that are composed of parts but allow for modeling objects with variable structure. The models we consider also explicitly capture scenes with multiple objects.2 Probabilistic Scene Grammars and a Factor Graph Representation
Our point of departure is a probabilistic scene grammar that defines a distribution over scenes. The approach is based on the Markov backbone from [14]. Scenes are defined using a library of building blocks, or bricks, that have a type and a pose. Bricks are generated spontaneously or through expansions of other bricks. This leads to a hierarchical organization of the elements of a scene.
Definition 2.1.
A probabilistic scene grammar consists of

A finite set of symbols, or types, .

A finite pose space, , for each symbol .

A finite set of production rules, . Each rule is of the form where . We use to denote the rules with symbol in the lefthandside (LHS). We use to denote the th symbol in the righthandside (RHS) of a rule .

For each rule we have categorical distributions defining the probability of a pose for conditional on a pose for .

Selfrooting probabilities, , for each symbol .

A noisyor parameter, .
The bricks defined by are pairs of symbols and poses,
Definition 2.2.
A scene is defined by

A set of bricks that are present in .

A rule for each brick , and a pose for each in the RHS of .
Let be a directed graph capturing which bricks can generate other bricks in one production. For each rule , if , we include in . We say a grammar is acyclic if is acyclic.
A topological ordering of is an ordering of the bricks such that appears before whenever can generate . When is acyclic we can compute a topological ordering of by topological sorting the vertices of .
Definition 2.3.
An acyclic grammar defines a distribution over scenes, , through the following generative process.

Initially .

For each brick we add to independently with probability .

We consider the bricks in in a topological ordering. When considering , if we expand it.

To expand we select a rule according to and for each in the RHS of we select a pose according to . We add to with probability .
Note that because of the topological ordering of the bricks, no brick is included in after it has been considered for expansion. In particular each brick in is expanded exactly once. This leads to derivation trees rooted at each brick in the scene. The expansion of two different bricks can generate the same brick, and this leads to a “collision” of derivations. When two derivations collide they share a subderivation rooted at the point of collision. Derivations terminate using rules of the form , or through early termination of a branch with probability .
2.1 Factor Graph Representation
We can represent the distribution over scenes,
, using a factor graph with binary variables.
Definition 2.4.
A scene generated by an acyclic grammar
defines a set of random variables,
(1)  
(2)  
(3) 
where

if .

if rule is used to expand .

when is expanded with rule , and is the pose selected for .
Let
be the vector of variables
where . We have when is not generated spontaneously or by the expansion of another other brick. Therefore,(4) 
where is the number of variables in with value 1.
Let be the vector of random variables . The generative process determines by selecting a rule for expanding when , and no rule is selected when . Therefore,
(5) 
where is an indicator vector for
Let be the vector of random variables . The generative process selects a pose for if the rule is used to expand a brick. Therefore,
(6) 
where is an indicator vector for .
The joint distribution,
, defined by an acyclic grammar can be expressed using a factored representation following the structure of the generative process defined by ,(7) 
We can express using a factor graph over the binary variables, with a factor for each term in the product above. The factors in the factor graph representation are
(8)  
(9)  
(10) 
Although we have assumed an acyclic grammar in the derivation of the distribution in equation (7
), the factor graph construction can also be applied to arbitrary grammars. This makes it possible to define probability distributions over scenes using cyclic grammars, without relying on the generative process formulation.
3 Inference Using Belief Propagation
To perform approximate inference with the factor graph representation, we use loopy belief propagation (LBP) [17, 15]. Here we describe how to compute LBP messages efficiently for the factor graphs that represent scene grammars.
The factors in our model are of one of two kinds: The factor defined in equation (8) captures a noisyOR distribution, and the factors and defined in equations (9) and (10) capture categorical distributions in which the outcome probabilities depend on the state of a switching random variable. Figure 2 shows the local graphical representation for the two types of factors. The computation of messages from variables to factors follows the standard LBP equations. Below we describe how to efficiently compute the messages from factors to variables. The computational complexity of message updates for both kinds of factors is linear in the degree of the factor. In the derivations below we assume all messages have nonzero value.
3.1 Message passing for noisyOR factors
Consider a factor that represents a noisyOR relationship between binary inputs , and a binary output . Suppose we have a leak in the noisyOR with probability and independent failure parameter . We define . We can write the factor as
(11)  
(12) 
The message passing equations are straightforward to derive and we simply state them here,
(13)  
(14)  
(15) 
3.2 Message passing for categorical factors
Consider a factor that represents a mixture of categorical distributions. The binary values specify the outcome and controls the outcome probabilities. Concretely,
(16) 
where is the probability of the th outcome with the mixture component defined by .
In this case we can derive the following message passing equations,
(17)  
(18)  
(19) 
4 Learning Model Parameters
For a grammar with fixed structure we can use EM to learn the the production rule probabilities, , and the selfrooting parameters, . The approach involves iterative updates. In each iteration, we (1) use LBP to compute (approximate) conditional marginal probabilities on training examples with the current model parameters, and (2) update the model parameters according to sufficient statistics derived from the output of LBP.
Let be the marginal probability of brick ( being expanded using rule in the training example . In the factor graph representation, this corresponds to the marginal probability that a random variable takes a particular value, , a quantity that is approximated by the output of LBP. The update for is,
(20) 
The value of is determined by normalizing probabilities over . We update the selfrooting parameters, , in an analogous way, using approximate marginals computed by LBP.
5 Experiments
To demonstrate the generality of our approach we conducted experiments with two different applications: curve detection, and face localization. Previous approaches for these problems typically use fairly distinct methods. Here, we demonstrate we can handle both problems within the same framework. In particular we have used a single implementation of a general computational engine for both applications. The computational engine can perform inference and learning using arbitrary scene grammars. We report the speed of inference as performed on a laptop with an Intel^{®} i7 2.5GHz CPU and 16 GB of RAM. Our framework is implemented in Matlab/C using a single thread.
5.1 Curve detection
To study curve detection we used the Berkeley Segmentation Dataset (BSD500) [1] following the experimental setup described in [9]. The dataset contains natural images and object boundaries manually marked by human annotators. For our experiments, we used the standard split of the dataset with training images and test images. For each image we use the boundaries marked by a single human annotator to define groundtruth binary contour maps .
From a binary contour map we generate a noisy image by sampling each pixel
independently from a normal distribution whose mean depends on the value of
.(21) 
For our experiments, we used , , .
To model binary contour maps we use a firstorder Markov process that generates curves of different orientations and varying lengths. The grammar is defined by two symbols: (oriented curve element) and (curve pixel). We consider curves in one of 8 possible orientations. For an image of size , the pose space for for is an grid and the pose space for is an grid.
We can express the rules of the grammar as
(22)  
(23)  
(24)  
(25) 
where denotes a rotation of by . Consider generating a “horizontal” curve, with orientation , starting at pixel . The process starts at the brick . Expansion of this brick will generate a brick to denote that pixel is part of a curve in the scene. Expansion of with the first rule ends the curve, while expansion with one of the other rules continues the curve in one of the three pixels to the right of .
The values on the right of the rules above indicate their probabilities. To learn the rule probabilities and selfrooting parameters, we used the approach outlined in Section 4. We show random contour maps generated by this grammar in Figure 1. The model generates multiple curves in a single image due to the selfrooting parameters.
In Figure 3
we show curve detection results using the curve grammar for some examples from the BSDS500 test set (see the supplementary material for more results). We illustrate the estimated probability that each pixel is part of a curve,
. This involves running LBP in the factor graph representing the curve grammar. For inference with an observed image, , we use the model in equation (21). In the factor graph, this means the variable is connected to and receives a fixedmessage from . Inference on a test image took 1.5 hours.For a quantitative evaluation we compute an AUC score, corresponding to the area under the precisionrecall curve obtained by thresholding . We also evaluate a baseline “nocontext” model, where the probability that a pixel belongs to a curve is computed using only the observation at that pixel. The grammar model obtained an AUC score 0.71 while the nocontext baseline achieved an AUC score of 0.11. For comparison, in [9] an AUC score of 0.73 was reported for the singlescale FieldofPatterns (FOP) model.^{1}^{1}1The contour maps used in [9] may differ from ours since images in the BSDS500 have multiple annotations.
The use of contextual information defined by the curve grammar described here significantly improves the curve detection performance. Although our method performed well in detecting curves in extremely noisy images, the model has some trouble finding curves with high curvature. We believe this is primarily because the grammar we used does not have a notion of curvature. It is possible to define more detailed models of curves to improve performance. However, we note that a simple firstorder model of curves with no curvature information is sufficient to compete well against other approaches such as [9].
5.2 Face Localization
To study face localization, we performed experiments on the Faces in the Wild dataset [13]. The dataset contains faces in unconstrained environments. Our goal for this task is to localize the face in the image, as well as face parts such as eyes, nose, and mouth. We randomly select images for training, and images for testing. Although the dataset comes annotated with the identity of the persons in the image, it does not come with part annotations. We manually annotate all training and test images with bounding box information for the parts: Face, Left eye, Right eye, Nose, Mouth. Examples of the manual annotation are shown in Figure 4.
The face grammar has symbols Face (), Left eye (), Right eye (), Nose (), and Mouth (). Each symbol has an associated set of poses of the form , which represent a position and scale in the image. We refer to the collection of symbols as the parts of the face. The grammar has a single rule of the form . We express the geometric relationship between a face and each of its parts by a scaledependent offset and region of uncertainty in pose space. The offset captures the mean location of a part relative to the face, and the region of uncertainty captures variability in the relative locations.
Concretely, suppose we had a Face with pose . Then, for each part , the Face would expand to a part somewhere in a uniform region centered at . Having a partdependent base offset allows us to express information such as “the mouth is typically near the bottom of the face” and “the nose is typically near the middle of the face”. The dependence of the offset on the scale of the Face allows us to place parts in their correct position independent of the Face size. We model the relationship between scales of a Face and a part in a similar way. Modeling the relation between scales allows us to represent concepts such as large faces tending to have large parts. We learn the geometric parameters such as the part offsets by collecting statistics in the training data.
Figure 1 shows samples of scenes with one face generated by the grammar model we estimated from the training images in the face dataset. Note the location and scale of the objects varies significantly in different scenes, but the relative positions of the objects are fairly constrained. Samples of scenes with multiple faces are included in the supplemental material.
Our data model is based on HOG filters [4]. We train HOG filters using publiclyavailable code from [6]. We train separate filters for each symbol in the grammar using our annotated images to define positive examples. Our negative examples are taken from the PASCAL VOC 2012 dataset [5], with images containing the class “People” removed.
The score of a HOG filter is realvalued. We convert this score to a probability using Platt’s method [18], which involves fitting a sigmoid. This allows us to estimate for each symbol . For the observation model we require a quantity that can be interpreted as , up to a proportionality constant. We note that . We approximate using the selfrooting probability, . To connect the data model to the grammar, the normalized scores of each filter are used to define messages into the corresponding bricks in the factor graph.
The result of inference with our grammar model leads to the (approximate) probability that there is an object of each type in each pose in the image. We show detection and localization results on images with multiple faces in the supplementary material. To quantify the performance of the model for localizing the face and its parts on images containing a single face we take the highest probability pose for each symbol. As a baseline we consider localizing each symbol using the HOG filter scores independently, without using a compositional rule.
Figure 4 shows some localization results. The results illustrate the context defined by the compositional rule is crucial for accurate localization of parts. The inability of the baseline model to localize a part implies the local image evidence is weak. By making use of contextual information in the form of a compositional rule we can perform accurate localization despite locally weak image evidence.
We provide a quantitative evaluation of the grammar model the baseline model in Table 1. The Face localization accuracy of both models are comparable. However, when attempting to localize smaller objects such as eyes, context becomes important since the local image evidence is ambiguous. We also ran an experiment with the grammar model without a HOG filter for the face. Here, the grammar is unchanged but there is no data model associated with the Face symbol. As can be seen in the bottom row of Table 1, we can localize faces very well despite the lack of a face data model, suggesting that contextual information alone is enough for accurate face localization. Inference using the grammar model on a test image took 2 minutes.
Model  Face  Left Eye  Right Eye  Nose  Mouth  Average 
HOG filters  14.7 (18.7)  33.8 (39.7)  37.9 (35.1)  8.9 (18.1)  24.6 (35.0)  24.0 
GrammarFull  13.1 (17.1)  6.6 (12.4)  8.2 (16.5)  5.5 (10.6)  11.4 (17.7)  9.0 
GrammarParts  13.8 (18.3)  6.1 (10.8)  8.8 (19.1)  7.4 (15.1)  12.1 (19.1)  9.7 
Mean distance of each part to the ground truth location. Standard deviations are shown in brackets. GrammarFull denotes the grammar model of faces with filters for all symbols. GrammarParts denotes the grammar model with no filter for the face symbol. The grammar models significantly outperform the baseline in localization accuracy. Further, the localization of the Face symbol for GrammarParts is very good, suggesting that context alone is sufficient to localize the face.
6 Conclusion
Probabilistic scene grammars define priors that capture relationships between objects in a scene. By using a factor graph representation we can apply belief propagation for approximate inference with these models. This leads to a robust algorithm for aggregating local evidence through contextual relationships. The framework is quite general and the practical feasibility of the approach was illustrated on two different applications. In both cases the contextual information provided by a scene grammar proves fundamental for good performance.
Acknowledgements
We would like to thank Stuart Geman and Jackson Loper for many helpful discussions about the topics of this research.
References
 [1] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898–916, May 2011.
 [2] Elie Bienenstock, Stuart Geman, and Daniel Potter. Compositionality, MDL priors, and object recognition. In Advances in Neural Information Processing Systems, pages 838–844, 1997.
 [3] Michael Burl, Markus Weber, and Pietro Perona. A probabilistic approach to object recognition using local photometry and global geometry. In European Conference on Computer Vision, pages 628–641. Springer, 1998.

[4]
Navneet Dalal and Bill Triggs.
Histograms of oriented gradients for human detection.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pages 886–893, 2005. 
[5]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.
The PASCAL Visual Object Classes Challenge 2012 (VOC2012)
Results.
http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.  [6] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformable part models, release 4. http://people.cs.uchicago.edu/ pff/latentrelease4/.
 [7] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Pictorial structures for object recognition. International Journal of Computer Vision, 61(1):55–79, 2005.
 [8] Pedro F. Felzenszwalb and David McAllester. Object detection grammars. Univerity of Chicago Computer Science Technical Report 201002, 2010.
 [9] Pedro F. Felzenszwalb and John G. Oberlin. Multiscale fields of patterns. In Advances in Neural Information Processing Systems, pages 82–90, 2014.
 [10] Sanja Fidler, Marko Boben, and Aleš Leonardis. Learning a hierarchical compositional shape vocabulary for multiclass object representation. In ArXiv:1408.5516, 2014.
 [11] Martin A. Fischler and Robert A. Elschlager. The representation and matching of pictorial structures. IEEE Transactions on computers, (1):67–92, 1973.
 [12] Ross B. Girshick, Pedro F. Felzenszwalb, and David Mcallester. Object detection with grammar models. In Advances in Neural Information Processing Systems, pages 442–450, 2011.

[13]
Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik LearnedMiller.
Labeled faces in the wild: A database for studying face recognition in unconstrained environments.
Technical Report 0749, University of Massachusetts, Amherst, October 2007.  [14] Ya Jin and Stuart Geman. Context and hierarchy in a probabilistic image model. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 2145–2152, 2006.
 [15] Frank R Kschischang, Brendan J Frey, and HansAndrea Loeliger. Factor graphs and the sumproduct algorithm. IEEE Transactions on Information Theory, 47(2):498–519, 2001.
 [16] Christopher D Manning and Hinrich Schütze. Foundations of statistical natural language processing. MIT Press, 1999.

[17]
Kevin P. Murphy, Yair Weiss, and Michael I. Jordan.
Loopy belief propagation for approximate inference: An empirical
study.
In
Uncertainty in Artificial Intelligence
, pages 467–475, 1999. 
[18]
John C. Platt.
Probabilities for SV machines.
In
Advances in Large Margin Classifiers
, pages 61–74. MIT Press, 1999.  [19] Przemyslaw Prusinkiewicz and Aristid Lindenmayer. The algorithmic beauty of plants. Springer, 1991.
 [20] Zhuowen Tu, Xiangrong Chen, Alan L. Yuille, and SongChun Zhu. Image parsing: Unifying segmentation, detection, and recognition. International Journal of computer vision, 63(2):113–140, 2005.
 [21] Yibiao Zhao and SongChun Zhu. Image parsing with stochastic scene grammar. In Advances in Neural Information Processing Systems, pages 73–81, 2011.
Appendix A Contour Detection Results
Appendix B Multiple Faces
In Figure 7 we show unconstrained samples from the grammar model for faces. Note how the model generates scenes with multiple faces, and also generates parts that appear on their own, since every symbol has a nonzero probability of selfrooting.
In Figure 8 we show localization results for images with multiple faces. In this case we show the top poses for each symbol after performing nonmaximum suppression, where is the number of faces in the image.
In general we do not know in advance the number of symbols of each type that are present in the scene. In this case it is not possible to simply select the top poses for each symbol. A different strategy is to use a threshhold, and select bricks that have marginal probabilities above the threshold to generate detections. The particular threshhold used depends on the desired tradeoff between false positives and false negatives. A threshhold for a particular application can be set by examining a PrecisionRecall curve. For three example images, we manually selected a single threshhold for detection that leads to good results. In Figure 9 we show all Face bricks with marginal probability above the threshold, after applying nonmaximum suppression.
Appendix C Contextual Influence
Figure 10 shows the results of inference when conditioning on various parts being in the image in specific poses. The grammar used here was a simplified version of the face grammar in the main paper. The symbols are Face (), Eye (), Nose (), and Mouth (). The pose of each symbol is the set of pixels in the image (there is no scale variation). The only compositional rule is . Note that we use the same symbol to represent the left and right eyes.
As can be seen in Figure 10, when we condition on the presence of a Face at a particular position (first row), the model “expects” to see parts of the face in certain regions in the image. When we condition on the location of an Eye (second row), the model does not know whether the Eye should be a left eye or right eye, hence there are two modes for the location of the Face, and two modes for the location of another Eye. Intuitively LBP is performing the following chain of reasoning: (1) the eye that is known to be present can be a left or right eye, (2) there are two possible regions of the image in which the face can occur, depending on whether the eye that is known to be present is a left or right eye, and finally (3) given each possible pose for the face, the other parts of the face should be located in a particular spatial configuration. When we condition on more parts, LBP can infer that it is more likely for the face to be in one region of the image over another region, and the beliefs for the other face parts reflect this reasoning.