The past few years have seen dramatic improvements in the performance of object recognition systems, especially in 2D object detection and classification. Much of this progress has been driven by the use of deep learning techniques, which allow for end-to-end learning of multiple layers of low-, mid- and high-level image features, which are used to predict, e.g., the object’s class, its 2D location, or its 3D pose, provided that sufficiently many annotations for the desired output are provided for training the corresponding deep net.
On the other hand, automatic semantic parsing of natural scenes that typically exhibit contextual relationships among multiple object instances remains a core challenge in computational vision. As an example, consider the dining room table scene shown in Figure1, where it is fairly common for collections of objects to appear in a specific arrangement on the table. For instance, a plate setting often involves a plate with a knife, a fork and a spoon to the left or right of the plate, and a glass in front of the plate. Also, the knife, fork and spoon often appear parallel to each other rather than in a random configuration. These complex spatial relationships among object poses are often not captured by existing deep networks, which tend to detect each object instance independently. We argue that modeling such contextual relationships is essential for highly accurate semantic parsing because detecting objects in the context of other objects can potentially provide more coherent interpretations (e.g., by avoiding object detections that are inconsistent with each other).
Proposed Bayesian Framework: We propose to leverage recent advances in object classification, especially deep learning of low-, mid- and high-level features, to build high-level generative models that reason about objects in the scene rather than features in the image. Specifically, we assume we have at our disposal a battery of classifiers trained to answer specific questions about the scene (e.g., is there a plate in this image patch?) and propose a model for the output of these high-level classifiers.
The proposed model is Bayesian, but can be seen as a hybrid of learning-based and model-based approaches. By the former, we refer to parsing an image by scanning it with a battery of trained classifiers (e.g., SVMs or deep neural nets). By the latter, we refer to identifying likely states under the posterior distribution in a Bayesian framework which combines a prior model over interpretations and a data model based (usually) on low-level image features. In a nutshell, we maintain the battery of classifiers and
the Bayesian framework by replacing the low-level features with high-level classifiers. This is enabled by defining the latent variables in one-to-one correspondence with the classifiers. In particular, there are no low-level or mid-level features in the model; all variables, hidden and measured, have semantic content. We refer to the set which indexes the latent variables and corresponding classifiers as “queries” and to the latent variables as “annobits”. For example, some annobits might be lists of binary indicators of the presence or absence of visible instances from a subset of object categories in a specific image patch, and the corresponding classifiers might be CNNs which output a vector of weights for each of these categories. Annobits can be seen as a perfect (noiseless) classifier and, vice-versa, the classifier can be seen as an imperfect (noisy) annobit. The data model is the conditional distribution of the family of classifiers given the family of annobits.
The prior model encodes our expectations about how scenes are structured, for example encoding preferred spatial arrangements among objects composing a dining room table setting. Hence the posterior distribution serves to modulate or “contextualize” the raw classifier output. We propose two prior models. The first one combines a prior model of the 3D scene and camera geometry, whose parameters can be encoded by a homography, and a Markov random field (MRF) model of the 2D spatial arrangement of object instances given the homography. The model is motivated by our particular application to parsing dining room table scenes, where most objects lie on the table plane. This model is easy to sample from its posterior, but it is hard to learn tabula-rasa due to lack of modularity and therefore the need for a great many training samples. The second model is based on an attributed graph where each node corresponds to an object instance that is attributed with a category label and a pose in the 3D world coordinate system. The attributed graph is built on top of a random skeleton that encodes spatial relationships among different object instances. This model is easy to learn and sample, but sampling from its posterior is much harder. We get the best of both worlds by using the second model to synthesize a large number of annotated scenes, which are then used to learn the parameters of the first model.
Proposed Scene Parsing Strategy: Depending on the scene, running a relatively small subset of all the classifiers might already provide a substantial amount of information about the scene, perhaps even a sufficient amount for a given purpose. Therefore, we propose to annotate the data sequentially, identifying and applying the most informative classifier (in an information-theoretic sense) at each step given the accumulated evidence from those previously applied.
The selection of queries is task-dependent, but some general principles can be articulated. We want to structure them to allow the parsing procedure to move freely among different levels of semantic and geometric resolution, for example to switch from analyzing the scene as a whole, to local scrutiny for fine discrimination, and perhaps back again depending on current input and changes in target probabilities as evidence is acquired. Processing may be terminated at any point, ideally as soon as the posterior distribution is peaked around a coherent scene description, which may occur after only a small fraction of the classifiers have been executed.
The Bayesian framework provides a principled way for deciding what evidence to acquire at each step and for coherently integrating the evidence by updating likelihoods. At each step, we select the classifier (equivalently, the query) which achieves the maximum value of the conditional mutual information between the global scene interpretation and any classifier given the existing evidence (i.e., output of the classifiers already implemented). Consequently, the order of execution is determined online during scene parsing by solving the corresponding optimization problem at each step. The proposed Information Pursuit (IP) strategy then alternates between selecting the next classifier, applying it to the image data, and updating the posterior distribution on interpretations given the currently collected evidence.
Application to 2D Object Detection and 3D Pose Estimation in the JHU Table-Setting Dataset:
We will use the proposed IP strategy to detect instances from multiple object categories in an image and estimate their 3D poses. More precisely, consider a 3D scene and a semantic description consisting of a variable-length list of the identities and 3D poses of visible instances from a pre-determined family of object categories. We want to recover this list by applying high-level classifiers to an observed image of the scene acquired from an unknown viewpoint. As a proof of concept, we will focus on indoor scenes of dinning room tables, where the specific categories are plate, glass, utensil and bottle. Such scenes are challenging due to severe occlusion, complex photometry and intra-class variability. In order to train models and classifiers we have collected and manually labeledimages of table settings from the web. We will use this dataset for learning our model, training and testing the classifiers, and evaluating system’s performance. We will show that we can make accurate decisions about existing object instances by processing only a small fraction of patches from a given test image. We will also demonstrate that coarse-to-fine search naturally emerges from IP.
Paper Contributions: In summary, the core contribution of our work is a Bayesian framework for semantic scene parsing that combines (1) a data model on the output of high-level classifiers as opposed to low-level image features, (2) prior models on the scene that captures rich contextual relationships among instances of multiple object categories, (3) a progressive scene annotation strategy driven by stepwise uncertainty reduction, and (4) a dataset of table settings.
Paper Outline: The remainder of the paper is organized as follows. In section 2 we summarize some related work. In section 3 we define the main system variables and formulate information pursuit in mathematical terms. In section 4 we introduce the annobits and the annocell hierarchy. In section 5 we introduce our prior model on 3D scenes, which includes a prior model on interpretation units and a prior model on scene geometry and camera parameters. In section 6 we introduce a novel scene generation model for synthesizing 3D scenes, which is used to learn the parameters of the prior model. The algorithm for sampling from the posterior distribution, a crucial step, is spelled out in section 7 and the particular classifiers (CNNs) and data model (Dirichlet distributions) we use in our experiments are described in section 8. In section 9 we introduce the “JHU Table-Setting Dataset”, which is composed of about 3000 fully annotated scenes, which we use for training the prior model and the classifiers. In section 10 we present comprehensive experiments, including comparisons between IP and using the CNNs alone. Finally, there is a concluding discussion in section 11.
2 Related Work
The IP strategy proposed in this work is partially motivated by the “divide-and-conquer” search strategy employed by humans in playing parlor and board games such as “Twenty Questions,” where the classifiers would represent noisy answers, as well as by the capacity of the human visual system to select potential targets in a scene and ignore other items through acts of selective attention (Serences and Yantis, 2006; Reynolds et al., 1999). An online algorithm implementing the IP strategy was first introduced by Geman and Jedynak (1996) under the name “active testing” and designed specifically for road tracking in satellite images. Since then, variations on active testing have appeared in (Sznitman and Jedynak, 2010)
for face detection and localization, in(Branson et al., 2014) for fine-grained classification, and in (Sznitman et al., 2013) for instrument tracking during retinal microsurgery. However, it has not yet been applied to problems of the complexity of 3D scene interpretation.
CNNs, and more generally deep learning with feature hierarchies, are everywhere. Current CNNs are designed based on the same principles introduced years ago in (Homma et al., 1988; Lecun et al., 1998). In the past decade, more efficient ways to train neural networks with more layers (Hinton et al., 2006; Bengio et al., 2007; Ranzato et al., 2007)
together with far larger annotated training sets (e.g., large public image repositories such as ImageNet(Deng et al., 2009)) and efficient implementations on high-performance computing systems, such as GPUs and large-scale distributed clusters (Dean et al., 2012; Ciresan et al., 2011) resulted in the success of deep learning and more specifically CNNs. This has resulted in impressive performance of CNNs on a number of benchmarks and competitions including the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2015). To achieve better performance, the network size has grown constantly in the past few years by taking advantage of the newer and more powerful computational resources.
State-of-the-art object detection systems (e.g., RCNN Girshick et al. (2016) and faster RCNN Ren et al. (2015)) initially generate some proposal boxes which are likely to contain object instances; these boxes are then processed by the CNN for classification, and then regressed to obtain better bounding boxes for positive detections. In RCNN Girshick et al. (2016), the proposals are generated using the “selective search” algorithm Uijlings et al. (2013). The selective search algorithm generates candidates by various ways of grouping the output of an initial image segmentation. The faster region-based CNN (faster RCNN) of Ren et al. (2015) does not use the selective search algorithm to generate the candidate boxes; their network generates the proposals internally in the forward path. These approaches do not use contextual relations to improve disambiguation and prevent inconsistent interpretations, allow for progressive annotation, or accommodate 3D representations. There is no image segmentation in our approach.
There is a considerable amount of work attempting to incorporate contextual reasoning into object recognition. Frequently this is accomplished by labeling pairs of regions obtained from segmentation or image patches using Conditional Random Fields or Markov Random Fields (Rabinovich et al., 2007; Mottaghi et al., 2014; Sun et al., 2014; Desai et al., 2011). Compositional vision (Geman et al., 2002) embeds context in a broader sense by considering more general, non-Markovian models related to context-sensitive grammars. While most of the work is about discriminative learning and reasoning in 2D (Choi et al., 2012; Sun et al., 2014; Desai et al., 2011; Felzenszwalb et al., 2010; Porway et al., 2010; Hoai and Zisserman, 2014; Rabinovich et al., 2007), several attempts have been made recently at designing models that reason about surfaces of 3D scenes and the interaction between objects and their supporting surfaces (Bao et al., 2010; Hoiem et al., 2007; Lee et al., 2010; Silberman et al., 2012; Saxena et al., 2009; Liu et al., 2014). It has been shown that reasoning about the underlying 3D layout of the scene is, as expected, useful in recognizing interactions with other objects and surfaces (Bao et al., 2010; Hoiem and Savarese, 2011). However, most of the current 3D models do not encode contextual relations among objects on supporting surfaces beyond their coplanarity.
3 General Framework
3.1 Scenes and Queries
Let be a limited set of possible interpretations or descriptions of a physical 3D scene and let be a 2D image of the scene. In this paper, a description records the identities and 3D poses of visible instances from a pre-determined family of object categories . The scene description is unknown, but the image is observed and is determined by the scene together with other, typically unobserved, variables , including the camera’s intrinsic and extrinsic parameters. We will assume that , and
are random variables defined on a common probability space.
The goal is to reconstruct as much information as possible about from the observation and to generate a corresponding semantic rendering of the scene by visualizing object instances. In our setting, information about is supplied by noisy answers to a series of image-based queries from a specified set . We assume the true answer to a query is determined by and ; hence, for each , for some function . The dependency of on allows the queries to depend on locations relative to the observed image. We regard as providing a small unit of information about the scene , and hence assuming a small set of possible values, even just two, i.e., corresponding to the answers “no” or “yes” to a binary query. We will refer to every as an “annobit” whether or not is a binary query. Also, for each subset of queries , we will denote the corresponding subset of annobits as and similarly for classifiers (see below).
We will progressively estimate the states of the annobits from a matched family of image-based predictors. More specifically, for each query , there is a corresponding classifier , where for some function . We will assume that each classifier has the same computational cost; this is necessary for sequential exploration based on information flow alone to be meaningful, but can also be seen as a constraint on the choice of queries . We will further assume that is a sufficient statistic for in the sense that
We will use a Bayesian model. The prior model is composed of a scene model for , which encodes knowledge about spatial arrangements of scene objects, and a camera model for . Combining the prior model with the data model then allows us to develop inference methods based on (samples from) the posterior . While the specific form of these models naturally depends on the application (see section 5 for a description of these models for our applications to tables scenes), the information pursuit strategy is generally applicable to any prior and data models, as explained next.
3.2 Information Pursuit
Let be an ordered sequence of the first distinct queries and let be possible answers from the corresponding classifiers . Consider the event
where, is the index of the query at step of the process and is the observed result of applying classifier on . Therefore, is the accumulated evidence after queries.
The IP strategy is defined recursively. The first query is fixed by the model:
is the mutual information, which is determined by the joint distribution ofand . Thereafter, for ,
which is determined by the conditional joint distribution of and given the evidence to date, i.e., given . According to (4) a classifier with maximum expected information gain given the currently collected evidence is greedily selected at each step of IP.
From the definition of the mutual information, we have
where denotes the Shannon entropy. Since the first term on the right-hand side does not depend on , one sees that the next query is chosen such that adding to the evidence the result of applying to the test image will minimize, on average, the uncertainty about . One point of caution regarding the notation : here and are random variables, while is a fixed event. The notation then refers to the conditional entropy of given computed under the conditional probability , i.e., the expectation (with respect to the distribution of ) of the entropy of under .
Returning to the interpretation of the selection criterion, we can also write
This implies that the next question is selected such that:
is large, so that its answer is as unpredictable as possible given the current evidence, and
is small, so that is predictable given the ground truth (i.e., is a “good” classifier).
The two criteria are however balanced, so that one could accept a relatively poor classifier if it is (currently) highly unpredictable.
Depending on the structure of the joint distribution of and , these conditional entropies may not be easy to compute. A possible simplification is to make the approximation of neglecting the error rates of at the selection stage, therefore replacing by . Such an approximation leads to a simpler definition of , namely
Notice that (in above) the and are not assumed to coincide in the conditioning event (which depends on the variables) so that the accuracy of the classifiers is still accounted for when evaluating the implications of current evidence. So here again, one prefers asking questions whose (true) answers are unpredictable. For example, one would not ask “Is it an urban scene?” after already having got a positive response to “Is there a skyscraper?” nor would one ask if there is an object instance from category in patch “” if we already know it is highly likely that there is an object instance from category in patch “”, a subset of “”. Removing previous questions from the search is important with this approximation, since the mutual information in (6) vanishes in that case, but not necessarily the conditional entropy in (7).
Returning to the general situation, (6) can be simplified if one makes two independence assumptions:
The classifiers are conditionally independent given ;
The classifier is conditionally independent of given , i.e., the distribution of depends on only through .
Clearly if query belongs to the history, so assume . In what follows, let , where represents a possible value of . Then, under assumptions 1 and 2, and using the fact that only depends on the realizations of , we have:
This entropy can be computed from the data model and the mixture weights can be estimated from Monte Carlo simulations (see section 7). Similarly, the first term in (6), namely , can be expressed as the entropy of a mixture:
Arguing as with the second term in (6), i.e., replacing by , the last expression is the entropy of the mixture distribution
where is fixed. Consequently, given an explicit data model, the information pursuit strategy can be efficiently approximated by sampling from the posterior distribution.
As a final note, we remark that we have used the variables to represent the unknown scene . Writing
we see that the residual uncertainty on given the current evidence will only slightly differ from the residual uncertainty of as soon as the residual uncertainty of given is small, which is a reasonable assumption when the number of annobits is large enough.
We now pass to a more specific description of the variables and their distributions. In particular, the next section provides our driving principles for the choice of the annobits. We will then discuss the related classifiers, followed by the construction of the prior and data models, their training and the associated sampling algorithms.
4.1 General Principles
The choice of the functions that define the annobits, , , naturally depends on the specific application. The annobits we have in mind for scene interpretation, and have used in previous related work on a visual Turing test (Geman et al., 2015), fall mainly into three categories:
Scene context annobits: These indicate full scene labels, such as “indoor”, “outdoor” or “street”; since our application is focused entirely on “dinning room table settings” we do not illustrate these.
Part-of descriptors: These indicate whether or not one image region is a subset of another, e.g., whether an image patch is part of a table.
Existence annobits: These relate to the presence or absence of object instances with certain properties or attributes. The most numerous set of annobits in our system ask whether or not instances of a given object category are visible inside a specified region.
Functions of these elementary descriptors can also be of interest. For example, we will rely heavily on annobits providing a list of all object categories visible in a given image region, as described in section 4.3.
4.2 Annocell Hierarchy
Recall from section 3.1 that a scene description consists of the object categories and 3D poses of visible instances from a pre-determined family of object categories. Here, motivated by our application to dining room table scenes where objects lie in the table plane, we use a 2D representation of the object pose, which can be put in one-to-one correspondence with its 3D pose via the homography relating the image plane and the table plane (see section 5.2 for details). More specifically, an object instance is a triple , where denotes the object category in a set of pre-defined categories , denotes the locations of the centers of the instances in the image domain and denotes their sizes in the image (e.g., diameter). The apparent 2D pose space is therefore . More refined poses could obviously be considered.
To define the queries, we divide the apparent pose space into cells. Specifically, we consider a finite, distinguished subset of sub-windows, , and subset of size intervals, , and index the queries by the triplet , where , , and . For every category , sub-window and size interval , we let if an instance of category with size in is visible in , and otherwise. If , we simply write . We refer to as an “annocell.” Specifically, assuming
(by padding and normalizing),consists of square patches of four sizes, for . The patches at each “level” overlap: for each level, the row and column shift ratio is i.e., overlap between nearest windows. This leads to 1, 25, 169, and 841 patches for levels 0,1,2, and 3 respectively, for a total of patches. Figure 1 shows some of these regions selected from the four levels of the hierarchy.
Using a hierarchical annocell structure has the advantage of allowing for coarse-to-fine exploration of the pose space. Note also that, by construction, annocells at low resolution are unions of certain high-resolution ones. This implies that the value of the annobits at low resolution can in turn be derived as maximums of high-resolution annobits.
4.3 Extended Existence Annobits
Due to the nature of the classifiers we use in our application, we also introduce annobits that list the categories that have entirely visible instances in an annocell, i.e., the collection
In addition, we also use category-independent, size-related annobits: For each annocell and size interval , we define a binary annobit which indicates whether or not the average size of the objects present in belongs to .
4.4 Classifiers for Annobits
The particular image-based predictors of the annobits we use in the table-setting application are described in full detail in section 8. Some examples include:
Variables , , which provide a vector of weights on for predicting .
Variables , , which provide a probability vector on for predicting .
Additional variables (where is a subset of ) will also be introduced. They are designed to predict information units if more than half of overlaps the table. Observe that the classifier assigned to does not necessarily assume the same value as . However, this is not a problem since we are only interested in the conditional distribution of given .
5 Prior Model
Following section 3, the joint distribution of the annobits is derived from a prior model on the 3D scene description, , and on camera parameters . We assume these variables to be independent and model them separately.
5.1 Scene Model
Motivated by our application to dining room table scenes, we assume a fixed dominant plane in the 3D model, and choose a coordinate system in , such that the xy-plane coincides with this dominant plane. The scene is represented as a set of object instances, assumed to be sitting on a bounded region of the dominant plane, in our case a centered, rectangular table characterized by its length and width. Recall from section 4.2 that each object instance is represented by a category , a location and a size in the image. Here, we assume that objects from a given category have a fixed size, so that with . The distribution of will be defined conditional to , since, for example, the size of will directly impact the number of objects that it can support. More generally the table can be replaced by some other variable representing more complex properties of the global scene geometry. For convenience we sometimes drop from our notation. However, most of the model components introduced below depend on , and the proposed model is to be understood conditional to .
We partition the reference plane into small cells (
in the table-setting case) and use binary variables to indicate the presence of instances of object categories centered in each cell. In other words, we discretize the familyinto a binary random field that we will still denote by . Letting denote the set of cells, a configuration can therefore be represented as the binary vector where if and only if an object of category is centered in the cell .
The configuration is obviously a discrete representation of the scene layout restricted to object categories and location . Letting denote the space of all such configurations, we will use a Gibbs distribution on associated with a family of feature functions , with , and scalar parameters . The Gibbs distribution then has the following form:
where is the normalizing factor (partition function) ensuring that the probabilities sum up to one. Figure 2 shows a table and its fitted mesh where each of the cells is a square.
We use the following features:
Existence features, which indicate whether or not an instance from a given category is centered anywhere in a given set of cells, therefore taking the form
with . We consider sets at three different granularity levels, illustrated in Figure 3. At the fine level is a singleton, so that . We also consider middle-level sets (33 array of fine cells) and coarse-level sets (66 array of fine cells) that cover the reference plane without intersection.
Conjunction features, which are products of two middle-level existence features (of the same or different categories), and therefore signal their co-occurrence:
To limit model complexity, only pairs whose centers are less than a threshold away are considered where the threshold can depend on the pair .
Invariance and symmetry assumptions about the 3D scene are then encoded as equality constraints among the model parameters thereby reducing model complexity. Grouping binary features with identical parameters is then equivalent to considering a new set of features that count the number of layout configurations satisfying some conditions on the locations and categories. For table settings, it is natural to assume invariance by rotation around the center of the table. Hence we assume that existence features whose domain is of the same size and located at the same distance from the closest table edge all have the same weights (’s), and hence the probability only depends on the number of such instances.
We group conjunction feature functions based on the distance of the first patch to the edge of the table, and the relative position of the second patch (left, right, front, or back) with respect to the first patch.
: The model can be generalized to include pose attributes other than location, e.g., orientation, size and height. If denotes the space of poses, then one can extend the state space for to , interpreting as the presence of an object with category and pose in cell , and as the absence of any object with category , being irrelevant. Features can then be extended to this state space to provide a joint distribution that includes pose. The simplest approach would be to only extend univariate features, so that object poses and other attributes are conditionally independent given their categories and locations (and the geometry variable , since the model is always assumed conditional to it). Other attributes (color, style, etc.) can be incorporated in a similar way.
5.2 Camera Model
The second component of the prior model determines the probability distribution of the extrinsic and intrinsic camera parameters, such as its pose and focal length, respectively. The definition of these parameters is fairly standard in computer vision (see e.g.,Ma et al. (2003)), but the definition of generative models for these parameters is not. In what follows we summarize the typical definitions, and leave the details of the generative model to the Appendix.
Remember that we assumed a fixed coordinate system in 3D in which the -plane coincides with the dominant “horizontal” plane. Consider also a second camera coordinate system , such that -plane is equal to the image plane. The extrinsic camera parameters are defined by the pose of the camera coordinate system relative to the fixed coordinate system , where is the camera rotation, which maps the unit axis vectors of to the unit axis vectors of , and is the translation vector. We parametrize the rotation by three angles representing, respectively, counter-clockwise rotations of the camera’s coordinate system about the x-axis, y-axis, and z-axis of the world coordinate system (see equation (29) for conversion of unit vectors to angles). Observe that one can express the coordinates of a 3D point in the world coordinate system as functions of its coordinates in the camera coordinate system in the form . Since in our case 3D points lie in a plane , where is the normal to the plane (i.e., table) measured in the camera coordinate system and is the distance from the plane to the camera center, we further have , where is the homography between the camera plane and the world plane.
The intrinsic camera parameters are defined by the coordinates of the focal point, , where is the focal length and is the intersection of the principal axis of the camera with the image plane, as well as the pixel sizes in directions and , denoted by and .
The complete set of camera parameters is therefore 11-dimensional and given by . Our generative model for assumes that:
Intrinsic camera parameters are independent from extrinsic camera parameters.
Pixels are square, i.e., , but intrinsic parameters are otherwise independent. The focal length
is uniformly distributed between 10 and 40 millimeters,(resp. ) is uniformly distributed between and (resp. and ), where and are the width and height of the image in pixels, and is uniformly distributed between and .
The vertical component of is independent of the other two and the distribution of the horizontal components is rotation invariant. Specifically, letting , we assume that
follows a Beta distribution so that(expressed in meters). Then, letting denote the distance between the horizontal projection of on the table plane and the center of the table, we assume that follows a Beta distribution. We assume independence of and and invariance by rotation around the vertical axis, which specifies the distribution of .
The distribution of the rotation angles is defined conditionally to . Specifically, we assume that the camera roughly points towards the center of the scene and the horizontal direction in the image plane is also horizontal in the 3D coordinate system. Additional details of the model for are provided in the Appendix.
5.3 Scene Geometry Model and Global Model
We assume that the scene geometry takes value in a finite set of
“template geometries” that coarsely cover all possible
situations. Note that these templates are defined up
to translation, since we can always assume that the 3D reference frame
is placed in a given position relative to the geometry. For table
settings, where the geometry represents the table itself, our templates were simply square tables with size distributed according to a shifted and scaled Beta distribution ranging from 0.5 to 3 meters. This rough approximation was sufficient for our purposes, even though tables in real scenes are obviously much more variable in shape and size.
Finally, the joint prior distribution of all the variables is defined by:
5.4 Learning the Prior Model
The models for and are simple enough that we specified their model parameters manually, as described before. Therefore, the fundamental challenge is to learn the prior model on scene interpretations . For this purpose, we assume that a training set of annotated images is available. The annotation for each image consists of a list of object instances, each one labeled by its category (and possibly other attributes) and apparent 2D pose represented by an ellipse in the image plane. We also assume that sufficient information is provided to propagate the image annotation to a scene annotation in 3D coordinates; this will allow us to train the scene model independently from the unknown transformation that maps it to the image. This can be done in several ways. For example, given four points in the image that are the projections of the corners of a square in the reference plane, one can reconstruct, up to a scale factor, the homography mapping this plane to the image. Doing this with a reasonable accuracy is relatively easy in general for a human annotator, and allows one to invert the outline of every flat object on the image that lies on the reference plane to its 3D shape, up to a scale ambiguity. This ambiguity can be removed by knowing the true distance between two points in the reference plane, and their positions in the image. We used this level of annotation and representation for our table settings, based on the fact that all objects of interest were either horizontal (e.g., plates), or had easily identifiable horizontal components (e.g., bottoms of bottles), and we assumed that plates had a standard diameter of 25cm to remove the scale ambiguity.
As can be seen, the level of annotation required to train our prior model is quite high. While we have been able to produce rich annotations for 3,000 images of dining room table settings (see section 9), this is insufficient to train our model. To address this issue, in the next section we propose a 3D scene generation model that can be use to generate a large number of annotations for as many synthetic images as needed. Given the annotations of both synthetic images (section 6) as well as real images (section 9), the parameters of our prior model are learned using an accelerated version of the robust stochastic approximation (Nemirovski et al., 2009) to match empirical statistics calculated based on top-down samples from the scene generation model (see Jahangiri (2016) for details).
6 Scene Generation Model
In this section we propose a 3D scene generation model that can be used to generate a large number of annotations to train the prior model described in the section 5. The proposed model mimics a natural sequence of steps in composing a scene. First, create spontaneous instances by placing some objects randomly in the scene; the distribution of locations depends on the scene geometry. Then, allow each of these instances to trigger the placement of ancillary objects, whose categories and attributes are sampled conditionally, creating groups of contextually related objects. This recursive process terminates when no children are created, or when the number of iterations reaches an upper-bound.
6.1 Model Description Using a Generative Attributed Graph
To formally define this process, we will use the notation to represent a family of integer counts indexed by categories, so that . We will also let .
We will assume a probability distribution on , and a family of such distributions . These distributions (which are defined conditionally to ) are used to decide the number of objects that will be placed in the scene at each step. More specifically:
is the conditional joint distribution of the number of object instances from each category that are placed initially on the scene.
For each category , is the joint distribution of the numbers of new object instances that are triggered by the addition of an object instance from category . These distributions can be thought of as the basis distributions in a multi-type branching process (see Mode (1971)).
The complexity of the process is controlled by a master graph that restricts the subset of categories that can be created at each step. More formally, this directed graph has vertices in and is such that is supported by categories that are children of the node . Adjoining to the node labels avoids treating as a special case in the derivations below. The master graph we used on table settings is provided in Figure 4, where we regard “plate” and “bottle” as the children of category . Note that since we allow spontaneous instances from all categories every category is a child to category 0.
The output of this branching process can be represented as a directed tree in which each vertex is attributed a category denoted by and is a set of edges. The root node of the tree, hereafter denoted by , essentially represents the empty scene whose “category” is also denoted by 0 (note that ). All other nodes have categories in . Each non-terminal node has children where so that of these children have category . We will refer to as a skeleton tree, which needs to be completed with the object attributes (excluding its category since already includes the category attribute) to obtain a complete scene description. The probability distribution of is
where is the set of terminal nodes and are the category counts of the children of (graphs being identified up to category-invariant isomorphisms). An example of such graph is provided in Figure 5.
To complete the description, we need to associate attributes to objects, the most important of them being their poses in the 3D world, on which we focus now. In the MRF designed for our experiments, the only relevant information about pose was the location on the table, a 2D parameter. It is however possible to design a top-down generative model that includes richer information, using for example a 3D ellipsoid. Such representations involve a small number of parameters denoted generically by : each vertex in the skeleton graph is attributed by parameters such as its pose denoted by . When using ellipsoids, involves eight free parameters (five for the shape of the ellipsoid, which is a positive definite symmetric matrix, and three for its center). Fewer parameters would be needed for flat objects (represented by a 2D ellipse), or vertical ones, or objects with rotational symmetry. In any case, it is obvious that the distribution of an object pose depends heavily on its category.
In our model, contextual information is important: when placing an object relative to a parent, the pose also depends on the parent’s pose and category. This is captured by the conditional distribution of the pose parameters for a category , relative to a parent with category and pose . To simplify notation, we allow again for (indicating objects without parent), in which case is irrelevant. The complete attributed graph associated with the scene is now (where is the family of poses) with distribution
where is the parent of . In (19
), we have mixed discrete probability mass functions for the object counts and continuous probability density functions for the pose attributes.
If one is only interested in the objects visible in the scene, the scene description, , is obtained by discarding the graph structure from , i.e., only retaining the object categories and poses. More complex scene descriptors could be interesting as well, like object relationships or groupings (e.g., whether a family of plate, utensils, glasses can be considered as belonging to a single setting), in which case the whole graph structure may also be of interest; we do not use such “compositions” in our experiments. As a final point, we mention that the samples may require some pruning at the final stage, since the previous model does not avoid object collisions or overlaps that one generally wants to avoid. We removed physically impossible samples in which vertical object categories (i.e., bottle and glass) were overlapping in the world coordinate system. In general, one can add undirected edges between the children of the same parent to incorporate more context into a single setting. More details on the scene model that we used for table-settings can be found in the Appendix.
6.2 Algorithm for Learning the Scene Generation Model
Even though the annotation is assumed to describe the scene in the
world coordinate system, the information it provides on is still
incomplete, because it does not include the graph structure. To learn
the parameters of the branching process, we used the EM algorithm
(Dempster et al., 1977) or, more precisely, the Monte-Carlo version of
the Stochastic Expectation-Maximization
Stochastic Expectation-Maximization(SEM) algorithm (Celeux and Diebolt, 1985), usually referred to as MCEM in the literature (Wei and Tanner, 1990). In this framework, the conditional expectation of the complete log-likelihood, which is maximized at each step to update the parameters, is approximated by Monte-Carlo sampling, averaging a sufficient number of realizations of the conditional distribution of the complete data given the observed one for the current parameters. Note that the unobserved part of the graph given can be represented as a -dimensional vector , with if is an orphan. These configurations form a subset of , given the constraints imposed by the master graph and the fact that is acyclic. The Gibbs sampling algorithm iteratively updates each according to its conditional distribution given the observed variables and the other , , which can easily be computed using equation (19). Recall that the graph distribution is learned conditional to a given scene geometry .
6.3 Simulated Table Settings
Figure 6 shows top-view visualization of some annotated images in the dataset that roughly match in size to a table and some samples drawn from the generative attributed graph model for a square table of size
learned from matching annotated images. Visual similarity of the samples taken from the generative attributed graph model to natural scene samples confirm suitability of this model for table setting scenes although the proposed model is quite general and can be used to model different types of scenes.
: We developed algorithms for unconditional and conditional sampling of the graph model in the context of IP (the conditional distribution relative to the current history). The unconditional sampling is top-down, easy and fast. However, our conditional sampling based on Metropolis-Hastings (Hastings (1970); Metropolis et al. (1953)) is relatively complex and slow to adapt to a new condition i.e., long burn-in period; this is partly due to the innate low acceptance rate of the Metropolis-Hastings algorithm, normally (see Roberts and Rosenthal (2001) and Jahangiri (2016) for details). This is why we have not used this model directly in the IP framework, relying instead on the MRF model described in section 5.1, in which the feature expectations are learned on scenes generated by the generative attributed graph model.
7 Conditional Sampling
Sampling from the posterior distribution over hidden variables given evidence is central to our method, being necessary for both IP and performance evaluation. Writing for the unobserved scene-related variables, the prior distribution was given in (17). Recall that the annobits are deterministically related to the scene, with . In this discussion, we will work under the simplifying assumption that the classifiers are conditionally independent given and that, for a given , the conditional distribution of given these variables only depends on . (This assumption can be relaxed to a large extent without significantly increasing the complexity of the algorithm. This will be discussed at the end of this section.) Recall also (see Section 3.2) that at step of IP, in order to compute the conditional mutual information and determine the next query , we require the mixture weights , where is the evidence after steps. Clearly, then, can be estimated from samples from given the history.
The joint distribution of and all the data therefore takes the form
Since the next query is a deterministic function of , the conditional distribution of given is
for which we have again used the conditional independence of the ’s given the scene.
7.1 General Framework
We use a Metropolis-Hastings sampling strategy to estimate the conditional distribution of the scene variables given the history. As a reminder, the algorithm relies on the fact that any transition probability can be modified by rejection sampling to be placed in detailed balance with by letting
provided . The Metropolis-Hastings strategy assumes a family of “elementary moves” represented by transition probabilities . At each step, say , of the algorithm, a move is chosen (based on a random or deterministic scheme), and a new configuration is created with probability , where is the current configuration. The set of elementary moves and the updating scheme must be chosen appropriately to ensure that the chain is ergodic.
7.2 Application to the Scene Model
The feasibility of the method relies on whether the ratio intervening in (22) is tractable. In this equation, all terms can be relatively easily computed, with the exception of the probabilities in (21) because of the normalizing constant in (14) which depends on . This constant cancels in the ratio whenever the values of in and coincide, i.e., the elementary move does not change the scene geometry. Among moves that satisfy this property, moves involving the camera properties are generally computationally demanding, because they modify all the annobits, while elementary changes in only have a local impact.
7.2.1 Changing the Scene Geometry
To process moves that modify , the normalizing constant in (14), namely
must be computed (where and all depend on ). Whereas an exact computation is intractable, approximations can be obtained, using, for example, the formula
in which is a parameter at which is computable (typically making all variables independent) and each expectation in a numerical approximation of the integral is computed using Monte-Carlo sampling. This is a costly but can be computed offline for each value of (which can be discretized over a finite set).
In our application, however, we have used a simpler approach, relying on a good estimator of that is fixed in the rest of the computation. Letting be this estimator, we sampled over a small neighborhood of , making the additional approximation that in constant (as a function of ) in this neighborhood.
7.2.2 Changing the Camera Properties
For the camera properties, we use a proposal distribution taking the form , where the and coordinates in and coincide, and is the observed image. The dependency on is implemented through an estimator limiting the camera parameters, which will be described in the next section. The proposal distribution of can be assumed to be uniform over the finite set of scene geometries which is considered.
7.2.3 Changing object indicators
In our implementation, in which is a collection of binary variables, elementary moves correspond to Gibbs sampling, taking,
if and are such that , and for ; and taking in all other cases.
The overall updating scheme is based on nested loops, where the inner loop updates , the middle one updates and the outer one . Each loop is run several times before an update is made at a higher level.
8 Classifiers and Data Model
We trained three deep CNNs. The first one, “CatNet,” is for object category classification; the second one, “ScaleNet,” is to estimate the size of detected object instances, and the third, “SceneNet,” is to estimate the scene geometry in a given image. All of these CNNs borrow their network architecture, up to the last weight layer, i.e., layer 15, from the VGG-16 network (Simonyan and Zisserman, 2014)
. The last fully-connected layer (16-th weight layer) and the following softmax layer of these three CNNs were modified to accommodate our design needs. All CNNs rely on “transfer learning” by initializing the first 15 weight-layers to the corresponding weights from the VGG-16 network111Available at: http://www.robots.ox.ac.uk/~vgg/research/very_deep/ trained on 1.2 million images from the ImageNet dataset (see Deng et al. (2014)
). However, since the last layer’s architecture for all three CNNs is different from VGG-16, the corresponding weights were randomly initialized during training. All CNNs were trained and tested using the Caffe Deep Learning framework(Jia et al., 2014) using an Nvidia Tesla K40 GPU on a desktop computer with Intel i7-4790K Quad-Core processor (8M Cache and up to 4.40 GHz clock rate) and 32-GB RAM running Ubuntu 15.04 operating system. The processing time for each patch is about 12 seconds on our end-of-the-line Intel i7-4790K CPU and 0.2 seconds on the Tesla K40 GPU. Since the input patches are of the same size, namely , and pass through the same network, the classifiers all have the same computational cost during test time. We describe the design, training, and performance of these CNNs in the following subsections.
For each object category , we want to detect if there is at least one instance in a given patch . This will be done simultaneously for all categories, including “background.” Moreover, all patches are resized to and only one CNN is trained independently of the original size of in the image. This suffices in our framework since patches are restricted to the 4-level annocell hierarchy and the smallest annocells remain at the scale of objects except in extreme cases. CatNet is then a CNN with a softmax output layer, which returns a vector of scores , where each , for , reflects a proportional confidence level about the presence of at least one object from category in the patch, while corresponds to an empty patch (or the ”No Object” category). The scores are non-negative and sum to 1, but they should not be interpreted as probability of existence, since the events they represent are not incompatible i.e., they can co-occur.
The corresponding annobit is a binary vector where if and only if an object with category exists in . The conditional distribution is taken to be independent of , and modeled as a Dirichlet distribution separately for each of the possible configurations of . We used a fixed-point (without projection) iterative schemes to perform MLE parameter estimation (see Minka (2012)).
Figure 7 illustrates some samples from the learned Dirichlet distribution versus some sample CNN outputs for the corresponding annobit for a few configurations. We have and therefore estimated 16 conditional distributions. The figure shows stacked bar visualization of 25 samples (per configuration) drawn randomly from data collected by running CatNet on patches (left column) and samples taken from the Dirichlet model learned from CatNet output data (right column) where each row corresponds to one of the 16 annobit configurations. We have shown stacked bars for only four configurations as example. The length of each colored bar represent the proportion of each category; therefore, the total length of each stacked bar is equal to 1. Two interesting observations are: (1) the length of bars corresponding to the present categories are comparable and usually considerably larger than the length of absent categories; (2) the color distribution of CatNet outputs and Dirichlet model samples are very similar for the same configuration. This supports the argument for using a Dirichlet distribution in modeling the data distribution . Stacked bars are good means to visually inspect and compare the true empirical distribution versus the Dirichlet model.
Define the scale of an object in an image patch as the ratio of its longest side to the patch size (therefore belonging to when completely visible. The ScaleNet predictor is designed to estimate the average scale of object instances in a given patch, independent of their category.
Assume a quantization of the unit interval (in our experiments, we used and quantization levels of , , , and ). We modified the VGG-16 network by assigning output values to the softmax layer and trained by assigning to each patch in the training data the index such that is closest to the average scale of the objects it contains, using only non-empty patches. The output of the CNN is a vector of non-negative weights summing to one. Again, there is only one CNN and patches of different sizes are aggregated for training. The associated annobit is the index of the Voronoï cell that contains the average scale, obtained by adding midpoints to the initial sequence (which separates the unit interval into regions; see Figure 8). The conditional distribution is then modeled and trained as a Dirichlet distribution for each value .
SceneNet combines binary classifiers predicting whether or not an input patch belongs to the dominant plane. The basic architecture is the same as that of the CatNet and ScaleNet. It returns a region in the image plane. For a given scene geometry and camera properties , let be the representation of in the image plane. We discretize the image plane into non-overlapping patches, and let be the corresponding SceneNet outputs. Let if the corresponding patch belongs to and zero otherwise.
9 JHU Table-Setting Dataset
We collected and annotated the “JHU Table-Setting Dataset,” which consists of about 3000 images of dining room table settings with more than 30 object categories. The images in this dataset were collected from multiple sources such as Google, Flickr, Altavista, etc. Figure 13 shows a snapshot of the dataset, which is made publicly available 222Available at: http://www.cis.jhu.edu/~ehsanj/JHUTableSetting.html.
The images were annotated by three annotators over a period of about ten months using the “LabelMe” online annotating website Russell et al. (2008). The consistency of labels across annotators was then verified and synonymous labels were consolidated. The annotation task was carried out with careful supervision resulting in high quality annotations, better than what we normally get from crowd-sourcing tools like Amazon Mechanical Turk. Figure 13 shows the annotation histogram of the 30 most annotated categories. The average number of annotations per image is about 17.
To estimate the homography (up to scale) at least four pairs of corresponding points are needed according to the Direct Linear Transformation (DLT) algorithm(Hartley and Zisserman, 2004, p. 88). These four pairs of corresponding points were located in the image coordinate system by annotators’ best visual judgment about four corners of a square in real world whose center coincides with the origin of the table (world) coordinate system. We are able to undo the projective distortion due to the perspective effect by back-projecting the table surface in the image coordinate system onto the world coordinate system. The homography matrices are scaled appropriately (using object’s typical sizes in real world) such that after back-projection the distance of object instances in the world coordinate system (measured in meters) can be computed. Figure 13 shows two typical images from this dataset and their rectified versions. Clearly, the main distortions occur for objects which are out of the table plane.
Each object instance was annotated with an object category label plus an enclosing polygon. Then, an ellipse was fit to the vertices of the polygon to estimate the object’s shape and pose in the image plane. Figure 13 (left) shows an example annotated image; Figure 13 (middle) shows the corresponding back-projection of vertices of annotation polygons for plates (in red), glasses (in green), and utensils (in black). Note that non-planar objects (e.g., glass) often get distorted after back projection (e.g., elongated green ellipses) since the homography transformation is a perspective projection from points on the table surface to the camera’s image plane. Hence, we estimated the base of vertical objects (shown by black circles in the middle figure) to estimate their location in the table (world) coordinate system since the center of fitting ellipse to the back-projection of such objects’ annotation points is not a good estimate of their 3D location in the real world. Figure 13 (right) shows top-view visualization of the annotated scene in the left using top-view icons of the corresponding object instances for plates, glasses, and utensils (note that all utensil instances are shown by top-view knife icons).
We also utilized a synthetic table-setting scene renderer for verification purposes. This synthetic image renderer inputs the camera’s calibration parameters, six rotation and translation camera’s extrinsic parameters, table length and width, and 3D object poses in the table’s coordinate system and outputs the corresponding table setting scene. Figure 14 shows some synthetic images generated by this renderer.
10 Experiments and Results
10.1 Classifier Training
10.1.1 CatNet and ScaleNet
We fine-tuned CatNet using a set of 344,149 patches. The training set contained 170,830 patches from the “No Object” category, 36,429 patches from the “Plate” category, 2,074 patches from the “Bottle” category, 49,401 patches from the “Glass” category, and 85,415 patches from the “Utensil” category. If a patch includes multiple object instances, it is repeated in the training set, once for each instance. The train and test patches were extracted from the “JHU Table-Setting Dataset” using the image partitioning scheme explained in section 4.2. The “No Object” category patches were selected from the set of annocell patches whose overlap with the table area is less than of the patch. The number of such background training patches was chosen to be twice the number of patches from the most frequent category (utensil). We evaluated the performance of CatNet on a test set of 62,157 patches. Results from the raw output of CatNet are provided in Table 1, which shows the average scores in the vector of SoftMax scores returned by CatNet’s when it is applied to a patch from the corresponding class at different levels of the hierarchy. Unsurprisingly, for each category, the scores for that category increase as the patch size decreases (usually resulting in tighter patches to objects) when the category is present in the patch, which leads to higher classification accuracy being achieved for patches from finer levels of the annocell hierarchy.
We fine-tuned ScaleNet on 171,395 patches. Each patch was labeled by one label , respectively associated to the closest scale ratios in , the number of patches in each category being 42,567, 82,509, 37,443 and 8,876.
We evaluated the performance of ScaleNet on a test set of 30,742 patches. Figure 15 shows confusion matrices for test set in two cases of classification based on the maximum score class and top-2 score classes. A match is declared in the case of top-2 score classification if the true class is among the top two scores. It can be seen that the most common mistakes are made between consecutive classes which makes sense since consecutive classes are associated with consecutive scale ratios which have closer output distributions.
ScaleNet confusion matrix on “training” and “test” set considering both max score classification and top-2 classification.
The main component of SceneNet is a CNN that detects whether a patch is part of the table area. We used 270,410 training patches (including 153,812 background and 116,598 table), and 38,651 test patches (including 18,888 background and 19,763 table). The background and table patches are defined by having, respectively, at most and at least (of the patch) overlap with the table surface area. All of the training and test patches were selected from level-2 and level-3 of the annocell hierarchy.
We classify a level-3 patch as part of the table if both its associated CNN and the one run on one of the level-2 patches that contain it report a positive detection. The final table area prediction, , is defined as the convex hull of the largest connected component of the union of detected level-3 patches. Figure 17 shows the estimated table area for some example images. Figure 18 (left and middle) shows two examples in which misdetected off-table patches are removed after post-processing. Figure 18 (right) shows a poor table detection example which seem to happen due to the lack of sufficient texture on the tables. We tested our table detector on 284 images and observed fewer than 5 poor table detections.
We estimate the table size (in 3D) by appropriately scaling the diameter length of its convex hall. The scale was calculated by running ScaleNet on patches from level 2 classified as table, and assuming that the table-setting objects have an average size of cm. Figure 19 shows the histogram of the absolute and relative errors made by our table size estimator. We calculated the true table size by back-projecting the annotated table surface using the homography that was estimated from the annotation of the images. The histogram is centered roughly around 0 meaning that our table size estimator is relatively unbiased.
10.2 IP Experiments
Conditional inference on the posterior distribution given the accumulated evidence after steps of IP, was described in section 7 (including, in particular, approximations made to the sampling of the scene geometry and camera properties). The templates we used for the geometry are square tables whose sizes range from 0.9 to 2.7 meters with 20cm intervals. We selected the template closest to the estimated table size and its two nearest neighbors (or one neighbor if the closest table size is 0.9 or 2.7). For each of them, we sampled 10 homographies which are consistent with the detected table surface area (described in section 10.1.2).
To generate homography samples that conform with the detected table area, assume a rectangular table with length and width whose four corner points are , , , and . We draw samples from the distribution on camera parameters proposed in section 5.2 and calculate the corresponding homography matrix. Then, we project the four corners of the table to the image coordinate system using this homography matrix and check if the resulting polygon (quadrilateral) fits well to the detected table area using a similarity measure for 2D–shapes. We declare a “good fit” between two shapes and if their distance defined as satisfies
In an attempt to efficiently sample the homography (camera parameter) distribution that is consistent with the detected table area, we first try to find a set of camera parameters that result in a table projection meeting a relaxation of (25), namely , and as soon as we find such a sample we start to greedily fine–tune the camera parameters to finally satisfy (25
). During fine-tuning we randomly choose one camera parameter and change it slightly by sampling a normal distribution with small variance centered at the previous value; we accept this change if it resulted in a smaller distance. We try a total of homographies obtained by sampling the camera model (to satisfy the relaxed condition) or fine-tuning of parameters (to satisfy (25)) and exit the loop as soon as (25) is met; otherwise, if the condition (25) was not met during trials, we output the camera parameters resulting in the minimum . Figure 24 shows some example consistent homography samples.
Recall that at step , IP maximizes the mutual information
over queries and that this mutual information is the difference
(see (6)). Moreover, under our conditional independence assumptions, this reduces to the entropy of a mixture minus a mixture of entropies where in both cases the mixture weights are the conditional probabilities of the annobit given the evidence. In the current case, the queries are indexed by the annocells , where assumes sixteen possible values corresponding to the possible subsets of the four object categories. There are also scale annobits in correspondence with the classifiers but we do not consider these in the selection of queries; of course each time we execute a CatNet classifier for an annocell we also execute the corresponding ScaleNet classifier for and both the CatNet and ScaleNet results are part of the evidence. Once the weights are computed by sampling (see below) from the posterior, we can immediately evaluate the mixture of entropies since the entropy of the Dirichlet distribution has a closed-form solution. For the entropy of the mixture, namely the entropy of the mixture of of Dirichlet densities, we estimate the integral by Monte Carlo integration. To generate a sample from the mixture distribution for the Monte Carlo integration, we first select one of the 16 Dirichlet densities with probabilities according to the posterior and generate a sample from the selected Dirichlet distribution. Given generated samples from the mixture distribution we then evaluate negative logarithm of the mixture distribution at the generated samples and average to get an estimate of the entropy of the mixture. A similar approach can be taken to estimate the entropy of a Dirichlet distribution but since there is a closed-form solution for the entropy of Dirichlet distribution we used the closed-form solution in computing mixture of entropies. Nevertheless, by comparing the closed-form calculation of the Dirichlet distribution entropy and its Monte Carlo integration estimation we got insight about the appropriate number of mixture samples to reasonably estimate the entropy of mixture.
Turning back to the annobit posterior, we determine the states of the annobits from posterior samples by projecting the 3D samples to the image coordinate system using the sampled homography. More specifically, the projection of the sampled locations on the table plane in 3D obviously allows us to answer any queries about locations in the image plane appearing in the definition of an annobit. However, in order to determine what instances of objects are contained in a given annocell, and to measure the average sizes of the instances present, we need an estimate of the set of pixels which constitute the image realization of each instance sampled. For plates and utensils, which are effectively 2D, we simply use the projected circle for plates and projected ellipse for utensils, which of course are again ellipses in the image plane. For glasses and bottles, which are three-dimensional, we know the image representation is larger than the image ellipse obtained by projecting the base circle determined by the sample. Also, the projection of these objects in 2D is oriented perpendicular to the orientation of their base circle projection. Hence, we estimate the projection we would obtain for instances from these categories with a fully 3D to image mapping by moving the center of projection from the center of projected base upward (in the image) and along a vector orthogonal to the main axis of the projected base ellipse; we place the updated object center at a distance from the projected base center equal to half of its size where the size is proportional to the main diameter of the projected base.
We ran IP on a dataset of 284 images. In each step of IP, two most informative questions corresponding to annobits with maximum mutual informations were asked, i.e., two patches were processed by CNNs. Figure 20 shows the annocells selected in the first four steps of IP for a given test image. Figure 21, 22 show the selected annocells at later IP steps. We can see that the patches selected later are usually from the finer levels which follows a coarse-to-fine scene analysis paradigm. However, it is completely plausible, and actually happened during our experiments, to go back again to a coarser question after asking a sequence of finer questions. Analogously, we as humans may focus on a particular area while analyzing a scene and then depending on the collected evidence can zoom out and collect evidence at a coarser level.
It is worthy to mention the difference between the IP selection criterion in (4) and the approximate criterion in (7) in terms of the resolution level of selected patches. According to our experiments, the approximate selection criterion in (7) usually starts with selecting coarser patches compared to the IP selection criterion in (4); more specifically the approximate criterion starts with level-1 whereas the exact criterion starts with level-2 (the reason of not starting with level-0, in the approximate criterion, is that in level-0, which is basically the whole image, most of categories exist. Therefore, analyzing the whole image will not result in much information gain if we are considering only one type of scene category). This is mainly due to the fact that the approximate criterion ignores the error rates of classifiers at the selection stage by replacing with . We know that our classifiers are more accurate at finer levels which leads to encouragement of their selection when using the IP criterion in (4). Note that in both criterions the questions selected at the early steps are usually coarser and they progressively refine (coarse-to-fine analysis). This is an interesting contrast between the two criterions. In support of the IP criterion in (4), assume Alice walks into a bookstore in Brooklyn, where Bob is the Bookstore clerk, in search for a novel that she does not remember its title. Bob wants to find the book that Alice is looking for by asking questions that are most informative to him and at the same time Alice can provide an answer to them. There is no point in asking a very informative question if Alice cannot provide an accurate answer to it e.g., Alice may be able to tell Bob what is the color of cover but most probably will not be able to mention the name of a few non-first characters in the novel. The IP selection criterion in (4) is trying to strike a tradeoff between the information gain of questions and the accuracy of the classifier at providing answer to them.
For the first 100 steps of IP, Figure 25 shows the maximal mutual information for the selected annocell at step , and the corresponding conditional entropy , both averaged across the 284 processed images. Hence but 200 classifiers are involved which explains the ripples with period two in this figure. This is because the second most informative question asked in each step usually has slightly lower conditional mutual information compared to the most informative question of the next step. Naturally the mutual information is smaller than the conditional entropy.
In order to define and visualize the detections generated by sampling from the 3D posterior distribution we superimpose a uniform grid of size on the image plane. We earlier explained how to associate a set of pixles with the projection of each sampled object instance, which in turn generates a rectangular bounding box. The center of the bounding box then falls into one of the above cells. For each cell and each category, we aggregate all samples from that category whose center lies in the cell and compute the average of the top-left corner and width/height of the corresponding bounding boxes; we take this average bounding box as the detection for that cell. The score for every detection is proportional to the number of 2D projections contributing to that detection (used to compute the average). We then run non-maximum suppression on the detections for each object category separately; two bounding boxes are considered neighbors if their intersection size over minimum size is greater than 0.3. This yields a final set of scored detections, each of which is labeled as a true positive if the intersection of the ground-truth bounding box and the estimated bounding box is at least 0.7 of the minimum of the two boxes and the ratio of their longest sides is between and . Otherwise it is labeled a false positive.
10.3 Experiments with Stand-Alone Classifiers
In this section we consider parsing an image with the results of the classifiers alone, i.e., without the Bayesian model. For CatNet, from the softmax layer output, , we estimate the set of categories present in the annocell as follows. Let denote the weight for category with input patch . Order the weights, starting with the top one, then add new categories until the difference between the weights of the previous one and the new one is greater than a threshold , or until three categories have been selected (including the “No Object” category).
For ScaleNet, from the output of (a sequence of weights indexed by the scale categories), we compute an expected scale ratio as a weighted average of the top two categories, i.e., letting be the top two categories with scores , we take . We impose a selection criterion to declare an appropriate bounding box detection, ensuring in particular that objects present in the patch occupy a significant portion of it, by requiring that:
where and . The choice made for favors large differences between the top two scales. Note that ScaleNet returns the correct scale among its top two ratios more than 95% of the time when run on the test set. We also assign a score to the output of ScaleNet, namely .
Finally, each patch from the annocell hierarchy is given a mixed “Category–Scale” score per category. The mixed score for a given patch with scale score and the -th category score is . We declare an annocell patch to be the bounding box of a positive detection for the -th category if both and