1.1 Motivation & objective
Image understanding is one of core problems in the field of computer vision. Compared to object-detection techniques focusing on the “what is where” problem, we are more interested in mining the semantic hierarchy of object compositions and exploring how these compositions/sub-compositions are organized in an object. Such knowledge is a prerequisite for high-level human-computer dialogue and interactions in the future.
Therefore, in this paper, we aim to mine deep structures of objects from web images. More importantly, we present a cost-sensitive active Question-Answering (QA) framework to learn the deep structure from a very limited number of part annotations. Our method has the following three characteristics.
Deep and transparent representation of object compositions: In fact, obtaining a transparent representation of the semantic hierarchy is equivalent to understanding detailed object statuses, to some extent. Based on such a hierarchical representation, parsing an entire object into different semantic parts and aligning different sub-components within each part can provide rich information in object statuses, such as the global pose, viewpoint, and local deformation of each certain part.
Thus, as shown in Fig. 1, a nine-layer And-Or graph (AOG) is proposed to represent visual concepts at different layers that range from categories, poses/viewpoints, parts, to shape primitives with clear semantic meanings. In the AOG, an AND node represents sub-region compositions of a visual concept, and an OR node lists some alternative appearance patterns of the same concept. Unlike modeling visual contexts and taxonomic relationships at the object level in previous studies, the AOG focuses on semantic object components and their spatial relationships.
Multiple-shot QA learning from big data: In order to scale up the technique to big data, we apply the following two strategies to limit the annotation cost. First, we collect web images using search engines as training samples without annotating object boxes. Second, as shown in Fig. 1, we design a QA framework to let the computer automatically figure out a limited number of typical examples of “known unknowns” in the unannotated images, ask users questions, and use the answers to refine the AOG.
Thus, as shown in Fig. 4, we design six types of questions. Each question is oriented to a certain node in the AOG, e.g. whether an image contains an object of a certain category and whether the current AOG provides a correct localization of an object (or a certain semantic part of a category). The computer uses these questions to overcome image noises caused by incorrect search results, intra-class variations, and ubiquitous occlusions.
Note that this multiple-shot QA learning does not fall within a conventional paradigm of active learning. First, we do not pre-define a certain feature space of an object as the prerequisite of active learning. Instead, we use the QA process to gradually enrich the knowledge of object structure,i.e. discovering new alternative part appearance and mining the detailed components of each part. Second, we do not simply treat each answer as a single annotation of a specific object/part sample, but we generalize specific answers by mining the corresponding common patterns from big data in a weakly-supervised manner.
Cost-sensitive policy: We formulate a mixed loss for each node in the AOG as the unified paradigm to guide the learning of hierarchical object details. It includes a generative loss (measuring the model error in explaining the images) and a discriminative loss (i.e.
the model’s fitness to human answers). Thus, among the six types of questions, each question corresponds to a certain node in the AOG, and we can use its answers to explicitly optimize the generative and/or discriminative loss of this node. Clear losses and semantic meanings of middle-layer nodes make our deep AOG different from deep neural networks.
As shown in Fig. 1, the QA framework uses the node loss to identify the nodes that are insufficiently trained, and selects the best sequence of questions to optimize a list of AOG nodes in an online manner. In each step, the QA framework balances the costs and potential gains of different questions, and selects the questions with high gains and low costs, to ensure high learning efficiency, which trades off the generative and discriminative losses, the human labor for annotations, and the computational cost.
In fact, this cost-sensitive policy is extensible. In this study, the QA framework combines six types of questions and four modules of 1) graph mining  (unsupervised mining of AOG structures without the labeling of object locations), 2) And-Or template learning 
(discovery of detailed structures within aligned parts), 3) supervised learning, and 4) object parsing. In addition, people can extend the QA system by adding more questions and modules.
1.2 Related work
Knowledge organization for big data: Many studies organized models of different categories in a single system. The CNN 
encodes knowledge of thousands of categories in numerous neurons. The black-box representation of a CNN is not fully chaotic. made a survey of studies to understand feature representations in neural networks. For example, as shown in , each filter in a convolutional layer usually encodes a mixture of visual concepts. For example, a filter may represent both the head part and the tail part of an animal. However, how to clearly disentangle different visual concepts from convolutional filters is still a significant challenge.
Recently, there has been a growing interest in modeling high-level knowledge beyond object detection. [8, 9] mined models for different categories/subcategories from web images.  constructed a hierarchical taxonomic relationship between categories. [63, 28, 47, 3] formulated the relationships between natural language and visual concepts.  further built a Turing test system.  modeled the contextual knowledge between objects. Knowledge in these studies was mainly defined upon object-level models (e.g. the affordance and context). In contrast, we explore deep structures within objects. The deep hierarchy of parts provides a more informative understanding of object statuses.
Multiple-shot QA for learning: Many weakly-supervised methods and unsupervised methods have been developed to learn object-level models. For example, studies of [46, 37, 15, 54], object co-segmentation , and object discovery [44, 51] learned with image-level annotations (without object bounding boxes). In particular, [11, 60] did not require any annotations during the learning process. [16, 6, 36, 35] learned visual concepts from web images.
However, when we explore detailed object structures, manual annotations are still necessary to avoid model drift. Therefore, inspired by active learning methods [48, 50, 21, 33, 29], we hope to use a very limited number of human-computer QAs to learn each object pose/viewpoint. In fact, such QA ideas have been applied to object-level models [14, 41, 49]. Branson et al.  used human-computer interactions to point out locations of object parts to learn part models, but they did not provide part boxes. In contrast, we focus on deep object structures. We design six types of human-computer dialogues/QAs for annotations (see Fig. 4). Our QA system chooses questions based on the generative and discriminative losses of AOG nodes, thereby explicitly refining different AOG nodes. In experiments, our method achieved good performance when we only label parts on 3–5 objects for each pose/viewpoint. Similarly,  used active QA to learn a semantic tree to disentangle neural activations inside neural networks into hierarchical representations of object parts.
Transparent representation of structures is closely related to the deep understanding of object statuses. Beyond the object bounding box, we can further parse the object and align visual concepts at different layers to different object parts/sub-parts, which provides rich information of local appearance, poses, and viewpoints. In previous studies, many part models were designed with single-layer latent parts [44, 18] or single-layer semantic parts [2, 7, 40, 61, 23, 4], and trained for object detection with strong supervision. [36, 35] proposed to automatically learn multi-layer structures of objects from web images, which models the object identity, object viewpoints, semantic parts and their deformation locations. Whereas, we have a different objective, i.e. weakly-supervised mining a nine-layer deep structural hierarchy of objects, which models detailed shape primitives of objects.  learned an interpretable CNN with middle-layer filters representing object parts, and  further used an explanatory tree to represent the CNN’s logic of using parts for object classification.  learned an explainer network to interpret the knowledge of object parts encoded in a pre-trained CNN.  further designed an interpretable modular structure for a neural network for multiple categories and multiple tasks, where each network module is functionally interpretable.
The paper makes the following contributions:
1) We propose a nine-layer AOG to represent the deep semantic hierarchy of objects.
2) We propose an efficient QA framework that allows the computer to discover something unknown, to ask questions, and to explicitly learn deep object structures from human-computer dialogues.
3) We use a general and extensible cost-sensitive policy to implement the QA system, which ensures a high efficiency of mining knowledge. To the best of our knowledge, our method is the first to reduce the cost of learning part localization to about ten annotations for each part.
4) We can use our QA framework to learn deep semantic hierarchies of different categories from web images.
2 And-Or graph representation
Fig. 2 shows the nine-layer AOG, which encodes visual concepts at different levels within objects and organizes their hierarchy. The basic element of the AOG is the three-layer And-Or structure in Fig. 3, where an AND node represents 1) part compositions of a certain concept and 2) their deformation information, and an OR node lists alternative local patterns for a certain part. Let denote all the AOG parameters. Let us use the AOG for object parsing in image . For each node in the AOG, we use and to denote the image region corresponding to and the parameters related to , respectively.
Each Terminal node in the bottom layer represents a pattern of local shape primitives. The reference score of node in image is formulated as
where denotes the local features for the region in , and is the parameter.
Each OR node in the AOG provides a list of alternative local appearance patterns. In particular, OR nodes in Layers 1 and 2 encode the category choices and possible object poses/viewpoints within each category, respectively, and those in Layers 4, 6, and 8 offer local pattern candidates. When we use the AOG for object inference in image , selects its child node with the highest score as the true configuration:
where function indicates the children set of a node. The child node can be a Terminal node, an OR node, or an AND node. Note that is also a child of , which is activated when other children patterns cannot be detected.
Each AND node in the AOG contains some sub-region components, and it models their geometric relationships. In particular, the AND nodes in Layer 3 organize the relationship between object poses/viewpoints and object parts, and those in Layers 5 and 7 encode detailed structural deformation within part patches. The inference score of is formulated as the sum of its children’s scores:
where represents the score of the global appearance in the region . denotes the set of ’s neighboring children pairs. measures the deformation between image regions and of sibling children and . and are constant weighting parameters for normalization. and are learned to make
have zero mean and unit variance through random background images.
2.1 Design of Layers 3–5
Layers : The three-layer And-Or structure that ranges across the pose/viewpoint, part, and local layers is derived from the AOG pattern proposed in . This technique models the three-layer sub-AOG as the common subgraph pattern that frequently appears among a set of large graphs (i.e. images). For each pose/viewpoint node under a category node , we do not model its global appearance . contains two types of children nodes, i.e. latent children (the parts mined automatically from big data without clear semantic meaning) and semantic children (the parts with certain names). Thus, based on (3), we can write the inference score of as
Part deformation: We connect all pairs of part nodes under the pose/viewpoint as neighbors. For each pair of part nodes , the deformation score between them measures the squared difference between the ideal (average) geometric relationship and the actual part relationship detected in the image . In addition, we also assign a specific deformation penalty as the deformation score, when one of the parts are not detected. The average geometric relationship and the penalty
where the geometric relationships between and comprise three types of pairwise features, i.e. 1) , 2) , and 3) . and denote the scale and 2D position of the part , respectively.
Layers : To simplify the AOG, we allow latent part nodes to have multiple children, but the semantic part node can only have one child besides the “invisible” child. For each child in Layer 5 of a latent part, its appearance score measures the squared difference between ’s ideal (average) appearance and the actual appearance detected in the image . Then, for the only child of a semantic part
, we use part annotations to train a linear SVM to classify its local appearance, and set the appearance score ofas the SVM score. We also assign a specific appearance penalty for “invisible” children in Layer 5.
where and are learned to make have zero mean and unit variance through random background images. The appearance feature for patch comprise the HOG features and the height-width ratio of the patch. A linear SVM is learned to estimate the score for a visible semantic part, which returns a positive/negative value if is a true/false detection of . Model parameters, including average part appearance , SVM parameters for semantic parts, the appearance penalty would be learned.
2.2 Design of Layers 5–9
The bottom four layers (Layers 6–9) of the AOG represent detailed structures within the semantic patches in Layer 5 based on the And-Or template proposed in . First, for each AND node in Layers 5 and 7, we do not encode its global appearance. has two children, and the deformation relationship between the two children is used to roughly model the “geometric OR relationships” involved in . Second, each OR node in Layers 6 and 8 has several children, which encodes only the “structural OR information” described in . Finally, terminal nodes in Layer 9 are described by the HIT feature mined by , which combines information of sketches, texture, flat area, and colors of a local patch.
2.3 Object parsing (inference)
Given an image , we use the AOG to perform hierarchical parsing for the object inside , i.e. estimating a parse graph (see green lines in Figs.2) to explain the object:
where we define the parse graph as a set of activated node regions for object understanding, , which describes an inference tree of the AOG. We can understand the parse graph in a top-down manner. 1) Let an OR node in Layers 1, 2, 4, 6, or 8 have been activated and put into the parse graph (). activates its best child to explain the ’s image region , and puts into the parse graph (). 2) Let an AND node in Layers 3, 5, or 7 haven been activated and put into the parse graph (). determines the best image region inside for each of its OR children , i.e. , and put into the parse graph. Therefore, because we do not encode the global appearance of pose/viewpoint nodes, the objective of object parsing can be re-written as
where is the set of pose/viewpoint nodes in the AOG. The target parse graph for Layers 3–5 can be estimated via graph matching . As mentioned in , (3) is a typical quadric assignment problem that can be directly solved by optimizing a Markov random field . The detailed inference for Layers 6–9 is solved by using . The left-right symmetry of objects is considered in applications.
|#||Question stories for pose/viewpoint||Participants|
|1||retrain category classification||Computer||✓|
|2||check & correct inaccurate semantic part localizations||Users||✓||✓|
|3||1) QA-based collection of object samples for pose/viewpoint||Users||✓||✓||✓|
|, 2) mine the latent structure of pose/viewpoint||& Computer|
|4||generate a new pose/viewpoint: label an initial object example,||Instructors||✓|
|collect samples, mine latent structure, label parts||& Computer|
3 Cost-sensitive QA-based active learning
3.1 Brief overview of QA-based learning
In this section, we define the overall risk of the AOG. We use this risk to guide the growth of the AOG, which includes the selection of the questions, refining the current visual concepts in the AOG based on the answers, and mining new concepts as new AOG branches. The overall risk combines both the cost of asking questions during the learning process and the loss of AOG representation. The loss of AOG representation further comprises the generative loss (i.e. the fitness between the AOG and real images) and the discriminative loss (i.e. the AOG fitness to human supervision).
Therefore, the minimization of the AOG risk is actually to select a limited number of questions that can potentially minimize the AOG loss. In fact, we organize the six types of questions into four types of QA storylines (Fig. 4). In each step of the QA process, we conduct a certain storyline to decrease the risk. Meanwhile, we evaluate the gain (loss decrease) of different AOG nodes after each storyline, so that we can determine the next best storyline in an online manner.
Unlike previous active learning methods that directly use human annotations as ground-truth samples for training, we generalize specific annotations to common patterns among big data so as to update the AOG.
For example, in Layer 4 of the AOG, there are two types of parts, i.e. the semantic parts and latent parts. In Storylines 3 and 4 (details will be discussed later), we first 1) ask for object samples with a certain pose/viewpoint, 2) based on the object examples, select a large number of similar objects from all the web images as potential positives of this pose/viewpoint, then 3) use  to mine the common part patterns among these objects as the latent parts, and 4) model their spatial relationships.
Thus, as in (3) and Fig. 3, spatial relationships between latent parts constitute a graph that represents the latent structure of the pose/viewpoint. Then, we continue to ask for semantic parts in Storylines 3 and 4, and use the pre-mined latent pose/viewpoint structure to localize relative positions of the newly annotated semantic parts. Such a combination of structure mining from big data and part annotations on small data ensures high learning stability.
In the following subsections, we introduce the detailed implementations of the proposed QA framework.
As shown in Fig. 4, we design six types of questions to learn the AOG, and organize these questions into four types of storylines. Let us assume that the QA framework has selected a sequence of storylines , and modified the AOG parameters to . We use the system risk, , to evaluate the overall quality of the current status of QA-based learning. The objective of the QA framework is to select the storylines that can greatest decrease the overall risk:
The system risk comprises the cost of the storylines and the loss (inaccuracy) of the current AOG . Thus, we can expect the QA system to select cheap storylines that greatly improve the model quality.
Definition of and its cost: Let denote the set of storylines. Theoretically, there are four different storylines in each pose/viewpoint node in the AOG, which will be introduced later. The QA system selects a sequence of storylines to modify the AOG. Each storyline line comprises a list of questions and learning modules. As shown in Table 1, we can represent the storyline as a three-tuple . proposes some questions ( is a question defined in Fig. 4) for the target parse graph of the pose/viewpoint , expects a tutor to answer these questions, and then uses the answers for training. These storylines choose ordinary users, professional instructors, or the computer itself as the tutor to answer these queries. Because there are four types of storylines for each pose, the entire search space for storylines is given as .
In addition, each storyline has a certain cost according to both the human labor of answering and the computational cost of model learning222Professional instructors have higher labor cost considering their professional levels.. The overall cost of is given as
Definition of the AOG loss: Let be a comprehensive web image dataset governed by the underlying distribution . When we use our AOG (with parameters ) to explain the images in , we can formulate the overall loss as
where and indicate the true parse graph configuration and the estimated configuration of , respectively. The generative loss measures the fitness between the image and its true parse graph , and the discriminative loss evaluates the classification performance.
The generative loss can be rewritten as
where represents a subset of images that contain objects belonging to the pose/viewpoint , and denotes the average generative loss of images in . indicates the true pose/viewpoint of the object inside .
measures the probability of.
The discriminative loss for the pose/viewpoint comprises the loss for category (pose/viewpoint) classification and the loss for part localization :
where , . , . and represent prior weights for category classification and part localization, respectively (here, we set , ).
3.3 Learning procedure
Algorithm 1 summarizes the procedure of the QA-based active learning. In the beginning, we construct the top two layers of the AOG to contain a total of categories. We use keywords of these categories to crawl web images of the categories from the internet, and build a comprehensive web image dataset . Next, we apply Storyline 4 to each category, which mines an initial model for a certain pose/viewpoint of this category. Then, we simply use a greedy strategy to solve (9), which estimates an optimal sequence of storylines . In each step , we recursively determine the next best storyline, , as follows.
where denotes the potential AOG gain (decrease of the AOG loss, which is estimated by historical operations and introduced later) from storyline . Considering (12) and (13), we can rewrite the above equation as
where , , and are the potential gains of , , and after storyline , respectively. can be estimated based on the current web images collected for (i.e. )333 denotes the current images that are collected for pose/viewpoint from a category’s image pool in Storyline 3. and the yes/no answer ratio during sample collection in Storyline 3.
3.4 Introduction of storylines
Storyline 1: retraining category classification. As the QA framework collects more and more web images, in this storyline, we use these images to update the AOG parameters for the classification of a certain pose/viewpoint . This storyline mainly decreases the discriminative loss .
Given all the web images that have been collected for pose/viewpoint (i.e. 33footnotemark: 3) we use the current AOG for object inference on these images. Given an incorrect object inference (i.e. an image is incorrectly recognized as a pose/viewpoint other than the true pose/viewpoint ), we can use this inference result to produce hard negatives of semantic object parts for , and retrain its part classifier in Layer 5.
Therefore, the potential cost for a future storyline mainly comprises the computational cost of object inference , where is the set of all the pose/viewpoint nodes, and is a weighting parameter444Please see Section 4.1 for parameter settings of .. The potential gain can be predicted simply using historical gains from similar storylines for pose/viewpoint 555Among all the storylines , before , we select the storylines that have both the same type of questions and the same target pose/viewpoint as . We record gains of and after these storylines, and use these historical gains to predict the gain for a further storyline ..
Storyline 2: checking & labeling semantic parts. In this storyline, the computer 1) selects a sequence of images, 2) asks users whether the current AOG can correctly localize the semantic parts in these images, and 3) lets users correct the incorrect part localizations to update the AOG.
First, the QA system uses the pose/viewpoint model of for object inference on the images in which semantic parts are not labeled. Next, the QA system selects a set of images that potentially contain incorrect localizations of semantic parts. We select the object samples that have good localizations of latent parts but inaccurate localizations of semantic parts, i.e. having high scores for latent parts but low scores for semantic parts. Thus, we can determine the target sample (image) as , where is a dummy pose/viewpoint that is constructed by eliminating semantic parts from the current pose/viewpoint.
Then, the computer asks users to check whether the part localizations on the selected images are correct or not666The QA system asks about part compositions/names for pose/viewpoint in the first time of part labeling (see Fig. 4()). (see Fig. 4()), and finally asks users to label the boxes for the incorrect part localizations (see Fig. 4()).
Given the annotations of semantic part boxes, we update the geometric relationships between part nodes in Layer 4 based on , and update SVM classifiers for local patch appearance in Layer 5. Given the part annotations, we can further learn detailed structures in Layers 5–9 via .
The cost of this storyline mainly comprises the human labor required for both part checking and part labeling , which can be measured as , and , respectively. This storyline mainly decreases and . The potential gain and for a future storyline can be predicted using historical gains55footnotemark: 5.
Storyline 3: collecting & labeling new samples. This storyline collects new sample for pose/viewpoint from web images to update the pose/viewpoint. It decreases the generative loss and the pose/viewpoint classification loss . First, we use the sub-AOG of pose/viewpoint to collect new samples from web images777The images collected from search engines comprise both correct images with target objects and irrelevant background images. with top inference scores. The system collects new samples, when it is the -th time to perform Storyline 3 to pose/viewpoint .
Second, we randomly select (, here) new object samples, ask users whether they are true samples with pose/viewpoint , and expect yes/no answers (see Fig. 4()).
Third, given the true samples, we use graph mining  to refine the And-Or structure in Layers 3–5 for . The sub-AOG is refined towards the common subgraph pattern (pose/viewpoint model) embedded in a set of large graphs (images). Its objective can be roughly written as follows, which is proved in Appendix Appendix: Objective function of graph mining.
The above equation refines the by 1) adding (or deleting) new (or redundant) latent parts from the pose/viewpoint , 2) determine the children number (i.e. the number of patches in Layer 5) of each latent part , 3) updating the average appearance of each patch , and 4) refining the average geometric relationship between each pair of children parts .
At the end of Storyline 1, we further apply Storylines 2 and 1 to refine semantic parts for pose/viewpoint and retrain for pose/viewpoint classification.
Therefore, the potential cost of a future storyline can be computed as . is the computational cost of sample collection, where denotes the entire web image pool of category , . indicates the human labor of checking samples. , , and denote the costs of checking parts, labeling parts, and retraining pose/viewpoint classification, respectively, and can be estimated as introduced in Storylines 1 and 2. This storyline mainly decreases , and . For the term of , we can roughly estimate as , in the last Storyline 3. are approximated using historical gains55footnotemark: 5.
Storyline 4: labeling a new sibling pose/viewpoint. As shown in Fig. 4(), in this storyline, the QA system requires a professional instructor to label an initial sample for a new pose/viewpoint in category , and uses iterative graph mining  to extract the structure of Layers 3–5 for pose/viewpoint (only mining latent parts in Layer 5). The graph mining is conducted with three iterations. In each iteration, we first collect new object samples for pose/viewpoint , as shown in Fig. 4(). Based on the collected samples, we optimize the mining objective in (16) to mine/refine the latent parts in Layer 4 and the patches in Layer 5 for this pose/viewpoint. In this way, we obtain the latent structure of the new pose/viewpoint , and then we apply Storylines 2 to pose/viewpoint to ask and label semantic parts and to fix these semantic parts on this latent structure. Finally, we apply Storyline 1 to train classifiers of the semantic parts for pose/viewpoint classification.
Therefore, the storyline cost is given as , where is a constant cost for labeling a new pose/viewpoint44footnotemark: 4, and other costs can be estimated as mentioned above. This storyline mainly decreases , which can be computed as in Storyline 3.
To implement the QA system, we set the parameters as follows. , , and . It is because that we found that the time cost of labeling a part is usually five times greater than that of making a yes/no judgment in our experiments. The computational cost of the collection/inference of each object was set as , . We set as the labeling cost for a new pose/viewpoint.
We applied Bing Search and used 16 different keywords to collect web images. The keywords included “bulldozer,” “crab,” “excavator,” “frog,” “parrot,” “red panda,” “rhinoceros,” “rooster,” “Tyrannosaurus rex,” “horse,” “equestrian,” “riding motorbike,” “bus,” “aeroplane,” “fighter jet,” and “riding bicycle.” With each keyword, we collected the top-1000 returned images. We used images of the first ten keywords to learn an AOG (namely AOG-10) with ten categories to evaluate the learning efficiency of our QA framework. Then, we used images of the last seven keywords to learn an AOG (namely AOG-7) with five categories (horse, motorbike, bus, aeroplane, and bicycle) and tested the performance on the Pascal VOC dataset .
4.2 Mining of the deep semantic hierarchy
Figs. 5 and 6 illustrate the deep structures of some categories in the AOG-10. The QA system applied a total of 39 storylines to learn AOG-10. The AOG-10 contains two poses/viewpoints for the frog, horse, and parrot categories, and three poses/viewpoints for each of the other seven categories in Layer 3. AOG-10 has 132 semantic part nodes and 84 latent part nodes in Layer 4. AOG-7 contains a total of 12 pose/viewpoint nodes in Layer 3, 48 semantic part nodes, and 48 latent part nodes in Layer 4.
4.3 Evaluation of part localization
The objective of this work is to learn a transparent representation of deep object hierarchy, and it is difficult to evaluate the quality of deep structures. Therefore, we evaluate our AOGs in terms of part localization, although our contribution is far more than it. We tested the AOG-10 on web images and tested the AOG-7 on the Pascal VOC dataset for a comprehensive evaluation.
Baselines: Our AOGs were learned with part annotations on only 2–14 objects in each category, but most previous methods require a large number of part annotations to produce a valid model. Nevertheless, we still selected the nine baselines for comparisons, including benchmark methods for object detection (here is part detection), popular part-localization approaches, and methods for interactive learning of parts. For each baseline, we randomly selected different numbers of training samples to learn the model and enable a fair comparison.
First, we focused on , which uses annotations of semantic parts to train DPMs. This method clusters training samples to different object poses/viewpoints, and trains a DPM component for each pose/viewpoint. We designed three baselines based on , namely SSDPM-2, SSDPM-3 and SSDPM-5. For each category, SSDPM-2, SSDPM-3 and SSDPM-5 learned two, three, and five pairs of left-right symmetric poses/viewpoints, respectively888Due to the limited number of training samples, the bulldozer and horse categories could produce at most four pairs of pose/viewpoint models for SSDPM-5. Training samples used in the baselines will be published after the paper acceptance..
Then, we used the technique of  as the fourth baseline, namely PLDPM, which required annotations of both the parts and object poses/viewpoints for training. To enable a fair comparison, we only collected and labeled training samples that corresponded to the poses/viewpoints in our AOG.
|SSDPM ||228 :||58.7||58.4||67.2||65.7||98 :||30.0||36.8||20.6||30.7||133 :||13.9||22.2||24.6||31.0|
|110 :||53.2||55.0||54.7||58.5||57 :||3.3||23.2||5.9||18.0||68 :||7.7||15.4||12.3||31.8|
|P-Graph ||204 :||8.3||0||5.4||0||152 :||11.2||0||15.9||0||156 :||1.7||0||1.7||0|
|Fast-RCNN ||222 :||23.0||2.6||21.3||1.8||109 :||19.0||0||20.1||6.7||95 :||14.6||4.2||16.8||2.3|
|113 :||24.1||5.1||15.1||0||51 :||3.0||0||12.6||6.7||49 :||5.7||0||7.0||0|
|YOLOv3 ||222 :||33.8||–||44.4||–||109 :||18.9||–||12.3||–||186 :||14.9||–||23.4||–|
|Our||9 :||60.6||60.5||68.8||65.1||54 :||36.7||35.4||35.3||41.7||24 :||13.9||28.4||17.5||31.0|
|SSDPM ||30 :||57.9||57.3||24.5||41.5||104 :||10.1||40.6||9.5||35.5|
|24 :||0||7.7||0||8.0||52 :||0||18.7||1.4||16.2|
|P-Graph ||148 :||7.1||0||9.3||0||180 :||3.7||0||0.6||0|
|Fast-RCNN ||163 :||29.2||5.5||24.4||0||208 :||29.7||8.3||26.1||8.5|
|83 :||15.6||1.8||9.7||0||104 :||14.1||1.7||19.2||3.4|
|YOLOv3 ||163 :||48.3||–||30.6||–||208 :||44.4||–||38.8||–|
|Our||9 :||57.9||62.4||32.7||48.6||46 :||24.6||35.8||23.0||35.7|
#box indicates the number of part annotations for model learning, and the performance is evaluated by the values of (APP / AER). With the help of massive web images, our method only required – number of the part annotations that were used by SSDPM, and achieved comparable performance to SSDPM.
The fifth baseline was another part model proposed by , namely P-Graph, which organized object parts into a graph and trained an SVM based on the part appearance features and inter-part relationships for part localization.
The sixth baseline was image matching, namely Matching, introduced in . Unlike conventional matching between automatically detected feature points [10, 31, 5], Matching used a graph template to match semantic parts of objects in images. For a fair comparison, Matching constructed a graph template for each pose/viewpoint in our AOG (i.e. using the template of the initial sample labeled in Storyline 4).
Then, we used two benchmark methods for object detection, i.e. Fast-RCNN  and YOLOv3 , as the seventh and eighth baselines to detect object parts. For the fast-RCNN baseline, we chose the widely used 16-layer VGG network (VGG-16) 
that was pre-trained based on the ImageNet dataset. For each semantic part, we used  to fine-tuned the VGG-16 using part annotations and obtained a specific part detector. In order to detect small object parts, we decreased the threshold for region proposal module and thus received more than 200 region proposals from each object region. For the YOLOv3 baseline, we used part annotations to fine-tune the pre-trained network.
The ninth baseline was a method for interactive annotating and learning object parts, which was proposed in . We called it Interactive-DPM. The idea of online interactive learning of object parts is quite close to our method.
Evaluation metrics: We used two ways to evaluate part localization performance. The first metric is the APP  (Average Percentage of Parts that are correctly estimated). Given each true object, we used the best pose/viewpoint component in the model (with the highest score) to explain the object. Then, for each object part of the pose/viewpoint, we used the “” criterion [37, 2] to identify correct part localizations. We computed such a percentage for each type of semantic parts, and APP is the average for all the part types. To reduce the effects of object detection on the APP, we detect the object within the image region of and , where // indicates the width/height/center of the true object bounding box.
The second evaluation metric is the AER (average explanation rate) of objects. When an object is detected999To simplify the evaluation metric, we only detected the best object from an image and ignored the others., if more than of the parts in its pose/viewpoint component are correctly localized, we consider this object being correctly explained by this component. Fig. 7 compares part localization performance between different baselines given a certain annotation cost. Different curves/dots correspond to different baselines. For most baselines, the annotation cost is the number of labeled parts on training samples. However, for our QA system, the overall cost consists of the cost of labeling parts and that of making yes/no judgments. Therefore, we drew two curves for our method: Ours simply used the number of part boxes as the cost, whereas Ours (full cost) computed the cost as (a judgment costs about of the time of labeling a part).
Note that the baseline of Interactive-DPM  cannot detect bounding boxes for object parts, but localizes the center of each part. Therefore, just as in , we used the “average localization error” to evaluate the part localization accuracy. We normalized pixel error with respect to the part size, computed as . In Fig. 9, we compared the proposed method with Interactive-DPM  in terms of the average localization error.
Comparison of learning efficiency. We used the ten category models in the AOG-10 to explain its corresponding objects. For each category, 75 images were prepared as testing images to compute the object explanation rate. Fig. 8 illustrates part localization performance of the AOG-10. Fig. 7 shows the average explanation rate over the ten categories. To evaluate our method, we computed the performance of intermediate models for each category, which were trained during the QA procedure with different numbers of storylines/questions. Given the same amount of labeling, our method exhibited about twice explanation rate of Matching. When our method only used 125 bounding boxes for training, i.e. of SSDPM-3’s annotation cost (4258 boxes), it still achieved higher explanation rate than SSDPM-3 ( vs. ).
Performance on the Pascal VOC2007: We learned the AOG-7 from web images, and tested it using horse, motorbike, bus, aeroplane, and bicycle images with the left and right poses/viewpoints. This subset of Pascal images have been widely used for weakly-supervised exploiting part structures of objects [37, 11]. We compare our method with the baselines of SSDPM, P-Graph, Fast-RCNN, and Interactive-DPM. SSDPM used the Pascal training samples with the left and right poses/viewpoints for learning. We required SSDPM to produce the maximum number of components for each category. Table 2 shows the result. SSDPM models were learned from different numbers of part annotations. In Fig. 9, we compared the average localization errors of Interactive-DPM  and our method.
SSDPM used part annotations for training, so its performance depended on whether or not this method could extract discriminative features from small part regions. Therefore, SSDPM may exhibit bad performance when the annotated parts were not distinguishable enough. In contrast, besides semantic parts, our method also mined discriminative latent parts from images, which increased the robustness of part localization. Unlike SSDPM, P-Graph and Interactive-DPM directly learning part knowledge from a few annotations, we localized semantic parts on a latent object structure that was mined from unannotated web images. Thus, our method suffered less from the over-fitting problem. In addition, although Fast-RCNN has exhibited superior performance in most object detection tasks, it did not perform that well in part detections. It is because 1) object parts were usually small in images, and without contextual knowledge, the low-resolution part patches could not provide enough distinguishing information; and 2) that we only annotated a small number of samples for each part (e.g. annotations for each part of the aeroplane), which was not enough to learn a solid Fast-RCNN model. In contrast, our method did not require a large number of annotations for learning/fine-tuning, and modeled the spatial relationships between parts. Therefore, in Table 2 and Fig. 9, our method used fewer part annotations but achieved better localization accuracy.
5 Conclusions and discussion
In this study, we used human-computer dialogues to mine a nine-layer hierarchy of visual concepts from web images and build an AOG. Unlike the conventional problem of object detection that only focuses on object bounding boxes, our AOG localized semantic parts of objects and simultaneously aligned common shape primitives within each part, in order to provide a deep understanding of object statuses. In addition, our method combined QA-based active learning and weakly supervised web-scale learning, which exhibited high efficiency at knowledge mining in experiments.
In recent years, the development of the CNN has made great progress in object detection. Thus, it becomes more and more important to go beyond the object level and obtain a transparent understanding of deep object structures. Unlike widely used models (e.g. CNNs for multi-category or fine-grained classification), the objective of our AOG model is not multi-category/fine-grained classification, but the deep explanation of the structural hierarchy of each specific object. We do not learn the AOG towards the application of multi-category classification. Instead, we design the loss for part localization and show the performance of hierarchical understanding of objects. Unlike object parts, the accuracy of detailed sketches within each local part is difficult to evaluate. Many of the sketches represent latent semantics within object parts.
Compared to deep neural networks, AOGs are more suitable for weakly-supervised learning of deep structures of objects. Figs. 5 and 6 show one of the main achievements of this study, i.e. the deformable deep compositional hierarchy of an object, which ranges from the “object,” “semantic parts,” “sub-parts,” to “shape primitives.” Such deep compositional hierarchy is difficult for deep neural networks to learn without given sufficient part annotations.
Our deep hierarchical representation of object structures partially solves the typical problem of how to define semantic parts for an object. In fact, different people may define semantic parts at different fine-grained levels. The uncertainty of part definition proves the necessity of our nine-layer AOG. Our AOG, for the first time, provides a nine-layer coarse-to-fine representation of object parts, which is a more flexible representation of object parts than shallow part models. People can define large-scale parts in the first four layers, and obtain representations of small parts in deep layers (please see Figs. 5 and 6). Thus, the flexibility of our AOG representation is one of main contributions of this research.
Although the AOG can be used for both object detection and part parsing, in recent years, deep neural networks [30, 24, 25] have exhibited superior the discrimination power to graphical models. Therefore, we believe the main value of the proposed method is weakly-supervised mining deep structure of objects, which can be used as explainable structural priors of objects for many applications and tasks. For example, a crucial bottleneck for generative networks is its limited interpretability. The automatically mined hierarchical object structures can be used as prior structural codes for generative networks and boost their interpretability.
The current AOG mainly models common part structures of objects without a strong discriminative power for fine-grained classification. However, our AOG can provide dense part correspondences between objects, which include both alignments of semantic parts and alignments of latent parts. Such dense part correspondences are crucial for fine-grained classification. More specifically, as discussed in , we can simply add different attributes to each node in the AOG. In this way, original AOG nodes mainly localize object parts, while attribute classifiers in AOG nodes servers for fine-grained classification.
Search engines usually return incorrect images without target objects and simple objects that are placed in image centers and well captured without occlusions. Thus, lifelong learning studies, such as  and ours, mainly first learn from simple samples, and then gradually switch to difficult ones. In fact, comprehensive mining of all object poses/viewpoints, including infrequent poses/viewpoints, remains a challenging long-tail problem.
In this study, we aimed to explore a general QA system for model mining and test its efficiency. Thus, we applied simple features and trained simple classifiers for simplicity. However, we can extend the QA system to incorporate more sophisticated techniques (e.g. connecting the AOG to the CNN) to achieve better performance. In experiments, we simply used very few (one or two) keywords for each category to search web images, because our weakly-supervised method did not need numerous web images for training. However, theoretically, people can also apply standard linguistic knowledge bases, such as WordNet , to provide several synonyms for the same category as keywords to search web images.
This work is supported by ONR MURI project N00014-16-1-2007 and DARPA XAI Award N66001-17-2-4029, and NSF IIS 1423305.
Appendix: Objective function of graph mining
The objective function in  was proposed in the form of
where the pattern complexity is formulated using the node number in the pattern, . Then, the terms of and are the average responses of part node among positive images and negative images, respectively:
Considering , we can rewrite the objective as
In addition, the average score of for negative (background) images is normalized to zero. Therefore, we can further approximate the objective as
Therefore, if we redefine a new complexity , we can write the objective function as
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, 2015.
-  H. Azizpour and I. Laptev. Object detection using strongly-supervised deformable part models. In ECCV, 2012.
J. L. Ba, K. Swersky, S. Fidler, and R. Salakhutdinov.
Predicting deep zero-shot convolutional neural networks using textual descriptions.In ICCV, 2015.
-  S. Branson, P. Perona, and S. Belongie. Strong supervision from weak annotation: Interactive training of deformable part models. In ICCV, 2011.
-  T. S. Caetano, J. J. McAuley, L. Cheng, Q. V. Le, and A. J. Smola. Learning graph matching. In ICCV, 2007.
-  X. Chen and A. Gupta. Webly supervised learning of convolutional networks. In ICCV, 2015.
-  X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
-  X. Chen, A. Shrivastava, and A. Gupta. Neil: Extracting visual knowledge from web data. In ICCV, 2013.
-  X. Chen, A. Shrivastava, and A. Gupta. Enriching visual knowledge bases via object discovery and segmentation. In CVPR, 2014.
-  M. Cho, K. Alahari, and J. Ponce. Learning graphs to match. In ICCV, 2013.
-  M. Cho, S. Kwak, C. Schmid, and J. Ponce. Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals. In CVPR, 2015.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
-  J. Deng, J. Krause, A. C. Berg, A. Berg, and L. Fei-Fei. Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition. In CVPR, 2012.
-  J. Deng, O. Russakovsky, J. Krause, M. Bernstein, A. Berg, and L. Fei-Fei. Scalable multi-label annotation. In CHI, 2014.
-  T. Deselaers, B. Alexe, and V. Ferrari. Localizing objects while learning their appearance. In ECCV, 2010.
-  S. K. Divvala, A. Farhadi, and C. Guestrin. Learning everything about anything: Webly-supervised visual concept learning. In CVPR, 2014.
-  C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
-  T. Durand, N. Thome, and M. Cord. Mantra: Minimum maximum latent structural svm for image classification and ranking. In ICCV, 2015.
-  M. Everingham, L. Gool, C. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.
-  V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction for human pose estimation. In CVPR, 2008.
-  S. Gavves, T. Mensink, T. Tommasi, C. Snoek, and T. Tuytelaars. Active learning revisited: Reusing past datasets for future tasks. In ICCV, 2015.
-  R. Girshick. Fast r-cnn. In ICCV, 2015.
-  G. Gkioxari, R. Girshick, and J. Malik. Actions and attributes from wholes and parts. In ICCV, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In CVPR, 2017.
-  G. Kim and E. P. Xing. On multiple foreground cosegmentation. In CVPR, 2012.
-  V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1568–1583, 2006.
-  C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. What are you talking about? text-to-image coreference. In CVPR, 2014.
-  K. Konyushkova, R. Sznitma, and P. Fua. Introducing geometry in active learning for image segmentation. In ICCV, 2015.
-  A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  M. Leordeanu, R. Sukthankar, and M. Hebert. Unsupervised learning for graph matching. In International Journal of Computer Vision, 96:28–45, 2012.
-  B. Li, W. Hu, T. Wu, and S.-C. Zhu. Modeling occlusion by discriminative and-or structures. In ICCV, 2013.
-  C. Long and G. Hua. Multi-class multi-annotator active learning with robust gaussian process for visual recognition. In ICCV, 2015.
-  G. A. Miller. Wordnet: A lexical database for english. In Communications of the ACM, 38(11):39–41, 1995.
-  D. Modolo and V. Ferrari. Learning semantic part-based models from google images. In arXiv preprint arXiv:1609.03140, 2017.
D. Novotny, A. Vedaldi, and D. Larlus.
Learning the semantic structure of objects from web supervision.
In the Proceeding of the ECCV workshop on Geometry Meets Deep Learning, 2016.
-  M. Pandey and S. Lazebnik. Scene recognition and weakly supervised object localization with deformable part-based models. In ICCV, 2011.
-  S. Park, B. X. Nie, and S.-C. Zhu. Attribute and-or grammar for joint parsing of human pose, parts and attributes. In IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 40(7):1555–1569, 2018.
-  J. Redmon and A. Farhadi. Yolov3: An incremental improvement. In arXiv:1804.02767, 2018.
-  Z. Ren, C. Wang, and A. Yuille. Scene-domain active part models for object representation. In ICCV, 2015.
-  O. Russakovsky, L.-J. Li, and L. Fei-Fei. Best of both worlds: human-machine collaboration for object annotation. In CVPR, 2015.
-  Z. Si and S.-C. Zhu. Learning hybrid image templates (hit) by information projection. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1354–1367, 2012.
-  Z. Si and S.-C. Zhu. Learning and-or templates for object recognition and detection. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2189–2205, 2013.
-  M. Simon and E. Rodner. Neural activation constellations: Unsupervised part model discovery with convolutional networks. In ICCV, 2015.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell. On learning to localize objects with minimal supervision. In ICML, 2014.
-  C. Sun, C. Gan, and R. Nevatia. Automatic concept discovery from parallel text and visual corpora. In ICCV, 2015.
-  Q. Sun, A. Laddha, and D. Batra. Active learning for structured probabilistic models with histogram approximation. In CVPR, 2015.
-  K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S.-C. Zhu. Joint video and text parsing for understanding events and answering queries. In IEEE MultiMedia, 21(2):42–70, 2014.
-  S. Vijayanarasimhan and K. Grauman. Large-scale live active learning: Training object detectors with crawled data and crowds. In CVPR, 2011.
-  X. Wang, Z. Zhu, C. Yao, and X. Bai. Relaxed multiple-instance svm with application to object discovery. In ICCV, 2015.
-  Q. Zhang, R. Cao, F. Shi, Y. Wu, and S.-C. Zhu. Interpreting cnn knowledge via an explanatory graph. In AAAI, 2018.
-  Q. Zhang, R. Cao, Y. N. Wu, and S.-C. Zhu. Mining object parts from cnns via active question-answering. In CVPR, 2017.
-  Q. Zhang, Y.-N. Wu, and S.-C. Zhu. Mining and-or graphs for graph matching and object discovery. In ICCV, 2015.
-  Q. Zhang, Y. N. Wu, and S.-C. Zhu. Interpretable convolutional neural networks. In CVPR, 2018.
-  Q. Zhang, Y. Yang, Y. Liu, Y. N. Wu, and S.-C. Zhu. Unsupervised learning of neural networks to explain neural networks. in arXiv:1805.07468, 2018.
-  Q. Zhang, Y. Yang, Y. N. Wu, and S.-C. Zhu. Interpreting cnns via decision trees. In arXiv:1802.00121, 2018.
-  Q. Zhang, Y. Yang, Y. N. Wu, and S.-C. Zhu. Network transplanting. in arXiv:1804.10272, 2018.
-  Q. Zhang and S.-C. Zhu. Visual interpretability for deep learning: a survey. in Frontiers of Information Technology & Electronic Engineering, 19(1):27–39, 2018.
-  J.-Y. Zhu, J. Wu, Y. Xu, E. Chang, and Z. Tu. Unsupervised object class discovery via saliency-guided multiple class learning. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(4):862–875, 2014.
-  M. Zhu, X. Zhou, and K. Danilidiis. Single image pop-up from discriminatively learned parts. In ICCV, 2015.
-  X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In CVPR, 2012.
-  Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV, 2015.