Interpretable R-CNN

11/14/2017 ∙ by Tianfu Wu, et al. ∙ NC State University 0

This paper presents a method of learning qualitatively interpretable models in object detection using popular two-stage region-based ConvNet detection systems (i.e., R-CNN). R-CNN consists of a region proposal network and a RoI (Region-of-Interest) prediction network.By interpretable models, we focus on weakly-supervised extractive rationale generation, that is learning to unfold latent discriminative part configurations of object instances automatically and simultaneously in detection without using any supervision for part configurations. We utilize a top-down hierarchical and compositional grammar model embedded in a directed acyclic AND-OR Graph (AOG) to explore and unfold the space of latent part configurations of RoIs. We propose an AOGParsing operator to substitute the RoIPooling operator widely used in R-CNN, so the proposed method is applicable to many state-of-the-art ConvNet based detection systems. The AOGParsing operator aims to harness both the explainable rigor of top-down hierarchical and compositional grammar models and the discriminative power of bottom-up deep neural networks through end-to-end training. In detection, a bounding box is interpreted by the best parse tree derived from the AOG on-the-fly, which is treated as the extractive rationale generated for interpreting detection. In learning, we propose a folding-unfolding method to train the AOG and ConvNet end-to-end. In experiments, we build on top of the R-FCN and test the proposed method on the PASCAL VOC 2007 and 2012 datasets with performance comparable to state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, deep neural networks [LeCun et al.1998, Krizhevsky, Sutskever, and Hinton2012] have improved prediction accuracy significantly in many vision tasks, and even outperform humans in image classification tasks [He et al.2016, Szegedy, Ioffe, and Vanhoucke2016]. In the literature of object detection, there has been a critical shift from more explicit representation and models such as the mixture of deformable part-based models (DPMs) [Felzenszwalb et al.2010] and its many variants, and hierarchical and compositional AND-OR graphs (AOGs) models [Song et al.2013, Zhu et al.2008, Wu, Li, and Zhu2016, Wu, Lu, and Zhu2016], to less transparent but much more accurate ConvNet based approaches [Ren et al.2015, Dai et al.2016, Redmon et al.2016, Liu et al.2016, He et al.2017, Dai et al.2017]. Meanwhile, it has been shown that deep neural networks can be easily fooled by so-called adversarial attacks which utilize visually imperceptible, carefully-crafted perturbations to cause networks to misclassify inputs in arbitrarily chosen ways [Nguyen, Yosinski, and Clune2015, Athalye and Sutskever2017], even with one-pixel attack [Su, Vargas, and Kouichi2017]

. And, it has also been shown that deep learning can easily fit random labels 

[Zhang et al.2016a]. It is difficult to analyze why state-of-the-art deep neural networks work or fail due to the lack of theoretical underpinnings at present [Arora et al.2014]. From cognitive science perspective, state-of-the-art deep neural networks might not learn and think like people who know and can explain “why” [Lake et al.2016]

. Nevertheless, there are more and more applications in which prediction results of computer vision and machine learning modules based on deep neural networks have been used in making decisions with potentially critical consequences (e.g., security video surveillance and autonomous driving).

Figure 1: Illustration of the proposed end-to-end integration of a generic top-down grammar model represented by a directed acyclic AND-OR Graph (AOG) and bottom-up ConvNets. For clarity, we show an AOG constructed for a grid using the method proposed in [Song et al.2013]. The AOG unfolds the space of all possible latent part configurations. We build on the R-FCN method [Dai et al.2016]. Based on the AOG, we use Terminal-node sensitive maps and propose an AOGParsing operator to substitute the position-sensitive RoIPooling operator in the R-FCN, which will infer the best parse tree for a RoI on-the-fly, as well as the best part configuration. See text for details. (Best viewed in color and magnification)

It has become a common recognition that prediction without interpretable justification will have limited applicability eventually. For example, consider the intuitive fact that people could get frustrated if someone close to her/him did something critical without convinced explanation, let alone machine systems. So, it is a crucial issue of addressing machine’s inability to explain its predicted decisions and actions (e.g., eXplainable AI or XAI proposed in the DARPA grant solicitation [DARPA]

), that is to improve accuracy and transparency jointly: Not only is an interpretable model capable of computing correct predictions of a random example with very high probability, but also rationalizing its predictions, preferably in a way explainable to end users. Generally speaking, learning interpretable models is to let machines make sense to humans, which usually consists of many challenging aspects. So there has not been a universally accepted definition of the notion of model interpretability. Especially, it remains a long-standing open problem to measure interpretability in a quantitative and principled way.

To address the explainability challenge, many work have proposed to visualize the internal filter kernels or to generate attentive activation maps, which reveal a lot of insights of what DNNs have learned in a post-hoc way. Complementary to those methods, this paper focuses on how to unfold the latent structures for addressing model interpretability in learning and inference. We first propose a method of formulating model interpretability. We then present a case study in object detection. We aim to investigate the feasibility of integrating a top-down model representing the space of latent structures with ConvNets end-to-end, and to qualitatively rationalize the popular two-stage region-based ConvNets detection system, i.e. R-CNN [Girshick2015, Ren et al.2015, Dai et al.2016] without hurting the detection performance.

Figure 1 illustrated the proposed method for object detection. It integrates a generic top-down hierarchical and compositional grammar model and bottom-up ConvNets end-to-end. We adopt R-CNN [Girshick2015, Ren et al.2015, Dai et al.2016] in detection. We focus on weakly-supervised extractive rationale generation in the RoI prediction component in R-CNN, that is learning to unfold latent discriminative part configurations of RoIs automatically and simultaneously in detection without using any supervision for part configurations. To that end, we utilize a generic top-down hierarchical and compositional grammar model embedded in a directed acyclic AND-OR Graph (AOG) [Song et al.2013, Wu, Lu, and Zhu2016] to explore and unfold the space of latent part configurations of RoIs (see an example in the top of Figure 1). There are three types of nodes in an AOG: an AND-node represents binary decomposition of a large part into two smaller ones, an OR-node represents alternative ways of decomposition, and a Terminal-node represents a part instance. The AOG is consistent with the general image grammar framework [Geman, Potter, and Chi2002, Zhu and Mumford2006, Felzenszwalb2011, Zhu et al.2008]. We propose an AOGParsing operator to substitute the RoIPooling operator in the R-CNN based detection systems. In detection, each bounding box is interpreted by the best parse tree derived from the AOG on-the-fly, which is the extractive rationale generated for detection.

In experiments, we build on the R-FCN [Dai et al.2016] with the residual net [He et al.2016]

pretrained on the ImageNet 

[Russakovsky et al.2015] as backbone. We test our method on the PASCAL VOC 2007 and 2012 datasets with performance comparable to state-of-the-art methods. We also perform the ablation study on different aspects of the proposal method.

2 Related Work

In the literature, many work focused on interpreting post-hoc interpretability of deep neural networks by associating explanatory semantic information with nodes in a deep neural network. There are a variety of methods including identifying high-scoring image patches [Girshick et al.2014, Long, Zhang, and Darrell2014] or over-segmented atomic regions [Ribeiro, Singh, and Guestrin2016] directly, visualizing the layers of convolutional networks using deconvolutional networks to understand what contents are emphasized in the high-scoring input image patches [Zeiler and Fergus2014], identifying items in a visual scene and recount multimedia events [Yu et al.2012, Gan et al.2015], generating synthesized images by maximizing the response of a given node in the network [Erhan et al.2009, Le et al.2012, Simonyan, Vedaldi, and Zisserman2013] or by developing a top-down generative convolutional networks [Lu, Zhu, and Wu2016, Xie et al.2016]. Hendricks et al [Hendricks et al.2016] extended the approaches used to generate image captions [Karpathy and Li2015, Mao et al.2014] to train a second deep network to generate explanations without explicitly identifying the semantic features of the original network. Most of these methods are not model-agnostic except for [Ribeiro, Singh, and Guestrin2016].

More recently, Spatial attention-like mechanism has been widely studied in deep neural network based systems, including the seminal spatial transform network 

[Jaderberg et al.2015] which warps the feature map via a global parametric transformation such as affine transformation, the exploration of global average pooling and class specific activation maps for weakly-supervised discriminative localizationj(i.e., CAM) [Zhou et al.2016a], the deformable convolution network [Dai et al.2017] and active convolution [Jeon and Kim2017], and more explicit attention based work in image caption and visual question answering (VQA) such as the show-attend-tell work [Xu et al.2015] and the hierarchical co-attention in VQA [Lu et al.2016]. The Grad-CAM work [Selvaraju et al.2017], extended from the CAM work [Zhou et al.2016b], can produce a coarse localization map highlighting the important regions in the image used by deep neural networks for predicting the concept. In similar spirit, the excitation back-propagation method [Zhang et al.2016b] can generate task-specific attention map. The latest network dissection work [Bau et al.2017] reported empirically that interpretable units are found in representations of the major deep learning architectures [Krizhevsky, Sutskever, and Hinton2012, Chatfield et al.2014, He et al.2016] for vision, and interpretable units also emerge under different training conditions. On the other hand, they also found that interpretability is neither an inevitable result of discriminative power, nor is it a prerequisite to discriminative power. Most of these methods are not model-agnostic except for [Ribeiro, Singh, and Guestrin2016, Koh and Liang2017]. In [Koh and Liang2017], a classic technique in statistics, influence function, is used to understand the black-box prediction in terms of training sample, rather than extractive rationale justification.

Our Contributions. This paper makes three main contributions to the emerging field of learning interpretable models as follows: (i) It presents a method of integrating a generic top-down grammar model, embedded in an AOG, and bottom-up ConvNets end-to-end to learn qualitatively interpretable models in object detection. (ii) It presents an AOGParsing operator which can be used to substitute the RoIPooling operator widely used in R-CNN based detection systems. (iii) It shows detection performance comparable to state-of-the-art R-CNN systems, thus shedding light on addressing accuracy and transparency jointly in learning deep models for object detection.

3 Interpreting Model Interpretability

In this section, we present a generic formulation of model interpretability in visual understanding tasks which accounts for unfolding well-defined latent structures in a weakly-supervised way.

Intuitively, we would expect that an interpretable model could learn and capture latent semantic structures automatically which are not annotated in training data. For example, if we consider the basic image classification task with only image labels available in training as commonly used, to compare which classification models are more interpretable or explainable, one principled way is to show the capability of extracting the latent localization of object of interest w.r.t. the ground-truth label. Similarly, a person detector is more interpretable if it is learned using person bounding box annotations only, but capable of interpreting a person detection with the latent semantic structure explained, ideally the kinetic pose. So, our intuitive idea is that model interpretability can be posed as the capability of exploring the latent space of a higher level task (e.g., localization vs classification and pose recovery vs detection) in a principled way, and of capturing the sufficient statistics in the latent space. The more a model can explore and capture the latent tasks at higher level, the better the model interpretability is.

To that end, we first consider an underlying task hierarchy, e.g., from image classification, to object localization and detection, to object part recovery (object parsing), and all the way to full image parsing (i.e., all image pixels are explained-away in a mathematically sound way). Then, for a task at hand (e.g., object detection), we seek a principled way of defining and exploring the latent space of the task of object part-based parsing, and then compute extractive rationale for the task at hand.

Our formulation is a straightforward top-down method. We first build a grammar structure which quantizes and unfolds the space of latent structures by utilizing the methods presented in [Song et al.2013, Wu, Lu, and Zhu2016, Zhu et al.2016]

. Then, we integrate the grammar structure into the model in learning and inference. The parse graph of the grammar structure is treated as the qualitatively interpretable result. The grammar structure can be potentially exploited to build quantitatively interpretable models from scratch by defining loss functions on the latent structures captured by the grammar. To investigate the feasibility, we present a case study on object detection in this paper.

4 A Case Study: Toward Interpretable R-CNN

In this section, we first briefly present backgrounds on R-CNN and the construction of the top-down AOG [Song et al.2013, Wu, Lu, and Zhu2016] to be self-contained. Then, we present the end-to-end integration of AOG and ConvNets.

The R-CNN Framework. The R-CNN framework consists of three components: (i) A ConvNet backbone such as the Residual Net [He et al.2016]

for feature extraction, parameterized by

and shared between the region-proposal network (RPN) and the RoI prediction network. (ii) The RPN network for objectness detection (i.e., category-agnostic detection through binary classification between foreground objects and background) and bounding box regression, parameterized by . Denote by

a RoI (i.e., a foreground bounding box proposal) computed by the RPN. (iii) The RoI prediction network for classifying a RoI

and refining it, parameterized by , which utilizes the RoIPooling operator and usually use one or two fully connected layer(s) as the head classifier and regressor. We build on top of the R-FCN method [Dai et al.2016] in our experiments. In R-FCN, position-sensitive score maps are used in RoIPooling, that is to treat cells in a RoI as object parts each of which has its own score map. The final classification is based on the majority voting after RoIPooling. The parameters are trained end-to-end.

The AOG. In the R-CNN framework, a RoI is interpreted as a predefined flat configuration. To learn interpretable models, we need to explore the space of latent part configurations defined in a RoI. To that end, a RoI is first divided into a grid of cells as done in the RoIPooling operator (e.g., or ). Denote by and a non-terminal symbol and a terminal symbol respectively, both representing the sub-grid with left-top and width and height in the RoI. We only utilize binary decomposition, either izontal cut or tical cut, when interpreting a non-terminal symbol. We have four rules,

(1)
(2)
(3)
(4)

where represents the minimum side length of a valid sub-grid allowed in the decomposition (e.g., ). When instantiated, the first rule will be represented by Terminal-nodes, both the second and the third by AND-nodes, and the fourth by OR-nodes.

The top-down AOG is constructed by applying the four rules in a recursive way [Song et al.2013, Wu, Lu, and Zhu2016]. Denote an AOG by where and and represent a set of AND-nodes, OR-nodes and Terminal-nodes respectively, and a set of edges. We start with and , and a first-in-first-out queue . It unfolds all possible latent configurations. We further introduce a super OR-node whose child nodes are those OR-nodes that occupy the entire grid more than certain threshold (e.g., ). The super OR-node is used in the unfolding step of learning the AOG model to help find better interpretation for noisy RoIs from the RPN network. Figure 1 shows the AOG constructed for a grid. In [Song et al.2013], The two child nodes of an AND-node are allowed to overlap up to certain ratio, which we do not use in our experiments for simplicity.

A parse tree is an instantiation of the AOG, which follows the breadth-first-search (BFS) order of nodes in the AOG, selects the best child node for each encountered OR-nodes, keeps both child nodes for each encountered AND-node, and terminates at each encountered Terminal-node. A configuration is generated by collapsing all the Terminal-nodes of a parse tree onto the image domain.

4.1 The Integration of AOG and ConvNets.

We now present a simple end-to-end integration of the top-down AOG and ConvNets, as illustrated in Figure 1.

Consider an AOG with the grid size being and the minimum side length allowed for nodes (e.g., in Figure 1). Denote by the Terminal-node sensitive score map for a Terminal-node . All ’s have the same dimensions, , where the height and the width are the same as those of the last layer in the ConvNet backbone, and the channel the number of classes in detection (e.g., in the PASCAL VOC benchmarks including foreground categories and background). ’s are usually computed through convolution on top of the last layer in the ConvNet backbone. Denote by the

-d score vector of a Terminal node

placed in a RoI , which is computed by average pooling in the corresponding sub-grid occupied by (in the same way that the position-sensitive RoIPooling of R-FCN computes the score vector of a RoI grid cell). Following the depth-first search (DFS) order, the score vectors of Terminal-nodes are then passing through the AOG in the forward step w.r.t. the folding-unfolding stage in learning. Following the breadth-first-search (BFS) order, the best parse tree per category for a RoI is inferred in the backward step in the unfolding stage, as well as the part configuration. We elaborate the forward and backward computation in the next section.

Remark: The number of channels of Terminal-node score maps can take values other than the number of classes, and we add a fully connected layer on top of the AOG to predict the class scores. We keep it simple in this paper.

Forward;
while  is not empty do
       Pop a node from the ;
       if  is an OR-node then
             if  then
                  
             else
                      (element-wise)
               end if
              
        else if  is an AND-node then
              
        else if  is a Terminal-node then
               Compute in , and
        end if
       
end while
Normalize the score vector of the root OR-node ;
if  then
       
else
        Compute using Algorithm 2;
        (element-wise)
end if
Algorithm 1 Forward computation with AOG.
ComputeOmega1;
for  do
       Initialize ( is the root OR-node) ;
       Set ;
       while  is not empty do
             Pop a node from the ;
             if  is an OR-node then
                   Push the best child node in
             else if  is an AND-node then
                   Push the two child nodes in
             else if  is a Terminal-node then
                  
             end if
            
       end while
      
end for
Algorithm 2 Computing the forward normalization weight in the unfolding stage.
Backward;
if  then
       where is computed in Algorithm 1;
      
else
       where is computed in Algorithm 2;
      
end if
For the root OR-node , we have while  is not empty do
       Pop a node from the ;
       if  is an OR-node then
             if  then
                  
             else
                   where is computed in Algorithm 1;
                  
             end if
            
       else if  is an AND-node then
            
       else if  is a Terminal-node then
             Back-propagate to Terminal-sensitive feature maps .
       end if
      
end while
Algorithm 3 Backward computation with AOG.

4.2 The Folding-Unfolding Learning

Since the Terminal-node sensitive maps are computed with randomly initialized convolution kernels, it is not reasonable to select the best child for each OR-node at the beginning in the forward step, and all nodes not retrieved by the parse trees will not get gradient update in the backward step. So, we resort to a folding-unfolding learning strategy. In the folding stage, OR-nodes are implemented by MEAN operators and AND-nodes by SUM operators, thus the AOG is actually an AND Graph and all nodes will be updated in the backward computation. The folding stage is usually trained for one or two epochs. In the unfolding stage, OR-nodes are implemented by element-wise MAX operators and AND-nodes still by SUM operators, leading to the

AOGParsing operator. For notational simplicity, we write for an AOG . In both forward and backward computation, all RoIs are processed at once in implementation and we present the algorithms using one RoI for clarity.

Forward Computation. Denote by the DFS queue of nodes in an AOG . Forward computation (see Algorithm 1) is to compute score vectors for all nodes following in both folding and unfolding stages. It also computes the assignment of the best child node of OR-nodes in unfolding stage, denoted by . In the forward step, the score vector of the root OR-node needs to be normalized for fair comparison, especially in the unfolding stage where different parse trees have different number of Terminal-nodes. Denote by the normalization weight in the folding stage which is a scalar shared by all categories. Denote by and the normalization weight vector in the unfolding stage which is a -d vector since different categories might infer different best parse trees in interpreting a RoI .

Backward Computation. Similarly, by changing the DFS queue to the BFS queue, we can define backward computation using the AOG based on Algorithm 1 for the folding stage, and on Algorithm 2 for the unfolding stage which is summarized in Algorithm 3.

fAOG772-d

AOG772-d-7

AOG772-d-1

fAOG772

AOG772-7

AOG772-1

fAOG551-d

AOG551-d-7

AOG551-d-1

fAOG551

AOG551-7

AOG551-1

fAOG331-d

AOG331-d-7

AOG331-d-1

fAOG331

AOG331-7

AOG331-1

RFCN-d-re

AP@ 81.7 80.7 82.0 80.2 81.1 81.1 81.5 80.7 82.1 81.1 80.5 81.7 81.5 80.8 81.3 80.4 80.2 80.3 82.0
AP@ 67.8 68.1 68.6 66.1 67.7 66.7 66.9 68.4 67.9 67.4 67.9 66.8 67.4 67.8 67.1 66.7 67.7 67.0 67.9
Table 1: Performance comparisons using Average Precision (AP) at the intersection over union (IoU) threshold (AP@) and (AP@) respectively in the PASCAL VOC2007 dataset (using the protocol, competition ”comp4” trained using both 2007 and 2012 trainval datasets). In the table, “fAOG772-d” represents the model trained using the deformable AOG and the folding stage only, “AOG772-d-7 or AOG772-d-1” the model trained using the folding-unfolding method with the unfolding stage initialized from the model at epoch or in the folding stage respectively. Without “-d”, it means the AOGs are not deformable. “RFCN-d-re” represents the reproduced results of R-FCN with deformable convolution using our modified code which are consistent with the results reported in [Dai et al.2017].
aero bike boat bttle bus car mbik train bird cat cow dog hrse sheep pers plant chair tble sofa tv avg.
AOG772-d-1 87.7 84.1 79.5 66.7 63.3 82.2 81.3 93.6 61.2 82.4 62.2 92.2 87.4 85.9 84.9 60.2 83.4 69.9 87.0 73.4 78.4
RFCN-d-re 87.0 84.3 78.8 67.8 62.2 80.9 81.7 93.8 60.3 82.5 63.4 92.2 87.0 86.6 85.5 60.3 82.8 68.8 86.4 73.5 78.3
Table 2: Performance comparisons using AP@ in the PASCAL VOC2012 dataset (“comp4”). “AOG772-d-1” can be viewed at http://host.robots.ox.ac.uk:8080/anonymous/EXCJXR.html, and “RFCN-d-re” at http://host.robots.ox.ac.uk:8080/anonymous/BWL8DV.html.

5 Experiments

In this section, we present experimental results on the PASCAL VOC 2007 and 2012 benchmarks [Everingham et al.2015]. We also give the ablation study on different aspects of the proposed method. We build on top of the R-FCN method [Dai et al.2016], which is a fully convolutional version of R-CNN framework among the state-of-the-art variants of R-CNN. We implement our method using the latest MXNet. Our source code will be released.

Setting and Implementation Details. We conduct experiments with different settings: (i) Three different AOGs, , and ( is too slow to train, thus not reported). Note that we do not change the bounding box regression branch in the RoI prediction except for the RoI grid size which is changed to match the AOGs. (ii) Deformable vs non-deformable AOGs. We modified the latest R-FCN with deformable convolution [Dai et al.2017] (i.e. RFCN-d) and we reused the code released on the Github 111https://github.com/msracver/Deformable-ConvNets. For deformable AOGs, we allow Terminal-nodes deformable in computing their score vector ’s similar to the deformable RoIPooling used in [Dai et al.2017]. (iii) Folding vs Folding-Unfolding training procedure. We follow the same hyper-parameter setting provided in the RFCN-d source code for fair comparison: the number of epochs is , the learning rate starts with and the scheduling step is at , the warm-up step is used with a smaller learning rate for min-batches, and online hard-negative mining is adopted in training.

Ablation Study. The proposed integration of AOG and ConvNets is simple which substitutes the original RoIPooling operator with the AOGParsing operator. The RoIPooling is computed with a predefined and fixed flat configuration (e.g., grid). The AOGParsing is computed with a hierarchical and compositional AND-OR graph constructed on top of the same grid to explore much large number of latent configurations. Terminal-nodes use the same operators as the cells in the RoIPooling. AND-nodes and OR-nodes adopt very simple operators, SUM, MEAN or element-wise MAX. So, we expect that the proposed integration will not hurt the accuracy performance of the baseline R-CNN system, but is capable of output extractive rationale justification using the parse trees inferred on-the-fly for each detected object. The RoIPooling operator is a special case of the AOGParsing operator.

We conduct ablation study on PASCAL VOC 2007. Table. 1 shows the breakdown performance and comparisons. The results show that all of the variants are comparable in terms of accuracy performance, which matches with our expectation. In terms of the extractive rationale justification, Figure 2 shows some qualitative examples.

Figure 2: Examples of latent part configurations unfolded by AOGs using the learned model “AOG772-1”. We show one random example per category in VOC 2007 test dataset. We show one instance of the top two configurations for the 20 categories with the configuration superposed on the right-top of each image. (Best viewed in color and magnification)

Results. We also test the integration on the PASCAL VOC 2012 benchmark with results shown in Table. 2. We report comparisons with the RFCN-d [Dai et al.2016, Dai et al.2017] only since it is one of the state-of-the-art methods.

Runtime. The runtime is mainly affected by the size of an AOG. Our current implementation of the AOG are not optimized with some operators are written in Python instead of C/C++. Per image, “AOG772” roughly takes , “AOG551” roughly takes and “AOG331” roughly takes . RFCN-d roughly takes .

Limitations and Discussions. The proposed method has two main limitations to be addressed in future work. First, although it can show qualitative extractive rationales in detection in a weakly-supervised way, it is difficult to measure the model interpretability, especially in a quantitative way. For quantitative interpretability, we will investigate rigorous definitions which can be formalized as a interpretability-sensitive loss term in end-to-end training. Second, current implementation of the proposed method did not improve the accuracy performance although it is not our focus in this paper. We will explore new operators for AND-nodes and OR-nodes in the AOG to improve performance. We hope detection performance will be further improved with the interpretability-sensitive loss term.

6 Conclusion

This paper presented a method of integrating a generic top-down grammar model with bottom-up ConvNets in an end-to-end way for learning qualitatively interpretable models in object detection using the R-CNN framework. It builds on top the R-FCN method and substitutes the RoIPooling operator with an AOGParsing operator to unfold the space of latent part configurations. It proposed a folding-unfolding method in learning. In experiments, the proposed method is tested in the PASCAL VOC 2007 and 2012 benchmarks with performance comparable to state-of-the-art R-CNN based detection methods. The proposed method computes the optimal parse tree in the AOG as qualitatively extractive rationale in “justifying” detection results. It sheds light on learning quantitatively interpretable models in object detection.

References

  • [Arora et al.2014] Arora, S.; Bhaskara, A.; Ge, R.; and Ma, T. 2014. Provable bounds for learning some deep representations. In ICML, 584–592.
  • [Athalye and Sutskever2017] Athalye, A., and Sutskever, I. 2017. Synthesizing robust adversarial examples. CoRR abs/1707.07397.
  • [Bau et al.2017] Bau, D.; Zhou, B.; Khosla, A.; Oliva, A.; and Torralba, A. 2017. Network dissection: Quantifying interpretability of deep visual representations. In CVPR.
  • [Chatfield et al.2014] Chatfield, K.; Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Return of the devil in the details: Delving deep into convolutional nets. In BMVC.
  • [Dai et al.2016] Dai, J.; Li, Y.; He, K.; and Sun, J. 2016. R-FCN: object detection via region-based fully convolutional networks. In NIPS.
  • [Dai et al.2017] Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; and Wei, Y. 2017. Deformable convolutional networks. CoRR abs/1703.06211.
  • [DARPA] DARPA.

    Explainable artificial intelligence (xai) program,

    http://www.darpa.mil/program/ explainable-artificial-intelligence, full solicitation at http://www.darpa.mil/attachments/ darpa-baa-16-53.pdf.
  • [Erhan et al.2009] Erhan, D.; Bengio, Y.; Courville, A.; and Vincent, P. 2009. Visualizing higher-layer features of a deep network. Technical Report 1341, University of Montreal.
  • [Everingham et al.2015] Everingham, M.; Eslami, S. M.; Gool, L.; Williams, C. K.; Winn, J.; and Zisserman, A. 2015. The pascal visual object classes challenge: A retrospective. IJCV 111(1):98–136.
  • [Felzenszwalb et al.2010] Felzenszwalb, P. F.; Girshick, R. B.; McAllester, D.; and Ramanan, D. 2010. Object detection with discriminatively trained part-based models. TPAMI 32(9):1627–1645.
  • [Felzenszwalb2011] Felzenszwalb, P. F. 2011. Object detection grammars. In ICCV-Workshops, 691.
  • [Gan et al.2015] Gan, C.; Wang, N.; Yang, Y.; Yeung, D.; and Hauptmann, A. G. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR, 2568–2577.
  • [Geman, Potter, and Chi2002] Geman, S.; Potter, D.; and Chi, Z. Y. 2002. Composition systems. Quarterly of Applied Mathematics 60(4):707–736.
  • [Girshick et al.2014] Girshick, R.; Donahue, J.; Darrell, T.; and Malik, J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
  • [Girshick2015] Girshick, R. 2015. Fast R-CNN. In ICCV.
  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
  • [He et al.2017] He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. 2017. Mask R-CNN. In ICCV.
  • [Hendricks et al.2016] Hendricks, L. A.; Akata, Z.; Rohrbach, M.; Donahue, J.; Schiele, B.; and Darrell, T. 2016. Generating visual explanations. In ECCV.
  • [Jaderberg et al.2015] Jaderberg, M.; Simonyan, K.; Zisserman, A.; and Kavukcuoglu, K. 2015. Spatial transformer networks. In NIPS.
  • [Jeon and Kim2017] Jeon, Y., and Kim, J. 2017. Active convolution: Learning the shape of convolution for image classification. CoRR abs/1703.09076.
  • [Karpathy and Li2015] Karpathy, A., and Li, F. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR, 3128–3137.
  • [Koh and Liang2017] Koh, P. W., and Liang, P. 2017. Understanding black-box predictions via influence functions. In ICML.
  • [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012.

    Imagenet classification with deep convolutional neural networks.

    In NIPS, 1106–1114.
  • [Lake et al.2016] Lake, B. M.; Ullman, T. D.; Tenenbaum, J. B.; and Gershman, S. J. 2016. Building machines that learn and think like people. CoRR abs/1604.00289.
  • [Le et al.2012] Le, Q. V.; Ranzato, M.; Monga, R.; Devin, M.; Corrado, G.; Chen, K.; Dean, J.; and Ng, A. Y. 2012.

    Building high-level features using large scale unsupervised learning.

    In ICML.
  • [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
  • [Liu et al.2016] Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; and Berg, A. C. 2016. SSD: Single shot multibox detector. In ECCV.
  • [Long, Zhang, and Darrell2014] Long, J.; Zhang, N.; and Darrell, T. 2014. Do convnets learn correspondence? In NIPS.
  • [Lu et al.2016] Lu, J.; Yang, J.; Batra, D.; and Parikh, D. 2016. Hierarchical question-image co-attention for visual question answering. In NIPS.
  • [Lu, Zhu, and Wu2016] Lu, Y.; Zhu, S.; and Wu, Y. N. 2016. Learning FRAME models using CNN filters. In AAAI.
  • [Mao et al.2014] Mao, J.; Xu, W.; Yang, Y.; Wang, J.; and Yuille, A. L. 2014. Deep captioning with multimodal recurrent neural networks (m-rnn). CoRR abs/1412.6632.
  • [Nguyen, Yosinski, and Clune2015] Nguyen, A. M.; Yosinski, J.; and Clune, J. 2015. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In CVPR, 427–436.
  • [Redmon et al.2016] Redmon, J.; Divvala, S. K.; Girshick, R. B.; and Farhadi, A. 2016. You only look once: Unified, real-time object detection. In CVPR.
  • [Ren et al.2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS.
  • [Ribeiro, Singh, and Guestrin2016] Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ”why should I trust you?”: Explaining the predictions of any classifier. CoRR abs/1602.04938.
  • [Russakovsky et al.2015] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A. C.; and Fei-Fei, L. 2015. ImageNet Large Scale Visual Recognition Challenge. IJCV 115(3):211–252.
  • [Selvaraju et al.2017] Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV.
  • [Simonyan, Vedaldi, and Zisserman2013] Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR abs/1312.6034.
  • [Song et al.2013] Song, X.; Wu, T.; Jia, Y.; and Zhu, S. 2013. Discriminatively trained and-or tree models for object detection. In CVPR, 3278–3285.
  • [Su, Vargas, and Kouichi2017] Su, J.; Vargas, D. V.; and Kouichi, S. 2017. One pixel attack for fooling deep neural networks. CoRR abs/1710.08864.
  • [Szegedy, Ioffe, and Vanhoucke2016] Szegedy, C.; Ioffe, S.; and Vanhoucke, V. 2016. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR abs/1602.07261.
  • [Wu, Li, and Zhu2016] Wu, T.; Li, B.; and Zhu, S. 2016.

    Learning and-or model to represent context and occlusion for car detection and viewpoint estimation.

    TPAMI 38(9):1829–1843.
  • [Wu, Lu, and Zhu2016] Wu, T.; Lu, Y.; and Zhu, S. 2016. Online object tracking, learning and parsing with and-or graphs. TPAMI.
  • [Xie et al.2016] Xie, J.; Lu, Y.; Zhu, S.; and Wu, Y. N. 2016. A theory of generative convnet. In ICML.
  • [Xu et al.2015] Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A. C.; Salakhutdinov, R.; Zemel, R. S.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML.
  • [Yu et al.2012] Yu, Q.; Liu, J.; Cheng, H.; Divakaran, A.; and Sawhney, H. S. 2012. Multimedia event recounting with concept based representation. In MM, 1073–1076.
  • [Zeiler and Fergus2014] Zeiler, M. D., and Fergus, R. 2014. Visualizing and understanding convolutional networks. In ECCV, 818–833.
  • [Zhang et al.2016a] Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; and Vinyals, O. 2016a. Understanding deep learning requires rethinking generalization.
  • [Zhang et al.2016b] Zhang, J.; Lin, Z. L.; Brandt, J.; Shen, X.; and Sclaroff, S. 2016b. Top-down neural attention by excitation backprop. In ECCV.
  • [Zhou et al.2016a] Zhou, B.; Khosla, A.; Lapedriza, À.; Oliva, A.; and Torralba, A. 2016a.

    Learning deep features for discriminative localization.

    In CVPR.
  • [Zhou et al.2016b] Zhou, B.; Khosla, A.; Lapedriza, À.; Oliva, A.; and Torralba, A. 2016b. Learning deep features for discriminative localization. In CVPR.
  • [Zhu and Mumford2006] Zhu, S. C., and Mumford, D. 2006. A stochastic grammar of images. Found. and Trends in Comp. G. and V. 2(4):259–362.
  • [Zhu et al.2008] Zhu, L.; Chen, Y.; Lu, Y.; Lin, C.; and Yuille, A. L. 2008. Max margin AND/OR graph learning for parsing the human body. In CVPR.
  • [Zhu et al.2016] Zhu, J.; Wu, T.; Zhu, S.; Yang, X.; and Zhang, W. 2016.

    A reconfigurable tangram model for scene representation and categorization.

    TIP 25(1):150–166.