Architecture Disentanglement for Deep Neural Networks

03/30/2020 ∙ by Jie Hu, et al. ∙ 0

Deep Neural Networks (DNNs) are central to deep learning, and understanding their internal working mechanism is crucial if they are to be used for emerging applications in medical and industrial AI. To this end, the current line of research typically involves linking semantic concepts to a DNN's units or layers. However, this fails to capture the hierarchical inference procedure throughout the network. To address this issue, we introduce the novel concept of Neural Architecture Disentanglement (NAD) in this paper. Specifically, we disentangle a pre-trained network into hierarchical paths corresponding to specific concepts, forming the concept feature paths, i.e., the concept flows from the bottom to top layers of a DNN. Such paths further enable us to quantify the interpretability of DNNs according to the learned diversity of human concepts. We select four types of representative architectures ranging from handcrafted to autoML-based, and conduct extensive experiments on object-based and scene-based datasets. Our NAD sheds important light on the information flow of semantic concepts in DNNs, and provides a fundamental metric that will facilitate the design of interpretable network architectures. Code will be available at:



There are no comments yet.


page 1

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The practical usage of Deep Neural Networks (DNNs) has been disturbed by the inability to fully understand the rationale behind their predictions (Carvalho et al., 2019), which is especially crucial in many real-world scenarios, such as healthcare, criminal justice and administrative regulation (Rudin, 2019). Thus, interpreting DNNs has attracted ever-increasing research attention in recent years. To this end, existing endeavors (e.g., Network Dissection (Bau et al., 2017)

and Concept Activation Vector

(Kim et al., 2018)) mainly target at pinning the semantic concepts to a DNN’s units or layers.

Figure 1:

Examples of some disentangled concepts on ImageNet (the first and the second rows) and Place365 (the third and the fourth rows) with VGG16. (a)

An input image assembled with different concepts. (b) Activations from the last block of the original network. (c)-(f) Activations from the last block of the disentangled network with corresponding concept feature paths. The network is disentangled with specific concepts.

However, the hierarchical inference procedure for specific concept is not effectively captured through simply aligning concepts to units or layers. First, the learned networks entangle all the concepts together. For example, one unit can be responsible for multiple concepts (Morcos et al., 2018; Zhou et al., 2018). Second, only knowing which unit or layer represents which concept is not enough to understand the reasoning process for a given concept. For instance, the ‘conv5-3 unit 151’ of VGG16 pre-trained on ImageNet was shown to represent the concept of ‘Airplane’ (Bau et al., 2017), but it remains unknown how and why this unit infers such concept from the input. In contrast, if the hierarchical network architecture could be disentangled in terms of concepts, the above concerns could be addressed correspondingly. The disentangled sub-architectures naturally form the inference procedures for the concepts.

Therefore, in this paper, we introduce a new concept of Neural Architecture Disentanglement (NAD), which aims to decompose a pre-trained network into sub-architectures consisting of paths that relate to specific concepts. We term such paths as concept feature paths, which are selected combinations of hidden units from a DNN’s bottom block111

‘Block’ refers to a combination of convolutional layers, activation functions and normalization layers.

to its top block. Fig. 1

shows some examples of disentangled concepts from the original network. Moreover, NAD provides a natural metric for evaluating the interpretability of DNNs. The metric is defined as the variance of similarities between paths, which serves as a

quantification on how interpretable the network is. More details about the metric can be found in Sec. 4. To our best knowledge, such quantification of network interpretability is the first of its kind.

We conduct extensive experiments with four representative architectures, i.e., VGG (Simonyan & Zisserman, 2014), ResNet (He et al., 2016), DenseNet (Huang et al., 2017) and DARTS-Net (Liu et al., 2018), which range from hand-crafted to autoDL-based. We investigate the interpretability of the above architectures on an object-based dataset (i.e., ImageNet (Deng et al., 2009)) and a scene-based dataset (i.e., Place365 (Zhou et al., 2017)). NAD sheds important light on the information flow of semantic concepts in DNNs, as well as providing a fundamental metric that will facilitate the design of interpretable network architectures.

Several interesting observations are revealed by NAD with the proposed metric: (1) The well-organized hierarchical semantic structure of categories in ImageNet and Place365 can be learned by highly interpretable networks. (2) The order of interpretability, from highest to lowest, is roughly VGG16, DARTS-Net/DenseNet121, and ResNet50 shown in Fig. 6 of our experiments. Interestingly, we note that the connection complexity is not a key factor affecting interpretability. (3) Further analysis reveals that basic operations, i.e., the residual operation and the concat operation, indeed have important effect on the interpretability. Specifically, the residual operation reduces the interpretability, while the concat operation encourages it.

The rest of this paper is organized as follows: Sec. 2 reviews related work. The concept of Neural Architecture Disentanglement (NAD) and the proposed metric for evaluating DNNs’ interpretability are introduced in Sec. 3 and Sec. 4, respectively. Experimental results are given in Sec. 5. Finally, we conclude this work in Sec. 6.

2 Related Work

Post-model Interpretability for DNNs. Post-model interpretability refers to improving the interpretability after building a model. Many studies have been done to understand DNNs by linking the semantic concepts to units or layers. For instance, Activation Maximization (Erhan et al., 2009; Nguyen et al., 2019) optimizes a random input image to maximize a specific unit, and then assigns a concept to this unit by observing what the optimized input image looks like. Network Dissection (Bau et al., 2017, 2019) directly aligns individual hidden units with a set of semantic concepts by using pre-defined pixel-level labels. Concept Activation Vector (Kim et al., 2018) interprets the internal layers of DNNs in terms of human-friendly concepts by learning a normal vector of a plane that differentiates the selected concept from other concepts. Different from the above works, we focus on linking the image-level labeled concepts to the hierarchically disentangled sub-architectures.

Pre-Model Interpretability for DNNs. An interpretable DNN can also be trained in a pre-model manner by enforced disentanglement of representations (Higgins et al., 2017; Zhang et al., 2018)

. The key idea is to align the features to the standard Gaussian distribution and then attribute the concepts to the features

(Tishby et al., 2000; Hu et al., 2019). Instead of disentangling the representation, we focus on disentangling the network itself, decomposing the whole architecture of a pre-trained DNN in terms of the semantic concepts, to interpret the working mechanism.

Dynamic Routings of DNNs. Dynamic routings focus on studying how to select a routing for each input during inference (Bengio et al., 2016, 2013; Bolukbasi et al., 2017; Li et al., 2019). For instance, Distillation Guided Routing (Wang et al., 2018) uses dynamic routings to interpret the neural networks, which finds paths for each individual input, and interprets DNNs by comparing intra-class samples. In contrast, our work disentangles the architecture with different semantic concepts, and interprets DNNs according to the semantic difference between inter-class concepts.

3 Neural Architecture Disentanglement

Neural Architecture Disentanglement (NAD) aims to decompose a pre-trained DNN into a set of sub-architectures consisting of feature paths with corresponding semantic concepts, which further provides a metric for quantifying the interpretability of DNNs. In this section, we first define the problem formulation of NAD. Then, we provide our solution to find the concept feature paths for NAD. In the next section, we define the metric to quantify the interpretability of different DNN models according to NAD.

3.1 Problem Formulation

Figure 2: Illustration of the concept feature path. Images of concept inputted into the pre-trained network will go through a row of blocks and produce a series of activations (i.e.

, feature maps). The activations of the current block are sent to the next block. Each neural unit in the block produces one activation. After selecting a minimum number of activations that keep the activations as the same as possible and classify the concept

correctly, we get the concept feature path by combining the corresponding units of the selected activations. After finding each paths for each individual concepts, we disentangle the network into a concept-wised path set. The concept feature paths may overlap the same units with each other.

A pre-trained neural network can be generally subdivided into two parts, i.e., the feature extractor and the classifier. We focus on the information flowing process in the feature extractor.222In this paper, we leave the decision-making process in the classifier alone, wihch focus instead on classifying the features from the feature extractor. As the labels are the most straightforward reflection of human cognition to images, we adopt the image labels as semantic concepts for network disentanglement. Let denote the concept to be studied; the image is sampled from the set with the identical concept . We define the blocks in the feature extractor as functions from the bottom layer to the top layer, and as the classifier. For the input , we can obtain the original activations of concept as:


where denote the activations of the blocks, and denotes the output of the classifier. The corresponding feature path for is defined as a set of binary vectors . The activations produced by this path are:


where denote the selected activations, and denotes the classification output of the concept path . Our target is to find the sparsest to make the inner activations and the classification results as identical as possible to their original outputs. Note that the constraints on the inner activations are necessary, as the combinations of the top layers’ units are enough to produce the desired classification results. In other words, the optimized in the bottom layers can all be without these inner constraints, as the classification results can be maintained using only partial units in the top layers. We form the objective for finding the concept path of as:


where is the metric for evaluating the difference (e.g., L1 distance and L2 distance), is the -norm of the set , is the threshold for adjusting the sparsity of the path, and is the cardinality of the set . Such selected activations form a feature path for concept with their corresponding units. Fig. 2 explains the generation mechanism of the concept feature path.

3.2 Finding Concept Feature Paths

In this subsection, we provide our solution to find such paths by optimizing Eq. 3. Specifically, we first soften the binary values of to be continuous values

. By using the Sigmoid function, we then constrain these values to

. The L2 distance is used to measure the difference between the original outputs and the operated outputs. Therefore, the cost function constraining the difference is:


where the selected outputs are:


and is the Sigmoid function.

We regularize directly by their values activated after the Sigmiod function to enforce the sparsity:


We do not fix the threshold , but instead use a hyper-parameter to balance the sparsity and accuracy automatically. Therefore, the final cost function is:


To optimize Eq. 7 and find the appropriate path for concept

, we adopt stochastic gradient descent to update the continuous values

. The L2 normalization is applied to the gradient to accelerate convergence.

  Input: Pre-trained network, data related to concept , continuous vectors , learning rate , and balancing parameter .
  Output: Feature path for concept .
  Initialize randomly.
     Sample from .
     Calculate original activations of by Eq. 1.
     Calculate selected activations of by Eq. 5.
     Compute cost by Eq. 8 together with Eqs. 4 and 7.
     Update vectors in by Eq. 9.
  until Convergence


to by Eq. 10.
Algorithm 1 Finding the Concept Feature Path

Therefore, we update every by:


where is the learning rate. The parameters of pre-trained networks are fixed in the training phase. After optimization, we discretize the continuous set back to the binary set by:


where represents the -th vector in the set, and denotes the -th value of this vector. After obtaining feature paths for all the concepts, we have naturally disentangled the network. Alg. 1 summarizes the overall procedure.

4 Interpretability Evaluation for DNNs

Disentangling the network into concept feature paths enables us to measure the similarity between different semantic concepts. This similarity further provides a natural metric for evaluating the interpretability of DNNs. Given two concepts and their feature paths , we calculate the average Jaccard similarity coefficient for elements of and to measure the similarity between concepts and . This coefficient is defined as the size of the intersection divided by the size of the union of the sets. For example, the coefficient between sets and is defined as:


where denotes the cardinality of a set. Then, we can define the similarity between the concepts and with paths and as:


where and denote the indices of elements that are equal to in and . Then, we introduce the following hypothesis to define the metric for evaluating the interpretability of different DNNs.

Interpretability Hypothesis: If one network can better formulate the semantic diversity among different concepts, this network is more interpretable.

According to this hypothesis, we define the metric as the variance of the similarities between concepts, where a larger variance indicates higher interpretability. This metric effectively quantifies the interpretability in terms of semantic diversity in networks. For example, intuitively, the similarity between the concepts ‘hen’ and ‘cock’ should be high, while the similarity between the concepts ‘hen’ and ‘airplane’ should be low. Such semantic diversity can be well depicted by the variance of the similarity. Therefore, the metric for evaluating the interpretability of a given network is defined as:


where and are the path classification accuracies of concepts and , and is the mean value of the similarities with concepts.

5 Experiments

In this section, we first introduce the experimental settings. Then, we check the validity of the optimized concept feature paths, and study the character of neural architecture disentanglement with the learned concept feature paths. After that, we show examples of the concept information flows according to the concept feature paths in DNNs. Finally, we study the interpretability versus discriminability of DNNs from the viewpoint of neural architecture disentanglement, and we investigate the relationship between neural concepts and human intuitions across different architectures.

5.1 Experimental Settings

Datasets. We conduct experiments on both the object-based dataset, i.e., ImageNet (Deng et al., 2009), and the scene-based dataset, i.e., Place365 (Zhou et al., 2017). ImageNet is an image dataset organized according to the WordNet hierarchy (Miller, 1998), in which each node of the hierarchy is depicted by hundreds of images. Place365 contains images comprising various unique scene categories, whose labels represent the entry-level of an environment. The hierarchical semantic categories of the above datasets make them very suitable to quantify the intepretability of DNNs. We disentangle the networks into the concept feature paths using the training set, and use the validation set to study the properties of neural architecture disentanglement and the network interpretability.

Network Architectures. We select four representative architectures, i.e., VGG16 (Simonyan & Zisserman, 2014), ResNet50 (He et al., 2016), DenseNet121 (Huang et al., 2017) and DARTS-Net (Liu et al., 2018) for our experiments. In these architectures, the connection type includes the direct connection of VGG, the skip connection of ResNet, the densely-skip connection of DenseNet, and the automatically learned connection of DARTS-Net. The operations include the residual operation and the concat operation. Intuitively, the interpretability of these networks tends to get worse due to their higher complexity of connection. However, we find some counter-intuitive results that do not support this argument in Sec. 5.4.

Figure 3: Concept classification results of the original architectures and the disentangled architectures. The horizontal axis represents the values of Top@1 classification precision (%), which are discretized into the bins of . The vertical axis represents the frequency of the concept classification precision located in each bin. The concept precision distributions of disentangled networks are squashed into the range (90,100].

Experimental Details. We use the models pre-trained

on ImageNet and Place365 with Pytorch

(Paszke et al., 2019), and fix their parameters when training the concept feature paths. The architecture of DARTS-Net is searched on ImageNet. The paths are initialized with random values from the standard Gaussian distribution. We set the learning rate for VGG, ResNet, DenseNet and DARTS-Net on ImageNet, and on Place365 by experience. Adjusting the hyper-parameter to balance the accuracy and sparsity is trivial, and we manually adjust it to on ImageNet, and on Place365, which effectively balance the accuracy and sparsity. The reported values are averaged results of paths with different initializations.

Figure 4: Unit usage rate of different architectures on ImageNet and Place365. The horizontal axis represents the block numbers alongside the networks. The vertical axis represents the unit usage rate of the corresponding block. The overall usage rate of units gradually decreases in the order of VGG16, DARTS-Net, DenseNet121 and ResNet50.
Figure 5: Visualizations of the original activations and the disentangled paths of VGG16. The concepts ‘Frog’ and ‘Panda’ are from ImageNet (the second and third rows), and the concepts ‘Castle’ and ‘Bathroom’ are from Place365 (the fifth and sixth rows). The input images are the same as (a) in Fig. 1. The activations of blocks are resized and binarized by selecting the top 5% of activated pixels. Using the disentangled paths, the corresponding concepts can be found gradually. Best viewed in color (zoom in for details).
Figure 6: Interpretability versus discriminability of architectures on ImageNet and Place365. The horizontal axis represents the interpretability computed by the proposed metric. The vertical axis represents the classification accuracy of the original architectures.

5.2 Validity of Concept Feature Paths in DNNs

We verify the validity of concept feature paths in neural architectures by checking the concept classification accuracy. Specifically, we gather the images from the validation set with the grouped labels, and calculate the concept precision for the same class by running through the corresponding concept feature paths of the disentangled networks. We discretize the Top@1 concept recognition precision (%) into the bins and count the number of classes whose precision is located in the above bins. From the results in Fig. 3, we find that the concepts are classified more precisely via the concept feature paths compared with the original network architectures, which suggests that the semantic concepts are well linked to the concept feature paths. For example, the concept precision distribution of the original VGG16 on Place365 is mainly found in range . Through the concept feature paths, the concept precision distribution of the disentangled VGG16 is squashed into the range .

5.3 Concept Information Flows in DNNs

We study the information flows of concepts in DNNs, and investigate the properties of neural architecture disentanglement by a horizontal comparison in the same architecture and a vertical comparison across different architectures. We first show the unit usage rate of different architectures. The unit usage rate is defined as the average number of selected units for different concepts divided by the number of their corresponding channels. Fig. 4 shows the results of the unit usage rate. From the horizontal comparison in the same architecture, we see that the unit usage rate generally decreases, which intuitively verifies the common statement that the bottom layers of DNNs share low-level patterns and the top layers of DNNs form the high-level semantic information. However, by the vertical comparison across different architectures, we find the overall unit usage rate roughly decreases in order of VGG, DARTS-Net, DenseNet, and ResNet, which suggests the semantic information is gradually compressed into less units. To explain this interesting insight, we think that the residual operation entangles information more severely than the concat operation through the skip connection over layers, thus learning features that highly entangle semantic information for classification. We discuss this in detail in Sec. 5.4.

We further use VGG16 as an example to visualize the concept information flows by the activations. Specifically, we combine different animals or scenes into one input image, and visualize the activations from the bottom layers to the top layers. Fig. 1 shows examples of inputs and their activations from the last block of the disentangled paths. We show the activations from the bottom layers to the top layers of the original VGG and the disentangled in Fig. 5, from which we clearly see the information flow of semantic concepts.

Generally, VGG16 first detect edges and colors, and then the small pieces are grouped into concrete patterns. The concrete patterns related to the specified concept are selected lastly. For example, the feature path of ‘panda’ first detects the round and straight edges, red, white and green colors in the blocks from 1 to 4, then the blocks from 5 to 9 gradually group the small pieces into concrete patterns such as ‘eyes of panda’ and ‘head of cock’, and the final blocks from 10 to 13 select the patterns which are most related to the concept ‘panda’. After this, the patterns are sent to the classifier to finish the decision-making process.

5.4 Interpretability versus Discriminability of DNNs

(a) Results on ImageNet. The primary concepts are 1:Bird, 2:Lizard, 3:Snake, 4:Dog, 5:Wolf&Fox, and 6:Cat.
(b) Results on Place365. The primary concepts are 1:Garden, 2:Building, 3:Home, 4:Water-related, 5:Car-related, and 6:Hotel.
Figure 7: Concept cluster heatmaps with selected concepts on (a) ImageNet and (b) Place365. The figures (from left to right) are the results with VGG16, DARTS-Net, DenseNet121 and ResNet50, respectively. The tree structure at the left and top of each figure denotes the clustering results, and the same primary concepts are denoted by the same colored rectangle leaves. The labels are denoted as ‘number-name’, e.g., ‘6-tiger cat’ means the ‘tiger cat’ is a secondary concept of primary concept ‘6:Cat’. If a network is interpretable, the secondary concepts will be clustered with their primary concepts. The heatmap of each figure denotes the path similarity between secondary concepts, the values are normalized into for better visualization. The interpretability of architectures decreases from left to right, respectively. Best viewed in color (zoom in for details).

We study the interpretability versus discriminability of DNNs from the viewpoint of neural architecture disentanglement in this sub-section. Specifically, we compute the interpretability by Eq. 10, Eq. 11 and Eq. 12 after distanglement, and compute the classification accuracy by the original architectures of VGG16, ResNet50, DenseNet121 and DARTS-Net on ImageNet and Place365. From the quantitative results in Fig. 6, we find VGG16 is the most interpretable architecture among the compared architectures, which is intuitive as the VGG16 is the simplest architecture.

The interesting results come from the interpretability of ResNet50, DenseNet121 and DARTS-Net. Intuitively, the more complex connections between layers should make the architecture less interpretable, let alone the automatically-learned connections. However, we find ResNet50 has the worst interpretability among those architectures, and the automatically-learned DARTS-Net has comparable, or even better, results than DenseNet121.

By checking the architectures, we find the outputs of blocks are produced by the residual operation in ResNet, and concat operations in DenseNet121 and DARTS-Net. Therefore, we believe that the residual operation entangles semantic information more severely than the concat operation through the skip connections over layers. Specifically, the concat operation learns features that better depict the similarity and diversity with different concepts, while the residual operation somehow entangles such information. On one hand, we think that the model capacity is greatly enlarged by the residual operation, which might make the model more likely to overfit the training data. On the other hand, the residual operation adds the outputs of layers while the concat operation concatenates them, which could intensify information entanglement and impede the information flow in the network. Based on the above analysis, we conclude that the residual operation hinders the disentanglement of neural architectures, thus affecting their interpretability.

We further verify such findings by comparing the relationship between the concepts learned by neural networks and human perception. Concepts in human perception are often organized according to similar patterns caught by our visual systems, which is significant for us to understand the working mechanism of brains in cognitive science. For example, we may cluster the ‘white wolf’ with ‘white fox’ as the same species by their similar patterns such as white fur and pointed ears.

To study whether neural networks have this property, we define the neural concepts of DNNs by the learned concept feature paths. By Eq. 11, we can compute the similarity between any two neural concepts. Therefore, we select classes grouped by six primary human concepts manually, and visualize the cluster heatmaps of the path similarity for ImageNet and Place365, respectively, in Fig. 7. From the cluster results, we find that the clustering of similar concepts is gradually destroyed from VGG16, DARTS-Net, DenseNet121, to ResNet50. Additionally, from the heatmaps, we find that the well summarized relationship indicating the similarity of neural concepts (e.g., the concept ‘white wolf’ is similar to ‘white fox’ but is dissimilar to ‘grass snake’ on ImageNet, and the concept ‘hotel outdoor’ is similar to ‘inn outdoor’ but dissimilar to ‘waterfall’ on Place365) is gradually broken from VGG16, DARTS-Net, DenseNet121, to ResNet50, and the relationship between neural concepts becomes out-of-order with consistent similarity. These results support our findings on the architecture interpretability.

6 Conclusion

In this paper, we present a novel concept, termed Neural Architecture Disentanglement (NAD), for better understanding of DNNs’ interpretability. Differing from the current line of research, which links semantic concepts to a DNN’s single unit or single layer, we prefer to capture the hierarchical inference procedure throughout the network. Specifically, a pre-trained network is disentangled according to specific concepts, forming the concept feature paths which capture the concept flows from the bottom to top layers of a DNN. This procedure also enables to evaluate the interpretability of neural architectures. In experiments, we study the interpretability of DNNs on object-based and scene-based datasets (i.e., ImageNet and Place365) with four types of representative architectures ranging from handcrafted (i.e., VGG, ResNet and DenseNet) to autoML-based (i.e., DARTS-Net). Experimental results suggest that the connection complexity is not crucial for the neural network interpretability, while the basic operations, such as residual operation and concat operation, should be considered instead. The architecture disentanglement not only sheds important light on the information flow of semantic concepts in DNNs, but also provides a fundamental metric that facilitates the design of explainable network architectures.


  • Bau et al. (2017) Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. Network dissection: Quantifying interpretability of deep visual representations. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2017.
  • Bau et al. (2019) Bau, D., Zhu, J.-Y., Strobelt, H., Zhou, B., Tenenbaum, J. B., Freeman, W. T., and Torralba, A. Gan dissection: Visualizing and understanding generative adversarial networks. In International Conference on Learning Representations, 2019.
  • Bengio et al. (2016) Bengio, E., Bacon, P.-L., Pineau, J., and Precup, D. Conditional computation in neural networks for faster models. In International Conference on Learning Representations, 2016.
  • Bengio et al. (2013) Bengio, Y., Léonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • Bolukbasi et al. (2017) Bolukbasi, T., Wang, J., Dekel, O., and Saligrama, V. Adaptive neural networks for efficient inference. In

    International Conference on Machine Learning

    , 2017.
  • Carvalho et al. (2019) Carvalho, D. V., Pereira, E. M., and Cardoso, J. S. Machine learning interpretability: A survey on methods and metrics. Electronics, 2019.
  • Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • Erhan et al. (2009) Erhan, D., Bengio, Y., Courville, A., and Vincent, P. Visualizing higher-layer features of a deep network. University of Montreal, 2009.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
  • Hu et al. (2019) Hu, J., Ji, R., Zhang, S., Sun, X., Ye, Q., Lin, C.-W., and Tian, Q. Information competing process for learning diversified representations. In Advances in Neural Information Processing Systems, 2019.
  • Huang et al. (2017) Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • Kim et al. (2018) Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., and Sayres, R. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International Conference on Machine Learning, 2018.
  • Li et al. (2019) Li, Y., Ji, R., Lin, S., Zhang, B., Yan, C., Wu, Y., Huang, F., and Shao, L. Dynamic neural network decoupling. arXiv preprint arXiv:1906.01166, 2019.
  • Liu et al. (2018) Liu, H., Simonyan, K., and Yang, Y. Darts: Differentiable architecture search. In International Conference on Learning Representations, 2018.
  • Miller (1998) Miller, G. A. WordNet: An electronic lexical database. MIT press, 1998.
  • Morcos et al. (2018) Morcos, A. S., Barrett, D. G., Rabinowitz, N. C., and Botvinick, M. On the importance of single directions for generalization. In International Conference on Learning Representations, 2018.
  • Nguyen et al. (2019) Nguyen, A., Yosinski, J., and Clune, J. Understanding neural networks via feature visualization: A survey. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. 2019.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, 2019.
  • Rudin (2019) Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 2019.
  • Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Tishby et al. (2000) Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. arXiv preprint physics/0004057, 2000.
  • Wang et al. (2018) Wang, Y., Su, H., Zhang, B., and Hu, X. Interpret neural networks by identifying critical data routing paths. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • Zhang et al. (2018) Zhang, Q., Nian Wu, Y., and Zhu, S.-C.

    Interpretable convolutional neural networks.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • Zhou et al. (2017) Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Torralba, A.

    Places: A 10 million image database for scene recognition.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • Zhou et al. (2018) Zhou, B., Sun, Y., Bau, D., and Torralba, A. Revisiting the importance of individual units in cnns via ablation. arXiv preprint arXiv:1806.02891, 2018.