Attending Category Disentangled Global Context for Image Classification

by   Keke Tang, et al.

In this paper, we propose a general framework for image classification using the attention mechanism and global context, which could incorporate with various network architectures to improve their performance. To investigate the capability of the global context, we compare four mathematical models and observe the global context encoded in the category disentangled conditional generative model retains the richest complementary information to that in the baseline classification networks. Based on this observation, we define a novel Category Disentangled Global Context (CDGC) and devise a deep network to obtain it. By attending CDGC, the baseline networks could identify the objects of interest more accurately, thus improving the performance. We apply the framework to many different network architectures to demonstrate its effectiveness and versatility. Extensive results on four publicly available datasets validate our approach could generalize well and is superior to the state-of-the-art. In addition, the framework could be combined with various self-attention based methods to further promote the performance. Code and pretrained models will be made public upon paper acceptance.



There are no comments yet.


page 8


Disentangled Deep Autoencoding Regularization for Robust Image Classification

In spite of achieving revolutionary successes in machine learning, deep ...

Learning Deep Context-Network Architectures for Image Annotation

Context plays an important role in visual pattern recognition as it prov...

Superpixel Image Classification with Graph Attention Networks

This document reports the use of Graph Attention Networks for classifyin...

A Baseline for Multi-Label Image Classification Using Ensemble Deep CNN

Recent studies on multi-label image classification have been focusing on...

Class Semantics-based Attention for Action Detection

Action localization networks are often structured as a feature encoder s...

Disentangled Motif-aware Graph Learning for Phrase Grounding

In this paper, we propose a novel graph learning framework for phrase gr...

Decision Propagation Networks for Image Classification

High-level (e.g., semantic) features encoded in the latter layers of con...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image classification [13]

, aiming at categorizing each image into one of several predefined categories, is a fundamental problem in computer vision. Thousands of applications, such as object detection 

[6, 11, 4, 37, 53, 51, 7]

, scene classification 

[27, 35, 28, 64], segmentation [18, 55, 39, 25, 56], could benefit from it. In the last few decades, researchers focus on hand-crafting descriptors (e.g., SIFT [31], ORB [38]

) to represent the images, and then discriminating them with a classifier (e.g., KNN 

[61] and SVM [45]

). However, due to high intra-class variation, low inter-class difference, etc., image classification remains to be an extremely challenging problem. Until very recently, with the emergence of deep neural networks (DNNs), researchers attempt to solve the problem with DNNs, significantly improving the state-of-the-art 


To improve image classification, researchers compete designing different structures of networks, from shallow to deep [22, 42, 14] and from narrow to wide [59]. However, with stronger representation power of deep networks, the computational complexity always increases significantly. In addition, aggressive deep networks with too many parameters tend to cause over-fitting [44]

or the vanishing gradient problem 

[19], hindering further improvement.

Other researchers instead attempt to explore the problem in a different view, seeking to design some small network units inspired by cognitive science or bionics. With minimal additional computational cost, these units always bring significant improvement. A good representative is the attention mechanism [16, 48], which has attracted much “attention” from the computer vision community in recent years. The attention mechanism could enable the network to focus on the object of interest while suppressing the background. However, as indicated by Li et al. [26], most of the existing neural network-based attention methods only utilize features at image-level, which may miss some important information, such as global cues across the whole dataset, thus will be misled to attend. Even though, Zhou el al. [62] have shown that deep networks originally have the ability to locate objects of interest or the discriminative regions in a single forward-pass without requiring any supervision. Thus, Zagoruyko and Komodakis [58] attempt to improve the performance of a student network by mimicking the attention maps of a powerful teacher network, based on the observation that stronger networks always have peaks in attention while weak networks don’t. The information encoded in the teacher network could be regarded as a kind of global context. However, as both networks are trained for classification, making their encoded information has a substantial overlap, there remains little room for improvement.

To resolve the above issues, we define a novel Category Disentangled Global Context (CDGC). The context could capture the underlying property of the whole dataset instead of a single image, which is obtained via a conditional auto-encoder trained in advance. In addition, we specifically enforce the global context without any category related information via the technique of disentangling. As verified by our experiments, this context could retain richer complementary information to that in the baseline classification networks. Therefore, the baseline networks could adjust the identified locations of objects of interest adaptively by employing the complementary information within CDGC in the manner of attention, thus improving the performance. The key contribution of this paper is three-fold:

  • [leftmargin=*]

  • We summarize four mathematical models that could capture the underlying property of a data population and compare their benefits for global context modeling. To the best of our knowledge, this is the first work that compares them in the aspect of information complementarity.

  • We define a novel CDGC which encodes all the global information except that is category related, and develop an adversarial conditional auto-encoder network to obtain it.

  • We propose a general framework that employs CDGC to bootstrap deep networks for image classification, including channel-aware amplification and activation hyperplane re-biasing.

We demonstrate the effectiveness and versatility of our approach with exhaustive experiments on four publicly available datasets with different image resolutions and data scales: CIFAR-10/100 [21], ImageNet3232 [9]

and ImageNet-150K 

[29], and apply it to many different deep network architectures, including both residual and non-residual networks. Experimental results validate the usefulness of our approach, and show it is comparable and sometimes even superior to the state-of-the-art methods [58, 17, 52]. In addition, by combining our approach with theirs, we could obtain some further improvement.

2 Related Work

2.1 Image Classification with CNN

With the popularity of deep networks, there is a growing body of literature on image classification. Since AlexNet [22] made a breakthrough on ImageNet [10], many different networks have been proposed. VGG [42] and Inception models [46] demonstrated the benefits of increasing depth and multiple scale reception fields. ResNet [14, 15] enabled learning deeper networks through the use of a simple identity skip-connection, and Wide ResNet [59] validated the effectiveness of enlarging the width of deep networks. Most of the above methods focus on promoting the performance of classification via increasing the complexity of deep networks or designing novel architectures.

2.2 Visual Attention

Visual attention [2] is a basic concept in the field of psychology and neuroscience. Inspired by this mechanism, there have been many attempts to apply attention to the tasks of computer vision and related fields, such as image/video captioning [57, 50], visual question answering [43], and so on. By enabling the networks to focus on the object of interest while suppressing the background, there could always obtain substantial improvement. According to the schemes to obtain attention, current attention methods could be broadly classified into two categories.

Post-hoc attention methods analyze the attention mechanism mostly to reason for the task of visual classification. Simonyan et al. [41] provided two techniques for visualizing the attention maps by computing a Jacobian of network output. Cao et al. [3] introduced attention in the form of binary nodes between the network layers of [41]. Zhang et al. [60]

introduced the contrastive marginal winning probability, to model the top-down attention for neural classification models which could highlight discriminative regions. Similarly, Zhou et al. 


applied the classifier weights learned for image classification, and the resulting class scores to estimate the category activation maps. Based on it, Selvaraju et al. 


combined the guided backpropagation to obtain a novel Grad-CAM. Zagoruyko and Komodakis 

[58] later defined the gradient-based and activation-based attention maps, and proved that by mimicking the attention maps of a more powerful teacher network, the student network could be improved.

Trainable attention methods instead incorporate the modules of extracting attention and task learning into an end-to-end architecture and are mostly applied to query-based tasks, such as image captioning [57, 54, 5] and machine translation [1]. By projecting the query instance into the same high-dimensional space as the target, relevant spatial regions of the query will be highlighted to guide the desired inference. Instead of attending between two different domains, Jetley et al. [20]

re-purposed the global image representation which refers to features extracted by the higher layer of the CNN as the target to estimate multi-scale attentions with local image representations extracted by lower layers, and significantly improved the performance.

Figure 1: The architecture of CDGC network: an (image, category) pair (, ) is given as input, the network outputs a reconstructed image ; is the encoder layer while is the decoder layer; is the discriminator for category disentangling and is CDGC. Note that only the networks within the purple box will be conducted for extracting CDGC.

In contrast to the above query-based methods, self-attention based approaches attempt to learn the attention maps itself. Hu et al. [17] proposed an “Squeeze-and-Excitation” module to exploit the inter-channel relationships, which could be regarded as an attention mechanism applied upon channel axis. Wang et al. [49] proposed an encoder-decoder style attention module to extract 3D attention maps that could refine task-specific feature maps and implemented it in a residual style to enable scalability. Woo et al. [52] proposed a Convolutional Block Attention Module which could extract informative features by blending cross-channel and spatial information together to emphasize meaningful features along both channel and spatial axes.

Our approach consists of both the post-hoc and the trainable attention modules. For the extraction of CDGC, it could be considered as a post-hoc attention method. Among the related work, perhaps the most similar as ours is Zagoruyko and Komodakis [58], as they also attempt to adopt the “global context” of a teacher network to guide the student network. However, the difference is still two-fold. First, they apply the global context as a hard constraint, enforcing the two attention maps to be exactly the same, such that some valuable information of the student network is forced to be discarded; while we apply it as a soft guidance, and thus could take advantage of both networks. Second, unlike their approach that obtains the global context via a deep network which is a discriminative model, our global context is encoded in a category disentangled conditional generative model, and thus could retain richer complementary information, leaving more room for improvement. Despite some trainable attention approaches [32, 20, 26], claimed that they attend the global feature, the information is either still within a single image, or although within the whole dataset but is also encoded by a discriminative model. For attending the CDGC, it is actually a trainable attention method. Compared with their work, we adopt an amplification then re-bias scheme to make full use of attention.

3 Category Disentangled Global Context

In this section, we first summarize and compare four mathematical models, including one discriminative model and three generative models, that could capture the underlying property of a data population of interest, and then discuss our intuition of using Category Disentangled Global Context (CDGC), and finally the network to obtain it.

3.1 Global Context Modeling

In a multi-category problem with classes, we denote

as the data tensor/vector and

as its label. Traditionally, there are four mathematical models that could describe the underlying structure.

Discriminative model (DM): given input data , the discriminative model attempts to compute , with . This formulation is always realized by a classification network.

Generative model (GM): given input data , the generative model attempts to model directly without explicitly considering . Similarly, in deep networks, researchers always model it as an auto-encoder.

Conditional generative model (CGM): given input data and its corresponding label , the conditional generative model attempts to compute . Generally, researchers adopt the architecture of conditional auto-encoder trained in a semi-supervised manner to model it [8].

Category disentangled conditional generative model (CDCGM): given input data and its corresponding label , CDCGM attempts to model , with , namely the model shouldn’t be related to at all. The readers need to distinguish between CDCGM and GM, although GM does not explicitly consider , the information of is still included, while CDCGM explicitly enforces the model without containing any information of . Nowadays, with the emergence of generative adversarial nets (GAN) [12], researchers attempt to combine CGM with adversarial learning to solve this problem, achieving better disentangling performance [33, 23, 30].

Discussion: Although all the above models could capture the underlying property, their abilities vary a lot. Discriminative models focus on classification boundaries, whereas generative models emphasize the data generation process, and thus generative models always carry richer information [47]. In addition, suppose we adopt a deep network, whose representation power is of course not unlimited, to model the underlying property, as CDCGM does not encode any information of , it’s able to carry the richest information of among these models. Therefore, we choose the “Category Disentangled Global Context” encoded in the model of CDCGM for its largest complementarity to the original information encoded in the baseline classification networks, and thus the baseline networks could learn more from it (see Section 5 for the validation).

Figure 2: Left: the demonstration of applying CDGC based framework to the feature map outputted by a residual group of a residual network; right: the detailed structure for obtaining the attention map and the channel relevance . is the network to extract the attention map while and are embedding layers. The networks/layers within the purple box could be reused for extending the framework to the feature map that has a larger resolution as in the translucent dotted box.

3.2 CDGC Network

In this part, we describe how we obtain CDGC via a deep network, where is an intermediate 3D tensor derived from an input image . Figure 1 demonstrates the structure of our CDGC network, which includes three main components: conditional auto-encoder, category dispelling branch and a repulsive loss. In the following, the paper will describe these components one by one.

Conditional Auto-encoder.   The general architecture of our CDGC network is the popular conditional auto-encoder, where some encoder layers (see / in Figure 1) first encode an input image into a compact latent feature and then some other decoder layers (see / in Figure 1) decode the latent feature based on the conditional vector into another image that should be similar with .


where is a function that represents as a one-hot vector (conditional vector) and then expand it as the same feature map resolution as that of .

Compared with the traditional one, we make two modifications: (1) we delay the introducing of the conditional vector, and thus the spatial resolution of could be larger; (2) we add a skip connection, such that it could be more stable even with adversarial training. Although seem simple, these two modifications are critical for extracting feasible Category Disentangled Global Context.

Category Dispelling Branch.   However, simply disentangling the category in a semi-supervised manner is still not enough. Thus, we follow Lample et al. [23] and add an extra category dispelling branch. We iteratively train the discriminator to classify the category based on , and then update the encoder/decoder to output a newer to fool . By this, the final could be more category-invariant.

Repulsive Loss.   In our experiments, we find it is quite difficult to train such an adversarial conditional auto-encoder and it always requires a large amount of iterations. To resolve this, we further add a repulsive loss to enforce the discrepancy between images generated with the same but different conditional vectors to be large enough.


where (e.g., 0.01 in our experiments) is a margin to guarantee reasonable changes, and the weight of is 0.001.

4 CDGC based Classification Framework

In this section, we describe a general framework that employs CDGC proposed in Section 3 to bootstrap deep networks for image classification, via using the attention mechanism. Note that we do not require the baseline classification networks to be with certain structures, and our framework could be applied to many different architectures, including non-residential networks (e.g., VGG [42]) and residential networks (e.g., ResNet [14], Wide ResNet [59]). For the framework that is applied to the residual networks, please refer to Figure 2 for a demonstration. For VGG, we choose the feature map outputted by the convolutional layer that is just before a spatial maximum pooling layer but has the same resolution as CDGC to apply our framework.

To bootstrap the baseline networks, we first extract the (spatial) attention map from CDGC, and then make a channel-aware amplification of the features map outputted by a layer of the baseline networks, together with a re-bias unit for better activation, and an identity connection inspired by the residual learning. In the following, we will describe these components one by one, and then extend the framework to allow applying to multiple layers without bringing too much additional computational complexity.

Extracting Attention Map.   Unlike Wang et al. [49] that compute the attention map from 3D tensor directly, we conduct two pooling operations to decrease the computational complexity as in [52]. Specifically, given extracted from (see Section 3), we conduct an average pooling and a maximum pooling along the channel axes. Then we extract the attention map (will be abbreviated as ) by forwarding the two pooled CDGCs and to the network (see purple box in Figure 2).

Channel-aware Amplification.

   The motivation of this component is to amplify the activation of neurons that is relevant to classification while suppressing those irrelevant with the help of attention map

. Suppose

(before ReLU 

[34]) is the immediate feature map after forwarding some layers of a classification network that has the same spatial resolution as . As the attention map directly reveals the importance of each region, a straightforward way is to apply an element-wise multiplication with all the channels of equally. However, although the multiplication helps some relevant channels to make them more informative for classification, other less relevant or even irrelevant channels would be misled. Thus a reasonable approach is to project the attention map into the same channels of by forwarding to a deep network and then multiply them. However, the improvement is still limited, as without enough supervision (see Section 5). To resolve this issue, we employ the channel attention mechanism [43] to calculate the channel relevance between each channel of the feature map and the pooled CDGCs. This is achieved by first projecting / and into the same feature space via two embedding networks and , and then calculating the relevance between the embedded features and

via dot product after squeezing the spatial dimensions, and finally normalizing them with a sigmoid function (see purple box in Figure 


By now, channel-aware amplification could be realized by applying an element-wise multiplication of and the attention map , weighted with the channel relevance , resulting in an amplified feature .

Figure 3: The demonstration of re-bias unit in 2D: (a) the activation hyperplane (in purple) of ReLU could not distinguish relevant neurons (solid circle) with irrelevant neurons (hollow circle); (b) the amplification of some relevant neurons; (c) the activation hyperplane of ReLU after using re-bias unit makes a better discrimination.

Re-bias Unit.   After amplification, the absolute values of task-relevant neurons will be relatively enlarged (see Figure 3(a&b)), making them more distinguishable. However, some amplified neurons (e.g., from -0.001 to -0.1) will be still left inactivated, since ReLU [34] only activates the neurons according to whether their values are larger than zero or not with the “activation hyperplane” of . To achieve better activation , we design a re-bias unit which adaptively shifts the activation hyperplane of ReLU, leading to more accurate discrimination of task-relevant neurons (see Figure 3(c)). Note that, we acknowledge ReLU is very elegant while effective, and we do not claim re-bias unit is required for other architectures. In this paper, we add the re-bias unit is to guarantee our channel-aware amplification mechanism could work. In addition, our re-bias unit is very similar to the pre-activation residual unit [15] (see Figure 4(a&b)), except that our output directly flows to ReLU and the following layers while theirs are outputted to other residual units iteratively (except the last one).

Figure 4: (a) Pre-activation residual unit; (b) re-bias unit.

Identity Connection.   We follow the idea of Wang et al. [49] and add the original before activation on the consideration that no matter whether the network could learn to amplify the features and/or re-bias the activation hyperplane, its performance should not be worse than the original.

Multiple Layer Extension.   Our framework could be easily extended to the feature maps of multiple layers without bringing significant complexity increment. Specifically, for the feature map outputted by the other layer that has the same spatial resolution, and could be directly reused, while for the feature map with a different resolution, we could down-sample or up-sample and to make them consistent with the feature map (see Figure 2).

5 Experiments

In this section, we exhaustively evaluate the performance of our CDGC based framework on several publicly available datasets with different image resolutions (e.g., 3232 and 128128) and data scales (e.g., 50k to 1 million), to validate its effectiveness. In addition, we also apply the framework to many different network architectures, such as VGG [42] (with BatchNorm [19] and only one fully connected layer, while leave Dropout [44] removed), ResNet [15] (with pre-act residual unit) and Wide ResNet [59] (we refer Wide ResNet with depth and widening factor as WRN--) to validate its versatility. The experiments are organized as follows: we first describe the implementation of our framework; and then demonstrate the comparison results with three state-of-the-art [58, 17, 52] on four different datasets, including CIFAR-10 and CIFAR-100 [21], ImageNet3232 [9] and ImageNet-150K [29].

To analyze the effectiveness of each component in our framework, including the comparison between different global context modeling approaches, the importance of repulsive loss, the usefulness of re-bias unit, and so on, we choose CIFAR datasets to apply most ablation studies due to limited computational resources. Then we conduct more extensive comparisons with the state-of-the-art on the other two larger datasets. Note that, the focus of this paper is to demonstrate our framework could bootstrap standard CNN architectures for the task of image classification, instead of pushing the state-of-the-art results. Thus we intentionally choose simple networks and reproduce all the evaluated networks in the PyTorch 

[36] framework without any model pre-training and report our reproduced results in the whole experiments to perform better apple-to-apple comparisons.

5.1 Implementation

CDGC Network.   We adapt the architecture of CDGC network from [23]. Denote as a convolution layer, where each convolution uses kernel of size

, with a stride of

, and a padding of

, and as a group of , BatchNorm (BN) and ReLU. For images with the resolution of 3232, our encoder consists of the following six groups: , with the first four groups belonging to . The decoder is symmetric to the encoder, except replacing the convolutions in with transposed convolutions and appending the conditional vector as additional constant input channels for all the layers of (see Figure 1). Therefore, the final resolution of CDGC is 88. For images with the resolution of 128128, the second group is replaced by , resulting in a 1616’s CDGC. Note that, for the cases with more than ten categories, we additionally adopt a fully connected layer to embed the one-hot category vector into a ten-dimensional conditional vector.

CDGC based Framework.   and are with the same structure of BN-ReLU--BN-ReLU-, is with -BN-ReLU--BN-Sigmoid, and is with -BN-ReLU-. To fit the CDGC network, all input images in our experiments are normalized in [-1, 1], and we adopt the two layer CDGC based classification framework, by adding a 22 up-sampling of CDGC.

Training Details.   We adopt SGD using default parameters as the optimizer with a mini-batch size of 128 for CIFAR and ImageNet32

32, while 32 for ImageNet-150K. The initial learning rate for CIFAR datasets is set as 0.1, and it is dropped by 0.2 at 60, 120 and 160 epochs for total 200 epochs. For ImageNet32

32 and ImageNet-150K, we start the learning rate with 0.01, and drop it by 0.2 every 10 epochs, and train up to 40 epochs. We evaluate the performance at each epoch and report the best one.

5.2 CIFAR Experiments

We start our studies on the CIFAR-10 and CIFAR-100 datasets, with both consisting of 50k training images and 10k testing images, but in 10 and 100 classes, respectively. Although with a small resolution of , resulting in an even smaller resolution of global context (e.g., ), our approach could still give reasonable benefits. We present experiments trained on the whole training set and evaluated on the test set. For data augmentation, we follow the scheme reported in [24] for training: 4 pixels are padded on each side, and a 3232 crop is randomly sampled from the padded image or its horizontal flip.

Top-1 Acc(%)
Architecture CIFAR-10 CIFAR-100
VGG13 94.18 74.72
VGG13 + DM 93.47 73.62
VGG13 + GM 94.39 75.78
VGG13 + CGM 94.45 75.56
VGG13 + CDCGM (Ours) 94.71 75.81
VGG16 93.85 73.78
VGG16 + DM 93.16 73.42
VGG16 + GM 94.15 75.03
VGG16 + CGM 94.18 74.85
VGG16 + CDCGM (Ours) 94.33 75.24
Table 1: Comparison results of different global context modeling approaches on the CIFAR-10/100 datasets.

Different Global Context Modeling Approaches.   In this part, we investigate the benefits of four different global context modeling approaches introduced in Section 3.1 to extract global context: discriminative model (DM), generative model (GM), conditional generative model (CGM) and category disentangled conditional generative model (CDCGM). For DM, we choose WRN-16-10 to compute the global context, while for GM and CGM, the networks are simply adapted from our CDGC network by cancelling the introducing of conditional vector and/or removing the category dispelling branch. Experimental results on the CIFAR-10 and CIFAR-100 datasets are depicted in Table 1. Readers could see both VGG13 and VGG16 networks obtain a certain amount of improvement by attending the global context computed by generative models on both two datasets. In addition, the final results of attending GM and CGM are very similar as both of their models are related to no matter explicitly or not (see Section 3.1). Obviously, our approach that adopts CDCGM to extract the global context performs the best on all occasions, validating our assumption that  CDGC has the richest complementary information to that encoded in the baseline classification networks for the largest improvement. Interestingly, the performance drops significantly if attending the context extracted from a discriminative model, although the approach [58] that enforces their attention maps to be exactly the same could work (see Table 3). We guess the reason is that their encoded information is quite similar, making our approach fail to learn how to attend the context.

Top-1 Acc(%)
Architecture CIFAR-10 CIFAR-100
VGG13 94.18 74.72
VGG13 + Re-bias 94.22 75.05
VGG13 + Ours w/o CA 94.60 75.62
VGG13 + Ours w/o RL 94.41 75.37
VGG13 + Ours (one layer) 94.40 (94.06) 74.48 (75.20)
VGG13 + Ours (two layers) 94.71 75.81
VGG16 93.85 73.78
VGG16 + Re-bias 94.12 73.74
VGG16 + Ours w/o CA 94.28 74.54
VGG16 + Ours w/o RL 94.19 74.36
VGG16 + Ours (one layer) 94.23 (93.95) 74.73 (74.11)
VGG16 + Ours (two layers) 94.33 75.24
Table 2: Ablation studies on the CIFAR-10/100 datasets.
Top-1 Acc(%)
Architecture CIFAR-10 CIFAR-100
VGG13 94.18 74.72
VGG13 + AT [58] (WRN-16-10) 94.43 75.58
VGG13 + Ours 94.71 75.81
VGG16 93.85 73.78
VGG16 + AT [58] (WRN-16-10) 94.23 74.76
VGG16 + Ours 94.33 75.24
WRN-16-10 94.58 77.72
WRN-16-10 + Ours 95.23 79.06
WRN-28-10 95.18 79.15
WRN-16-10 + CBAM [52] 95.09 79.51
WRN-16-10 + CBAM [52] + Ours 95.60 80.38
WRN-16-10 + SE [17] 95.81 80.38
WRN-16-10 + SE [17] + Ours 96.02 80.86
Table 3: Comparison results on the CIFAR-10/100 datasets.

Channel Attention.   As a key component of our framework, we also compare the results of using channel attention (CA) or not. For the approach without CA, it is implemented by projecting the attention map into the same channel numbers as the feature map by forwarding it to a deep network with two convolutional layers. The results in Table 2 show that without channel attention, our framework could still outperform the baseline networks, but are slightly worse than the full approach, validating its usefulness.

Re-bias Unit.   As the re-bias unit indeed deepens the baseline networks, the readers may doubt whether the final improvement is brought by that. Thus, we compare the results of ours and those of the baseline networks with adding a re-bias unit also. Results in Table 2 show that the adding of re-bias unit could slightly improve the performance, but the results are far behind those of the full approach (e.g., 75.05% vs. 75.81% with VGG13 on CIFAR-100), validating that the adding of layers is not the key factor that brings improvement but the amplification then re-bias scheme.

Repulsive Loss.   In this experiment, we demonstrate the importance of repulsive loss (RL) by training two CDGC networks with and without adding repulsive loss, and then applying the two corresponding CDGCs to VGG13 and VGG16 to compare their accuracies. The results in Table 2 show that using CDGC extracted without adding repulsive loss still could improve the classification performance. However, our full framework with repulsive loss achieves the best, validating the usefulness of repulsive loss.

Multiple Layer Extension.   We investigate the usefulness of applying the framework to multiple layers. The results in Table 2 show that the results of only applying to the layer with ’s output or the layer with ’s output (in the bracket) are worse than those of applying to both two layers, validating the usefulness of multiple layer extension.

Comparison with the State-of-the-art.   We make comprehensive comparisons with the state-of-the-art using various baseline networks (e.g., VGG and WRN) on the CIFAR datasets, and report the results in Table 3. It is shown that, networks with applying the CDGC based framework could obtain substantial improvement in all cases. We would like to point out that by applying the CDGC based framework, WRN-16-10 achieves very comparable results with WRN-28-10. In addition, our method outperforms the approach of Attention Transfer (AT) [58] that adopts WRN-16-10 as the teacher network on both CIFAR datasets. Note that, due to the degradation problem [14], VGG16 performs slightly worse than VGG13, which also indicates the importance of exploring other directions to improve the performance. Although the accuracies of Squeeze-and-Excitation Networks (SE) [17] and Convolutional Block Attention Module (CBAM) [52] are slightly higher than ours, their methods could be further improved (e.g., 0.4%/0.7% on CIFAR-10/100) by combining our framework, as shown in Table 3.

5.3 ImageNet32x32 Experiments

To further validate the effectiveness of our CDGC based framework, in this section we conduct experiments on a more challenging dataset: ImageNet3232 [9], which contains exactly the same number of images as the original ImageNet, i.e., 1281167 training images from 1000 classes (compared with the CIFAR datasets with 50k images from 10/100 classes) and 50000 validation images with 50 images per class, except with a down-sampled resolution. We follow the same data augmentation scheme as on CIFAR.

Architecture Top-1 Top-5
ResNet-32 37.31 62.97
ResNet-32 + AT [58] (ResNet-56) 37.20 62.74
ResNet-32 + Ours 40.37 66.06
ResNet-32 + SE [17] 37.41 63.02
ResNet-32 + SE [17] + Ours 40.11 65.82
ResNet-32 + CBAM [52] 37.82 63.35
ResNet-32 + CBAM [52] + Ours 40.22 65.88
ResNet-56 42.35 68.01
ResNet-56 + AT [58] (ResNet-110) 42.48 68.18
ResNet-56 + Ours 44.43 70.18
ResNet-56 + SE [17] 42.53 68.29
ResNet-56 + SE [17] + Ours 45.08 70.57
ResNet-56 + CBAM [52] 43.18 68.92
ResNet-56 + CBAM [52] + Ours 44.55 70.06
ResNet-110 49.08 74.35
ResNet-110 + Ours 51.36 75.93
ResNet-110 + SE [17] 49.18 74.32
ResNet-110 + SE [17] + Ours 50.29 75.17
ResNet-110 + CBAM [52] 49.43 74.55
ResNet-110 + CBAM [52] + Ours 50.91 75.72
Table 4: Classification results on ImageNet3232.

The results are shown in Table 4. ResNet (basic residual block) [15] with all three different depths could obtain large margins of improvement (e.g., 23%) when applying the CDGC based framework, demonstrating that our approach could generalize well on the large-scale dataset with more categories. In addition, we compare the results with three state-of-the-art: CBAM [52], SE [17] and AT [58]. CBAM and SE based approaches could slightly improve the baseline networks, whereas for AT based method, we could not see any improvement. Furthermore, by combining our framework with CBAM or SE, we could see another improvement, validating the superiority of our framework.

5.4 ImageNet-150K Experiments

We also conduct experiments on the ImageNet-150K dataset [29] with a higher resolution, which is a subset of ImageNet and also contains 1k categories but with 150 images per category (148 for training, 2 for testing). We resize all the images into 128128. For data augmentation, we horizontally flip each image and randomly take a 128128 crop with 16 pixels padded on each side.

Experimental results are depicted in Table 5. It is shown that ResNet-101 (bottleneck residual block) obtains the largest improvement by applying the CDGC based framework, compared with two state-of-the-art: CBAM [52] and SE [17], validating that our approach could generalize well on the datasets with higher resolutions.

We also investigate the effect of the resolutions of CDGC. It is shown that the framework with CDGC in a higher resolution (e.g., 1616) performs better. Thus we attempt to obtain higher resolutions of CDGC to make more comparisons. However, we find the CDGC network could hardly converge, probably indicating the challenging of disentangling latent features with a high resolution.

Figure 5: Top row: input images; middle row: activation-based attention maps [58] of ResNet-101 (output of the last residual group); bottom row: attention maps of ResNet-101 after applying the CDGC based framework.
Architecture Top-1 Top-5
ResNet-101 78.70 94.55
ResNet-101 + Ours (88) 79.00 95.10
ResNet-101 + Ours (1616) 80.85 95.15
ResNet-101 + SE [17] 79.30 94.35
ResNet-101 + SE [17] + Ours 80.65 94.80
ResNet-101 + CBAM [52] 78.80 94.20
ResNet-101 + CBAM [52] + Ours 80.35 95.50
Table 5: Classification results on ImageNet-150K.

To intuitively demonstrate why our framework could bootstrap the baseline networks, we visualize the activation-based attention maps [58] of the ResNet-101 and its integration within the CDGC based framework. Figure 5 shows that, due to the lack of global information, ResNet-101 fails to locate the objects of interest, while all these locations could be identified correctly with the guidance of CDGC.

6 Conclusion

We propose a general framework that employs the novel Category Disentangled Global Context (CDGC) to bootstrap deep networks for image classification. The rationale behind it is the global context that is not related to the category, retains richer complementary information to that encoded in the classification networks. Therefore, by attending CDGC, the networks could identify the objects of interest more accurately. Thorough experiments demonstrate that the framework could bring substantial improvement to various network architectures and is superior to the state-of-the-art. We hope the framework becomes an important component of CNN networks and inspires more research in computer vision, without limiting to image classification.