DIANet: Dense-and-Implicit Attention Network

05/25/2019 ∙ by Zhongzhan Huang, et al. ∙ National University of Singapore Northwestern University 0

Attention-based deep neural networks (DNNs) that emphasize the informative information in a local receptive field of an input image have successfully boosted the performance of deep learning in various challenging problems. In this paper, we propose a Dense-and-Implicit-Attention (DIA) unit that can be applied universally to different network architectures and enhance their generalization capacity by repeatedly fusing the information throughout different network layers. The communication of information between different layers is carried out via a modified Long Short Term Memory (LSTM) module within the DIA unit that is in parallel with the DNN. The sharing DIA unit links multi-scale features from different depth levels of the network implicitly and densely. Experiments on benchmark datasets show that the DIA unit is capable of emphasizing channel-wise feature interrelation and leads to significant improvement of image classification accuracy. We further empirically show that the DIA unit is a nonlocal normalization tool that enhances the Batch Normalization. The code is released at https://github.com/gbup-group/DIANet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

Code Repositories

DIANet

The official implementation of paper "DIANet:Dense-and-Implicit-Attention-Network".


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Attention, a cognitive process that selectively focuses on a small part of information while neglects other perceivable information anderson2005cognitive , has been used to effectively ease neural networks from learning large information contexts from images Xu:2015:SAT:3045118.3045336 ; luong2015effective , sentences vaswani2017attention ; britz2017massive ; cheng2016long and videos miech2017learnable

. Especially in computer vision, DNNs incorporated with special operators that mimic the attention mechanism can process informative regions in an image efficiently. Empirical results in  

mnih2014recurrent ; bahdanau2014neural ; Xu:2015:SAT:3045118.3045336 ; newell2016stacked ; vaswani2017attention ; anderson2018bottom have demonstrated the visual representations improvement of DNNs by these special attention operators hu2018gather . Recently, attention operators are modularized and stacked into popular networks as attention modules he2016deep ; huang2017densely for further performance improvement. The design of these modules may be task-dependent hu2018squeeze ; woo2018cbam ; park2018bam ; wang2018non ; hu2018gather ; li2019selective ; cao2019GCNet .

In previous works, attention-modules are used individually in each layer throughout DNNs. Even a tiny add-in module with a small number of parameters per layer will primarily increase the total number of parameters as the network depth increases. Besides, the potential network redundancy can also hinder the learning capability of DNNs  huang2017densely ; wang2018mixed . Therefore, it is crucial to design efficient attention modules that can maintain reasonable parameter cost and avoid feature redundancy to improve the performance of DNNs.

1.1 Our contribution

To tackle the challenge mentioned above, we propose a Dense-and-Implicit Attention (DIA) unit to enhance the generalization capacity of DNNs. The DIA unit can recurrently fuse multi-scale information from previous network layers to later network layers to model attention. The structure and computation flow of a DIA unit is visualized in Figure 1. There are three parts in the DIA unit. The part denoted by 1⃝ extracts spatial-wise jaderberg2015spatial , channel-wise hu2018squeeze or multi-scale features  woo2018cbam ; park2018bam from the feature map in the current layer. The second part denoted by 2⃝ is the main module in the DIA unit to model network attention and is the key innovation of the proposed method. Particularly, we apply Long Short-Term Memory (LSTM) hochreiter1997long module that not only connects two adjacent layers, but also fuses the information from previous layers, creating nonlocal information communication throughout the DNN. Other network structures can also be explored to implement the second part and this is left as future work. The third part denoted by 3⃝ adjusts (e.g., re-scaling or re-distributing) the feature map in the current layer according to the feedback of the second part.

Characteristics and Advantages. (1) As shown in Figure 1, the DIA unit is placed parallel to the network backbone, and it is shared with all the layers in the same stage (the collection of successive layers with same spatial size, as defined in he2016deep ). It also links different layers implicitly and densely, which improves the interaction of layers at different depth. (2) As the DIA unit is shared, the number of parameter increment from the DIA unit remains roughly constant as the depth of the network increases. (3) DIA unit adaptively learns the non-local scaling that has the normalization effect.

Figure 1: DIA units. means the operation for extracting different scales of features. means the operation for emphasizing features.

In this work, we focus on studying the DIA unit using a modified LSTM that can learn the channel-wise relationship and long distance dependency between layers. As shown in Figure 2, a global average pooling (GAP) layer (as the 1⃝ in Figure 1) is used to extract global information from current layer; A LSTM module (as the 2⃝ in Figure 1

) is used to integrate multi-scale information and there are three inputs passed to the LSTM: the extracted global information from current raw feature map, the hidden state vector

, and cell state vector from previous layers. Then the LSTM outputs the new hidden state vector and the new cell state vector . The cell state vector stores the information from the layer and its preceding layers. The new hidden state vector  (dubbed as attention vector in our work) is then applied back to the raw feature map by channel-wise multiplication (as the 3⃝ in Figure 1) to emphasize the feature importance in each feature map.

The LSTM in the DIA unit plays a role to bridge the current layer and preceding layers such that the DIA unit can adaptively learn the non-linearity relationship between features in two different scales. The first scale of features is the internal information of the current layer, and the second scale represents the outer information from the preceding layers. The non-linearity relationship between these two scales will benefit attention modeling for the current layer. Additionally, the operation of channel-wise multiplication in the final step of the DIA unit is similar to the scaling operation of Batch Normalization. Indeed, the empirical results show that DIA has non-local normalization effect. With our modified LSTM, the additional parameter cost is economical in DIA unit.

1.2 Organization

In Section 2, some of the attention-based networks are reviewed, and the difference between them and ours are discussed. Then in Section 3, the formal definition of Dense-and-Implicit Attention Network is introduced. In Section 4, we conduct experiments on Benchmarks datasets to empirically demonstrate the effectiveness of DIA unit. In Section 5, the influence of hyper-parameter on our model has been studied experimentally. Finally, in Section 6, some evidence of long-distance dependence in our model is shown, and the normalization effects of DIA unit are also investigated.

Figure 2: Illustration of DIANet with LSTM. In the LSTM module, is the cell state vector and is the hidden state vector. GAP means global average pool and means channel-wise multiplication.

2 Related Works

Attention Mechanism in Computer Vision. mnih2014recurrent ; zhao2017diversified

use attention mechanism in image classification via utilizing a recurrent neural network to select and process local regions at high resolution sequentially. Concurrent attention-based methods tend to construct operation modules to capture non-local information in an image

wang2018non ; cao2019GCNet , and model the interrelationship between channel-wise featureshu2018squeeze ; hu2018gather . The combination of multi-level attentions are also widely studied park2018bam ; woo2018cbam ; DBLP:journals/corr/abs-1904-04402 ; Wang_2017_CVPR . Prior works wang2018non ; cao2019GCNet ; hu2018squeeze ; hu2018gather ; park2018bam ; woo2018cbam ; DBLP:journals/corr/abs-1904-04402 usually insert an attention module in each layer independently. In this work, the DIA unit is innovatively shared for all the layers in the same stage of the network, and the existing attention modules can be composited into the DIA unit readily. Besides, we adopt a global average pooling in part 1⃝ to extract global information and a channel-wise multiplication in part 2⃝ to enhance the importance of features, which is similar to SENet hu2018squeeze .

Dense Network Topology. DenseNet proposed in huang2017densely connects all pairs of layers directly with an identity map. Through reusing features, DenseNet enjoys the advantage of higher parameter efficiency, the better capacity of generalization, and more accessible training than alternative architectures lin2013network ; he2016deep ; srivastava2015highway . Instead of explicitly dense connections, the DIA unit implicitly links layers at different depth via a shared module and leads to dense connection. Empirical results in Section 6 shows that the DIA unit has a similar effect as DenseNet.

Multi-level Feature Integration. wolf2006critical experimentally analyzes that even the simple aggregation of low-level visual features sampled from wide inception field can be efficient and robust for context representation, which inspires hu2018squeeze ; hu2018gather to incorporate multi-level features to improve the network representation. li2016multi

also demonstrates that by biasing the feature response in each convolutional layers using different activation functions, the deeper layer could achieve the better capacity of capturing the abstract pattern in DNN. In DIA unit, the high non-linearity relationship between multi-scale features are learned and integrated via the LSTM module, which is conducive to better model the attention and improve learning performance reported in Section

4, 5 and 6.

3 Dense-and-Implicit Attention Network (DIANet)

In this section, we will formally introduce the DIA unit and elaborate on how it implicitly connects all layers densely. Afterwards, a DIANet is referred to a network built with DIA units.

3.1 Formulation of DIANets

As shown in Figure 2 when a DIANet is built with a residual network he2016deep , the input of the layer is , where and mean width, height and the number of channels, respectively. is the residual mapping at the layer with parameters as introduced in he2016deep . Let Next, a global average pooling denoted as is applied to to extract global information from features in the current layer. Then is passed to LSTM along with a hidden state vector and a cell state vector ( and are initialized as zero vectors). The LSTM finally generates a current hidden state vector and a cell state vector as

(1)

In our model, the hidden state vector is regarded as attention vector to adaptively recalibrate feature maps. We apply channel-wise multiplication to enhance the importance of features, i.e., and obtain after skip connection, i.e., . Table 1 shows the formulation of ResNet, SENet, and DIANet, and Part (b) is the main difference between them. The LSTM module is used repeatedly and shared with different layers in parallel to the network backbone. Therefore the number of parameters in a LSTM does not depend on the number of layers in the backbone, e.g., . SENet utilizes a attention-module consisted of fully connected layers to model the channel-wise dependency for each layer independently hu2018squeeze . The total number of parameters brought by the add-in modules depends on the number of layers in the backbone and increases with the number of layers.

ResNet SENet DIANet (ours)
(a)
(b) -
(c)
Table 1: Formulation for the structure of ResNet, SENet, and DIANet. is the convolution layer. FC means fully connected layer and GAP indicates global average pooling.

3.2 Modified LSTM Module (DIA-LSTM Module)

Now we introduce the modified LSTM module. As shown in Figure 3, compared to the standard LSTM hochreiter1997long

module, there are two modifications in our purposed LSTM: 1) a shared linear transformation to reduce input dimension; 2) a careful selected activation function for better performance.

A standard LSTM consists of four linear transformation layers as shown in Figure 3 (Left). Since , and are of the same dimension , the standard LSTM may cause parameter increment as shown in the Appendix. When the number of channels is large, e.g., , the parameter increment of add-in LSTM module in the DIA unit will be over 8 million, which can hardly be tolerated.

Figure 3: Left: standard LSTM module. Right: DIA-LSTM module. We highlight the modified component in DIA-LSTM. “” means the sigmoid activation. “Linear” means the linear transformation.

Hence, to avoid such a scenario, we propose a modified LSTM (denoted as DIA-LSTM) as follows:

(1) Activation Function We change the output layer’s activation function from to and further discussion will be presented in ablation study;

(2) Parameter Reduction As shown in Figure 3 (Left), and are passed to four linear transformation layers with the same input and output dimension . In the DIA-LSTM, a linear transformation layer ( denoted as “Linear1” in Figure 3 (Right)) with a smaller output dimension are applied to and . We use reduction ratio in the Linear1. Specifically, we reduce the dimension of the input from to and then apply the activation function to increase non-linearity in this module. The dimension of the output from function are changed back to when the output is passed to those four linear transformation functions. This modification can enhance the relationship between the inputs for different parts in DIA-LSTM and also effectively reduce the number of parameters by sharing a linear transformation for dimension reduction. The number of parameter increment reduces from to as shown in the Appendix, and we find that when we choose an appropriate reduction ratio , we can make a better trade-off between parameter reduction and the performance of DIANet. Further experimental results will be discussed in the ablation study later.

Figure 4: Left: explicit structure of DIANet. Right: implicit connection of DIA unit.

Implicit and Dense Connection. Now we illustrate how the DIA unit connects all the layers in the same stage implicitly and densely. Consider a stage consisting many layers, as shown in Figure 4 (Left), it is an explicit structure of DIANet and the one layer seems not to be connected to the other layers except the network backbone. However, as shown in Figure 4 (Right), there are indirect connections between the current layer and the preceding layers with the help of the shard DIA unit. Unlike DenseNet huang2017densely that concatenates the raw features of all previous layers directly, the layer receives the information of the preceding layers through data-driven learning within the DIA unit. We call them implicit connections. Since the DIA unit is shared, there is communication between every pair of layers, which leads to dense connections over all layers.

4 Experiments

In this section, we evaluate the performance of the DIA unit in image classification task and empirically demonstrate its effectiveness. We conduct experiments on popular networks for benchmark datasets. Since the SENet hu2018squeeze is also a channel-specific attention model, we compare the DIANet with the SENet. For a fair comparison, we adjust the reduction ratio such that the number of parameters of the DIANet is similar to that of the SENet.

Dataset and Model. We conduct experiments on CIFAR10, CIFAR100 cifar

, and ImageNet 2012 

ILSVRC15 using ResNet he2016deep , PreResNet he2016identity , WRN wrn and ResNeXt xie2017aggregated . CIAFR10 or CIFAR100 has 50k train images and 10k test images of size 32 by 32, but has 10 and 100 classes respectively. ImageNet 2012 datasetILSVRC15 comprises 1.28 million training and 50k validation images from 1000 classes, and the random cropping of size 224 by 224 is used in our experiments. The implementation details can be found in the Appendix.

Image Classification. As shown in Table 2, the DIANet improves the testing accuracy significantly over the original networks and consistently comparing with SENet for different datasets. In particular, the performance improvement of the ResNet with the DIA unit is most remarkable. Due to the popularity of ResNet, the DIA unit may be applied in other computer vision tasks.

Dataset original SENet DIANet
P(M) top1-acc. P(M) top1-acc. P(M) top1-acc.
ResNet164 CIFAR100 1.73 73.43 1.93 75.03 1.95 76.67 4
PreResNet164 CIFAR100 1.73 76.53 1.92 77.41 1.96 78.20 4
WRN52-4 CIFAR100 12.07 79.75 12.42 80.35 12.30 80.99 4
ResNext101,8x32 CIFAR100 32.14 81.18 34.03 82.45 33.01 82.46 4
ResNet164 CIFAR10 1.70 93.54 1.91 94.27 1.92 94.58 4
PreResNet164 CIFAR10 1.70 95.01 1.90 95.18 1.94 95.23 4
WRN52-4 CIFAR10 12.05 95.96 12.40 95.95 12.28 96.17 4
ResNext101,8x32 CIFAR10 32.09 95.73 33.98 96.09 32.96 96.24 4
ResNet34 ImageNet 21.81 73.93 21.97 74.39 21.98 74.60 20
ResNet50 ImageNet 25.58 76.01 28.09 76.61 28.38 77.24 20
ResNet152 ImageNet 60.27 77.58 66.82 78.36 65.85 78.87 20
ResNext50,32x4 ImageNet 25.03 77.19 27.56 78.04 27.83 78.32 10
Table 2: Testing accuracy (%) on CIFAR10, CIFAR100 and ImageNet 2012. “P(M)” means the number of parameters (million). The rightmost “” indicates the reduction ratio of DIANet.

5 Ablation Study

In this section, we conduct ablation experiment to explore how to embed the DIA unit in different neural network structures better and gain a deeper understanding of the role of each component in the DIA unit. All ablation experiments are performed on CIFAR-100 dataset with ResNet. First, we discuss the effect of the reduction ratio introduced in Section 3.2. Then the performance of the DIA unit in ResNet of different lengths will be explored. We test the performance of DIA-LSTM with different activation function at the output layer. Also, we study the capacity of the stacked DIA-LSTM with a different number of DIA-LSTM modules.

Reduction ratio.

The reduction ratio rate is the only hyperparameter in our DIANet, as mentioned in Section 

3.2. Improving the performance with light parameter increment is one of the main characteristics in our model. This part investigates the trade-off between model complexity and its performance. As shown in Table 4, we find out that the testing accuracy of the DIANet declines slightly with the increasing reduction rate. In particular, when , the parameter increment is 0.05M comparing with ResNet164. The testing accuracy of the DIANet is 76.50% while that of the original network is 73.43%, which makes the DIA unit has potential in a variety of practical applications, especially those in which small model size is of importance.

DIANet Ratio P(M) top1-acc. 1 2.59 76.88 4 1.95 76.67 8 1.84 76.42 16 1.78 76.50
Table 3: Test accuracy (%) with varying reduction ratio on CIFAR100.
CIFAR-100 SENet DIANet() Depth P(M) top1-acc. P(M) top1-acc. ResNet83 0.99 74.67 1.11 75.02 ResNet164 1.93 75.03 1.95 76.67 ResNet245 2.87 75.03 2.78 76.79 ResNet407 4.74 75.54 4.45 76.98
Table 4: Test accuracy (%) with models of different depth on CIFAR100.

The depth of the neural network.

Generally, in practice, deep DNNs with a large number of parameters do not guarantee sufficient performance improvement since, on the one hand, deeper networks probably contain extreme feature and parameter redundancy 

huang2017densely . On the other hand, the gradient vanishing problem will be worse and make the DNN hard to be trained  bengio1994learning ; glorot2010understanding ; srivastava2015training . Therefore, designing a new structure of deep neural networks he2016deep ; huang2017densely ; srivastava2015training ; hu2018squeeze ; hu2018gather ; wang2018non is of great necessity. Since the DIA unit also changes the network topology of the DNN backbones, therefore evaluating the effectiveness of DIANet structure is of great importance. Here we investigate how the depth of the DNN influence the DIANet in two parts: (1) the performance of DIANet compared to SENet of varying depth; (2) the parameter increment of DIANet in DNN. The results in Table 4 show that as the depth of the ResNet increases from 83 layers to 407 layers, the DIANet can achieve higher classification accuracy improvement than the SENet with less number of parameters. Moreover, DIANet83 (for simplicity, 83 denotes the depth of ResNet backbone) can achieve the competitive result as SENet245, and DIANet164 can outperform all the SENet results with at least 1.13% and at most 58.8% parameter reduction. They imply that the DIANet is of higher parameter efficiency than SENet. The results also suggest that: for DIANet, as shown in Figure 2, the deeper network means that the DIA-LSTM module will pass more layers recurrently. The DIA-LSTM can handle the interrelationship between the information of different layers in much deeper DNN and better find out the long distance dependency between layers. Therefore the DIANet can effectively avoid feature redundancy.

P(M) Activation DIA-LSTM top1-acc.
ResNet164 1.95 sigmoid 1 76.67
ResNet164 1.95 tanh 1 75.24
ResNet164 3.33 sigmoid 3 75.20
ResNet164 3.33 tanh 3 76.47
Table 5: Test accuracy (%) on CIFAR100. The effect of activation function at the output layer in DIA-LSTM and the number of staking DIA-LSTM modules.

Activation function and the number of stacking DIA-LSTM modules. We choose two different activation functions ( and ) in the output layer of DIA-LSTM and two different numbers (one and three) of stacking DIA-LSTM cells to explore the effects of these two factors on classification performance. In Table 5, we find that the performance has been significantly improved after replacing in the standard LSTM with . As shown in Figure 3 (Right), this activation function is located in the output layer, which directly changes the effect of memory unit on the output of the output gate.

When we use in the output layer of DIA-LSTM, an increasing number of stacking DIA-LSTM modules does not necessarily lead to performance improvement but may lead to considerable performance degradation. However, when we choose , the situation is different. It suggests that for the DIA unit to be effective, fine structural adjustments are necessary.

6 Analysis

In this section, we study some properties of DIANet, including feature integration and normalization effect. In DIANet, the deeper layers connect the shallower layers via DIA-LSTM module. Firstly, the random forest model 

Gregorutti2017Correlation is used to visualize how the current layer depends on the preceding layers. Secondly, we study the normalization effect of DIANet by removing the Batch Normalization Ioffe:2015:BNA:3045118.3045167 in the networks. Also, the tests on removal of skip connection and data augment show some positive side-effects of DIANet.

Feature Integration. Here we try to understand the dense connection from the numerical perspective. As shown in Figure 2 and 4, the DIA-LSTM bridges the connections between layers by propagating the information forward through and . Moreover, at different layers are also integrating with in DIA-LSTM. Notably, is applied directly to the features in the network at each layer . Therefore the relationship between at different layers somehow reflects connection degree of different layers. We explore the nonlinear relationship between the hidden state of DIA-LSTM and the preceding hidden state , and visualize how the information coming from contributes to . To reveal this relationship, we consider using the random forest to visualize variable importance. The random forest can return the contributions of input variables to the output separately in the form of importance measure, e.g., Gini importance Gregorutti2017Correlation . The computation details of Gini importance can be referred to the Appendix. Take as input variables and as output variable, we find out the Gini importance of each variable . ResNet164 contains three stages, and each stage consists of 18 layers. We conduct three Gini importance computation to each stage separately. As shown in Figure 5, each row presents the importance of source layers contributing to the target layer . In each sub-graph of Figure 5, the diversity of variable importance distribution indicates that the current layer utilizes the information of the preceding layers. The interaction between shallow and deep layers in the same stage reveals the effect of implicitly dense connection. In particular, taking in stage 1 (the last row) as an example, or does not intuitively provide the most information for , but does. We conclude that the DIA unit can adaptively integrate information between multiple layers.

Figure 5: Visualization of feature integration for each stage by random forest.

Moreover, in Figure 5 (stage 3), the information interaction with previous layers in stage 3 is more intense and frequent than that of the first two stages. Correspondingly, as shown in Table 6(Left), in the experiments when we remove the DIA unit in stage 3, the classification accuracy decreases from 76.67 to 75.40. However, when it in stage 1 or 2 is removed, the performance degradation is very similar, falling to 76.27 and 76.25 respectively. Also note that for DIANet, the number of parameters increment in stage 2 is much larger than that of stage 1. It implies that the significant performance degradation after the removal of stage 3 may be not only due to the reduction of the number of parameters but also due to the lack of dense feature integration.

P(M) P(M) top1-acc. top1-acc. stage1 1.94 0.01 76.27 0.40 stage2 1.90 0.05 76.25 0.42 stage3 1.78 0.17 75.40 1.27 Models CIFAR-10 CIFAR-100 ResNet164 87.32 60.92 SENet 88.30 62.91 DIANet 89.25 66.73
Table 6: (Left) The effect of removal of DIA units in different stage. We test on CIFAR100 with ResNet164. (Right) Test accuracy (%) of the models without data augment.

Normalization Effect of DIANet. Small changes in shallower hidden layers may be amplified as the information propagates within the deep architecture and sometimes result in a numerical explosion. Batch Normalization (BN) Ioffe:2015:BNA:3045118.3045167 is widely used in the modern deep networks since it stabilizes the training by standardizing the input of each layer. DIA unit readjusts the feature maps by channel-wise multiplication, which plays a role of scaling similar to BN. In this part, we empirically claim that DIA unit has a normalization effect. As shown in Table 7, different models trained with varying depth in CIFAR-100, and BNs are removed in these networks to eliminate their normalization effect. The experiments are conducted on a single GPU with batch size 128 and initial learning rate 0.1. Both the original ResNet, SENet face problem of numerical explosion without BN while the DIANet can be trained with depth up to 245. Besides, comparing with Table 4, the testing accuracy of DIANet without BN still can keep up to 70%. The difference between the performance of the network with BN and network with BN and DIA indicates that the normalization effect of BN and that of DIANet is different. BN learns the shift and the scaling parameters by utilizing the local information, i.e., the current layer and batch data. However, the scaling learned by DIANet integrates the information from preceding layers and enables the network to choose a better scaling for each feature map. Combination of BN and DIANet can learn the local scaling, but also the non-local scaling.

original SENet DIANet
P(M) top1-acc. P(M) top1-acc. P(M) top1-acc.
ResNet83 0.88 nan 0.98 nan 0.94 70.58
ResNet164 1.70 nan 1.91 nan 1.76 72.36
ResNet245 2.53 nan 2.83 nan 2.58 72.35
ResNet326 3.35 nan 3.75 nan 3.41 nan
Table 7: Testing accuracy (%). We train models of different depth without BN on CIFAR-100. “nan” indicates the numerical explosion.

Removal of skip connection. The skip connection has become a necessary structure for training the DNN he2016identity . Without skip connection, the DNN is hard to train due to the reasons like the gradient vanishing. Figure 6(a-b) shows the training curves if we remove the last one-third of the skip connections at each stage. We can find that the decline of train loss in DIANet is more stable and finally DIANet achieves a smaller training loss. At the same time, DIANet has higher test accuracy than SENet and original ResNet. To some extent, it shows that DIA unit can alleviate the gradient vanishing in deep network training.

Without data augment. Explicit dense connections may help bring more efficient usage of parameters, which makes the neural network less prone to overfit huang2017densely . Although the dense connections in DIA are implicit, the DIANet still shows the ability to reduce overfitting. To verify it, We train the models without data augment to reduce the influence of regularization from data augment. As shown in Table 6 (Right), DIANet achieves lower testing error than ResNet164 and SENet. To some extent, the implicit and dense structure of DIANet may have regularization effect.

(a) (b) (c) (d)
Figure 6: (a-b) and (c-d) are the performance of three kinds of network without and with the last one-thirds of the skip connections at each stage, respectively.Testing with ResNet164 and CIFAR-100.

7 Conclusion

In this paper, we proposed a Dense-and-Implicit Attention (DIA) unit to enhance the generalization capacity of deep neural networks by recurrently fusing feature attention throughout different layers. Experiments showed that the DIA unit could be universally applied to different network architectures and improve their performance. The DIA unit is also supported by empirical analysis and extensive ablation study. Notably, the DIA unit can be constructed as a global networks module and can implicitly change the topology of existing network backbones.

Acknowledgments

S. Liang and H. Yang gratefully acknowledge the support of National Supercomputing Center (NSCC) SINGAPORE nscc and High Performance Computing (HPC) of National University of Singapore for providing computational resources, and the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. Sincerely thank Xin Wang from Tsinghua University for providing personal computing resource. H. Yang thanks the support of the start-up grant by the Department of Mathematics at the National University of Singaporet, the Ministry of Education in Singapore for the grant MOE2018-T2-2-147.

References

Appendix A Introdcution of Implementation detail

ResNet164 PreResNet164 WRN52-4 ResNext101-8x32
Batch size 128 128 128 128
Epoch 180 164 200 300
Optimizer SGD(0.9) SGD(0.9) SGD(0.9) SGD(0.9)
depth 164 164 52 101
schedule 60/120 81/122 80/120/160 150/225
wd 1.00E-04 1.00E-04 5.00E-04 5.00E-04
gamma 0.1 0.1 0.2 0.1
widen-factor - - 4 4
cardinality - - - 8
lr 0.1 0.1 0.1 0.1
GAP BN+GAP BN+GAP GAP
drop - - 0.3 -
Table 8: Implementation detail for CIFAR10/100 image classification. Normalization and standard data augmentation (random cropping and horizontal flipping) are applied to the training data. GAP and BN denote Global Average Pooling and Batch Normalization separately.
ResNet34 ResNet50 ResNet152 ResNext50-32x4
Batch size 256 256 256 256
Epoch 120 120 120 120
Optimizer SGD(0.9) SGD(0.9) SGD(0.9) SGD(0.9)
depth 34 50 152 50
schedule 30/60/90 30/60/90 30/60/90 30/60/90
wd 1.00E-04 1.00E-04 5.00E-04 5.00E-04
gamma 0.1 0.1 0.2 0.1
lr 0.1 0.1 0.1 0.1
GAP GAP GAP GAP
Table 9: Implementation detail for ImageNet 2012 image classification. Normalization and standard data augmentation (random cropping and horizontal flipping) are applied to the training data. The random cropping of size 224 by 224 is used in these experiments. GAP denote Global Average Pooling .
Batch size train batchsize
Epoch number of total epochs to run
Optimizer Optimizer
depth the depth of the network
schedule Decrease learning rate at these epochs
wd weight decay
gamma learning rate is multiplied by gamma on schedule
widen-factor Widen factor
cardinality Model cardinality (group)
lr initial learning rate
extract features(Figure 1)
drop Dropout ratio
Table 10: The Additional explanation

Appendix B Gini importance

1:: composed of ,,…, from stage ; #The size of is () # denotes the batch size of # denotes the number of the feature maps’ channel in current stage # denotes the number of layers in current stage
2:The hotmap about the features integration for stage ;
3:initial ;
4:for  to  do
5:     ;
6:     ;
7:      .reshape(,);
8:     RF RandomForestRegressor();
9:     RF.fit(,);
10:     Gini_importances RF.feature_importances_; #The length of Gini_importance is
11:     res ;
12:      0;
13:     cnt 0;
14:     for  to  do
15:           + Gini_importance();
16:          cnt cnt + 1;
17:          if cnt == -1 then
18:               res.add();
19:                0;
20:               cnt 0;
21:          end if.add(res/(res));
22:     end for
23:end for
Algorithm 1 Calculate features integration by Gini importance from Random Forest

Appendix C Number of parameter of LSTM

Suppose the input is of size and the hidden state vector is also of size .

Standard LSTM As shown in Figure (3) (Left), in the standard LSTM, there requires 4 linear transformation to control the information flow with input and respectively. The output size is set to be . To simplify the calculation, the bias is omitted. Therefore, for the , the number parameters of 4 linear transformation is equal to . Similarly, the number parameters of 4 linear transformation with input is equal to . The total of parameters equals to

DIA-LSTM As shown in Figure (3) (Right), there is a linear transformation to reduce the dimension at the beginning. The dimension of input will reduce from to after the first linear transformation. The number of parameters for the linear transformation is equal to . Then the output will be passed into 4 linear transformation same as the standard LSTM. the number parameters of 4 linear transformation is equal to . Therefore, for input and reduction ratio , the number of parameters is equal to . Similarly, the number of parameters with input is the same as that concerning . The total of parameters equals to