The official implementation of paper "DIANet:Dense-and-Implicit-Attention-Network".
Attention-based deep neural networks (DNNs) that emphasize the informative information in a local receptive field of an input image have successfully boosted the performance of deep learning in various challenging problems. In this paper, we propose a Dense-and-Implicit-Attention (DIA) unit that can be applied universally to different network architectures and enhance their generalization capacity by repeatedly fusing the information throughout different network layers. The communication of information between different layers is carried out via a modified Long Short Term Memory (LSTM) module within the DIA unit that is in parallel with the DNN. The sharing DIA unit links multi-scale features from different depth levels of the network implicitly and densely. Experiments on benchmark datasets show that the DIA unit is capable of emphasizing channel-wise feature interrelation and leads to significant improvement of image classification accuracy. We further empirically show that the DIA unit is a nonlocal normalization tool that enhances the Batch Normalization. The code is released at https://github.com/gbup-group/DIANet.READ FULL TEXT VIEW PDF
In this work we focus on the problem of image caption generation. We pro...
Batch Normalization (BN) (Ioffe and Szegedy 2015) normalizes the feature...
We address a learning-to-normalize problem by proposing Switchable
Batch Normalization (BN) is a vital pillar in the development of deep
We address a learning-to-normalize problem by proposing Switchable
This paper analyzes the generalization of speech enhancement algorithms ...
Recently, very deep convolutional neural networks (CNNs) have been attra...
The official implementation of paper "DIANet:Dense-and-Implicit-Attention-Network".
Attention, a cognitive process that selectively focuses on a small part of information while neglects other perceivable information anderson2005cognitive , has been used to effectively ease neural networks from learning large information contexts from images Xu:2015:SAT:3045118.3045336 ; luong2015effective , sentences vaswani2017attention ; britz2017massive ; cheng2016long and videos miech2017learnable
. Especially in computer vision, DNNs incorporated with special operators that mimic the attention mechanism can process informative regions in an image efficiently. Empirical results inmnih2014recurrent ; bahdanau2014neural ; Xu:2015:SAT:3045118.3045336 ; newell2016stacked ; vaswani2017attention ; anderson2018bottom have demonstrated the visual representations improvement of DNNs by these special attention operators hu2018gather . Recently, attention operators are modularized and stacked into popular networks as attention modules he2016deep ; huang2017densely for further performance improvement. The design of these modules may be task-dependent hu2018squeeze ; woo2018cbam ; park2018bam ; wang2018non ; hu2018gather ; li2019selective ; cao2019GCNet .
In previous works, attention-modules are used individually in each layer throughout DNNs. Even a tiny add-in module with a small number of parameters per layer will primarily increase the total number of parameters as the network depth increases. Besides, the potential network redundancy can also hinder the learning capability of DNNs huang2017densely ; wang2018mixed . Therefore, it is crucial to design efficient attention modules that can maintain reasonable parameter cost and avoid feature redundancy to improve the performance of DNNs.
To tackle the challenge mentioned above, we propose a Dense-and-Implicit Attention (DIA) unit to enhance the generalization capacity of DNNs. The DIA unit can recurrently fuse multi-scale information from previous network layers to later network layers to model attention. The structure and computation flow of a DIA unit is visualized in Figure 1. There are three parts in the DIA unit. The part denoted by 1⃝ extracts spatial-wise jaderberg2015spatial , channel-wise hu2018squeeze or multi-scale features woo2018cbam ; park2018bam from the feature map in the current layer. The second part denoted by 2⃝ is the main module in the DIA unit to model network attention and is the key innovation of the proposed method. Particularly, we apply Long Short-Term Memory (LSTM) hochreiter1997long module that not only connects two adjacent layers, but also fuses the information from previous layers, creating nonlocal information communication throughout the DNN. Other network structures can also be explored to implement the second part and this is left as future work. The third part denoted by 3⃝ adjusts (e.g., re-scaling or re-distributing) the feature map in the current layer according to the feedback of the second part.
Characteristics and Advantages. (1) As shown in Figure 1, the DIA unit is placed parallel to the network backbone, and it is shared with all the layers in the same stage (the collection of successive layers with same spatial size, as defined in he2016deep ). It also links different layers implicitly and densely, which improves the interaction of layers at different depth. (2) As the DIA unit is shared, the number of parameter increment from the DIA unit remains roughly constant as the depth of the network increases. (3) DIA unit adaptively learns the non-local scaling that has the normalization effect.
In this work, we focus on studying the DIA unit using a modified LSTM that can learn the channel-wise relationship and long distance dependency between layers. As shown in Figure 2, a global average pooling (GAP) layer (as the 1⃝ in Figure 1) is used to extract global information from current layer; A LSTM module (as the 2⃝ in Figure 1
) is used to integrate multi-scale information and there are three inputs passed to the LSTM: the extracted global information from current raw feature map, the hidden state vector, and cell state vector from previous layers. Then the LSTM outputs the new hidden state vector and the new cell state vector . The cell state vector stores the information from the layer and its preceding layers. The new hidden state vector (dubbed as attention vector in our work) is then applied back to the raw feature map by channel-wise multiplication (as the 3⃝ in Figure 1) to emphasize the feature importance in each feature map.
The LSTM in the DIA unit plays a role to bridge the current layer and preceding layers such that the DIA unit can adaptively learn the non-linearity relationship between features in two different scales. The first scale of features is the internal information of the current layer, and the second scale represents the outer information from the preceding layers. The non-linearity relationship between these two scales will benefit attention modeling for the current layer. Additionally, the operation of channel-wise multiplication in the final step of the DIA unit is similar to the scaling operation of Batch Normalization. Indeed, the empirical results show that DIA has non-local normalization effect. With our modified LSTM, the additional parameter cost is economical in DIA unit.
In Section 2, some of the attention-based networks are reviewed, and the difference between them and ours are discussed. Then in Section 3, the formal definition of Dense-and-Implicit Attention Network is introduced. In Section 4, we conduct experiments on Benchmarks datasets to empirically demonstrate the effectiveness of DIA unit. In Section 5, the influence of hyper-parameter on our model has been studied experimentally. Finally, in Section 6, some evidence of long-distance dependence in our model is shown, and the normalization effects of DIA unit are also investigated.
use attention mechanism in image classification via utilizing a recurrent neural network to select and process local regions at high resolution sequentially. Concurrent attention-based methods tend to construct operation modules to capture non-local information in an imagewang2018non ; cao2019GCNet , and model the interrelationship between channel-wise featureshu2018squeeze ; hu2018gather . The combination of multi-level attentions are also widely studied park2018bam ; woo2018cbam ; DBLP:journals/corr/abs-1904-04402 ; Wang_2017_CVPR . Prior works wang2018non ; cao2019GCNet ; hu2018squeeze ; hu2018gather ; park2018bam ; woo2018cbam ; DBLP:journals/corr/abs-1904-04402 usually insert an attention module in each layer independently. In this work, the DIA unit is innovatively shared for all the layers in the same stage of the network, and the existing attention modules can be composited into the DIA unit readily. Besides, we adopt a global average pooling in part 1⃝ to extract global information and a channel-wise multiplication in part 2⃝ to enhance the importance of features, which is similar to SENet hu2018squeeze .
Dense Network Topology. DenseNet proposed in huang2017densely connects all pairs of layers directly with an identity map. Through reusing features, DenseNet enjoys the advantage of higher parameter efficiency, the better capacity of generalization, and more accessible training than alternative architectures lin2013network ; he2016deep ; srivastava2015highway . Instead of explicitly dense connections, the DIA unit implicitly links layers at different depth via a shared module and leads to dense connection. Empirical results in Section 6 shows that the DIA unit has a similar effect as DenseNet.
Multi-level Feature Integration. wolf2006critical experimentally analyzes that even the simple aggregation of low-level visual features sampled from wide inception field can be efficient and robust for context representation, which inspires hu2018squeeze ; hu2018gather to incorporate multi-level features to improve the network representation. li2016multi
also demonstrates that by biasing the feature response in each convolutional layers using different activation functions, the deeper layer could achieve the better capacity of capturing the abstract pattern in DNN. In DIA unit, the high non-linearity relationship between multi-scale features are learned and integrated via the LSTM module, which is conducive to better model the attention and improve learning performance reported in Section4, 5 and 6.
In this section, we will formally introduce the DIA unit and elaborate on how it implicitly connects all layers densely. Afterwards, a DIANet is referred to a network built with DIA units.
As shown in Figure 2 when a DIANet is built with a residual network he2016deep , the input of the layer is , where and mean width, height and the number of channels, respectively. is the residual mapping at the layer with parameters as introduced in he2016deep . Let Next, a global average pooling denoted as is applied to to extract global information from features in the current layer. Then is passed to LSTM along with a hidden state vector and a cell state vector ( and are initialized as zero vectors). The LSTM finally generates a current hidden state vector and a cell state vector as
In our model, the hidden state vector is regarded as attention vector to adaptively recalibrate feature maps. We apply channel-wise multiplication to enhance the importance of features, i.e., and obtain after skip connection, i.e., . Table 1 shows the formulation of ResNet, SENet, and DIANet, and Part (b) is the main difference between them. The LSTM module is used repeatedly and shared with different layers in parallel to the network backbone. Therefore the number of parameters in a LSTM does not depend on the number of layers in the backbone, e.g., . SENet utilizes a attention-module consisted of fully connected layers to model the channel-wise dependency for each layer independently hu2018squeeze . The total number of parameters brought by the add-in modules depends on the number of layers in the backbone and increases with the number of layers.
module, there are two modifications in our purposed LSTM: 1) a shared linear transformation to reduce input dimension; 2) a careful selected activation function for better performance.
A standard LSTM consists of four linear transformation layers as shown in Figure 3 (Left). Since , and are of the same dimension , the standard LSTM may cause parameter increment as shown in the Appendix. When the number of channels is large, e.g., , the parameter increment of add-in LSTM module in the DIA unit will be over 8 million, which can hardly be tolerated.
Hence, to avoid such a scenario, we propose a modified LSTM (denoted as DIA-LSTM) as follows:
(1) Activation Function We change the output layer’s activation function from to and further discussion will be presented in ablation study;
(2) Parameter Reduction As shown in Figure 3 (Left), and are passed to four linear transformation layers with the same input and output dimension . In the DIA-LSTM, a linear transformation layer ( denoted as “Linear1” in Figure 3 (Right)) with a smaller output dimension are applied to and . We use reduction ratio in the Linear1. Specifically, we reduce the dimension of the input from to and then apply the activation function to increase non-linearity in this module. The dimension of the output from function are changed back to when the output is passed to those four linear transformation functions. This modification can enhance the relationship between the inputs for different parts in DIA-LSTM and also effectively reduce the number of parameters by sharing a linear transformation for dimension reduction. The number of parameter increment reduces from to as shown in the Appendix, and we find that when we choose an appropriate reduction ratio , we can make a better trade-off between parameter reduction and the performance of DIANet. Further experimental results will be discussed in the ablation study later.
Implicit and Dense Connection. Now we illustrate how the DIA unit connects all the layers in the same stage implicitly and densely. Consider a stage consisting many layers, as shown in Figure 4 (Left), it is an explicit structure of DIANet and the one layer seems not to be connected to the other layers except the network backbone. However, as shown in Figure 4 (Right), there are indirect connections between the current layer and the preceding layers with the help of the shard DIA unit. Unlike DenseNet huang2017densely that concatenates the raw features of all previous layers directly, the layer receives the information of the preceding layers through data-driven learning within the DIA unit. We call them implicit connections. Since the DIA unit is shared, there is communication between every pair of layers, which leads to dense connections over all layers.
In this section, we evaluate the performance of the DIA unit in image classification task and empirically demonstrate its effectiveness. We conduct experiments on popular networks for benchmark datasets. Since the SENet hu2018squeeze is also a channel-specific attention model, we compare the DIANet with the SENet. For a fair comparison, we adjust the reduction ratio such that the number of parameters of the DIANet is similar to that of the SENet.
Dataset and Model. We conduct experiments on CIFAR10, CIFAR100 cifar
, and ImageNet 2012ILSVRC15 using ResNet he2016deep , PreResNet he2016identity , WRN wrn and ResNeXt xie2017aggregated . CIAFR10 or CIFAR100 has 50k train images and 10k test images of size 32 by 32, but has 10 and 100 classes respectively. ImageNet 2012 datasetILSVRC15 comprises 1.28 million training and 50k validation images from 1000 classes, and the random cropping of size 224 by 224 is used in our experiments. The implementation details can be found in the Appendix.
Image Classification. As shown in Table 2, the DIANet improves the testing accuracy significantly over the original networks and consistently comparing with SENet for different datasets. In particular, the performance improvement of the ResNet with the DIA unit is most remarkable. Due to the popularity of ResNet, the DIA unit may be applied in other computer vision tasks.
In this section, we conduct ablation experiment to explore how to embed the DIA unit in different neural network structures better and gain a deeper understanding of the role of each component in the DIA unit. All ablation experiments are performed on CIFAR-100 dataset with ResNet. First, we discuss the effect of the reduction ratio introduced in Section 3.2. Then the performance of the DIA unit in ResNet of different lengths will be explored. We test the performance of DIA-LSTM with different activation function at the output layer. Also, we study the capacity of the stacked DIA-LSTM with a different number of DIA-LSTM modules.
The reduction ratio rate is the only hyperparameter in our DIANet, as mentioned in Section3.2. Improving the performance with light parameter increment is one of the main characteristics in our model. This part investigates the trade-off between model complexity and its performance. As shown in Table 4, we find out that the testing accuracy of the DIANet declines slightly with the increasing reduction rate. In particular, when , the parameter increment is 0.05M comparing with ResNet164. The testing accuracy of the DIANet is 76.50% while that of the original network is 73.43%, which makes the DIA unit has potential in a variety of practical applications, especially those in which small model size is of importance.
The depth of the neural network.
Generally, in practice, deep DNNs with a large number of parameters do not guarantee sufficient performance improvement since, on the one hand, deeper networks probably contain extreme feature and parameter redundancyhuang2017densely . On the other hand, the gradient vanishing problem will be worse and make the DNN hard to be trained bengio1994learning ; glorot2010understanding ; srivastava2015training . Therefore, designing a new structure of deep neural networks he2016deep ; huang2017densely ; srivastava2015training ; hu2018squeeze ; hu2018gather ; wang2018non is of great necessity. Since the DIA unit also changes the network topology of the DNN backbones, therefore evaluating the effectiveness of DIANet structure is of great importance. Here we investigate how the depth of the DNN influence the DIANet in two parts: (1) the performance of DIANet compared to SENet of varying depth; (2) the parameter increment of DIANet in DNN. The results in Table 4 show that as the depth of the ResNet increases from 83 layers to 407 layers, the DIANet can achieve higher classification accuracy improvement than the SENet with less number of parameters. Moreover, DIANet83 (for simplicity, 83 denotes the depth of ResNet backbone) can achieve the competitive result as SENet245, and DIANet164 can outperform all the SENet results with at least 1.13% and at most 58.8% parameter reduction. They imply that the DIANet is of higher parameter efficiency than SENet. The results also suggest that: for DIANet, as shown in Figure 2, the deeper network means that the DIA-LSTM module will pass more layers recurrently. The DIA-LSTM can handle the interrelationship between the information of different layers in much deeper DNN and better find out the long distance dependency between layers. Therefore the DIANet can effectively avoid feature redundancy.
Activation function and the number of stacking DIA-LSTM modules. We choose two different activation functions ( and ) in the output layer of DIA-LSTM and two different numbers (one and three) of stacking DIA-LSTM cells to explore the effects of these two factors on classification performance. In Table 5, we find that the performance has been significantly improved after replacing in the standard LSTM with . As shown in Figure 3 (Right), this activation function is located in the output layer, which directly changes the effect of memory unit on the output of the output gate.
When we use in the output layer of DIA-LSTM, an increasing number of stacking DIA-LSTM modules does not necessarily lead to performance improvement but may lead to considerable performance degradation. However, when we choose , the situation is different. It suggests that for the DIA unit to be effective, fine structural adjustments are necessary.
In this section, we study some properties of DIANet, including feature integration and normalization effect. In DIANet, the deeper layers connect the shallower layers via DIA-LSTM module. Firstly, the random forest modelGregorutti2017Correlation is used to visualize how the current layer depends on the preceding layers. Secondly, we study the normalization effect of DIANet by removing the Batch Normalization Ioffe:2015:BNA:3045118.3045167 in the networks. Also, the tests on removal of skip connection and data augment show some positive side-effects of DIANet.
Feature Integration. Here we try to understand the dense connection from the numerical perspective. As shown in Figure 2 and 4, the DIA-LSTM bridges the connections between layers by propagating the information forward through and . Moreover, at different layers are also integrating with in DIA-LSTM. Notably, is applied directly to the features in the network at each layer . Therefore the relationship between at different layers somehow reflects connection degree of different layers. We explore the nonlinear relationship between the hidden state of DIA-LSTM and the preceding hidden state , and visualize how the information coming from contributes to . To reveal this relationship, we consider using the random forest to visualize variable importance. The random forest can return the contributions of input variables to the output separately in the form of importance measure, e.g., Gini importance Gregorutti2017Correlation . The computation details of Gini importance can be referred to the Appendix. Take as input variables and as output variable, we find out the Gini importance of each variable . ResNet164 contains three stages, and each stage consists of 18 layers. We conduct three Gini importance computation to each stage separately. As shown in Figure 5, each row presents the importance of source layers contributing to the target layer . In each sub-graph of Figure 5, the diversity of variable importance distribution indicates that the current layer utilizes the information of the preceding layers. The interaction between shallow and deep layers in the same stage reveals the effect of implicitly dense connection. In particular, taking in stage 1 (the last row) as an example, or does not intuitively provide the most information for , but does. We conclude that the DIA unit can adaptively integrate information between multiple layers.
Moreover, in Figure 5 (stage 3), the information interaction with previous layers in stage 3 is more intense and frequent than that of the first two stages. Correspondingly, as shown in Table 6(Left), in the experiments when we remove the DIA unit in stage 3, the classification accuracy decreases from 76.67 to 75.40. However, when it in stage 1 or 2 is removed, the performance degradation is very similar, falling to 76.27 and 76.25 respectively. Also note that for DIANet, the number of parameters increment in stage 2 is much larger than that of stage 1. It implies that the significant performance degradation after the removal of stage 3 may be not only due to the reduction of the number of parameters but also due to the lack of dense feature integration.
Normalization Effect of DIANet. Small changes in shallower hidden layers may be amplified as the information propagates within the deep architecture and sometimes result in a numerical explosion. Batch Normalization (BN) Ioffe:2015:BNA:3045118.3045167 is widely used in the modern deep networks since it stabilizes the training by standardizing the input of each layer. DIA unit readjusts the feature maps by channel-wise multiplication, which plays a role of scaling similar to BN. In this part, we empirically claim that DIA unit has a normalization effect. As shown in Table 7, different models trained with varying depth in CIFAR-100, and BNs are removed in these networks to eliminate their normalization effect. The experiments are conducted on a single GPU with batch size 128 and initial learning rate 0.1. Both the original ResNet, SENet face problem of numerical explosion without BN while the DIANet can be trained with depth up to 245. Besides, comparing with Table 4, the testing accuracy of DIANet without BN still can keep up to 70%. The difference between the performance of the network with BN and network with BN and DIA indicates that the normalization effect of BN and that of DIANet is different. BN learns the shift and the scaling parameters by utilizing the local information, i.e., the current layer and batch data. However, the scaling learned by DIANet integrates the information from preceding layers and enables the network to choose a better scaling for each feature map. Combination of BN and DIANet can learn the local scaling, but also the non-local scaling.
Removal of skip connection. The skip connection has become a necessary structure for training the DNN he2016identity . Without skip connection, the DNN is hard to train due to the reasons like the gradient vanishing. Figure 6(a-b) shows the training curves if we remove the last one-third of the skip connections at each stage. We can find that the decline of train loss in DIANet is more stable and finally DIANet achieves a smaller training loss. At the same time, DIANet has higher test accuracy than SENet and original ResNet. To some extent, it shows that DIA unit can alleviate the gradient vanishing in deep network training.
Without data augment. Explicit dense connections may help bring more efficient usage of parameters, which makes the neural network less prone to overfit huang2017densely . Although the dense connections in DIA are implicit, the DIANet still shows the ability to reduce overfitting. To verify it, We train the models without data augment to reduce the influence of regularization from data augment. As shown in Table 6 (Right), DIANet achieves lower testing error than ResNet164 and SENet. To some extent, the implicit and dense structure of DIANet may have regularization effect.
In this paper, we proposed a Dense-and-Implicit Attention (DIA) unit to enhance the generalization capacity of deep neural networks by recurrently fusing feature attention throughout different layers. Experiments showed that the DIA unit could be universally applied to different network architectures and improve their performance. The DIA unit is also supported by empirical analysis and extensive ablation study. Notably, the DIA unit can be constructed as a global networks module and can implicitly change the topology of existing network backbones.
S. Liang and H. Yang gratefully acknowledge the support of National Supercomputing Center (NSCC) SINGAPORE nscc and High Performance Computing (HPC) of National University of Singapore for providing computational resources, and the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. Sincerely thank Xin Wang from Tsinghua University for providing personal computing resource. H. Yang thanks the support of the start-up grant by the Department of Mathematics at the National University of Singaporet, the Ministry of Education in Singapore for the grant MOE2018-T2-2-147.
Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 2048–2057. JMLR.org, 2015.
Stacked hourglass networks for human pose estimation.In European Conference on Computer Vision, pages 483–499. Springer, 2016.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6077–6086, 2018.
Gather-excite: Exploiting feature context in convolutional neural networks.In Advances in Neural Information Processing Systems, pages 9401–9411, 2018.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
|Batch size||train batchsize|
|Epoch||number of total epochs to run|
|depth||the depth of the network|
|schedule||Decrease learning rate at these epochs|
|gamma||learning rate is multiplied by gamma on schedule|
|cardinality||Model cardinality (group)|
|lr||initial learning rate|
|extract features(Figure 1)|
Suppose the input is of size and the hidden state vector is also of size .
Standard LSTM As shown in Figure (3) (Left), in the standard LSTM, there requires 4 linear transformation to control the information flow with input and respectively. The output size is set to be . To simplify the calculation, the bias is omitted. Therefore, for the , the number parameters of 4 linear transformation is equal to . Similarly, the number parameters of 4 linear transformation with input is equal to . The total of parameters equals to
DIA-LSTM As shown in Figure (3) (Right), there is a linear transformation to reduce the dimension at the beginning. The dimension of input will reduce from to after the first linear transformation. The number of parameters for the linear transformation is equal to . Then the output will be passed into 4 linear transformation same as the standard LSTM. the number parameters of 4 linear transformation is equal to . Therefore, for input and reduction ratio , the number of parameters is equal to . Similarly, the number of parameters with input is the same as that concerning . The total of parameters equals to