Attention as Activation

Activation functions and attention mechanisms are typically treated as having different purposes and have evolved differently. However, both concepts can be formulated as a non-linear gating function. Inspired by their similarity, we propose a novel type of activation units called attentional activation (ATAC) units as a unification of activation functions and attention mechanisms. In particular, we propose a local channel attention module for the simultaneous non-linear activation and element-wise feature refinement, which locally aggregates point-wise cross-channel feature contexts. By replacing the well-known rectified linear units by such ATAC units in convolutional networks, we can construct fully attentional networks that perform significantly better with a modest number of additional parameters. We conducted detailed ablation studies on the ATAC units using several host networks with varying network depths to empirically verify the effectiveness and efficiency of the units. Furthermore, we compared the performance of the ATAC units against existing activation functions as well as other attention mechanisms on the CIFAR-10, CIFAR-100, and ImageNet datasets. Our experimental results show that networks constructed with the proposed ATAC units generally yield performance gains over their competitors given a comparable number of parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/08/2020

TanhSoft – a family of activation functions combining Tanh and Softplus

Deep learning at its core, contains functions that are composition of a ...
06/16/2020

SPLASH: Learnable Activation Functions for Improving Accuracy and Adversarial Robustness

We introduce SPLASH units, a class of learnable activation functions sho...
04/08/2018

Comparison of non-linear activation functions for deep neural networks on MNIST classification task

Activation functions play a key role in neural networks so it becomes fu...
10/27/2018

A Methodology for Automatic Selection of Activation Functions to Design Hybrid Deep Neural Networks

Activation functions influence behavior and performance of DNNs. Nonline...
12/22/2015

Deep Learning with S-shaped Rectified Linear Activation Units

Rectified linear activation units are important components for state-of-...
03/16/2016

Suppressing the Unusual: towards Robust CNNs using Symmetric Activation Functions

Many deep Convolutional Neural Networks (CNN) make incorrect predictions...
10/27/2019

L*ReLU: Piece-wise Linear Activation Functions for Deep Fine-grained Visual Categorization

Deep neural networks paved the way for significant improvements in image...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Key technological advances in deep learning include the recent development of advanced

attention mechanisms [10] and activation functions [18]. While both attention mechanisms and activation functions depict non-linear operations, they are generally treated as two different concepts and have been improved and extended into different directions over recent years. The development of novel attention mechanisms has led to more sophisticated and computationally heavy network structures, which aim to capture long-range spatial interactions or global contexts [9, 24, 2]. In contrast, activation functions, no matter if designed by hand [12] or via automatic search techniques [20], have remained scalar and straightforward.

Recently, a few learning-based approaches [5, 13] have been proposed that augment the rectified linear unit

 (ReLU) with learnable parameters. However, such learning-based activation functions still suffer from the following shortcomings: Firstly, current work on activation functions is dedicated to augmenting ReLUs with a negative, learnable, or context-dependent slope 

[17, 5, 22], self-normalizing properties [12], or learnable spatial connections [13]. However, recent research on neural architecture search [20] shows that activation functions, which abandon the form of ReLU, tend to work better on deeper models across many challenging datasets. Secondly, current learnable activation units either impose a global slope on the whole feature map [5, 22] or aggregate feature context from a large spatial scope [13]. However, very different from the attention mechanism favoring the global context, activation functions have a significant inclination on the local context and point-wise gating manner (also see section IV-B

). Thirdly, the ReLU remains the default activation function for deep neural networks since none of its hand-designed alternatives has managed to show consistent performance improvements across different models and datasets.

In this work, we propose the attentional activation (ATAC) unit to tackle the above shortcomings, which depicts a novel dynamic and context-aware activation function. One of our key observation is that both the attention mechanism and the concept of a activation function can be formulated as a non-linear adaptive gating function (see section III-A). More precisely, the activation unit is a non-context aware attention module, while the attention mechanism can be seen as a context-aware activation function. Besides introducing non-linearity, our ATAC units enable networks to conduct a layer-wise context-aware feature refinement:

  1. The ATAC units differ from the standard layout of ReLUs and offer a generalized approach to unify the concepts of activation functions and attention mechanisms under the same framework of non-linear gating functions.

  2. To meet both, the locality of activation functions and the contextual aggregation of attention mechanisms, we propose a local channel attention module, which aggregates point-wise cross-channel feature contextual information.

  3. The ATAC units make it possible to construct fully attentional networks that perform significantly better with a modest number of additional parameters.

We conduct extensive ablation studies to investigate the importance of locality and the efficiency of attentional activation as well as fully attentional networks. To demonstrate the effectiveness of our ATAC units, we also compare them with other activation functions and state-of-the-art attention mechanisms. Our experimental results indicate that, given a comparable number of parameters, the models based on our ATAC units outperform other state-of-the-art networks on the well-known CIFAR-10, CIFAR-100, and ImageNet datasets. 111All the source code and trained models are made publicly available at https://github.com/YimianDai/open-atac.

Ii Related Work

We start by revisiting both activation units as well as modern attention mechanisms used in the context of deep learning.

Ii-a Activation Units

Activation functions are an integral part of neural networks. Given finite networks, a better activation function improves convergence and might yield a superior performance. The ReLU has been a breakthrough that has alleviated the vanishing gradient problem and has played an important role for the training of the first “deep” neural network instances 

[18]. However, it also suffers from the so-called “dying ReLU” problem, which describes the phenomenon that a weight might become zero and that it does not recover from this state anymore during training. To alleviate this problem, instead of assigning zero to all negative inputs, the leaky rectified linear unit (Leaky ReLU) assigns to all negative inputs values , where the slope parameter (e.g. ) is a hyper-parameter [17]. The gaussian error linear unit (GELU) weighs inputs by their magnitude in a non-linear manner, instead of gating inputs by their sign as it is done by ReLUs [8]. Another variant is the scaled exponential linear unit (SELU), which induces self-normalizing properties to enable a high-level abstract representation. In contrast to such hand-designed units, Ramachandran et al. have been leveraging automatic search techniques and propose the so-called Swish activation function [20], which resorts to element-wise gating as non-linearity (i.e. , where

is the Sigmoid function) and which can be seen as a special non-context aware case of the ATAC unit proposed in this work.

Another way to extend ReLUs is to introduce learnable parameters. The parametric rectified linear unit (PReLU) learns the optimal slope for the negative region during training [5]. Yang et al. and Agostinelli et al. have generalized the form to a learnable piece-wise linear function formulated as a parametrized sum of hinge-shaped functions [25, 1]. Recently, Kligvasser et al. proposed the so-called xUnit, a learnable non-linear function that augments ReLU with spatial connections. Our ATAC unit follows the idea of learnable activation functions, but differs in at least two important aspects: Firstly, we abandon the idea of improving and augmenting ReLUs as most works do [17, 8, 12, 5, 22, 13]. Instead, we adopt a lightweight and more flexible attention mechanism as activation function. Secondly, we emphasize the importance of locality in ATAC units, i.e., not only the feature context should be locally aggregated, but the attentional weights should also be applied in a point-wise and individual manner.

Ii-B Attention Mechanisms

Motivated by their success in natural language processing tasks, attention mechanisms have also been widely employed in computer vision applications 

[10, 9]. Most of them are implemented as one or more “pluggable” modules that are integrated into the middle or final blocks of existing networks. The key idea is to adaptively recalibrate feature maps using weights that are dynamically generated via context aggregation on feature maps. The squeeze-and-excitation network (SENet) [10], the last winner of the popular ImageNet challenge [21], re-weighs channel-wise feature responses by explicitly modeling inter-dependencies between channels via a bottleneck fully-connected (FC) module. From the perspective of this work, we refer to SENet as the global channel attention module since it aggregates global context via global averaging pooling (GAP) from entire feature maps. A popular strategy for improving the performance of attention modules is based on incorporating long-range spatial interactions. Woo et al. propose the so-called convolutional block attention module to sequentially apply channel and spatial attention modules to learn “what” and “where” to focus [24]. The gather-excite network (GENet) efficiently aggregates feature responses from a large spatial extent by depth-wise convolution (DWConv) and redistributes the pooled information to local features [9]. To capture long-range interactions,

attention augmented convolutional neural networks

 [2] replace convolutions by a two-dimensional relative self-attention mechanism as a stand-alone computational primitive for image classification.

In table I, we provide a brief summary of related feature context aggregation schemes, in which PWConv denotes the point-wise convolution [16] and ShrinkChannel means the global averaging pooling along the channel axis. In contrast to the works sketched above, which aim at refining feature maps based on a wide range of contextual information, the proposed local channel attention module emphasizes point-wise channel-wise context aggregation, which we think is a key ingredient to achieve a better activation performance. Another difference of the proposed ATAC unit is that it goes beyond block/module-level refinement to a layer-wise attentional modulation within activation functions enabling us to build fully attentional networks.

Scale Interaction Formulation Reference
Global Spatial [3, 24]
Channel-wise [10, 15]
Spatial Scope Spatial [13, 9]
Point-wise Channel-wise ours

TABLE I: Context aggregation schemes in attention modules

We notice that concurrent with our work, Ramachandran et al.

 developed a local self-attention layer to replace spatial convolutions, which also provides a way to build a fully attentional models 

[19]. Their experimental analysis indicates that using self-attention in the initial layers of a convolutional network yields worse results compared to using the convolution stem. However, our ATAC units do not suffer from this problem. More precisely, in terms of increased parameters, using ATAC units in the initial layers is the most cost-effective solution (see section IV-B). We think that the differences can be explained by the different usage of the attention mechanism. In [19]

, the self-attention layer is used to learn useful features, while the raw pixels at the stem layer are individually uninformative and heavily spatially correlated, which is difficult for content-based mechanisms. In contrast, our ATAC units are only responsible for activating and refining the features extracted by convolution.

Iii Attentional Activation

The ATAC units proposed in this work depict a unification of activation functions and attention mechanisms. Below, we first provide a unified framework for these two concepts. Afterwards, we describe the unit followed by a description of how to integrate it into common neural network architectures.

Iii-a Unification of Attention and Activation

Given an intermediate feature map with channels and feature maps of size , the transformation induced by attention mechanisms can be expressed as

(1)

where denotes the element-wise multiplication and where is a three-dimensional weight map generated by the corresponding attention gating module . Here, the output of the gating module depends on the whole feature map . Thus, given a specific position , one can rewrite eq. 1 in a scalar form as

(2)

where is a complex gating function. Note that, given a position , the function  is responsible for aggregating the relevant feature context, generating attention weights, and broadcasting as well as picking weights for .

Meanwhile, an activation function can also be formulated in the form of a gating function [13] in the following way:

(3)

For instance, given the ReLU activation function, the scalar function in eq. 3 is an indicator function. For the Swish activation function, is a Sigmoid function. For the sinusoidal representation network (SIREN) [23] unit, is the Sinc function . Other activation functions follow this formulation in a similar way.

Comparing eq. 2 and eq. 3, it can be seen that both the attention mechanism and the concept of activation functions give rise to non-linear adaptive gating functions. Despite the specific forms, their only difference is that the activation gating function takes a scalar as input and outputs a scalar, whereas the attention gating function takes a larger spatial, global, or cross-channel feature context as input. This connection between activation functions and attention mechanisms motivates the use of lightweight attention modules as activation functions. Besides introducing non-linearity, such attentional activation units enable the networks to conduct an adaptive layer-wise context-aware feature refinement.

Iii-B Local Channel Attention Module

Fig. 1: The proposed attentional activation (ATAC) unit

In this work, we use attention modules as activation functions throughout the entire network architecture. Therefore, the attentional activation functions must be cheap from a computational perspective. To simultaneously satisfy the locality requirements of activation functions and the contextual aggregation requirements of attention mechanisms, our ATAC units resort to point-wise convolutions [16] to realize local attention, which are a perfect fit since they map cross-channel correlations in a point-wise manner and also exhibit a few parameters.

The architecture of the proposed local channel attention based attentional activation unit is illustrated in fig. 1: The goal is to enable the network to selectively and element-wisely activate and refine the features according to the point-wise cross-channel correlations. To save parameters, the attentional weight is computed via a bottleneck structure as follows:

(4)

Here, is the ReLU activation function and

denotes the batch normalization operator 

[11]. The kernel size of is and the kernel size of is . The parameter is the channel reduction ratio. It is noteworthy that has the same shape as the input feature maps and can thus be used to activate and highlight the subtle details in a local manner—spatially and across channels. Finally, the activated feature map is obtained via an element-wise multiplication with :

(5)
((a))
((b))
Fig. 4: Illustrations of the proposed architectures: (a) The basic ATAC-ResNet-V2 block and (b) the bottleneck ATAC-ResNet-V1b block. These blocks can be used to create fully attentional networks by replacing every ReLU with an ATAC unit in a baseline convolutional network.

Iii-C Fully-Attentional Networks

The unit described above can be used to obtain fully attentional neural networks by replacing every ReLU with an ATAC unit. Due to this paradigm, we can achieve feature refinement at very early stages, even after the first convolutional layer. In comparison with a convolution, our ATAC unit induces about additional parameters and computations.

Our experimental evaluation shows that it is worth spending these additional memory and computing resources for the ATAC units instead of, e.g., making the networks deeper. That is, instead of simply increasing the depth of a network, one should pay more attention to the quality of the feature activations. We hypothesize that the reason behind this is that suppressing irrelevant low-level features and prioritizing relevant object features at earlier stages enables networks to encode higher-level semantics more efficiently.

In the experimental evaluation provided below, we consider the ResNet family of architectures as host networks. More specifically, we replace the ReLU unit in the basic ResNet block and bottleneck ResNet block, which we call ATAC-ResNet blocks. This induced architectures are shown in fig. ((a))(a) and fig. ((b))(b) and the details are provided in table II.

Stage Output ResNet-20 Output ResNet-50
conv1
stage1
stage2
stage3
stage4
Averge Pool, 10/100/1000-d FC, Softmax
TABLE II: The host network architectures with ReLUs being replaced by the ATAC units. ResNet-20 [7] is used for the CIFAR-10/100 datasets and ResNet-50 [6] is used for the ImageNet dataset. For ResNet-20, we scale the model depth using different assignments for the parameter (which defines the number of blocks per stage) to study the relationship between network depth and the induced performance. Note that corresponds to the standard ResNet-20 backbone.

Iv Experiments

To analyze the potential of the proposed ATAC units, we consider several datasets and compare the architectures mentioned above with state-of-the-art baselines. We also conduct a comprehensive ablation study to investigate the effectiveness of the design of the ATAC units and the behavior of the networks that are induced by them. In particular, the following questions will be investigated in our experimental evaluation:

  1. Q1: Generally, the attention mechanisms in vision tasks capture long-range contextual interactions to address the semantic ambiguity. Our attentional activation utilizes point-wise local channel attention instead. In our study (see table III), we investigate the question of how important the local context locality is for ATAC units.

  2. Q2: Using a micro-module to enhance the network discriminability is not a new idea. The network-in-network (NiN) [16] approach, attentional feature refinement (SENet) [10], and the proposed attentional activation all fall into this idea. Given the same increased computation and parameter budget, we investigate the question if the proposed ATAC units depict a better alternative compared to the NiN-style block, which deepens the network, and attention module like SENet [10], which refines the feature maps (see table III).

  3. Q3: A natural question is also if the performance will improve consistently when more and more ReLUs are replaced by ATAC units until a fully attentional network is obtained. We will examine this as well if ATAC units shall be used in the initial layers of a convolutional network, see fig. 16.

  4. Q4: Next, we will analyze how the networks with our ATAC units compare to networks with other activation functions and other state-of-the-art attention mechanisms. In particular, considering the fact that the proposed ATAC units also induce additional parameters and computational costs compared to their non-parametric alternatives such as ReLU or Swish, we will investigate if convolutional networks with ATAC units yield a superior performance with fewer layers and parameters, see fig. 16 and table IV.

  5. Q5: Finally, we will examine if deeper network such as ResNet-50 with our ATAC units suffer from the vanishing gradient problem induced by the Sigmoid function, see section IV-C.

Iv-a Experimental Settings

For experimental evaluation, we resort to the CIFAR-10, CIFAR-100 [14], and ImageNet [21] datasets. All network architectures in this work are implemented using the MXNet [4] framework. Since most of the experimental architectures cannot take advantage of pre-trained weights, every architecture instantiation is trained from scratch for fairness. The strategy described by He et al. [5] is used for weight initialization. and the channel reduction ratio is set to for all experiments. For CIFAR-10 and CIFAR-100, the ResNet-20v2 [7] architecture was adopted as a host backbone network with a single convolutional layer, followed by three stages each having three basic residual blocks. To study the network’s behavior with the proposed ATAC units under different computational and parameter budgets, we vary the depths of the models using the block number in each stage in table II

. The Nesterov accelerated SGD (NAG) optimizer was considered for training the models using a learning rate of 0.2, a total number of 400 epochs, weight decay of 1e-4, batch size of 128, and learning rate decay factor of 0.1 at epochs 300 and 350, respectively. For ImageNet, we resorted to the ResNet-50-v1b 

[6] architecture as a host network and NAG as optimizer with a learning rate of 0.075, a total number of 160 epochs, no weight decay, a batch size of 256 on two GPUs, and a decay of the learning rate by a factor of 0.1 at epoch 40 and 60, respectively. To save computational time, only the ReLUs in the last two stages of ResNet-50-v1b are replaced with our ATAC units illustrated in fig. ((b))(b).

Iv-B Ablation Study

We start by investigating the questions Q1-Q3 raised above. In particular, we consider several micro-module competitors and compare their performances with our approach that replaces each ReLU by an ATAC unit (ATAC(ours) in table III). For these experiments, we consider the CIFAR-10 and CIFAR-100 datasets and the ResNet-20 as host network with a varying number of blocks.

Iv-B1 Importance of Locality (Q1)

Fig. 5: SEActivation

We start by comparing the ATAC unit with the SEActivation unit shown in Figure 5, which corresponds to the SE block used by the SENet [10]. Note that, instead of using this block for block-wise feature refinement, we make use of this unit as layer-wise activation, i.e., as replacement for the ReLU. Compared to our ATAC unit, the SEActivation unit adds global average pooling layer at the beginning to obtain feature maps of size . While the ATAC unit and the SEActivation unit have the same number of parameters, they vary w.r.t. the contextual aggregation scale and the application scope of the attentional weight. More precisely, SEActivation aggregates the global contextual information and each feature map of size shares the same attentional activation weight of size

in the end. In contrast, the ATAC unit captures the channel-wise relationship in a point-wise manner and each scalar in the feature map has an individual gating weight (i.e. the weight tensor has size

).

table III presents the comparison between both units on CIFAR-10 and CIFAR-100 given a gradually increased network depth. It can be seen that, compared with ATAC, the performance of the network using the SEActivation unit is significantly worse (note that the block is used in a different fashion here compared to its original usage as feature refinement in the SENet architecture). The results suggest that the locality is of vital importance in case such units are used for attentional activation.

Scale CIFAR-10 CIFAR-100
ReLU 0.895 0.920 0.929 0.935 0.737 0.785 0.799 0.806
SEActivation 0.548 0.601 0.613 0.622 0.388 0.432 0.452 0.456
NiN 0.893 0.917 0.922 0.926 0.743 0.776 0.792 0.796
LocalSENet 0.906 0.926 0.931 0.937 0.762 0.794 0.805 0.811
ATAC (ours) 0.906 0.927 0.936 0.939 0.764 0.796 0.812 0.821
TABLE III: Comparision of the classification accuracies on CIFAR-10 and CIFAR-100. The results comparing SEActivation and ATAC suggest that the locality of contextual aggregation is of vital importance. The comparison among NiN, LocalSENet, and ATAC suggests that given the same computational and parameter budget, the layer-wise feature refinement via the proposed attentional activation outperforms going deeper via a NiN-style block and block-wise feature refinement via attention module.

Iv-B2 Activation vs. NiN vs. Refinement (Q2)

Fig. 6: NiN

Next, we investigate and compare our ATAC units with two other micro-modules: The first one is a NiN-style block [16], which introduces point-wise convolutions after convolution layers to enhance the network discriminability. fig. 6 provides an instantiation of the NiN-style block, which has an increased number of parameters compared to its original implementation to be on-pair with the the other modules. The NiN-module is applied after each convolution layer and the associated ReLU in the ResNet-20 host network.

Fig. 7: LocalSENet

The second micro-module considered here as competitor is the LocalSENet block shown in fig. 7. Basically, instead of using the ATAC unit as replacement for ReLU, the LocalSENet uses the same local attention mechanism to refine the residual output. Compared to the SEActivation module, we employ a local channel attention module in LocalSENet. Furthermore, we do not use it as activation function. Thus, the induced networks correspond to the SENet without the global average pooling block for the residual branch. Note that the LocalSENet block refines the residual output after two convolutions, whereas the NiN and ATAC modules are applied after each convolutional layer. Hence, to obtain the same number of parameters and the same computational costs, the channel reduction ratio in the LocalSENet module is set to 1.0.

table III provides the results, from which it can be seen that: 1) The performance of NiN is not as good as LocalSENet and ATAC, which suggests that having a small additional budget for parameters and computation costs, one should resort to the attention mechanism instead of a NiN-style block. This strengthens our assumption that instead of blindly increasing the network depth, refining the feature maps is a more efficient and effective way to increase the networks’ performance. 2) The difference between LocalSENet and ATAC is that LocalSENet uses the attention mechanism only once with all the additional parameters. In contrast, the ATAC units are applied after every convolution (with each ATAC unit having only half the parameters compared to the LocalSENet module). The results suggest that given the same budget for parameters and computational costs, one should choose the paradigm that applies as many lightweight attention modules as possible, instead of adopting the sophisticated attention modules only a few times.

Iv-B3 Towards Fully Attentional Networks (Q3)

((a))
((b))
Fig. 10: Illustration of the performance gain tendency in percentage by gradually replacing ReLUs with the proposed ATAC units starting with the last layer and ending with the first layer on CIFAR-10 and CIFAR-100. Here, a contribution ratio of 1.0 corresponds to the normalized performance gain obtained via a fully-attentional network with all ReLUs being replaced by ATAC units and a ratio of 0.0 corresponds to no ATAC units being used (hence, no gain). The results suggest that the network can obtain a consistent performance gain by going towards a fully attentional network.

We also investigated the cost-effective ratio of the fully attentional network with the proposed ATAC unit. We analyze the network’s predictive performance on CIFAR-10 and CIFAR-100 while gradually replacing ReLUs with ATAC units, starting with the last layer and ending with the first layer. As it can be seen in fig. ((a))(a) and fig. ((b))(b), the performance tends to increase with more ATAC units. Therefore, a fully attentional network offers a way to obtain a performance increase with marginal additional costs. It can also be seen that the largest performance increase is obtained for the replacements made at the end of the process (steep increase in the range from 0.125 to 0.150), which correspond to replacements of ReLUs be ATAC units in the first layers of the network. This supports our hypothesis that early attentional modulation enables networks to encode higher-level semantics more efficiently by suppressing irrelevant low-level features and highlighting relevant features in the early layers of the networks.

Iv-C Comparison

Finally, we address question Q4 raised at the beginning of this section by comparing our approach with several activation functions and other state-of-the-art network competitors. We also show that the ATAC-ResNet is not affected by the vanishing gradient problem, thus addressing question Q5.

Iv-C1 Activation Units & Networks (Q4)

First, we compare the proposed ATAC unit with other activation units, namely ReLU [18], SELU [12], Swish [20], and xUnit [13].222We also consdiered GELU [8] and PReLU [5], but their performance was not as good as the aforementioned baselines and were, hence, excluded from the overall comparison. fig. ((a))(a) and fig. ((b))(b) provide the comparison on CIFAR-10 and CIFAR-100 given a gradual increase of the depths of the networks. It can be seen that: (a) The ATAC unit achieves a better performance for all experimental settings, which demonstrates its effectiveness compared to the baselines. The Swish unit, which is also a non-linear gating function, ranked second in the comparison, better than the ReLU-like activation units. These results reaffirm that one can obtain better activation functions by considering alternative non-linearities than ReLU-like units. (b) Since the ATAC unit outperforms the Swish unit—which can be interpreted as a non-context-aware scalar version of the ATAC unit—we conclude that channel-wise context is beneficial for activation functions. (c) By replacing ReLUs by ATAC units, one can obtain a more efficient convolutional network that yields a better performance with fewer layers or parameters per network. For example, in fig. ((b))(b), the ATAC-ResNet () achieves the same classification accuracy as the ResNet (), while only using 65% of the parameters.

((a))
((b))
Fig. 13: Comparison with other activation units on CIFAR-10 and CIFAR-100 with a gradual increase of network depth. The results suggest that we can obtain a better performance with even fewer layers or parameters per network by replacing ReLUs with the proposed ATAC units.
((a))
((b))
Fig. 16: Comparison of different networks on CIFAR-10/100 given a gradual increase of network depth. The results demonstrate the effectiveness of the layer-wise manner of simultaneous feature activation and refinement by our ATAC units.

Next, we compared our proposed methods with the baseline and other state-of-the-art networks. fig. ((a))(a) and fig. ((b))(b) illustrate the results given a gradual increase in the network depth for all networks on CIFAR-10 and CIFAR-100. Comparing SENet [10] and GENet [9], one can see that the network with the ATAC units performs better for all settings. We believe that the improved performance stems from the layer-wise and simultaneous feature activation and refinement of the proposed attentional activation scheme. We also validated our ATAC units on the ImageNet dataset. The results are provided in table IV. Note that compared to the self-reported results from SENets [10], attention augmented (AA) convolution networks [2], full attention (FA) vision models [19], and gather-excite (GE) networks [9], our ATAC-ResNet-50 achieves the best in top-1 error. Notably, compared with GE--ResNet-50 and SE-ResNet-50, the ATAC-ResNet-50 requires fewer parameters. Although the number of GFlops is a bit higher (which mainly stems from the point-wise convolutions in the local contextual aggregation), the network yields a good trade-off compared to the other architectures.

Iv-C2 Vanashing Gradients (Q5)

The performance of our ATAC-ResNet on CIFAR-10/100 (, ResNet-32) and ImageNet (ResNet-50) empirically answers the question Q5 that deep networks equipped with ATAC units do not appear to suffer from the vanishing gradient problem. Interestingly, the Swish activation function [20]

also adopts the Sigmoid function and the network with Swish units can also go very deep. Note that the Sigmoid and Softmax functions are not used as activation in our context, but are used to obtain probabilities to weigh the feature maps (which cannot be obtained via ReLUs). This is the reason why attention mechanism modules, deep belief networks (DBN), recurrent neural networks (RNN), and long short-term memory (LSTM) networks as well as our ATAC units typically have layer-wise Sigmoid or Softmax functions and can go very deep. In fact, the emergence of batch normalization 

[11]

allows the Sigmoid function to be used in deep networks again. With the better expressive ability of Sigmoid, these networks have achieved better results. In addition, the residual connection

[6] also helps training very deep networks.

Architecture GFlops Params top-1 err. top-5 err.
ResNet-50 [6] 3.86 25.6M 23.30 6.55
SE-ResNet-50 [10] 3.87 28.1M 22.12 5.99
AA-ResNet-50 [2] 8.3 25.8M 22.30 6.20
FA-ResNet-50 [19] 7.2 18.0M 22.40 /
GE--ResNet-50 [9] 3.87 33.7M 21.88 5.80
ATAC-ResNet-50 (ours) 4.4 28.0M 21.41 6.02
TABLE IV: Classification comparison on ImageNet with other state-of-the-art networks. ATAC-ResNet-50 achieves the best top-1 err. with smaller parameter numbers than SE-ResNet-50 and GE--ResNet-50.

V Conclusion

Smarter activation functions that integrate what is generally considered separately as attention mechanisms are very promising and worthy of further research. Instead of blindly increasing the depth of network, one should pay more attention to the quality of feature activation. In particular, we found that our attentional activation units—a unification of activation function and attention mechanism that endow activation units with attentional context information—improve the performance of all the networks and datasets that we have experimented with so far. To meet both the locality of activation function and contextual aggregation of attention mechanism, we propose a local channel attention module, which locally aggregates point-wise cross-channel feature contextual information. A simple procedure of replacing all ReLUs with the proposed ATAC units produces a fully attentional network that performs significantly better than the baseline with a modest number of additional parameters. Compared with other activation units, the convolutional network with our ATAC units can gain a performance boost with fewer layers or parameters per network.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61573183, the Open Project Program of the National Laboratory of Pattern Recognition (NLPR) under Grant No. 201900029, the Nanjing University of Aeronautics and Astronautics PhD short-term visiting scholar project under Grant No. 180104DF03, the Excellent Chinese and Foreign Youth Exchange Program, China Association for Science and Technology, China Scholarship Council under Grant No. 201806830039.

References

  • [1] F. Agostinelli, M. D. Hoffman, P. J. Sadowski, and P. Baldi (2015) Learning activation functions to improve deep neural networks. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, Cited by: §II-A.
  • [2] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le (2019-10) Attention augmented convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), pp. 3286–3295. Cited by: §I, §II-B, §IV-C1, TABLE IV.
  • [3] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua (2017)

    SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning

    .
    In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6298–6306. Cited by: TABLE I.
  • [4] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang (2015)

    MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems

    .
    In In Neural Information Processing Systems, Workshop on Machine Learning Systems, Vol. abs/1512.01274. External Links: 1512.01274 Cited by: §IV-A.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, Washington, DC, USA, pp. 1026–1034. External Links: ISBN 978-1-4673-8391-2 Cited by: §I, §II-A, §IV-A, footnote 2.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. Cited by: TABLE II, §IV-A, §IV-C2, TABLE IV.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, pp. 630–645. Cited by: TABLE II, §IV-A.
  • [8] D. Hendrycks and K. Gimpel (2016) Gaussian error linear units (gelus). Cited by: §II-A, §II-A, footnote 2.
  • [9] J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi (2018) Gather-excite: exploiting feature context in convolutional neural networks. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 9401–9411. Cited by: §I, §II-B, TABLE I, §IV-C1, TABLE IV.
  • [10] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 7132–7141. Cited by: §I, §II-B, TABLE I, item 2, §IV-B1, §IV-C1, TABLE IV.
  • [11] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 448–456. Cited by: §III-B, §IV-C2.
  • [12] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter (2017) Self-normalizing neural networks. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 971–980. Cited by: §I, §I, §II-A, §IV-C1.
  • [13] I. Kligvasser, T. R. Shaham, and T. Michaeli (2018-06) XUnit: learning a spatial activation function for efficient image restoration. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2433–2442. External Links: ISSN 1063-6919 Cited by: §I, §II-A, TABLE I, §III-A, §IV-C1.
  • [14] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report University of Toronto. Cited by: §IV-A.
  • [15] X. Li, W. Wang, X. Hu, and J. Yang (2019) Selective kernel networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 510–519. Cited by: TABLE I.
  • [16] M. Lin, Q. Chen, and S. Yan (2014) Network in network. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §II-B, §III-B, item 2, §IV-B2.
  • [17] A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Cited by: §I, §II-A, §II-A.
  • [18] V. Nair and G. E. Hinton (2010)

    Rectified linear units improve restricted boltzmann machines

    .
    In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, USA, pp. 807–814. External Links: ISBN 978-1-60558-907-7 Cited by: §I, §II-A, §IV-C1.
  • [19] N. Parmar, P. Ramachandran, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens (2019) Stand-alone self-attention in vision models. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 68–80. Cited by: §II-B, §IV-C1, TABLE IV.
  • [20] P. Ramachandran, B. Zoph, and Q. V. Le (2018) Searching for activation functions. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, Cited by: §I, §I, §II-A, §IV-C1, §IV-C2.
  • [21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015-12-01) ImageNet large scale visual recognition challenge. 115 (3), pp. 211–252. External Links: ISSN 1573-1405 Cited by: §II-B, §IV-A.
  • [22] L. Sha, J. Schwarcz, and P. Hong (2019) Context dependent modulation of activation function. Cited by: §I, §II-A.
  • [23] V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein (2020) Implicit neural representations with periodic activation functions. abs/2006.09661. Cited by: §III-A.
  • [24] S. Woo, J. Park, J. Lee, and I. S. Kweon (2018) CBAM: convolutional block attention module. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII, pp. 3–19. Cited by: §I, §II-B, TABLE I.
  • [25] Y. Yang, J. Sun, H. Li, and Z. Xu (2016) Deep admm-net for compressive sensing mri. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 10–18. Cited by: §II-A.