Humans and animals process vasts amounts of information with limited computational resources thanks to attention mechanisms which allow them to focus resources on the most informative chunks of information [anderson1985cognitive, desimone1995neural, ungerleider2000mechanisms]
This work is inspired by the advantages of visual and biological attention mechanisms, for tackling fine-grained visual recognition with Convolutional Neural Networks (CNN) [lecun1998gradient]
. This is a particularly difficult task since it involves looking for details in large amounts of data (images) while remaining robust to deformation and clutter. In this sense, different attention mechanisms for fine-grained recognition exist in the literature: (i) iterative methods that process images using ”glimpses” with recurrent neural networks (RNN) or long short-term memory (LSTM)[sermanet2014attention, zhao2017diversified]
, (ii) feed-forward attention mechanisms that augment vanilla CNNs, such as the Spatial Transformer Networks (STN)[jaderberg2015spatial], or top-down feed-forward attention mechanisms (FAM) [rodriguez2017age]. Although it is not applied to fine-grained recognition, the Residual Attention introduced by [wang2017residual]
is another example of feed-forward attention mechanism that takes advantage of residual connections[he2016deep] to enhance or dampen certain regions of the feature maps in an incremental manner.
Thus, most of the existing attention mechanisms are either limited by having to perform multiple passes through the data [sermanet2014attention], by carefully designed architectures that should be trained from scratch [jaderberg2015spatial], or by considerably increasing the needed amount of memory and computation, thus introducing computational bottlenecks [jetley2018learn]. Hence, there is still the need of models with the following learning properties: (i) Detect and process in detail the most informative parts of an image for learning models more robust to deformation and clutter [mnih2014recurrent]; (ii) feed-forward trainable with SGD for achieving faster inference than iterative models [sermanet2014attention, zhao2017diversified]
, together with faster convergence rate than Reinforcement Learning-based (RL) methods[sermanet2014attention, liu2016fully]; (iii) preserve low-level detail for a direct access to local low-level features before they are modified by residual identity mappings. This is important for fine-grained recognition, where low-level patterns such as textures can help to distinguish two similar classes. This is not fulfilled by Residual Attention, where low-level features are subject to noise after traversing multiple residual connections [wang2017residual].
In addition, desirable properties for attention mechanisms applied to CNNs would be: (i) Modular and incremental, since the same structure can be applied at each layer on any convolutional architecture, and it is easy to adapt to the task at hand; (ii) Architecture independent, that is, being able to adapt any pre-trained architecture such as VGG [simonyan2014very] or ResNet [he2016deep]; (iii) Low computational impact implying that it does not result in a significant increase in memory and computation; and (iv) Simple in the sense that it can be implemented in few lines of code, making it appealing to be used in future work.
Based on all these properties, we propose a novel attention mechanism that learns to attend low-level features from a standard CNN architecture through a set of replicable Attention Modules and gating mechanisms (see Section 3). Concretely, as it can be seen in Figure 1, any existing architecture can be augmented by applying the proposed model at different depths, and replacing the original loss by the proposed one. It is remarkable that the modules are independent of the original path of the network, so in practice, it can be computed in parallel to the rest of the network. The proposed attention mechanism has been included in a strong baseline like Wide Residual Networks (WRN) [Zagoruyko2016WRN], and applied on CIFAR-10, CIFAR-100 [krizhevsky2009learning], and five challenging fine-grained recognition datasets. The resulting network, called Wide Attentional Residual Network (WARN) systematically enhances the performance of WRNs and surpasses the state of the art in various classification benchmarks.
2 Related Work
There are different approaches to fine-grained recognition [zhao2017survey]
: (i) vanilla deep CNNs, (ii) CNNs as feature extractors for localizing parts and do alignment, (iii) ensembles, (iv) attention mechanisms. In this work, we focus on (iv), the attention mechanisms, which aim to discover the most discriminative parts of an image to be processed in greater detail, thus ignoring clutter and focusing on the most distinctive traits. These parts are central for fine-grained recognition, where the inter-class variance is small and the intra-class variance is high.
Different fine-grained attention mechanisms can be found in the literature. [xiao2015application] proposed a two-level attention mechanism for fine-grained classification on different subsets of the ILSVRC [russakovsky2012imagenet] dataset, and the CUB200_2011. In this model, images are first processed by a bottom-up object proposal network based on R-CNN [zhang2014part] and selective search [uijlings2013selective]. Then, the softmax scores of another ILSVRC2012 pre-trained CNN, which they call FilterNet
, are thresholded to prune the patches with the lowest parent class score. These patches are then classified to fine-grained categories with aDomainNet
. Spectral clustering is also used on theDomainNet filters in order to extract parts (head, neck, body, etc.), which are classified with an SVM. Finally, the part- and object-based classifier scores are merged to get the final prediction. The two-level attention obtained state of the art results on CUB200-2011 with only class-level supervision. However, the pipeline must be carefully fine-tuned since many stages are involved with many hyper-parameters.
Differently from two-level attention, which consists of independent processing and it is not end-to-end, Sermanet et al. proposed to use a deep CNN and a Recurrent Neural Network (RNN) to accumulate high multi-resolution “glimpses” of an image to make a final prediction [sermanet2014attention]. However, reinforcement learning slows down convergence and the RNN adds extra computation steps and parameters.
A more efficient approach was presented by Liu et al. [liu2016fully], where a fully-convolutional network is trained with reinforcement learning to generate confidence maps on the image and use them to extract the parts for the final classifiers whose scores are averaged. Compared to previous approaches, in the work done by [liu2016fully], multiple image regions are proposed in a single timestep thus, speeding up the computation. A greedy reward strategy is also proposed in order to increase the training speed. The recent approach presented by [fu2017look] uses a classification network and a recurrent attention proposal network that iteratively refines the center and scale of the input (RA-CNN). A ranking loss is used to enforce incremental performance at each iteration.
Zhao et al. proposed to enforce multiple non-overlapped attention regions [zhao2017diversified]. The overall architecture consists of an attention canvas generator, which extracts patches of different regions and scales from the original image; a VGG-16 [simonyan2014very] CNN is then used to extract features from the patches, which are aggregated with a long short-term memory [hochreiter1997long] that attends to non-overlapping regions of the patches. Classification is performed with the average prediction in each region. Similarly, in [zheng2017learning], they proposed the Multi-Attention CNN (MA-CNN) to learn to localize informative patches from the output of a VGG-19 and use them to train an ensemble of part classifiers.
In [jetley2018learn], they propose to extract global features from the last layers of a CNN, just before the classifier and use them to attend relevant regions in lower level feature activations. The attended activations from each level are then spatially averaged, channel-wise concatenated, and fed to the final classifier. The main differences with [jetley2018learn] are: (i) attention maps are computed in parallel to the base model, while the model in [jetley2018learn] requires output features for computing attention maps; (ii) WARN uses fewer parameters, so dropout is not needed to obtain competitive performance (these two factors clearly reflect in gain of speed); and (iii) gates allow our model to ignore/attend different information to improve the performance of the original model, while in [jetley2018learn] the full output function is replaced. As a result, WARN obtains 3.44% error on CIFAR10, outperforming [jetley2018learn] while being 7 times faster w/o parallelization.
All the previously described methods involve multi-stage pipelines and most of them are trained using reinforcement learning (which requires sampling and makes them slow to train). In contrast, STNs, FAM, the model in [jetley2018learn], and our approach jointly propose the attention regions and classify them in a single pass. Moreover, different from STNs and FAM our approach only uses one CNN stream, it can be used on pre-trained models, and it is far more computationally efficient than STNs, FAM, and [jetley2018learn] as described next.
3 Our approach
Our approach consists of a universal attention module that can be added after each convolutional layer without altering pre-defined information pathways of any architecture (see Figure 1). This is helpful since it seamlessly augments any architecture such as VGG and ResNet with no extra supervision, i.e.
no part labels are necessary. Furthermore, it also allows being plugged into any existing trained network to quickly perform transfer learning approaches.
The attention module consists of three main submodules depicted in Figure 2 (a): (i) the attention heads , which define the most relevant regions of a feature map, (ii) the output heads , generate an hypothesis given the attended information, and (iii) the confidence gates , which output a confidence score for each attention head. Each of these modules is described in detail in the following subsections.
As it can be seen in Figure 1, a convolution layer is applied to the output of the augmented layer, producing
attentional heatmaps. These attentional maps are then used to spatially average the local class probability scores for each of the feature maps, and produce the final class probability vector. This process is applied to an arbitrary numberof layers, producing class probability vectors. Then, the model learns to correct the initial prediction by attending the lower-level class predictions. This is the final combined prediction of the network. In terms of probability, the network corrects the initial likelihood by updating the prior with local information.
3.2 Attention head
Inspired by [zhao2017diversified] and the transformer architecture presented by [vaswani2017attention], and following the notation established by [Zagoruyko2016WRN], we have identified two main dimensions to define attentional mechanisms: (i) the number of layers using the attention mechanism, which we call attention depth (AD), and (ii) the number of attention heads in each attention module, which we call attention width (AW). Thus, a desirable property for any universal attention mechanism is to be able to be deployed at any arbitrary depth and width.
This property is fulfilled by including attention heads (width), depicted in Figure 1, into each attention module (depth)111Notation: are the set of attention heads, output heads, and attention gates respectively. Uppercase letters refer to functions or constants, and lowercase ones to indices. Bold uppercase letters represent matrices and bold lowercase ones vectors.. Then, the attention heads at layer , receive the feature activations of that layer as input, and output attention masks:
where is the output matrix of the attention module, is a convolution kernel with output dimensionality used to compute the attention masks corresponding to the attention heads , and denotes the convolution operator. The spatial_softmax, which performs the softmax operation channel-wise on the spatial dimensions of the input, is used to enforce the model to learn the most relevant region of the image. Sigmoid units could also be used at the risk of degeneration to all-zeros or all-ones. To prevent the attention heads at the same depth to collapse into the same region, we apply the regularizer proposed in [zhao2017diversified].
3.3 Output head
To obtain the class probability scores, the input feature map is convolved with a kernel:
represent the spatial dimensions, and is the number of input channels to the module. This results on a spatial map of class probability scores:
Note that this operation can be done in a single pass for all the heads by setting the number of output channels to . Then, class probability vectors are weighted by the spatial attention scores and spatially averaged:
where is the element-wise product, and . The attention scores are a 2d flat mask and the product with each of the input channels of is done by broadcasting, i.e. repeating for each of the channels of .
3.4 Layered attention gates
The final output of an attention module is obtained by a weighted average of the output probability vectors, through the use of head attention gates .
Where is obtained by first convolving with
and then performing a spatial weighted average:
This way, the model learns to choose the attention head that provides the most meaningful output for a given attention module.
3.5 Global attention gates
In order to let the model learn to choose the most discriminative features at each depth to disambiguate the output prediction, a set of relevance scores are predicted at the model output, one for each attention module, and one for the final prediction. This way, through a series of gates, the model can learn to query information from each level of the network conditioned to the global context. Note that, unlike in [jetley2018learn], the final predictions do not act as a bottleneck to compute the output of the attention modules.
The relevance scores are obtained with an inner product between the last feature activation of the network and the gate weight matrix :
The gate values are then obtained by normalizing the set of scores by means of a function:
where is the total number of gates, and is the confidence score from the set of all confidence scores. The final output of the network is the weighted sum of the attention modules:
where is the gate value for the original network output (), and is the final output taking the attentional predictions into consideration. Note that setting the output of to , corresponds to averaging all the outputs. Likewise, setting , i.e. the set of attention gates is set to zero and the output gate to one, corresponds to the original pre-trained model without attention.
It is worth noting that all the operations that use
can be aggregated into a single convolution operation. Likewise, the fact that the attention mask is generated by just one convolution operation, and that most masking operations are directly performed in the label space, or can be projected into a smaller dimensionality space, makes the implementation highly efficient. Additionally, the direct access to the output gradients makes the module fast to learn, thus being able to generate foreground masks from the beginning of the training and refining them during the following epochs.
We empirically demonstrate the impact on the accuracy and robustness of the different modules in our model on Cluttered Translated MNIST and then compare it with state-of-the-art models such as DenseNets and ResNeXt. Finally, we demonstrate the universality of our method for fine-grained recognition through a set of experiments on five fine-grained recognition datasets, as detailed next.
Cluttered Translated MNIST222https://github.com/deepmind/mnist-cluttered Consists of images containing a randomly placed MNIST [lecun1998mnist] digit and a set of randomly placed distractors, see Figure 4(b). The distractors are random patches from other MNIST digits.
CIFAR333https://www.cs.toronto.edu/~kriz/cifar.html The CIFAR dataset consists of 60K 32x32 images in 10 classes for CIFAR-10, and 100 for CIFAR-100. There are 50K training and 10K test images.
Stanford Dogs [khosla2011novel]. The Stanford Dogs dataset consists of 20.5K images of 120 breeds of dogs, see Figure 2(d). The dataset splits are fixed and they consist of 12k training images and 8.5K validation images.
UEC Food 100 [matsuda12]. A Japanese food dataset with 14K images of 100 different dishes, see Figure 2(e). In order to follow the standard procedure (e.g. [chen2016deep, hassannejad2016food]), bounding boxes are used to crop the images before training.
Adience dataset [eidinger2014age]. The adience dataset consists of 26.5 K images distributed in eight age categories (0–2, 4–6, 8–13, 15–20, 25–32, 38–43, 48–53, 60+), and gender labels. A sample is shown in Figure 2(a). The performance on this dataset is measured using 5-fold cross-validation.
Stanford Cars [krause20133d]. The Cars dataset contains 16K images of 196 classes of cars, see Figure 2(c). The data is split into 8K training and 8K testing images.
Caltech-UCSD Birds 200 [WahCUB_200_2011]. The CUB200-2011 birds dataset (see Figure 2(b)) consists of 6K train and 5.8K test bird images distributed in 200 categories. Although bounding box, segmentation, and attributes are provided, we perform raw classification as done by [jaderberg2015spatial].
4.2 Ablation study
We evaluate the submodules of our method on the Cluttered Translated MNIST following the same procedure as in [mnih2014recurrent]. The proposed attention mechanism is used to augment a CNN with five
convolutional layers and two fully-connected layers in the end. The three first convolution layers are followed by Batch-normalization and a spatial pooling. Attention modules are placed starting from the fifth convolution (or pooling instead) backward untilis reached. Training is performed with SGD for epochs, and a learning rate of , which is divided by after epoch . Models are trained on a images train set, validated on a images validation set, and tested on test images. Weights are initialized using He et al. [he2015delving]. Figure 3(e)
shows the effects of the different hyperparameters of the proposed model. The performance without attention is labeled asbaseline
. Attention models are trained with softmax attention gates and regularized with[zhao2017diversified], unless explicitly specified.
First, we test the importance of for our model by increasingly adding attention layers with after each pooling layer. As it can be seen in Figure 3(b), greater results in better accuracy, reaching saturation at , note that for this value the receptive field of the attention module is , and thus the performance improvement from such small regions is limited. Figure 3(c) shows training curves for different values of , and . As it can be seen, small performance increments are obtained by increasing the number of attention heads even with a single object present in the image.
Then, we use the best and , i.e. , to verify the importance of using softmax on the attention masks instead of sigmoid (1), the effect of using gates (Eq. 7), and the benefits of regularization [zhao2017diversified]. Figure 3(d) confirms that ordered by importance: gates, softmax, and regularization result in accuracy improvement, reaching . In particular, gates play an important role in discarding the distractors, especially for high and high
Finally, in order to verify that attention masks are not overfitting on the data, and thus generalize to any amount of clutter, we run our best model so far (Figure 3(d)) on the test set with an increasing number of distractors (from 4 to 64). For the comparison, we included the baseline model before applying our approach and the same baseline augmented with an STN [jaderberg2015spatial] that reached comparable performance as our best model in the validation set. All three models were trained with the same dataset with eight distractors. Remarkably, as it can be seen in Figure 3(e), the attention augmented model demonstrates better generalization than the baseline and the STN.
4.3 Training from scratch
We benchmark the proposed attention mechanism on CIFAR-10 and CIFAR-100, and compare it with the state of the art. As a base model, we choose Wide Residual Networks, a strong baseline with a large number of parameters so that the additional parameters introduced by our model (WARN) could be considered negligible. The same WRN baseline is used to train an att2 model [jetley2018learn], we refer to this model as WRN-att2. Models are initialized and optimized following the same procedure as in [Zagoruyko2016WRN]. Attention Modules are systematically placed after each of the three convolutional groups starting by the last one until the attention depth has been reached in order to capture information at different levels of abstraction and fine-grained resolution, this same procedure is followed in [jetley2018learn]
. The model is implemented with pytorch[paszke2017pytorch] and run on a single workstation with two NVIDIA 1080Ti.444https://github.com/prlz77/attend-and-rectify
First, the same ablation study performed in Section 4.2
is repeated on CIFAR100. We consistently reached the same conclusions as in Cluttered-MNIST: accuracy improves 1.5% by increasing attention depth from 1 to #residual_blocks, and width from 1 to 4. Gating performs 4% better than a simpler linear projection, and 3% with respect to simply averaging the output vectors. A 0.6% improvement is also observed when regularization is activated. Interestingly, we found sigmoid attention to perform similarly to softmax. With this setting, WARN reaches 17.82% error on CIFAR100. In addition, we perform an experiment blocking the gradients from the proposed attention modules to the original network to analyze whether the observed improvement is due to the attention mechanism or an optimization effect due to introducing shortcut paths to the loss function[lee2015deeply]
. Interestingly, we observed a 0.2% drop on CIFAR10, and 0.4% on CIFAR100, which are still better than the baseline. Note that a performance drop should be expected, even without taking optimization into account, since backpropagation makes intermediate layers learn to gather more discriminative features for the attention layers. It is also worth noting that fine-grained accuracy improves even when fine-tuning (gradients are multiplied by 0.1 in the base model), see Section4.4. In contrast, the approach in [jetley2018learn] does not converge when gradients are not sent to the base model since classification is directly performed on intermediate feature maps (which continuously shift during training).
As seen in Table 1, the proposed Wide Attentional Residual Network (WARN) improves the baseline model for CIFAR-10 and CIFAR-100 even without the use of Dropout and outperforms the rest of the state of the art in CIFAR-10 while being remarkably faster, as it can be seen in Table 2. Remarkably, the performance on CIFAR-100 makes WARN competitive when compared with Densenet and Resnext, while being up to 36 times faster. We hypothesize that the increase in accuracy of the augmented model is limited by the base network and even better results could be obtained when applied on the best performing baseline.
Interestingly, WARN shows superior performance even without the use of dropout; this was not possible with [jetley2018learn], which requires dropout to achieve competitive performances, since they introduce more parameters to the augmented network. The computing efficiency of the top performing models is shown in Figure 5. WARN provides the highest accuracy per GFlop on CIFAR-10, and is more competitive than WRN, and WRN-att2 on CIFAR-100.
4.4 Transfer Learning
We fine-tune an augmented WRN-50-4 pre-trained on Imagenet[russakovsky2012imagenet] and report higher accuracy on five different fine-grained datasets: Stanford Dogs, UEC Food-100, Adience, Stanford Cars, CUB200-2001 compared to the WRN baseline. All the experiments are trained for 100 epochs, with a batch size of 64. The learning rate is first set to to all layers except the attention modules and the classifier, for which it ten times higher. The learning rate is reduced by a factor of every 30 iterations and the experiment is automatically stopped if a plateau is reached. The network is trained with standard data augmentation, i.e. random patches are extracted from images with random horizontal flips.
For the sake of clarity and since the aim of this work is to demonstrate that the proposed mechanism universally improves the baseline CNNs for fine-grained recognition, we follow the same training procedure in all datasets. Thus, we do not use images which are central for state-of-the-art methods such as RA-CNNs, or MA-CNNs. Accordingly, we do not perform color jitter and other advanced augmentation techniques such as the ones used by [hassannejad2016food] for food recognition. The proposed method is able to obtain state of the art results in Adience Gender, Stanford dogs and UEC Food-100 even when trained with lower resolution.
|SotA||RA-CNN [fu2017look]||Inception [hassannejad2016food]||MA-CNN [zheng2017learning]||FAM [rodriguez2017age]||DEX [Rothe-IJCV-2016]||MA-CNN [zheng2017learning]|
As seen in table 3, WRN substantially increase their accuracy on all benchmarks by just fine-tuning them with the proposed attention mechanism. Moreover, we report the highest accuracy scores on Stanford Dogs, UEC Food, and Gender recognition, and obtain competitive scores when compared with models that use high resolution images, or domain-specific pre-training. For instance, in [Rothe-IJCV-2016] a domain-specific model pre-trained on millions of faces is used for age recognition, while our baseline is a general-purpose WRN pre-trained on the Imagenet. It is also worth noting that the performance increase on CUB200-2011 () is higher than the one obtained in STNs with images () even though we are augmenting a stronger baseline. This points out that the proposed mechanism might be extracting complementary information that is not extracted by the main convolutional stream. As seen in table 4, WARN not only increases the absolute accuracy, but it provides a high efficiency per introduced parameter.
We have presented a novel attention mechanism to improve CNNs. The proposed model learns to attend the most informative parts of the CNN feature maps at different depth levels and combines them with a gating function to update the output distribution. Moreover, we generalize attention mechanisms by defining them in two dimensions: the number of attended layers, and the number of Attention Heads per layer and empirically show that classification performance improves by growing the model in these two dimensions.
We suggest that attention helps to discard noisy uninformative regions, avoiding the network to memorize them. Unlike previous work, the proposed mechanism is modular, architecture independent, fast, simple, and yet WRN augmented with it obtain state-of-the-art results on highly competitive datasets while being 37 times faster than DenseNet, 30 times faster than ResNeXt, and making the augmented model more parameter-efficient. When fine-tuning on a transfer learning task, the attention augmented model showed superior performance in each recognition dataset. Moreover, state of the art performance is obtained on dogs, gender, and food. Results indicate that the model learns to extract local discriminative information that is otherwise lost when traversing the layers of the baseline architecture.
Authors acknowledge the support of the Spanish project TIN2015-65464-R (MINECO/FEDER), the 2016FI B 01163 grant of Generalitat de Catalunya, and the COST Action IC1307 iV&L Net. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of a Tesla K40 GPU and a GTX TITAN GPU, used for this research.