PyTorch Implementation of CVPR 2018 paper AMNet: Memorability Estimation with Attention
In this paper we present the design and evaluation of an end-to-end trainable, deep neural network with a visual attention mechanism for memorability estimation in still images. We analyze the suitability of transfer learning of deep models from image classification to the memorability task. Further on we study the impact of the attention mechanism on the memorability estimation and evaluate our network on the SUN Memorability and the LaMem datasets. Our network outperforms the existing state of the art models on both datasets in terms of the Spearman's rank correlation as well as the mean squared error, closely matching human consistency.READ FULL TEXT VIEW PDF
Visual explanation enables human to understand the decision making of De...
Visual attention prediction is a classic problem that seems to be well
This work proposes a novel end-to-end convolutional neural network (CNN)...
We tackle the problem of understanding visual ads where given an ad imag...
The problem of state estimation for unobservable distribution systems is...
Visual attention mechanisms have proven to be integrally important
Morphed images have exploited loopholes in the face recognition checkpoi...
PyTorch Implementation of CVPR 2018 paper AMNet: Memorability Estimation with Attention
The ability of man cognition to recall as well as forget visual content after viewing it is very important to the way we acquire new information and interact with our environment. This is becoming increasingly significant as creating and consuming visual content dominates other forms of information exchange. Moreover, low cost, automated image and video capture systems are rapidly surfacing as the norm in the Internet of the Things (IoT) domain, also contributing to the visual information flow.
To which degree an image is later remembered or forgotten is expressed as image memorability. It is an important cognitive measure to be taken into account while processing visual content, whether for human to human or machine to human communication or for storage.
Memorability estimation has a large variety of practical applications, such as selecting or designing highly memorable advertising material, organizing and tagging of photos in albums, introducing a real-time, image memorability measure built into consumer digital cameras, helping to make highly memorable presentations and data visualizations, improving memorability of specific parts of a graphical user interface (GUI) or helping to illustrate education material. An application of a great interest is to measure a decline in memory capacity of patients affected by dementia (such as Alzheimer’s and Parkinson’s diseases) and forms of mild cognitive impairment (MCI).
Prior research 
has shown that image memorability has a stable property, that is, individuals tend to remember the same images with the same probability regardless of delays, and that it can be quantified and measured. This research has led to first attempts to learn and predict memorability with machine learning freameworks, initially with low-level, global image features, reaching moderate success. To improve such a solution would, however, require the design of new features, which demands a strong domain knowledge not well understood in the specific case of memorability.
has been shown that this problem can be mitigated by applying deep learning techniques to the memorability domain. Deep learning, however, requires large training dataset which was not available until A. Khosla et al. introduced a large memorability dataset LaMem with 60K images and subsequently used it to train the MemNet, which is based on the AlexNet 
initialized on the ImageNet and Places  datasets. MemNet achieves Spearman’s rank correlation compared with the human consistency as measured by .
Intuitively, image regions immediately drawing our attention would appear to be linked with highly memorable visual content. Indeed, this assumption was confirmed to be correct in the works of ,  and  who already very early indicated a potential relationship between the visual attention and memorability but did not further investigate their correlation. To that end, we propose the Attention based Memorability estimation Network-AMNet, a novel, deep neural network architecture with a recurrent, visual attention mechanism with the primary goal to improve on the state of the art for the memorability prediction task. We also show advantages of the visualization of the generated attention maps and their connection to the memorability property. Our approach is extensively evaluated on the LaMem  and SUN Memorability  datasets. The main contributions of our work are:
AMNet as a generic architecture for regression tasks with deep CNN, visual attention mechanism and recurrent neural network.
application of the proposed AMNet to the image memorability estimation.
introduction of the incremental memorability estimation with the recurrent network and demonstration of the achieved performance gain.
introduction of the visual attention technique for the memorability estimation and presentation of the performance gain.
demonstration that transfer learning from deep models, trained for image classification, is particularly beneficial for the memorability estimation.
The paper is organized as follows: Section 2 provides background material on image memorability, its properties, measurement and prediction. In section 3 we propose the AMNet and discuss the theoretical framework behind this architecture and the training procedure. The performance of AMNet is studied in the section 4, with section 5 concluding this work.
In a pioneering work on image memorability, Isola et al. ,  demonstrated that the ability of our cognition system to remember certain images and forget other is congruent among independent observers, despite large variability in the image content, concluding memorability is a stable property, intrinsic to images. Based on this premise, Isola et al.  investigated factors that give rise to the image memorability effect, which was then used to predict image memorability scores with a machine learning program, based on global image features GIST , SIFT , HOG , SSIM  and pixel histogram.
In order to build better computational models to learn and predict memorability, researchers analyzed the relationship between memorability and various visual factors , image classes  and saliency . Bylinskii et al.  conducted a number of experiments to better understand the intrinsic and extrinsic effects on image memorability, concluding that the primary substrate of memorability lies in the intrinsic properties of images and all extrinsic effects contribute only marginally.
Deep learning was first applied to the memorability problem by Baveye et al.  who proposed a MemoNet model based on GoogLeNet  trained on the ImageNet  dataset.  used CNN features with SVR  to predict memorability with accuracy comparable to MemoNet .
To achieve higher accuracy with deep learning techniques Khosla et al.  collected a large memorability dataset LaMem with 60K images and introduced MemNet model based on the Hybrid-CNN, which is the AlexNet  CNN pretrained on the ImageNet  and the Places  datasets ( million images in total). Researchers also tried to improve memorability prediction by other techniques, such as the adaptive transfer learning from external sources  or predicting image memorability by multi-view adaptive regression , none exceeding the performance of the MemNet .
Relationship between the visual attention and memorability was already suggested by Isola et al.  but was not further investigated. Mancas and Le Meur  studied the link between saliency and memorability and found that the most memorable images have uniquely localized regions, while less memorable either do not have precise regions of interest or have several of them. Based on these findings,  devised new attention-related features that improved the memorability prediction by 2% compared to the non attention based models from . In a similar work, Celikkale et al.  applied an attention driven spatial pooling pipeline based on SIFT  and HOG  features and bottom-up and object-level saliency detectors. Their results, albeit only moderate, still indicate a benefit of the attention based approach. Importance of the memorability regions was explored by Khosla et al.  who introduced the concept of attention maps that relate image regions to memorability. These maps are learnt directly as clusters of gradients, textures and color features with the SVM-Rank solver  with results showing benefits of the attention on memorability prediction.
In our work we investigate the application of deep learning methods with visual attention and recurrent network to learn and predict image memorability. To our knowledge the presented approach has not been attempted before.
The idea behind the AMNet architecture is based on four main components a deep CNN trained on large-scale image classification task, a soft attention network, a Long Short Term Memory (LSTM) recurrent neural network followed by a fully connected neural network for memorability score regression.
In the following section we introduce the details of the AMNet architecture as shown in Figure 2, starting with the pre-trained CNN for transfer learning. Subsequently we show the working of the visual, soft attention mechanism , the LSTM and network for the memorability regression and . Finally we outline the training procedure and finish with the data augmentation process.
It is common practice to use a pretrained CNN as a fixed feature extractor or to fine tune it for a similar application , mainly to reduce training time and overfitting on tasks with small datasets.
This technique is readily applied to computer vision problems centered around semantic features such as objects detection and segmentation, however little is known about such transfer learning for the image memorability estimation since there is no clear understanding of what visual features trigger the effects of remembering and forgetting.
Khosla et al.  has already shown the benefits of fine tuning of pretrained CNN for this domain, however we decided to evaluate a much deeper model as a fixed feature extractor. Our results show that the features learnt for image classification are highly suitable for the memorability task. In our work we use ResNet50  model trained on ImageNet where it achieves the top 1 error 24.7%.
The ability of a neural network to learn which discrete information elements to focus on within a given training sample was first applied in machine translation by Bahdanau et al. . This mechanism is called soft attention due to the fact that it produces a probability weight for every information element rather than a hard decision boundary. The benefit of soft attention is that it can be learnt end-to-end
with a gradient based optimization method.
The soft attention mechanism has two components, a network that learns probabilities for each information element within the input data and a gating function that uses these probabilities to weigh data for further processing.
The AMNet estimates the image memorability by taking a single image and generating a memorability score .
The process of memorability estimation is summarized in algorithm 1.
All vectors are column vectors, unless stated otherwise. The memorability is estimated with LSTM  over a three steps long sequence . The LSTM is defined as:
where is the LSTM state at time with size . The vector represents a new image features produced at the step as a result of the application of the attention weights on the input image features and is calculated as a simple weighted sum such that
where are the attention probabilities conditioned on the entire image feature vector and previous LSTM hidden state
The attention probabilities, as well as other functions are parameterised with neural networks. The attention is then represented as a vector of weights produced by a softmax function
The attention weights vector is a product of the image feature vector and the LSTM hidden state
is s simple sum of two affine transformations followed by logistic function
where and are network weights and biases respectively, estimated together with other parameters of the network during optimization.
In order to experiment with the effects of the attention we can conditionally disable it by defining the as a constant function with unit output such that:
The results it that all feature vectors in are considered equally, thus disabling the attention mechanism.
At each step the network produces one discrete memorability score calculated as:
The function maps the LSTM hidden state to the memorability scoreis calculated as a sum of the discrete memorabilities
In the first step, the LSTM hidden and memory states are initialized from the image feature vector as follows:
where the functions are single, fully connected neural networks with activation.
The AMNet model is trained by minimizing the following loss function:
The first term represents a mean squared error between the ground truth and predicted image memorability
. In order to encourage the attention model to explore all image regions over all time steps, we add a second termwhich performs a joint - penalty as a function of activations of all attention maps in the LSTM sequence , introduced by Xu et al. . The hyper-parameter specifies the impact of this penalty.
represents the penalty, which enforces sparsity along the sequence dimension . In other words, it encourages a strong activation for only one of the attention maps at location .
Finally, the penalty in the form of in Eq. 14 further promotes an even distribution of activations over all locations. The value of the parameter was experimentally determined as for which the network achieved the highest performance.
The entire model if fully differentiable and trained end-to-end with the ADAM  optimizer with a fixed learning rate . The input image feature vector is extracted from the layer of the RestNet50  with dimensions . The ResNet50 is trained for image classification on the ImageNet dataset and its weights are not updated during the AMNet training.
The AMNet network is heavily regularized with dropout and with small weights regularization
. We found that the dropout was critical to stop the network from overfitting. The training was carried out in minibatches of 256 images and terminated by early stopping when the observed Spearman’s rank correlation on the validation dataset reached its maximum, which was between epoch 30 and 50 depending on the split and the training dataset (LaMem or SUN). Training and validation losses as well as the memorability rank correlation on the validation dataset in the LaMem, split 1 is shown in Figure3.
Common augmentation techniques are applied to the images during the training stage to reduce overfitting and improve generalization. A crop of random size of (0.08 to 1.0) of the original size and a random aspect ratio of 3/4 to 4/3 of the original aspect ratio is made and then resized to and randomly, horizontally flipped. For the evaluation only a center crop was selected for the input.
Memorability scores in the LaMem dataset are in the range with distribution shown in Figure 4. For the training purpose the memorability scores were zero mean centered and scaled to range .
datasets. First we briefly describe the datasets and used evaluation metrics, and then present our qualitative and quantitative results with the comparison against the state of the art.
Main focus of this research work is on the LaMem  dataset due to its large size which makes it suitable for training deep neural networks. The LaMem is the largest annotated image memorability dataset to this date with total of 58741 images. The images cover a wide range of indoor and outdoor environments, objects and people and were obtained from other labeled datasets such as MIR Flicker, AVA dataset , affective images dataset , image saliency datasets , , SUN , image popularity dataset , Abnormal Objects dataset  and a Pascal dataset . The memorability scores were collected manually on the Amazon Mechanical Turk (AMT) by means of a memorability game introduced by  and improved by . Approximately 80 measurements (memorable=yes/no) were collected per image. There are 5 random splits each with 45000 images for training, 3741 for evaluation and 10000 for testing.
As a second dataset for evaluation we chose the SUN Memorability dataset pioneered by Isola et al. . There are 2222 images in total, originating from the SUN  dataset with memorability scores collected similarly to the LaMem. There are 25 random splits with equal number of 1111 images for training and testing.
Following the previous work, we report on the performance in terms of rank correlation, specifically a Spearman’s rank correlation coefficient  and mean squared error .
The Spearman’s rank correlation coefficient measures consistency between the predicted and ground truth ranking, within the range where zero represents no correlation. Higher values indicate better memorability prediction method:
where is a number of samples, is a rank of the ground truth memorability score, and the prediction.
is used as a secondary metric, not always presented in previous work. The Spearman’s rank correlation shows a monotonic relationships between the reference and observations but does not reflect the absolute numerical errors between them, which is then presented by according to:
where is the ground truth memorability score, while the prediction and number of tested samples.
In order to obtain results that are fully comparable with the previous work, we used the same training and evaluation protocol as in the  for the LaMem dataset and  for the SUN memorability dataset.
Evaluation on the LaMem dataset was performed by training one model on each of the five random splits as suggested by the authors  and then reporting the final memorability rank correlation and , averaged over the results from five corresponding test datasets.
|Method (LaMem dataset)|
|AMNet (no attention)||0.663||0.0085|
In Table 1 we show that the AMNet model with the active attention achieves , or a 5.8% improvement over the best known method MemNet . Even without attention the AMNet outperforms prior work by 3.6% which demonstrates that the pretrained, deep CNN with our recurrent and regression network layers still achieve high accuracy. The comparatively low performance of the CNN-MTLES  method can be attributed to the fact that this model uses various, specifically engineered visual features and features extracted from CNN networks trained on ImageNet  and Places . Thus it does not leverage the end-to-end deep learning. The CNN-MTLES, however, uses the LaMem dataset, which indicates that even a large dataset does not significantly improve the performance of models based on engineered visual features.
|Method (SUN Memorability dataset)|
|Mancas & Le Meur ||0.479||NA|
|AMNet (no attention)||0.62||0.012|
|MemoNet 30k ||0.636||0.012|
To train the deep AMNet model on the rather small SUN dataset we had to increase regularization to avoid overfitting. We found that in this specific case weights regularization performed better than a stronger dropout or the combination of both. Table 2 shows that the AMNet with attention performs 2% better than the current best model. By disabling the attention the performance declined to , demonstrating the advantages of visual attention for this task.
We found that during training on the validation datasets follows a similar trend with the rank correlation , however the peaks after the model starts overfitting as seen in Figure 3
. It is conceivable to assume that the slightly higher variance at the maximumimproves generalization in terms of the predicted and ground truth monotonic relationships, even though starts increasing. For example, during the training on the LaMem split 1, as seen in Figure 3, we attained maximum and while for minimum .
Tables 1 and 2 show that the AMNet exhibits the best performance in terms of the Spearman’s rank correlation as well as on both, the LaMem and the SUN datasets. The best performance attains on the LaMem dataset, approaching of the human performance as measured by Khosla et al. . Comparison against the state of the art can be seeing in Figure 7.
The significant performance gain is achieved by the fact that the neural network learns to focus its attention to specific regions most relevant to memorability. The improvement is close to 2% on the LaMem and almost 5% on the SUN dataset. AMNet learns to explore the image content by producing three visual attention maps, each conditioned on the image content obtained by exploiting the previous map. We have experimented with 2,3,4,5 and 6 LSTM steps and found that three steps are sufficient to achieve the reported performance.
|a) 0.453 (0.45)||b) 0.453 (0.44)||c) 0.78 (0.794)||d) 0.881 (0.892)||e) 0.9 (0.894)||f) 0.887 (0.896)|
In order to interpret the relation between the attention maps and corresponding discrete memorability estimations in each LSTM step, we converted the attention maps to heat maps and visualized them along with the memorability scores. In Figure 5 we show selected images from the LaMem, split 2 test dataset. Images (a), (b) and (c) have low memorability, image (d) a medium one and (e) and (f) high memorability. Images of the attention maps are obtained by taking the output of the softmax function Eq. 6, scaled to range and resized from to .
As we can see in images (a), (c) and (d) in Figure 5, most of the first attention weights gravitate towards the image center, which is most likely caused by the Center Bias, studied in ,  and attributed primarily to the photographer bias. In the subsequent LSTM steps, however, the attention usually moves to the regions responsible for memorability.
After a close inspection, we found that the attention maps for low memorability images tend to be sparser with few small peaks, while for higher image memorability, the attention maps display sharper focus covering larger regions around the activation peaks. Core image memorability usually originates in regions with people and human faces as evident in images (c) and (f) in Figure 5.
Moreover, we found that the estimates of discrete memorabilities in Eq. 10 decrease with each LSTM step for low memorability images, while for high memorability images they grow. This relation is shown in Figure 6. This effects is consistent within the LaMem test datasets across all splits and can be seen in Figure 5.
Initially, we experimented with additional penalty function that would encourage the optimizer to estimate the discrete memorabilities in ascending or descending order, however this always caused a drop in the performance. The above observation explains this effect, that is, the gradient of the discrete memorabilities over the LSTM steps differs depending on the core image memorability. Thus forcing the optimizer to maintain positive or negative gradient has a detrimental effect on the model convergence.
In this work we propose AMNet, a novel deep neural network with visual attention component for image memorability estimation. This network consists of a pre-trained, deep CNN followed by a modified visual attention mechanism with a recurrent network and network for memorability regression. By design the AMNet is generic and could be employed for other regression, computer vision tasks.
We show that a deep CNN, trained on large-scale image classification is beneficial for the memorability estimation task, indicating that the feature hierarchies extracted for the image classification are suitable to express the composition underlying the memorability effect.
Finally, we demonstrate that our recurrent visual attention network significantly improves performance of the image memorability learning and inference.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 976–983, 2013.
Imagenet classification with deep convolutional neural networks.In Advances in Neural Information Processing Systems, volume 25, 2012.
Object-centric anomaly detection by attribute-based reasoning.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 787–794, 2013.
Sun database: Large-scale scene recognition from abbey to zoo.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3485–3492. IEEE, 2010.
Image memorability prediction using deep features.In Iranian Conference on Electrical Engineering (ICEE), pages 2176–2181. IEEE, 2017.