Enhancing Perceptual Attributes with Bayesian Style Generation

12/03/2018 ∙ by Aliaksandr Siarohin, et al. ∙ Università di Trento 0

Deep learning has brought an unprecedented progress in computer vision and significant advances have been made in predicting subjective properties inherent to visual data (e.g., memorability, aesthetic quality, evoked emotions, etc.). Recently, some research works have even proposed deep learning approaches to modify images such as to appropriately alter these properties. Following this research line, this paper introduces a novel deep learning framework for synthesizing images in order to enhance a predefined perceptual attribute. Our approach takes as input a natural image and exploits recent models for deep style transfer and generative adversarial networks to change its style in order to modify a specific high-level attribute. Differently from previous works focusing on enhancing a specific property of a visual content, we propose a general framework and demonstrate its effectiveness in two use cases, i.e. increasing image memorability and generating scary pictures. We evaluate the proposed approach on publicly available benchmarks, demonstrating its advantages over state of the art methods.



There are no comments yet.


page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent advances in predicting and understanding subjective properties of visual data (e.g. beauty, memorability, interestingness, etc.) enabled by deep learning models [17, 18, 9, 13, 3, 22] have motivated researchers in computer vision to take a step forward and investigate automatic techniques to manipulate images in order to modify these properties. For instance, recent works have proposed methods to edit images in order to increase their memorability [25], to improve their aesthetic quality [30] or to evoke specific emotional reactions into users [21]. Recently, deep style transfer methods [11, 6, 15, 29] which allow the users to modify pictures by blending them with style images have gained popularity. These methods have significantly widened the set of editing operations available in traditional image enhancement tools, fostering the diffusion of novel software for turning user pictures into artworks. While earlier methods for neural style transfer [6, 29] considered a fixed set of styles and relied on slow optimization processes, more recent approaches [11, 15] are highly flexible, enable the generation of arbitrary styles and have close to realtime performance.

Figure 1:

Idea behind our approach. Given a generic input image (yellow box) our framework provides as output a set of stylized images (green box) obtained by applying the styles which maximally enhance a given perceptual attribute. The attribute value (shown on top left corners of input and output images) is automatically assessed by a deep network. The style selection process is achieved by modeling the style space (light blue box) as a probability distribution automatically learned from a given training set of style images (orange box) using a generative adversarial network.

Motivated by these recent advances, in this paper we propose a novel approach for generating stylized images in order to enhance a given perceptual attribute. Similarly to previous deep style transfer methods [11, 15]

, the stylized images are obtained by training a feed-forward neural network which receives as input the original images and the style pictures. Opposite to previous works, the style choice is not made by a user but it is automatic and is driven by a specific criterion,


increasing the value of the given perceptual attribute. At the core of our style selection process there is a novel probabilistic framework which exploits recent Generative Adversarial Networks (GANs) to learn a probability distribution modeling the style space and Markov Chain Monte Carlo (MCMC) methods to sample from the learned distribution and compute the best styles. We named the proposed approach BAE (Bayesian Attribute Enhancement). While our framework is generic and can be used for different types of perceptual attributes, in this work we focus on two applications,

i.e. increasing memorability, defined as the probability of an image to be remembered [13] and generating scary pictures. We quantitatively and qualitatively evaluate the proposed approach on publicly available datasets, demonstrating superior performance over state of the art methods. Figure 1 illustrates the intuition behind our method.

Contributions. To summarize, the contribution of this work is threefold. (i) We propose a novel framework to automatically modify an input image in order to alter its inherent perceptual attributes. To preserve the semantic content of the original image, our approach relies on a neural style transfer method. In this way, the problem of perceptual attribute enhancement naturally translates to that of retrieving the best styles to apply to the given image. Opposite to previous works which focus on modifying a specific subjective property [25, 30, 21], our method can be applied to any arbitrary attribute. While we tested it on two scenarios, we expect the method to be useful in other applications, e.g. for enhancing the aesthetic quality of images or for increasing their virality score. (ii) By exploiting state of the art deep style transfer techniques [11] within a novel probabilistic framework for modelling the style space, our approach does not simply select the best styles from a small predefined set but also allows to generate arbitrary new styles. Thus, a higher increase of the attribute score can be obtained with respect to previous approaches [25]. (iii) Our framework is highly flexible and allows not only to automatically select the best styles but also the degree of stylization. Furthermore, by resorting on MCMC sampling methods, it can be used to compute multiple styles. In this way we keep the users in the loop, suggesting the best styles for attribute increase but still allowing the users to choose among multiple stylized images according to their personal preferences.

2 Related Works

Our work lies at the intersection between two main research lines. The first line focuses on the problem of understanding and predicting subjective properties from visual data, the second one includes works proposing novel deep models for automatic image editing.

Predicting perceptual attributes from visual data. In the last decade several works in computer vision and multimedia have addressed the problem of modelling and predicting perceptual attributes from images and videos. These studies have focused on the automatic assessment of aesthetic value [17, 18], interestingness [9], memorability [13], virality [3], symmetry [5], etc. In some cases, typically where a large amount of training data is available, automatic systems can even reach human-level performances. For instance, Khosla et al. [13] showed that a deep learning model trained on LaMem, the largest memorability dataset so far, can predict image memorability with an accuracy close to that of human annotators. Similarly, recent methods for computing automatically the aesthetic value of images are quite precise, achieving an accuracy superior to 75% on the AVA dataset [17, 18, 20]. In this work we focus not only on predicting subjective attributes but we also address the more challenging task of image enhancement.

Deep Models for Automatic Image Manipulation. Deep learning models and, in particular, neural style transfer methods [11, 6, 15, 29] and deep generative networks [7, 12] have enabled significant advances for automatic image editing and generation. In the wake of these progresses, recent works have taken a step beyond perceptual attributes prediction and have proposed methods to manipulate images in order to modify these intrinsic attributes [30, 25, 28]. For instance, Wang et al. [30] addressed the task of increasing the aesthetic value of an image by finding the best crop. Tsai et al. [28] proposed a deep model for image harmonization which adjusts the appearance of the image foreground in order to better adapt it to the background. Liao et al. [16] introduced a method to alter intrinsic image properties like color, texture or style based on deep analogy and visual property transfer. However, these works simply propose strategies to modify a specific property of images but do not provide a general framework to systematically enhance an arbitrary perceptual attribute and quantitatively assess its value increase. Recently, Siarohin et al. [25] moved a step forward in this direction, by proposing an approach which selects the best styles for a given image in order to increase its memorability. Still, their method relies on a pre-defined set of styles and the degree of stylization is also fixed a priori. In this work, we overcome these limitations by introducing a more general and flexible approach which operates on a large set of styles and where the trade-off between style and content is regulated by a user-defined hyper-parameter .

3 Enhancing Perceptual Attributes with BAE

As stated in Section 1, the proposed approach deals with the problem of automatically modifying an arbitrary input image in order to enhance a specific perceptual attribute, e.g. its memorability, the likelihood to evoke specific emotional reactions from users, etc. This task is addressed within a novel Bayesian framework and by resorting on a state of the art neural style transfer method [11]. In fact, our approach aims to modify the given image increasing its perceptual attribute score by changing its style while retaining the semantic content. In the following we briefly describe the neural style transfer method in [11] and then introduce the proposed approach providing some details on our implementation.

3.1 Arbitrary Style Transfer

Given an input image and a style image , let us denote with the modified image obtained by applying the style transfer model . In this work we consider the style transfer approach in [11] as, oppositely to earlier methods [6, 29], i) it is not tied to a fixed set of styles, allowing to generate arbitrary new styles, ii) it performs style transfer in realtime, and iii) it is very flexible, enabling to control the degree of stylization also at test time.

The deep architecture proposed in [11] has a simple encoder-decoder structure. The encoder is used to compute the feature maps and associated respectively to the input and to the style images. The computed feature maps are then fed to a specific feature alignment layer, the Adaptive Instance Normalization (AdaIN) layer. This layer aligns the mean

and variance

of the image features to those of the style features, producing the target feature maps:


where . The decoder is trained to map the target feature maps back to the image space, generating the stylized image . As typically done in neural style transfer methods, the network

is trained by optimizing a loss function which is the weighted sum of two terms,

i.e. , where and are the content and the style loss respectively and is a user defined parameter regulating at training time the trade-off between semantic content and stylization. We refer the reader to the original paper [11] for details on the definition of the loss functions.

A prominent feature of the neural style transfer method in [11] is the possibility to control the degree of stylization not only at training time by changing but also at test time. In particular, a parameter is introduced and the stylized image is computed as:


Here, corresponds to the case where no style transfer is performed, while corresponds to full stylization.

3.2 Bayesian Attribute Enhancement

The main idea behind the proposed approach is to construct a model which, given an arbitrary image and a set of style images , is able to automatically compute a novel set of styles that, applied to the input image, fully enhance a specific perceptual attribute, e.g. increase its memorability. To build this model, inspired by [11]

, we first introduce a compact representation for styles in terms of mean and standard deviation of activations. Formally, given a style image

and a pre-trained encoder network , we define a style where and .

Figure 2: Overview of our approach: given an input image our method generates through the style which maximizes the perceptual attribute score, computed by , of the stylized image .

Given the set of style images and the associated representations

, we propose to learn a probability density function

modelling the style space. While different methods can be used for this purpose, motivated by the recent successes of deep generative models [7], in this paper we consider a Generative Adversarial Network (GAN). A GAN consists of two networks, a generator and a discriminator . These two networks play a minimax game in which the task of is to distinguish the samples generated by from the real samples and the task of is to increase the chances of producing a high probability for a synthetic example. In [7] it is shown that the equilibrium in this game is achieved when the probability density of the generated samples is equal to the probability density of the real ones. For our application we use as real samples, and learn the generator in order to produce styles , where . The input to the generator is sampled from some simple noise distribution

such as a Gaussian distribution

. For training the GAN model we use an efficient version of Wasserstein GANs [4], and specifically WGAN-GP, recently proposed in [8].

In addition to the GAN model, we propose to learn two additional deep networks. The first network implements the neural style transfer approach described in Section 3.1. In the following, given a style we denote as the stylized image. The second network is used to learn a function which, given an input image , outputs a probability score reflecting the strength of a given perceptual attribute. The design of this network and its training strategy, described in Section 3.3, is at the core of our method and depends on the chosen subjective attribute. In particular in this paper we consider two attributes, memorability and scariness, i.e. we propose two different criteria for modifying pictures: increasing their memorability and maximizing their likelihood to evoke scary reactions into users. Given the above definitions, we propose to build a probability density for the joint model:


where the last term is derived considering the learned models , and . In this way, we obtain a probability over which, in our work, can be seen as a latent representation of a style . We propose to exploit in order to find the styles which better enhance a given perceptual attribute. Specifically, to obtain a diverse set of styles corresponding to high values of the target attribute, we propose to sample from using MCMC methods. The best styles, in fact, correspond to the modes of the distribution.

We also extend the proposed Bayesian framework in order to compute automatically not only the best styles but also the degree of stylization. To this aim, we consider Eqn. 2 and define . However, instead of setting as a constant, we assume that

is a random variable. In this case, similarly to Eqn. 

3, we define the joint probability:


where and

is a prior probability. In this case with MCMC sampling we obtain a set of latent style representations

, as well as a set of stylization coefficients . In the following, we refer to our method as Bayesian Attribute Enhancer (BAE), while its adaptive version where we also automatically compute value is called ABAE. An overview of the proposed framework is illustrated in Fig.2.

3.3 Implementation

In this Section we report additional details on the implementation of the proposed method. In particular, we describe the adopted deep network architectures and provide further details on the considered MCMC sampling strategies.

Network Architectures. The neural style transfer network is implemented following the original paper [11]. The encoder is built from the first four convolutional layers of a pre-trained VGG-19 [26]. The decoder is implemented with a structure mirroring the encoder, with all pooling layers replaced with up-sampling layers. In the case of ABAE we limit the range of the coefficient between introducing a clipping function . It is worth noting that, while we consider the method in [11], our framework allows using different style transfer approaches such as the one proposed in [15]. In this case, the only difference would be the representation of style , which in [15] is modelled in terms of mean and covariance matrix.

Data: : energy function, : number of samples, : learning rate
Result: Set
// Initialization
1 , , ;
2 while i ¡ M do
       // Generate candidate point
3       ;
       // Calculate acceptance ratio
4       ;
5       if  then
             // Accept candidate point
6             , , ;
8       end if
10 end while
Algorithm 1 Langevin MCMC

The implementation of the perceptual attribute predictor depends on the considered attribute. For memorability we resort on the Memnet model introduced in [13] to allow fair comparison with [25]. As suggested in [13], we consider the HybridCNN network [32] and finetune it on LaMem dataset [13]. Following this protocol, the resulting model implements a regressor, i.e. . To normalize the output scores of the memorability predictor we compute , where

is a sigmoid function and

a user defined parameter. We follow a similar approach for deriving in the case of scariness. We use InceptionV3 network as one of the best general purpose models [27]. We trained this model on images with their binary labels from the BAM dataset [31] to derive and then compute .

In the proposed GAN model the generator is implemented as a neural network with the following structure: - - , where denotes a fully-connected layer with

output units and Relu activation, while

indicates a fully-connected layer with output units without activation. Similarly, the architecture of the discriminator is defined as: - - - .

Style Sampling Methods. In this work we used MCMC sampling in order to find the best styles. MCMC is a general method for sampling from a multivariate probability distribution. We define the energy function and we chose Langevin MCMC [23] as our sampling method (see Algorithm 1). For simplicity here we report the formulas only for BAE. The algorithm is similar for ABAE. We also experiment with the two other popular MCMC methods: Metropolis Hastings [10] and Hamiltonian [19]. The effect of using different MCMC methods for creating new styles is discussed in Section 4.2. We also introduced two modifications to the traditional methods to help increasing the acceptance rate (line 5 - Algorithm 1):

  • [noitemsep,topsep=0pt]

  • Adaptive gradient: Instead of using we consider it adaptive version, in analogy to Adam [14]. We found this strategy especially helpful for ABAE, because the gradient for can be several order of magnitude higher than the gradient for .

  • Adaptive learning rate: At step 5 in Algorithm 1 upon rejection we decrease the learning rate (e.g. ) while upon acceptance we set to the initial value. This strategy eliminates the need of tuning the learning rate for each image.

4 Experimental Validation

In this Section we report the results of our experimental evaluation. First, we provide some details on the used datasets and our experimental setup (Section 4.1). Then, in Section 4.2 we quantitatively evaluate the performance of our method in enhancing two different perceptual attributes: memorability and scariness. In the case of memorability, we also discuss the effect of using different sampling methods. Finally, we report qualitative results comparing our method with baselines. Our code is available online [1].

4.1 Experimental Setup and Datasets

Datasets. We considered three datasets in our work.
The DevianArt dataset [24] is a collection of 500 abstract art paintings collected from an online social network site, deviantart.com, devoted to user-generated art. The dataset was used in [25] to define the style set.
LaMem [13] is a collection of 58,741 images annotated with the corresponding memorability score. The scores were collected through an efficient version of the memorability game. We encourage the reader to refer at [13] for further details. This dataset was also considered in [25].
The Behance-Artistic-Media (BAM) dataset [31] is a very large dataset with automatically labeled binary attribute scores. It comprises about 20 attributes, including emotional attributes like scary, gloomy, happy and peaceful. It contains 14,585 images (with positive or negative labels) originally crowdsourced from human annotators for the scary attribute. We were able to download a subset of 11,698 images from this dataset. We use this set to train our scariness predictor.
Experimental Setup. We now provide further details on our experimental setting and implementation. We follow an experimental protocol similar to [25] in order to allow a fair comparison with their work. Note that [25] only focuses on memorability, while our approach deals with arbitrary attributes.
Styles set. For the style set we considered 500 abstract art images from the DeviantArt [24] dataset. While Siarohin et al. [25] considered a pre-defined set of styles selecting 100 images from this dataset, our approach by learning a style probability density function can potentially learn from and generate an infinite number of styles. As described in Section 3.2, we use a GAN model to represent the probability density over the set of styles. The GAN was trained with batch size equal to 64 and for about 100k iterations. All the other hyper-parameters were set as indicated in [8].
Baseline methods. In the case of memorability, we compare the performance of our method with [25]. The code from [25] is available online [2]. This is the closest work to ours, where a set of only 100 styles is considered for increasing memorability. We also consider an additional baseline method which uses the same set of 500 styles of our approach. Specifically, this baseline consists in applying the style transfer method in [11] to the given image considering all the style pictures in the style set and then compare the obtained stylized images with those we obtained with our method setting . In the case of scariness, we simply compare with .
Perceptual Attribute Predictors. For the target attributes, memorability and scariness, we trained two predictors from two independent set of images. The first predictor, which we denote as the internal predictor , is used for generating the stylized images and corresponds to , while the second is employed only for assessing the performance of our method and we call it the external predictor , indicated as . Specifically we use the second predictor to compute the score increases between the original image and the stylized images obtained with our method. In the case of memorability, we split LaMem into two sets of 22,500 images each and use these sets to learn the two predictors. This is exactly the same setup used in [25]. We also used the same training parameters. To verify that these predictors are valid and have performance close to human annotators, following [13], we compute the rank correlation and we obtain a value of 0.63 for both models. As reported in [13], this is close to human performance (0.68). In the case of scariness, we finetuned InceptionV3 [27]

(considering only the two top inception blocks) originally trained on ImageNet using labels from BAM dataset

[31]. We split the BAM labeled set into two disjoint sets of 5,849 images each and trained two scariness prediction models.
Style Transfer. As stated above for the style transfer network, we used the recent approach from [11]. We considered the pretrained network released by [11]. For the baseline and our method BAE we used . In the case of adaptive alpha, we used a Gaussian distribution as prior (see Eqn. 2).
Hyper-parameters. For the experiment on memorability we set and , while for scariness we consider and . In general, we found that the higher is the higher is the attribute increase, but when is too high nearly all the candidates points are rejected at step 5 in Algorithm 1. So, we set to the highest value (we try values on a log scale, i.e. ) for which this effect is not observed. A similar trend is observed for the learning rate . If is high, we obtain more diverse styles. Still, when the learning rate is too high almost all candidate points are rejected. Similarly to , we set the initial learning rate to the highest possible value for which we do not observe this effect (we also try values on log scale ). The problem of choosing the optimal learning rate is partially overcome with the adaptive learning rate strategy described in Section 3.3. Still, choosing an optimal initial learning rate can greatly improve the overall method speed.
Image test set. We evaluate the performance of our approach on the same test set as in [25], which we call , consisting of 1,001 generic images. We used this test set also for the experiment on scariness.

Figure 3: Sorted memorability differences for the images obtained by averaging over the top 10 results retrieved with each method. Comparison of our methods with (left) the baseline and (right) the competing work S-cube [25].
Figure 4: Perceptual attributes scores. Sorted scores for the original images and the top results retrieved in the case of (left) memorability and (right) scariness: original image scores and comparison with the top results obtained with the baseline and our method.
N S-cube [25] BAE ABAE
1 0.0792 0.0677 0.0812 0.1067
5 0.0594 0.0590 0.0762 0.0976
10 0.0488 0.0544 0.0723 0.0911
(a) memorability
1 0.4151 0.5362 0.6960
5 0.3500 0.5194 0.6775
10 0.3153 0.5075 0.6631
(b) scariness
Table 1: Increasing (a) memorability and (b) scariness. Performance of our method with fixed (BAE) and adaptive (ABAE) compared to the baseline and, in the case of memorability, also to [25]. Performances are measured in terms of memorability score differences averaged over the top N results .

4.2 Results

Evaluation Metrics. Similarly to [25], we use the Top N results and compute the average score difference as evaluation measure. Specifically, for each method computing stylized images, we rank these images based on the attribute scores calculated with the internal predictor . The Top N results corresponds to the subset of N images which rank the highest according to these scores. Then, given a generic image and a corresponding stylized image , we define as the difference between the attribute scores of these two images, based on the external predictor, i.e. . Finally, given the Top N results, we compute for each image the corresponding average score difference by averaging over from the Top N set.

Quantitative results. We first perform some experiments in order to compare our approach with baseline methods on the memorability enhancement task. Figure 3 reports the average memorability differences obtained for all the images . In the plot on the left we compare our approach in the case of fixed (BAE) and adaptive (ABAE) with the baseline. It is straightforward to see that our approach performs better than the baseline and that the adaptive method ABAE guarantees a higher memorability gain with respect to BAE using a fixed . This indicates that the possibility to automatically set the degree of stylization is beneficial in terms of performance. In the plot on the right we compare our best performing method ABAE with the competing work [25]. It can be noted that in the case of [25] the average memorability differences are negative for a large set of test images. This difference may be explained by the fact that in the case of [25]

the top N styles are retrieved from a pool of only 100 art images, while our method learns the style space from an initial set of 500 styles. This result highlights the importance of considering a wide set of styles, in order to find those which better suit a given image. In this respect, our method is very powerful, being able to interpolate between the styles of a given style set, thus achieving a higher memorability increase.

Figure 4 reports the results of a similar analysis. Specifically, it depicts the sorted scores of the original image set and of the corresponding sorted scores of stylized images obtained with our methods and with the baseline in the case of (left) memorability and (right) scariness enhancement. For each image and all the methods we consider only the best stylized image, i.e. the one for which we measure the highest attribute score increase. From the figure we can observe that both plots exhibits a similar trend: our method outperforms and the adaptive version ABAE outperforms the fixed version BAE. It can also be observed that the score increases are more significant in the case of scary images. This result may be partially due to characteristics of the considered dataset: in the case of memorability, most of the images in the dataset exhibit an initial memorability score higher than 0.5, while in the case of scariness the original score is lower than 0.2 for the large majority of the images.

Top N M.H. Langevin Hamiltonian
BAE 1 0.0766 0.0812 0.0780
5 0.0725 0.0762 0.0734
10 0.0692 0.0723 0.0700
A-BAE 1 0.1070 0.1067 0.1094
5 0.0995 0.0976 0.1012
10 0.0939 0.0911 0.0955
Table 2: Performance of our methods (top) BAE and (bottom) A-BAE considering different sampling strategies. Performances are measured in terms of memorability score differences averaged over the top N results .
Figure 5: Qualitative results: (left) original image and corresponding top result obtained with (center) the baseline and (right) our method BAE. The predicted memorability and scariness scores are reported below each image.
Figure 6: Qualitative results: original input image and top result obtained with the baseline and our method with fixed (BAE) and adaptive (ABAE). The corresponding memorability and scariness scores are reported below each image.
0.62 1) 0.75 2) 0.73 3) 0.73 4) 0.73 5) 0.76
1) 0.80 2) 0.78 3) 0.80 4) 0.75 5) 0.77
0.01 1) 0.77 2) 0.83 3) 0.72 4) 0.60 5) 0.78
1) 0.97 2) 0.97 3) 0.95 4) 0.95 5) 0.95
Figure 7: Increasing perceptual attributes: top 5 results for a given sample image. (Left) Original image and (right) comparison (top) in the memorability scenario between and BAE and (bottom) in the scariness scenario between and ABAE.
I: 0.02 : 0.66 BAE: 0.45 ABAE: 0.51
I: 0.06 : 0.24 BAE: 0.21 ABAE: 0.29
Figure 8: Increasing scariness: sample images where our method is not effective.

A comparison between our approach and the competing methods is also provided in Table 1(a), where we report the average memorability increases over the Top N results for test set in the cases of , and . In all cases our method performs better than the baseline and the competing approach [25] and the highest performance is obtained with the adaptive version of our approach. It is also interesting to note that the performances of the baseline are sometimes comparable or inferior to those of method in [25]. Indeed, and [25] are based on two different style transfer approaches, and this may explain the small performance gaps, especially in the case of small N. A similar trend is observed in the experiments on the scariness scenario (Table 1(b)).

Comparison between different sampling methods. Table 2 reports the results of our approach using different sampling methods in the case of memorability enhancement. As expected, Metropolis-Hastings MCMC corresponds to the worst performance, while the other two methods are comparable. Qualitatively, we did not observe significant differences between the three methods. In light of these results, in all our experiments we use Langevin MCMC as sampling strategy as it represents the best trade-off between performance and computational speed.

Running time. The running times for the Langevin MCMC of one image on Nvidia Titan X are, respectively, 1m41s for the baseline (500 style images) and 7m20s for A-BAE (500 MCMC iterations). However, by decreasing the number of iterations in A-BAE to 100 the running time can be reduced to 1m28s, while the top1 average difference is still higher than the baseline (0.55 vs 0.41).

User study. We run a user study to show the advantage of using our method for increasing image attribute in the case of scariness. The user study consisted in showing pairs of images to a user who was asked to indicate, for each pair, the image which looked more scary. We randomly selected 100 images out of our test set, and considered for each image the corresponding top results obtained respectively with the baseline and our method ABAE. We run the study with 11 people (6 male, 5 female); viewers voted for the image modified with ABAE in the 72.36% of the cases in average. The inter user agreement for this user study, measured with cronbach alpha coefficient is 0.78, thus validating the study.

Qualitative results. Finally, we report some qualitative results. Figure 5 depicts sample stylized images obtained with our method and with the baseline in the case of (left) memorability and (right) scariness enhancement. Given an input image of the test set , we report the top stylized image computed by and BAE. In both cases, the coefficient is set to 0.5. In the case of , we also display the corresponding selected style. For the figure it is interesting to observe that, by generating new styles, our method allows to better customize the style to a given image and to achieve an higher increase in terms of attribute score. In Figure 6 we report additional results to show the effects of further adapting the style to the given image by computing the optimal stylization coefficient . For each image, we report the top result obtained with the baseline and with our methods BAE and ABAE. (Due to space limitations, we do not report the original style image for ). Our methods produce a significant increase in terms of perceptual score with respect to and generally creates a style which better suit the input image. Furthermore, the performance of ABAE are always close or significantly better than those of BAE.

So far, we compared the Top 1 results obtained with different methods. In Figure 7 instead we report the top 5 results corresponding to some sample images on the two considered scenarios. Specifically, we show a comparison between (left) and BAE in the case of memorability and (right) and ABAE in the case of scariness. The result images obtained with our method usually obtain higher score increases. As a counterpart, these increases come with a small loss in terms of diversity for the top stylized images with respect to . In Figure 8 we report a few cases where our method does not perform as expected. We report sample results in the scariness scenario. In one case (left) our method performs poorly with respect to the baseline. In the other case (right) neither the baseline nor our method can find a suitable solution to create a scary picture.

5 Conclusions

We presented BAE, a novel framework for generating stylized images in order to enhance a predefined perceptual attribute. By exploiting recent advances on neural style transfer and generative adversarial models, we showed that it is possible to edit images such as to increase their memorability and scariness. Future work will be devoted to exploit different style transfer approaches and consider other subjective properties.

6 Acknowledgments

We gratefully acknowledge Fondazione Caritro for supporting SMARTourism project and NVIDIA Corporation for the donation of the TitanX GPU used for this research.


  • [1] https://github.com/aliaksandrsiarohin/bae
  • [2] https://github.com/aliaksandrsiarohin/mem-transfer
  • [3] Alameda-Pineda, X., Pilzer, A., Xu, D., Sebe, N., Ricci, E.: Viraliency: Pooling local virality. In: CVPR (2017)
  • [4] Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017)
  • [5] Funk, C., Liu, Y.: Beyond planar symmetry: Modeling human perception of reflection and rotation symmetries in the wild. In: ICCV (2017)
  • [6]

    Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: CVPR (2016)

  • [7] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS (2014)
  • [8] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: NIPS (2017)
  • [9] Gygli, M., Grabner, H., Riemenschneider, H., Nater, F., Van Gool, L.: The interestingness of images. In: ICCV (2013)
  • [10] Hastings, W.K.: Monte carlo sampling methods using markov chains and their applications. Biometrika 57(1), 97–109 (1970)
  • [11] Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV (2017)
  • [12]

    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks (2017)

  • [13] Khosla, A., Raju, A.S., Torralba, A., Oliva, A.: Understanding and predicting image memorability at a large scale. In: ICCV (2015)
  • [14] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [15] Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Universal style transfer via feature transforms. In: NIPS (2017)
  • [16] Liao, J., Yao, Y., Yuan, L., Hua, G., Kang, S.B.: Visual attribute transfer through deep image analogy. ACM Transactions on Graphics (TOG) 36(4),  120 (2017)
  • [17]

    Lu, X., Lin, Z., Shen, X., Mech, R., Wang, J.Z.: Deep multi-patch aggregation network for image style, aesthetics, and quality estimation. In: ICCV (2015)

  • [18] Mai, L., Jin, H., Liu, F.: Composition-preserving deep photo aesthetics assessment. In: CVPR (2016)
  • [19] Neal, R.M., et al.: Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2(11) (2011)
  • [20] Peng, K.C., Chen, T.: Toward correlating and solving abstract tasks using convolutional neural networks. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2016)
  • [21] Peng, K.C., Chen, T., Sadovnik, A., Gallagher, A.C.: A mixed bag of emotions: Model, predict, and transfer emotion distributions. In: CVPR (2015)
  • [22] Porzi, L., Rota Bulò, S., Lepri, B., Ricci, E.: Predicting and understanding urban perception with convolutional neural networks. In: ACM Multimedia (2015)
  • [23] Rossky, P., Doll, J., Friedman, H.: Brownian dynamics as smart monte carlo simulation. The Journal of Chemical Physics 69(10), 4628–4633 (1978)
  • [24] Sartori, A., Yanulevskaya, V., Salah, A.A., Uijlings, J., Bruni, E., Sebe, N.: Affective analysis of professional and amateur abstract paintings using statistical analysis and art theory. ACM Transactions on Interactive Intelligent Systems 5(2),  8 (2015)
  • [25] Siarohin, A., Zen, G., Majtanovic, C., Alameda-Pineda, X., Ricci, E., Sebe, N.: How to make an image more memorable?: A deep style transfer approach. In: ACM ICMR (2017)
  • [26] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • [27] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
  • [28] Tsai, Y.H., Shen, X., Lin, Z., Sunkavalli, K., Lu, X., Yang, M.H.: Deep image harmonization. In: CVPR (2017)
  • [29] Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.S.: Texture networks: Feed-forward synthesis of textures and stylized images. In: ICML (2016)
  • [30] Wang, W., Shen, J.: Deep cropping via attention box prediction and aesthetics assessment. In: ICCV (2017)
  • [31] Wilber, M.J., Fang, C., Jin, H., Hertzmann, A., Collomosse, J., Belongie, S.: Bam! the behance artistic media dataset for recognition beyond photography. In: ICCV (2017)
  • [32]

    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: NIPS (2014)