A Gated Peripheral-Foveal Convolutional Neural Network for Unified Image Aesthetic Prediction

12/19/2018 ∙ by Xiaodan Zhang, et al. ∙ Xidian University 8

Learning fine-grained details is a key issue in image aesthetic assessment. Most of the previous methods extract the fine-grained details via random cropping strategy, which may undermine the integrity of semantic information. Extensive studies show that humans perceive fine-grained details with a mixture of foveal vision and peripheral vision. Fovea has the highest possible visual acuity and is responsible for seeing the details. The peripheral vision is used for perceiving the broad spatial scene and selecting the attended regions for the fovea. Inspired by these observations, we propose a Gated Peripheral-Foveal Convolutional Neural Network (GPF-CNN). It is a dedicated double-subnet neural network, i.e. a peripheral subnet and a foveal subnet. The former aims to mimic the functions of peripheral vision to encode the holistic information and provide the attended regions. The latter aims to extract fine-grained features on these key regions. Considering that the peripheral vision and foveal vision play different roles in processing different visual stimuli, we further employ a gated information fusion (GIF) network to weight their contributions. The weights are determined through the fully connected layers followed by a sigmoid function. We conduct comprehensive experiments on the standard AVA and Photo.net datasets for unified aesthetic prediction tasks: (i) aesthetic quality classification; (ii) aesthetic score regression; and (iii) aesthetic score distribution prediction. The experimental results demonstrate the effectiveness of the proposed method.



There are no comments yet.


page 1

page 2

page 4

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Automatic image aesthetic assessment aims to endow computers with the ability of perceiving aesthetics as human beings. It plays an important role in many real-world applications, such as image recommendation, photo organization and image enhancement [1, 2, 3, 4]. Early attempts in this area focus on handcrafted features which are based on the known aesthetic principles such as the rule-of-thirds, simplicity or diagonal rules [5, 6, 7, 8, 9]. However, most photographic rules are descriptive, which are difficult for mathematical modeling.

Deep learning methods have shown great success in various computer vision tasks 

[10, 11, 12, 13]. More and more researchers try to apply deep learning methods to image aesthetic assessment [14, 15, 16]. But most of the networks ignore the fine-grained information, which is quite important in aesthetic prediction. In order to tackle this problem, previous study [17] represented the image with one randomly sampled patch from original high-resolution image. However, aesthetic attributes in one randomly cropped patch may not well represent the fine-grained information in the entire image. Recently, Lu et al. [18] proposed a multi-patch aggregation network (DMA-Net) to extract local fine-grained features from multiple randomly cropped patches. This method achieves some promising results, but it ignores the global spatial layout information. Considering this, Ma et al. [19]

proposed a layout-aware framework in which an attribute graph is added to DMA-Net. Whereas, the nodes of the attribute graph need to be predefined, which is not applicable in practical applications. Besides, all the above mentioned researches treat the global and local feature extraction as two distinct tasks. But in human vision system, these two features have high correlations.

It is universally acknowledged that humans perceive scenes with a mixture of high-acuity foveal vision and coarser peripheral vision [20, 21]. The former has the highest density of cones, and is responsible for encoding fine-grained details. The latter contains a significantly lower density of cones, and is mainly used for encoding the broad spatial scene and seeing large objects [21, 22]. More importantly, peripheral vision also actively participates in attentional selection of visual space to be processed by fovea [23]. Considering the above observations, we mimic this process and develop a Gated Peripheral-Foveal Convolutional Neural Networks (GPF-CNN). It is a dedicated double-subnet neural network. The input image of the first subnet is a downsampled low resolution image. We refer to the image as peripheral view and denote the first subnet as peripheral subnet. The peripheral subnet composes of a bottom-up feed-forward network to encode the global composition and a top-down neural attention feedback process to create a saliency map. We use the salience map to determine the regions from the peripheral view on which we wish to extract the fine-grained details. The input image of the second subnet is a high-resolution image and denoted as the foveal view or simply fovea. We refer the second subnet as foveal subnet. Figure 1 shows an example of the attention map, the peripheral and foveal view. The model selects a foveal window from the peripheral view with the guidance of top-down neural attention. The corresponding region from the high-resolution images is then cropped for extracting fine-grained details. Finally, features extracted in the fovea subnet are fused with features extracted in the peripheral subnet.

Fig. 1: Illustration of the top-down neural attention map (left), peripheral view (middle) and foveal views (right) of an image.

Recent studies show that foveal vision and peripheral vision play different roles in processing different visual stimuli [21, 24]. Categories such as portrait and animal rely more on fine-grained details information to make aesthetic decision. Thus they are associated more with fovea representations. Other categories, as in the case of landscape and architecture, rely more on global shape and large-scale integration. Therefore, they are associated more with peripheral representations. Motivated from these findings, we propose a gated information fusion network to weight the foveal and peripheral branch adaptively: if one branch is better at processing a given image, the gating layer will direct more information to that branch by increasing the value of that gating node.

Overall, this paper makes the following contributions.

  • A biologically inspired structure is proposed. With this structure, networks can automatically focus on key regions of the top-down neural attention map to extract the fine-grained details. By doing so, we not only establish a relationship between the global and local features, but also preserve the semantic integrity as demonstrated in the experiment part.

  • We have also developed a gated information fusion module which can adaptively weight the contributions of the global layout and local fine-grained features according to the input. By combining the weighted global and local features, the proposed module can greatly boost the performance.

  • We conduct comprehensive experiments for unified aesthetic prediction tasks: aesthetic classification, aesthetic regression and aesthetic label distribution. For all these tasks, the proposed model achieves superior performance over the state-of-the-art approaches on public datasets.

The remainder of this paper is organized as follows. In section II, we briefly summarize the related work. In section III, we introduce the architecture of the GPF-CNN model. In section IV, we quantitatively evaluate the effectiveness of the proposed model and compare it with state-of-the art methods. Finally, we wrap up with conclusions and ideas for future work in section V.

Ii Related Work

Contemporary image aesthetic assessment research can be roughly outlined by the following two important components: extraction of more advanced features and utilization of more sophisticated learning algorithms. Thus, we summary the previous research from these two perspectives: the visual representations and learning algorithms.

Ii-a Visual Representations

There is a vast literature on the problem of designing effective features for aesthetic assessment, starting with the seminal work of  [25] and leading to recent works of [6, 7, 9]. These features are based on the person’s aesthetic perception and photographic rules. For example, Datta et al. [25] extracted features to model the photographic technique such as rule of thirds, colorfulness, or saturation. Tang et al. [7] modeled the photographic rules (composition, lighting, and color arrangement) by extracting the visual features according to the variety of photo content. Nishiyama et al. [9] proposed to use the bags-of-color-patterns to model the color harmony in aesthetics. Later work proposed by Zhang et al. [26]

focused on constructing the small-sized connected graphs to encode the image composition information. However, the above methods with hand-designed features can achieve only limited success because 1) such hand-crafted features cannot be applied to all the image categories since the photographic rules vary considerably among different images. 2) these handcrafted features are heuristic and some photography rules are difficult to be quantified mathematically.

Recently, some researchers have tried to apply the deep learning networks to image aesthetic quality assessment. Tian et al. [14] proposed a query-dependent aesthetic model with deep learning for aesthetic quality assessment. Their method suffers deteriorate accuracy since they just use the networks as feature extractor. Kao et al. [27] explored the deep multi-task networks to leverage the semantic information to image aesthetic prediction. Different from the aforementioned methods,  [17, 18, 19]

focused on the fixed-size input constraint of deep networks when applied for aesthetic prediction. The inputs need to be transformed via scaling, cropping, or padding before feeding into the neural network. Images after these transformations often lose the holistic information and the high-resolution fine-grained details. Lu

et al. [17] tried to tackle this problem by proposing a double column network called RAPID. In particular, they represented the global view via padded or warped image, the local view via the randomly cropped single patch. In order to capture more high resolution fine-grained details, Lu et al. [18] extended the RAPID to a deep multi-patch aggregation network (DMA-Net). In DMA-Net, the input image was represented with a bag of random cropped patches. Two network layers (statistics and sorting) were used to aggregate the multiple patches. However, DMA-Net failed to encode the global layout of the image. Ma et al. [19] tried to address this limitation by adding an object-based attribute graph to DMA-Net. Their method relies on strong hypothesis. The number of attribute graph node is given in advance, which is unapplicable in most cases. Our work is also related to fusing the global and local features for aesthetic prediction. It not only makes full use of the attention mechanism, but also adaptively weights the global and local features according to the inputs.

Ii-B Learning Algorithms

Early attempts in image aesthetic assessment cast this problem as a classification problem, such as [28, 29, 27, 19, 18]

. They classified the images into high or low aesthetic quality based on the threshold of the weighted mean scores of human rating. Other research such as 

[15, 16] used the regression model to predict the aesthetic score. However, the image aesthetic quality assessment is highly subjective. The rated scores of different people may differ greatly due to the cultural background. Thus a scalar value is insufficient to provide the degree of consensus or diversity of opinion among annotators [30]. Considering this, recent research focuses on directly predicting the label distribution of the scores. In [30], Jin et al. proposed a new CJS loss to predict the aesthetic label distribution. Murray et al. [31]

used the Huber loss to predict the aesthetic score distribution. But they predicted each discrete probability independently. Talebi

et al. [32] treated the score distribution as ordered classes and used squared EMD loss to predict the score distributions. In this paper, similar with [32], we optimize our networks by minimizing EMD loss.

Iii Gated Peripheral and Foveal Vision Convolutional Neural Networks

The proposed model includes two subnets: the peripheral subnet and the foveal subnet. Given a high resolution image, the image is first downsampled and then fed into the peripheral subnet. The peripheral subnet is responsible for encoding the global composition and providing the key region. Then, a top-down back-propagation pass is done to calculate the attention map which is informative about the model's decisions. Based on the neural attention map, the attended region is selected and fed into the foveal subnet. A GIF module is followed to effectively weight the extracted features from these two subnets. The overall architecture of the model is shown in Figure 2.

The traditional methods often formulate the aesthetic aesthetic assessment as binary classification as we have discussed earlier. The binary labels are typically derived from a distribution of scores (e.g. from in www.dpchallenge.com and from in www.photo.net

). They compute and threshold the mean score of distributions. However, the single binary label removes the useful information of the ground-truth score distribution, such as the variance, the median, etc. These removed information is useful to investigate the consensus and diversity of opinions among annotators. Thus in this paper, we formulate the aesthetic assessment as a label distribution predicting problem. Each image in the dataset consists of its ground truth (user) ratings

. Let denote the score distributions of the images. represents the -th score bucket. is the total number of score buckets. denotes the number of voters that give a discrete score of to the image. As for AVA dataset, , , , but for Photo.net dataset, , , (The detailed introduction of AVA and Photo.net dataset can be found in section IV). The score distributions are -normalized as a preprocessing step. Thus . When we predict the score distributions, the mean score can be obtained via

. Then we can perform the classification and regression task. The loss function used in our paper is defined as follows:



is the cumulative distribution function,

is set as 2 to penalize the Euclidean distance between the CDFs. Our proposed GPF-CNN is applicable to a variety of CNN, such as AlexNet [33], VGGNet [34], ResNet [13] as demonstrated in the experiment part. For fair comparison with most of the aesthetic assessment methods, we select the VGG16 [34] as our baseline.

Fig. 2: Overall architecture of the GPF-CNN. The input of the peripheral subnet is low resolution image, and the input of foveal subnet is the selected attended region. The GIF module is used to balance the weights of the two subnets. More detailed illustrations of GIF module can be seen in Figure 3.

Iii-a Top-down Neural Attention Feedback

The detail information locates in the original high resolution image. Training deep networks with large-size inputs requires a significantly larger dataset, and hardware memory. In this work, we use the top-down neural attention to discover the most important region of an image. The network then directs the high resolution “fovea” to extract fine-grained details. This offers a two-fold bonus. First, it helps to reduce the parameters. if we estimate the salience map via a new saliency network, the number of learning parameters tends to be quite large. This will increase the amount of computation and difficulty of training. Second, extracting local fine-grained features based on the global network's attention can establish the relationships between the global and the local features.

Recently, lots of methods have been proposed to explore where the neural networks “look” in an image for evidence for their predictions [35, 36]. Our work is inspired by the excitation backprop method [36] which generates the top-down neural attention map based on the probabilistic Winner Take All (WTA) model. Given a selected output class, the probabilistic WTA scheme uses a stochastic sampling process to generate a soft attention map. The winning (sampling) probability is defined as


where (

is the overall neuron set),

is the parent node set of (top-down order). As Eq. 2 indicates, is a function of the winning probability of the parent nodes in the preceding layers [36]. Thus, the winner neurons are recursively sampled in a top down fashion based on a conditional winning probability . The conditional winning probability is defined as


where is the normalization factor, is the response of , is the connection weight between and . Recursively propagating the top-down signal based on Eq. 2 and Eq. 3 layer by layer, we can compute the attention map of the predicted class. The computed attention map indicates which pixels are more relevant for the class. Next, we crop and zoom in the attended region to finer scale with higher resolution to extract fine-grained features.

Attention based automatic image cropping tries to identify the most important region in the image. It aims to search for the smallest region inside which the summed attention is maximized. Suppose is a non-negative valued top-down neural attention map. Larger attention values in indicate higher visual importance. Without loosing generality, the attended regions can be found by optimising the following problem:


where is the minimum percentage of total attention to be preserved, is the smallest rectangle that contain percentage of total attention, is the rectangular area of . It should be emphasized that for a given 111We use the search strategy of [37], and follow the default parameter setting in the paper, i.e. , may not be unique. In our algorithms, we always choose with the largest summed attention value.

Iii-B Gated Information Fusion (GIF) Network

The GIF module aims to balance the global and local feature according to the feature maps. The overall structure is shown in Figure 3. Similar gated information fusion mechanism has been proposed for multi-modal learning [38]. In this paper, we generalise this design and focus on weighting the features by modeling the relationship between channels. The same idea has been adopted in SENet [39]. Let and denote the feature maps from the peripheral subnet and the foveal subnet. The GIF module consists of two parts: the weight generation part and the feature fusion part. In the weight generation part, a global pooling layer is applied before concatenating the feature maps and . is used to squeeze global spatial features into channel descriptors [39]. Then, a bottleneck with two fully connected (FC) layers is applied in parallel to fully capture channel-wise dependencies. The sigmoid gating layer is employed to modulate the learned weights. Finally, the weighted feature maps are fed into the fully connected layers and the classification layer. Let denote the features after containing. We summarize the operations of the GIF module as follows.


where denotes the ReLU function [40], is a dimensionality-reduction layer and is a dimensionality-increasing layer as defined in SENet [39], refers the output features of , and denotes the input features of -th branch.

Fig. 3: The structure of the proposed GIF network. The GIF module produces weights and by applying the fully connected layers and the sigmoid function. Then, and are multiplied to the features to get the weighted information fusion results.

Iv Experiments

In this section, we verify the effectiveness of the proposed photo aesthetic prediction approach on different datasets and CNN architectures. First, we perform the ablation studies on AVA dataset. The training networks include AlexNet [33], VGGNet [34], ResNet [13], and InceptionNet [41]. For all the architectures, our proposed scheme learns to perform better than the original networks. Next, we compare the performance of our scheme with state-of-the-art methods on AVA and Photo.net dataset.

Fig. 4: Example patches cropped with neural attention. First row: original images; second row: top-down neural attention map; third row: patches cropped with neural attention map.

Iv-a Datasets

AVA Dataset: The AVA aesthetic dataset [42] includes images, which is the largest public available aesthetics assessment dataset. The images are collected from www.dpchallenge.com. Each image has about aesthetic ratings ranging from one to ten. We use the same partition of training data and testing data as the previous work [5, 19, 18], i.e. images for training and validation, the rest images for testing.

Photo.net Dataset: The Photo.net dataset [25] is collected from www.photo.net. It consists of images but only images have aesthetic label distribution. Distribution (counts) of aesthetics ratings are in scale. From the overall images, images are used to train, images are used for validation and the rest images are used for test.

Iv-B Implementation Details and Evaluation Criteria

Considering that the peripheral subnet is used for encoding the global composition features, we do not rescale the input into fixed size but use downsampling to keep its original aspect ratios. The longest dimension of the input image is kept to

. The training process includes two steps. In the first step, we initialize convolutional layers in the peripheral subnet by the pre-trained VGG16 from ImageNet 

[34]. We first train the peripheral subnet with softmax loss to classify the images into high or low category. After training the peripheral subnet, we can get the attended regions by feeding back the top-down neural attention. In the second step, we freeze the convolutional layers of peripheral subnet, and start to train the foveal subnet and the GIF module. Each input image is normalised through mean RGB-channel subtraction. Both the two steps adopt the SGD optimization algorithm. The minibatch samples images randomly in each iteration. The momentum is . The initial learning rate is set to and reduced by a factor of every epochs. The training continues until the validation loss reaches a plateau for

epochs. We unify the hyper-parameters for the first and second step training. Our networks are implemented based on the open source PyTorch framework with a NVIDIA Pascal TITAN X GPU.

Network architecture Accuracy (%) SRCC(mean) LCC (mean) MAE RMSE EMD
VGG16 74.41 0.6007 0.5869 0.4611 0.5878 0.0539
Random-VGG16 78.54 0.6274 0.6382 0.4410 0.5660 0.0510
PF-CNN 80.60 0.6604 0.6712 0.4176 0.5387 0.047
GPF-CNN 80.70 0.6762 0.6868 0.4144 0.5347 0.046
TABLE I: Effectiveness of neural network attention module.
AlexNet VGG16 InceptionNet ResNet-16
Operation Layer type Parameter Parameter Parameter Parameter
fully connected [128,256] [256,512] [512,2048] [64,512]
fully connected [128,256] [256,512] [512,2048] [64,512]
fully connected 2048 2048 2048 2048
fully connected 2048 2048 2048 2048
fully connected 4096 4096 4096 4096
TABLE II: Parameter settings of GIF in different architectures

Unlike most traditional methods that are designed to perform the binary classification, we evaluate our proposed method with respect to three aesthetic quality tasks: (i) aesthetic score regression, (ii) aesthetic quality classification, and (iii) aesthetic score distribution prediction. For the aesthetic score regression task, we compute the mean score of the label distribution via . For the aesthetic quality classification, we threshold the mean score using the threshold just as the work of [5, 6, 19, 27]. Images with predicted scores above

are categorized as high quality and vice versa. The evaluation metrics related to the three prediction tasks are as follows.

  • Image aesthetic score regression: We report the Spearman rank-order correlation coefficient (SRCC), Pearson linear correlation coefficient (LCC), root mean square error (RMSE) and mean absolute error (MAE). These are the most significant for testing the performance of an IQA method. Of these criteria, SRCC measures the prediction monotonicity, and the LCC provides an evaluation of prediction accuracy. Both SRCC and LCC range from to , and larger value indicates better result. While for MAE and RMSE, the smaller value indicates the better results.

  • Image aesthetic quality classification: We report the overall accuracy, defined as

  • Image aesthetic score distribution: We report the EMD values. The EMD measures the closeness of the predicted and ground truth rating distribution with in Eq.1.

Fig. 5: Top 2 rows: high quality images, as predicted by our GPF-CNN (VGG16), coupled with plots of their ground-truth and predicted score distributions. Bottom 2 rows: the low quality images, as predicted by our GPF-CNN (VGG16), coupled with plots of their ground-truth and predicted score distributions.

Iv-C Ablation Studies

Traditional methods extract the local features based on random cropping [18]. The random cropping method is independent of the image content. It is unlikely to capture the semantic meaning. Another alternative is to extract the fine-grained details based on salient object detection. The salient object detection can perform well on condition that there is one salient object. When there are multiple objects, it is difficult to choose which is the most important one. Besides, for most landscape images, there is no salient object in the image. However, extracting fine-grained features based on neural attention can tackle the above challenges. Figure 4 shows some examples of patches cropped with neural attention. Figure 4(a)(c)(e) have only one subject in the image. The cropped patches can capture the important region and preserve the semantic integrity. Figure 4(d)(f) have multiple objects in the image. But the cropped patches can capture both of them.

To validate the neural attention module quantitatively, we conduct two baselines models: VGG16 and Random-VGG16. The VGG16 is pre-trained on ImageNet and fine-tuned to predict the aesthetic quality. The input of VGG16 is obtained by wrapping the original input image to the fixed size of . The Random-VGG16 is a double-column deep convolutional neural networks. The first column encodes the global views and the input image is . The second column uses random cropping method to extract the local fine-grained information. The cropped patch size is fixed to be . The PF-CNN is a simplification of proposed GPF-CNN by removing the GIF module. It uses the neural attention to extract the fine-grained details. The attended regions of foveal subnet are resized to in training and testing. For fair comparison, we use the same network architecture and unify the hyper-parameters. The results are shown in Table LABEL:table:attention. It can be seen that both Random-VGG16 and PF-CNN achieve better performance compared with VGG16, which indicates that incorporating local fine-grained features can improve the prediction results. This is consistent with the results of [19, 18], who used the random cropping strategy to encode fine-grained details. The PF-CNN exceeds both VGG16 and Random-VGG16 by a significant margin. This illustrates the importance of using attention mechanism to encode the fine details.

Fig. 6:

Failure cases of our model. Our model performs poorly on bimodal distribution or on very skewed distributions.

In order to see whether the GIF module is effective, we compare the GPF-CNN with PF-CNN. Compared with PF-CNN, GPF-CNN has GIF module to weight the global and local features. The baseline network is still VGG16. The detailed parameters of GIF module in VGG16 are illustrated in Table LABEL:table:GIF_parameter. The comparison results are shown in Table LABEL:table:attention. GPF-CNN performs better than PF-CNN. In conclusion, our experimental results confirm the importance of fusing global and local fine details, emphasizing the critical importance of neural attention and GIF module in our framework.

Iv-D Extension to Other Network Architectures

We next investigate the performance of GPF-CNN mechanism on several other architectures: AlexNet [33], ResNet-16 [13], and InceptionNet [41]. The parameters of GIF module that is integrated with AlexNet, ResNet-16 and InceptionNet are shown in Table LABEL:table:GIF_parameter. The is a bottleneck with two fully connected (FC) layers: a dimensionality-reduction layer with parameters , a ReLU, and then a dimensionality increasing layer with parameters . The and are set and respectively for AlexNet, and for VGG16, and for InceptionNet (We have tried other parameters. But we have not seen any improvements.). The comparison results are illustrated in Table LABEL:table:other_network. As with the previous experiments, we observe significant performance improvements induced by the GPF-CNN mechanism.

Network architecture Accuracy (%) SRCC(mean) LCC (mean) MAE RMSE EMD
AlexNet 76.37 0.5549 0.5665 0.4733 0.6063 0.0525
ResNet-16 77.91 0.6394 0.6505 0.4346 0.5583 0.0484
InceptionNet 79.43 0.6756 0.6865 0.4154 0.5359 0.0466
GPF-CNN(AlexNet) 78 0.5996 0.6121 0.4539 0.5820 0.0507
GPF-CNN(ResNet-16) 79.98 0.6670 0.6779 0.4200 0.5409 0.0470
GPF-CNN(InceptionNet) 81.81 0.6900 0.7042 0.4072 0.5246 0.045
TABLE III: Extensions to other network architecture.

Iv-E Content-based Photo Aesthetic Analysis

In this section, we demonstrate the effectiveness of the proposed method on various types of images. We select eight category images from the test set of AVA dataset: i.e. animal, landscape, cityscape, floral, food-drink, architecture, portrait, still-life. The image collection is the same with previous works of  [5, 19, 42], about in each of the categories. In each category of images, we systematically compare the proposed GPF-CNN with VGG16, Random-VGG16, and PF-CNN. The experimental results are illustrated in Table LABEL:table:category. For all the seven categories, random-VGG16, PF-CNN, and GPF-CNN perform better than VGG16. These results indicate that fine-details information is quite important for image aesthetic prediction. We can also find that the performance of the proposed GPF-CNN significantly outperforms the baselines in most of the categories. The portrait shows substantial improvements, reaching a improvement compared with VGG16. This is because the fine details in the face, such as light, contrast is quite important in portrait aesthetic assessment. The proposed GPF-CNN is sensitive to the faces since it uses the neural attention to extract the fine-grained details (see Figure 4(a)).

Category Network architecture Accuracy (%) SRCC(mean) LCC (mean) MAE RMSE EMD
animal GPF-CNN 80.80 0.7480 0.7478 0.3941 0.5051 0.045
PF-CNN 80.23 0.7274 0.7277 0.4091 0.5238 0.0462
VGG-Random16 77.9 0.6587 0.6654 0.4475 0.5688 0.0516
VGG16 76.17 0.6212 0.6267 0.4959 0.6325 0.0546
landscape GPF-CNN 85.42 0.7746 0.7822 0.3713 0.4705 0.0422
PF-CNN 84.78 0.7606 0.7685 0.3831 0.4863 0.0431
VGG-Random16 82.97 0.7230 0.7318 0.4051 0.5153 0.0480
VGG16 80.04 0.6780 0.6876 0.4968 0.6276 0.0541
cityscape GPF-CNN 81.68 0.7533 0.7539 0.3956 0.5103 0.0443
PF-CNN 80.59 0.7365 0.7362 0.4096 0.5284 0.0456
VGG-Random16 77.02 0.6808 0.6827 0.4438 0.5676 0.0508
VGG16 76.22 0.6460 0.6424 0.5074 0.6481 0.0552
floral GPF-CNN 79.95 0.7374 0.7348 0.3681 0.4785 0.0423
PF-CNN 79.15 0.7196 0.7171 0.3794 0.4921 0.0433
VGG-Random16 77.19 0.6564 0.6576 0.4147 0.5312 0.0487
VGG16 75.58 0.6184 0.6220 0.4455 0.5709 0.05
fooddrink GPF-CNN 80.22 0.7389 0.7476 0.3919 0.4948 0.0443
PF-CNN 79.46 0.7180 0.7288 0.4081 0.5125 0.0456
VGG-Random16 77.58 0.6642 0.6740 0.4357 0.5498 0.0503
VGG16 74.01 0.6163 0.6278 0.4876 0.6208 0.0536
architecture GPF-CNN 81.60 0.7431 0.7410 0.3704 0.4822 0.0421
PF-CNN 81.16 0.7221 0.7213 0.3840 0.4970 0.0433
VGG-Random16 79.16 0.6708 0.6709 0.4191 0.5379 0.0476
VGG16 76.83 0.6032 0.6069 0.4748 0.6091 0.0521
portrait GPF-CNN 83.52 0.6987 0.7047 0.4228 0.5389 0.0475
PF-CNN 82.72 0.6774 0.6866 0.4386 0.5549 0.0487
VGG-Random16 81.71 0.6215 0.6331 0.4672 0.5884 0.0522
VGG16 76.93 0.5671 0.5726 0.5347 0.6784 0.0583
still life GPF-CNN 76.35 0.7001 0.7127 0.4039 0.5153 0.0455
PF-CNN 75.23 0.6772 0.6909 0.4210 0.5338 0.0468
VGG-Random16 71.49 0.6158 0.6329 0.4488 0.5683 0.0522
VGG16 71.14 0.5652 0.5810 0.49 0.6187 0.0537
TABLE IV: Ablation study on eight category images.
Network architecture Accuracy (%) SRCC(mean) LCC (mean) MAE RMSE EMD
RAPID(AlexNet)[17] 74.2 - - - - -
DMA-Net(AlexNet)[18] 75.42 - - - - -
MNA-CNN(VGG16)[5] 76.1 - - - - -
A-Lamp(VGG16)[19] 82.5 - - - - -
MTRLCNN(VGG16)[27] 78.46 - - - - -
NIMA(VGG16)[32] 80.6 0.592 0.610 - - 0.052
NIMA(InceptionNet)[32] 81.51 0.612 0.636 - - 0.05
GPF-CNN(AlexNet) 78 0.5996 0.6121 0.4539 0.5820 0.0507
GPF-CNN(VGG16) 80.70 0.6762 0.6868 0.4144 0.5347 0.046
GPF-CNN(InceptionNet) 81.81 0.6900 0.7042 0.4072 0.5246 0.045
TABLE V: Comparison with state-of-the-art methods on AVA Dataset.
Network architecture Accuracy (%) SRCC(mean) LCC (mean) MAE RMSE EMD
RAPID(AlexNet) 74.2 - - - - -
DMA-Net(AlexNet) 75.42 - - - - -
MNA-CNN(VGG16) 76.1 - - - - -
A-Lamp(VGG16) 82.5 - - - - -
MTRLCNN(VGG16) 78.46 - - - - -
NIMA(VGG16) 80.6 0.592 0.610 - - 0.052
NIMA(InceptionNet) 81.51 0.612 0.636 - - 0.05
GPF-CNN(AlexNet) 78 0.5996 0.6121 0.4539 0.5820 0.0507
GPF-CNN(VGG16) 80.70 0.6762 0.6868 0.4144 0.5347 0.046
GPF-CNN(InceptionNet) 81.81 0.6900 0.7042 0.4072 0.5246 0.045
TABLE VI: Comparison with state-of-the-art methods on AVA Dataset.

Iv-F Comparison with the State-of-the-Art on AVA Dataset

We quantitatively compare our GPF-CNN with several state-of-the-art methods: i.e. NIMA [32], MTRLCNN [27], A-Lamp [19], MNA-CNN [5], RAPID [17], DMA-Net [18] on AVA dataset. Note that methods of  [27, 19, 5, 17, 18] are designed to perform binary classification on the aesthetic scores. Only aesthetic quality classification results are reported. Table LABEL:table:AVA shows the comparison results. As shown in the table, our GPF-CNN achieves the best performance across the board. Methods of RAPID [17] and DMA-Net [18] are based on shallow networks, achieving and respectively. But the proposed GPF-CNN (AlexNet) achieves . This is a and performance improvement. For the larger VGG16 network, our method GPF-CNN (VGG16) performs slightly worse than A-Lamp [19] but outperforms MTRLCNN [27] and MNA-CNN [5] by and respectively. Note that A-Lamp [19] only performs binary classification. Our method provides richer and more precise information than binary classification. NIMA [32] is most closely related to our work since they use the EMD loss to optimise their network. The SRCC and LCC of NIMA is and respectively on VGG16, while GPF-CNN achieves and . This is a and improvement. This is, to the best of our knowledge, the state-of-the-art performance on AVA dataset.

Figure  5 shows the top six and bottom six images randomly selected in the AVA test set. Plots of the ground-truth and predicted distributions are also shown. We can find that the model can achieve a high degree of accuracy, with almost perfect reconstruction in some cases. Figure 6

shows some failure cases of our model. Our trained model performs poorly on images which have very non-Gaussian distributions. But the Gaussian functions perform adequately for

of all the images in the AVA dataset, as reported by Murray [42].

Iv-G Evaluating Performance on Photo.net Dataset

We compare our proposed model with state-of-the-art models, including the deep learning models proposed in [27], VGG16 and the traditional feature extraction models [28] on Photo.net dataset. For VGG16, we directly replaced the last layer with a fully connected layer with neurons followed by soft-max activations (the scale of the Photo.net dataset is ). The comparison results are shown in Table LABEL:table:photo.net. Again, GPF-CNN outperforms the baselines by a large margin, achieving accuracy rate. This is around better than MTCNN [27], and better than VGG16.

Network architecture Accuracy (%) SRCC(mean) LCC (mean) MAE RMSE EMD
GIST_SVM 59.90 - - - - -
FV_SIFT_SVM 60.8 - - - - -
MTCNN(VGG16) 65.2 - - - - -
VGG16 70.69 0.4097 0.4214 0.4621 0.5623 0.0761
GPF-CNN(VGG16) 75.6 0.5217 0.5464 0.4242 0.5211 0.070
TABLE VII: Comparison with state-of-the-art methods on Photo.net Dataset.

V Conclusion

This paper presents a biological model for photo aesthetic assessment. In human vision system, the fovea has the highest possible visual acuity and is responsible for seeing the fine details. The peripheral vision has a significantly lower density of cones and is used for perceiving the broad spatial scene. Besides, foveal and peripheral vision play different roles in processing different visual stimuli. We are inspired by these observations and propose the GPF-CNN architecture. It can learn to focus on the important regions of top-down neural attention map to extract the fine details features. The GIF module can adaptively fuse the global and local features according to the input feature map. The experimental results on the large-scale AVA and Photo.net datasets show that our GPF-CNN can significantly improve the state-of-the-art for three tasks: aesthetic quality classification, aesthetic score regression and aesthetic score distribution prediction. In the future work, we will further explore the human vision system and design more powerful model for aesthetic prediction tasks.


  • [1] F.-L. Zhang, M. Wang, and S.-M. Hu, “Aesthetic image enhancement by dependence-aware object recomposition,” IEEE Trans. Multimedia, vol. 15, no. 7, pp. 1480–1490, 2013.
  • [2] A. Samii, R. Měch, and Z. Lin, “Data-driven automatic cropping using semantic composition search,” Computer Graphics Forum, vol. 34, no. 1, pp. 141–151, 2015.
  • [3] S. Bhattacharya, R. Sukthankar, and M. Shah, “A framework for photo-quality assessment and enhancement based on visual aesthetics,” in Proceedings of the 18th ACM International Conference on Multimedia, 2010, pp. 271–280.
  • [4] H. Talebi and P. Milanfar, “Learned perceptual image enhancement.” [Online]. Available: https://arxiv.org/abs/1712.02864
  • [5] L. Mai, H. Jin, and F. Liu, “Composition-preserving deep photo aesthetics assessment,” in

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    , June 27-30, 2016, pp. 497–506.
  • [6] L. Guo, Y. Xiong, Q. Huang, and X. Li, “Image esthetic assessment using both hand-crafting and semantic features,” Neurocomputing, vol. 143, pp. 14–26, 2014.
  • [7] X. Tang, W. Luo, and X. Wang, “Content-based photo quality assessment,” IEEE Trans. Multimedia, vol. 15, no. 8, pp. 1930–1943, 2013.
  • [8] C. Chamaret and F. Urban, “No-reference harmony-guided quality assessment,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 23-28, 2013, pp. 961–967.
  • [9] M. Nishiyama, T. Okabe, I. Sato, and Y. Sato, “Aesthetic quality classification of photographs based on color harmony,” in Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition, June 20-25, 2011, pp. 33–40.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, 2015.
  • [11] Z. Jiao, X. Gao, Y. Wang, J. Li, and H. Xu, “Deep convolutional neural networks for mental load classification based on EEG data,” Pattern Recognition, vol. 76, pp. 582–595, 2018.
  • [12] X. Zhang, X. Gao, W. Lu, L. He, and Q. Liu, “Dominant vanishing point detection in the wild with application in composition analysis,” Neurocomputing, vol. 311, pp. 260–269, 2018.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [14] X. Tian, Z. Dong, K. Yang, and T. Mei, “Query-dependent aesthetic model with deep learning for photo quality assessment,” IEEE Trans. Multimedia, vol. 17, no. 11, pp. 2035–2048, 2015.
  • [15] S. Kong, X. Shen, Z. L. Lin, R. Mech, and C. C. Fowlkes, “Photo aesthetics ranking network with attributes and content adaptation,” in Proceedings of 14th European Conference on Computer Vision, 2016, pp. 662–679.
  • [16] B. Jin, M. V. O. Segovia, and S. Süsstrunk, “Image aesthetic predictors based on weighted cnns,” in Proceedings of International Conference on Image Processing, September 25-28, 2016, pp. 2291–2295.
  • [17] X. Lu, Z. L. Lin, H. Jin, J. Yang, and J. Z. Wang, “Rating image aesthetics using deep learning,” IEEE Trans. Multimedia, vol. 17, no. 11, pp. 2021–2034, 2015.
  • [18] X. Lu, Z. Lin, X. Shen, R. Mech, and J. Z. Wang, “Deep multi-patch aggregation network for image style, aesthetics, and quality estimation,” in Proceedings of IEEE International Conference on Computer Vision, December 7-13, 2015, pp. 990–998.
  • [19] S. Ma, J. Liu, and C. W. Chen, “A-lamp: Adaptive layout-aware multi-patch deep convolutional neural network for photo aesthetic assessment,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, July 21-26, 2017, pp. 722–731.
  • [20] H. Strasburger, I. Rentschler, and M. Jüttner, “Peripheral vision and pattern recognition: A review,” Journal of Vision, vol. 11, no. 5, pp. 13–13, 2011.
  • [21]

    P. Wang and G. W. Cottrell, “Central and peripheral vision for scene recognition: a neurocomputational modeling exploration,”

    Journal of Vision, vol. 17, no. 4, pp. 9–9, 2017.
  • [22] S. Gould, J. Arfvidsson, A. Kaehler, B. Sapp, M. Messner, G. R. Bradski, P. Baumstarck, S. Chung, and A. Y. Ng, “Peripheral-foveal vision for real-time object recognition and tracking in video,” in

    Proceedings of the 20th International Joint Conference on Artificial Intelligence

    , January 6-12, 2007, pp. 2115–2121.
  • [23] C. J. Ludwig, J. R. Davies, and M. P. Eckstein, “Foveal analysis and peripheral selection during active visual sampling,” Proceedings of the National Academy of Sciences, vol. 111, no. 2, pp. E291–E299, 2014.
  • [24] A. M. Larson and L. C. Loschky, “The contributions of central versus peripheral vision to scene gist recognition,” Journal of Vision, vol. 9, no. 10, pp. 6–6, 2009.
  • [25] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Studying aesthetics in photographic images using a computational approach,” in Proceedings of European Conference on Computer Vision, May 7-13, 2006, pp. 288–301.
  • [26] L. Zhang, Y. Gao, R. Zimmermann, Q. Tian, and X. Li, “Fusion of multichannel local and global structural cues for photo aesthetics evaluation,” IEEE Trans. Image Processing, vol. 23, no. 3, pp. 1419–1429, 2014.
  • [27] Y. Kao, R. He, and K. Huang, “Deep aesthetic quality assessment with semantic information,” IEEE Trans. Image Processing, vol. 26, no. 3, pp. 1482–1495, 2017.
  • [28] L. Marchesotti, F. Perronnin, D. Larlus, and G. Csurka, “Assessing the aesthetic quality of photographs using generic image descriptors,” in Proceedings of IEEE International Conference on Computer Vision, November 6-13, 2011, pp. 1784–1791.
  • [29] M. Kucer, A. C. Loui, and D. W. Messinger, “Leveraging expert feature knowledge for predicting image aesthetics,” IEEE Trans. Image Processing, vol. 27, no. 10, pp. 5100–5112, 2018.
  • [30] X. Jin, L. Wu, X. Li, S. Chen, S. Peng, J. Chi, S. Ge, C. Song, and G. Zhao, “Predicting aesthetic score distribution through cumulative jensen-shannon divergence,” in Proceedings of the Thirty-Second Conference on Artificial Intelligence, February 2-7, 2018, pp. 77–84.
  • [31] N. Murray and A. Gordo, “A deep architecture for unified aesthetic prediction.” [Online]. Available: https://arxiv.org/abs/1708.04890
  • [32] H. Talebi and P. Milanfar, “NIMA: neural image assessment,” IEEE Trans. Image Processing, vol. 27, no. 8, pp. 3998–4011, 2018.
  • [33] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 26th Annual Conference on Neural Information Processing Systems, December 3-6, 2012, pp. 1106–1114.
  • [34] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition.” [Online]. Available: https://arxiv.org/abs/1409.1556
  • [35]

    B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 27-30, 2016, pp. 2921–2929.
  • [36] J. Zhang, Z. L. Lin, J. Brandt, X. Shen, and S. Sclaroff, “Top-down neural attention by excitation backprop,” in Proceedings of the 14th European Conference on Computer Vision, October 11-14, 2016, pp. 543–559.
  • [37] J. Chen, G. Bai, S. Liang, and Z. Li, “Automatic image cropping: A computational complexity study,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 27-30, 2016, pp. 507–515.
  • [38] J. Kim, J. Koh, Y. Kim, J. Choi, Y. Hwang, and J. W. Choi, “Robust deep multi-modal learning based on gated information fusion network.” [Online]. Available: https://arxiv.org/abs/1807.06233
  • [39] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 18-22, 2018, pp. 7132–7141.
  • [40]

    V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in

    Proceedings of the 27th International Conference on Machine Learning

    , June 21-24, 2010, pp. 807–814.
  • [41] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
  • [42] N. Murray, L. Marchesotti, and F. Perronnin, “AVA: A large-scale database for aesthetic visual analysis,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 16-21, 2012, pp. 2408–2415.