Theme Aware Aesthetic Distribution Prediction with Full Resolution Photos

by   Gengyun Jia, et al.

Aesthetic quality assessment (AQA) of photos is a challenging task due to the subjective and diverse factors in human assessment process. Nowadays, it is common to tackle AQA with deep neural networks (DNNs) for their superior performance on modeling such complex relations. However, traditional DNNs require fix-sized inputs, and resizing various inputs to a uniform size may significantly change their aesthetic features. Such transformations lead to the mismatches between photos and their aesthetic evaluations. Existing methods usually adopt two solutions for it. Some methods directly crop fix-sized patches from the inputs. The others alternately capture the aesthetic features from pre-defined multi-size inputs by inserting adaptive pooling or removing fully connected layers. However, the former destroys the global structures and layout information, which are crucial in most situations. The latter has to resize images into several pre-defined sizes, which is not enough to reflect the diversity of image sizes, and the aesthetic features are still destroyed. To address this issue, we propose a simple and effective method that can handle the arbitrary sizes of batch inputs to achieve AQA on the full resolution images by combining image padding with ROI (region of interest) pooling. Padding keeps inputs of the same size, while ROI pooling cuts off the forward propagation of features on padding regions, thus eliminates the side effects of padding. Besides, we observe that the same image may receive different scores under different themes, which we call the theme criterion bias. However, previous works only focus on the aesthetic features of the images and ignore the criterion bias brought by their themes. In this paper, we introduce the theme information and propose a theme aware model. Extensive experiments prove the effectiveness of the proposed method over the state-of-the-arts.



There are no comments yet.


page 1

page 2

page 5

page 6

page 7

page 9

page 10


Deep Multi-Scale Features Learning for Distorted Image Quality Assessment

Image quality assessment (IQA) aims to estimate human perception based i...

CartoonRenderer: An Instance-based Multi-Style Cartoon Image Translator

Instance based photo cartoonization is one of the challenging image styl...

Evolving Multi-Resolution Pooling CNN for Monaural Singing Voice Separation

Monaural Singing Voice Separation (MSVS) is a challenging task and has b...

Layer Adaptive Deep Neural Networks for Out-of-distribution Detection

During the forward pass of Deep Neural Networks (DNNs), inputs gradually...

A-Lamp: Adaptive Layout-Aware Multi-Patch Deep Convolutional Neural Network for Photo Aesthetic Assessment

Deep convolutional neural networks (CNN) have recently been shown to gen...

Aesthetics Assessment of Images Containing Faces

Recent research has widely explored the problem of aesthetics assessment...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

(a) Original
(b) Crop
(c) Pad
(d) Resize
Fig. 1: Examples of transformations to make the photo size fixed. (a): Original image is well photographed. (b): Cropping destroys both the image layout and the object integrity. (c): The additional padding regions will confuse the algorithm, because they are different on different images. (d): Resizing operation warps the image, and introduces noises.

Photo AQA is an interesting task, which has wide applications, such as retrieving and providing users photos of high aesthetic quality, and guiding the aesthetic-driven image enhancement [deng2018aesthetic] by the aesthetic quality discriminator. However, the subjectivity and diversity of human assessment make it complex. Many efforts [ke2006design, tang2013content] tried to leverage the human general criterion and focused on designing features according to the various expert knowledge, such as ”the Rule of Thirds”, ”the color harmony”, and ”the depth of field”. Some researchers also directly introduced features widely adopted in pattern recognition [ke2006design, tang2013content, bhattacharya2010framework, dhar2011high, su2011scenic, li2010towards]. However, these hand-craft and pre-defined features have limited representation ability, which thus lead to poor performance in AQA.

With the development of deep learning and the collection of large-scale databases for aesthetic visual analysis, DNNs based methods are widely employed to deal with AQA. DNNs require fix-sized inputs. However, resizing inputs to a uniform size may destroy and change the aesthetic information in the original images as showed in Fig

1. We can observe from Fig 1(a) that directly cropping fix-sized patches from inputs destroys their original layout information. In Fig 1(b), the padding regions may confuse the networks. Fig 1(c) shows that the resizing operations also bring harmful information to the inputs, such as image distortion, blur and artifacts. Existing methods usually adopt two solutions for it. Several methods [Lu2015DeepMA, Ma2017ALampAL] directly crop fix-sized patches from the inputs, destroying the global structures and layout information of the original inputs. The other method employs special architectures, such as FCN (fully convolutional networks) [long2015fully] or ASP (adaptive spatial pooling) [he2015spatial], to alternately capture the aesthetic features from pre-defined multi-size inputs [fang2018image, apostolidis2019image, cui2018distribution]. These methods require a uniform input size in one batch, but image sizes are always diverse and cover a wide range. For example, in AVA data set, images size from to , which makes it impossible to group images into one batch without resizing. One direct solution is setting batch size as 1, but it will make the training process unstable and inefficient, and easy to stuck into the sub-optimal point [apostolidis2019image].

To handle the conflict between the uniform size in one batch and diverse image sizes in real-world photos, we propose to combine padding with ROI pooling [girshick2015fast]

. ROI pooling is firstly proposed to extract features of object regions of any sizes and locations. In our model, for inputs of arbitrary sizes, we first pad their boundaries to turn the images into a pre-defined size while keeping its aspect ratio. Then a network takes these fix-sized padded images as inputs. We use ROI pooling in a specific layer of the network, which will only apply max pooling on the regions corresponding to the original images. The features of padding regions will not forward to the next layer. Through this procedure, we achieved arbitrary input size with arbitrary batch size (as long as the GPU memory can hold), and introduce no extra noise or useless information. To the best of our knowledge, it is the first method in AQA that supports end-to-end batch training on full resolution images.

People’s criteria of AQA are strongly related to a given theme, that means even one photo is assessed by the same person, the assessments can be different under different themes. With such theme criterion bias, evaluating a photo only based on photo itself is ill-conditioned. There are two widely used data sources and in AQA research. In the former website, a photo belongs to a challenge with a specific theme, and participants must submit photos that are fit to the corresponding themes if they want good evaluations. Some extreme examples are shown in Fig 2. The two photos in the first row are both highly blurred, but the right image gets a significant higher score than the left one. One reason is that the right image comes from theme ’Motion Blur’, in which blurring is regarded as a good feature. While for the left image that is submitted to theme ’Color on Color’, blurring is considered as a drawback by lots of people. Images in the second row are all about natural environments. Although the left image that belongs to the theme ’Landscape’ is beautiful, it is scored lower than the right. It is possibly because that the content and style of the right image fit the theme ’Harsh Environments’ better. Some works [cui2018distribution, kao2017deep] have proposed to use semantic labels to guide the aesthetic assessment. However, these methods can still not cope with the aforementioned theme criterion bias problem.

For this issue, it is natural to take the challenge themes into account. To fully utilize them, we encode the theme information and combine it with the extracted visual features. The AQA is then based on the combined features. With the help of theme features, visual features can be extracted and used more adaptive. This makes the assessment more reasonable.

Most previous works [tang2013content, Lu2015DeepMA, Ma2017ALampAL] adopt the classification task to predict a binary label. The score determines whether the photo is good or bad. However, such binary judgement is not proper. As discussed by the previous works [cui2018distribution, jin2018predicting], a binary label cannot reflect the subjectivity and diversity of human assessment. Furthermore, the binary label is confusing when the average score is close to the threshold. It is unreasonable to conclude that a photo of score is good and a photo of score is bad when the threshold is . Some works [apostolidis2019image, kao2017deep] have noticed such irrationality, and discard images whose scores are close to the threshold. This is not a good solution because many images in the dataset will be useless. Regressing the average score instead of classification may be a solution, but it still cannot reflect the diversity of human judging the same photo. Considering all these drawbacks, we decide to predict the aesthetic score distribution similar to works [cui2018distribution, jin2018predicting, talebi2018nima, zhang2019gated].

Fig. 2: Examples that themes influence assessment criterion. Images in the first row are both heavily blurred. Image on the right gets a higher score because theme ’Motion Blur’ prefers blurred images. In the second row, the left image belongs to theme the ’Landscape’, while the right image belongs to ’Harsh Environments’. Although the right image looks less attractive, gets higher score due to the fitness to the theme.

To sum up, the contributions of this paper are:

  • We emphasize the importance of keeping original size of photos in AQA. Inspired by object detection models, we develop a novel method by applying ROI pooling in networks who take padded images as inputs. This method thus enables us to utilize padding while eliminate its side-effects. It is the first method in AQA that supports end-to-end batch training on arbitrary full resolution images.

  • There is theme criterion bias of human’s AQA process. In this case, assessment according to only photos themselves is ill-conditioned. This problem has not been studied before. We propose to employ theme information as additional inputs to alleviate the criterion bias. The themes are encoded and combined with visual features to predict aesthetic quality.

  • We conduct experiments on both aesthetic distribution prediction and mean score regression. For both tasks, our model outperforms existing methods on various evaluation metrics. The experiments proved the effectiveness of proposed method.

The rest of this paper is organized as follows: we summarize related works about AQA in Section II, demonstrate the detailed process of the proposed method in Section III. The experimental details are described in Section IV. We finally conclude our paper and analyze the future works in Section V.

Ii Related Work

Ii-a Photo Aesthetic Quality Assessment

Traditional aesthetic quality assessment is based on hand-craft features. There is a wide range of features employed to assess image aesthetic quality. Some reflect the technical quality and some are more subjective. Different features are often combined together. Ke et al. [ke2006design] analysed several aesthetic factors and connected them with low-level visual features, for example, using edge distribution to reflect the photo simplicity. Several other features such as color, blur and contrast were also introduced. Luo et al. [luo2008photo] proposed to focus on the subject region, and extracted features such as clarity contrast. Tang et al. [tang2013content] further pointed out that the assessment should be based on the content. For example, they designed features specially for human photos. Many other works [bhattacharya2010framework, dhar2011high, su2011scenic, li2010towards] are also based on these features, and may emphasis particular features or photo contents. Some works [bhattacharya2010framework] also introduced applications based on aesthetic quality assessment such as aesthetic enhancement. Besides these carefully designed aesthetic features, some generic features such as GIST [oliva2001modeling], SIFT [lowe2004distinctive] were also employed and showed good performance [marchesotti2011assessing]. Although many of these features are carefully designed, these shallow models still have limited representation power.

Recent years, DNNs have become a widely adopted architecture for many research areas. In AQA, the application of DNNs also achieved great success. Lu et al. [lu2014rapid, lu2015rating] are among the first to assess photo aesthetic quality based on deep neural networks. They designed a two column architecture to learn features on both global and local view. Kao et al. [kao2015visual] proposed to use the regression model instead of the classification model because a continuous score can deliver more information about aesthetics. Dong et al. [dong2015photo]

used features from an ImageNet pre-trained network to predict aesthetic binary labels. Tian et al.

[tian2015query] designed a query-based model, they train a network for each query image based on a subset images that are strongly related to the query. Kao et al. [kao2017deep] proposed a multi-task network to predict aesthetic label and semantic label simultaneously. They explain that AQA is strongly associated with image semantics.

Although these works based on DNNs have made great progress, they still suffer from the fix-sized input problem. Compared with image semantic classification and detection tasks, the influences of the size of input images are more significant. Thus researchers proposed some methods to solve this problem. Some works [Lu2015DeepMA, Ma2017ALampAL, lu2015rating, wang2019aspect]

crop multiple fix-sized patches from the original images, and aggregate features extracted from these patches to predict aesthetic quality, Ma et al.

[Ma2017ALampAL] proposed to select patches according to some criteria based on human perception, therefore the patches can reflect features of original images well. Other works tried to keep the aspect ratio of input images, they adopt either FCN [long2015fully] or ASP [he2015spatial] to generate fix-sized features from arbitrary network inputs [apostolidis2019image, fang2018image, Mai2016CompositionPreservingDP, cui2018distribution]. To meet the requirement of the same size in one training batch, some use multi-size training as an approximation, but only and aspect ratios are taken into account [fang2018image, cui2018distribution]. Apostolidis et al. [apostolidis2019image] tested training with batch size 1. Although it can maintain various aspect ratio, they reported significantly worse result due to the difficulty of training. Hosu et al. [hosu2019effective] extracted features from ImageNet pre-trained models without fine-tuning, thus they can use the original photos in a two-stage process.

Some researchers noticed that only a binary label or mean score cannot reflect the subjectivity and diversity of human assessment. They proposed to learn distributions of aesthetic ratings. Some early works [wu2011learning]

have employed label distribution learning based on support vector machine. They also used the voting number to represent the reliability of the ground-truth distribution. Recently, many other works predicting aesthetic rating distributions were proposed, there were various loss functions are used such as Kullback- Leibler (KL) divergence

[cui2018distribution], earth mover’s distance [talebi2018nima], distance [jin2016image], or cumulative Jensen-Shannon divergence [jin2018predicting]

. These works also combined other strategies, such as using semantic information, or defining reliability based on distribution kurtosis.

Ii-B Feature Extraction of Multi-size Images

The problem that DNNs can only take inputs of fixed size has been studied in some areas. It was firstly tackled by using spatial pyramid pooling [he2015spatial]. This work pointed out that it is the fully connected (FC) layers that put the constraint of fixed size on the input, while the convolutional layers have no such requirement. Spatial pyramid pooling can generate fix-sized feature maps from the input of arbitrary sizes. So they employed a multi-size training strategy with two pre-defined sizes and . Removing the FC layers and using only convolutional layers can also support multi-size training. Such network is called FCN and was firstly introduced to solve semantic segmentation tasks [long2015fully].

ROI pooling is originally designed to solve the inefficiency problem in RCNN [girshick2014rich]. This inefficiency derives from the repetitive extraction of region proposal features since there are many overlaps between the proposals. ROI pooling [girshick2015fast] pools features from regions on CNN feature maps of the whole image, thus features of each pixel will be computed only once. To deal with the different shapes and scales of objects, ROI pooling is able to generate fix-sized features from arbitrary input region sizes and locations. ROI align [he2017mask] is an enhanced version, which fixed the quantification error problem in ROI pooling. This enhancement significantly improves the detection performance on small objects. ROI poooling (align) has become a standard module in many object detection and segmentation models [dai2016r, lin2017feature, cai2018cascade], especially in the two-staged models.

Iii Method

This section provides a detailed demonstration of the proposed method. We firstly describe how to use ROI pooling and image padding in section III-A, then introduce the theme aware model in section III-B. Finally, we detail the training and inference procedure and network architecture in section III-C and III-D respectively.

Fig. 3: The overall architecture. The padded images are fed into the network. In ROI pooling, features of different images are pooled to a uniform size , as showed in the bottom left corner. The pooled features are then processed by several convolutional and pooling layers, and output 2048-dim features. The themes are encoded into one-hot code, and are turned into 256-dim features. Visual and theme features are concatenated to predict aesthetic distributions. EMD is employed as our loss function.

Iii-a ROI Pooling on Padded Full Resolution Images

The image transformations to meet the fix-sized inputs restriction of traditional DNNs cause the mismatches between images and their aesthetic quality assessment. We aim to eliminate such mismatches while keeping the uniform input size. To this end, we employ padding transformation for its two properties. First, padding does not change pixels in original images, the transformation is invertible. Second, padding region and image region in the whole padded image are spatially separated. As we know, feature maps of of different CNN layers keep spatial correspondences, which means that the separation of image and padding regions is still retained on feature maps. As a result, if the operations of CNN layers are applied to only image regions, they can extract features of original full resolution images without noises introduced by padding.

Inspired by object recognition model fast RCNN [girshick2015fast], we employ ROI pooling to pool features from specific feature map regions. Specifically, some regions are located on an input image beforehand. Then the image is processed by several layers and these regions are mapped on the feature maps. Assuming the downsampling ratio is , the coordinates mapping function is:

where subscript and indicate coordinates on feature maps and the image respectively. Pooling is applied only to these regions on the feature maps according to the location, such that it can generate features of each region separately. ROI pooling is applied independently on each channel as in standard pooling. For a general max pooling, the output feature maps is computed as:

where is the receptive field specified by location and . is the channel index. In normal max-pooling layers, the size as well as the moving range of the receptive field are both fixed. In ASP, only the moving range is fixed. In ROI pooling, both of them are adaptive, which enables ROI pooling to pool features from feature map regions of arbitrary sizes and locations. The coordinates of can be written as:,

where is the moving start specified by the region location. and are the startpoint and endpoint of the receptive field along dimension respectively, where is floor operation, is ceil operation and

is the pooling stride without quantification. Feature map at one channel can be written as

, where subscripts indicate the region that features belong to. Once ROI pooling is applied on the feature map, the output will contain only the image features:

where is the location of image region on feature maps. It can be calculated easily by the mapping function (1).

In our model, we firstly pad all images to the same size. Since we plan to use full resolution images, the padding size is equals to the biggest image size. Theoretically, it does not matter how to pad the images, so we use the most straightforward way that pad zero along upper and right boundaries. At the same time, the locations of the image region in the padded images are recorded. The coordinates of one corner is fixed to (0, 0), only its diagonal coordinates need to be adjusted according to the image size. Then we employ a network and replace one of its pooling layers with ROI pooling. Finally, the network takes the padded images and corresponding locations of image regions as inputs. The ROI pooling can cut off the forward propagation of padding regions. Therefore the network can eventually predict aesthetic quality based on only image features. The pooling procedure is illustrated in Fig 3. In the bottom left of the figure, light blue and gray regions denote image features and padding area features respectively. It can be seen that different images are eventually pooled into feature maps of the same size.

Discussion 1

: ROI pooling aims to achieve arbitrary receptive field on a tensor, as a result quantification approximations are introduced on both region location and pooling stride. Such quantification error can result in mismatches of ROIs. ROI align


is proposed to solve this problem by using interpolation. The two methods perform nearly the same in our model for three reasons. First, the mismatches of ROIs caused by quantification error is determined by the downsampling ratio between network inputs and pooling inputs. For the downsampling ratio of

, the number of mismatched pixels along one dimension are at most . Therefore the small downsampling ratio (which will be discussed in section III-D brings small mismatches in our model. Second, the mismatches have less influences on bigger regions. For example, a mismatch of 10 pixels may totally change the location of a region, but causes nearly no visible differences on a region. Since the regions refer to the entire full-resolution images in our model, the mismatches have nearly no impact. Third, predicting an attribute from the entire images is always robust to very small translations as analyzed in [he2017mask].

Discussion 2: We transform image feature maps of different sizes into a uniform size through ROI pooling, which means that our model can be seen as applying resizing on different layers of a network. Traditional methods resize the network inputs, and our model resizes the intermediate result in network forwarding propagation. Such substitution has significant advantages. Features of original images have been learned before ROI pooling, and the network can learn how to retain useful information in pooled features.

Iii-B Theme Aware AQA

In AVA dataset, photos are submitted and assessed under specific themes. Assessing aesthetic quality with only photos in AVA dataset will bring about theme criterion bias. The criterion we adopted to assess a photo may not be proper for another photo if they belong to different themes. This motivates us to utilize theme information to alleviate such bias. To this end, we introduce a theme aware model, which can fuse visual features and theme information to evaluate the aesthetic quality.

Specifically, images in AVA dataset come from 1397 different themes. We turn the theme of each image into one-hot code. Compared with visual feature dimension in most DNNs, the code dimension 1397 is too high since themes contain much less information than images. To balance the two dimensions, the one-hot codes are firstly processed by a fully connected layer and the dimension is reduced to 256. The fully connected layers can also work as a theme feature extractor. Then the theme features are concatenated with the visual features extracted by the backbone convolutional layers. Finally the concatenated features are fed into fully connected layers to predict the aesthetic quality of images. The process is illustrated in Fig 3.

By counting into the theme information, the network could extract and use visual features adaptively, thus alleviate the theme criterion bias. For example, if different themes have different visual feature preferences, with the help of theme aware model, the network could learn different criteria. While for models without theme information, the network tend to learn a universal criterion which dominates the dataset, resulting in criterion bias on photos with other preferences.

Fig. 4: Some distribution prediction results, blue bins are predictions, red bins are ground-truth. For convenience, images are resized to the same size. It can be seen that our model can predict distributions of both good and bad photos well.

Iii-C Training and Inference

Given a set of images , their aesthetic distributions , and corresponding challenge themes , the dataset are formulated as . We aim to learn a model that predicts the aesthetic distributions of photos. The ground-truth distribution is the normalized ratings defined as where is the number of voters of photo , and indicates the rating score. The range of is , which is pre-defined and may be different in different datasets. may significantly influence people’s judgement. A bigger will result in more diverse assessments. As described before, is encoded into one-hot code.

The output of the network is a dimensional vector

. Softmax function is applied to to the last fully connected layer to ensure the output is a probability distribution:

where is the direct output of

th neuron. Assuming parameters of the convolutional layers, the theme feature extractor and the output fully connected layers are

, and respectively, the output can be formulated as:

In the training phase, we choose cumulative distribution divergence as loss function. As discussed in [wu2011learning], it is more proper to use cumulative distance when the distribution is ordered. For example, discrete distribution should be closer with than with . Concretely, similar to [talebi2018nima], we employ earth mover distance (EMD) as loss function:

where is the cumulative distribution of defined as . We also choose for its simplicity in optimization.

In the testing phase, we adopt the same procedure as training. Testing images are padded to the same size and fed into the network. Outputs processed by softmax function are regarded as the photos’ aesthetic distribution. One may argue using original images without padding and removing ROI pooling, but such a method leads to weak accordance between testing and training. The learnt network may not fit such a situation. We tested this testing method and got a worse results.

Fig. 5: Some well predicted photos, aspect ratio is kept unchanged. Predicted mean score (ground-truth score) and EMD (r=1) are given below each image.

Iii-D Network Architecture

We choose Inception-V3 [szegedy2016rethinking] as our backbone network, which is same as previous works [talebi2018nima, zhang2019gated, hosu2019effective]. Inception-V3 is designed for image classification with many modules called inception. These modules combine convolutional outputs of different kernel sizes, therefore can extract features from various scales. Inception-V3 also employs powerful strategies to ease the computational cost, such as factorize convolutions of big filter size. Since we plan to train the network on original full resolution images, these strategies are very helpful.

In Inception-v3, every inception module has at least one pooling layer, and there are 5 basic convolutional layers with 2 max-pooling layers before inception modules. We can replace any of these pooling layers wuth ROI pooling theoretically, but there are two reasons preventing a deep ROI pooling layer. On the one hand, it is the different sizes of padding regions that confused the network. If padded images are propagated very deep, with the scale reduction of feature maps, features of padding regions with different sizes will be eventually mixed up with image features and can not be isolated. On the other hand, since the padding size is determined by the biggest images in datasets, training networks based on such high resolution images is hard and time consuming. Applying ROI pooling in deep layers will significantly increase the burden of computation and storage. Considering these facts, we replace the first pooling layer in Inception-v3 with ROI pooling.

The output size of ROI pooling also plays a vital role. The standard output size of the first pooling layer in Inception-V3 is . Since we aim to predict aesthetic quality from full resolution images, this size will cause significant information bottleneck and lead to pool performance. Considering a large portion of images in datasets are nearly two times bigger than the standard input size , we choose to double the pooling size to . This modification can bring about mismatches between CNN outputs and FC inputs. We add an adaptive pooling layer to cope with this problem.

(r=2) (r=1)
Talebi et al. [talebi2018nima] - - - - - 0.050 -
Zhang et al. [zhang2019gated] - - - - - 0.045 -
Wang et al. [wang2019aspect] - - - - - 0.065 -
Fang et al. [fang2018image] 0.144 0.120 - - - - 0.056
Cui et al. [cui2018distribution] 0.127 0.094 - - - - 0.042
Jin et al. [jin2018predicting] 0.158 - 0.037 0.068 0.082* - -
Ours 0.134 0.088 0.022 0.041 0.059 0.041 0.040
Ours (Align) 0.134 0.089 0.022 0.041 0.059 0.041 0.040
TABLE I: Performance of Distribution Prediction on AVA.’Align’ Indicates That We Replace ROI Pooling with ROI Align

Iv Experiments

Iv-a Dataset and Evaluation Metrics

We evaluate our algorithm and compare it with other AQA algorithms on two widely used datasets, AVA [murray2012ava] and [datta2008algorithmic]. Both datasets are collected from websites. The aspect ratio distribution is shown in Fig 6, we can see that both data sets contain various aspect ratios and cover a wide range. surprisingly, although the two datasets come from different websites and have different scales, their aspect ratio distributions are nearly the same.

Iv-A1 Ava

AVA is a large-scale database for image aesthetics analysis, which contains over 250,000 color images. All images are collected from The aesthetic assessment is given by 78 549 individuals, each of the voters chooses a score from 1 to 10. AVA also provides semantic information, each image has 0 2 semantic labels, and belongs to a specific challenge theme. We follow the standard dataset partition as in [murray2012ava].

Iv-A2 data set only provides aesthetic labels. It contains 20,278 images. Each image is rated by at least 10 users on score 1 to 7, voting information are lost in some images, only mean score and standard deviation are given. Because the website has been updated several times, some images are lost, therefore only 16666 images can be downloaded. We random select 14800 images as training set, 1200 images as testing set, and 666 images as validation set.

We evaluate our method based on two kinds of metrics. One is for distribution prediction. We employ several distance and divergence metrics according to previous works [jin2018predicting, fang2018image, cui2018distribution, talebi2018nima, zhang2019gated], including Euclidean distance (Euc), Kullback-Leibler (KL) divergence, Jensen-Shannon (JS) divergence, Chi-square () distance, EMD with both and in equ (4), and Cosine distance (CD). All these metrics indicate a better performance if the value is smaller. The other is for mean score and standard deviation computed from the predicted distribution. We calculate their correlation coefficients and mean square error (MSE). The correlation coefficients include Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank-order Correlation Coefficient (SRCC). Larger coefficients indicate better performance. MSE criterion is only applied on mean score due to the lack of previously reported results.

Iv-B Implementation Details

All our models are pre-trained on ImageNet to accelerate the convergence, and fine-tuned on the corresponding data sets. SGD optimizer with 0.9 momentum and 0.0001 weight decay is used. The learning rate is divided by 2 every 10 epochs and the model are trained 30 epochs. We set the initial learning rate as

on convolutional layers, and for fully connected layers, the learning rate is 10 times bigger. All input images are padded to

. The batch size is set to 16 due to the memory limitation of TiTan XP GPU. Apparently such a small batch size can not learn the statistical information. Considering there are many batch normalization (BN) layers, we fix the parameters of BN layer on pre-trained models. Data augmentation is used to ease the overfitting problem. Concretely, we crop original images while keeping its aspect ratio on the four corners. The training images are randomly selected from an image collection, which contains the original image, the horizontal flipped image, and four cropped images. In the test stage, we average results on all the augmented images.

(mean) (mean) ( ( (mean)
Kao et al. [kao2015visual] - - - - 0.4501
Jin et al. [jin2016image] - - - - 0.3373
Kao et al. [kao2016hierarchical] - 0.5214 - - 0.3988
Kong et al. [kong2016photo] 0.5581 - - - -
Talebi et al. [talebi2018nima] 0.6120 0.6360 0.2330 0.2180 -
Meng et al. [meng2018mlans] 0.6730 0.6860 - - -
Wang et al. [wang2019aspect] 0.6868 0.6923 - - 0.2764
Zhang et al. [zhang2019gated] 0.6900 0.7042 - - 0.2752*
Hosu et al. [hosu2019effective] 0.7450 0.7480 - - -
Ours 0.7611 0.7632 0.6918 0.7013 0.2366
Ours (Align) 0.7611 0.7621 0.7031 0.7111 0.2390
TABLE II: Performance of Mean Score and Standard Deviation Prediction on AVA.
Fig. 6: Distributions of images aspect ratio. Blue and red bins denote the distributions of AVA and datasets separately. We can see that the range is wide, showing the diversity of aspect ratios. It can also be seen that two datasets show nearly the same distribution.

Iv-C Performance Evaluation

We first evaluate the distribution prediction performance. The proposed method is compared with previous distribution learning models [fang2018image, cui2018distribution, jin2018predicting, talebi2018nima, zhang2019gated, wang2019aspect], some metrics are not reported in these literature. Table I gives the detailed comparison. Note that the KL divergence of literature [jin2018predicting] is a symmetrical version computed as . We did not compute this version because we observe that there are zeros in ground-truth distributions of many images. The EMD (r=2) of literature [jin2018predicting] in table I is derived from reported Euclidean distance of cumulative distributions. As can be seen in Table I, our method achieves the best performance on almost all evaluation metrics.

Then we compare the mean score and standard deviation prediction performance, both values are computed from the predicted distributions. In the competitors, GPF-CNN [zhang2019gated] and NIMA [talebi2018nima] are distribution-based methods. Although Jin et al. [jin2016image] employ a distribution prediction model, the reported result in Table II comes from the regression model, which is the best result. All rest methods are regression models. Note that the MSE of [zhang2019gated] is induced from reported Root MSE. We use the Inception-V3 result of [hosu2019effective], which is same as our backbone network. We can see from the table that the proposed method creates a new state-of-the-art.

The results of model that replaces ROI pooling with ROI align are also given. It can be seen that both results are nearly the same, which proves our analysis in section III-A. For simplicity, we only use ROI pooling in other experiments.

We show some distribution prediction results on AVA dataset in Fig 4. The blue bins are prediction and red bins are ground-truth. It can be seen that our method can predict distributions precisely. Some other results are given in Fig 5, in which we keep the aspect ratio of images, but only show the predicted (ground-truth) mean score and the EMD. Fig 7 gives two failure cases. The possible reason is that the ground-truth distributions show uncommon patterns different from most images in this data set. We can see that the distribution is non-Gaussian and not smooth.

Fig. 7: Failure cases. The model fails to predict uncommon distributions. The two distributions are both non-Gaussian, and have abnormal values. In the first image, the number of people who vote for score 9 is too small. For the second image, score 1 receive too many votes.

Iv-D Ablation Study

To evaluate the effectiveness of the each module, we test networks trained on transformed images only, three kinds of transformations are tested. The first one resizes images to standard input size. Concretely, we first resize all images to , and random crop a patch with a random horizontal flip. During test phase, we average the results of 20 random augmented images. The second uses padding. However, images in AVA dataset are up to resolution. Directly learning on images is time consuming, so we uses padded images in training while keeping their aspect ratio. During testing, we adopt the same procedure as our proposed method. The third is random cropping. Training and testing strategies are all the same as the first one, except the cropping is applied to original images. We further tested the proposed model without theme information.

For simplicity, we choose SRCC on the mean score and standard deviation, EMD with r=1 and KL divergence as the representative metrics. As can be seen in Table III, our model outperforms methods with only transformed input images. We can see that the three methods have almost the same performance, especially on the distribution metrics. The correlation coefficients differ a little. Although resized padding method have the worst performance on mean score prediction, it performs better on standard deviation prediction.

We can see from the table that compared with mean score prediction, the improvement on standard deviation prediction is very huge after using theme aware model. This phenomenon proves the theme aware model can better reflect the subjectivity and diversity of human ratings. Some comprehensive results are showed in 8. We can see that for the first three images (images in the first row and the left image in the second row), the model tends to predict a lower score without theme information. One possible reason is that the model lose a positive aesthetic factor that the images fit the corresponding themes well. For example, the first image uses brown and gray as the dominant color, and is filled with smog. Such image may not be popular in normal context, but is of high quality in theme ’Harsh Environments’. Most images in this theme use such cool tone. The last image is lower scored with theme information, perhaps because its bad skills on lighting and color. The two aesthetic factors are heavier emphasized in theme ’Still Life’ than in many other themes.

Fig. 8: Examples about the prediction results with and without theme information. We only showed the mean score. We can see that the model can predict a more accurate score with theme information.
(mean) (
Resize 0.7026 0.2709 0.046 0.108
Resized Pad 0.6894 0.2927 0.046 0.107
Random Crop 0.6965 0.2780 0.046 0.108
Pad+ROI 0.7438 0.3424 0.043 0.098
Pad+ROI+Theme 0.7611 0.6918 0.041 0.088
TABLE III: Ablation Study.

Iv-E Analyse of Image and Feature Map Sizes

We have analyzed that keeping the original full resolution image size is very important in AQA. Besides the size of the input image, the size of the feature maps is also important. As discussed in section III-D, we increase the pooling size to to ease the block of information flow. The size is two times bigger than the standard pooling size in Inception-v3.

To validate the importance of such choices, we conduct experiments on downscaled images and feature maps. Specifically, in the first experiment, we resize images to make the edges no longer than 400 while keeping the aspect ratio, thus the inputs are all of the size . This means that only part of the images in AVA are kept unchanged. The feature map size in this situation is . In the second experiment, we use original images, and the feature maps keep size. Results are shown in table IV. We can clearly see that the performance improves with the size magnification of both feature maps and images. The table also shows that original full resolution images can generate better results even the ROI pooled feature maps keep the same size. It proves our conclusion in section III-A, that it is better to resize feature maps instead of input images.

Image Feature SRCC SRCC EMD KL
Size Size (mean) (
400 73 0.7189 0.4667 0.044 0.102
800 73 0.7531 0.6872 0.042 0.091
800 146 0.7611 0.6918 0.041 0.088
TABLE IV: Results on Different Image and Feature Maps Sizes.

Iv-F Analyse of Data Augmentation

Data augmentation is a common process in deep learning frameworks. The general method is to crop and flip original images randomly, such augmentation has been proved effective in many pieces of literature. But some previous works [cui2018distribution, fang2018image] of AQA only apply random horizontal flipping in data augmentation considering the possible influences of cropping to the original image layout. The data augmentation in our model is the same as [hosu2019effective] which combines cropping and flipping. The reason is that we observe that the cropping in data augmentation is slight (we only crop along one side), therefore the damage to the image layout can be ignored in most cases. It can be seen from Fig 9 that the differences between cropped and original images are slight. People hardly change their aesthetic evaluations even they have noticed the differences, therefore the cropping augmentation will be beneficial to the training process. To validate this conclusion, we test networks trained with and without cropping augmentation. As can be seen in table V, using flipping and cropping augmentation simultaneously can improve the performance.

(a) Original
(b) Crop
Fig. 9: Examples of a image (a) and one of its crop augmentation (b). For visual harmony, the cropped image is resized. People hardly change their aesthetic assessments under such small transformations.
Augmentation SRCC SRCC EMD KL
(mean) (
Flip 0.7553 0.6707 0.041 0.090
Flip+Crop 0.7611 0.6918 0.041 0.088
TABLE V: Results on Different Data Augmentations.

Iv-G Evaluation on is a small dataset. Because there is no theme information, we only evaluate the model with ROI pooling on padded images. Same as AVA dataset, we also pad images to . The number of voters per image in this data set is much smaller than AVA, so personal subjectivity has a greater impact, which leads to unstable ground-truth distributions. Predicting distributions is harder under such condition. To the best of our knowledge, there is only one previous work [zhang2019gated] predicting aesthetic distributions on this dataset. Since they didn’t report the data augmentation strategy, we give results with and without cropping augmentation for a fair comparison. We can see from the table VI that our proposed method outperforms previous work.

(mean) (mean)
Kong et al. [kong2016photo] 0.5217 0.5464 0.2715 0.070
Ours-CropAug 0.5644 0.5749 0.2273 0.066
Ours 0.5826 0.5858 0.2195 0.065
TABLE VI: Performance on data set.

V Conclusion and Future Works

Resizing images can significantly influence their aesthetic features. This paper proposed a simple framework that supports end-to-end batch training on original full resolution photos to predict their aesthetic distriutions. We achieve this by applying ROI pooling on feature maps of padded images. ROI pooling can pool features from arbitrary location with arbitrary receptive field. Besides, we analyzed that evaluating photo aesthetic quality only from images leads to theme criterion bias problem. Therefore we introduce the themes as extra information, which makes the network can extract and use visual features adaptively. Experimental results showed that our method outperforms the state-of-the-art distribution learning and regression models. We also analyzed the importance of image size and data cropping augmentation in experiments.

Although our proposed method has achieved the best performance, there are still works to do. We have proved the importance of feature map size. Considering the diversity of image sizes, it seems sub-optimal to use a uniform feature map size even images have been processed by several layers. We will further explore how to make the network more suitable for different image sizes. On the other hand, our method works badly on skewed distribution. The model tends to predict distributions with a mean score close to 5. How to predict such distributions more precisely is worth to study.