Considering user agreement in learning to predict the aesthetic quality

10/13/2021 ∙ by Suiyi Ling, et al. ∙ 0

How to robustly rank the aesthetic quality of given images has been a long-standing ill-posed topic. Such challenge stems mainly from the diverse subjective opinions of different observers about the varied types of content. There is a growing interest in estimating the user agreement by considering the standard deviation of the scores, instead of only predicting the mean aesthetic opinion score. Nevertheless, when comparing a pair of contents, few studies consider how confident are we regarding the difference in the aesthetic scores. In this paper, we thus propose (1) a re-adapted multi-task attention network to predict both the mean opinion score and the standard deviation in an end-to-end manner; (2) a brand-new confidence interval ranking loss that encourages the model to focus on image-pairs that are less certain about the difference of their aesthetic scores. With such loss, the model is encouraged to learn the uncertainty of the content that is relevant to the diversity of observers' opinions, i.e., user disagreement. Extensive experiments have demonstrated that the proposed multi-task aesthetic model achieves state-of-the-art performance on two different types of aesthetic datasets, i.e., AVA and TMGA.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual aesthetic assessment is important for many real-world use cases, including automatic image composition [su2021camera], image creation, editing [xu2020spatial], graphical design etc. and is intriguing more other piratical research topics. As people differ in how they respond to artworks [schlotz2020aesthetic], unlike general quality assessment, aesthetic assessment is associated more with high-level components of the contents in terms of emotions, composition, and beauty. Thus, it is more subjective compared to quality assessment of compressed/distorted contents [talebi2018nima]. State-of-The-Art (SoTA) approaches [pfister2021self, xu2020spatial, hosu2019effective]

concentrate on leveraging different deep neural networks with large-scale annotated aesthetic data, and achieve decent performance on benchmark datasets in recent yet. Other than predicting the atheistic quality score in terms of mean opinion scores (

) alone, more and more studies, e.g. NIMA [talebi2018nima], developed their models by taking the standard deviation () of the scores, i.e., how observer agree with each other, into account.

Figure 1: Example of contents that have significantly different standard deviation , and demonstration of how Confidence Intervals (CI) could help to rank the aesthetic scores.

When ranking the aesthetic qualities of a pair of contents, it is important to not only comparing their mean opinions scores, but also the corresponding confidence intervals. For instance, as shown in Fig 1, as the art-style of image is simpler than , its aesthetic votes are polarized. As depicted in Fig 1 (b), some of the observers voted very high score for as they preferred the style, while most of the other observers gave very low scores. As a result, the standard deviation of its aesthetic score is much higher than the one of . When comparing the aesthetic quality between and , we want to make a statistically solid decision based on the opinions of the majority. In other words, as presented in Fig 1 (e), we consider is significantly better than only if the confidence intervals of the two contents are not overlapped [tiotsop2021modeling, ling2020strategy]. In general, the Confidence Interval (CI) of the aesthetic score of a given image is given by:


where is the critical value w.r.t. of confidence level, and

is the number of observers. However, within modern deep learning framework, it is impractical to compute first the CIs and then compare them (


, limited by the design of loss function). Therefore, as an alternative, we may prefer to know how confident we are regarding the difference of the aesthetic scores of a given pair. If

is small enough, we could then say confidently that they are significantly different, where is the confidence interval of , defined as:


It is not hard to see that the standard deviations of the observers’ scores are in both aforementioned equations, and thus are important to be predicted along with the mean opinion score . Based on the discussion above, the contribution of this study is two-fold: (1) we propose to predict both and together via a novel multi-task attention network; (2) inspired by equation (2), we develop a new confidence interval loss so that the proposed model is capable of learning to rank the aesthetic quality regarding the confidence intervals.

2 Related Work

In the past decade, the performances of aesthetic assessment models grow at a respectable pace. Li et al. [li2009aesthetic] proposed one of the early efficient aesthetic metrics based on hand-crafted features. By formulating aesthetic quality assessment as a ranking problem,  [kao2015visual], Kao et al. developed a rank-based methodology for aesthetic assessment. Akin to [kao2015visual], another ranking network was proposed in [kong2016photo] with attributes and content adaptation. To facilitate heterogeneous input, a double-column deep network architecture was presented in [lu2015rating], which was improved subsequently in [lu2015deep] with a novel multiple columns architecture. Ma et al. developed a salient patch selection approach [ma2017lamp]

that achieved significant improvements. Three individual convolutional neural networks (CNN) that capture different types of information were trained and integrated into one final aesthetic identifier in 

[kao2016hierarchical]. Global average pooled activations were utilized by Hii et al. in [hii2017multigap] to take the image distortions into account. Later, triplet loss was employed in a deep framework in [schwarz2018will] to further push the performances to the limits of most modern methods available at the time. The Neural IMage Assessment (NIMA) [talebi2018nima], developed by Talebi et al., is commonly considered as the baseline model. It was the very first metric that evaluates the aesthetic score via predicting the distribution of the ground truth data. To assess UAV video aesthetically, a deep multi-modality model was proposed [kuang2019deep]. As global pooling is conducive to arbitrary high-resolution input, MLSP [hosu2019effective] was proposed, based on Multi-Level Spatially Pooled features (MLSP). Recently, an Adaptive Fractional Dilated Convolution (AFDC) [chen2020adaptive] was proposed to incorporate the information of image aspect ratios.

3 The proposed Model

3.1 Deep Image Features for Arbitrary Input

It is verified in [hosu2019effective] that the wide MLSP feature architecture Pool-3FC achieves the SoTA performance. Thus, it is adopted in this study as the baseline model. Similar to MLSP, we extracted features from the Inception ResNet-v2 architecture. As the dimension of the features, i.e., is too high, the features are first divided into 16 sub-features with dimension of (i.e., the dark-grey blocks at left-upper corner of Fig. 2) to ease the computation of attention feature and reduce the number of parameters used across the architecture. After the division, each sub-feature is fed into a Lighter-weight Multi-Level Spatially Pooled feature block (LMLSP) as depicted in Fig. 2. The module of LMLSP is shown in Fig. 3. It is composed of three streams, including the , convolutions streams, and the stream of average pooling followed by a convolution. Through this design, the dimension of the multi-level spatial pooled feature was reduced to facilitate the latter fully connected layers.

Figure 2: Overall network architecture of the proposed model.
Figure 3: Detailed visualization of each LMLSP block.

3.2 Predicting and at the Same Time with Multi-Task Attention Network

The goal of using multi-task learning is two-fold, i.e., (1) predict both the mean opinion scores and the standard deviation at the same time; (2) via joint loss functions, the network learns to consider observers’ diverse opinions (), the content uncertainty, etc., when predicting the final aesthetic score.

In this study, the multi-task attention network proposed in [liu2019end] is adapted to predict both the mean opinion score , and the standard deviation of the subjects’ scores . The proposed LMLSP blocks are utilized as task-shared features blocks. And the separate attention modules, which are linked to the shared model, are designed to learn task-specific features for and correspondingly. The attention modules of (in cyan color) and (in dark blue color) are shown in Fig. 2 and 3. As presented in Fig. 3, within each individual LMLSP block , two attention masks are attached to each convolution stream, i.e., one for task and one for task at each conv , conv , and the average pooling stream. The task specific features and in the LMLSP block, at the stream are calculated via element-wise multiplication of the corresponding attention modules with the shared stream features :


where indicates the element-wise multiplication operation.

Then, for each task /, the task-specific features obtained from each stream are concatenated into one dedicated feature, followed by a Global Average Pooling (GAP). Same for the shared features from the three streams. As such, for the LMLSP block, three features are output, including , and , which are the feature of task , the feature of task and the feature of the shared network respectively:


where is the Global Average Pooling and

denotes the concatenation of tensor

and .

Afterward, as depicted in the lower part of Fig. 2, for each task, i.e., , or , at each LMLSP block, the task-specific feature is first concatenated with the shared feature , and all the obtained features are further concatenated across all the 16 LMLSP blocks to generate the final feature for each task. The feature flow is highlighted in cyan color for task and dark blue color for task . Lastly, the features are forwarded to three continuous Fully-Connected (FC) layers with the same attention modules for each task after each FC layer to predict the final and . It is worth mentioning that the entire network is trained in an end-to-end manner. As such, the attention modules for , and serves as feature selectors that pick up the relevant dimensions in predicting or respectively, while the shared LMLSP blocks and the shared FC layers learn the general features across the two tasks.

3.3 Loss function

Under the multi-task learning setting, the joint loss function of the tasks of predicting the mean and the standard deviation of subject’ scores is defined as:


where and are the parameters that balances the losses of predicting , i.e., and the one of predicting i.e., . The is simply defined as the Mean Absolute Error (MAE) between the ground truth of an image , and the corresponding predicted . N equals to the number of images:


As emphasized in Section 1, when comparing the aesthetic scores of two images and , we want to know not only whether image is significantly better than in terms of their aesthetic quality, but also how confident we are regarding the difference. Specifically, we want to consider this ‘certainty’ when predicting . Thus, the is defined with a novel , namely, the Confidence Interval Loss:


where is a parameter that balances the two losses. Inspired by the confidence interval of the differences between two aesthetic scores , which is defined as (shown in Section 1), we further define as:


where and are the ground truth mean and the predicted mean respectively, is the set of all possible pairs, and is defined as below:


where is the margin. Without loss of generality,  [rumsey2009statistics]. indicates the number of observers. serves as a gating function. In equation (11), the prediction error of a given image pair is accumulated only if equals to 1, under the condition of . It can be noticed that, is from the (2), i.e., the definition of . The proposed Confidence Interval Loss punishes only the pair that have value that higher than the margin . In other words, it focuses on pairs that have higher uncertainty (larger confidence interval of the aesthetic score difference). Furthermore, the loss for pair that triggers the gate function is , which encourages the prediction of the difference between the predicted scores to be as close as the ground truth one.

The intuition behind this selection of pairs is to concentrate on pairs that have larger uncertainties, which further enables the model to learn uncertainty from the ambiguous contents and improve the performance of the model on such pairs.

4 Experiment

4.1 Experimental setup

The performance of the proposed model was assessed on two different types of aesthetic quality datasets that were developed for different use cases/scenarios:

(1) The Tencent Mobile Gaming Aesthetic (TMGA) dataset [ling2020subjective], which is developed for improving gaming aesthetics. In this dataset, there are in total 1091 images collected from 100 mobile games, where each image was labeled with four different dimensions including the ‘Fineness’, the ‘Colorfulness’, the ‘Color harmony’, and the ‘Overall aesthetic quality’. The entire dataset is divided into 80%, 10%, and 10%, for training, validation, and testing correspondingly.

(2) The Aesthetic Visual Analysis (AVA) dataset [murray2012ava], which was developed for general aesthetic purposes. It contains about 250 thousand images that were collected from a photography community. Each individual image was rated by an average of 210 observers. The same random split strategy was adapted as done in [chen2020adaptive, hosu2019effective] to obtain 235,528 images for training/validation (95%/5%) and 20,000 for test.

Since most of the metrics were evaluated on the AVA dataset regarding the binary classification performance, ACCuracy (ACC) was computed based on classifying images into two classes with a cut-off score equals to 5 

[talebi2018nima, hosu2019effective] when reporting the performances on the AVA dataset. However, as emphasized in [hosu2019effective], this predominant two-class accuracy suffers from several pitfalls. For instance, due to the unbalanced distribution of images in training, testing set (different numbers of images of different aesthetic quality score values), using ‘accuracy’ does not fully reveal/stress out the performances of under-test metrics regarding its capability in ranking the aesthetic score of the image. Thus, as done in  [hosu2019effective, ling2020subjective], we computed the Pearson correlation coefficient (PCC), and the Spearman’s rank order correlation coefficient (SCC) between the ground truth and the predicted to evaluate the performance of different aesthetic quality assessment approaches. Similar to [talebi2018nima], as the range of differs across the dataset, the performances of models in terms of predicting are reported in PCC, and SCC. The proposed model is mainly compared to SoTA aesthetic models AFDC [chen2020adaptive], MLSP [hosu2019effective] and one of the most popular baseline models NIMA [talebi2018nima].

During training, we applied ADAM optimizer with an initial learning rate of

, which was further divided by 10 every 20 epochs. As a preliminary study to learn both

and with multi-task attention network by taking confidence interval into account, we still want to focus on predicting . Thus, the and in (8) were set as 0.6 and 0.4 accordingly, with slightly higher weight for task . in (10) was set as 0.5.

4.2 Experimental Results

Results on AVA dataset: The PCC, SCC, and ACC values of the considered models in predicting and on the AVA dataset are shown in Table 1. Regarding task , the proposed model accomplishes state-of-the-art performance by obtaining the best PCC, SCC values, and second best ACC value. Regarding task  111As the code of ‘AFDC’ is not publicly available, its performance in predicting on AVA, and its performances on TMGA are not reported., the proposed model achieves the best performance. It can be noticed that the performances of objective models in predicting are lower than the ones of . One of the potential reasons is that the task of predicting is more challenging than , as it is not only relevant to the contents but also the subjects who participated in the test.

Results on TMGA dataset: For a fair comparison, NIMA and MLSP were first finetuned on the training set of TMGA dataset with the optimized hyper-parameters. As the performance of MLSP is higher than NIMA by a large margin, we only report the results of MLSP. The results are presented in Table 2. Similarly, the performances of our model are superior to the ones of the compared models.

PCC () SCC () ACC PCC () SCC()

NIMA [talebi2018nima]
0.636 0.612 81.51% 0.233 0.218
MLSP [hosu2019effective] 0.756 0.757 81.72% 0.568 0.549
AFDC [chen2020adaptive] 0.671 0.648 83.24% - -
Proposed 0.785 0.779 82.97% 0.624 0.613
Table 1: Performances of relevant models AVA dataset.
Fineness Colorful Harmony Overall
PCC of models in predicting
MLSP [hosu2019effective] 0.904 0.900 0.888 0.872
Proposed 0.923 0.936 0.918 0.910
SCC of models in predicting
MLSP [hosu2019effective] 0.904 0.904 0.826 0.8652
Proposed 0.917 0.923 0.889 0.897
PCC of models in predicting
MLSP [hosu2019effective] 0.607 0.413 0.289 0.5151
Proposed 0.686 0.557 0.372 0.559
SCC of models in predicting
MLSP [hosu2019effective] 0.555 0.364 0.301 0.475
Proposed 0.679 0.486 0.370 0.478
Table 2: Performances of relevant models on TMGA dataset.

4.3 Ablation Study

To validate the effeteness of each component within the proposed model, ablation studies (with 2 ablative models) were conducted. The results are presented in Table 3: (1) when comparing rows 1 and 2, it is demonstrated that by employing the proposed the performance of the baseline single-task MLSP model in predicting is improved; (2) By comparing row 1 and 3, it is showcased that the performance of the aesthetic model is boosted by learning both and together via multi-task attention network.

Single-Task (MLSP)  [hosu2019effective] 0.756 0.757 81.72%
Single-Task with 0.762 0.765 81.89%
Multi-Task & without 0.775 0.772 82.24%
Multi-Task & with (Proposed) 0.785 0.779 82.97%
Table 3: Ablation studies on the AVA dataset.

4.4 Conclusion

In this study, a novel aesthetic model is proposed to learn the aesthetic ranking by exploiting the confidence intervals based on (1) learning both the mean and the standard deviation of aesthetic scores through multi-task attention network; (2) a new confidence interval loss that facilitates the model to concentrate on the less confident pairs and learn the high-level ambiguous characteristics of the contents. Through experiments, it is showcased that our model achieves superior performances compared to the state-of-the-art models in predicting both the mean and the standard deviation of aesthetic scores.