Revisiting Image Aesthetic Assessment via Self-Supervised Feature Learning

11/26/2019 ∙ by Kekai Sheng, et al. ∙ Xiamen University Tencent 0

Visual aesthetic assessment has been an active research field for decades. Although latest methods have achieved promising performance on benchmark datasets, they typically rely on a large number of manual annotations including both aesthetic labels and related image attributes. In this paper, we revisit the problem of image aesthetic assessment from the self-supervised feature learning perspective. Our motivation is that a suitable feature representation for image aesthetic assessment should be able to distinguish different expert-designed image manipulations, which have close relationships with negative aesthetic effects. To this end, we design two novel pretext tasks to identify the types and parameters of editing operations applied to synthetic instances. The features from our pretext tasks are then adapted for a one-layer linear classifier to evaluate the performance in terms of binary aesthetic classification. We conduct extensive quantitative experiments on three benchmark datasets and demonstrate that our approach can faithfully extract aesthetics-aware features and outperform alternative pretext schemes. Moreover, we achieve comparable results to state-of-the-art supervised methods that use 10 million labels from ImageNet.



There are no comments yet.


page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


With the explosive growth of online visual data, the demand for image aesthetic assessment [1] in many multimedia applications has been dramatically increased. Typically, this assessment process seeks to evaluate the aesthetic level of each image according to certain rules commonly agreed by human visual perception, ranging from fine-grained local textures and lighting details to high-level semantic layout and composition. These highly subjective and ambiguous perceptual metrics pose formidable challenges on designing intelligent agents to automatically and quantitatively measure image aesthetics, especially with conventional hand-crafted features.

Following the recent advances in deep convolutional neural networks, researchers have explored various data-driven learning based approaches for aesthetic assessment and have reported impressive results in the past few years 

[24, 25, 33, 34], benefiting from several image aesthetic benchmarks [27, 16]. However, the inherent shortcomings of these datasets are still deterring us from continuously scaling up the volume of training data and improving the performance: (1) the sizes of existing image aesthetic datasets are far from enough to feed up latest neural networks with very deep architectures, since the labor work involved in manual labeling is prohibitively expensive and developing aesthetics-invariant data augmentation solution remains an open problem; (2) meanwhile, subjective human annotations are often strongly biased towards personal aesthetic preference, and thus require excessive amount of data to neutralize such inconsistency for reliable training.

Figure 1: Illustration of our key idea. Some image editing operations typically have negative aesthetic impact and the degree of the impact is related to manipulation parameters.

Due to the poor scalability and consistency of available aesthetics datasets, it is becoming more and more attractive to break these bottlenecks of purely supervised training with unsupervised [9, 35], or more specifically, self-supervised features [2, 5, 31, 41]. The main idea of self-supervision is to design pretext tasks that are naturally available, contextually relevant, and capable of providing proxy loss as the training signal to guide the learning of features which are consistent with the real target. As a result, we can significantly broaden the scope of training data without introducing any further annotation cost.

In the context of image aesthetic assessment, due to the absence of an in-depth understanding of the human perception system, designing such a proper self-supervised pretext task can be challenging. However, although it is difficult to quantitatively measure the aesthetic score, predicting the relative influence of certain controlled image editing operations can be much easier. Therefore, we propose two key observations to help design the task (see Fig. 1): (1) some parametric image manipulation operations, such as blurring, pixelation, and patch shuffling, are very likely to have consistent negative impact on image aesthetics; (2) and the degree of impact due to such degradation will be monotonically increasing with respect to values of the corresponding operation parameters.

In this work, motivated by the strong correlation between image aesthetic levels and some degradation operations, we propose, to the best of our knowledge, the very first self-supervised learning scheme for image aesthetic assessment. The core idea behind our approach is to extract aesthetics-aware features with two novel self-supervision pretext tasks on distinguishing the type and strength of image degradation operations. To improve the training efficiency, we also introduce an entropy-based weighting strategy to filter out image patches with less useful training signals. The experimental results demonstrate that our self-supervised aesthetics-aware feature learning is able to achieve promising performance on available aesthetics benchmarks such as AVA 

[27] and AADB [16], and outperforms a wide range of commonly adopted self-supervision schemes, including context prediction [5], jigsaw puzzle [28]

, colorization 

[18, 41], and rotation recognition [8].

In summary, our main contributions include:

  • We propose a simple yet effective self-supervised learning scheme to extract useful features for image aesthetic assessment without using manual annotations.

  • We present an entropy-based weighting strategy to help strengthen meaningful training signals in manipulated image patches.

  • On three image aesthetic assessment benchmarks, our approach outperforms other self-supervised counterparts and even works better than models pre-trained on ImageNet or Places datasets using a large number of labels.

Related Work

Image aesthetic assessment

has been extensively studied in the past two decades. Conventional solutions use hand-crafted feature extractors designed with domain expertise to model the aesthetic aspects of images [1, 15]. More recently, the power of deep neural networks makes it possible to learn feature representations that can surpass hand-crafted ones. Typical approaches include distribution-based objective functions [34, 13], multi-level spatially pooling [10], attention-based learning schemes [33], and attribute/semantics-aware models [24, 30]. The advance of learning-based image aesthetic assessment has also inspired a number of practical solutions for various usage scenarios such as clothing [38] and food [32].

Self-supervised feature learning

can be considered as one type of unsupervised learning algorithms 

[2], which intends to learn useful representations without manual annotations. Many effective pretext tasks have been proposed in this direction, such as context prediction [5], colorization [41], split-brain [42], and RotNet [8]. The representations learned from self-supervised schemes turn out to be useful for many downstream tasks, e.g., tracking [36], re-identification [7], and image generation [21].

Aesthetics-aware image manipulations

can be generally divided into two categories: (1) Rule-based approaches leverage empirical knowledge and domain expertise to enforce well-established photographic heuristics such as gamma correction and histogram equalization. (2) Data-driven approaches focus on learning powerful feature representations from examples to improve aesthetics in certain aspects, such as aesthetics-aware image enhancement 

[4, 11], image blending [12], and colorization [41].

Our Method

Key Observations

Our self-supervised feature learning approach for image aesthetic assessment is based on two key observations. First, experiments in previous work on image aesthetic assessment benchmarks indicate that inappropriate data augmentation (e.g., brightness/contrast/saturation adjustment, PCA jittering) during the training process will result in performance degradation at the test time [27, 16]. Second, it is observed that a convolutional neural network (CNN) model trained from manually annotated aesthetic labels can inherently acquire the ability to distinguish fine-grained aesthetic differences caused by various image manipulation methods. In Fig. 2, for instance, some image editing operations (such as Gaussian blur, downsampling, color quantization, and rotation) can increase the prediction confidence of aesthetically negative images, or even turn an aesthetically positive example into a negative one (as shown in the last row of Fig. 2).

Consequently, we argue that these fine-grained perceptual quality problems, without manual annotations, are closely related to the image aesthetic assessment task, for which meaningful training instances can be constructed via proper expert-designed image manipulations.

Figure 2: Some learning-free parametric image editing operations can introduce controllable aesthetic degradations on input images. The aesthetic label (P for positive and N for negative), as well as the corresponding assessment confidence predicted by a fully supervised model are listed beneath each image.

Selected Image Manipulations

According to empirical knowledge, different image editing operations have diverse effects on the manipulated output [23, 40]. Furthermore, some operations require complex parameter settings (e.g., inpainting) and it is often difficult to automatically compare the output to the input in terms of aesthetic level (e.g., grayscale conversion). In our case, however, we need to select image manipulations with easily controllable parameters and predictable perceptual quality.

Specifically, as listed in Table 1

, we adopt a variety of image manipulation operations with different parameters for artificial training instances, including (1) downsampling by a scaling factor and upsampling back to the original resolution via bilinear interpolation; (2) JPEG compression with a percentage number to control the quality level; (3) Gaussian noise controlled by the variance; (4) Gaussian blur controlled by the standard deviation; (5) color quantization into a small number of levels; (6) brightness change based on a scaling factor; (7) random patch shuffle 

[14]; (8) pixelation based on a patch size; (9) rotation by a certain degree; and (10) linear blending (mixup [39]) based on a constant alpha value.

Attribute Operation Parameters
Much noise JPEG compression
Gaussian noise
Camera Shake Rotation
Soft / Grainy Downsampling
Poor lighting Exposure
Fuzzy Gaussian blur
Distracting Patch shuffle
Table 1: Image editing operations and the parameters adopted to construct meaningful pretext tasks.

Aesthetics-Aware Pretext Tasks

In this section, we propose our aesthetics-aware pretext task in a self-supervised learning scheme from two aspects. On one hand, the ability to categorize different types of image manipulations can be beneficial to the learning of aesthetics-aware representation. On the other hand, for the same type of editing operation, different control parameters can render various aesthetic levels correspondingly, and the tendency of quality shift is predictable [26]. Take JPEG compression for instance, decreasing the output image quality parameter always decreases the aesthetic level. Therefore, by constructing images with manipulations resulting in predictable degradation behaviors, we can extract meaningful training signals for the fine-grained aesthetics-aware pretext tasks.

Degradation identification loss.

We denote an image patch as , and the manipulation parameter as for . The loss term of our first pretext task, i.e., , reinforces the model to recognize which operation has been applied to :


where is the transformed output patch given the image patch by the parameters , and

is the probability predicted by our model

that has undergone a degradation operation of type .

For a comprehensive coverage of image attributes (e.g., resolution, color, spatial structures), we leverage a variety of typical manipulation operations as listed in Table 1. To take better advantage of synthesized instances, we adopt parameters that will always induce aesthetic degradation of observably different patterns. Apart from these distortions, None operation is also taken into consideration. That is, we require the model to categorize different classes of editing operations.

Figure 3: The diagram of our self-supervised approach. We propose two aesthetic-aware pretext tasks, and apply an entropy-based weighting scheme to enhance the training efficiency. In this way, we learn useful aesthetic-aware features in a label-free manner.

Triplet loss.

Training with alone is not enough for the task of aesthetic assessment, since some editing operations may produce some low-level artifacts that are easy to detect and will fool the network to learn some trivial features of no practical semantics [8]. Our solution to address this issue is to encode the information via triplets , where and are two different parameters of a certain operation in Tab. 1 (except for rotation and exposure). The two parameters are specified to create aesthetic distortions with a predictable relationship, i.e., the edited image patch using is aesthetically more positive than . Therefore, in an ideal aesthetic-aware representation, the distance between the original image patch and should be smaller than the distance between and . In this way, we propose the second task, :



is the normalized feature extracted from the model

given a patch .

It should be noted that we do not apply the triplet loss term from the beginning of the training process, since the representation learned from in the early stage can oscillate drastically and thus may not be suitable for comparisons of fine-grained features. To combat the training dynamics, we activate after the curve of has shown some plateau.

Total loss function.

By putting the degradation identification loss item and the triplet loss item together, we formulate a new self-supervised pretext task as below:


where is a mini-batch made up of patches, is the set of all the image manipulations, and is a scaling factor to balance the two terms. The diagram of our proposed self-supervised scheme is shown in Fig. 3.

Entropy-based weighting.

To improve the training efficiency and reduce noisy signals from misleading instances, we present a simple yet effective entropy-based weighting strategy. Specifically, after warm up for several epochs, we apply an entropy-based weight

for each patch :

where is used to control the lower bound of and we set it as in our experiments. The rationale behind this strategy is that instances of higher entropy values tend to have more uncertain visual cues for image aesthetics, and thus should be assigned with lower weights in the optimization process.


Our proposed pretext task is similar to [6, 19]

in which CNNs are trained to discriminate instances from different types of data augmentation methods. Different from these previous methods, we select a set of image manipulation operations and design our loss function carefully so that the learned representation can be aware of significant patterns of visual aesthetics. Besides, instead of building on selected high quality photos, we conduct pretext task optimization directly on images from ImageNet 

[3]. This strategy makes our learning scheme more flexible to use.

One might argue that we ignore some global aesthetic factors, e.g., rule of thirds. The reasons are two folds. First, global image attributes are more complex to manipulate than local ones which are considered in our approach. Second, global factors generally involve semantics which are not available in the context of self-supervised feature learning. We are not intended to mimic the statistics of real-world visual aesthetics, but to propose pretext tasks which are suitable for visual aesthetic assessment.


Baseline Methods

We compare the performance of our method with five typical self-supervised visual pretext tasks, as listed below:


The context predictor [5] that predicts the relative positions between two square patches cropped from one input image.


The image colorization task [41]

requires the model to estimate the color channels from a gray-scale image.


The cross-channel predictor [42] estimates one subset of the color channels from another with constrained spatial regions.


The primitive counting task [29] requires the model to generate visual primitives with their number invariant to transformations including both scaling and tiling.


A rotation predictor [8] is trained to recognize the 2D rotation applied to input images, i.e., is image rotation and .

In our experiments, we use the pre-trained models released by the authors for reliable and fair comparisons111Context:

Additionally, we also testify with three typical pre-training strategies, including the method pre-trained with -way object labels from ImageNet [3] or -way scene labels from Places [43], and the Gaussian random initialization i.e., without any pretext task.

Training Pipeline

Figure 4: The schematic of a self-supervised learning framework, which we apply in our experiments to evaluate various pretext tasks for image aesthetic assessment.

Our entire training pipeline (Fig. 4) contains two parts: a self-supervised pre-training stage to learn visual features with unlabeled images and a task adaptation stage to evaluate how the learned features perform in the task of image aesthetic assessment.

In the pre-training stage (Fig. 4 left), the first few layers share the same structure with AlexNet [17] for fair comparisons. In the task adaptation stage (Fig. 4 right), following the same configurations in [42], we freeze each model learned in the pre-training and leverage similar linear classifiers to evaluate visual features from each convolutional layer for the task of binary aesthetic classification. The channel number of each convolutional layer is shown in Fig. 4, and the dimensions of the corresponding fully connected layers are , , , , and , respectively.

Implementation details.

In the pre-training stage, we first resize the shorter edge of each input image to . Then we randomly crop one patch of resolution from the resized image. Next, we randomly choose three manipulation operations in Tab. 1 to edit each patch. We apply SGD optimization using a batch size of

, with the Nesterov momentum of

and the weight decay of . We begin with a learning rate of , dropped it by a factor of after every epochs. To eschew training oscillating, we activate with of after the first epochs. The following adaptation stage shares the same settings except that the learning-rate starts from .

Benchmarks for Aesthetic Assessment

Aesthetic Visual Analysis (AVA).

The AVA dataset [27] contains approximately images. Each image has about crowdsource aesthetic ratings in the range of to . Following the common practice in [24, 10], we consider images whose average aesthetic scores are no less than as positive instances and adopt the same training/test partition, i.e., images for training and for testing.

Aesthetics with Attributes Database (AADB).

The AADB dataset [16] contains images with aesthetic ratings and eleven additional attributes.We follow [16] to split the dataset into three partitions, i.e., , , and

images for training, validation, and testing, respectively. Without loss of generality, we binarize the aesthetic ratings into two classes using a threshold of

, similar to AVA.

Chinese University of Hong Kong-Photo Quality Dataset (CUHK-PQ).

The CUHK-PQ dataset [22] contains images with binary aesthetic labels. Commonly used training/testing partitions on this dataset include a random split and a five-fold split for cross-validation. We use the former one in our experiments.

conv1 conv2 conv3 conv4 conv5 average conv1 conv2 conv3 conv4 conv5 average conv1 conv2 conv3 conv4 conv5 average
ImageNet label 79.3 79.0 79.3 79.1 79.4 79.22 63.6 65.0 64.2 67.5 64.2 64.9 77.8 83.3 84.1 84.2 83.1 82.5
Places label 79.6 79.5 79.2 79.6 79.8 79.54 61.8 63.2 63.4 65.0 65.2 63.72 76.0 82.2 82.5 83.1 81.6 81.08
Without pretext 77.2 77.8 78.0 78.2 78.0 77.84 61.3 58.9 62.1 61.6 64.0 61.58 73.6 72.0 73.3 74.6 73.1 73.32
Context 79.8 79.2 79.2 79.0 79.0 79.24 59.4 59.8 62.4 62.6 63.2 61.48 69.2 75.9 77.9 79.0 78.7 76.16
Colorization 80.0 79.7 79.5 79.2 79.2 79.52 57.0 63.2 66.4 63.4 63.4 62.68 73.5 79.3 80.6 82.4 82.6 79.68
Split-Brain 79.5 79.7 79.5 80.1 79.4 79.64 58.2 64.4 67.8 63.8 65.6 63.96 77.5 83.7 84.7 83.7 84.5 82.82
Counting 53.0 52.2 63.3 65.5 58.8 58.56 61.8 61.3 60.4 62.8 62.3 61.72 75.6 76.0 74.3 72.5 71.8 74.04
RotNet 77.6 73.8 80.3 80.3 80.3 78.46 54.6 57.8 64.2 66.0 66.0 61.72 67.2 66.8 63.6 63.6 63.6 64.96
Ours 78.4 79.9 80.5 80.8 80.6 80.02 62.6 65.9 68.4 68.9 65.8 66.32 77.4 83.4 85.3 85.6 85.1 83.36
Table 2: Task generalization performance () of different convolutional layers from models guided by different pretext tasks, measured for visual aesthetic assessment with linear layers on the AVA, AADB, and CUHK-PQ benchmarks.

Results and Discussions

Evaluation of Unsupervised Features

Our experimental results on the three benchmarks are reported in Table 2. In each column, the best numbers are shown in bold font and the second best are highlighted with an underscore. We can make several interesting observations from the table.

Figure 5: The accuracy results of the 3rd conv. block from different learning schemes in low data adaptation on AADB.

The proposed scheme generally works the best on all the three benchmarks. It is evident that our approach can achieve competitive results consistently, compared with other baselines on the three datasets, especially for the mid-level layers, e.g., . Self-supervised visual features can even outperform semantics-related features that are pre-trained with manual labels from ImageNet or Places. Furthermore, from the accuracy perspective, our method can be comparable to or work better than existing self-supervised learning schemes in image aesthetic assessment.

Figure 6: The accuracy results of the 3rd conv. block from different methods in low data adaptation on CUHK-PQ.

Mid-level features achieve the best results in task adaptation. By comparing performance of different layers, we can see the correlation between image aesthetics and network depth. As shown in all tables, mid-level features ( & ) generally outperform high-level features () and low-level ones ( & ) in terms of accuracy. The observations are consistent with [41, 8]. One possible explanation is that, during the back propagation process, gets more training signals for the pretext task as compared to . Consequently, is more likely to suffer from the overfitting issue, while using leads to better generalization performance.

In addition to image attributes used in our pretext tasks, color is another important factor in image aesthetic assessment. Among other tested pretext tasks, Split-brain [42] and Colorization [41] consistently achieve the better results. It indicates that color and image attributes in Tab. 1 are the key factors in assessing visual aesthetics. Regarding why RotNet fails to achieve good results on the CUHK-PQ benchmark, we suspect that both high-quality images and low-quality ones in this dataset share similar distributions or visual patterns.

Method Backbone Labels in Aesthetic Results ()
the pre-training label alone
DMA-Net [20] AlexNet 75.4
AA-Net [37] VGG 76.9
 [16] AlexNet 77.3
MNA-CNN [25] VGG 77.4
NIMA [34] Inception 81.5
Pool-3FC [10] Inception 81.7
A-Lamp [24] VGG 82.5
 [33] ResNet-18 83.0
Our pretext task AlexNet 0 82.0
+ non-linear layers ResNet-18 82.8
Table 3: Results of several methods on AVA benchmark measured in terms of binary classification accuracy of aesthetic labels.

Low Data Adaptation

One practical issue that self-supervised learning schemes are able to handle is low data adaptation, where very few labels are available for task adaptation. In our case, we simulate the low data adaptation regime by using of the original training data per class in the task adaptation stage. The final results are shown in Fig. 5 and Fig. 6. From these two figures, we can see that our proposed learning scheme generally outperforms baseline models pre-trained with -ways labels from ImageNet. As the margin becomes larger when using more manual aesthetic labels in the task adaptation, our method presents higher efficiency of data usage compared to the vanilla supervised counterpart.

It is also interesting to note that our method and vanilla supervised method have similar adaptation performance when training data is available. We believe this fact is due to the complexity of visual aesthetics, i.e., when aesthetic labels are extremely few, the sampled instances cannot cover the entire distribution faithfully and thus lead to poor assessment results.

Adaption Using a Non-Linear Classifier

If we use non-linear layers in the task adaptation stage, we can achieve results close to state-of-the-art fully supervised approaches [24, 33, 34, 10] which have a test accuracy of on the AVA benchmark, as shown in Tab. 3. Note that our method does not use millions labels from ImageNet during the pre-training, and we only use aesthetic labels in the task adaptation stage. Besides, our accuracy is about on the AADB benchmark and on the CUHKPQ dataset, using ResNet-18 as the backbone network. These numbers are also close to that of the same network pre-trained with millions labels from ImageNet (i.e., i.e., on the AADB and on the CUHKPQ). Arguably, we manage to achieve similar assessment performance using far less manual labels.

Ablation Study

Pretext tasks.

We perform the pre-training with either one of the two pretext tasks and measure the final assessment results. It turns out that joint training using both loss terms generally yields better aesthetic assessment accuracy than using or alone. The accuracy difference is on the AVA dataset and on the AADB dataset,. We also find that pre-training with alone is prone to undesirable training dynamics, while the two-stage learning scheme is more stable and consequently leads to better results.

Image editing operations.

We randomly select one operation in Tab 1 and exclude it from the pre-training stage to analyze the impact on the final assessment performance. We can make several observations from the corresponding results shown in Tab. 4. First, editing operations which are related to softness, camera shake, and poor lighting are relatively more significant in learning aesthetic-aware features, compared to other operations. Second, image manipulations which are related to distracting, fuzziness, and noise have relatively smaller margins. This fact indicates that these operations may create aesthetically ambiguous instances that do not always provide consistent signals. Interestingly, camera shake has the most significant impact on the AVA, while soft/grainy are the most important ones for the AADB.

Attribute AVA AADB
Much noise
Poor lighting
Soft / Grainy
Camera shake
Table 4: The degradation caused by discarding some image manipulation operations during the pretext task training.

Entropy-based weighting.

Without our entropy-based weighting scheme, there will be at least down in the performance after task adaptation. Besides, some undesired dynamics will occur during the training process. Therefore, we can strengthen meaningful training signals from manipulated instances and improve training efficiency by assigning an entropy-based weight to each patch.


In this paper, we propose a novel self-supervised learning scheme to investigate the possibility of learning useful aesthetic-aware features without manual annotations. Based on the correlation between negative aesthetic effects and several expert-designed image manipulations, we argue that an aesthetic-aware representation space should distinguish between results yielded by various operations. To this end, we propose two pretext tasks, one to recognize what kind of editing operation has been applied to an image patch, and the other to capture fine-grained aesthetic variations due to different manipulation parameters. Our experimental results on three benchmarks demonstrate that the proposed scheme can learn aesthetics-aware features effectively and generally outperforms existing self-supervised counterparts. Besides, we achieve results comparable to state-of-the-art supervised methods on the AVA dataset without using labels from ImageNet. We hope these findings can help us obtain a better understanding of image visual aesthetics and inspire future research in related areas.


This work was supported in part by National Natural Science Foundation of China under nos. 61832016, 61672520 and 61720106006, in part by Key R&D Program of Jiangxi Province (No. 20171ACH80022), in part by CASIA-Tencent Youtu joint research project.