The goal of image quality assessment (IQA) is to quantify perceptual quality of images. In the deep learning era, many IQA approaches[12, 34, 36, 43, 49] have achieved significant success by leveraging the power of convolutional neural networks (CNNs). However, the CNN-based IQA models are often constrained by the fixed-size input requirement in batch training, , the input images need to be resized or cropped to a fixed shape as shown in Figure 1 (b). This preprocessing is problematic for IQA because images in the wild have varying aspect ratios and resolutions. Resizing and cropping can impact image composition or introduce distortions, thus changing the quality of the image.
To learn IQA on the full-size image, the existing CNN-based approaches use either adaptive pooling or resizing to get a fixed-size convolutional feature map. MNA-CNN  processes a single image in each training batch which is not practical for training on a large dataset. Hosu et al.  extracts and stores fixed-size features offline, which costs additional storage for every augmented image. To keep aspect ratio, Chen et al.  proposes a dedicated convolution to preserve aspect ratio in the convolutional receptive field. Its evaluation verifies the importance of aspect-ratio-preserving (ARP) in the IQA tasks. But it still needs resizing and smart grouping for effective batch training.
In this paper, we propose a patch-based multi-scale image quality Transformer (MUSIQ) to bypass the CNN constraints on fixed input size and predict the quality effectively on the native resolution image as shown in Figure 1 (a). Transformer 
is first proposed for natural language processing (NLP) and has recently been studied for various vision tasks[4, 5, 6, 11]. Among these, the Vision Transformer (ViT)  splits each image into a sequence of fixed-size patches, encodes each patch as a token, and then applies Transformer to the sequence for image classification. In theory, such kind of patch-based Transformer models can handle arbitrary numbers of patches (up to memory constraints), and therefore do not require preprocessing the input image to a fixed resolution. This motivates us to apply the patch-based Transformer on the IQA tasks with the full-size images as input.
have shown the benefit of using multi-scale features extracted from CNN feature maps at different depths. This inspires us to transform the native resolution image into a multi-scale representation, enabling the Transformer’s self-attention mechanism to capture information on both fine-grained detailed patches and coarse-grained global patches. Besides, unlike the convolution operation in CNNs that has a relatively limited receptive field, self-attention can attend to the whole input sequence and it can therefore effectively capture the image quality at different granularities.
However, it is not straightforward to apply the Transformer on the multi-aspect-ratio multi-scale input. Although self-attention accepts arbitrary length of the input sequence, it is permutation-invariant and therefore cannot capture patch location in the image. To mitigate this, ViT  adds fixed-length positional embedding to encode the absolute position of each patch in the image. However, the fixed-length positional encoding fails when the input length varies. To solve this issue, we propose a novel hash-based 2D spatial embedding that maps the patch positions to a fixed grid to effectively handle images with arbitrary aspect ratios and resolutions. Moreover, since the patch locations at each scale are hashed to the same grid, it aligns spatially close patches at different scales so that the Transformer model can leverage information across multiple scales. In addition to the spatial embedding, a separate scale embedding is further introduced to help the Transformer distinguish patches coming from different scales in the multi-scale representation.
The main contributions of this paper can be summarized into three-folds:
We propose a patch-based multi-scale image quality Transformer (MUSIQ), which supports processing full-size input with varying aspect ratios or resolutions, and allows multi-scale feature extraction.
A novel hash-based 2D spatial embedding and a scale embedding are proposed to support positional encoding in the multi-scale representation, helping the Transformer capture information across space and scales.
2 Related Work
Image Quality Assessment. Image quality assessment aims to quantitatively predict perceptual image quality. There are two important aspects for assessing image quality: technical quality  and aesthetic quality . The former focuses on perceptual distortions while the latter also relates to image composition, artistic value and so on. In the past years, researchers proposed many IQA methods: early natural scene statistics based [14, 26, 28, 47], codebook-based [40, 42] and CNN-based [12, 34, 36, 43, 49]. CNN-based methods achieve the state-of-the-art performance. However they usually need to crop or resize images to a fixed size in batch training, which affects the image quality. Several methods have been proposed to mitigate the distortion from resizing and cropping in CNN-based IQA. An ensemble of multi-crops from the original image is proven to be effective for IQA [7, 16, 24, 33, 34], but it introduces non-negligible inference cost. In addition, MNA-CNN  handles full-size input by adaptively pooling the feature map to a fixed shape. However, it only accepts a single input image for each training batch to preserve the original resolution which is not efficient for large scale training. Hosu et al.  extracted and stored the fixed-sized features from the full-size image for model training which costs extra storage for every augmented image and is inefficient for large scale training. Chen et al.  proposed an adaptive fractional dilated convolution to adapt the receptive field according to the image aspect ratio. The method preserves aspect ratio but cannot handle full-size input without resizing. It also needs smart grouping strategy in mini-batch training.
Transformers in Vision. Transformers  were first applied to NLP tasks and achieved great performance [41, 10, 23]. Recent works applied transformers on various vision tasks [4, 5, 6, 11]. Among these, the Vision Transformer (ViT) 
employs a pure Transformer architecture to classify images by treating an image as a sequence of patches. For batch training, ViT resizes the input images to a fixed squared size, , 224224, where fixed number of patches are extracted and combined with fixed-length positional embedding. This constrains its usage for IQA since resizing will affect the image quality. To solve this, we propose a novel Transformer-based architecture that accepts the full-size image for IQA.
Positional Embeddings. Positional embeddings are introduced in Transformers to encode the order of the input sequence . Without it, the self-attention operation is permutation-invariant . Vaswani et al.  used deterministic positional embeddings generated from sinusoidal functions. ViT  showed that the deterministic and learnable positional embeddings  works equally well. However, those positional embeddings are generated for fixed-length sequences. When the input resolution changes, the pre-trained positional embeddings is no longer meaningful. Relative positional embeddings [32, 2] is proposed to encode relative distance instead of absolute position. Although the relative positional embeddings can work for variable length inputs, it requires substantial modifications in Transformer attention and cannot capture multi-scale positions in our use case.
3 Multi-scale Image Quality Transformer
3.1 Overall Architecture
To tackle the challenge of learning IQA on full-size images, we propose a multi-scale image quality Transformer (MUSIQ) which can handle inputs with arbitrary aspect ratios and resolutions. An overview of the model is shown in Figure 2.
We first make a multi-scale representation of the input image, containing the native resolution image and its ARP resized variants. The images at different scales are partitioned into fixed-size patches and fed into the model. Since patches are from images of varying resolutions, we need to effectively encode the multi-aspect-ratio multi-scale input into a sequence of tokens (the small boxes in Figure 2), capturing both the pixel, spatial, and scale information.
To achieve this, we design three encoding components in MUSIQ, including: 1) A patch encoding module to encode patches extracted from the multi-scale representation (Section 3.2); 2) A novel hash-based spatial embedding module to encode the 2D spatial position for each patch (Section 3.3); 3) A learnable scale embedding to encode different scale (Section 3.4).
After encoding the multi-scale input into a sequence of tokens, we use the standard approach of prepending an extra learnable “classification token” (CLS) [10, 11]. The CLS token state at the output of the Transformer encoder serves as the final image representation. We then add a fully connected layer on top to predict the image quality score. Since MUSIQ only changes the input encoding, it is compatible with any Transformer variants. To demonstrate the effectiveness of the proposed method, we use the classic Transformer  (Appendix A) with a relatively lightweight setting to make model size comparable to ResNet-50 in our experiments.
3.2 Multi-scale Patch Embedding
Image quality is affected by both the local details and global composition. In order to capture both the global and local information, we propose to model the input image with a multi-scale representation. Patches from different scales enables the Transformer to aggregate information across multiple scales and spatial locations.
As shown in Figure 2, the multi-scale input is composed of the full-size image with height , width , channel , and a sequence of ARP resized images from the full-size image using Gaussian kernel. The resized images have height , width , channel , where and is the number of resized variants for each input. To align resized images for a consistent global view, we fix the longer side length to for each resized variant and yield:
represents the resizing factor for each scale.
Square patches with size are extracted from each image in the multi-scale representation. For images whose width or height are not multiples of
, we pad the image with zeros accordingly. Each patch is encoded into a-dimension embedding by the patch encoder module. is the latent token size used in the Transformer.
as the patch encoder module to learn a better representation for the input patch. We find that encoding the patch with a few convolution layers performs better than linear projection when pre-training on ILSVRC-2012 ImageNet (see Section 4.4). Since the patch encoding module is lightweight and shared across all the input patches whose size is small, it only adds a small amount of parameters.
The sequence of patch embeddings output from the patch encoder module are concatenated together to form a multi-scale embedding sequence for the input image. The number of patches from the original image and the resized ones are calculated as and .
Since each input image has a different resolution and aspect ratio, and are different for each input and therefore and are different. To get fixed-length input during training, we follow the common practice in NLP  to zero-pad the encoded patch tokens to the same length. An input mask is attached to indicate the effective input, which will be used in the Transformer to perform masked self-attention (Appendix A.3). Note that the padding operation will not change the input because the padding tokens are ignored in the multi-head attention by masking them.
As previously mentioned, we fix the longer length to for each resized variant. Therefore and we can safely pad to . For the native resolution image, we simply pad or cut the sequence to a fixed length . The padding is not necessary during single-input evaluation because the sequence length can be arbitrary.
3.3 Hash-based 2D Spatial Embedding
Spatial positional embedding is important in vision Transformers to inject awareness of the 2D image structure in the 1D sequence input . The traditional fixed-length positional embedding assigns an embedding for every input location. This fails for variable input resolutions where the number of patches are different and therefore each patch in the sequence may come from an arbitrary location in the image. Besides, the traditional positional embedding models each position independently and therefore it cannot align the spatially close patches from different scales.
We argue that an effective spatial embedding design for MUSIQ should meet the following requirements: 1) effectively encode patch spatial information under different aspect ratios and input resolutions; 2) spatially close patches at different scales should have close spatial embeddings; 3) efficient and easy to implement, non-intrusive to the Transformer attention.
Based on that, we propose a novel hash-based 2D spatial embedding (HSE) where the patch locating at row , column is hashed to the corresponding element in a grid. Each element in the grid is a -dimensional embedding.
We define HSE by a learnable matrix . Suppose the input resolution is . The input image will be partitioned into patches. For the patch at position , its spatial embedding is defined by the element at position in where
The -dimensional spatial embedding is added to the patch embedding element-wisely as shown in Figure 2. For fast lookup, we simply round to the nearest integers. HSE does not require any changes in the Transformer attention module. Moreover, both the computation of and and the lookup are lightweight and easy to implement.
To align patches across scales, patch locations from all scales are mapped to the same grid . As a result, patches located closely in the image but from different scales are mapped to spatially close embeddings in , since and as well as and change proportionally to the resizing factor . This achieves spatial alignment across different images from the multi-scale representation.
There is a trade-off between expressiveness and train-ability with the choice hash grid size . Small may result in a lot of collision between patches which makes the model unable to distinguish spatially close patches. Large wastes memory and may need more diverse resolutions to train. In our IQA setting where rough positional information is sufficient, we find once is large enough, changing only results in small performance differences (see Appendix B). We set in the experiments.
3.4 Scale Embedding
Since we reuse the same hashing matrix for all images, HSE does not make a distinction between patches from different scales. Therefore, we introduce an additional scale embedding (SCE) to help the model effectively distinguish information coming from different scales and better utilize information across scales. In other words, SCE marks which input scale the patch is coming from in the multi-scale representation.
We define SCE as a learnable scale embedding for the input image with -scale resized variants. Following the spatial embedding, the first element is added element-wisely to all the -dimensional patch embeddings from the native resolution image. are also added element-wisely to all the patch embeddings from the resized image at scale .
3.5 Pre-training and Fine-tuning
Typically, the Transformer models need to be pre-trained on the large datasets, ImageNet, and fine-tuned on the downstream tasks. During the pre-training, we still keep random cropping as an augmentation to generate images of different sizes. However, instead of doing square resizing like the common practice in image classification, we intentionally skip resizing to prime the model for inputs with different resolutions and aspect ratios. We also employ common augmentations such as RandAugment and mixup  in pre-training.
When fine-tuning on IQA tasks, we do not resize or crop the input image to preserve the image composition and aspect ratio. In fact, we only use random horizontal flipping for augmentation in fine-tuning. For evaluation, our method can be directly applied on the original image without aggregating multiple augmentations (multi-crops sampling).
When fine-tuning on the IQA datasets, we use common regression losses such as L1 loss for single mean opinion score (MOS) and Earth Mover Distance (EMD) loss to predict the quality score distribution :
where is the normalized score distribution and
is the cumulative distribution function as.
4 Experimental Results
PaQ-2-PiQ is the largest picture technical quality dataset by far which contains 40k real-world images and 120k cropped patches. Each image or patch is associated with a MOS. Since our model does not make a distinction between image and extracted patches, we simply use all the 30k full-size images and the corresponding 90k patches from the training split to train the model. We then run the evaluation on the 7.7k full-size validation and 1.8k test set.
SPAQ dataset consists of 11k pictures taken by 66 smartphones. For a fair comparison, we follow  to resize the raw images such that the shorter side is 512. We only use the image and its corresponding MOS for training, not including the extra tag information in the dataset.
KonIQ-10k contains 10k images selected from a large public multimedia database YFCC100M .
AVA is an image aesthetic assessment dataset. It contains 250k images with 10-scale score distribution for each.
4.2 Implementation Details
For MUSIQ, the multi-scale representation is constructed as the native resolution image and two ARP resized input ( and ) by default. It therefore uses 3-scale input. Our method also works on 1-scale input using just the full-size image without resized variants. We report the results of this single-scale setting as MUSIQ-single.
We use patch size . The dimensions for Transformer input tokens are , which is also the dimension for pixel patch embedding, HSE and SCE. The grid size of HSE is set to . We use the classic Transformer  with lightweight parameters (384 hidden size, 14 layers, 1152 MLP size and 6 heads) to make the model size comparable to ResNet-50. The final model has around 27 million total parameters.
We pre-train our models on ImageNet for 300 epochs, using Adam with, a batch size of , weight decay and cosine learning rate decay from . We set the maximum number of patches from full-size image to 512 in training. For fine-tuning, we use SGD with momentum and cosine learning rate decay from for 10, 30, 30, 20 epochs on PaQ-2-PiQ, KonIQ-10k, SPAQ, and AVA, respectively. Batch size is set to 512 for AVA, 96 for KonIQ-10k, and 128 for the rest. For AVA, we use the EMD loss with . For other datasets, we use the loss.
The models are trained on TPUv3. All the results from our method are averaged across 10 runs. Spearman rank ordered correlation (SRCC), Pearson linear correlation (PLCC), and the standard deviation (std) are reported.
4.3 Comparing with the State-of-the-art (SOTA)
Results on PaQ-2-PiQ. Table 1 shows the results on the PaQ-2-PiQ dataset. Our proposed MUSIQ outperforms other methods on both the validation and test sets. Notably, the test set is entirely composed of pictures having at least one dimension exceeding 640 . This is very challenging for traditional deep learning approaches where resizing is inevitable. Our method is able to outperform previous methods by a large margin on the full-size test set which verifies its robustness and effectiveness.
|Validation Set||Test Set|
|BIQA  (25 crops)||0.906||0.917|
Results on KonIQ-10k. Table 2 shows the results on the KonIQ-10k dataset. Our method outperforms the SOTA methods. In particular, BIQA  needs to sample 25 crops from each image during training and testing. This kind of multi-crops ensemble is a way to mitigate the fixed shape constraint in the CNN models. But since each crop is only a sub-view of the whole image, the ensemble is still an approximate approach. Moreover, it adds additional inference cost for every crop and sampling can introduce randomness in the result. Since MUSIQ takes the full-size image as input, it can directly learn the best aggregation of information across the full image and only one evaluation is involved.
Results on SPAQ. Table 3 shows the results on the SPAQ dataset. Overall, our model is able to outperform other methods in terms of both SRCC and PLCC.
|Fang  (w/o extra info)||0.908||0.909|
|A-Lamp  (50 crops)||0.825||-||-||-|
|NIMA (VGG16) ||0.806||-||0.592||0.610|
|NIMA (Inception-v2) ||0.815||-||0.612||0.636|
| ( 32 crops)||0.830||-||-||-|
|Zeng (ResNet101) ||0.808||0.275||0.719||0.720|
|Hosu  (20 crops)||0.817||-||0.756||0.757|
|AFDC + SPP (single warp) ||0.830||0.273||0.648||-|
|AFDC + SPP (4 warps) ||0.832||0.271||0.649||0.671|
Results on AVA. Table 4 shows the results on the AVA dataset. Our method achieves the best MSE and has top SRCC and PLCC. As previously discussed, instead of multi-crops sampling, our model can accurately predict image aesthetics by directly looking at the full-size image.
4.4 Ablation Studies
Importance of Aspect-Ratio-Preserving (ARP). CNN-based IQA models usually resize the input image to a square resolution without preserving the original aspect ratio. We argue that such preprocessing can be detrimental to IQA because it alters the image composition. To verify that, we compare the performance of the proposed model with either square or ARP resizing. As shown in Table 5, ARP resizing performs better than square resizing, demonstrating the importance of ARP when assessing image quality.
|NIMA(Inception-v2)  (224 square input)||56M||0.612||0.636|
|NIMA(ResNet50)* (384 square input)||24M||0.624||0.632|
|ViT-Base 32* (384 square input) ||88M||0.654||0.664|
|ViT-Small 32* (384 square input) ||22M||0.656||0.665|
|MUSIQ w/ square resizing (512, 384, 224)||27M||0.706||0.720|
|MUSIQ w/ ARP resizing (512, 384, 224)||27M||0.712||0.726|
|MUSIQ w/ ARP resizing (full, 384, 224)||27M||0.726||0.738|
To intuitively understand the importance of keeping aspect ratios in IQA, we follow  to artificially resize the same image into different aspect ratios and run models to predict quality scores. Since aggressive resizing will cause image quality degradation, a good IQA model should give lower scores to such unnatural looking images. As shown in Figure 3, MUSIQ (blue curve) is discriminative to the change of aspect ratios while scores from the other ones trained with square resizing are not sensitive to the change. This shows that ARP resizing is important and MUSIQ can effectively detect quality degradation due to resizing.
Effect of Full-size Input and the Multi-scale Input Composition. In Table 1 2 3 4, we compare using only the full-size input (MUSIQ-single) and the multi-scale input (MUSIQ). MUSIQ-single achieves promising results, showing the importance of preserving full-size input in IQA. The performance is further improved using multi-scale and the gain is larger on PaQ-2-PiQ and AVA because these two datasets have much more diverse resolutions than KonIQ-10k and SPAQ. This shows that multi-scale is important for effectively capturing quality information on real-world images with varying sizes.
We also vary the multi-scale composition and show in Table 6 that multi-scale consistently improves performance on top of single-scale models. The performance gain of multi-scale is more than a simple ensemble of individual scales because an average ensemble of individual scales actually under-performs using only the full-size image. Since MUSIQ has full receptive field of the multi-scale input sequences, it can more effectively aggregate quality information across scales.
|(512, 384, 224)||0.629||0.718|
|(full, 384, 224)||0.646||0.739|
|Average ensemble of (full), (224), (384)||0.640||0.710|
To further verify that the model captures different information at different scales, we visualize the attention weights on each image in the multi-scale representation as Figure 4. We observe that the model tends to focus on more detailed areas on full-size high-resolution images and on more global areas on the resized ones. This shows that the model learns to capture image quality at different granularities.
Effectiveness of Proposed Hash-based Spatial Embedding (HSE) and Scale Embedding (SCE). We run ablations on different ways to encode spatial information and scale information using positional embeddings. As shown in Table 7, there is a large gap between adding and not adding spatial embeddings. This aligns with the finding in  that spatial embedding is crucial for injecting 2D image structure. To further verify the effectiveness of HSE, we try to add a fixed length spatial embedding as ViT . This is done by treating all input tokens as a fixed length sequence and assigning a learnable embedding for each position. The performance of this method is unsatisfactory compared to HSE because of two reasons: 1) the inputs are of different aspect ratios. So each patch in the sequence can come from a different location from the image. Fixed positional embedding fails to capture this change; 2) since each position is modeled independently, there is no cross-scale information, meaning that the model cannot locate spatially close patches from different scales in the multi-scale representation. Moreover, the method is inflexible because fixed length spatial embedding cannot be easily applied to the large images with more patches. On the contrary, HSE is meaningful under all conditions.
|Fixed-length (no HSE)||0.707||0.722|
A visualization of the learned HSE cosine similarity is provided as Figure 5. As depicted, the HSE of spatially close locations are more similar (yellow color) and it corresponds well to the 2D structure. For example, the bottom HSEs are brightest at the bottom. This shows that HSE can effectively capture the 2D structure of the image.
In Table 8, we show that adding SCE can further improve performance when compared with not adding SCE. This shows that SCE is helpful for the model to capture scale information independently of the spatial information.
Choice of Patch Encoding Module. We tried different designs for encoding the patch, including linear projection as  and small numbers of convolutional layers. As shown in Table 9, using a simple convolution based patch encoding module can boost the performance. Adding more conv layers has diminishing returns and we find a 5-layer ResNet can provide satisfactory representation for the patch.
We propose a multi-scale image quality Transformer (MUSIQ), which can handle full-size image input with varying resolutions and aspect ratios. By transforming the input image to a multi-scale representation with both global and local views, the model is able to capture the image quality at different granularities. To encode positional information in the multi-scale representation, we propose a hash-based 2D spatial embedding and a scale embedding strategy. Although MUSIQ is designed for IQA, it can be applied to other scenarios where task labels are sensitive to the image resolutions and aspect ratios. Moreover, MUSIQ is compatible with any type of Transformers that accept input as a sequence of tokens. Experiments on the four large-scale IQA datasets show that MUSIQ can consistently achieve state-of-the-art performance, demonstrating the effectiveness of the proposed method.
-  (1984) Pyramid methods in image processing. RCA engineer 29 (6), pp. 33–41. Cited by: §1.
Attention augmented convolutional networks.
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3286–3295. Cited by: §2.
-  (2017) Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on image processing 27 (1), pp. 206–219. Cited by: Table 2.
-  (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: §1, §2.
-  (2020) Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364. Cited by: §1, §2.
Generative pretraining from pixels.
Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 1691–1703. Cited by: §1, §2.
Adaptive fractional dilated convolution network for image aesthetics assessment.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14114–14123. Cited by: §1, §2, §4.4, Table 4.
-  (2020) Randaugment: practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703. Cited by: §3.5.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §A.4.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4171–4186. External Links: Cited by: §A.1, §A.3, §2, §3.1.
-  (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, External Links: Cited by: Figure 6, §A.4, Appendix H, §1, §1, §2, §2, §3.1, §3.2, §3.3, §4.4, §4.4, Table 5.
-  (2020) Perceptual quality assessment of smartphone photography. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3677–3686. Cited by: Appendix F, MUSIQ: Multi-scale Image Quality Transformer, 3rd item, §1, §2, §4.1, §4.1, Table 3.
-  (2017) Convolutional sequence to sequence learning. In International Conference on Machine Learning, pp. 1243–1252. Cited by: §2.
-  (2017) Perceptual quality prediction on authentically distorted images using a bag of features approach. Journal of Vision 17 (1), pp. 32–32. Cited by: §2, Table 3.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §3.2.
-  (2019) Effective aesthetics prediction with multi-level spatially pooled features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9375–9383. Cited by: §1, §1, §2, Table 4.
-  (2020) KonIQ-10k: an ecologically valid database for deep learning of blind image quality assessment. IEEE Transactions on Image Processing 29, pp. 4041–4056. Cited by: Table 16, Appendix E, MUSIQ: Multi-scale Image Quality Transformer, 3rd item, §2, §4.1.
-  (2014) Convolutional neural networks for no-reference image quality assessment. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1733–1740. Cited by: Table 1.
-  (2016) Fully deep blind image quality predictor. IEEE Journal of Selected Topics in Signal Processing 11 (1), pp. 206–220. Cited by: Table 2.
-  (2016) Photo aesthetics ranking network with attributes and content adaptation. In European Conference on Computer Vision, pp. 662–679. Cited by: Table 4.
-  (2018) Which has better visual quality: the clear blue sky or a blurry animal?. IEEE Transactions on Multimedia 21 (5), pp. 1221–1234. Cited by: Table 2.
-  (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125. Cited by: §1.
-  (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Cited by: §2.
-  (2017) A-lamp: adaptive layout-aware multi-patch deep convolutional neural network for photo aesthetic assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4535–4544. Cited by: §2, Table 4.
-  (2016) Composition-preserving deep photo aesthetics assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 497–506. Cited by: §1, §2, Table 4.
-  (2012) No-reference image quality assessment in the spatial domain. IEEE Transactions on image processing 21 (12), pp. 4695–4708. Cited by: §2, Table 1, Table 2, Table 3.
-  (2012) Making a “completely blind” image quality analyzer. IEEE Signal processing letters 20 (3), pp. 209–212. Cited by: Table 1.
-  (2011) Blind image quality assessment: from natural scene statistics to perceptual quality. IEEE transactions on Image Processing 20 (12), pp. 3350–3364. Cited by: §2, Table 3.
-  (2017) A deep architecture for unified aesthetic prediction. arXiv preprint arXiv:1708.04890. Cited by: Table 4.
-  (2012) AVA: a large-scale database for aesthetic visual analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2408–2415. Cited by: 3rd item, §2, §4.1.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §A.4, §3.2.
-  (2018) Self-attention with relative position representations. arXiv preprint arXiv:1803.02155. Cited by: §2.
-  (2018) Attention-based multi-patch aggregation for image aesthetic assessment. In Proceedings of the 26th ACM international conference on Multimedia, pp. 879–886. Cited by: §2, Table 4.
-  (2020) Blindly assess image quality in the wild guided by a self-adaptive hyper network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3667–3676. Cited by: Appendix E, §1, §2, §4.1, §4.3, Table 2.
-  (2017) Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision, pp. 843–852. Cited by: §A.4.
-  (2018) NIMA: neural image assessment. IEEE Transactions on Image Processing 27 (8), pp. 3998–4011. Cited by: §1, §2, §3.5, Table 1, Table 4, Table 5.
-  (2016) YFCC100M: the new data in multimedia research. Communications of the ACM 59 (2), pp. 64–73. Cited by: §4.1.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . Cited by: Figure 6, §A.1, §A.2, §A.3, §B.2, §1, §2, §2, §3.1, §3.2, §4.2.
-  (2016) Blind image quality assessment based on high order statistics aggregation. IEEE Transactions on Image Processing 25 (9), pp. 4444–4457. Cited by: Table 2.
-  (2013) Learning without human scores for blind image quality assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 995–1002. Cited by: §2, Table 3.
-  (2019) XLNet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32, pp. 5754–5764. Cited by: §2.
-  (2012) Unsupervised feature learning framework for no-reference image quality assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1098–1105. Cited by: §2, Table 3.
-  (2020) From patches to pictures (paq-2-piq): mapping the perceptual space of picture quality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3585. Cited by: MUSIQ: Multi-scale Image Quality Transformer, 3rd item, §1, §2, §4.1, §4.3, Table 1.
-  (2019) A unified probabilistic formulation of image aesthetic assessment. IEEE Transactions on Image Processing 29, pp. 1548–1561. Cited by: Table 4.
-  (2017) A probabilistic quality representation approach to deep blind image quality prediction. arXiv preprint arXiv:1708.08190. Cited by: Table 2.
-  (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §3.5.
-  (2015) A feature-enriched completely blind image quality evaluator. IEEE Transactions on Image Processing 24 (8), pp. 2579–2591. Cited by: §2, Table 2, Table 3.
The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595. Cited by: §1.
-  (2018) Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Transactions on Circuits and Systems for Video Technology 30 (1), pp. 36–47. Cited by: §1, §2, Table 2, Table 3.
-  (2020) MetaIQA: deep meta-learning for no-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14143–14152. Cited by: Appendix E, §4.1, Table 2.
Appendix A Transformer Encoder
a.1 Transformer Encoder Structure
, the Transformer block layer consists of multi-head self-attention (MSA), Layernorm (LN) and MLP layers. Residual connections are added in between the layers.
In MUSIQ, the multi-scale patches are encoded as where is the scale index and is the patch index in the scale. represents the full-size image. We then add HSE and SCE to the patch embeddings, forming the multi-scale representation input. Similar to previous works , we prepend a learnable [class] token embedding to the sequence of embedded tokens ().
The Transformer encoder can be formulated as:
is the patch embedding. and are the spatial embedding and scale embedding respectively. is the number of patches from original resolution. are the number of patches from resized variants. is the input to the Transformer encoder. is the output of each Transformer layer and is the total number of Transformer layers.
a.2 Multi-head Self-Attention (MSA)
In this section we introduce the standard self-attention (SA)  (Figure 7) and its multi-head version (MSA). Suppose the input sequence is represented by , are its query, key, and value representations, respectively. They are generated by projecting the input sequence with a learnable matrix , respectively. is the inner dimension for . We then compute a weighted sum over using attention weights which are pairwise similarities between and .
MSA is an extension of SA where self-attention operations (heads) are conducted in parallel. The outputs from all heads are concatenated together and then projected to the final output with a learnable matrix . is typically set to to keep computation and number of parameters constant for each .
a.3 Masked Self-Attention
Masking is often used in self-attention [38, 10] to ignore padding elements or to restrict attention positions and prevent data leakage (in causal or temporal predictions). In batch training, we use the input mask to indicate the effective input and to ignore padding tokens. As shown in Figure 7, the mask is added on attention weights before the softmax. By setting the corresponding elements to before the softmax step in Equation 10, the attention weights on invalid positions are close to zero.
The attention mask is constructed as where
Then the masked self-attention weight matrix is calculated as
a.4 Different Transformer Encoder Settings
We use a lightweight parameters setting for Transformer encoder in the main experiments to make the model size comparable to ResNet-50. Here we also report the results from different Transformer encoder settings. The model variants are shown as Table 10. The MUSIQ-Small model is the one used in our main experiments in the paper. The performance of these variants on the AVA dataset is shown in Table 11. Overall, these models have similar performance when pre-trained on ImageNet . Larger Transformer backbones might need more data to pre-train in order to get better performance. As shown in experiments from , larger Transformer backbones get better performance when pre-trained on ImageNet21k  or JFT-300m .
Appendix B Additional Studies for HSE
b.1 Grid Size in HSE
We run ablation studies for the grid size in the proposed hash-based 2D spatial embedding (HSE). Results are shown in Table 12. Small may result in collision and therefore the model cannot distinguish spatially close patches. Large means the hashing is more sparse and therefore needs more diverse resolutions to train, otherwise some positions may not have enough data to learn good representations. One can potentially generate fixed for larger when detailed positions really matter (using sinusoidal function, see Appendix B.2). With a learnable , a good rule of thumb is to let grid size times the number of patches roughly equals the average resolution, . Since the average resolution across 4 datasets is around and we use patch size 32, we use grid size around 10 to 15. Overall, we find different does not change the performance too much once it is large enough, showing that rough spatial encoding is sufficient for IQA tasks.
b.2 Sinusoidal HSE v.s. Learnable HSE
Besides the learnable HSE matrix introduced in the paper, another option is to generate a fixed positional encoding matrix using the sinusoidal function as . In Table 13, we show the performance comparison of using learnable or generated sinusoidal with different Grid size . Overall, the learnable gives slightly better performance than that of the fixed .
b.3 Visualization of HSE with Different
Figures 9 and 9 visualize the learned HSE with and , respectively. Even with as small as 5, the similarity matrix corresponds well to the patch position in the image, showing that HSE captures patch position in the image.
Appendix C Effect of Patch Size
We ran ablation on different patch size , results are shown in Table 14. In our settings, we find patch size performs well across datasets.
Appendix D The Maximum Number of Patches () from Full-size Image
We run ablation with different during training. As shown in the Table 15, using large in the fine-tuning can improve the model performance. Since larger resolution images have more patches than low resolution ones, when is too small, some larger images might be cutoff, thus the model performance will degrade.
Appendix E KonIQ-10k More Results
In our main experiment on KonIQ-10k, we followed BIQA  and MetaIQA  to report the average of 10 random 80/20 train-test splits to avoid the bias. On the other hand, methods like KonCept512  uses a fixed split instead of averaging. In Table 16, we report our results using the same fixed split. Images in KonIQ-10k are of the same resolution and CNN models like KonCept512 usually need a cherry-picked fixed size to work well. Unlike CNN models that are constrained by fixed size, MUSIQ does not need tuning the input size and generalizes well for diverse resolutions.
Appendix F SPAQ Full-size Results
As mentioned in Section 4.1, we follow  to resize the raw images such that the shorter side is 512 for a fair comparison with the reference methods. Since our model can be applied directly on the images without resizing, we also report the performance on the SPAQ full-size test in Table 17 when training on the SPAQ full-size train. The results only have very little difference.
|Full-size train and test||0.916 ()||0.919 ()|
|Resized train and test||0.917 ()||0.921 ()|
Appendix G Computation Complexity
For the default MUSIQ model, the number of parameters is around 27M. For a 224x224 image, its FLOPS is , which is at the same level as SOTA CNN-based models (23M parameters and FLOPS for ResNet50). Training IQA takes 0.8 TPUv3-core-days on average. MST-IQA is compatible with the efficient Transformer backbones like Linformer and Performer, which greatly reduce the complexity of the original Transformer. We leave model speedup as the future work.
Appendix H Multi-scale Attention Visualization
To understand how MUSIQ uses self-attention to integrate information across different scales, we visualize the average attention weights from the output tokens to each image in the multi-scale representation as Figure 10. We follow  for the attention map computation. In short, the attention weights are averaged across all heads and then recursively multiplied, accounting for the mixing of attention across tokens through all layers.