Learning Generalized Spatial-Temporal Deep Feature Representation for No-Reference Video Quality Assessment

by   Baoliang Chen, et al.
City University of Hong Kong

In this work, we propose a no-reference video quality assessment method, aiming to achieve high-generalization capability in cross-content, -resolution and -frame rate quality prediction. In particular, we evaluate the quality of a video by learning effective feature representations in spatial-temporal domain. In the spatial domain, to tackle the resolution and content variations, we impose the Gaussian distribution constraints on the quality features. The unified distribution can significantly reduce the domain gap between different video samples, resulting in a more generalized quality feature representation. Along the temporal dimension, inspired by the mechanism of visual perception, we propose a pyramid temporal aggregation module by involving the short-term and long-term memory to aggregate the frame-level quality. Experiments show that our method outperforms the state-of-the-art methods on cross-dataset settings, and achieves comparable performance on intra-dataset configurations, demonstrating the high-generalization capability of the proposed method.



There are no comments yet.


page 1

page 3

page 7


Quality Assessment of In-the-Wild Videos

Quality assessment of in-the-wild videos is a challenging problem becaus...

Deep Learning based Full-reference and No-reference Quality Assessment Models for Compressed UGC Videos

In this paper, we propose a deep learning based video quality assessment...

No Reference Stereoscopic Video Quality Assessment Using Joint Motion and Depth Statistics

We present a no reference (NR) quality assessment algorithm for assessin...

More for Less: Non-Intrusive Speech Quality Assessment with Limited Annotations

Non-intrusive speech quality assessment is a crucial operation in multim...

Prediction of Satisfied User Ratio for Compressed Video

A large-scale video quality dataset called the VideoSet has been constru...

FOVQA: Blind Foveated Video Quality Assessment

Previous blind or No Reference (NR) video quality assessment (VQA) model...

Capturing Video Frame Rate Variations via Entropic Differencing

High frame rate videos are increasingly getting popular in recent years,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

There has been an increasing demand for accurately predicting the quality of videos, coinciding with the exponentially growing of video data. In the context of video big data, it becomes extremely difficult and costly to rely solely on human visual system to conduct timely quality assessment. As such, objective video quality assessment (VQA), the goal of which is to design computational models that automatically and accurately predict the perceived quality of videos, has become more prominent. According to the application scenarios regarding the availability of the pristine reference video, the assessment of video quality can be categorized into full-reference VQA (FR-VQA), reduced-reference VQA (RR-VQA) and no-reference VQA (NR-VQA). Despite remarkable progress, the NR-VQA of real-world videos, which has received great interest due to its high practical utility, is still very challenging especially when the videos are acquired, processed and compressed with diverse devices, environments and algorithms.

For NR-VQA, numerous methods have been proposed in the literature, and the majority of them rely on a machine learning pipeline based on the training of quality prediction model with labeled data. Methods relying on handcrafted features

[tu2020ugc, mittal2012no, zhang2015feature, mittal2015completely]

and deep learning features

[liu2018end, zhu2020metaiqa, fang2020perceptual, ren2017ran4iqa] have been developed, with the assumption that the training and testing data are drawn from closely aligned feature spaces. However, it is widely acknowledged that different distributions of training and testing data create the risk of poor generalization capability, and as a consequence, inaccurate predictions could be obtained on the videos that hold dramatically different statistics compared to those in the training set. The underlying design principle of the proposed VQA method is learning features with high generalization capability, such that the model is able to deliver high quality prediction accuracy of videos that are not sampled from the domain of the training data. This well aligns real application scenarios when the testing data are unknown. To verify the performance of our method, we conduct experiments on four cross-dataset settings with available databases, including KoNViD-1k [hosu2017konstanz], LIVE-Qualcomm [ghadiyaram2017capture], LIVE-VQC [sinno2018large] and CVD2014 [nuutinen2016cvd2014]. Experimental results have demonstrated superior performance of our method over existing state-of-the-art models with a significant margin. The main contributions of this paper are as follows,

  • We propose an objective NR-VQA model that is capable of automatically accessing the perceptual quality of videos resulting from different acquisition, processing and compression techniques. The proposed model is driven by learning features that specifically characterize the quality, and is able to deliver high prediction accuracy for videos that hold dramatically different characteristics compared to the training data.

  • In the spatial domain, we develop a multi-scale feature extraction scheme to explore the quality features in different scales, and an attention module is further incorporated to adaptively weight the features by their importance. We further unify the quality features of each frame with a Gaussian distribution where the mean and variance of the distribution are learnable. As such, the domain gap of different video samples caused by the content and distortion types can be further reduced by such a normalization operation.

  • In the temporal domain, a pyramid temporal pooling layer is proposed to account for the quality aggregation in temporal domain. The pyramid temporal pooling can make temporal pooling independent of the number of frames of the input video and aggregate the short-term and long-term quality levels of a video in a pyramid manner, which further enhances the generalization ability of the proposed model.

Ii Related Works

Ii-a No-reference Image Quality Assessment

Generally speaking, general purpose no-reference image quality assessment (NR-IQA) methods, which do not require any prior information of distortion types, hold the assumption that the destruction of “naturalness” could be the useful clue in quality assessment. The so-called natural scene statistic (NSS) approaches rely on a series of handcrafted features extracted in both spatial and frequency domains. Mittal

et al. [mittal2012no] investigated NSS features by exploiting the local spatial normalized luminance coefficients. Xue et al. [xue2014blind] combined the gradient magnitude (GM) and Laplacian of Gaussian (LoG) features together, and the results show that joint statistics GM-LoG could obtain desirable performance for NR-IQA task. Gu et al. [gu2014using] proposed a general purpose NR-IQA metric by exploiting the features that are highly correlated to human perception, including structural information and gradient magnitude. The Distortion Identification-based Image Verity and Integrity Evaluation (DIIVINE) method was developed by Moorthy et al. [moorthy2011blind]

with a two-stage framework, which includes distortion identification and support vector regression (SVR) to quality scores for distorted natural images. Narwaria

et al.

quantified structural representation in images with the assistant of singular value decomposition (SVD), and formulated quality prediction as a regression problem to predict image score using SVR. Another efficient NR-IQA method in

[saad2012blind] explored the discrete cosine transform (DCT) domain statistics to predict perceptual quality. Zhang et al. [zhang2013no]

designed the DErivative Statistics-based Image QUality Evaluator (DESIQUE), exploiting statistical features related to quality in spatial and frequency domains, which can be fitted by a generalized Gaussian distribution model to estimate image quality.

Recently, sophisticated deep learning based NR-IQA methods have been developed, demonstrating superior prediction performance over traditional methods. Zhang et al. [zhang2018blind]

proposed a deep bilinear model for NR-IQA that is suitable for the quality assessment of synthetic and real distorted images. The bilinear model includes two convolutional neural networks (CNNs): S-CNN and pre-trained VGG, which account for the synthetic and real-world distortions, respectively. In view of the challenges in cross-distortion-scenario prediction, Zhang

et al. [zhang2020learning] used massive image pairs composed of multiple databases simultaneously to train a unified blind image quality assessment model. The Neural IMage Assessment (NIMA) model [talebi2018nima] which tackles the problem of understanding visual aesthetics was trained on large-scale Aesthetic Visual Analysis (AVA) dataset [murray2012ava] to predict the distribution of quality ratings. Su et al. [su2020blindly] proposed an adaptive multi-scale hyper-network architecture, which consists of two modules: content understanding and quality prediction networks, to predict quality score based on captured local and global distortions. Zhu et al. [zhu2020metaiqa] developed a reference-free IQA metric based on deep meta-learning, which can be easily adapted to unknown distortions by learning meta-knowledge shared by human. Bosse et al. [bosse2017deep] proposed a data-driven end-to-end method for FR and NR image quality assessment task simultaneously.

Ii-B No-reference Video Quality Assessment

Recently, considerable efforts have been dedicated to VQA, in particular for quantifying the compression and transmission artifacts. Manasa et al. [manasa2016optical] developed the NR-VQA model based on the statistics of optical flow. In particular, to capture the influence of distortion on optical flow, statistical irregularities of optical flow at patch level and frame level are quantified, which are further combined with the SVR to predict the perceptual video quality. Li et al. [li2015no] developed an NR-VQA by combining 3D shearlet transform and deep learning to pool the quality score. Video Multi-task End-to-end Optimized neural Network (V-MEON) [liu2018end] is an NR-VQA technique designed based on feature extraction with 3D convolutional layer. Such spatial-temporal features could lead to better quality prediction performance. Korhonen et al. [korhonen2019two] extracted Low Complexity Features (LCF) from full video sequences and High Complexity Features (HCF) from key frames, following which SVR is used to predict video score. Vega et al. [vega2017deep]

focused on packet loss effects for video streaming settings, and an unsupervised learning based model is employed at the video server (offline) and the client (in real-time). In  

[li2019quality], Li et al.

integrated both content and temporal-memory in the NR-VQA model, and the gated recurrent unit (GRU) is used for long-term temporal feature extraction. You

et al. [you2019deep] used 3D convolution network to extract local spatial-temporal features from small clips in the video. This not only addresses the problem of insufficient training data, but also effectively captures the perceptual quality features which are finally fed into the LSTM network to predict the perceived video quality.

Fig. 1: The framework of the proposed generalized NR-VQA model. For each frame of the input video, we first utilize the pre-trained VGG16 [simonyan2014very] network to extract the multi-scale features with an attention module. Subsequently, the extracted features are further processed by a fully connected layer to reduce its dimension, followed by a GRU module to acquire the frame-level quality features. We further regularize the frame-level quality features by enforcing the features to be subject to Gaussian distributions via adversarial learning. Finally, a pyramid pooling strategy is utilized for temporal quality aggregation inspired by short-term and long-term memory effects.

Ii-C Domain Generalization

The VQA problem also suffers from the domain gap between the labeled training data (source domain) and unseen testing data (target domain), leading to the difficulty that the trained model in the labeled data cannot generalize well on the unseen data. These feature gaps may originate from different resolutions, scenes, acquisition devices/conditions and processing/compression artifacts. Over the past years, numerous attempts have been made to address domain generalization problem by learning domain-invariant representations [muandet2013domain, erfani2016robust, xie2017controllable, ghifary2015domain, xu2014exploiting, motiian2017unified], which lead to promising results. In [andrew2013deep], Canonical Correlation Analysis (CCA) was proposed to learn the shareable information among domains. Muandet et al. [muandet2013domain] proposed to leverage Domain Invariant Component Analysis (DICA) to minimize the distribution mismatch across domains. In [carlucci2019domain] Carlucci et al. learn the generalized representation by shuffling the image patches and this idea was further extended by [wang2020heterogeneous], in which the samples across multiple source domains are mixed for heterogeneous domain generalization task. The generalization of adversarial training [goodfellow2014explaining, sinha2017certifiable] has also been extensively studied. For example, Li et al. [li2018domain]

proposed the MMD-AAE model which extends adversarial autoencoders by imposing the Maximum Mean Discrepancy (MMD) measure to align the distributions among different domains. Instead of training domain classifiers in our work due to sample complexity

[schmidt2018adversarially] and uncontrolled conditions (scenes, distortion types, motion, resolutions, etc.), we further regularize the learned feature to follow Gaussian distribution via adversarial training, shrinking the learned feature mismatch across domains.

Iii The Proposed Scheme

We aim to learn an NR-VQA model with high generalization capability for real-world applications. Generally speaking, three intrinsic attributes that govern the generalization capability of VQA are considered, including spatial resolution, frame rate and video content (e.g., the captured scenes and the distortion type). As shown in Fig. 1, we first extract the frame-level quality features with a pretrained VGG16 model [simonyan2014very], inspired by the repeatedly proven evidence that such features could reliably reflect the visual quality [zhang2018blind, ding2020image] [li2019quality] [li2017image]

. To encode the generalization capability to different spatial resolutions into feature representation, statistical pooling moments are leveraged and the features in the five convolution stages (from top layer to bottom layer) are aggregated with the channel attention. To further enhance the generalization capability to unseen domains, the large distribution gap between the source and target domains are blindly compensated by regularizing the learned quality feature into a unified distribution. In the temporal domain, a pyramid aggregation module is further proposed, leading to the final quality features for quality prediction.

Iii-a Attention Based Multi-scale Feature Extraction

Herein, the feature representation that is equipped with strong generalization capability in terms of the spatial resolution of a single frame is obtained based on the pretrained VGG ConvNets. It is widely acknowledged that the pooling moments determine the discriminability of features, and we adopt the widely acknowledged mean and standard deviation (std) based pooling strategies. In particular, for frame

, supposing the mean pooling and std pooling results of the output feature at stage as and respectively, the multi-scale quality representations can be acquired by concatenating the pooled features at each stage as follows,


where and stand for the multi-scale mean feature and std feature of frame . However, it may not be feasible to concatenate the two pooled features straightforwardly for quality regression, due to the high relevance of with the semantic information [wan2019information]. As a result, the learned model tends to overfit to the specific scenes in the training set. Here, instead of discarding the , as shown in Fig. 2, the is regarded as the semantically meaningful feature working as the integral part in the attention based multi-scale feature extraction. To be specific, for frames, given , we first calculate the std of each channel along the temporal dimension as follows,




where the frame index is denoted as . Given , two fully connected layers are learned to implement the attention mechanism, as shown in Fig. 2,


where and represent the two fully connected layers. The underlying principle is the attention weight in each channel depends on the corresponding variance along the temporal domain, which is highly relevant with the video content variations. As such, such nested pooling with spatial mean and temporal std could provide the attention map by progressively encoding the spatial and temporal variations into a global descriptor. Then the frame-specific quality representation can be obtained by and its attention weight as follows,


where the “” represents the element wise multiplication.

Fig. 2: Illustration of the attention module for feature extraction.

Iii-B Feature Regularization with Gaussian Distribution

Given the frame-level quality feature , the Gated Recurrent Unit (GRU) [cho2014learning] layer is utilized to refine the frame-level feature by involving the temporal information. In particular, we use a fully connected layer (denoted as ) to reduce the redundancies of VGG feature, following which the resultant feature is processed by a GRU layer,


However, we argue that the is still not generalized enough to different scenes and distortion types. To enhance the generalization capability of , we resort to feature regularization, expecting to learn the quality feature with a unified distribution. The underlying assumption of generalizing to an unseen domain is that there exists a discrete attribute separating the data into different domains. However, a naïve extension to VQA may be confused by numerous discrete or continuous attributes (e.g., scene, distortion type, motion, resolution) for domain classification. As such, instead of dividing the data into different domains, we restrict the frame-level feature subject to a mixture Gaussian distribution by a GAN based model, and moreover the mean and variance of the presumed Gaussian distribution can also be adaptively learned. To be specific, as shown in Fig. 1, we first average the extracted of each frame as follows,


Herein, we treat the feature extractor as the generator of a GAN model and we sample the same dimension vector (denoted as ) from the prior Gaussian distribution as reference. Then the discriminator tries to distinguish the generated feature from the sampled vector. The GAN model is trained through the following adversarial loss,


where is the vector sampled from Gaussian distribution , is the input video and generates the feature . When the network is trained in the first epochs, we constrain the to be the standard Gaussian distribution with mean and variance . However, this imposes a strong constraint that the features in each dimension share the Gaussian distribution with identical mean and variance. Generally speaking, each dimension of the feature is expected to represent a perceptual relevance attribute for quality inference, such that they ultimately follow different Gaussian distributions parameterized by different and . This motivates us to adapt the mean and variance of prior Gaussian distribution of each dimension via learning. More specifically, to learn the parameters and where is the dimension of , we impose the constraint on to regress the quality score


Here, we use to represent the predicted quality score of the input video, and we aim to regress towards the ground-truth mean opinion score (MOS) via learning the optimal and . Moreover, indicates the dimension. During the training of the network, after every epochs, we use the Gaussian distribution with learned and to replace the distribution in previous epochs. From the experimental results, we also find such an adaptive refreshing mechanism can further improve the performance of our model compared with standard Gaussian distribution.

Iii-C Pyramid Feature Aggregation

Temporal domain aggregation plays an indispensable role in objective VQA models. We consider two cognitive mechanisms in visual quality perception [hochreiter1997long, zhang2016video]. The short-term memory effect persuades us to consider the video quality for each localized time-frame, due to the consensus that subjects are resistant in their opinion and prefer consistent quality when watching the video. Moreover, the long-term memory effect suggests that the global pooling over the whole video sequence in a coarse-to-fine manner could lead to the final video quality. Therefore, we imitate such perception mechanisms with a pyramid feature aggregation (PFA) strategy. In the PFA, the short-term memory and long-term memory are incorporated and the aggregation result is independent of the number of frames. More specifically, as illustrated in Fig. 3, in the bottom layer of the pyramid, for , we calculate its weight by synthesizing it with its surrounding frames,


where the and are two 1D-CNNs and their kernel sizes are all set to . Moreover, and

are the activation functions, and

is defined as follows,


Then the weighted frame-level quality feature can be acquired,


Subsequently, the weighted frame-level features along the temporal dimension are aggregated in a pyramid manner. In general, the perceivability along the temporal dimension determines the sampling density governed by the number of layers. Herein, we empirically set the number of layers with a constant number 7. To be specific, for the layer (), the weighted frame-level features are aggregated into a vector with the dimension , where denotes the feature dimension in . In other words, the video is averagely divided into time slots, and within each time slot, average feature pooling is performed for aggregation. Finally, we concatenate the aggregated features of all layers, leading to the video-level quality feature with a constant dimension that is independent of the number of frames and frame rate, . We first apply a fully connected layer () to reduce the channel from to 1, then another fully connected layer () is adopted to synthesize the pyramid aggregated features. As such, the quality of input videos can be predicted as follows,


where is the prediction score. This strategy provides more flexibility than single layer aggregation by incorporating the variations along the temporal dimension.

Fig. 3: Illustration of PFA module.

Iii-D Objective Loss Function

The final loss function involves the frame-level and video-level quality regression results acquired in Eqn. (

9) and Eqn. (13), as well as the distribution based feature regularization,




Herein, and are two trade-off parameters. In the testing phase, we use the as the final quality score that our model predicts.

Iv Experimental Results

Iv-a Experimental Setups

Iv-A1 Datasets

To validate the proposed method, we evaluate our model based on four datasets including KoNViD-1k [hosu2017konstanz], LIVE-Qualcomm [ghadiyaram2017capture], LIVE-VQC585 [sinno2018large] and CVD2014 [nuutinen2016cvd2014].

CVD2014. In this dataset, 78 different cameras, ranging from low-quality phone cameras to dedicated digital single-lens reflex cameras, are used to capture these 234 videos. In particular, five unique scenes (traffic, city, talking head, newspaper and television) are covered with these videos of two resolutions 480P () and 720P ().

LIVE-VQC. Videos in this dataset are acquired by 80 inexperienced mobile camera users, leading to a variety of authentic distortions levels. There are in total 585 video scenes in this dataset, containing 18 different resolutions ranging from to .

LIVE-Qualcomm. This dataset consists of 208 videos in total, which are recorded by 8 different mobile phones in 54 different scenes. Six common in-capture distortion categories are studied in this database including: noise and blockiness distortions; incorrect or insufficient color representation; over/under-exposure; autofocus related distortions; unsharpness and camera shaking. All these sequences have identical resolution 1080P and quite close frame rate.

KoNViD-1k. KoNViD-1k is the largest VQA dataset which contains in total 1200 video sequences. These videos are sampled from YFCC100m [thomee2016yfcc100m] dataset. Various devices are used to acquire these videos, leading to 12 different resolutions. A portion of the videos in the dataset are acquired by professional photographers, such that there is a large variance in terms of the video quality.

In Fig. 4, the sampled frames from above four datasets are shown, from which we can observe that these videos are featured by diverse scenes (e.g., indoors and outdoors), resolutions (from to ) as well as quality levels. In view of the diverse content, resolutions and frame rates in real-world applications, there has been an exponential increase in the demand for the development of VQA models with high generalization capability.

Fig. 4: Sample frames from four video datasets. The corresponding resolution () and values are also provided.
Layer Type Kernel Size Channel (in,out) Stride
VGG16 Backbone ×2 3 (3,64) 1
×2 3 (64,128) 1
×3 3 (128,256) 1
×3 3 (256,512) 1
×3 3 (256,512) 1
Attention module (1472,320)
Pyramid Aggregation 15 (32,1) 1
15 (1,1) 1
TABLE I: Architecture of the network in the proposed method.
Training on KoNViD-1k CVD2014 LIVE-Qualcomm LIVE-VQC
NR-IQA NIQE 0.3856 0.4410 0.2681 0.1807 0.1672 0.1196 0.4573 0.4025 0.3154 [t]
BRISQUE 0.4626 0.5060 0.3238 0.3061 0.3303 0.2071 0.5805 0.5788 0.4089
WaDIQaM 0.6988 0.7151 0.5081 0.4926 0.5471 0.3545 0.6461 0.6797 0.4634
NIMA 0.5446 0.5836 0.3818 0.3413 0.4011 0.2003 0.5642 0.6204 0.3932
SPAQ 0.6188 0.6151 0.4339 0.1879 0.2374 0.1274 0.4653 0.5202 0.3202 [b]
NR-VQA VSFA 0.6278 0.6216 0.4489 0.5574 0.5769 0.3966 0.6792 0.7198 0.4905 [t]
TLVQM 0.3569 0.3838 0.2442 0.4730 0.5127 0.3290 0.5953 0.6248 0.4268
VIDEVAL 0.6494 0.6638 0.4684 0.4048 0.4351 0.2758 0.5318 0.5329 0.3685 [b]
Ours 0.7972 0.7984 0.5891 0.6200 0.6666 0.4445 0.6797 0.7327 0.4864
Training on LIVE-Qualcomm KoNViD-1k CVD2014 LIVE-VQC
NR-IQA NIQE 0.4564 0.3619 0.3148 0.3856 0.4410 0.2681 0.4573 0.4025 0.3154 [t]
BRISQUE 0.4370 0.4274 0.2983 0.4626 0.5060 0.3238 0.5805 0.5788 0.4089
WaDIQaM 0.3671 0.3510 0.2538 0.3189 0.3255 0.2189 0.5385 0.5377 0.3756
NIMA 0.2877 0.2588 0.1948 0.2705 0.2768 0.1842 0.3401 0.3711 0.2306
SPAQ 0.1330 0.1541 0.0898 0.1663 0.1508 0.1116 0.2854 0.3122 0.1926 [b]
NR-VQA VSFA 0.6643 0.6716 0.4769 0.5348 0.5606 0.3751 0.6425 0.6819 0.4613 [t]
TLVQM 0.0347 0.0467 0.0205 0.4893 0.4721 0.3361 0.4091 0.3559 0.2763
VIDEVAL 0.1812 0.3441 0.1113 0.6059 0.6244 0.4246 0.4314 0.4122 0.2931 [b]
Ours 0.6694 0.6258 0.4847 0.7046 0.6665 0.5115 0.6201 0.6100 0.4397
Training on LIVE-VQC KoNViD-1k CVD2014 LIVE-Qualcomm
NR-IQA NIQE 0.4564 0.3619 0.3148 0.3856 0.4410 0.2681 0.1807 0.1672 0.1196 [t]
BRISQUE 0.4370 0.4274 0.2983 0.4626 0.5060 0.3238 0.3061 0.3303 0.2071
WaDIQaM 0.4352 0.4451 0.2997 0.5362 0.5417 0.3666 0.4049 0.4207 0.2760
NIMA 0.5848 0.5988 0.4105 0.3532 0.3835 0.2427 0.3106 0.3362 0.2098
SPAQ 0.3542 0.3468 0.2048 0.5494 0.4982 0.3837 0.2714 0.3235 0.1811 [b]
NR-VQA VSFA 0.6584 0.6666 0.4751 0.5061 0.5415 0.3623 0.5094 0.5350 0.3551 [t]
TLVQM 0.6023 0.5943 0.4289 0.4553 0.4749 0.3134 0.6415 0.6534 0.4599
VIDEVAL 0.5007 0.4841 0.3422 0.5702 0.5171 0.4125 0.3021 0.3602 0.2064 [b]
Ours 0.7085 0.7074 0.5179 0.6894 0.6645 0.4888 0.5952 0.6245 0.4285
Training on CVD2014 KoNViD-1k LIVE-Qualcomm LIVE-VQC
NR-IQA NIQE 0.4564 0.3619 0.3148 0.1807 0.1672 0.1196 0.4573 0.4025 0.3154 [t]
BRISQUE 0.4370 0.4274 0.2983 0.3061 0.3303 0.2071 0.5805 0.5788 0.4089
WaDIQaM 0.4981 0.4825 0.3456 0.2863 0.3305 0.1906 0.4598 0.5086 0.3222
NIMA 0.3142 0.3013 0.2120 0.0294 0.0628 0.0189 0.2769 0.2933 0.1857
SPAQ 0.3253 0.3335 0.2209 0.1523 0.1951 0.0996 0.3619 0.4066 0.2482 [b]
NR-VQA VSFA 0.5759 0.5636 0.4108 0.3256 0.3718 0.2192 0.4600 0.4783 0.3171 [t]
TLVQM 0.5437 0.5052 0.3758 0.3334 0.3838 0.2279 0.5397 0.5527 0.3803
VIDEVAL 0.1918 0.3260 0.1220 0.1208 0.3315 0.0809 0.4751 0.5167 0.3192 [t]
Ours 0.6069 0.5942 0.4345 0.5316 0.5827 0.3713 0.5872 0.5986 0.4138
TABLE II: Performance comparisons on four datasets with cross-dataset settings. In each column, the best and second-best values are marked in boldface and underlined, respectively.

Iv-A2 Implementation details

We implement our model by PyTorch 


. In Table I, we detail the layer-wise network of our proposed method. In particular, we retain the original size of each frame as input without the resizing operation. The VGG-16 network is pretrained on ImageNet 

[deng2009imagenet] and we fix its parameters when training. The batch size in the training phase is 128 and we adopt Adam optimizer for optimization. The learning rate is fixed to 1e-4. The weighting parameters , in Eqn. (14) are set as 0.5 and 0.05, respectively. In each setting, we fix the maximum epoch as and the model learned at the latest epoch will be used for testing. For every 20 epochs (), we renew the mean and variance of the predefined distribution in Eqn. (8

). It is worth mentioning that all the experimental settings (hyper-parameters and learning strategy) are fixed. Four evaluation metrics are reported in this paper, including: Spearman’s rank-order correlation coefficient (SROCC), Kendall’s rank-order correlation coefficient (KROCC), Pearson linear correlation coefficient (PLCC), and Root mean square error (RMSE). As suggested in 

[video2000final], the predicted quality scores are passed through a nonlinear logistic mapping function before computing PLCC and RMSE,


where are regression parameters to be fitted.

Iv-B Quality Prediction Performance

In this subsection, we evaluate the performance of our method with four different cross-dataset settings to verify the generalization capability. We compare the proposed method with both NR-IQA methods including NIQE [mittal2012making], BRISQ [mittal2012no], WaDIQaM [bosse2017deep], NIMA [talebi2018nima], SPAQ [fang2020perceptual] and NR-VQA methods including VSFA [li2019quality], TLVQM [korhonen2019two], VIDEVAL [tu2020ugc]. In each setting, the models are trained on one dataset and tested on other three datasets. For deep learning based NR-IQA models, we extract two frames per second of each video in the training set and the MOS of the video is treated as the quality score of the extracted frames for model training. The results are shown in Table II, from which we can find our method can achieve the best performance on all individual cross-dataset settings which reveals the superior generalization ability of our proposed method. Compared with NR-VQA methods, we can observe that the overall performance of NR-IQA methods is not satisfactory as the temporal information is discarded. However, even the VQA based methods cannot achieve very promising performance in such challenging settings. For example, when the method VIDEV trained on LIVE-Qua dataset, the testing result of SROCC is 0.6059 on CVD2014 dataset while it is degraded significantly to 0.1812 on KoNViD-1k dataset which further demonstrates the large domain gap between the two datasets. As shown in Table II, training on CVD2014 dataset and cross-testing on other three datasets is the most challenging setting as only 234 videos and 5 scenes are involved in CVD2014. The limited data cause the over-fitting problem. However, our method still leads with a large margin over the second-best method VSFA, demonstrating the robustness and promising generalization capability of our method.

Iv-C Quality Prediction Performance on Intra-dataset

In this subsection, to further verify the effectiveness of our method, we evaluate our method with intra-dataset settings on on LIVE-Qualcomm, KoNViD-1k and CVD2014. We compare the proposed method with six state-of-the-art methods including BRISQ [mittal2012no], NIQE [mittal2012making], CORNIA [ye2012unsupervised], VIIDEO [mittal2015completely], VIDEVAL [tu2020ugc] and VSFA [li2019quality]. More specifically, for each dataset, 80% and 20% data are used for training and testing, respectively. This procedure is repeated 10 times and the mean and standard deviation of performance values are reported in Table III. From Table III, we can observe that our method can still achieve the best overall performance in terms of both the prediction monotonicity (SROCC, KROCC) and the prediction accuracy (PLCC, RMSE). In particular, for the most challenge datasets LIVE-Qualcomm, our method achieves 7.2% SROCC improvements compared with the second-best method VSFA. Though the performance of our method achieves the second place on CVD2014 dataset, the performance of our method is still comparable with the state-of-the-art method VSFA and has a large gain over other methods. This phenomenon reveals that our methods can possess the superior generalization capability without the sacrifice of performance on intra-dataset settings.

Overall SROCC 0.643 (± 0.059) 0.526 (± 0.055) 0.591 (± 0.052) 0.237 (± 0.073) 0.686 (± 0.035) 0.771 (± 0.028) 0.811 (± 0.017) [t]
KROCC 0.465 (± 0.047) 0.369 (± 0.041) 0.423 (± 0.043) 0.164 (± 0.050) 0.503 (± 0.032) 0.582 (± 0.029) 0.620 (± 0.020)
PLCC 0.625 (± 0.053) 0.542 (± 0.054) 0.595 (± 0.051) 0.218 (± 0.070) 0.660 (± 0.037) 0.762 (± 0.031) 0.817 (± 0.017)
RMSE 3.895 (± 0.380) 4.214 (± 0.323) 4.139 (± 0.300) 5.115 (± 0.285) 3.753 (± 0.365) 3.074 (± 0.448) 2.832 (± 0.399) [b]
LIVE-Qualcomm SROCC 0.504 (± 0.147) 0.463 (± 0.105) 0.460 (± 0.130) 0.127 (± 0.137) 0.566 (± 0.078) 0.737 (± 0.045) 0.790 (± 0.015) [t]
KROCC 0.365 (± 0.111) 0.328 (± 0.088) 0.324 (± 0.104) 0.082 (± 0.099) 0.405 (± 0.074) 0.552 (± 0.047) 0.594 (± 0.009)
PLCC 0.516 (± 0.127) 0.464 (± 0.136) 0.494 (± 0.133) -0.001 (± 0.106) 0.568 (± 0.089) 0.732 (± 0.036) 0.792 (± 0.033)
RMSE 10.731 (± 1.335) 10.858 (± 1.013) 10.759 (± 0.939) 12.308 (± 0.881) 10.760 (± 1.231) 8.863 (± 1.042) 7.605 (± 1.015) [b]
KoNViD-1k SROCC 0.654 (± 0.042) 0.544 (± 0.040) 0.610 (± 0.034) 0.298 (± 0.052) 0.695 (± 0.024) 0.755 (± 0.025) 0.810 (± 0.014) [t]
KROCC 0.473 (± 0.034) 0.379 (± 0.029) 0.436 (± 0.029) 0.207 (± 0.035) 0.509 (± 0.020) 0.562 (± 0.022) 0.622 (± 0.167)
PLCC 0.626 (± 0.041) 0.546 (± 0.038) 0.608 (± 0.032) 0.303 (± 0.049) 0.658 (± 0.025) 0.744 (± 0.029) 0.814 (± 0.010)
RMSE 0.507 (± 0.031) 0.536 (± 0.010) 0.509 (± 0.014) 0.610 (± 0.012) 0.483 (± 0.011) 0.469 (± 0.054) 0.386 (± 0.213) [b]
CVD2014 SROCC 0.709 (± 0.067) 0.489 (± 0.091) 0.614 (± 0.075) 0.023 (± 0.122) 0.746 (± 0.056) 0.880 (± 0.030) 0.831 (± 0.040) [t]
KROCC 0.518 (± 0.060) 0.358 (± 0.064) 0.441 (± 0.058) 0.021 (± 0.081) 0.562 (± 0.057) 0.705 (± 0.044) 0.634 (± 0.053)
PLCC 0.715 (± 0.048) 0.593 (± 0.065) 0.618 (± 0.079) -0.025 (± 0.144) 0.753 (± 0.053) 0.885 (± 0.031) 0.850 (± 0.053)
RMSE 15.197 (± 1.325) 17.168 (± 1.318) 16.871 (± 1.200) 21.822 (± 1.152) 14.292 (± 1.413) 11.287 (± 1.943) 11.135 (± 1.875) [b]
TABLE III: Performance comparisons on three VQA datasets with intra-dataset settings. Mean and standard deviation (std) of the performance values in 10 runs are reported. The overall performance is obtained by weighted-average performance values over all three databases, where weights are in proportional to the size of the dataset. In each row, the best and second-best values are marked in boldface and underlined, respectively.

Iv-D Ablation Study

In this subsection, to reveal the functionalities of different modules in the proposed method, we perform the ablation analysis. The experiments are conducted with a cross-dataset setting (training on KoNViD-1k and testing on other three datasets). As shown in Table IV, the performance are provided in terms of SROCC and PLCC. To identify the effectiveness of the attention module used in multi-scale features extraction, we directly concatenate the mean and std pooling features without attention performed and maintain the rest of parts for training. The model is denoted as Concat in Table IV, in which we can observe that the performance on all testing sets is degraded especially on the LIVE-Qualcomm dataset. The similar phenomenon can be observed when the pyramid poling module is ablated (denoted as Ours PymidPooling in Table IV). The reason lies in that the videos in LIVE-Qualcomm dataset challenge both human subjects and objective VQA models, as indicated in [ghadiyaram2017capture]. As such, more dedicated design on both spatial and temporal domains is desired. Subsequently, we remove the Gaussian distribution regularization module from the original models, leading to a model denoted as Ours Distribution. From the results, we can find that both the SROCC and PLCC are degraded compared with our original method (denoted as Ours) which demonstrates that the regularization on feature space is also important for the generalized VQA model.

Method CVD2014 LIVE-Q LIVE-V
Concat SROCC 0.7466 0.5286 0.6357 [t]
PLCC 0.7625 0.5847 0.6889 [b]
Ours Distribution SROCC 0.7735 0.5900 0.6692 [t]
PLCC 0.7638 0.6524 0.7142 [b]
Ours PyramidPooling SROCC 0.7732 0.5884 0.6701 [t]
PLCC 0.7631 0.6495 0.7173 [b]
Ours SROCC 0.7972 0.6200 0.6797 [t]
PLCC 0.7984 0.6666 0.7327 [b]
TABLE IV: Ablation studies with the KoNViD-1k as the training data. For simplification, we denote the LIVE-Qualcomm as LIVE-Q and LIVE-VQC and LIVE-V.

Iv-E Visualization

To better understand the learned quality relevant features in our proposed method, we train our model on one specific dataset and visualize the quality features of all videos in above four datasets, respectively. More specifically, for each video, we extract its feature (as shown in Eqn. (8)), and subsequently the feature dimension is reduced to two by T-SNE [maaten2008visualizing], as visualized in Fig. 5. We can observe that the features generated from different testing sets have a large overlap with the features of training set, which reveals the domain gaps among the four datasets can be reduced with our method. Moreover, the closely aligned feature distributions when different datasets are used for training demonstrate that consistent feature space can be learned by our model, leading to superior performance on the cross-dataset settings.

Moreover, to verify whether the Gaussian distribution is updated from the initial standard distribution (mean and variance ) of each dimension in , we also plot the final values of mean and variance in Fig. 6 on four cross-dataset testings. We can observe that the distributions of each feature dimension is totally different from each other. For example, when the model is trained on LIVE-VQC dataset, the variance of 30-th dimension is nearly 1.4 times of the 17-th dimension, which further reveals that the quality of the video is governed by the features from different dimensions with different sensitives.

Fig. 5: T-SNE visualization of the features extractions from each dataset. The dataset used for training is provided under each sub-figure.
Fig. 6: Mean and variance of each dimension of in Eqn. (8). The dataset used for training is provided under each sub-figure.

V Conclusions

In this paper, we propose an NR-VQA method, aiming for improving the generalization capability of the quality assessment model when the training and testing videos hold different content, resolutions and frame rates. The effectiveness of the proposed method, which has been validated in both cross-dataset and intra-dataset settings, arises from the feature learning based upon unified distribution constraint and pyramid temporal aggregation. The proposed model is extensible from multiple perspectives. For example, the proposed model can be further applied in the optimization tasks when the pristine reference video is not available. Moreover, the design philosophy could be further applied to other domains (e.g., high dynamic range, screen content, virtual reality).