Log In Sign Up

Convolutional Neural Networks for Video Quality Assessment

Video Quality Assessment (VQA) is a very challenging task due to its highly subjective nature. Moreover, many factors influence VQA. Compression of video content, while necessary for minimising transmission and storage requirements, introduces distortions which can have detrimental effects on the perceived quality. Especially when dealing with modern video coding standards, it is extremely difficult to model the effects of compression due to the unpredictability of encoding on different content types. Moreover, transmission also introduces delays and other distortion types which affect the perceived quality. Therefore, it would be highly beneficial to accurately predict the perceived quality of video to be distributed over modern content distribution platforms, so that specific actions could be undertaken to maximise the Quality of Experience (QoE) of the users. Traditional VQA techniques based on feature extraction and modelling may not be sufficiently accurate. In this paper, a novel Deep Learning (DL) framework is introduced for effectively predicting VQA of video content delivery mechanisms based on end-to-end feature learning. The proposed framework is based on Convolutional Neural Networks, taking into account compression distortion as well as transmission delays. Training and evaluation of the proposed framework are performed on a user annotated VQA dataset specifically created to undertake this work. The experiments show that the proposed methods can lead to high accuracy of the quality estimation, showcasing the potential of using DL in complex VQA scenarios.


A JND-based Video Quality Assessment Model and Its Application

Based on the Just-Noticeable-Difference (JND) criterion, a subjective vi...

A user model for JND-based video quality assessment: theory and applications

The video quality assessment (VQA) technology has attracted a lot of att...

Disentangling Aesthetic and Technical Effects for Video Quality Assessment of User Generated Content

User-generated-content (UGC) videos have dominated the Internet during r...

Bridge the Gap Between VQA and Human Behavior on Omnidirectional Video: A Large-Scale Dataset and a Deep Learning Model

Omnidirectional video enables spherical stimuli with the 360 × 180^ ∘ vi...

DCVQE: A Hierarchical Transformer for Video Quality Assessment

The explosion of user-generated videos stimulates a great demand for no-...

Prediction of the Influence of Navigation Scan-path on Perceived Quality of Free-Viewpoint Videos

Free-Viewpoint Video (FVV) systems allow the viewers to freely change th...

Learning from Experience: A Dynamic Closed-Loop QoE Optimization for Video Adaptation and Delivery

The quality of experience (QoE) is known to be subjective and context-de...

1 Introduction

Due to the large size of uncompressed video signals, video compression is essential to ensure that content can be efficiently and timely distributed to final users. In order to obtain the low bit-rates required for smooth video delivery over conventional networks, video distribution systems rely on lossy compression. Modern video coding solutions such as the H.265/High Efficiency Video Coding (HEVC) standard HEVC are capable of achieving very high compression ratios, while minimising the distortion introduced during compression. Nonetheless, the processing performed by HEVC encoders may introduce some amount of distortion or artefacts in the video signal, which negatively impact the user’s perceived quality of the received video signal.

HEVC relies on a flexible approach which allows content to be compressed with different strengths (defined by the Quantisation Parameter, QP, used within the compression loop), depending on the application. Higher QPs result in lower quality of the compressed signal, at smaller average bit-rates, whereas low QPs result in better quality content at higher bit-rates. When distributing content in challenging network conditions, it is crucial to ensure that bit-rates are not too high and do not exceed the capacity of the network, otherwise users will experience delays which also contribute negatively to the perceived quality of the received signal. Moreover, it is worth noticing that using the same settings on different video content may produce very different results in terms of quality and bit-rate of the compressed signal. This is due to the complex mechanisms adopted by standards such as HEVC, which rely on exploiting spatial, temporal and statistical redundancies within the video signal to achieve compression. Due to the unpredictability of the compression step, it is very difficult to predict the Quality of Experience (QoE) perceived by viewers watching compressed content distributed through a transmission network.

QoE is defined by the International Telecommunication Union (ITU) as ”the overall acceptability of an application or service, as perceived subjectively by the end-user” rec2007p . As such, QoE may include the complete end-to-end system effects (acquisition, processing, compression, storage, transmission, etc.), as well as being influenced by user expectations and context. In this paper, focus is given to perceived subjective Video Quality Assessment (VQA), taking into account the effects of coding distortion and transmission delays. Subjective VQA can be measured in psychophysical experiments in which a number of subjects rate a given set of content. Depending on the application, tests can be performed with Full-Reference (FR), where viewers are asked to compare the processed video against the original reference; or tests can be performed with Reduced-Reference, in which the comparison happens based on a specific number of features from the reference video; or finally, in case only the processed videos are presented to the viewers, No-Reference (NR) tests can be performed. The latter is the case for the NR subjective tests used within this paper (which will be described in more details in the rest of the paper), in that the assessment was performed at the receiving end, where there is no availability of the uncompressed signal before transmission.

In order to obtain representative ratings, a certain number of non-expert viewers (to avoid potential subject bias) should be invited. According to ITU-T (recommendation P.910) itu2008p910 , any number between and is desirable. Moreover, due to possible influence factors from heterogeneous contexts, tests should be performed in a neutral environment (e.g. a dedicated laboratory room). After the tests, the scores for all participants are averaged to compute so-called Mean Opinion Scores (MOS). Obviously, preparing and running such tests can be expensive and time consuming.

For that reason, methods to objectively predict the VQA of video content are highly desirable. In this paper, a Deep Learning (DL) approach for automatic VQA based on Convolutional Neural Networks (CNN) is proposed, with the goal of predicting the expected perceived quality of compressed video content after transmission. The system is tailored for very challenging applications entailing usage of User Generated Content (UGC) which is more and more relevant in many scenarios. As such, the proposed system was trained and tested under challenging conditions both from the compression perspective (in that content contains noise, fast motion, and is in general of lower quality than professionally-captured video content) as well as the transmission (in that mobile device users may be in areas with low network coverage).

The system is capable of making the VQA prediction at the source and, differently than other methods, it accepts as input raw visual data, without performing any further processing or transmission, thus reducing the necessary complexity. Moreover, differently than other alternative techniques, the supervised DL approach assures end-to-end feature learning, and the regression-flavoured task is transformed into a classification task, aiming to provide results which are easily exploitable within the distribution chain, as illustrated in the rest of this paper.

The rest of this paper is organised as follows. Section 2 briefly summarises the basic concepts of existing approaches towards NR VQA methods and metrics, provides basic notations and preliminaries of supervised DL prediction techniques, and gives an overview of state-of-the-art methods. In Section 3, the proposed DL-based QoE predictive NR framework is presented, while the evaluation methodology and the respective experiments take place in Section 4. Section 5 provides extensive experimental evaluation of the proposed model and analysis of the results. Finally, concluding remarks are pointed out in Section 6.

2 State of the art

The simplest way of evaluating quality of video signals is using FR metrics such as the Peak Signal-to-Noise Ratio (PSNR) winkler2008evolution , which is a function of the Mean Square Error (MSE) between each frame of the reference and the processed video signal. PSNR is widely used in video coding for instance for rate-distortion optimisation, where it has proven to work well while being inexpensive to compute. On the other hand, PSNR may not match well with perceived visual quality due to the complex, highly non-linear behaviour of the human visual system wang2009mean , and it cannot be used to measure some of the effects of transmission (such as delays), because it does not generalise to the temporal dimension. Similarly, the popular Structural Similarity (SSIM) index wang2004image , wang2004video is frequently used for estimating video quality. The computation is also performed in a frame-by-frame manner to the luminance component of the video sequence, and, in conjunction with the contrast and structure components, the overall degradation is computed as the average of the SSIM indexes at each frame level. When dealing with variations across scales, SSIM can be extended to Multi-Scale Structural Similarity (MS-SSIM) wang2003multiscale . More complex FR metrics have been proposed, including the Video Quality Metric (VQM) pinson2004new and the Motion-based Video Integrity Evaluation (MOVIE) seshadrinathan2010motion , or the Visual Information Fidelity (VIF) li2016toward .

On the other hand, NR metrics have also been proposed to estimate the quality of video content shahid2014no bovik2013automatic . These metrics typically require lower computational complexity yang2007perceptual , yang2005novel , kawayoke2008nr , brandao2010no , in order to be used on-line for quantifying the quality of video content. A DCT-based approach for estimating the effects on quality of various types of compression distortions is proposed in saad2014blind , while a more general approach relying on statistical properties of undistorted videos is presented in mittal2016completely . In an overview study torres2016experimental a comparison of different metrics is presented, stating that there is no universally effective metric, indicating that for many applications automatic VQA is an open research question.

In addition to the aforementioned models which try to directly model the distortion in a picture, statistical models can also be defined, in which independent variables are fit against results obtained in a subjective quality evaluation test using regression techniques. These methods rely on the availability of a set of pre-annotated training data, generated by means of subjective VQA. As such, instead of directly trying to model the distortions, these methods try to correlate the annotated assessment with the reference signal. One way of achieving this goal is by manually defining specific features that are assumed to be relevant to the subjective quality, and subsequently use a mapping between the feature space and the subjective quality space. Unfortunately, manually designing such mapping may be difficult, and as such these methods may not be ideal especially in cases where the processing consists in complex unpredictable operations, such as those entailed by modern video encoders.

As a possible alternative, ML based methods have been recently proposed. These methods typically rely on two steps: Feature extraction, in which representative features of the video content are computed; and classification, where the extracted features are mapped into class scores based on a trained algorithm. From an abstract perspective, ML schemes usually perform a dimensionality reduction technique to reduce the original data space, followed by a prediction scheme performed by trainable algorithmic methods. Among such methods, Support Vector Machines (SVMs), k-Nearest Neighbors (k-NN) and Decision Trees (DT) have all been used in VQA

mittal2012no , le2006convolutional , saad2014blind , narwaria2012svd , plakia2016user . A typical ML system tries to learn and gain knowledge from the training data it is provided with, in order to be able to make predictions concerning new test data it will be shown. Nonetheless, several issues must be taken into consideration. First of all, ML learning methods depend heavily on sophisticated feature extraction methods designed specifically for a certain task. These methods are based on the assumption that the selected features are relevant to the subjective quality, but varying datasets of the same nature can limit the effectiveness of such assumption. Furthermore, such techniques only consider the distortions introduced by compression, and do not take into account transmission.

A few approaches have been presented dealing with predicting the perceived quality after transmitting video content through a network. In bampis2017learning , the authors consider video impairments based on playback interruptions, mainly caused by bandwidth limitations. A set of features is derived, which are then used to fit a regression-based predictive approach bampis2017study

. Regression is performed using different models such as Support Vector Regression (SVR), Random Forests (RF) or Gaussian Boosting (GB). In

kumar2015intelligent , a ML scheme is adopted for wireless communication applications. The proposed method involves a Pseudo-Subjective Quality Assessment (PSQA) procedure, during which a finite set of high-influential parameters is selected, and subsequently video content is rated by subjective viewers. The subjective data is then fed to a regression model.

In sogaard2015video

the authors propose a regularised linear regression NR model called Elastic Net (EN). The video features extracted are based on an approach described in

sogaard2015no , with the goal of estimating the QP used during the encoding and the corresponding PSNR, similar to the work presented in bi2003dimensionality . Another approach is proposed in rehman2015display , where the authors present the results of a subjective study in order to assess the effects of viewing conditions and display devices on the VQA process. Furthermore, they propose a FR metric called SSIMplus ssimplus which operates in real-time for predicting the quality of video content.

A NR machine learning based approach for streaming applications can be found in


. The authors extract eight NR video features (occurring in bit-stream and pixel level) and combine them with the nominal bit-rate and estimated level of packet loss in order to form a representative feature set. This feature set is subsequently fed to regression based predictive algorithms, which carry out the QoE assessment. Such algorithms vary between Multiple Linear Regression (MLR) and Standard Regression Trees to GP regression and SVRs. Finally, a deep unsupervised learning scheme is proposed in


. In this work, the authors employ Restricted Boltzmann Machines combined with eight NR features.

Inspired by the fact that DL techniques are effective on many problems in image and video processing (for instance image/video classification, human activity recognition, etc.) compared to conventional machine learning techniques, a DL framework for efficient VQA prediction is proposed in this paper. The choice of a DL approach towards VQA is two-folded. Firstly, DL models can acquire remarkable generalisation capabilities when sufficient data is used for training, especially if using data augmentation techniques such as those utilised in the proposed methodology. Secondly, DL models do not depend on sophisticated feature extraction and selection techniques, which is the case with traditional machine learning techniques mentioned above, but perform end-to-end learning and optimisation via linear and non-linear transformations to raw pixel data. Due to the subjective nature of this problem, defining a set of features that appropriately correlate with the final VQA is not trivial, and as such DL approaches are suitable for this task in that they overcome this step. DL models based on Convolutional Neural Networks (CNNs) have recently been used for picture-quality prediction

kim2017deep . The approach proposed in this paper goes beyond state of the art, by investigating usage of CNNs towards higher-order models. In particular, temporal information is considered by feeding three dimensional patches to the algorithm, ensuring that variations of quality over time are taken into account. The VQA is posed as a classification problem as shown in the following section.

3 Problem formulation

In this section, some background is provided to formulate the problem of predicting VQA using deep neural networks. In addition, a description of the dataset used for training and testing of the approach is provided. The dataset was created specifically to develop the work presented in this paper.

3.1 VQA as a classification problem

When applying ML for predictive VQA modelling purposes, training data is used to make a prediction. This prediction should generalise well on new data on which there is no ground truth. As such, predictive modelling can be described as the goal of approximating a mapping function from input variables to output variables. In the case described in this paper, the output variable is a score of the predicted VQA of the current piece of content being considered. The output variable to represent the VQA can be treated as either a discrete or continuous parameter. In the latter case, a continuous scale could be used to score the VQA from a minimum to a maximum.

In ideal conditions assuming availability of a large training set, this case could better fit the ground truth obtained from subjective testing, which is obtained as the average of the MOS scores provided by the test participants. Such averages are expressed as continuous variables. On the other hand, posing the problem in such a way that the output variable is continuous does not work well for the cases in which limited items are available for training. Under these conditions, the networks have limited ability of learning from the data, and generalisation is more difficult to obtain. For that reason, a discretised output variable can be more suitable, leading to higher accuracy of the prediction, despite the limited size of the training set. In case the output variable is a discrete variable, the task is typically referred to as a classification problem. The input variables in a classification problem can be either real-valued or discrete-valued.

Following from the aforementioned observations, the following scheme is proposed in this paper. The objective is to predict the VQA of a given piece of compressed video content that needs to be transmitted through a transmission network under known network conditions. The CNN is input raw pixel data. This data is obtained by means of decoding the compressed bitstream, and then pre-processing the data in order to embed the effects of the network within the signal. In this respect, several parameters which affect the network conditions need to be taken into account, including Maximum Segment Size (MMS),Round-Trip Time (RTT), and loss rate. The rate throughput is then estimated (assuming that Transmission Control Protocol, TCP, is used) as:


where is the MMS, is the RTT and is the loss rate.

Finally, for a given video content with a bit-rate of length (in seconds), the following delay can be considered:


The delay

is manually added at the beginning of the raw pixel data signal before this is fed to the CNN. Training is performed using a dataset consisting of annotated content (as described in the following of this section). The ground truths used during the training are average MOS training values, which are discretised into a fixed number of classes. In order to estimate the performance of the proposed classification predictive model, the accuracy of the prediction can be computed, corresponding to the percentage of correctly classified samples over the total number of estimations in a test set. A scheme of the proposed approach is illustrated in Figure


Figure 1: Proposed CNN-based approach for VQA prediction.

3.2 Dataset

In designing DL solutions and training ML models, the definition of a suitable training and test set is a critical task. With regards to VQA, a set of video sequences with different levels of perceived quality is required. In addition, the annotated associated perceived QoE for each video is also essential to act as the ground truth in the training process. The subjects should annotate video clips under known conditions in terms of compression parameter as well as network conditions under which the video content was transmitted.

Given the nature of the problem presented in this paper and the general approach depicted in Figure 1, a suitable dataset was difficult to identify in the literature. Therefore, a new dataset was built specifically for the problem at hand. The dataset comprises video clips encoded with different QPs which are transmitted through simulated enforced network conditions. As already mentioned, the method should be capable of dealing with challenging scenarios such as those imposed by using UGC, which is typically of poorer quality than broadcast content.

In this regard, UGC video sequences from the open access Edinburgh Festival dataset weerakkody2017 were selected. Each video clip is of seconds in length, comprising frames, a spatial resolution of samples, and frame rate of Hz. Exemplary frames from the aforementioned UGC video sequences can be seen in Figure 2.

Figure 2: The UGC sequences used for the creation of the annotated VQA dataset.

The sequences were encoded with different QPs. The encoding was performed using a practical HEVC encoder solution. The Turing codec was used at this purpose Turing , an open-source practical HEVC software implementation which is capable of compressing video content at very high compression efficiency, while at the same time providing features typical of practical encoders, such as low complexity, high parallelisation capabilities, and minimal memory requirements. This is beneficial to ensure that the proposed approach can be used within practical use case scenarios.

Encoding was performed with different QP values aimed at simulating various compression distortions witnessed in distribution conditions. Hence the values , , , , and were selected, to cover a wide range of delivery requirements. As already mentioned, the effects of encoding using different QP values on the actual quality of the decoded signal are difficult to predict, and highly content-dependent. As an example, Figure 3 shows frames extracted from two of the three UGC sequences in the dataset, encoded with different QP values. As can be seen, the effects of high quantisation are generally very evident in that the sequence is on average of much lower quality. On the other hand, such effects are not uniform. Smoother areas of content, such as the hand or the background in the sequence shown in Figure 2(a) tend to be compressed more easily even when using higher QP values. Conversely, textured areas with higher amount of details suffer more from usage of high QP values, as is evident from the sequence shown in Figure 2(b).

(a) UGC sequence 1
(b) UGC sequence 2
Figure 3: Example frames extracted from two of the UGC sequences used for the dataset, compressed with QP= (left) and QP= (right).

In addition to considering the effects of compression, transmission is also considered in the dataset. By considering various combinations of the aforementioned network parameters, four realistic conditions were modelled each represented by a final overall network throughput rate (and corresponding delay , obtained as in Equation 2). The selected network conditions are listed as the following:

  • Mbps rate associated with ms, and

  • Mbps rate associated with ms, and

  • Mbps rate associated with ms, and

  • Mbps rate associated with ms, and

An MSS of Bytes was considered for all cases. The effects of the above network conditions was incorporated in the encoded sequences by means of pre-processing (as illustrated in Figure 1).

The combination of the different encoding parameters and network parameters resulted in different video clips, that created the basis for the dataset. The video clips and the simulations described above created the training data to be fed to the DL model as the input. During the training phase, the network creates QoE ratings associated with each input that need to be compared with ground truth values. Hence it is necessary to have an annotated dataset of VQA values associated with the described video content. To address this, a subjective assessment of the video dataset was performed, and the quality of the videos were rated based on the existing distortions introduced by compression and network conditions.

The subjective assessment was based on the standard Absolute Category Rating (ACR) metric recommended by ITU itu2008p910 ; series2012methodology , that measures the perceived quality based on source stimuli that are presented to the viewers separately and are rated independently. The VQA rates are ranged from to , that represent bad, poor, fair, good, and excellent QoE.

In this regard, subjects were selected to participate in the subjective assessment. They were asked to watch the items in the dataset. Viewing conditions were carefully controlled to simulate normal scenarios in which viewers usually watch TV. Normal displays with a native resolution of pixels were used for the assessment. Participants were instructed to consider the general quality of the items, considering all the existing conditions in the videos, including the content, visual artefacts and distortions, starting delay, camera movements, etc., giving a single rating from to to describe the QoE of watching that specific item.

The subjects were shown the items in a random order, in full-screen mode. Randomisation is important in order to avoid creating a bias on the participants. The evaluation took approximately one hour for each candidate. After performing the experiments, the obtained opinion scores were averaged to obtained a single MOS for each item, which was used for the training of the CNNs. Moreover, further processing was applied to the QoE scores in terms of discretisation, by grouping the average QoE values into classes.

4 Proposed method

In this section a description of the proposed methodology for predicting VQA of video signals is presented. Details on how the CNN used for performing the VQA estimation was designed are also presented.

4.1 Data pre-processing and augmentation

CNNs require a considerable amount of data in order to ensure that sufficient generalisation is achieved during the training. Moreover, directly processing frames of samples is too complex from the perspective of memory requirements during the training, as well as during the classification when applying the CNNs for performing the VQA prediction.

In order to tackle this problem, a patch-based data augmentation technique was applied to the training data. For simplicity, only the luminance component was considered in the proposed approach. Each video volume (corresponding to the three dimensional matrix of luminance pixel values of size , where is the number of frames in the sequence) was split into a sequence of non-overlapping cubic patches of size containing pixel values, which serve as the examples to be used for training the CNN. Since each entire video sequence is annotated with a single label, the same VQA label is propagated to each of the training samples belonging to the dataset item. The obtained training examples are then grouped together in order to form the dataset which is used for training the CNN.

4.2 Spatio-temporal learning with higher-order CNNs

The proposed method fits within classification-flavoured supervised learning, employing higher-order CNNs for efficient spatio-temporal feature learning of the video content. Unlike conventional neural networks, CNNs employ the notion of local receptive fields in order to effectively extract features from raw data. More specifically, each locally connected input subset of the input neuron is mapped to a single output neuron, a process which is performed in a stacked manner throughout convolutional layers, in order to capture as many representative features as possible. The connection between input and output neurons is performed via convolutions by means of trainable kernels, namely filters with specific filter coefficients. The number of such filters can be trimmed using pooling layers in order to avoid over-fitting issues.

Formally, the value of a convolved output neuron at position in the j-th feature map of the i-th layer can be expressed as follows:



is an activation function,

stands for the value of the kernel connected to the current feature map at the position , represents the value of the input neuron, is the bias of the computed feature map, and and are the height and width of the kernel respectively.

Processing video signals implies that in addition to spatial information, temporal (inter-frame) redundancies also exist among neighbouring frames. In order to exploit such information, a 3D-CNN approach is proposed in this paper for effective preservation of temporal and motion features which may be essential to VQA. 3D convolution is an extension of the 2D convolution operation, in which the learnable convolution kernel is a -dimensional cube which considers local spatial regions extracted from adjacent frames. Formally, the third-order analogue of Equation 3 can be expressed as follows:


where depicts the temporal dimension of the kernel cube, and the respective quantities in Equation 3 are extended to their three dimensional counterparts. Adopting such a strategy, the feature maps in the convolution layer are connected to multiple contiguous frames in the previous layer, leading to better capturing motion information.

Figure 4: Proposed CNN-based architecture for QoE prediction.

4.3 QoE prediction modelling

The architecture of the CNN used throughout this paper was inspired by work in human action recognition tran2015learning

. The network utilises multiple stacks of convolutional layers, max-pooling layers, normalisation layers and Rectified Linear Unit (ReLU) activation layers. It has been shown (for instance in

tran2015learning ) that small receptive fields of convolution kernels may lead to higher classification accuracies than using larger kernels. Extending the assumption to the temporal dimension, while also noting that temporal redundancy generally quickly decreases with time, an equivalent temporal length of was also considered in this paper. A representation of the architecture of the network described in this section is presented in Figure 4.

In terms of the number of stacks of layers employed, architectures comprising of and layers were explored. In the proposed model, each convolution layer is always followed by a max-pooling layer. These layers are then followed by two fully-connected layers. Finally, a final soft-max layer is considered for the prediction task.

Regarding the number of filters, a ”doubling-depth” strategy was adopted in which deeper layers in the network utilise double the number of filters used for convolution in the previous layer. This approach was shown to be successful in tasks in which it is crucial for the network to learn abstract features tran2015learning .

In order to preserve as much temporal information as possible, the max-pooling layer following the first convolutional layer in the network has a size of (corresponding to reducing the input signal by a factor of ). This ensures that temporal redundancies are not discarded in the initial phases of the signal path. All subsequent max-pooling layers following deeper convolutional layers in the network have a size of samples, which corresponds to reducing the input signal by a factor of . The first and the second fully-connected layer have and outputs, respectively.

In terms of the optimiser, both standard SGD optimisation with learning rates of and , as well as Adagrad optimisation duchi2011adaptive

were investigated. The latter was selected due to its adaptive selection of reducing the learning rate of parameters with high gradients and, conversely, increasing that of parameters which have small or infrequent updates. Categorical Cross-Entropy was used as the loss function.

Training was performed using varying number of epochs (e.g.

, , ), with the goal of observing how the prediction accuracy changes by increasing the epochs. The CNN performance was measured in terms of Classification Accuracy and Loss with respect to the total number of epochs, measured for each patch. However, patches are extracted from specific video sequences, and it is therefore important to also assess the performance of the network in terms of its ability in predicting the VQA of an entire video sequence. To that extent, an aggregation process was considered in order to obtain a single accuracy score for each item in the dataset. At this purpose different strategies can be utilised. In this paper, the two approaches described in kim2017deep are considered, namely:

  • Aggregation in terms of majority-vote strategy;

  • Using a pre-trained model strategy.

When considering aggregation in terms of majority-vote strategy, patch aggregation is not performed during the actual training process. On the contrary, training is performed as if each patch is independent from the others disregarding the fact that patches may be extracted from the same item. After training, during the classification, the labels associated with patches belonging to the same item are grouped together, and the label that is selected most frequently is selected as label for the entire video sequence. This strategy is represented in Figure 5. When adopting this strategy, each patch is therefore independently ”regressed” onto the global subjective score for the video sequence.

Figure 5: Majority-voting strategy for patch aggregation.

As an alternative to the majority-vote strategy, a pre-trained model can instead be considered. In this case the patch aggregation process is incorporated within the training process, as illustrated in Figure 6. The aforementioned 3D-CNN is initially trained on video signal data using each patch individually, as is the case for the majority-vote strategy. The network gains knowledge from this data and embeds such knowledge within the trained weights. These weights can be extracted and then transferred to another CNN, to perform the actual classification. Formally, a given number of patches are randomly extracted from a given video. Refer to each patch as where is the total number of extracted patches (a value of

was used in the rest of this paper). After each patch is fed to the CNN, the training is performed by means of backpropagation. The trained set of weights is then extracted and arranged in a one dimensional vector

of length , where is the total number of filter coefficients in the CNN. Finally, these vectors are then arranged in a single feature vector of length .

These feature vectors, extracted from the set of training videos, are then used to train an additional one-dimen-sional CNN. The input to the CNN is a one-dimensional vector of length . The structure of this CNN is identical to that of the 3D CNN, but the 3D filters are replaced by their 1D counterparts. The output is again formulated in terms of classification as a single discrete value, representing the predicted VQA of the current video.

When applying this approach on a new video from a test set, the first 3D CNN is trained again (using the patches extracted from the test video), to obtain the feature vector . This is then input to the second CNN, to output the final VQA prediction. Intuitively, this strategy implies that instead of training the 1D CNN from scratch, the learned features of each video item are transferred from the 3D CNN. This allows good classification accuracy even if the second 1D CNN is actually trained with a very limited number of training samples (equal to the number of video sequences in the training set). This strategy overcomes the limitations of the data augmentation technique used, while avoiding usage of the majority-vote strategy which may lead to unsatisfactory results. A detailed analysis of the performance of both strategies is presented in the rest of this paper.

Figure 6: Patch aggregation using pre-trained model.

To further quantify the performance of the proposed system, apart from the Classification Accuracy and Loss, other metrics were also used, namely True Positive Rate (TPR), False Negative Rate (FNR), False Positive Rate (FPR), True Negative Rate (TNR) and accuracy-per-class.

5 Experimental evaluation

In this section, results of the adopted CNNs under different scenarios using various parameters are presented and discussed, to illustrate the performance of the methodologies introduced in this paper.

5.1 Effects of compression in VQA prediction

In a first set of experiments, the proposed DL-based VQA prediction was assessed only taking into account the effects of compression. To this end, network conditions were considered constant. The experiments only utilise items in the dataset, corresponding to the UGC clips compressed with the QP values, transmitted using the best available network conditions. In all experiments presented in this subsection, the discrete labels used for training and testing were obtained quantising the average MOS ratings using non-overlapping intervals of equal sizes (of size ), resulting in three labels, where label 1 corresponds to the lowest annotated QoE and label 3 corresponds to the highest QoE.

The training samples used to feed the CNN are patches of size . For the purpose of this experiment, a two-layers model was used, with the parameter specifications described in Section 4. The effects of using different optimisers as well as varying number of trainable filters were investigated. Also, increasing number of epochs was used for training, in order to highlight the effects of the number of epochs on the network performance.

First, a two-layered CNN was considered, with filters in the first layer and filters in the second layer. A comparison between the conventional SGD optimiser and the Adagrad optimiser is first performed using 50 epochs for training. Results of this analysis are shown in the plots in Figure 7. The plot shows the patch-wise CNN accuracy for training and validation with respect to the number of epochs up to which the network was trained.

The results of this experiment show that the SGD optimiser produces less stable results than the Adagrad optimiser. Both optimisation algorithms lead to validation accuracy of up to . In the case of using Adagrad, the model accuracy steadily increases with the epoch, and the accuracy with the validation set stabilises towards that of the training set, showing that the model is capable in generalising well during the training.

Figure 7: Patch-wise classification accuracy for the Adagrad (left) and SGD (right) optimisers with 16-32 filters.

In an attempt to obtain more representative features, the number of trainable filters per layer was also investigated. In order to isolate these effects, the conventional SGD optimiser was again used for these experiments. The number of filters was increased by a factor of , corresponding to using filters in the first layer and filters in the second layer. The performance of the system in terms of model accuracy as well as model loss is illustrated in Figure 8.

Figure 8: Patch-wise classification accuracy (right) and loss (left) for SGD optimiser with 128-256 filters.

The plots in Figure 8 (left) show that the CNN achieves good classification accuracy, even though the accuracy for specific patches is considerably lower than what obtained with smaller number of filters, pointing to the fact that the generalisation properties of the CNN may have decreased as an effect of increasing the number of filters. The decaying trend of the model loss depicted in Figure 8 (right) further confirms this claim.

To increase the generalisation properties, an additional experiment was then performed in which training was performed with epochs, instead of epochs as in the aforementioned experiments. Results of this experiment are presented in Figure 9.

Figure 9: Patch-wise classification accuracy (left) and loss (right) for training the system with 500 epochs.

It can be observed that while the accuracy stabilises and does not seem to improve further with the increasing number of epochs, the system increases its generalisation properties, resulting in a decreasing number of patches with low classification accuracy. This is also reflected in less peaks in the loss function shown on the right of Figure 9.

Table 1 reports the classification accuracy of the system when utilising the majority-vote strategy to obtain item-wise classification accuracy (as opposite to the patch-wise results shown in the aforementioned plots). Sequence-wise accuracy of is obtained in the experiments. Due to the generally good accuracies obtained patch-wise, when utilising a majority-vote strategy, the CNN obtains consistently good results, smoothing out the effects of the varying conditions presented in these experiments.

Optimiser Filters Epochs P-Accuracy S-Accuracy
Adagrad 16-32 50 77.58% 80%
SGD 16-32 50 76.88% 80%
SGD 128-256 50 77.14% 80%
SGD 128-256 500 74.20% 80%
Table 1: Patch-wise (P) and Sequence-wise (S) classification accuracy for various optimisers, number of filters and epochs.

5.2 Joint effects of compression and network conditions in VQA prediction

In a second set of experiments, the analysis was extended to include the rest of the dataset, considering impairments due to both compression and transmission. A similar analysis as that from the previous subsection was performed in order to investigate the influence of the parameters on the CNN performance. In the initial set of experiments presented in this subsection, non-overlapping intervals of equal size were used for discretisation, resulting in three labels.

First, the conventional SGD optimiser is used, using the CNN with filters in the first layer and filters in the subsequent layers. The results obtained in terms of patch-wise accuracy and loss are depicted in Figure 10.As can be seen, the network does not generalise well, providing low classification accuracy on the validation set. The model loss increases with the number of epochs, highlighting that the CNN parameters are not suitable to model the combined effects of network and compression.

Figure 10: Patch-wise classification accuracy (left) and loss (right) for SGD optimiser with 16-32 filters and 50 epochs.

Given that the Adagrad optimiser was shown to perform better in the results presented in the previous subsection, this optimiser was also tested under these conditions. Moreover, increasing the number of epochs was also shown to have a positive effect on the CNN performance and therefore, epochs were used for the subsequent test. Under these conditions, the CNN performances increase considerably, especially in terms of validation accuracy, as shown in Figure 11.

Figure 11: Patch-wise classification accuracy (left) and loss (right) for Adagrad optimiser with 16-32 filters and 500 epochs.

The results of the experiment demonstrate that Adagrad behaves again more consistently than SGD in terms of patch-wise classification. These two models were compared also using the majority-vote strategy to obtain sequence-wise accuracy. The first model utilising the conventional SGD optimiser resulted in an accuracy of , whereas the second model utilising the Adagrad optimiser outperformed that, reaching an accuracy of .

In an attempt to obtain better classification performance, the number of filters per layer was increased to in the first layer and for the subsequent layers. The obtained patch-wise classification accuracy is shown in Figure 12.

Figure 12: Patch-wise classification accuracy using 128-256 filters and 50 epochs.

The aforementioned results were used as basis to construct a final model for the CNN, using filters in the first layer and in the subsequent layers, trained with a large number of epochs, and making use of the SGD optimiser. Figure 13 presents results of this experiment, ending up with patch-wise accuracy of up to % and sequence-wise accuracy of .

Figure 13: Patch-wise classification accuracy obtained with final model parameters, using majority-vote strategy.

An analysis of the label-wise accuracy of classification was performed, to understand whether the CNN is better in predicting specific labels. Figure 14 shows the accuracy of classification obtained for each label. As expected, the labels at the extremes are easier to predict (in that the corresponding impairments are either very clear, or the items provide very high QoE). Nonetheless, the CNN predicts each label with an accuracy of at least .

Figure 14: Classification accuracy per label.

To further analyse the performance of the CNN label-wise, the classification problem was treated as a statistical hypothesis testing, and its performance was investigated in terms of respective commonly used performance metrics such as the aforementioned TPR, FNR, FPR and TNR. These results can be seen in Figure

15. It is commonly accepted that FPR below for two out of three classes indicates good classification capabilities.

Figure 15: Statistical Hypothesis Testing performance metrics.

Finally, the model was tested using the pre-trained strategy for patch aggregation. A two-layered 3D CNN with filters in the first layer and filters in the subsequent layers was used. This was followed by a 1D CNN, consisting of layers with , and filters per layer, respectively. When using this strategy, the sequence-based classification is incorporated during the training. Still, even though the training of the two subsequent CNNs is performed with the goal of obtaining a single sequence-wise classification, the performance of the first (3D) CNN can still be assessed in order to evaluate its classification accuracy. Results of this test are shown in Figure 16, where it can be seen that good classification is obtained patch wise, even though the results in the validation set are less stable than those obtained using the majority-vote strategy. Sequence-wise, an accuracy of was obtained. This shows that the pre-trained model does not perform as well as the model making use of the majority-vote. Nonetheless, using such model avoids the need for grouping the labels and performing majority-vote patch aggregation, because a single classification can directly be obtained as output of the 1D CNN. Due to the fact that better performance are obtained with the majority-vote strategy, this was selected as the model of choice, unless the application requires the network to directly provide a single classification for each video item.

Figure 16: Patch-wise classification accuracy obtained with final model parameters, using the pre-trained model strategy.

Different discretisation strategies can be adopted when transforming the continuous average MOS ratings into discrete labels. An analysis of different strategies was performed in order to verify which strategy to adopt when constructing the discrete classes. Four different scenarios are considered, corresponding to dividing the continuous range of MOS ratings (from to ) in non-overlapping intervals using different sizes of , , or . A total of , , and labels are obtained in each case, respectively.

Results of these experiments were computed in terms of sequence-wise accuracy, using both patch aggregation strategies described in this paper. Results can be seen in Figure 17. It can be observed that the classification accuracy drops considerably when considering smaller sizes of the discretisation intervals. This is to be expected, in that the problem becomes more complex given as the CNN has to classify each item into more classes. Interestingly, while the majority-vote strategy outperforms the pre-trained model in the case of labels, opposite behaviour is obtained in other cases, where the pre-trained model performs better. Especially in the case of labels, the pre-trained model achieves a much higher accuracy than the majority-vote strategy. Nonetheless, the results justify the choice of using a total of output labels, which outperforms all other cases.

The results presented in this section justify the choice of the model parameters, and highlight an acceptable performance of the proposed approach stressing its good generalisation properties, especially taking into account the complexity of the problem tackled within this paper.

Figure 17: Sequence-wise classification accuracy with majority-vote strategy (blue) and pre-trained model with different number of output labels (classes).

6 Conclusions

This paper presents a novel DL approach to predict the perceived quality of video content affected by compression and network conditions.The approach is based on higher-order CNNs, for learning efficient spatio-temporal feature representations. The method was designed by means of an ad-hoc dataset comprising various pieces of challenging UGC material affected by HEVC encoding and various network conditions. The problem was posed as a classification problem, where two strategies are proposed in order to obtain sequence-wise classification. Extensive evaluation is provided to identify suitable network parameters, showing that the method is capable of achieving consistent classification accuracy under challenging conditions.

7 Acknowledgments

The work leading to this paper was co-supported by the Greek General Secretariat for Research and Technology (GSRT), the Hellenic Foundation for Research and Innovation (HFRI) and by the project COGNITUS, which received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 687605.



  • (1) G. J. Sullivan, J.-R. Ohm, W.-J. Han, T. Wiegand, Overview of the high efficiency video coding (HEVC) standard, IEEE Transactions on Circuits and Systems for Video Technology 22 (12) (2012) 1649–1668. doi:10.1109/TCSVT.2012.2221191.
  • (2) I. Rec, P. 10/g. 100 amendment 1, new appendix i–definition of quality of experience (QoE), International Telecommunication Union.
  • (3) R. ITU-T, P910, Subjective video quality assessment methods for multimedia applications.
  • (4) S. Winkler, P. Mohandas, The evolution of video quality measurement: From psnr to hybrid metrics, IEEE Transactions on Broadcasting 54 (3) (2008) 660–668.
  • (5) Z. Wang, A. C. Bovik, Mean squared error: Love it or leave it? a new look at signal fidelity measures, IEEE signal processing magazine 26 (1) (2009) 98–117.
  • (6) Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE transactions on image processing 13 (4) (2004) 600–612.
  • (7) Z. Wang, L. Lu, A. C. Bovik, Video quality assessment based on structural distortion measurement, Signal processing: Image communication 19 (2) (2004) 121–132.
  • (8) Z. Wang, E. P. Simoncelli, A. C. Bovik, Multiscale structural similarity for image quality assessment, in: Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, Vol. 2, Ieee, 2003, pp. 1398–1402.
  • (9) M. H. Pinson, S. Wolf, A new standardized method for objectively measuring video quality, IEEE Transactions on broadcasting 50 (3) (2004) 312–322.
  • (10) K. Seshadrinathan, A. C. Bovik, Motion tuned spatio-temporal quality assessment of natural videos, IEEE transactions on image processing 19 (2) (2010) 335–350.
  • (11) Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, M. Manohara, Toward a practical perceptual video quality metric, The Netflix Tech Blog 6.
  • (12) M. Shahid, A. Rossholm, B. Lövström, H.-J. Zepernick, No-reference image and video quality assessment: a classification and review of recent approaches, EURASIP Journal on Image and Video Processing 2014 (1) (2014) 40.
  • (13) A. C. Bovik, Automatic prediction of perceptual image and video quality, Proceedings of the IEEE 101 (9) (2013) 2008–2024.
  • (14) K.-C. Yang, C. C. Guest, K. El-Maleh, P. K. Das, Perceptual temporal quality metric for compressed video, IEEE Transactions on Multimedia 9 (7) (2007) 1528–1535.
  • (15) F. Yang, S. Wan, Y. Chang, H. R. Wu, A novel objective no-reference metric for digital video quality assessment, IEEE Signal processing letters 12 (10) (2005) 685–688.
  • (16) Y. Kawayoke, Y. Horita, Nr objective continuous video quality assessment model based on frame quality measure, in: Image Processing, 2008. ICIP 2008. 15th IEEE International Conference on, IEEE, 2008, pp. 385–388.
  • (17) T. Brandão, M. P. Queluz, No-reference quality assessment of h. 264/avc encoded video, IEEE Transactions on Circuits and Systems for Video Technology 20 (11) (2010) 1437–1447.
  • (18) M. A. Saad, A. C. Bovik, C. Charrier, Blind prediction of natural video quality, IEEE Transactions on Image Processing 23 (3) (2014) 1352–1365.
  • (19) A. Mittal, M. A. Saad, A. C. Bovik, A completely blind video integrity oracle, IEEE Transactions on Image Processing 25 (1) (2016) 289–300.
  • (20) M. Torres Vega, V. Sguazzo, D. C. Mocanu, A. Liotta, An experimental survey of no-reference video quality assessment methods, International Journal of Pervasive Computing and Communications 12 (1) (2016) 66–86.
  • (21) A. Mittal, A. K. Moorthy, A. C. Bovik, No-reference image quality assessment in the spatial domain, IEEE Transactions on Image Processing 21 (12) (2012) 4695–4708.
  • (22) P. Le Callet, C. Viard-Gaudin, D. Barba, A convolutional neural network approach for objective video quality assessment, IEEE Transactions on Neural Networks 17 (5) (2006) 1316–1327.
  • (23) M. Narwaria, W. Lin, Svd-based quality metric for image and video using machine learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42 (2) (2012) 347–364.
  • (24) M. Plakia, M. Katsarakis, P. Charonyktakis, M. Papadopouli, I. Markopoulos, On user-centric analysis and prediction of QoE for video streaming using empirical measurements, in: Quality of Multimedia Experience (QoMEX), 2016 Eighth International Conference on, IEEE, 2016, pp. 1–6.
  • (25) C. G. Bampis, A. C. Bovik, Learning to predict streaming video QoE: Distortions, rebuffering and memory, arXiv preprint arXiv:1703.00633.
  • (26) C. G. Bampis, Z. Li, A. K. Moorthy, I. Katsavounidis, A. Aaron, A. C. Bovik, Study of temporal effects on subjective video quality of experience, IEEE Transactions on Image Processing 26 (11) (2017) 5217–5231.
  • (27) P. A. Kumar, S. Chandramathi, Intelligent video QoE prediction model for errorprone networks, Indian Journal of Science and Technology 8 (16).
  • (28) J. Søgaard, S. Forchhammer, J. Korhonen, Video quality assessment and machine learning: Performance and interpretability, in: Quality of Multimedia Experience (QoMEX), 2015 Seventh International Workshop on, IEEE, 2015, pp. 1–6.
  • (29) J. Søgaard, S. Forchhammer, J. Korhonen, No-reference video quality assessment using codec analysis, IEEE Transactions on Circuits and Systems for Video Technology 25 (10) (2015) 1637–1650.
  • (30) J. Bi, K. Bennett, M. Embrechts, C. Breneman, M. Song, Dimensionality reduction via sparse support vector machines, Journal of Machine Learning Research 3 (Mar) (2003) 1229–1243.
  • (31) A. Rehman, K. Zeng, Z. Wang, Display device-adapted video quality-of-experience assessment, in: Human Vision and Electronic Imaging XX, Vol. 9394, International Society for Optics and Photonics, 2015, p. 939406.
  • (32) A. Rehman, K. Zeng, Z. Wang, Display device-adapted video quality-of-experience assessment 9394.
  • (33) M. T. Vega, D. C. Mocanu, S. Stavrou, A. Liotta, Predictive no-reference assessment of video quality, Signal Processing: Image Communication 52 (2017) 20–32.
  • (34) M. T. Vega, D. C. Mocanu, J. Famaey, S. Stavrou, A. Liotta, Deep learning for quality assessment in live video streaming, IEEE signal processing letters 24 (6) (2017) 736–740.
  • (35) J. Kim, H. Zeng, D. Ghadiyaram, S. Lee, L. Zhang, A. C. Bovik, Deep convolutional neural models for picture-quality prediction: Challenges and solutions to data-driven image quality assessment, IEEE Signal Processing Magazine 34 (6) (2017) 130–141.
  • (36) R. Weerakkody, An Open Access Dataset of User Generated Videos from Edinburgh Festival 2016 (May 2017). doi:10.5281/zenodo.582605.
  • (37) S. Blasi, M. Naccari, R. Weerakkody, J. Funnell, M. Mrak, The open-source turing codec: Toward fast, flexible, and parallel HEVC encoding, SMPTE Motion Imaging Journal 126 (9) (2017) 1–8. doi:10.5594/JMI.2017.2744578.
  • (38) B. Series, Methodology for the subjective assessment of the quality of television pictures, Recommendation ITU-R BT (2012) 500–13.
  • (39)

    D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in: Computer Vision (ICCV), 2015 IEEE International Conference on, IEEE, 2015, pp. 4489–4497.

  • (40)

    J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research 12 (Jul) (2011) 2121–2159.