DeepQoE: A unified Framework for Learning to Predict Video QoE

Motivated by the prowess of deep learning (DL) based techniques in prediction, generalization, and representation learning, we develop a novel framework called DeepQoE to predict video quality of experience (QoE). The end-to-end framework first uses a combination of DL techniques (e.g., word embeddings) to extract generalized features. Next, these features are combined and fed into a neural network for representation learning. Such representations serve as inputs for classification or regression tasks. Evaluating the performance of DeepQoE with two datasets, we show that for the small dataset, the accuracy of all shallow learning algorithm is improved by using the representation derived from DeepQoE. For the large dataset, our DeepQoE framework achieves significant performance improvement in comparison to the best baseline method (90.94 an open source tool, provides video QoE research much-needed flexibility in fitting different datasets, extracting generalized features, and learning representations.



There are no comments yet.


page 2

page 3

page 4

page 5


Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

We focus on the word-level visual lipreading, which requires recognizing...

Generalized Operating Procedure for Deep Learning: an Unconstrained Optimal Design Perspective

Deep learning (DL) has brought about remarkable breakthrough in processi...

Representation Learning for Medical Data

We propose a representation learning framework for medical diagnosis dom...

Time-Aware and View-Aware Video Rendering for Unsupervised Representation Learning

The recent success in deep learning has lead to various effective repres...

A Review on Deep Learning Techniques for Video Prediction

The ability to predict, anticipate and reason about future outcomes is a...

The performance evaluation of Multi-representation in the Deep Learning models for Relation Extraction Task

Single implementing, concatenating, adding or replacing of the represent...

Representation Learning of Logic Circuits

Applying deep learning (DL) techniques in the electronic design automati...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video quality of experience (QoE), which assesses directly the perceptual quality of service (QoS) from the end users’ perspective, has become the de facto metric in guiding the design, deployment, and operation of video related services and applications [1, 2]. Notwithstanding its ever crucial role in services like video streaming, the measurement, modeling, and prediction of video QoE remain challenging tasks [3]. Video QoE depends on many often inter-related factors, from system parameters such as resolution and frame rate [4, 5] to demographic information such as gender and age [5]. These factors, often referred to as influence factors (IFs), fall into three categories: system IFs, context IFs, and human IFs. Though user experience is often considered as subjective and hard to quantify, human IFs will continue to be an essential part of QoE measurement and prediction [2]. For measuring QoE, two types of models, subjective test model and objective quality model, are often used [6]. Subjective test directly measures QoE by soliciting users’ evaluation under the controlled laboratory environment. Users are given a series of tested video sequences (original and processed ones) and then required to provide scores on the video quality. Objective quality models often use the results from subjective tests as ground truth to identify the objective QoS parameters that contribute to user perceptual quality and map these parameters to user QoE. Though these models are widely deployed, they have drawbacks. First, conducting subjective tests can be costly in terms of time and money. Second, these models often rely on hand-crafted features and data representations that are unique to the specific dataset, thus are difficult to be applied to other scenarios directly. As such, when applied to the new datasets, these models often do not generalize well. Third, though video QoE prediction can take the form of either classification or regression, both tasks often share significant similarities in features and processing methods. However, many models neglect these similarities and develop separate frameworks for feature engineering and training to perform classification or regression, often leading to inefficiency in the model developing and training process.

To bridge these gaps, we aim to design a QoE prediction framework with the following design guidelines and objectives. First, the use of dataset-specific representation and feature engineering should be minimized. Instead, by leveraging the potential of deep learning techniques and the availability of large datasets, both feature extraction and representation should be designed in an end-to-end learning based manner. Second, the framework should have the efficiency, configurability, and flexibility to perform multiple tasks and to facilitate transfer learning. To this end, we propose DeepQoE, an end-to-end and unified deep learning based framework that consists of three phases in tandem. First, we leverage deep learning based techniques (i.e. convolutional neural networks (CNNs)) to extract general features from different datasets or types of data. By applying these techniques, we can map data of different types and modalities all into the same high-dimensional feature space. Next, a deep neural network (DNN) is used to process different features to produce a representation. Finally, the existing or DeepQoE models can take directly the learned representations as input for classification or regression tasks. In a nutshell, the whole framework supplies a complete pipeline for feature extractions, representation learning, and QoE prediction, which is applicable for a variety of datasets.

Figure 1:

DeepQoE framework. Input data, categorized into four types, is first processed with different learning based methods and then concatenated into one feature vector to learn a representation. The learned representation is used for classification or regression.

We compare DeepQoE to some shallow learning algorithms (e.g., decision tree) when solving classification problem. The results show that, in the small dataset, the performance of our framework is comparable to these non-deep-learning based algorithms. All machine learning algorithms perform better by using the representation derived from DeepQoE. When applying to a large dataset, DeepQoE achieves significant performance improvements compared to the strongest baselines (DeepQoE 90.94% vs. SVM 82.84%). In addition to the performance improvement, the proposed framework has the following advantages. First, to the best of our knowledge, DeepQoE is the first model to predict directly the QoE score, in the form of either classification or regression, using the same framework. Second, with the help a diverse set of deep learning techniques, DeepQoE provides powerful generalization and feature extractions, which enable effective transfer learning in video QoE research. To facilitate the QoE research, We also develop DeepQoE into an open source tool and release pre-trained DeepQoE models.

The rest of the paper is organized as follows. Section II introduces related works for deep learning and video QoE prediction. Section III provides a detailed description of the design of the DeepQoE framework. Section IV shows the details of experiments and presents experiments results. Section V concludes this paper and discusses future works.

2 Related Works

Deep Learning.

Deep learning has emerged as a powerful set of frameworks and techniques that are widely applied to the research of computer vision, natural language processing, and speech recognition. In this work, we focus on three deep learning techniques: CNNs, word embeddings, DNNs.

CNNs are the most powerful tool to process visual information and often provide much better results than traditional models in video analysis task [7, 8]. Among various CNN frameworks aimed to extract video features, DeepVideo [7] first use CNN to extract features frame-by-frame and then fuse them to derive temporal correlations. 3-Dimensional Convolutional Neural Networks (C3D) [8], with an added dimension in the convolution filters, produces features that contain both frame and context information. Word embeddings map words into vectors, such that vectors of words that are similar in semantics are also close in distance. Two prevailing frameworks are word2vec [9] and GloVe [10]

. Word2vec, a supervised learning based framework, uses a very large-scale dataset to learn word embeddings. In comparison, GloVe is an unsupervised learning based approach that uses co-occurrence statistics to produce word vectors.

DNNs, also called deep forward networks, are often used function approximators. Recently, many new techniques, such as Dropout [11], are introduced to overcome the issues of over-fitting.

Video QoE Prediction. The prevailing approaches for video QoE prediction in general fall into two categories: objective QoE monitoring and data-driven approaches. Objective QoE monitoring often considers a set of video QoE influence factors (IFs) and design schemes to fit them. [12] selects video content as the IF and propose a clustering algorithm to predict QoE. In [13]

, a user-centric model is built to select the important external audiovisual factors and users’ internal factors. Among data-driven approaches, some machine learning algorithms (e.g., linear regression

[14], decision tree [1]

) has been applied to predict QoE and to identify the important factors. Recently, deep learning techniques such as recurrent neural networks

[15] have been used to predict video QoE. Since most of these models or frameworks often rely on features unique to the particular dataset used, they may lack the capability to generalize. In addition, the models are designed to solve one task only: either classification or regression. To address these issues, we propose our DeepQoE framework, which can not only process data with a wide range of formats and modality and but also has the flexibility of performing both classification and regression tasks.

3 Framework Design

The architecture of the proposed framework is illustrated in Fig. 1. It has three phases that supply an end-to-end pipeline for predicting video QoE: feature preprocessing, representation learning, and QoE prediction.

3.1 Feature preprocessing

The goal of feature preprocessing phase is to map input data into initial feature vectors, which are to be fed into the representation learning phase. Since the training datasets could come from different sources, they pose the challenges for feature preprocessing in regard to the following aspects:

  • Heterogeneity in data modality and type: some datasets only contain categorical information and numerical values; while other datasets include video sequences (or video features) as well as detailed text descriptions of video type and content.

  • Heterogeneity in representation approaches: even within a dataset, finding a general representation for different categorical information can be difficult. For example, while it is easy to encode user gender information with 0 and 1, it is less straightforward to encode the resolution information (e.g., 480P, 720P, and etc.) in an efficient manner. Moreover, categorical information such as video type can be represented as an index (integer), one-hot vector (vector of zeros and one), or an embedded vector (vector of continuous variables). It is not immediately obvious which representation will give rise to the best classification or regression performance.

To address these challenges, we categorize the input data into four types: text, video, categorical information (integer values), and continuous values. For each input type, we adopt a specific approach to extract the features. Specifically, we use GloVe (pre-trained on Wikipedia corpus), C3D (pretrained on Sport-1M dataset), embedding layer, and dense layer to extract the features for text, video, categorical information, and continuous values, as shown in Fig. 1. Let denote the input data of type and denote the extracted feature vector, the prepossessing can be summarized as:


where represents the feature extraction method for data type and represents the learned parameters.

3.2 Learning representation

In this phase, different feature vectors output by the preprocessing phase are firstly fused into a single feature vector. We use a simple concatenation to combine different feature vectors (we find other fusion approaches (such as 1D CNN [16]) can not offer noticeable performance improvement). Specifically, the fusion operation has the following mathematical form:


where is the fused feature vector, represents concatenation operation, and is a general function associated with feature . The salient feature of this design is that provide a general and flexible way of assigning different “weights” for different features. Moreover, the choices of can be considered as a form of hyper-parameter tuning in the training process. This approach can help us not only to achieve better performance but also to identify the important contributing factors to QoE (by evaluating the that achieve the best performance) during training and testing.

The fusion layer is followed by a few fully connected layers to continue the learning of a representation. The number of layers is another design parameter that can be adjusted depending on the size of the dataset. In particular, the representation out at layer , , has the following form:


where , , , and

represent the input, the weight, the bias, and the activation function of layer

. In addition, dropout technique [11] is applied these hidden layers when training to prevent overfitting.

Figure 2: Performance comparison between using original features and using representations derived from DeepQoE. Using DeepQoE representations, all models achieve better performance.

3.3 Predicting video QoE

In this phase, the learned representation is fed into a NN or a DNN, which performs either classification or regression. Let denote the representation vector output from representation learning phase and denote the ground-truth. Using a NN with only a single layer (specified by the weight matrix

) as an example, we apply cross-entropy as loss function for classification:


where is the softmax activation function. For regression, the loss function is:


where is the linear activation function and is the number of samples in a training batch.

4 Results Presentation

To evaluate the performance of the proposed DeepQoE framework, we design three experiments based on two datasets. For each experiment, the architecture of DeepQoE is adjusted according to the requirement of the experiment.

4.1 Small text dataset

The first and second experiments both use a small data set WHU-MVQoE2016 [17]. There are four video types: movie, cartoon, news, and sports. For each type, there are two different video titles. Each video title is encoded with three resolutions: 720P, 480P, and 360P. In addition, each resolution is encoded with three different bitrates. In total there are 72 video clips in the dataset. A total of 16 subjects are asked to watch videos on the phone and rate them in the score of one (bad) to five (excellent). The dataset also includes the ages and genders of the end users. After post-processing, the datasets contains 1116 rating scores and 72 mean opinion scores (MOS).

Figure 3: Performance comparison of different methods on the small dataset. DeepQoE is comparable to the shallow learning algorithms by using same original features.

4.1.1 Classification

The first experiment is conducted to predict users’ voting scores, which can be cast as a classification task of five classes. In the pre-processing module, we use pre-trained GloVe model to transform four video types to four 50-dimension vectors. For resolution, we use an embedding layer to map a resolution value to a vector of eight dimensions. For bitrate and user age, we normalize them to range [0, 1] and use the dense layer to get two vectors of one dimension; For user gender information, we use an embedding layer to get a vector of one dimension. Next, the 50-dimension vector of video type is reduced to 5-dimension. The representation learning phase concatenates these vectors into a single one and then feed it to two fully connected layers, with dropout technique applied to prevent overfitting. Finally, the output layer use softmax activation function and cross entropy loss function of training and prediction.

Figure 4: Spearman Rank Order Correlation Coefficient (SROCC) of our DeepQoE trained by using news, cartoon and sport videos and tested on the movie videos.

We compare the DeepQoE model with and other shallow learning models (such as SVM and random forest). In particular, we repeat the same experiment 10 times and use the average as our final result. Each time, the output of the last fully connected layer of the trained DeepQoE model is fed into one of the well-known machine learning models as features. The results show that all models that use representations provided by DeepQoE perform better than using original features (Fig. 

2). By using original features directly, DeepQoE may not perform the best, but comparable to the shallow learning algorithms as demonstrated in Fig. 3.

4.1.2 Regression

Video_1 Video_2 Video_3 Video_4 Video_5 Video_6 Video_7 Video_8 Baseline DeepQoE
0.104 0.145 0.107 0.076 0.129 0.076 0.141 0.229 0.126 0.298
Table 1: Regression models comparison.

In the second experiment, we predict the MOS of the videos. Since this is cast as a regression problem, the ground truth is replaced by average scores and data related to user gender and age is not used.

We use the simple regression model from [2] as the baseline. The model essentially performs single variable regression — it predicts MOS based only on the bitrate of a video title. We first apply this regression model to each of the eight video titles in our dataset and generate eight mean square error (MSE) values. We then take the average of these eight MSEs, which is 0.126. To evaluate the performance of DeepQoE, we use all the video related information in our dataset (video type, resolution, and bitrate) and train the model only once in an end-to-end fashion. The MSE of DeepQoE regression is 0.298, as shown in Table 1.

Movie Cartoon Sport News
0.4119 0.5069 0.6771 1.2679
Table 2: DeepQoE regression results. We use three kinds of videos as the training set and then use the remaining one as the test set to verify that our approach can get a fair prediction result.

The larger MSE of our model can be initially attributed to the fact that deep learning based models often work better with a very large dataset, as evident by the results to be presented in the next section. Improving the regression performance (with the help of new techniques and more data) remains the focus of our ongoing research. However, we note that our model has two distinctive advantages. First, the regression performed by the baseline model is on a per-video basis, thus the regression coefficients obtained for one video cannot be directly applied to another one. In contrast, our DeepQoE model, trained on one dataset, includes all the general features sufficient to predict MOS of videos of various type, resolution, and bitrate (Table. 2). Second, for a new video sequence of given bitrate, baseline model cannot directly predict the MOS of this video, due to the unavailability of data points of different bitrates (for obtaining the regression coefficients). With the help of generalized feature extractions, our model can find the correlation between the new video and trained ones, and thus is able to predict the MOS (Figure. 4).

4.2 Large video dataset

SVM Decision Tree Random Forest AdaBoost Naïve Bayes DeepQoE
3344.64s 82.91s 33.56s 700.03s 233.54s 504.41s
Table 3: Training time comparison.

For the third experiment, we use a large dataset from [18], which has 220 different video titles. Each video title is five seconds in duration and has four different resolutions. For each resolution, a video title is encoded into 52 video clips with different quantization parameters (QP). In total, there are 45760 different video clips in the data set. A total of 800 subjects participate in this test. After post-processing, three just-noticeable-difference (JND) [19] points are derived for each video clip. A JND point is a statistical quantity that accounts for maximum difference unnoticeable by a user. Using the notion of JND to measure the quality of coded images and videos was recently proposed [20, 21]. As such, we use JND as QoE metric in our experiment. Specific to our experiment and dataset [18], the three JND points represent the three QP parameters at which noticeably degrade of video quality is observed. That is, any QP values smaller than (before the occurrence) of the first JND point is considered excellent; QP values that are in between the first and second JND points is considered as good; QP values that are in between the second and third JND points is considered as fair; the QP values larger than the third JND points are considered as bad. Thus we cast the QoE predication as a classification problem with four QoE classes: excellent, good, fair, and bad. Similar to the previous two experiment, we extract generalized features first and then make a prediction using softmax activation function. Since this dataset includes the video clips in addition to categorical data, we take full advantage of this by using 3D CNN to extract video content feature. The result shows that DeepQoE model can effectively capture the JND information and thus provide much better classification performance — the 90.94% accuracy is higher than all of those generated by shallow learning algorithms as shown in Fig 5. In comparison to the results generated by using small dataset, the proposed deep QoE method performs the best in the large dataset. Moreover, training of the proposed model takes only about one-sixth of time use for training with SVM, which provides the best performance in non-deep-learning based algorithms (Table. 3).

Figure 5: Performance comparison on the large dataset. DeepQoE shows the best performance when comparing to the other shallow learning algorithms.

5 Conclusion and Future Works

Accurate and efficient QoE prediction provides important guidance to the deployment and operation of video services and applications. To address two main drawbacks of the current prediction models, namely over-reliance on dataset-specific feature engineering and lack of the configurability for transfer learning, we propose DeepQoE, a deep learning based framework capable of feature extraction, representation learning, and QoE prediction. Our results show that the learned representation via DeepQoE can improve the prediction accuracy of shallow learning models for a small dataset. When applied to a larger dataset, our framework is shown to achieve the best performance in comparison to other start-of-art algorithms. For our future research, we plan to continue stress-test and improve the performance of DeepQoE, as larger datasets of subjective QoE test become available. We also plan to extend DeepQoE to real-time to QoE prediction. In particular, we will evaluate if techniques such as attention mechanism [22] can help to improve the performance in continuous-time scenarios.


  • [1] Athula Balachandran and et al., “Developing a predictive model of quality of experience for internet video,” in ACM SIGCOMM Computer Communication Review. ACM, 2013, vol. 43, pp. 339–350.
  • [2] Weiwen Zhang and et al., “Qoe-driven cache management for http adaptive bit rate streaming over wireless networks,” IEEE Transactions on Multimedia, vol. 15, no. 6, pp. 1431–1445, 2013.
  • [3] Tiesong Zhao and et al., “Qoe in video transmission: A user experience-driven strategy,” Commun. Surveys Tuts, vol. 19, no. 1, pp. 285–302, 2017.
  • [4] Naty Ould Sidaty and et al., “Influence of video resolution, viewing device and audio quality on perceived multimedia quality for steaming applications,” in EUVIP. IEEE, 2014, pp. 1–6.
  • [5] Quan Huynh-Thu and et al., “Temporal aspect of perceived quality in mobile video broadcasting,” IEEE Trans. Broadcast, vol. 54, no. 3, pp. 641–651, 2008.
  • [6] Yanjiao Chen and et al., “From qos to qoe: A tutorial on video quality assessment,” Commun. Surveys Tuts, vol. 17, no. 2, pp. 1126–1165, 2015.
  • [7] Andrej Karpathy and et al., “Large-scale video classification with convolutional neural networks,” in CVPR, 2014, pp. 1725–1732.
  • [8] Du Tran and et al., “C3d: generic features for video analysis,” CoRR, abs/1412.0767, vol. 2, no. 7, pp. 8, 2014.
  • [9] Tomas Mikolov and et al.,

    Distributed representations of words and phrases and their compositionality,”

    in NIPS, 2013, pp. 3111–3119.
  • [10] Jeffrey Pennington and et al., “Glove: Global vectors for word representation,” in EMNLP, 2014, pp. 1532–1543.
  • [11] Nitish Srivastava and et al., “Dropout: a simple way to prevent neural networks from overfitting.,” Journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [12] Hossein Malekmohamadi and et al., “Automatic qoe prediction in stereoscopic videos,” in ICMEW. IEEE, 2012, pp. 581–586.
  • [13] Jiarun Song and et al., “Qoe evaluation of multimedia services based on audiovisual quality and user interest,” IEEE Trans. Multimedia, vol. 18, no. 3, pp. 444–457, 2016.
  • [14] Florin Dobrian and et al., “Understanding the impact of video quality on user engagement,” in ACM SIGCOMM Computer Communication Review. ACM, 2011, vol. 41, pp. 362–373.
  • [15] Christos G Bampis and et al., “Recurrent and dynamic models for predicting streaming video quality of experience,” IEEE Transactions on Image Processing, 2018.
  • [16] Steven Bohez and et al., “Sensor fusion for robot control through deep reinforcement learning,” arXiv preprint arXiv:1703.04550, 2017.
  • [17] Yingxue Zhang and et al., “Whu-mvqoe2016: A quality of experience dataset for mobile video research,” Dec. 2016.
  • [18] Haiqiang Wang and et al., “Videoset: A large-scale compressed video quality dataset based on jnd measurement,” J Vis Commun Image Represent, vol. 46, pp. 292–302, 2017.
  • [19] Jason Fischer and et al., “Serial dependence in visual perception,” Nature neuroscience, vol. 17, no. 5, pp. 738–743, 2014.
  • [20] Jingteng Xue and et al., “Mobile jnd: Environment adapted perceptual model and mobile video quality enhancement,” in ACM MMSys. ACM, 2012, pp. 173–183.
  • [21] Jinjian Wu and et al.,

    “Just noticeable difference estimation for images with free-energy principle,”

    IEEE Trans. Multimedia, vol. 15, no. 7, pp. 1705–1710, 2013.
  • [22] Dzmitry Bahdanau and et al., “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.