VideoATLAS_release
None
view repo
Mobile streaming video data accounts for a large and increasing percentage of wireless network traffic. The available bandwidths of modern wireless networks are often unstable, leading to difficulties in delivering smooth, high-quality video. Streaming service providers such as Netflix and YouTube attempt to adapt their systems to adjust in response to these bandwidth limitations by changing the video bitrate or, failing that, allowing playback interruptions (rebuffering). Being able to predict end user' quality of experience (QoE) resulting from these adjustments could lead to perceptually-driven network resource allocation strategies that would deliver streaming content of higher quality to clients, while being cost effective for providers. Existing objective QoE models only consider the effects on user QoE of video quality changes or playback interruptions. For streaming applications, adaptive network strategies may involve a combination of dynamic bitrate allocation along with playback interruptions when the available bandwidth reaches a very low value. Towards effectively predicting user QoE, we propose Video Assessment of TemporaL Artifacts and Stalls (Video ATLAS): a machine learning framework where we combine a number of QoE-related features, including objective quality features, rebuffering-aware features and memory-driven features to make QoE predictions. We evaluated our learning-based QoE prediction model on the recently designed LIVE-Netflix Video QoE Database which consists of practical playout patterns, where the videos are afflicted by both quality changes and rebuffering events, and found that it provides improved performance over state-of-the-art video quality metrics while generalizing well on different datasets. The proposed algorithm is made publicly available at http://live.ece.utexas.edu/research/Quality/VideoATLAS release_v2.rar.
READ FULL TEXT VIEW PDFNone
Mobile video traffic accounted for 55 percent of total mobile data traffic in 2015, according to the Cisco Visual Networking Index (VNI) and global mobile data traffic forecast [1]. Since video data traffic and streaming services are significantly increasing, content providers such as Netflix and YouTube must make resource allocation decisions and mediate tradeoffs between operational costs and end user Quality of Experience (QoE). Since in video data applications such as streaming the human is the end user, perceptually-driven optimization strategies are desireable to guide the resource allocation problem.
While the motivation for perceptually-driven models is obvious, QoE prediction is still far from being an easy task. The low-level human visual system (HVS) is complex and driven by non-linear processes not yet well understood. There are also cognitive factors that influence perceived QoE, adding further layers of complexity, complicating the analysis of human subjective data and the design of QoE prediction models. For example, subjective QoE is affected by recency: more recent QoE experiences may have a higher impact on currently perceived QoE [2]. We are interested here in two types of subjective QoE: retrospective QoE and continuous-time QoE. In studies of retrospective QoE, subjects provide a single score describing their overall QoE on each presented video sequence. Studies of continuous-time QoE involve the real-time measurement of each subject’s current QoE, which may be triggered by changes in video quality or streaming and by short or long term memory effects.
With respect to these challenges, we will show that existing objective video quality assessment (VQA) methods inadequately model subject QoE. There is also a broad spectrum of video distortions ranging from video compression artifacts to rebuffering events, all having different effects on subject QoE. In streaming applications, rebuffering currently appears to be a necessary evil, since the available bandwidth is volatile and hard to predict. However, only recently have sophisticated approaches been developed that predict the effects of rebuffering on QoE. Yet, making unified QoE predictions involving diverse impairments remains an elusive goal.
Towards solving this challenging problem, we have developed a learning-based approach to making QoE predictions when the videos are afflicted by both bitrate changes and rebuffering. This is commonly seen in practice, where video bitrate often varies over time and where rebuffering events frequently occur. However, most existing subjective video quality datasets cannot be used to study general QoE models, since they either do not contain both rebuffering and quality changes, or they are of limited size or their design is not suitable for streaming applications. Towards filling this gap, the recently introduced LIVE-Netflix dataset [3] was specifically designed for this problem and includes the outcomes of a large subjective study.
The rest of this paper is organized as follows. Section II discusses previous work on QoE prediction related to streaming applications. Then, Section III gives an overview of the LIVE-Netflix dataset [3] that we use to study these impairments and to develop more general QoE models. Section IV investigates whether currently used VQA methods are suitable for QoE prediction on this dataset and motivates the need for a more general framework. Section V describes the proposed learning-based QoE prediction framework and Section VI presents experimental results. Finally, Section VII gives conclusion.
QoE prediction models typically consider a set of video impairments in light of human subjective data. To facilitate a description of previous work on QoE prediction, consider the following two types of video impairments that affect perceived user QoE:
A. Impairments of Videos with Normal Playback
The most typical streaming scenario is to apply an adaptive bitrate allocation strategy such that bandwidth consumption is optimized. An example of a compressed video can be seen in Fig. 1a. The effects of bitrate changes on the retrospective QoE may vary according to a number of QoE-related scene aspects: low-level content (slow/fast motion scenes), previous bitrates, frequency of bitrate shifts and their noticeability, the display device being used and so on [4]. Apart from bitrate selection schemes which lead to compression artifacts, other network-related distortions arise from packet losses [5] or impairments of the source videos. A commonality of these impairments is that there are no implied playback interruptions, with the rare exception of severe packet loss, where whole groups of frames cannot be properly decoded. To help study and measure the video quality degradations induced by these video distortions, many successful datasets have been built [6, 4, 7, 8]. An overview of available video quality datasets can be found in [9].
A wide variety of video quality assessment (VQA) models have been proposed ranging from full-reference (FR) to no-reference (NR) [10]. These include standard frame-based techniques (FR-IQA) such as SSIM [11, 12] and MS-SSIM [13], temporal FR-VQA methods such as VQM_VFD [14], MOVIE [15], ST-MAD [16], VMAF [17] and FLOSIM [18] and reduced-reference models like STRRED [19].
No-reference (NR) VQA has also been deeply studied [20]. Many distortion-specific NR VQA methods [21, 22, 23] have been designed to predict the effect of domain-relevant distortions on perceived quality. In a general model [24]
, a natural scene statistics model in the DCT domain was used to train a support vector regressor to predict the effects of packet loss, MPEG-2 and H.264 compression. VIIDEO
[25] generalizes further by relying only on statistical regularities of natural videos, rather than on subjective scores or prior information about the distortion types. However, the NR VQA problem remains far from an ultimate solution.B. Playback Interruption
When the available bandwidth reaches a critical value (e.g. in a mobile streaming scenario), playback interruption is sometimes very difficult to avoid. Fig. 1b depicts an example of playback interruption. While the effects of rebuffering on QoE are not yet well understood, various studies have shown that the duration, frequency and location of rebuffering events severely affects QoE [26, 27, 28, 29]. By making use of global rebuffering statistics, Quality of Service (QoS) models such as FTW [30] and VsQM [31] have been proposed. More recent efforts [28] have sought to both model the effects of rebuffering on user QoE, and to integrate them with models of recency [2].
These video impairments are usually studied in isolation. For example, QoE models have either been designed for videos suffering from compression distortion or from rebuffering, but not both. This is partly due to the unavailability of suitable subjective data, along with the difficulty of combining objective video quality models and rebuffering-related information into single QoE scores. In [32], FR quality algorithms such as SSIM and MS-SSIM were combined with rebuffering information yielding the Streaming Quality Index (SQI). In [33]
, the authors fed QP values and rebuffering related features into a Random Neural Network learning model to make QoE predictions. However, their method was evaluated on only 4 contents and on short video sequences of 16 seconds, did not consider longer term memory effects and did not deploy perceptually relevant VQA algorithms. This suggests the need for larger streaming-oriented subjective datasets and algorithms which collectively build on perceptually driven VQA methods, rebuffering models and other QoE-aware features. Note that HAS uses TCP, hence it is resilient to video quality degradations related to packet loss, such as glitches and other transient artifacts
[6]. As a result, the two main impairment categories that a streaming dataset should include are compression (due to the multiple encoding bitstream representations of the high-quality source content) and playback interruptions (due to throughput and buffer limitations).We begin by describing the recently designed LIVE-Netflix Video QoE Database which contains videos suffering from temporal rate/quality changes and rebuffering events. Next, we develop Video ATLAS: a new learning framework that integrates objective VQA metrics with rebuffering-related features to conduct QoE prediction.
Most existing video quality databases consider the two main video impairments (quality changes and playback interruptions) either in isolation or in an ad hoc fashion, hampering their practical relevance. In addition, due to the difficulty of designing and carrying out large video subjective studies, many of these datasets are of quite limited size in terms of video content and/or the number of participants. We recently designed the LIVE-Netflix Video QoE Database [3], which uses a set of 8 different playout patterns on 14 diverse video contents. The video content spans a variety of content types typical of streaming applications, including action scenes, drama, cartoons and anime. We gathered approximately subjective QoE (both continuous and retrospective) scores from 56 subjects, each participating in three 45 minute sessions.
The playout patterns contain mixtures of static and dynamic bitrate selection strategies together with playback interruptions, assuming practical network conditions and buffer size. Figure 2 shows the exemplar temporal bandwidth condition. For all playout patterns, it is assumed that the available bandwidth can reach a maximum value of kbps and a minimum of kbps; a variety of playout bitrates can occur within this range. However, the buffer capacity is assumed constant over all playout patterns.
The underlying study design allows for direct comparisons between playout patterns with regards to bitrates, and the locations and the durations of playback interruptions. These playout patterns model realistic network allocation policies that content providers need to decide on. The diverse spatiotemporal characteristics and realistic playout patterns make the new LIVE-Netflix dataset a useful tool for training and evaluating video QoE predictors. This dataset consists of both public and Netflix content. The public videos together with metadata for all videos will be made available.
Most VQA algorithms do not consider playback interruptions. However, the increasingly pressing problem of rebuffering events in streaming applications dictates the need to quantify the effects of using (or failing to use) rebuffering-aware methods when predicting user QoE. Therefore, we selected a few important objective quality metrics and applied them on the LIVE-Netflix dataset twice. First, on the set of videos distorted only by video quality changes with normal playback (). Second, on all the videos in the dataset (). Then, we calculated the correlations of the prediction models against the retrospective subjective scores in the LIVE-Netflix Database to better understand the effect of including rebuffering-aware information. We used the following models: PSNR, PSNRhvs [34], SSIM [11], MS-SSIM [13], NIQE [35], VMAF [17], the FR version of STRRED [19] and GMSD [36]. Note that for PSNRhvs [34] we used the publicly available implementation of the Daala codec [37]. For the rest of the implementations, we use the publicly available implementations and all objective quality metrics were applied on the luminance component. The results are tabulated in Table I.
IQA/VQA metric | ||
---|---|---|
PSNR (IQA, FR) | 0.5561 | 0.5152 |
PSNRhvs [34] (IQA, FR) | 0.5841 | 0.5385 |
SSIM [11] (IQA, FR) | 0.7852 | 0.7015 |
MS-SSIM [13] (IQA, FR) | 0.7532 | 0.6800 |
NIQE [35] (IQA, NR) | 0.3960 | 0.1697 |
VMAF [17] (VQA, FR) | 0.7533 | 0.6097 |
STRRED [19] (VQA, RR) | 0.7996 | 0.6594 |
GMSD [36] (IQA, FR) | 0.6476 | 0.5812 |
Consider , which only includes video compression artifacts. NIQE performed the worst, since it is a frame-based NR method. PSNR performed worse than all FR methods, while PSNRhvs achieved a small improvement over PSNR. The gradient-based GMSD performed worse than SSIM. STRRED yielded the best performance whereas VMAF performed poorly. Notably, STRRED performed similar to SSIM, while MS-SSIM performed worse than SSIM. This raises the following contradiction: if we consider videos that suffered only from video bitrate changes () , why would a single scale algorithm such as SSIM perform better than its multiscale counterpart and almost the same as a more sophisticated VQA model such as STRRED? We believe that when subjects are exposed to both rebuffering and quality changes, they tend to internally compare between them rather than evaluating their QoE merely based on quality changes. This makes objective video quality models less reliable, by decorrelating their performance against perceived QoE. This strongly suggests that rebuffering and bitrate changes must be considered jointly and not in isolation.
Next we consider the performance of these quality models on . First, there was clearly a large drop in the performance of all models compared to . Note that SSIM unexpectedly outperformed STRRED and MS-SSIM. This suggests that objective quality models are less suitable for QoE prediction on videos afflicted by interrupted playback. However, in mobile streaming applications, rebuffering events occur often. Again, this implies the need to integrate QoE-aware information into QoE prediction models. In this direction, we next describe a new learning framework which integrates objective video quality, rebuffering-related and memory features to significantly improve QoE prediction.
Our proposed framework is designed to make predictions on retrospective QoE scores i.e. the subjective score given by subjects after the video playback has finished. In order to capture both video quality and to predict reactions to playback interruptions, we compute the following types of QoE-relevant input features:
1. Objective video quality scores (VQA)
During normal playback, any good video quality algorithm can be used to measure objective QoE scores. Our method allows the use of any full reference (FR) or no reference (NR) image/video quality model [20] as appropriate for the application context. We selected several that are both highly compute-efficient and that deliver accurate VQA predictions, rather than using compute-intensive models [15, 38]. Since we are focused on predicting retrospective QoE scores, a pooling strategy was chosen that collapses per-frame objective quality measurements into a single value. A number of different pooling strategies have been proposed [4, 39, 40] that capture subjective QoE aspects such as recency (whereby more recent experiences have a larger weight when making retrospective evaluations) or the peak-end effect (the worst and best parts of an event affect the QoE more). For simplicity, we deployed simple averaging of the QoE scores as suggested in [39], reserving recency modeling as a separate input feature.
2. Rebuffering-aware features (R and R)
When the playback is interrupted, objective video quality algorithms are not operative. Based on previous observations regarding the effects of re-buffering [41, 42, 30, 27, 28], we use the length of each rebuffering event measured in seconds (R) and the number of rebuffering events (R). The length of the rebuffering event(s) were normalized to the duration of each video.
3. Memory-related feature (M)
The users’ QoE also depends on the recency effect. When conducting retrospective QoE prediction, we computed the time since the last rebuffering event or rate drop took place and was completed i.e. the number of seconds with normal playback at the maximum possible bitrate until the end of the video. This feature was normalized to the duration of each video.
4. Impairment duration feature (I)
While the previous features consider rebuffering and quality changes, we also computed the time (in sec.) per video over which a bitrate drop took place; following the simple notion that the relative amount of time that a video is more heavily distorted is directly related to the overall QoE. This feature was normalized to the duration of each video.
We now describe the feature extraction process. Consider all frame pairs (
, ), where indexes the th frame of the pristine video and indexes the corresponding frame of the distorted video, where . If there are no rebuffering events in the distorted video then ; else we determine based on the number of frozen frames up until this point for a particular video. In other words, these two frames must be synchronized in order to be able to extract meaningful objective quality measurements. Next, apply any FR IQA or VQA algorithm to measure the per-frame objective quality, then apply simple average pooling of those values, yielding a single quality-predictive feature that will be used later. In addition, all the other features are collected, assuming that for retrospective QoE prediction, the number of rebuffered frames as well as the locations of the bitrate changes are known. Note that for some VQA methods, adjacent frames may be needed to compute frame differences. In that case we ensure that all frame differencing takes place between two consecutive frames that both have normal playback. If an NR method is used, it is computed only on unstalled frame(s).After collecting all the features computed on each video, we then deployed a learning-based approach where the subjective data and the input features were used to train a regression engine. Note that no constraint was placed on which objective quality algorithm or regression model is used. In our experiments, we studied the performance of our proposed approach across different regression and IQA/VQA models. The final output of our overall system is a single retrospective QoE score on each input test video.
To evaluate the proposed method on the LIVE-Netflix Video QoE Database, we conducted two different experiments. The first one (Experiment 1) consisted of creating two disjoint content sets: one for training and one for testing. Within each content (training or testing), all patterns were used for training or testing. While this is a common approach used to account for content dependencies in learning-based VQA methods, it may also occur that the different “distortions” or playout patterns induce pattern dependencies, resulting in overestimation of the true predictive power of a learning-based method. To examine pattern independence we also conducted a second experiment (Experiment 2), where we picked one of the playout patterns as a test pattern and the rest as training patterns. Thus, for each testing pattern there were 14 test points (one for each content) and 98 testing points. On both tests, we applied a regression model (e.g. Random Forest regression) to predict the QoE scores of the test set given the input training features and MOS scores. We excluded the subjective scores gathered from the three training videos. Since our model does not produce continuous scores, we used only the retrospective QoE scores from all 14 test contents.
To demonstrate the behavior of Video ATLAS we evaluated it using several different types of regression models [43]
: linear models (Ridge and Lasso regression), Support Vector Regression (SVR) using a rbf kernel and ensemble methods such as Random Forest (RF), Gradient Boosting (GB) and Extra Trees (ET) regression. For the ensemble methods, feature normalization was not required, but we preprocessed the features for all regression models by mean subtraction and scaling to unit variance. Note that we computed the data mean and variance in the feature transformation step using only the training data. For each of the regression models, we determined the best parameters using 10-fold cross validation on the training set. This process was repeated on all possible train/test splits.
After each of the regression models was trained, we applied regression on the test features to make QoE predictions. Then, we correlated the regressed values with the MOS scores in the test set and calculated the Spearman Rank Order Correlation Coefficients (SROCC) and the Pearson Linear Correlation Coefficients (LCC). The former measures the monotonicity of the regressed values and the latter the linearity of the output, which is highly desirable since it describes the degree of simplicity of a trained model. Before computing the LCC, we first applied a non-linear regression step on the output QoE scores of our method, as suggested in
[44].We conducted 1000 different trials, each using a random train and test split of the video content. To avoid content dependencies, we select of the 14 contents in the database as the training contents (11 training contents) and the rest as the testing contents (3 testing contents). For direct comparison, we used a pre-generated set of train/test indices. The SROCC and LCC calculations were repeated on each of the trials yielding a distribution of SROCC and LCC values for all possible train/test content combinations. Taking the median value of this distribution of correlation scores yields a single number describing the performance level of the proposed method. Table II shows the SROCC and LCC results after trials.
VQA | PSNR | PSNRhvs [34] | SSIM [11] | MS-SSIM [13] | NIQE [35] | VMAF [17] | STRRED [19] | GMSD [36] | mean |
BR | 0.6074 | 0.6252 | 0.6748 | 0.6557 | 0.1391 | 0.6043 | 0.6348 | 0.6496 | 0.5734 |
Ridge | 0.6687 | 0.6817 | 0.7565 | 0.7461 | 0.4130 | 0.6278 | 0.7957 | 0.6948 | 0.6730 |
Lasso | 0.6496 | 0.6687 | 0.7461 | 0.7383 | 0.4191 | 0.6409 | 0.7983 | 0.6922 | 0.6691 |
SVR | 0.6313 | 0.6417 | 0.8252 | 0.8226 | 0.6730 | 0.6026 | 0.8704 | 0.6878 | 0.7193 |
ET | 0.4265 | 0.4387 | 0.8547 | 0.8752 | 0.7530 | 0.4756 | 0.8439 | 0.4527 | 0.6400 |
RF | 0.4931 | 0.5312 | 0.8088 | 0.8154 | 0.6222 | 0.4930 | 0.8104 | 0.5417 | 0.6395 |
GB | 0.4830 | 0.4944 | 0.7990 | 0.7899 | 0.5878 | 0.5145 | 0.8032 | 0.5000 | 0.6215 |
VQA | PSNR | PSNRhvs [34] | SSIM [11] | MS-SSIM [13] | NIQE [35] | VMAF [17] | STRRED [19] | GMSD [36] | mean |
BR | 0.6048 | 0.6534 | 0.7288 | 0.7104 | 0.3752 | 0.7561 | 0.7213 | 0.6861 | 0.6545 |
Ridge | 0.8145 | 0.8224 | 0.8531 | 0.8517 | 0.5984 | 0.8158 | 0.8703 | 0.8254 | 0.8064 |
Lasso | 0.8192 | 0.8312 | 0.8558 | 0.8514 | 0.6034 | 0.8292 | 0.8719 | 0.8374 | 0.8124 |
SVR | 0.7939 | 0.8016 | 0.9073 | 0.8973 | 0.7633 | 0.7742 | 0.9358 | 0.8106 | 0.8355 |
ET | 0.6325 | 0.6392 | 0.9186 | 0.9289 | 0.8407 | 0.6808 | 0.9088 | 0.6869 | 0.7796 |
RF | 0.6767 | 0.6922 | 0.8905 | 0.8868 | 0.7182 | 0.6591 | 0.8770 | 0.7026 | 0.7629 |
GB | 0.6744 | 0.7060 | 0.8661 | 0.8546 | 0.7143 | 0.7115 | 0.8678 | 0.7043 | 0.7624 |
First, note that both the SROCC and the LCC were improved when using the regression scheme for all quality metrics and for at least one regression model type. For VMAF, PSNR, PSNRhvs and GMSD the regression result did not improve using every regressor. However, the improvements of SSIM, MS-SSIM, NIQE and STRRED were remarkably higher for all the regression models. MS-SSIM using ET yielded the best overall performance in terms of SROCC, while STRRED using SVR yielded the best LCC value. STRRED is an information-theoretic approach to VQA that builds on the innovations in [45, 46]. It achieves quality prediction efficiency without the need to compute motion vectors, unlike [38, 15]. Regarding improvements in terms of LCC, all regression models improved most of the quality metrics. These observations support the argument that introducing an effective regression scheme into the QoE process has a large positive impact on QoE prediction over a wide range of leading video quality models.
To demonstrate the overall improvements delivered by the learned regression models, we also calculated the average SROCC and LCC values for the BR case and for each regression model separately (see the last columns of Table II). In both cases, the SVR regressor achieved the highest average performance followed by Ridge. The performance of the Ridge and Lasso models was somewhat higher than that of the RF and ET, while GB yielded the worst performance across all regression models, although it was still higher than the average performance of BR, which was notably low in the case of NIQE. Next, we visually demonstrate the effect of the proposed learning framework (in Fig. 3) for the case of STRRED and the Random Forest regression model. Clearly, the predicted QoE significantly improved both in terms of monotonicity and linearity.
While our proposed system deploys features that collectively deliver excellent results, it is interesting to analyze the relative feature contributions. One way to study the feature importances is by a tree-based method, as follows. First, we picked the best and the worst performing quality models before regression (when evaluated on the whole database), i.e., STRRED and NIQE, along with the highest performing SVR regression model (in terms of SROCC). Figure 4 shows the feature importances after pre-generated train/test splits.
Clearly, the video quality model used plays an important role in QoE prediction. The memory feature also has a strong contribution since for retrospective QoE evaluation, recent experiences are a strong QoE indicator. The rebuffering features delivered an important but somewhat smaller contribution. For retrospective QoE evaluations and distinct impairment events such as rebuffering, the lower contribution of the R feature (rebuffering duration) may possibly be explained by the duration neglect effect [2]: subjects may remember that a rebuffering event occurred, but may not be sensitive to its duration. However, as demonstrated earlier, both tested video quality models were greatly improved in terms of both SROCC and LCC when combined with Video ATLAS. Since NIQE is not a very good video quality predictor (although it is a very effective still picture quality predictor), the importance of the VQA feature was lower while the importance of the I and M features was relatively higher as compared to STRRED.
To further investigate the effects of those feature types on the retrospective QoE prediction task, we experimented further by using different feature subsets, and recording the QoE prediction performance of each. First, consider the following feature subsets:
individual feature subsets: VQA(1), M(2),
I(3) and R+R(4)
2 feature types subsets: VQA+M(5) and VQA+I(6)
subsets: VQA+M+R(7), M+R+R(8), M+I+R
+R(9), VQA+I+R+R(10), VQA+M+R+R(11) and VQA+M+I+R+R(12)
The SROCC and LCC results are shown in Table III, where we selected STRRED as the quality prediction model. Clearly, when using the individual components as features, the QoE prediction result was maximized when using VQA but was still very low, especially for other components such as M. Notably, the regression performance for the VQA subset was maximized in the case of the Ridge and Lasso linear regressions, but for the M (memory) and R+R (rebuffering) feature types, the performance was greatly reduced using those regression models compared to SVR, ET, RF and GB. This may be explained by the fact that the design of IQA/VQA algorithms such as STRRED ultimately aims for linear/explainable models. By contrast, the memory or rebuffering-aware features are highly non-linear, hence non-linear regression models may be expected to perform better.
We now move on to the different feature combinations and their effect on QoE prediction. First, note that when VQA is removed from the feature set (e.g. in columns 8 and 9) the prediction performance dropped considerably. Meanwhile, using only two features (VQA and M in column 5) we were able to achieve better prediction results than with any other combination of 2 feature types (or a single feature). This again strongly supports the importance of memory/recency effects on QoE when viewing longer video sequences. Regarding the regression models, Ridge and Lasso gave very similar performances when using fewer feature types, but as the number of features grew, Lasso yielded better results. Overall, the combination of all feature types gave the best performance over most regression models. This suggests that a successful QoE prediction model should consider diverse QoE-aware features in order to better approximate subjective QoE.
Features | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Ridge | 0.6348 | 0.2296 | 0.2700 | 0.3094 | 0.6000 | 0.6235 | 0.7870 | 0.4105 | 0.4172 | 0.7878 | 0.7735 | 0.7957 |
Lasso | 0.6348 | 0.2296 | 0.2700 | 0.3243 | 0.6304 | 0.6417 | 0.7991 | 0.4075 | 0.3955 | 0.8013 | 0.7991 | 0.7983 |
SVR | 0.5748 | 0.3807 | 0.2758 | 0.3740 | 0.7322 | 0.5878 | 0.8183 | 0.4210 | 0.4839 | 0.8543 | 0.8122 | 0.8704 |
ET | 0.5074 | 0.3076 | 0.2345 | 0.2993 | 0.7431 | 0.5962 | 0.7496 | 0.3119 | 0.3924 | 0.8348 | 0.7574 | 0.8435 |
RF | 0.5304 | 0.3961 | 0.2713 | 0.3218 | 0.7537 | 0.5691 | 0.7633 | 0.4126 | 0.4656 | 0.8074 | 0.7708 | 0.8096 |
GB | 0.5691 | 0.3905 | 0.2658 | 0.3527 | 0.7461 | 0.6001 | 0.7668 | 0.4355 | 0.4984 | 0.8070 | 0.7607 | 0.8036 |
Features | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Ridge | 0.7213 | 0.4507 | 0.3049 | 0.2930 | 0.7141 | 0.6475 | 0.7610 | 0.4602 | 0.6247 | 0.7854 | 0.7590 | 0.8703 |
Lasso | 0.7213 | 0.4507 | 0.3049 | 0.2956 | 0.7348 | 0.6956 | 0.7870 | 0.4592 | 0.6201 | 0.8055 | 0.7868 | 0.8719 |
SVR | 0.6454 | 0.4325 | 0.3148 | 0.3169 | 0.8133 | 0.6472 | 0.8510 | 0.4497 | 0.6959 | 0.8945 | 0.8392 | 0.9358 |
ET | 0.5407 | 0.3754 | 0.3110 | 0.3138 | 0.7620 | 0.6031 | 0.7596 | 0.3899 | 0.6173 | 0.9004 | 0.7659 | 0.9090 |
RF | 0.5685 | 0.4451 | 0.3528 | 0.3261 | 0.7794 | 0.6024 | 0.7862 | 0.4706 | 0.6966 | 0.8686 | 0.7975 | 0.8742 |
GB | 0.6287 | 0.4514 | 0.3514 | 0.3141 | 0.7755 | 0.6269 | 0.7904 | 0.4751 | 0.7413 | 0.8665 | 0.7865 | 0.8686 |
IQA/VQA metric | mean | hysteresis | VQ |
---|---|---|---|
PSNR | 0.6687 | 0.6687 | 0.6817 |
PSNRhvs [34] | 0.6817 | 0.6878 | 0.6965 |
SSIM [11] | 0.8547 | 0.8470 | 0.7887 |
MS-SSIM [13] | 0.8752 | 0.8722 | 0.7743 |
NIQE [35] | 0.7530 | 0.7513 | 0.6591 |
VMAF [17] | 0.6409 | 0.6226 | 0.6400 |
STRRED [19] | 0.8704 | 0.8800 | 0.8687 |
GMSD [36] | 0.6948 | 0.6800 | 0.6843 |
IQA/VQA metric | mean | hysteresis | VQ |
---|---|---|---|
PSNR | 0.8145 | 0.8173 | 0.8231 |
PSNRhvs [34] | 0.8224 | 0.8254 | 0.8635 |
SSIM [11] | 0.9186 | 0.9121 | 0.8804 |
MS-SSIM [13] | 0.9289 | 0.9281 | 0.8700 |
NIQE [35] | 0.8407 | 0.8495 | 0.7678 |
VMAF [17] | 0.8292 | 0.8136 | 0.8202 |
STRRED [19] | 0.9358 | 0.9390 | 0.9317 |
GMSD [36] | 0.8254 | 0.8062 | 0.8191 |
We now analyze the effects of the amount of training data used in the regression scheme on QoE prediction. By varying the percent of training data in the train/test split, we repeated the same process as before, over random trials. Figure 5 shows how the SROCC changed when the amount of training data varied between (2 training contents) and (11 training contents). Clearly, the prediction performance increased when the available training data was increased. The best performance in terms of SROCC and LCC was reached when MS-SSIM was used as the quality model. Note that while NIQE performed the worst before applying regression, now it performed better than VMAF, GMSD, PSNR and PSNRhvs when the ratio of the train/test split was larger than 0.4. Notably, the SROCC performance of GMSD, PSNR and PSNRhvs did not significantly vary until the train/test split was larger than 0.6. By contrast, STRRED, SSIM and MS-SSIM delivered good results (in terms of SROCC and LCC) when only a small amount of training data was used.
We also experimented with the type of pooling that is applied on the quality metric before it is used in the regression framework. We combined all features and used the pre-generated train and test splits. To collapse the frame-based objective quality scores to a single summary VQA score, we applied the hysteresis pooling method in [47] and the VQ pooling method in [40]. The former combines past and future quality scores within a window, while the latter clusters the video frames into low and high quality regions and weights their contributions to the overall VQA score. The results are tabulated in Table IV. For the mean pooling case, we used the results reported in Table II.
Given the results in Table IV, we observed that the use of temporal pooling strategies other than mean pooling did not always improve QoE prediction [39]. Of the 8 metrics we reported, only 3 were improved in terms of SROCC and 4 in terms of LCC. Further, these improvements were not very significant, with the exception of PSNR and PSNRhvs.
Finally, we compared models within each of three QoE prediction categories: QoS (FTW and VsQM), VQA-based ones (PSNR, SSIM, MS-SSIM) and hybrid ones (SQI and the proposed Video ATLAS). Table V shows the median SROCC and LCC results for these methods. A statistical significance test (Wilcoxon ranksum test [48] with significance level ) was carried out by comparing the distributions of SROCC across all trials. The results of this analysis are tabulated in Table VI. Clearly, Video ATLAS outperformed the other QoE prediction models when using SSIM and MS-SSIM. These improvements are also visually demonstrated in Fig. 6 using MS-SSIM for all regression models. In this example, the best performing regression model was ET.
Method | SROCC | LCC | Best |
---|---|---|---|
FTW [30] | 0.3403 | 0.2956 | - |
VsQM [31] | 0.3120 | 0.2421 | - |
PSNR | 0.6074 | 0.6048 | - |
SSIM [11] | 0.6748 | 0.7289 | - |
MS-SSIM [13] | 0.6557 | 0.7104 | - |
PSNR+SQI [32] | 0.6565 | 0.6599 | - |
SSIM+SQI [32] | 0.7565 | 0.8031 | - |
MS-SSIM+SQI [32] | 0.7270 | 0.7731 | - |
PSNR+ATLAS | 0.6687 | 0.8145 | Ridge |
SSIM+ATLAS | 0.8547 | 0.9186 | ET |
MS-SSIM+ATLAS | 0.8752 | 0.9289 | ET |
QoS | VQA | SQI | ATLAS | |||||||||
FTW | VsQM | PSNR | SSIM | MS-SSIM | PSNR | SSIM | MS-SSIM | PSNR | SSIM | MS-SSIM | ||
QoS | FTW | -1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
VsQM | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
VQA | PSNR | 1 | 1 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
SSIM | 1 | 1 | 1 | -1 | -1 | 1 | 0 | 0 | -1 | 0 | 0 | |
MSSIM | 1 | 1 | 1 | -1 | -1 | -1 | 0 | 0 | -1 | 0 | 0 | |
SQI | PSNR | 1 | 1 | 1 | 0 | -1 | -1 | 0 | 0 | 0 | 0 | 0 |
SSIM | 1 | 1 | 1 | 1 | 1 | 1 | -1 | 1 | 1 | 0 | 0 | |
MSSIM | 1 | 1 | 1 | 1 | 1 | 1 | 0 | -1 | 1 | 0 | 0 | |
ATLAS | PSNR | 1 | 1 | 1 | -1 | -1 | 1 | 0 | 0 | -1 | 0 | 0 |
SSIM | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 | 0 | |
MSSIM | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 |
We then examined pattern independence when applying Video ATLAS, with results shown in Table VII. Clearly, all of the video quality models were improved by using either SQI and/or VideoATLAS. When combined with either SSIM or MS-SSIM, Video ATLAS improved prediction performance more than SQI did. However, unlike our finding in Experiment 1, we found that not all of the features contributed to the QoE prediction result. To further illuminate this claim, Table VIII tabulates the QoE prediction results on the best feature subset of each regressor when STRRED (with mean pooling) was applied. It may be observed that the combination of the VQA and M features was important for all regressors. This again demonstrates the strong recency/memory effects that contribute to retrospective QoE evaluation. In the case of LCC, Video ATLAS was further improved by including the rebuffering features R and/or R.
Method | SROCC | LCC | Best |
---|---|---|---|
PSNR | 0.4945 | 0.5312 | - |
PSNR+SQI [32] | 0.4989 | 0.5340 | - |
PSNR+ATLAS | 0.4945 | 0.5321 | Ridge |
SSIM [11] | 0.6615 | 0.7947 | - |
SSIM+SQI [32] | 0.6791 | 0.7927 | - |
SSIM+ATLAS | 0.7143 | 0.8650 | RF |
MS-SSIM [13] | 0.6659 | 0.7982 | - |
MS-SSIM+SQI [32] | 0.6835 | 0.7955 | - |
MS-SSIM+ATLAS | 0.6961 | 0.8345 | GB |
NIQE | 0.4681 | 0.4107 | - |
NIQE+ATLAS | 0.6447 | 0.6541 | RF |
VMAF | 0.3890 | 0.4486 | - |
VMAF+ATLAS | 0.7415 | 0.7075 | RF |
STRRED | 0.8066 | 0.7848 | - |
STRRED+ATLAS | 0.8198 | 0.7923 | Ridge |
GMSD | 0.4989 | 0.5545 | - |
GMSD+ATLAS | 0.5256 | 0.6679 | RF |
Regressor | Best SROCC | Feature Set | Best LCC | Feature Set |
---|---|---|---|---|
Ridge | 0.7934 | VQA+M | 0.7939 | VQA+M+R |
Lasso | 0.7934 | VQA+M | 0.7918 | VQA+M+R |
SVR | 0.8242 | VQA+M | 0.8618 | VQA+M+R |
RF | 0.7385 | VQA+M | 0.8771 | VQA+I+R+R |
ET | 0.7437 | VQA+I+R+R | 0.8698 | VQA+M+R+R |
GB | 0.8079 | VQA+M+R | 0.8821 | VQA+M+R |
The proposed framework uses subjective data to make QoE predictions; hence its predictive power must also be carefully evaluated on other video QoE databases to understand its generalizability. The only publicly available video QoE database that considers other interactions between rebuffering and quality changes is the Waterloo Video QoE Database [32] (Waterloo DB). This recently developed database consists of 20 RAW HD 10 sec. reference videos. Each video was encoded using H.264 into three bitrate levels (500Kbps, 1500Kbps, 3000Kbps) yielding 60 compressed videos. For each one of those sequences, two more categories of video sequences were created by simulating a 5 sec. rebuffering event either at the beginning or at the middle of the video sequence. In total, 200 video sequences were evaluated by more than 25 subjects. Based on the collected subjective data, the authors designed the Streaming QoE Index (SQI) to “account for the instantaneous quality degradation due to perceptual video presentation impairment, the playback stalling events, and the instantaneous interactions between them”.
Unlike the LIVE-Netflix database (LIVE-Netflix DB), Waterloo DB consists of short video sequences (which may not reflect the experiences of viewers watching minutes or hours of video content), used fewer subjects, and importantly, the rebuffering events and the bitrate/quality changes were not driven by any realistic assumptions on the available network or the buffer size. However, given its simplicity and the lack of availability of other public domain databases of this type, applying our proposed model framework on this database may yield a comparison of practical worth. We compared the predictive power of our model with SQI [32], FTW [30], VsQM [31] and several VQA models. Aside from SQI and Video ATLAS, the other methods do not consider both rebuffering events and bitrate variations. When conducting direct comparisons, we used only the quality prediction models that were reported for SQI: PSNR, SSIM, MS-SSIM and SSIMplus [49]. Given the simple playout patterns, only the VQA+M+R feature set was applicable for Video ATLAS. Since the videos in the Waterloo DB do not suffer from dynamic rate changes, the M feature was computed here as the amount of time since a rebuffering event took place. We refer to this feature as M. As before, we conducted trials, split the contents into training and testing subsets to avoid content bias, and used a pre-generated matrix of such indices. We carried out the following three experiments:
Experiment 3: We conducted 1000 trials of 80% train, 20% test splits on the Waterloo DB. The results are tabulated in Table IX. For Video ATLAS, only the best regression model (in terms of SROCC) is reported. To ensure that SQI yielded its best results on this dataset, we used the parameters suggested in [32] (different for each quality model). As before, video quality models did not perform as well as the SQI and Video ATLAS variants. Notably, the performance of MS-SSIM and SSIMplus were worse than that of SSIM even though both have been shown to yield better results than SSIM on the IQA and VQA problems. This verifies our earlier observation: the Waterloo DB contains both rebuffering events and quality changes; hence a better IQA/VQA model may not always correlate better with subjective QoE. Overall, Video ATLAS performed slightly better than SQI, likely in part since the playout patterns in that dataset are simpler, the feature variation is smaller and the number of input features was reduced to only three. Given that SQI was designed on the Waterloo DB, the Video ATLAS results are quite promising.
Method | SROCC | LCC | Best |
---|---|---|---|
FTW [30] | 0.3290 | 0.3358 | - |
VsQM [31] | 0.2358 | 0.3324 | - |
PSNR | 0.6894 | 0.6875 | - |
SSIM [11] | 0.8172 | 0.8544 | - |
MS-SSIM [13] | 0.7986 | 0.8345 | - |
SSIMplus [49] | 0.8025 | 0.8414 | - |
PSNR+SQI [32] | 0.7800 | 0.7535 | - |
SSIM+SQI [32] | 0.9085 | 0.9028 | - |
MS-SSIM+SQI [32] | 0.8891 | 0.8808 | - |
SSIMplus+SQI [32] | 0.9103 | 0.9012 | - |
PSNR+ATLAS | 0.7799 | 0.7510 | SVR |
SSIM+ATLAS | 0.9142 | 0.9097 | SVR |
MS-SSIM+ATLAS | 0.8955 | 0.8880 | Lasso |
SSIMplus+ATLAS | 0.9084 | 0.8981 | Ridge |
Next, we studied the performance of our proposed QoE prediction framework when one of the databases is used for testing and the other for training. In this case, we applied 10-fold cross validation on the entire training dataset to determine the parameters of each regressor. Some regressors, such as RF, may give different results each time; hence we conducted 50 iterations and tabulated the median results in Table X.
Experiment 4: We used the Waterloo DB for training and tested the trained models on the LIVE-Netflix DB. For SQI, since we trained on the Waterloo DB, we again used the suggested optimal parameters from [32]. For Video ATLAS, we used the Waterloo DB to determine the best parameters for each regressor. The best QoE predictor was Video ATLAS, when combined with SSIM. Clearly, simple QoE predictors based on rebuffering information only such as FTW (or VsQM), or that only uses standard video quality models, perform worse than more general QoE models such as SQI and Video ATLAS. It may also be observed that Video ATLAS outperformed SQI in terms of SROCC and LCC. While Video ATLAS performed better, it should be noted that it used only 3 of the 5 input features (VQA+M+R), given the simple design of the Waterloo DB. A more general dataset for training could potentially increase the predictive performance of Video ATLAS even further.
Experiment 5: We then used the LIVE-Netflix DB to train the QoE prediction models, and tested them on the Waterloo DB. Again, to ensure that SQI would yield the best possible results when testing on the Waterloo DB, we used the parameters suggested in [32]. For Video ATLAS, we used the Waterloo DB to determine the best parameters of each regressor. As is also shown in Table X, SQI and Video ATLAS delivered similar results (Video ATLAS is slightly better when combined with SSIM and MS-SSIM) while FTW, VsQM and objective VQA models performed poorly. Again, when testing on the Waterloo DB, Video ATLAS uses only 3 features, thereby hampering its predictive power. As shown before, combining multiple complimentary features into the Video ATLAS engine is important if it is to achieve its most competitive performance. However, Video ATLAS still competed well against SQI (which was designed and optimized into the Waterloo DB) despite the fact that it was trained on the LIVE-Netflix dataset. This strongly suggests that it generalizes well. By contrast, the results of SQI in experiments 4 and 5 show that it did not generalize as well on the LIVE-Netflix DB.
In experiments 3, 4 and 5, we found that simple learning models such as SVR, Ridge and Lasso, when combined with the three most important features: VQA, M (or M) and R, performed better than SQI and tree-based regressors. This simplicity of Video ATLAS is highly desirable: simple regressors with features that capture the three main properties of subjective QoE (video quality, rebuffering and memory) are more explainable and less likely to overfit on unseen test data.
Method | SROCC | LCC | Best |
---|---|---|---|
FTW [30] | 0.3352 | 0.2900 | - |
VsQM [31] | 0.3236 | 0.2374 | - |
PSNR | 0.5152 | 0.5073 | - |
SSIM [11] | 0.7015 | 0.7219 | - |
MS-SSIM [13] | 0.6800 | 0.7104 | - |
PSNR+SQI [32] | 0.5904 | 0.5905 | - |
SSIM+SQI [32] | 0.7451 | 0.7070 | - |
MS-SSIM+SQI [32] | 0.7239 | 0.6848 | - |
PSNR+ATLAS | 0.6155 | 0.6116 | SVR |
SSIM+ATLAS | 0.8203 | 0.7813 | Lasso |
MS-SSIM+ATLAS | 0.8000 | 0.7670 | Lasso |
Method | SROCC | LCC | Best |
---|---|---|---|
FTW [30] | 0.3154 | 0.3313 | - |
VsQM [31] | 0.2259 | 0.3233 | - |
PSNR | 0.6715 | 0.6587 | - |
SSIM [11] | 0.8177 | 0.8408 | - |
MS-SSIM [13] | 0.7928 | 0.8168 | - |
PSNR+SQI [32] | 0.7492 | 0.7316 | - |
SSIM+SQI [32] | 0.9009 | 0.8897 | - |
MS-SSIM+SQI [32] | 0.8807 | 0.8652 | - |
PSNR+ATLAS | 0.7439 | 0.7254 | SVR |
SSIM+ATLAS | 0.9090 | 0.8963 | Lasso |
MS-SSIM+ATLAS | 0.8888 | 0.8716 | Lasso |
We described a learning-based approach for QoE prediction that integrates video quality models, rebuffering-aware, and memory features into a single QoE prediction model. This framework embodies our first attempt to develop an integrated QoE model, where rebuffering events and quality changes are considered in a unified way. We envision developing more sophisticated models for QoE prediction which could be directly used for continuous time QoE monitoring [50]. Towards predicting continuous time scores, combining frame-based objective quality models with temporally varying rebuffering statistics will require a better understanding of how QoE is affected by and further modulated by both inherent short and long term memory effects.
Towards achieving this goal, time series models such as ARIMA [51] can be exploited. The LIVE-Netflix Video QoE Database includes continuous time subjective data which is rich and suitable for designing such continuous time QoE models. Therefore, a natural step forward is to deploy prediction methods which also integrate temporal aspects of user QoE in order to design better strategies for the resource allocation problem. However, this remains a challenging problem.
The authors would like to acknowledge Zhi Li for his valuable comments on the manuscript. Also, Anush K. Moorthy, Ioannis Katsavounidis and Anne Aaron for their help in designing the LIVE-Netflix Video QoE Database and for sharing their insights on video streaming problems.
K. D. Singh, Y. Hadjadj-Aoul, and G. Rubino, “Quality of experience estimation for adaptive HTTP/TCP video streaming using H. 264/AVC,” in
IEEE Consumer Communications and Networking Conference, 2012, pp. 127–131.