Cumulative Quality Modeling for HTTP Adaptive Streaming

09/06/2019 ∙ by Huyen T. T. Tran, et al. ∙ 0

Thanks to the abundance of Web platforms and broadband connections, HTTP Adaptive Streaming has become the de facto choice for multimedia delivery nowadays. However, the visual quality of adaptive video streaming may fluctuate strongly during a session due to bandwidth fluctuations. So, it is important to evaluate the quality of a streaming session over time. In this paper, we propose a model to estimate the cumulative quality for HTTP Adaptive Streaming. In the model, a sliding window of video segments is employed as the basic building block. Through statistical analysis using a subjective dataset, we identify three important components of the cumulative quality model, namely the minimum window quality, the last window quality, and the average window quality. Experiment results show that the proposed model achieves high prediction performance and outperforms related quality models. In addition, another advantage of the proposed model is its simplicity and effectiveness for deployment in real-time estimation. The source code of the proposed model has been made available to the public at https://github.com/TranHuyen1191/CQM.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

CQM

Cumulative quality model for HTTP Adaptive Streaming


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

HTTP Adaptive Streaming (HAS) has become the de facto choice for multimedia delivery nowadays. In HAS, a video is encoded into different quality versions [thang2012_TCE_full]. Each version is further divided into a series of segments. Depending on throughput fluctuations, segments of appropriate quality versions will be delivered from the server to the client, resulting in quality variations during a session. Therefore, a key challenge in HAS is how to evaluate the quality of a session over time. The evaluation can provide service providers with suggestions to enhance the quality of services [TMM_2013_4]. Also, some existing studies deploy the quality model to build and evaluate effective adaptive streaming strategies [TMM_2018_1, TMM_2017_3].

Here, we would like to differentiate three concepts of the quality as follows.

  • Overall quality means the cumulative quality measured at the end of the session. Obviously, this concept is a special case of the cumulative quality.

  • Continuous quality

    means the instantaneous quality which is continuously perceived at any moment of the session.

  • Cumulative quality means the quality cumulated from the beginning up to any moment of the session.

It should be noted that the concepts of continuous quality and overall quality have been mentioned in Recommendation ITU-R BT.500-13 and ITU-T P.880 [ITU2004-P880, ITU2012-BT500] and have been investigated in a large number of previous studies.

To the best of our knowledge, however, few previous studies have actually considered the cumulative quality. In [QoE_tobias2011_memory], the cumulative quality was investigated in the context of Web services. The work in [QoE_ZWangTimevarying] was the first study on the cumulative quality of a video streaming session, where the authors focused on the impact of quality variations. However, this work employed very short sessions, only 5–15 seconds.

In this study, our goal is modeling the cumulative quality of HTTP adaptive video streaming. We first carry out a subjective test to measure the cumulative quality of long sessions of 6 minutes. Then, the impacts of quality variations, primacy, and recency are investigated. Based on the obtained results, a cumulative quality model (called CQM) is proposed. In the proposed model, a sliding window of video segments is the basic unit of computation. It should be noted that, in the following, the term ”window” means either the conceptual sliding window or a window at a certain location. Experiment results show that the average window quality, the minimum window quality, and the quality of the last window are key components of the cumulative quality model. Also, it is found that the proposed model outperforms six existing models. Moreover, the proposed model is applicable to real-time quality monitoring thanks to its low computation complexity. To the best of our knowledge, the proposed model is the first cumulative quality model for actual streaming sessions.

The remainder of this paper is organized as follows. Section 2 discusses the related work. Because the proposed model is based on an analysis of subjective results, the subjective test is presented in Sect. 3. Then, Sect. 4 presents the proposed cumulative quality model. In Sect. 5, we evaluate the performance and computation complexity of the proposed model and compare it to six existing models. Finally, conclusions are drawn in Sect. 6.

2 Related work

In this section, we will discuss the work related to three types of quality, namely, 1) overall quality, 2) continuous quality, and 3) cumulative quality.

2.1 Overall quality

The overall video streaming experience of the end-users can be quantified with the concept of Quality of Experience (QoE). In terms of video streaming, the QoE states to what extent users are annoyed or delighted with the provided streaming [qualinet2013qoe]. In  [QoE_tobias2013_DTMA, QoE_tobias2012_initial], it was found that the impact of the initial delay of the video stream is not severe, whereas the impact of stalling, i.e., playback interruptions, is significant. To model the impact of the interruptions, previous studies generally used some statistics such as the number of interruptions [QoE_singh2012qualityML, QoE_liu2015_deriving], the average [QoE_singh2012qualityML], the maximum [QoE_singh2012qualityML], the sum [QoE_rodriguez2016_video, QoE_liu2015_deriving], and the histogram [tran2016_GC] of interruption durations. To ensure a smooth streaming when end-users face throughput fluctuations, e.g., in mobile networks, HAS allows to adapt the video bit rate to the network conditions. Thereby, initial delay and stalling can be reduced, which are severe QoE degradations of video streaming. However, due to the bit rate adaptation, the visual quality of the video might vary, which introduces an additional QoE factor [QoE_seufert2015_survey].

Existing studies on overall visual quality were mostly limited to short sessions (about 1–3 minutes) [QoE_bellLab2013_QoEmodel, QoE_ywang2015_assessing, tran2017_IEICEhistogram, QoE_tobias2014_assessing, QoE_seufert2015_impact]. These studies mainly focused on the impact of the quality variations. The impact of the quality variations is modeled by some statistics of segment quality values and switching amplitudes (i.e., differences between consecutive segment quality values) such as average [QoE_bellLab2013_QoEmodel]

, standard deviation 

[QoE_bellLab2013_QoEmodel], minimum [QoE_ywang2015_assessing], median [QoE_ywang2015_assessing], histogram [tran2017_IEICEhistogram], and time duration on different quality levels [QoE_tobias2014_assessing, QoE_seufert2015_impact].

For long sessions, the primacy and recency are also important factors to be considered. Here, the primacy (recency) factor refers to the influences of quality degradations near the beginning (end) of a session. The authors in [QoE_tavakoli2016_JSAC] found that the primacy and recency both have significant impacts on the overall quality of a session. [QoE_Seufert2013_pool] studies different temporal pooling methods, which emphasize different aspects (e.g., recency, lowest quality), for aggregating objective quality metrics into an overall quality score. In [QoE_rodriguez2016_video], the authors proposed an overall quality model, taking into account the impacts of the quality variations, primacy, and recency. Specifically, a session is divided into three temporal intervals. In each interval, the impact of quality variations is modeled by the frequencies of switching types. Each switching type is defined based on resolutions and frame rates. To take into account the impact of the primacy and recency, each interval is simply assigned a weight to represent its contribution to the overall quality of the session. The experiment results then revealed that the first interval has the highest weight, and so the largest contribution to the overall quality.

In the latest stage of ITU-T P.1203 standardization for quality assessment of streaming media, a model (called P.1203) is recommended for predicting the overall quality, where session durations are from 1 to 5 minutes [ITU1203_3]. The P.1203 model also takes into account the impacts of quality variations, primacy, and recency. Then, to model the impact of quality variations, the authors used the average of the segment quality values in each temporal interval and various statistics calculated over a whole session, such as the total number of quality direction changes and the difference between the maximum and minimum segment quality. To take into account the impact of the primacy and recency, the authors used a weighted sum of all segment quality values in the session.

2.2 Continuous quality

The recommendation ITU-R BT.500-13 describes the Single Stimulus Continuous Quality Evaluation (SSCQE) method for subjective assessment of the continuous quality. In this method, test sessions are displayed in a random order. Each subject, while watching a video, is asked to continuously move a slider along a continuous scale so that its position reflects his/her selection of quality at that instant. All subjects’ quality ratings at each instant of each video are averaged to compute a mean opinion score (MOS) of that instant.

The work in [QoE_Chen2014] is the first study on the continuous quality of a streaming session. Note that, in this paper, the authors use the term ”time-varying quality” to refer to ”continuous quality”. To measure the continuous quality, the authors conducted a subjective test similar to the SSCQE method. Then, a continuous quality model is proposed, taking into account the impact of the recency. In particular, a Hammerstein-Wiener model was employed to predict the continuous quality of 5-minute long sessions. As this work is focused on continuous quality, the model only depends on the quality values of the last 15 seconds.

[QMon_shafiq_infocom18]

uses machine learning to predict initial delay, stalling, and video quality from the network traffic in windows of 10 s. The considered features are derived from IP or TCP/UDP headers only. ViCrypt 

[QMon_seufert2019_stream] detects QoE degradations on encrypted video streaming traffic in real-time within 1 s by using a stream-like analysis approach with two continuous sliding windows and a cumulative window. The features are based on packet-level statistics of the network traffic, and allow to accurately recognize initial delay and stalling [QMon_seufert2019_stream], as well as video resolution and the average bitrate [QMon_wassermann2019_let].

[texas2018] presents a continuous-time QoE predictor using an ensemble of Hammerstein-Wiener models, while [QoE_BampisRecurrent2018ML]

developed a neural-network-based continuous quality model. As discussed in Recommendation ITU-R BT.500-13 

[ITU2012-BT500], the continuous quality values of a session can be utilized to obtain the overall quality. However, this issue is currently under study [ITU2012-BT500, QoE_bovikTimeVarying2011, QoE_bampis2017Dataset].

2.3 Cumulative quality

To the best of our knowledge, the only previous study on the cumulative quality of a streaming session is in [QoE_ZWangTimevarying], where the authors presented some qualitative observations regarding the impact of quality variations. However, the authors employed simple simulated sessions of very short durations (5–15 seconds) with only 1–3 segments. It is found that when there is a small switching amplitude, the cumulative quality is quite stable with a slight change. Meanwhile, a large switching amplitude results in a significant change of the cumulative quality. From these observations, the authors proposed a cumulative quality model, in which a piecewise linear function of switching amplitudes was used to quantify the impact of the quality variations.

The preliminary work of our cumulative quality research was presented in [tran2018_QoMEX]

. In this paper, the previous work is extended significantly. First, we carried out more subjective tests with new videos and so the dataset is now doubled. Second, factors in the model are extensively studied with one-way analysis of variance (ANOVA). In addition, the impacts of window size and window quality model on the model performance are explored in detail and the best setting is recommended. Finally, the evaluation is extended with more related models and in-depth analysis of models’ performances with respect to the length of sequences as well as models’ computation complexity.

The contributions of our work have two general categories. First, we build a dataset that is specific to the cumulative quality. Our dataset helps to investigate how existing overall quality models perform cumulative quality prediction. Second, we propose a new cumulative quality model that can well predict the cumulative quality of streaming sessions. In particular, the distinguished features of our study are as follows.

  • First, a subjective test was specifically designed for measuring the cumulative quality of HAS sessions. In our test, there are in total 72 test sequences generated from six 6-minute long videos. The total time required for rating these sequences was approximately 160 hours.

  • Second, through statistical analysis, insights into the impacts of three factors of quality variations, primacy, and recency are provided. In particular, it is found that the impacts of the quality variations and recency are significant. However, no significant impact of the primacy is observed.

  • Third, we proposed a new cumulative quality model that takes into account the impacts of the quality variations and recency. Experiment results show that the proposed model is able to predict well the cumulative quality of streaming sessions.

  • Fourth, a comparison of the proposed model with six existing models was conducted. This is the first time a large number of quality models have been investigated for cumulative quality prediction. Experiment results show that the proposed model outperforms the existing models.

  • Fifth, it was found that the proposed model is applicable to real-time quality monitoring thanks to its low computation complexity. This feature is especially important for cost-effective evaluation of streaming technologies.

3 Subjective Test for Cumulative Quality

In this study, to measure the cumulative quality over time, each streaming session was converted into test sequences of different lengths. In the test, each subject viewed a random sequence and then rated the quality of the whole sequence. This approach is similar to that used in [QoE_ZWangTimevarying], where each 15-second long session was divided into three sequences of 5, 10, and 15 (seconds).

Video Content Type
Video #1 Slow movements of characters Animated video, Movie
Video #2 A story about Sintel and her friend, a dragon. Animated video, Movie
Video #3 Conversations of characters Natural video, Movie
Video #4 A talk show host analyzing news Natural video, News
Video #5 A documentary about the science experiment Natural video, Documentary
Video #6 A soccer match Natural video, Sport
Table 1: Features of Source Videos

There are in total six 6-minute long videos used in this study, denoted by Video #1, Video #2, Video #3, Video #4, Video #5, and Video #6, with features presented in Table 1. These videos were encoded using H.264/AVC (libx264) with a frame rate of 24 fps. In this study, we used two adaptation sets, each consisted of 9 versions with different QP values and/or resolutions. In particular, the 9 versions in the first adaptation set have the same resolution of 1280720 and 9 different QP values of 52, 48, 44, 40, 36, 32, 28, 24, and 20. The first adaptation set was used to generate the streaming sessions of Video #1, Video #2, and Video #3. The 9 versions in the second adaptation set are different in both resolution and QP. Specifically, the 9 versions correspond to 9 combinations of QP values and resolutions of {24, 256144}, {26, 426240}, {24, 426240}, {26, 640360}, {24, 640360}, {26, 854480}, {24, 854480}, {26, 1280720}, {24, 1280720}. The second adaptation set was used to generate the streaming sessions of Video #4, Video #5, and Video #6. The average bitrates of the versions are shown in Table 2. In this study, every version is divided into short segments with the duration of 1 second.

Version Average bitrate (kbps)
Video #1 Video #2 Video #3 Video #4 Video #5 Video #6
1 146 187 187 179 455 570
2 196 239 244 310 794 1034
3 310 333 353 382 1010 1304
4 455 482 528 548 1397 1823
5 717 717 813 675 1764 2295
6 1118 1097 1263 791 2017 2647
7 1751 1743 2005 977 2549 3330
8 2802 2910 3362 1303 3209 4382
9 4538 4993 6089 1613 3930 5500
Table 2: Average Bitrates of Versions
Figure 1: An example of version variations in a streaming session.

For each video, two full-length sessions of 6 minutes were generated by using the adaptation method of [thang2013_JCN] and two bandwidth traces from a mobile network [HAS_muller2012_evaluation]. The duration of 6 minutes was selected such that it is longer than the average video duration watched on YouTube, which is 5:01 minutes [QoE_nam2016_qoewhycat]. The bandwidth traces have average throughputs varying from 1484.87 kbps to 3432.33 kbps, and standard deviations from 867.01 kbps to 1252.75 kbps. An example of version variations in a 6-minute session is provided in Fig. 1.

From each full-length session, six test sequences were extracted, from the time-stamp 0 to the 1, 2, 3, 4, 5, and 6 minute. So, from the six original videos, there were in total 72 test sequences, with durations from 1 minute to 6 minutes. The total duration of all the test sequences is 252 minutes. Because a rating time which is longer than 1.5 hours may cause fatigue and boredom [P.9132014], the subjective test was divided into four parts that were conducted in different days. The duration of each part was approximately 1.5 hours, of which about 1 hour was spent for rating the test sequences. In the rating process, every 20 minutes, there was a break of 10 minutes. In order to avoid boredom, each subject took part in at most two test parts.

The subjective test was conducted using the absolute category rating (ACR) method. Test conditions were designed following Recommendation ITU-T P.913 [P.9132014]. In the subject-training stage, the subjects got used to the procedure and the range of quality impairments. In the test, the sequences were randomly displayed on a black background. The screen has the size of 14 inches and a resolution of 1366768. Given a sequence, each subject gave a score at the end of the sequence with the value ranging from 1 (worst) to 5 (best), which reflects his/her option of quality of the whole sequence.

There were in total 71 subjects taking part in the test. The total time of the test was approximately 160 hours. Screening analysis of the test results was performed following Recommendation ITU-T P.913 [P.9132014], and two subjects were rejected. After discarding these subjects’ scores, each test sequence was rated by 23 valid subjects. The MOS of each sequence was computed as the average of the valid subjects’ scores.

The 95% confidence intervals of subjective scores are shown in Fig. 

2. In general, the confidence intervals are in the range 0.08 to 0.35. Also, the subjective scores are in the range from 2 to about 4.7. This means the cumulative quality varies drastically during a session.

4 Cumulative Quality Model

Figure 2: Confidence intervals of the MOSs.
Figure 3: An illustration of ”sliding window” with size =3

4.1 Overview

To build a cumulative quality model taking into account the impacts of multiple factors, the basic ideas of our solution are as follows.

  • Quality variations over a long session are divided into long-term and short-term changes. Specifically, short-term changes refer to quality variations of neighboring segments, while long-term changes refer to quality variations between temporal intervals.

  • To represent the impact of long-term changes, the concept of ”sliding window” is used. Specifically, a window of segments is moved along the session, segment by segment as illustrated in Fig. 3. After each time, a window quality value is computed.

  • To represent the impact of short-term changes within a window, an existing overall quality model is used. For this purpose, such as model is called window quality model.

  • The cumulative quality value at any time point is computed based on window quality values, taking into account the impacts of factors such as long-term changes and recency. Note that, at the first time points, when the watched video duration is (very) short (i.e., less than segments), the corresponding cumulative quality values are directly computed from the window quality model.

In the next subsection, effect analysis of the quality variations, primacy, and recency will first be presented. Then, based on the obtained results, a cumulative quality model will be proposed.

4.2 Proposed quality model

As mentioned, to identify the key components of a cumulative quality model, we carried out a statistic analysis of some window quality values. In particular, the first window quality value and the last window quality value were employed to represent the impacts of the primacy and recency respectively. For the factor of long-term changes, three parameters are considered, which are the average quality , the minimum quality , and the maximum quality of all windows until a given time point.

Suppose that the window is just moved to the segment with . By using the window quality model, the window quality value is calculated. After that, the statistics of , , , , and are updated by the following equations.

(1)
(2)
(3)
(4)
(5)
Window
   size
(seconds)
Window
quality
statistics
30 F 0.01 13.53 3.70 45.87 70.12
p 0.94 <0.001 0.06 <0.001 <0.001
0.00 0.17 0.05 0.40 0.51
50 F 0.73 39.51 0.00 65.32 67.10
p 0.40 <0.001 0.97 <0.001 <0.001
0.01 0.37 0.00 0.49 0.50
70 F 1.20 41.18 0.39 52.34 64.83
p 0.28 <0.001 0.76 <0.001 <0.001
0.02 0.38 0.01 0.43 0.49
Table 3: Results of Effect Analysis of Window Quality Statistics

Table 3 shows the obtained results from one-way analysis of variance (ANOVA). To assess the effect size, partial Eta-squared values () are also reported in Table 3. Here, the window quality model is the model proposed in [tran2016_GC] (called Tran’s), and the window size is set to 30, 50, and 70 seconds.

The values in Table 3 indicate that, for all the considered window sizes, no significant effect was observed for (i.e., ). In contrast, significant results with large effects were obtained for (i.e., and ). This implies that the impact of the primacy on the cumulative quality can be neglected, while the impact of the recency has to be considered.

With regard to long-term changes, no significant effect was found for (i.e., ), yet significant effects with large sizes were observed for and (i.e., and ). This implies that only the minimal and average quality have to be considered.

To sum up, the results suggest that , , and should be key components of a cumulative quality model. Based on these observations, we propose a cumulative quality model which is given by

(6)

where and are the corresponding weights of , , and components.

It is interesting to note that the proposed model is in agreement with the peak-end rule [QoE_Kahneman1993]. The peak-end rule says that users judge an experience largely at its peak and at its end. Here the peak of a session is the most severe quality impairment , and the end is the significance of the recency effect shown in our model. In the case of speech quality, a large impact of the minimum quality on QoE was also shown [QoE_Koster2017]. For HTTP adaptive streaming, it is found in [QoE_tobias2014_assessing, QoE_seufert2015_impact] that the number of quality switches is not statistically significant, but the time the video is played out on each quality level. Also [QoE_Seufert2013_pool] showed that a good temporal pooling method is taking the average over the whole session, implying that is a key influence factor. Thus all the key factors of the proposed model are inline with the findings in previous studies. Yet, the CQM model is the first one that integrates these factors into a single model for predicting the cumulative quality of HAS sessions.

In the next section, we will investigate the performance of the proposed model and some existing models.

5 Model Evaluation and Analysis

5.1 Evaluation Methodology

This section is divided in two evaluations, each aiming at an important question. In the first evaluation, we will investigate what is the best setting (e.g., window quality model and window size) for the proposed model. The second evaluation is carried out to see if existing overall quality models can predict cumulative quality, especially in long sessions.

There are in total six existing models employed in this study, which are denoted by Tran’s [tran2016_GC], Guo’s [QoE_ywang2015_assessing], Vriendt’s [QoE_bellLab2013_QoEmodel], Yin’s [QoE_Yin_2015], P.1203 [ITU1203_3, ITUT_implement1, ITUT_implement2, ITUT_implement3], and Rehman’s [QoE_ZWangTimevarying]. Among these models, only the Rehman’s model was proposed for cumulative quality prediction, the other models were originally proposed for overall quality prediction.

Similar to [QoE_Database_ZDuanmu2018, QoE_QoEIndex_ZDuanmu2018], to evaluate the performance of existing models, we implemented the models using the parameter settings stated in the original papers. In addition, following Recommendation ITU-T P.1401 [ITUT_Rec1401]

, a first order linear regression between predicted scores and MOSs was performed for each model to compensate for the possible variances between subjective tests. The obtained coefficients of slope and intercept will be stated in the following subsections.

For the evaluations, the 72 sequences in our dataset were randomly divided into two sets, namely a training set of 36 sequences and a test set of 36 remaining sequences. The training set was used to obtain the model parameters by curve fitting. The test set was to evaluate the performance of the models. We randomly selected 50 training sets. For each training set, the remaining sequences were used for the corresponding test set.

In order to measure the performance of the models, we used two metrics of Pearson Correlation Coefficient (PCC) and Root-Mean-Squared Error (RMSE). The PCC and RMSE values reported below were averaged over the 50 test sets. Since the capability of real-time processing is an especially important feature for cumulative quality models, we also measured computation complexity of the models. In this study, the computation complexity was measured as the average time required to obtain a cumulative quality value per 1-second long segment. The measurement was conducted on a computer with Intel Core i3-2120 processor at 3.30GHz and 8GB RAM.

5.2 Performance Analysis of Cqm Model

Model Performance
 Training set Test set
PCC RMSE PCC RMSE
CQM+Tran’s 0.94 0.26 0.93 0.27
CQM+Guo’s 0.91 0.31 0.89 0.34
CQM+Vriendt’s 0.93 0.29 0.92 0.31
CQM+Yin’s 0.92 0.31 0.91 0.33
CQM+P.1203 0.94 0.26 0.92 0.28
Table 4: Performance of CQM Model using Different Window Quality Models

In this subsection, we investigate the performance of the proposed model under different settings. Our goal is to find the best settings of 1) window quality model and 2) window size of the proposed model.

For this purpose, we first present results using some different window quality models. Then, various window sizes are investigated. Finally, the model parameters are determined based on result analysis.

5.2.1 Window quality model

In this part, the five overall quality models of Tran’s, Guo’s, Vriendt’s, Yin’s, and P.1203 are employed to obtain window quality values. Note that these models all take into account the impact of short-term changes. Further, note that Rehman’s is a cumulative model, which was not used here, but is only used later for comparison purpose.

Table 4 shows the performance of the CQM model using the different window quality models with the window size of 50 seconds. It can be seen that the performance of the CQM model is generally good with all the window quality models. Especially, the combination CQM+Tran’s provides the best prediction performance. Specifically, the values of PCC and RMSE are 0.94 and 0.26 for the training set, and 0.93 and 0.27 for the test set. The main reason is that Tran’s model utilizes the histograms of segment quality values and switching amplitudes which are shown to be more effective in modeling the impact of short-term changes than the statistics used in the other models [tran2017_IEICEhistogram].

Since the combination CQM+Tran’s provides the best performance, Tran’s model is used as the window quality model in the rest of this paper.

5.2.2 Window size

In this part, the performance of the CQM model is evaluated using different window sizes. As mentioned, Tran’s model is employed to obtain window quality values. Fig. 4 shows the performance of the proposed model with different window sizes ranging from 2 to 90 seconds with the step size of 2 seconds. It is clear that, given a window size, the training set always achieves higher PCC values and lower RMSE values than that of the test set. In general, the behaviors of the PCC and RMSE curves for the training and test sets are similar. In particular, the prediction performance improves quickly (i.e., PCC value increases quickly while the RMSE value drops quickly) when the window size is increased to 14 seconds. When the window size is from 14 to 50 seconds, some small improvements are observed for the PCC and RMSE. The best prediction performance for the test set is achieved with the window size of 50 seconds. Specifically, the PCC and RMSE values are 0.94 and 0.26 for the training set, and 0.93 and 0.27 for the test set. When the window size increases beyond 50 seconds, the PCC falls sharply and the RMSE rises dramatically. Therefore, to achieve the highest performance, the window size should be 50 seconds. In the rest of this paper, this value of the window size will be used.

5.2.3 Model parameters

Similar to [QoE_liu2015_deriving], in order to obtain the model parameters, we pick the (best) training set, which provides the highest PCC for the corresponding test set. The best performance is given by

(7)
(8)

The high numerical values of the weights and reconfirm the observations in Sect. 4 that , , and are key components of the cumulative quality model. Also, the impacts of the quality variations and recency are significant on the cumulative quality of a session. In addition, it can be seen that is highest while is lowest. So the impact of the average window quality is strongest, and the impact of the minimum window quality is weakest.

5.3 Model Comparison

In this subsection, we compare the CQM model and the six existing models in terms of the prediction performance and the computation complexity.

Figure 4: Performance of CQM model using different window sizes
Figure 5: Performance of models with different sequence lengths

Fig. 5 shows the PCC values of the models with different sequence lengths. We can see that, when the sequence length is 1 minute, the PCC values of Tran’s, Guo’s, Vriendt’s, Yin’s, and P.1203 models are high (i.e., PCC ). This suggests that these models can predict well the overall quality of a short session, and thus each of them can be used as a window quality model with good performance as discussed in Subsect. 5.2.1.

However, when the sequence length increases, the PCC values of the models decrease. Among the models, the PCC of the CQM model is highest for all the sequence lengths. Meanwhile, the performance of Rehman’s model is lowest. A possible explanation is that Rehman’s model is designed using very short sessions with a duration of 5–15 seconds. Thus it is not really suitable for longer sessions (i.e., 1–6 minutes). In addition, there is no consideration for long-term changes and recency in Rehman’s, Tran’s, Guo’s, Vriendt’s, and Yin’s models, so the performances of these models are all lower than that of the CQM model.

   Model Coefficients
Performance
(Test set)
Computation Comlexity (ms)
Slope Intercept PCC RMSE
Tran’s 1.24 -1.27 0.89 0.31 0.22
Guo’s 1.01 -0.25 0.72 0.49 0.02
Vriendt’s 1.02 -0.41 0.85 0.37 0.05
Yin’s 1.07 -0.79 0.80 0.42 0.06
P.1203 1.04 -0.93 0.89 0.32 1682.82
Rehman’s 3.00 -0.42 0.62 0.67 0.05
CQM 0.93 0.27 0.20
Table 5: Performance of Models in Predicting Cumulative Quality
Figure 6: An example of the cumulative quality values of a streaming session.

Table 5 summarizes the performances and the computation complexity of the models. Here, the PCC and RMSE are averaged over the 50 test sets containing sequences of different lengths. We can see that the results of performances are similar to those in Fig. 5. In particular, the performance of the CQM model is highest and the performance of Rehman’s model is lowest.

Regarding the computation complexity, it can be seen that the CQM model takes less than 1ms to obtain a cumulative quality value, and so the cumulative quality can be updated after every segment as the window slides forward. In other words, the CQM model is applicable to real-time quality monitoring.

For the P.1203 model, its computation complexity is considerably higher than the others. In particular, the P.1203 model takes an average of 1.68s to calculate a cumulative quality value. Meanwhile, the remaining models have an average processing time less than 1ms per cumulative quality value.

To better understand the cumulative quality, Fig. 6 shows the MOSs and the predicted scores by the CQM model corresponding to the adaptation result in Fig. 1. We can see that the predicted scores closely follow to the MOSs. In addition, the cumulative quality fluctuates strongly during the session. This means that evaluating the overall quality at the end of a streaming session is obviously not enough to fully understand the quality of the video streaming service. So, cumulative quality over time is of crucial importance in adaptive streaming.

6 Conclusions and Future Work

In this paper, we have presented a model for predicting the cumulative quality of adaptive video streaming. The proposed model was developed based on the concept of a ”sliding window” over a streaming session, where each window is characterized by a quality value.

First, a subjective test was specifically designed and conducted for measuring the cumulative quality. Second, through statistical analysis, it was found that the impacts of the quality variations and recency are significant. We integrated the significant key components, namely, the average window quality, the minimum window quality, and the last window quality, into a new cumulative quality model CQM, which is able to accurately predict the cumulative quality of streaming sessions. The advantage of the proposed CQM model is its simplicity, while being inline with other well known effects from literature, namely, the applicability of simple temporal pooling plus the peak-end rule.

The CQM model was compared with six existing models, where it could outperform the other models in predicting the cumulative quality. Moreover, the proposed model is applicable to real-time quality monitoring thanks to its low computation complexity. This feature is especially important for cost-effective evaluation of streaming technologies, e.g., for real-time quality monitoring of video streams. In the future, the model will be used to assess the quality of different adaptive streaming techniques. Also, we will develop novel quality adaptation strategies, which are based on the CQM model.

References