I Introduction
Quality of Experience (QoE) research commonly relies on the collection of subjective ratings from a chosen panel of users to quantify various QoE dimensions (also referred to as QoE features [1]), e.g., related to perceived audio/visual quality, perceived usability, or overall perceived quality. While various rating scales have been used in both the user experience (UX) and QoE research fields, the results of subjective studies reported by the QoE community have to a large extent relied on the use of a standardized 5point Absolute Category Rating (ACR) scale to calculate Mean Opinion Score (MOS) values. While it has been argued that researchers should go beyond the MOS in their studies [2] in order to consider different applications and user diversity, MOS estimates remain a staple of the QoE literature.
In this context, the statistical analysis of subjective study results, subsequently used to derive QoE estimation models [3], relies on the estimation of confidence intervals (CIs) to quantify the significance of MOS values per test condition. Challenges arise in dealing with uncertainties resulting from problems such as ordering effects and subject biases [3, 4]. Such statistical uncertainties are expressed in terms of CIs. Given the nature of conducting QoE studies, two main issues arise. Firstly, rating scales used in quantitative QoE evaluation are bounded at both ends. Therefore, the individual rating scores
of a subject are limited. However, for the calculation of CIs, normal distributions (due to central limit theorem) or Student’s tdistribution are used, which are unbounded.
Secondly, due to the inherent complexity of running subjective studies, resulting in a compromise between a large number of test conditions and participant fatigue, the number of test subjects taking part in a study is generally small, in particular when running tests in a lab environment. We note that while methods such as crowdsourcing may be utilized to obtain a much larger population sample, in many cases the specifics of the study call for a controlled lab environment. As an example, and bearing in mind that the number of required participants clearly depends on the test design, number of test conditions, and target population, the ITUT recommends a minimum of 24 subjects (controlled environment) or 35 subjects (public environment) for subjective assessment of audiovisual quality [5]. ITUT Recom. P.1401 further states that if fewer than 30 samples are used, the normal distribution starts to become distorted and calculation of CIs based on normality assumptions are no longer valid. In cases with fewer than 30 samples, P.1401 advocates the use of the Student tdistribution when calculating CIs.
Given the aforementioned issues, we highlight that commonly used CI estimators do not work properly for small sample sizes, as the normal distribution assumption may not be valid, and that they violate the bounds of the rating scale. In this paper we review statistical approaches in the literature for their application in the QoE domain for MOS interval estimation (instead of having only a point estimator, which is the MOS). Due to space restrictions, we consider only discrete rating scales, and test the CI estimators in terms of efficiency (CI width), coverage (how many CIs overlap the true mean value), and outlier ratio.
The remainder of this paper is organized as follows. Section II provides the background on CIs such as the central limit theorem, used to derive CI estimators. Section III considers common estimators for the MOS and introduces some estimators based on binomal distributions that are suitable for MOS CI estimation. It also discusses other noncommonly used methods in the QoE community, such as simultaneous CI and bayesian approaches for multinomial distributions, as well as bootstrapping CI. Section IV defines various scenarios for evaluating the performance of the estimators in terms of coverage, outlier ratio, and CI width. Section V concludes this work and gives some recommendations on CI estimators for MOS values in practice.
Ii Background
For the sake of readability, we briefly state the definitions and theorems used to obtain an interval estimate, denoted confidence intervals in the following. Table I provides a summary of the notation used throughout the paper.

Let
be a random sample from a probability distribution with statistical parameter
, which is a quantity to be estimated. The confidence interval , is obtained by(1) where is the confidence coefficient (or degree of confidence). The confidence interval contains the statistical parameter with probability .

Let be a random sample of size () taken from a population with expected value
and variance
, , then the sample mean asymptotically follows a normal distribution with expected value and variance as .(2)

The confidence interval for the sample mean , with
and standard error of the sample mean
according to CLT, can be obtained by(3) where is the
quantile of the standard normal distribution
. estimates the unknown variance .
This assumes that the sample size is large, and that the sampling distribution is symmetric, which is not always the case. In the following, we detail how to establish a confidence interval in the case of a sampling distribution whose density function is symmetric or nonsymmetric around the mean.
Note that the variance of a sample mean, , where is the variance of the sample . This implies that when then while .
Variable  Description 

random variable of user ratings for test condition  
users rate on a discrete point rating scale from  
number of users rating the test condition  
number of test conditions (TC)  
number of simulation runs  
sampled user rating for user , TC and simulation run  
MOS, i.e. sample mean over user ratings, for TC and run  
confidence level  
significane level, e.g. ; it is 
Iii Confidence Interval Estimators for MOS
Iiia Problem Formulation
We assume we have a discrete rating scale with rating items, leading to a multinominal distribution, which is a generalization of the binomial distribution. For a certain test condition, users rate the quality on a discrete point rating scale, e.g., for the commonly used 5point ACR scale. Each scale item is selected with probability for ; .
The users rate quality as one of the categories. Samples indicate the number of ratings obtained per category, with (i.e., each user has provided one rating). With each category having a fixed probability , the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories (under the condition )
(4) 
In QoE tests, we are interested in the rating of an arbitrary user. The marginal distribution (when ) with estimates the the expected rating by the sample mean (aka MOS), assuming a linear rating scale.
(5) 
We denote as a random variable of the rating of the users. We observe a sample with . As previously stated, in subjective QoE tests, the number of users is typically not very high. From the samples , the MOS and CI can be estimated. However, given the use of a bounded rating scale and small sample size, existing estimators of CI do not follow the CLT and might be asymmetric around the sample mean, and will potentially violate the bounds of the rating scale, i.e., and/or .
IiiB Regular Normal and Student’s tdistribution
The most common way of constructing a CI from a set of samples, , is to apply the CLT. When the variance of is not known, then the quantile must be taken from a Student’s tdistribution with confidence level and degrees of freedom, unless the number of samples are sufficiently large ( according to ITUT recommendation P.1401). Then the quantiles in the Student’s tdistribution and standard Normal distributions are approximately the same.
The CI for both Student’s tdistribution and Normal distribution is estimated by use of (3), the only difference is the quantiles.
Observe; truncating the upper and lower bounds, i.e., and/or is not correct.
IiiC Simultaneous CIs for Multinomial Distribution
A complementary approach is to consider the multinomial proportions of user ratings on the scale for item and then to derive exact confidence coefficients of simultaneous CI for those multinomial proportions. A method for computing the CIs for functions of the multinomial proportions is proposed in [6] which can be directly applied to the computation of the MOS, see Eq.(5). There are user ratings for category and is the quantile of the distribution with one degree of freedom considering simultaneous CIs. The MOS is .
(6) 
IiiD Using Binomial Proportions for Discrete Rating Scales
The shifted binomial distribution can be used as an upper bound distribution for user rating distributions when users rate on a point rating scale (
). The binomial distribution leads to high standard deviations in QoE tests
[7] and follows exactly the SOS hypothesis with parameter with .Let us consider users. Assume the user ratings follow a shifted binomial distribution, . Then, the sum of the user ratings follows also a binomial distribution.
(7) 
and then . Due to differences among users, it may be for users and . The binomial sum variance inequality can be used to derive an upper bound. Let us consider , which does not follow a binomial distribution. We define with . As a result of the binomial sum variance inequality we observe that the variance of is an upper bound for QoE tests.
(8) 
Hence, we may use instead of to derive conservative CIs for the MOS based on the CI for the unknown .
(9) 
CI estimation for binomial distributions has drawn attention in the literature and several suggestions have been provided. A few works compare the CI estimators for binomial proportions [8, 9, 10, 11]. For example, [10] suggests using Wilson interval and Jeffreys prior interval for small . The normal theory approximation of a confidence interval for a proportion is known as the Wald interval, which is however not recommended [12]. For readability, we write for the quantile of the standard normal distribution.
IiiD1 Wald interval employing normal approximation
From the MOS we obtain . The standard deviation is . The CI for the MOS is as follows.
(10) 
IiiD2 Wilson score interval with continuity correction
For the Wilson interval, a continuity correction is proposed which aligns the minimum coverage probability, rather than the average probability, with the nominal value.
(11)  
(12)  
(13) 
IiiD3 ClopperPearson
It is the central exact interval [13]
and we use the implementation based on the beta distribution with parameters
and [12]. The parameter quantifies the number of ‘successes’ of the corresponding binomial proportion, i.e. for user ratings , and . The quantile of the beta distribution is denoted by .(14)  
(15) 
IiiD4 Jeffreys Interval
A Bayesian approach for binomial proportions is Jeffreys interval which is an exact Bayesian credibility interval and guarantees a mean coverage probability of
under the specified prior distribution. [10] have chosen the Jeffreys prior [14]. Although it follows a different paradigm, it has also good frequentist properties and looks similar to ClopperPearson. The calculation also uses the number of successes as defined above and the quantiles of the beta distribution.(16)  
(17) 
IiiE Bootstrap Confidence Intervals
The nonparametric bootstrap method as introduced by Efron [15] uses solely the empirical distribution of the observed sample. Simulations from the empirical distribution lead to many observations of various MOS estimators for each simulation run . As a result, a distribution of mean values is observed and the CIs can be directly obtained based on Eq. (1
). We use Matlab’s implementation of the ‘bias corrected and accelerated percentile’ method to cope with the skewness of the observed distribution, cf.
[15].Iv Numerical Results
For evaluating the estimators’ performance, we consider different scenarios in which the user ratings for a test condition are sampled from a known distribution. The commonly used 5point ACR scale is considered. We investigate two different scenarios: (1) binomial distribution as an upper bound in terms of variance for QoE tests, (2) low variance, where users only rate and avoid the rating scale edges. The performance is then evaluated with several metrics: the coverage of the CIs, the width of the CIs, and the outlier ratio.
Iva Scenarios for Performance Evaluation
We consider a point rating scale. For a certain test condition , the user ratings follow a certain discrete distribution, with for . User ratings are sampled for test condition for the users , from the distribution . The simulations are repeated times to get statistically significant results in the evaluation. The index represents the th simulation run. We use repetitions. For the evaluation, we consider test conditions with the known mean value, i.e., the expected value, for . It is with and indicating the maximum and minimum possible user rating , respectively.
IvA1 Binomially Distributed User Ratings
This scenario represents a high variance of user ratings which is also observed in real QoE tests. The user rating diversity for any QoE experiment can be quantified in terms of the SOS parameter which is defined in [7]. For example, [2] measured for the results of a web QoE study. This was among the highest SOS parameters observed for different QoE studies and applications such as video streaming, VoIP, and image QoE. The results of gaming QoE studies have shown a similarly high SOS parameter. The binomial distribution leads to an SOS parameter of and is therefore appropriate as a realistic scenario for high variances.
(18) 
with MOS and . Hence, .
IvA2 Low Variance
Next, we consider a scenario with low variances. In that case, users are not using the edge of the rating scale and only rate . This can be realized with a shifted binomial distribution.
(19) 
with and . Then . The SOS parameter is numerically derived [2] and found to be .
IvB Metrics for Evaluating the Performance of the Estimators
According to the distribution defined in a given scenario, we generate samples (i.e., user ratings) for test conditions and repeat the simulation times. The user rating indicates the user rating of user , test condition , in run .
For each test condition and each run , the MOS is derived by averaging over the sampled subjects’ ratings.
(20) 
The CI estimator does not know the underlying distribution or the expected values . We investigate the performance of the CI estimator with the following metrics.
IvB1 Coverage
For a certain confidence interval derived from the samples of all users for test condition in run , we can check whether the expected value is contained in the confidence interval .
(21) 
Then, the coverage of the CI estimator for test condition is the average over all simulation runs, i.e., the probability that the CI contains the expected value. The marginal distribution of for a fixed test condition , gives the test condition perspective and will be defined accordingly for the CI width and the outlier ratio.
(22) 
The marginal distribution of for a single QoE study, , gives the QoE study perspective.
(23) 
Please note that the overall average over all studies and test condition is obtained either by averaging over or .
(24) 
IvB2 Outlier Ratio
For test condition and study , we estimate the probability that the confidence interval is outside the bounds of the rating scale .
(25) 
Then, we define the outlier ratio from the test condition perspective and the QoE study perspective, respectively.
(26) 
IvB3 CI Width
Finally, the width and of the confidence intervals is considered from the test condition perspective and the QoE study perspective, respectively. Thereby, the confidence intervals are averages over all runs and over all test conditions, respectively.
(27) 
Please note that the average over and the average over are identical.
(28) 
IvC Scenario with Binomially Distributed Ratings
Binomial  

norm.  0.92  0.08  0.55  0.01  0.83  0.08  0.68 
stud.  0.93  0.09  0.55  0.01  0.85  0.09  0.72 
sim.CI  0.96  0.08  0.55  0.00  0.92  0.13  0.87 
Wald  0.98  0.14  0.55  0.04  0.94  0.30  1.36 
CP  0.97  0.01  0.93  0.03  0.91  0.00  0.72 
Wils.  0.97  0.00  0.93  0.04  0.90  0.00  0.73 
Jeff.  0.95  0.00  0.92  0.04  0.89  0.00  0.68 
boot.  0.93  0.05  0.52  0.00  0.87  0.00  0.67 
Low. var.  
norm.  0.90  0.10  0.28  0.00  0.82  0.00  0.48 
stud.  0.91  0.09  0.28  0.00  0.83  0.00  0.51 
sim.CI  0.93  0.10  0.28  0.01  0.87  0.00  0.61 
Wald  1.00  0.00  1.00  0.00  1.00  0.00  1.67 
CP  1.00  0.23  0.98  0.14  0.99  0.00  0.87 
Wils.  1.00  0.24  0.98  0.16  0.99  0.00  0.87 
Jeff.  1.00  0.05  0.98  0.23  0.97  0.00  0.82 
boot.  0.91  0.11  0.28  0.01  0.83  0.00  0.47 
Figures 1 and 2 show the results for the binomial distribution scenario for the TC and QoE study perspective, respectively.
The boxplots shows the median within the box. The bottom and top of the box are the first and third quartiles. The upper and lower ends of the whiskers denotes the most extreme data point that is maximum and minimum 1.5 interquartile range (IQR) of the upper and lower quartile, respectively. Data outside 1.5 IQR are marked as outlier with a dot.
An overview on the performance measures is provided in Table II. The numerical results from the binomial case show that ClopperPearson, Jeffreys and bootstrapping have a good performance from the test condition and QoE study perspective. They have a good coverage, do not suffer from outliers, and have small CI widths.
The proposed idea based on binomial proportion fails if the distribution has a higher variance than a binomial distribution. Then, the coverage is poor; the confidence intervals are too small, as only binomial variances are assumed, but in reality we have higher variances. This is however very rare in actual QoE studies. If the variances are higher, this is often an indicator for hidden influence factors in the test setup or some other issues [7].
IvD Low Variance Scenario
We only consider the QoE study perspective now which is provided in Figure 3. In case of low variances, the three identified estimators (Wilson, ClopperPearson, Jeffreys) still have a very good performance, and coverage is 100%. However in that case, the CI width is larger than for the normalized or studentt estimators. The reason for this is that the proposed estimators assume a binomial distribution (i.e., a much larger variance) and necessarily overestimate the CIs. For all estimators, the outlier ratio is zero. Still normalized or studentt have some problems to cover certain TCs at the edge (see or ).
Figure 4 considers the average CI width and coverage when varying the number of subjects in the study. The most efficient way to decrease the CI width is to increase . It is worth to note that the binomial proportions estimators show almost constant coverage in contrast to bootstrapping.
V Recommendations and Conclusions
Subjective QoE studies often involve a relatively small number of test participants. Moreover, used rating scales are commonly discrete and bounded at both ends, with study results reported in the form of MOS values and CIs derived for various test conditions to quantify the significance of MOS values. Given the importance of using efficient CI estimators in the context of deriving QoE models, we evaluate several MOS CI estimators, and develop our own estimator based on binomial proportions. The numerical results indicate that the proposed idea based on binomial estimators is robust and conservative in practice. Wilson, ClopperPearson, and Jeffreys lead to comparable results, with excellent coverage and outlier properties. However, very good coverage comes along with costs of having larger CI widths. The Wald interval performs poorly, unless is quite large, which is not commonly the case in QoE studies. Standard confidence intervals based on normal and studentt distribution, as well as simultaneous CIs for multinomial distributions, suffer from the CIs exceeding the bounds of the rating scale. Bootstrapping has similar issues, i.e., some test conditions are not captured properly, but the outlier ratio is always zero due to sampling.
In summary, for QoE tests characterized by a small sample size and the use of discrete bounded rating scales, the proposed binomial estimators (ClopperPearson, Wilson, Jeffreys) are conservative, but exact and recommended. For decreasing the CI widths, bootstrapping or standard CI may be used in case of low variance (when the SOS parameter ) at the cost of decreased coverage – but the most effective way is to increase the number of subjects. If the SOS parameter is larger than for a binomial distribution (), the results and test design should be checked, as there may be hidden influence factors in the study. An implementation of the CI estimators and the recommended estimators based on the SOS parameter is available in Github https://github.com/hossfeld.
References
 [1] S. M. Patrick Le Callet and A. Perkis, “Qualinet white paper on definitions of quality of experience,” European Network on Quality of Experience in Multimedia Systems and Services, Tech. Rep. Version 1.2, 2013.
 [2] T. Hoßfeld, P. E. Heegaard, M. Varela, and S. Möller, “Qoe beyond the mos: an indepth look at qoe via better metrics and their relation to mos,” Quality and User Experience, vol. 1, no. 1, p. 2, 2016.
 [3] ITUT, “Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models,” International Telecommunication Union, Tech. Rep. P.1401, July 2012.
 [4] L. Janowski and M. Pinson, “Subject bias: Introducing a theoretical user model,” in Quality of Multimedia Experience (QoMEX), 2014 Sixth International Workshop on. IEEE, 2014, pp. 251–256.
 [5] ITUT, “Methods for the subjective assessment of video quality, audio quality and audiovisual quality of internet video and distribution quality television in any environment,” International Telecommunication Union, Tech. Rep. P.913, March 2016.
 [6] S. Jin, Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions. Department of Statistics, Uppsala University, 2013.
 [7] T. Hoßfeld, R. Schatz, and S. Egger, “Sos: The mos is not enough!” in Quality of Multimedia Experience (QoMEX), 2011 Third International Workshop on. IEEE, 2011, pp. 131–136.
 [8] A. M. Pires and C. Amado, “Interval estimators for a binomial proportion: Comparison of twenty methods,” REVSTAT–Statistical Journal, vol. 6, no. 2, pp. 165–197, 2008.
 [9] S. E. Vollset, “Confidence intervals for a binomial proportion,” Statistics in medicine, vol. 12, no. 9, pp. 809–824, 1993.
 [10] L. D. Brown, T. T. Cai, and A. DasGupta, “Interval estimation for a binomial proportion,” Statistical science, pp. 101–117, 2001.
 [11] R. G. Newcombe, Confidence intervals for proportions and related measures of effect size. CRC Press, 2012.
 [12] A. Agresti and B. A. Coull, “Approximate is better than “exact” for interval estimation of binomial proportions,” The American Statistician, vol. 52, no. 2, pp. 119–126, 1998.
 [13] C. J. Clopper and E. S. Pearson, “The use of confidence or fiducial limits illustrated in the case of the binomial,” Biometrika, pp. 404–413, 1934.
 [14] H. Jeffreys, The theory of probability. OUP Oxford, 1998.
 [15] B. Efron, “Bootstrap methods: another look at the jackknife,” in Breakthroughs in statistics. Springer, 1992, pp. 569–593.
Comments
There are no comments yet.