1 Introduction
Subjective experiment methodologies to evaluate the perceptual audiovisual quality of multimedia and television services have been well studied. Recommendations such as ITUR BT.500 [1], ITUT P.910 [2] and ITUT P.913 [3] standardize the procedures to conduct subjective experiments and postprocess raw opinion scores to yield the final mean opinion scores (MOS). To account for the inherently noisy and often unreliable nature of test subjects, the recommendations have included corrective mechanisms such as subject rejection (BT.500), subject bias removal (P.913), and criteria for establishing the confidence intervals of the MOS (BT.500, P.910 and P.913). The standardized procedures are not without their own limitations. For example, in BT.500, if a subject is deemed an outlier, all opinion scores for that subject are discarded, which could be excessive. The BT.500 procedure also incorporates a number of hard coded parameters and thresholds, which may not be suitable for all conditions and subjective tests.
As an alternative, we propose a simple model to account for two of the most dominant behaviors of test subject inaccuracy: bias and inconsistency. In addition, this model can effectively deal with inattentive subject outliers that give random scores. Compared to BT.500style subject rejection, the proposed model can be thought as performing “soft” subject rejection as it explicitly models subject outliers as having large inconsistencies, and thus is able to diminish the effect from the estimated true quality scores. To solve for the model parameters, we propose to jointly optimize its likelihood function, also known as maximum likelihood estimation (MLE) [4]. We present two numeric solvers: 1) a NewtonRaphson (NR) solver [5], and 2) an Alternating Projection (AP) solver. We further show that the AP solver can be considered as a weighted and iterative generalization of the subject bias removal procedure in P.913. The AP solver also has the advantage of having no hard coded parameters and thresholds.
One of the challenges is to fairly compare the proposed methods to its alternatives. We evaluate the proposed simple model and its numerical solvers separately. To evaluate the model’s fit to real datasets, we use Bayesian Information Criterion (BIC) [6], where the winner can be characterized as having a good fit to data while maintaining a small number of parameters. We also compare the confidence intervals of the estimated quality scores, where a tighter confidence interval implies a higher confidence in the estimation. To evaluate the model’s robustness against subject outliers, we perform a simulation study on how the true quality score’s root mean squared error (RMSE) changes compared to the clean case as the number of outliers increases. Lastly, to validate that the numerical solvers are indeed accurate, we use synthetic data to compare the recovered parameters against the ground truth.
The rest of the paper is organized as follows. Section Prior Art and Standards discusses prior art and standards. The proposed model is presented in Section Proposed Model, followed by the two numerical solvers in Section Proposed Solvers, and the calculation of confidence intervals in Section Confidence Interval. Section Experimental Results presents the experimental results.
The source code of this work is opensourced on Github [7].
2 Prior Art and Standards
Raw opinion scores collected from subjective experiments are known to be influenced by the inherently noisy and unreliable nature of human test subjects [8]. To compensate for the influence of individuals, a common practice is to average the raw opinion scores from multiple subjects, yielding a MOS per stimulus. Standardized recommendations incorporate more advanced corrective mechanisms to further compensate for test subjects’ influence, and criteria for establishing the confidence intervals of MOS.

ITUR BT.500 Recommendation [1]
defines methodologies including doublestimulus impairment scale (DSIS) and doublestimulus continuous quality scale (DSCQS), and a corresponding procedure for subject rejection (ITUR BT.50014 Section A12.3.1) prior to the calculation of MOS. Video by video, the procedure counts the number of instances when a subject’s opinion score deviates by a few sigmas (i.e. standard deviation), and rejects the subject if the occurrences are more than a fraction. All scores corresponding to the rejected subjects are discarded, which could be considered excessive. On the other hand, our experiment shows that, in the presence of many outlier subjects, the procedure is only able to identify a portion of them. Another drawback of this approach is that it incorporates a number of hard coded parameters and thresholds to determine the outliers, which may not be suitable for all conditions. It also establishes the corresponding way of calculating the confidence interval (ITUR BT.50014 Section A12.2.1).

ITUT P.910 Recommendation [2] defines methodologies including absolute category rating (ACR), degradation category rating DCR (equivalent to DSIS), absolute category rating with hidden reference (ACRHR) and the corresponding differential MOS (DMOS) calculation, and recommends using the BT.500 subject rejection and confidence interval calculation procedure in conjunction.

ITUT P.913 Recommendation [3] defines a procedure to remove subject bias (ITUT P.913 Section 12.4) before carrying out other steps. It first finds the mean score per stimulus, and subtract it from the raw opinion scores to get the residual scores. Then it averages the residue scores on a persubject basis to yield an estimate of each subject’s bias. The subject bias is then removed from the raw opinion scores. For P.913 to possess resistance to subject outliers, it needs to be combined with BT.500style subject rejection. Yet, by doing so, it poses similar weaknesses as BT.500.
For completeness, below we give mathematical descriptions of the subject rejection method standardized in ITUR BT.50014 and the subject bias removal method in ITUT P.913. Let be the opinion score voted by subject on stimulus in repetition . Note that, in BT.50014, the notation is used to indicate test condition and is used to indicate sequence/image; in this paper, the test condition and sequence/image are combined and collectively represented by the stimulus notation . Let denote the mean value over scores for stimulus and for repetition , i.e. . Similarly, denotes the
th order central moment over scores for stimulus
and repetition , i.e. . Lastly, denotes the sample standard deviation for stimulus and repetition , i.e. . In the previous, the term indicates the number of observers that have offered an opinion score for a given stimulus/repetition, . This number of observers could be the same, , or different per stimulus, if a subjective experiment has been designed in such a way. The subject rejection procedure in ITUR BT.50014 Section A12.3 can be summarized in Algorithm 1.ITUT P.913 does not consider repetitions, so the notation denotes the opinion score voted by subject on stimulus . The subject bias removal procedure in ITUT P.913 Section 12.4 can be summarized in Algorithm 2.
3 Proposed Model
We propose a simple yet effective model to account for two of the most dominant effects of test subject inaccuracy: subject bias and subject inconsistency. We further show that this model can effectively deal with inattentive subjects that give random scores, without invoking explicit subject rejection. The proposed model is a simplified version of [9] without considering the ambiguity of video content. Compared to the previously proposed model, the solutions to the simplified model are more efficient and stable.
We assume that each opinion score
can be represented by a random variable as follows:
(1) 
where is the true quality of stimulus , represents the bias of subject , the nonnegative term represents the inconsistency of subject , and are i.i.d. Gaussian random variables. The index represents repetitions.
It is important to point out that a subject with erroneous behaviors can be modeled by a large inconsistency value . The erroneous behaviors that can be modeled include but are not limited to: subject giving random scores, subject being absentminded for a portion of a session, or software issue that randomly shuffles a subject’s scores among multiple stimuli. By successfully estimating and accounting its effect to calculating the true quality score, we can compensate for subject outliers without invoking BT.500style subject rejection.
Given a collection of opinion scores from a subjective experiment, the task is to solve for the free parameters , such that the model fits the observed scores the best. This can be formulated as a maximum likelihood estimation (MLE) problem. Let the loglikelihood function be
i.e. a monotonic measure of the probability of observing the given raw scores, for a set of these parameters. We can solve the model by finding
that maximizes , or . This problem can be numerically solved by the proposed NewtonRaphson method or the Alternating Projection method, to be discussed in Section Proposed Solvers.It is important to notice that the recoverability of and in (1) is up to a constant shift. Formally, assume is a solution that maximizes , one can easily show that where , is another solution that achieves the same maximum likelihood value . This implies that the optimal solution is not unique. In practice, we can enforce a unique solution, by adding a constraint that forces the mean subject bias to be zero, or
This intuitively makes sense, since bias is relative  saying everyone is positively biased is equivalent to saying that no one is positively biased. It is also equivalent to assuming that the sample of observers that offer opinion scores in a subjective experiment are truly random and do not consist of only “expert” viewers or “lazy” viewers that tend to offer lower or higher opinion scores, as a whole. There is always the possibility, once a subjective test establishes that the population from where subjects were recruited have such a collective bias, to change the condition and thus properly estimate what the true “typical” observer, drawn from a more representative pool that would vote.
Lastly, one should keep in mind that it is always possible to use more complicated models than (1) to capture other effects in a subjective experiment. For example, [9] considers content ambiguity, and [10, 11] considers perstimulus ambiguity. There are also environmentrelated factors that could induce biases. Additionally, the votes are influenced by the voting scales chosen, for example, continuous vs. discrete [12]. Our hope is that the proposed model strikes a good balance between the model complexity and explanatory power. In Section ModelData Fit, we show that the proposed model yields better modeldata fit than BT.500 and P.913 used today.
4 Proposed Solvers
Let us start by simplifying the form of the loglikelihood function . We can write:
(3) 
where (4) uses the independence assumption on opinion scores, is the Gaussian density function with mean and standard deviation , and denotes equal with omission of constant terms.
Note that not every subject needs to vote on each stimulus in every repetition. Our proposed solvers can effectively deal with subjective tests with incomplete data where some observations are missing. Denote by the missing observations in an experiment. All summations in this paper are ignoring the missing observations , that is, is equivalent to , and so on.
4.1 NewtonRaphson (NR) Solver
With (3), the first and secondorder partial derivatives of can be derived (see the first Appendix). We can apply the NewtonRaphson rule [5] to update each parameter in iterations. We further use a refresh rate parameter to control the speed of innovation and to avoid overshooting. Note that other update rules can be applied, but using the NewtonRaphson rule yields nice interpretability.
Also note that the NR solver finds a local optimal solution when the problem is nonconvex. It is important to initialize the parameters properly. We choose zeros as the initial values for , the mean score for , and the residue standard deviation for , where is the “residue”, , and . The NR solver is summarized in Algorithm 3. A good choice of refresh rate and stop threshold are and , respectively, but varying these parameters would not significantly change the result.
The “new” parameters can be simplified to the following form:
(4)  
(5)  
Note that there are strong intuitions behind the expressions for the newly estimated true quality and subject bias . In each iteration, is reestimated, as the weighted mean of the opinion scores with the currently estimated subject bias removed. Each opinion score is weighted by the “subject consistency” , i.e., the higher the inconsistency for subject , the less reliable the opinion score, hence less the weight. For the subject bias , it is simply the average shift between subject ’s opinion scores and the true values.
4.2 Alternating Projection (AP) Solver
This solver is called “alternating projection” because in a loop, we alternate between projecting (or averaging) the opinion scores along the subject dimension and the stimulus dimension. To start, we initialize to , where , same as the NR solver. The subject bias is initialized differently to , where is the average shift between subject ’s opinion scores and the true values. Note that the calculation of and matches precisely to the ones in Algorithm 2 (ITUT P.913). Within the loop, first, the “residue” is updated, followed by the calculation of the subject inconsistency as the residue’s standard deviation per subject, or , where . Then, the true quality and the subject bias are reestimated, by averaging the opinion scores along either the subject dimension or the stimulus dimension . The projection formula precisely matches equations (4) and (5) of the NewtonRaphson method. The AP solver is summarized in Algorithm 4. A good choice of the stop threshold is .
In sum, the AP solver can be considered as a generalization of P.913 Section 12.4 in the following sense: first, the AP solver is iterative until convergence whereas P.913 only goes through the initialization steps; second, in the AP solver the reestimation of quality score is weighted by the subject consistency
whereas in P.913, the reestimation is unweighted. Please note that weighting multiple random variables by the inverse of their variance is the minimum error parameter estimation, as can be trivially proven through Lagrange multipliers.
5 Confidence Interval
The estimate of each model parameter , , is associated with a confidence interval. Using the CramerRao bound [13], the asymptotic confidence intervals for the mean term and have the form , where their secondorder derivatives can be found in the first Appendix. The confidence interval for the standard deviation term has the form , where
is the percent point function (the inverse of CDF) of a chisquare distribution with
degrees of freedom. After simplification, the confidence intervals for , and are:(6)  
where is the number of samples that subject has viewed.
There is one fact worth mentioning. Recall that is equivalent to , where represents missing observation. If there is no missing observation, that is, the subjective test has complete data, then the lengths of the confidence intervals for , are the same, equal to (since it is independent of the subscript ). This is very different from the confidence intervals estimated from a conventional approach (for example, plain MOS, or BT.500), where each stimulus has a different confidence interval length (see the second Appendix for a MLE interpretation of the plain MOS). This phenomenon can be explained by the fact that all the true quality parameters , are estimated jointly, yielding identical certainty for all the estimated parameters.
6 Experimental Results
We compare the proposed method (the proposed model and its two numerical solvers) with the prior art BT.500 and P.913 recommendations. For P.913, after subject bias removal, we assume that a BT.500style subject rejection is carried out, before calculating the MOS and the corresponding confidence intervals. We first illustrate the proposed model by giving visual examples on two datasets: VQEG HD3 dataset [14] (which is the compressiononly subset of the larger HDTV Ph1 Exp3 dataset) and the NFLX Public dataset [15]. We then validate the modeldata fit using the Bayesian Information Criterion (BIC) on 22 datasets, including 20 datasets as part of a different larger experiments: VQEG HDTV Phase I [14]; ITS4S [16]; AGH/NTIA [10, 17]; MM2 [18]; ITUT Supp23 Exp1 [19]; and ITS4S2 [20]. We also evaluate the confidence intervals on the estimated quality scores on these 22 datasets. Next, we demonstrate that the proposed model is much more effective in dealing with outlier subjects. We then use synthetic data to validate the accuracy of the numerical solvers and the confidence interval calculation. Lastly, we compare the runtime of the various schemes.
6.1 Visual Examples
First, we demonstrate the proposed method on the VQEG HD3 and the NFLX Public datasets. Refer to Figure 1 for a visualization of the raw opinion scores. The 44th video of the VQEG HD3 dataset has a quality issue that all its scores are low. The NFLX Public dataset includes four subjects whose raw scores were shuffled due to a software issue during data collection.
Figure 2 shows the recovered quality scores of the four methods compared. The quality scores recovered by the two proposed methods are numerically different from the ones from BT.500 and P.913, suggesting that the recovery is nontrivial. The average confidence intervals by the proposed methods are generally tighter, compared to the ones from BT.500 and P.913, suggesting that the estimation has higher confidence. The NBIC scores, to be discussed in detail in Section ModelData Fit, represent how well the model fits the data. It can be observed that the proposed model fits the data better than BT.500 and P.913.
Figure 3 shows the recovered subject bias and subject inconsistency by the methods compared. On the VQEG HD3 dataset, it can be seen that the 20th subject has the most positive bias, which is evidenced by the whitish horizontal strip visible in Figure 3 (a). On the NFLX Public dataset, the last four subjects, whose raw scores are scrambled, have very high subject inconsistency values. Correspondingly, their estimated biases have very loose confidence intervals. This illustrates that the proposed model is effective in modeling outlier subjects. On the contrary, among the four outlier subjects, both BT.500 and P.910 fail to reject the 28th subject.
The subject bias and inconsistency revealed through the recovery process could be valuable information for subject screening. Unlike BT.500, which makes a binary decision on if a subject is accepted/rejected, the proposed approach characterizes a subject’s inaccuracy in two dimensions, along with their confidence intervals, allowing further interpretation and study. How to use the bias and inconsistency information to better screen subjects remains our future work.
6.2 ModelData Fit
Bayesian Information Criterion [6] is a criterion for modeldata fit. When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. BIC attempts to balance between the degree of freedom (characterized by the number of free parameters) and the goodness of fit (characterized by the loglikelihood function). Formally, BIC is defined as , where is the total number of observations (i.e. the number of opinion scores), is the number of model parameters, and is the loglikelihood function. One can interpret that the lower the free parameter numbers , and the higher the loglikelihood , the lower the BIC, and hence the better fit. In this work, we adopt the notion of a “normalized BIC”, defined as the BIC divided by the number of observations, or:
as the model fit criterion, for easier comparison across datasets.
Table NBIC shows the NBIC reported on the compared methods on the 22 public datasets. The MOS method is the plain MOS without subject rejection or subject bias removal. for MOS and BT.500 is , where is the number of stimuli (refer to the second Appendix for a MLE interpretation of the plain MOS). For P.913, is equal to , where is the number of subjects (due to the subject bias term). For the calculation of the loglikelihood function, notice that if subject rejection is applied, only the opinion scores after rejection are taken into account. The result in Table NBIC shows that the proposed two solvers yield better modeldata fit than the plain MOS, BT.500 and P.913 approaches.
Dataset  MOS  BT.500  P.913  NR/AP 

VQEG HD3  2.75  2.74  2.39  2.30 
NFLX Public  2.97  2.57  2.55  2.52 
HDTV Ph1 Exp1  2.45  2.46  2.38  2.20 
HDTV Ph1 Exp2  2.72  2.72  2.52  2.32 
HDTV Ph1 Exp3  2.72  2.71  2.37  2.29 
HDTV Ph1 Exp4  2.96  2.96  2.51  2.27 
HDTV Ph1 Exp5  2.77  2.77  2.47  2.33 
HDTV Ph1 Exp6  2.51  2.49  2.32  2.16 
ITUT Supp23 Exp1  2.91  2.91  2.35  2.31 
MM2 1  2.80  2.78  2.83  2.74 
MM2 2  3.89  3.89  3.52  3.13 
MM2 3  2.48  2.47  2.45  2.41 
MM2 4  2.74  2.73  2.62  2.47 
MM2 5  2.90  2.82  2.67  2.64 
MM2 6  2.81  2.74  2.74  2.72 
MM2 7  2.73  2.72  2.76  2.67 
MM2 8  3.00  2.92  2.88  2.70 
MM2 9  3.27  3.21  2.95  2.79 
MM2 10  3.04  3.05  2.98  2.82 
its4s2  3.63  3.63  2.96  2.59 
its4s AGH  3.15  3.05  2.77  2.64 
its4s NTIA  2.94  2.91  2.53  2.38 
6.3 Confidence Interval of Quality Scores
Table CI shows the average length of the confidence intervals on the compared methods on the 22 public datasets. The smaller the number, the tighter the confidence interval, thus more confident the estimation is. For MOS, BT.500 and P.913, the confidence intervals are calculated based on (8). For BT.500 and P.913, only the opinions scores after rejection are taken into account. For the proposed methods, the confidence intervals are calculated based on (6). It can be observed that the proposed two methods yield tighter confidence intervals compared to the other methods. For some databases BT.500 generates wider confidence interval than the plain MOS. This phenomena can be explained by the fact that subject rejection decreases the number of samples, even though the variance may also be decreased. Overall, the obtained confidence interval can be either narrower or wider.
Dataset  MOS  BT.500  P.913  NR/AP 

VQEG HD3  0.59  0.60  0.49  0.46 
NFLX Public  0.62  0.54  0.5  0.44 
HDTV Ph1 Exp1  0.50  0.61  0.48  0.46 
HDTV Ph1 Exp2  0.57  0.57  0.53  0.48 
HDTV Ph1 Exp3  0.56  0.59  0.52  0.48 
HDTV Ph1 Exp4  0.63  0.63  0.52  0.47 
HDTV Ph1 Exp5  0.57  0.57  0.53  0.49 
HDTV Ph1 Exp6  0.50  0.51  0.48  0.45 
ITUT Supp23 Exp1  0.61  0.61  0.56  0.47 
MM2 1  0.59  0.60  0.57  0.53 
MM2 2  1.21  1.21  1.12  0.88 
MM2 3  0.47  0.48  0.45  0.42 
MM2 4  0.58  0.59  0.54  0.48 
MM2 5  0.63  0.65  0.58  0.52 
MM2 6  0.62  0.70  0.59  0.56 
MM2 7  0.60  0.61  0.57  0.55 
MM2 8  0.76  0.76  0.71  0.66 
MM2 9  0.84  0.85  0.74  0.68 
MM2 10  0.77  0.83  0.73  0.70 
its4s2  0.82  0.82  0.66  0.60 
its4s AGH  0.68  0.68  0.61  0.56 
its4s NTIA  0.57  0.58  0.54  0.48 
6.4 Robustness against Outlier Subjects
We demonstrate that the proposed method is much more effective in dealing with (corrupted) outlier subjects compared to other methods. We use the following methodology in our reporting of results. For each method compared, we have a benchmark result, which is the recovered quality scores obtained using that method  for fairness  on an unaltered full dataset (note that for the NFLX Public dataset, unlike the one used in Figure 1, 2 and 3, we start with the dataset where the corruption on the last four subjects has been corrected). We then consider that a number of the subjects are “corrupted” and simulate it by randomly shuffling each corrupted subject’s votes among the video stimuli. We then run each method compared on the partially corrupted datasets. The quality scores recovered are normalized by subtracting the mean and dividing by the standard deviation of the scores of the unaltered dataset. The normalized scores are compared against the benchmark, and a rootmeansquarederror (RMSE) value is reported.
Figure 4 reports the results on the two datasets, comparing the proposed method with plain MOS, BT.500, P.913 and the proposed AP solver, as the number of corrupted subjects increases. It can be observed that in the presence of subject corruption, The proposed method achieves a substantial gain over the other methods. The reason is that the proposed model was able to capture the variance of subjects explicitly and is able to compensate for it. On the other hand, the other methods are only able to identify part of the corrupted subjects. Meanwhile, traditional subject rejection employs a set of hard coded parameters to determine outliers, which may not be suited for all conditions. By contrast, the proposed model naturally integrates the various subjective effects together and is solved efficiently by the MLE formulation.
Figure 5 reports the results as we increase the probability of corruption from 0 to 1 while fixing the number of corrupted subjects to 10. It can be seen that as the corruption probability increases, the RMSE increases linearly/nearlinearly for other methods, while the RMSE increases much slower for the proposed method, and it saturates at a constant value without increasing further. A simplified explanation is that, since only a subset of a subject’s scores is unreliable, discarding all of the subject’s scores is a waste of valuable subjective data, while the proposed method can effectively avoid that.
6.5 Validation of Solvers and Confidence Interval Calculation
Next, we demonstrate that the NR and AP solvers can accurately recover the parameters of the proposed model. This is shown using synthetic data, where the ground truth of the model parameters are known. In this section, we considered only the NFLX Public dataset for simulations. The random samples are generated using the following methodology. For each proposed solver, we take the NFLX Public dataset and run the solver to estimate the parameters. The parameters estimated from a real dataset allow us to run simulations with practical settings. We then treat the estimated parameters as the “synthetic” parameters, run simulations to generate synthetic samples according to the model (1). Subsequently, we run the solver again on the synthetic data to yield the “recovered” parameters.
Figure 6 shows the scatter plots of the synthetic vs. recovered parameters, for the true quality , subjective bias and subject inconsistency terms. It can be observed that the solvers recover the parameters reasonably well. We have to keep in mind that the synthetic data, differently from usual subjective scores of category rating, are continuous. For discrete data, some specific problems would influence the obtained results as described in [12]. Since those problems are not the main topic of this paper we do not go into more details and leave it as a future topic of research.
Figure 6(a) also shows the recovery result of the BT.500 and P.913. It is noticeable that the recovered subject biases by the AP method and the P.913 subject bias removal are very similar. This should not be surprising, considering that the AP method can be treated as a weighted and iterative generalization of the P.913 method.
Also plotted in Figure 6(b) are the confidence intervals of the recovered parameters. The reported “CI%” is the percentage of occurrences where the synthetic ground truth falls within the confidence interval. By definition, we expect the CI% to be 95% on average. To verify this, we run the same simulation on the 22 public datasets. For each dataset, the simulation is run 100 times with different seeds. The result is shown in Table CI%. We compare the proposed NR and AP methods with the plain MOS. It can be seen that all methods yield CI% to be very close to 95%, but slightly below. The explanation is that both have assumed that the underlying distribution is Gaussian, but with both the mean and standard deviation unknown, one should use a Student’s tdistribution instead. If the tdistribution is used, the coefficient can no longer be a fixed value 1.96 but is a function of the number of subjects and repetitions.
For the NR and AP methods, there are occasional cases where the CI% is significantly lower, for example, and for MM2 2 dataset. This is the case where the stimuli and/or subject dimensions are small, yielding nonGaussian behavior (recall that the confidence interval calculated is asymptotic). In practice, we can introduce a correction term to compensate for the nonGaussianity.
Dataset  MOS  NR  AP  

VQEG HD3  93.3  93.6  93.9  93.0  93.2  94.4  91.9 
NFLX Public  94.2  93.7  94.5  93.1  93.5  94.1  92.3 
HDTV Ph1 Exp1  93.9  94.1  93.9  93.1  93.8  94.2  91.3 
HDTV Ph1 Exp2  93.8  94.0  94.5  92.5  93.8  94.0  91.2 
HDTV Ph1 Exp3  93.9  93.9  94.4  92.5  93.7  94.1  90.6 
HDTV Ph1 Exp4  93.8  94.0  94.3  91.9  93.8  94.1  90.9 
HDTV Ph1 Exp5  93.8  94.1  94.2  92.2  93.9  94.2  90.9 
HDTV Ph1 Exp6  93.8  94.0  94.4  92.6  93.9  94.0  91.0 
ITUT Supp23 Exp1  93.8  94.0  94.4  91.2  93.8  94.9  90.0 
MM2 1  93.5  92.8  95.4  92.6  92.5  94.0  91.6 
MM2 2  92.1  81.5  92.9  80.0  68.1  92.1  75.4 
MM2 3  94.4  93.6  95.1  93.4  93.4  94.2  92.0 
MM2 4  93.2  93.6  95.6  93.0  93.2  95.1  92.0 
MM2 5  93.2  93.2  95.7  92.7  91.8  95.3  91.4 
MM2 6  93.6  93.3  95.2  92.8  93.0  94.1  91.4 
MM2 7  93.6  93.3  95.2  92.8  92.9  94.2  91.9 
MM2 8  93.0  92.4  95.4  88.8  92.2  94.5  87.0 
MM2 9  93.2  93.3  94.8  89.1  92.8  94.2  88.1 
MM2 10  93.2  93.1  95.7  89.7  92.8  94.5  87.9 
its4s2  93.1  94.1  94.6  60.6  94.1  94.2  59.2 
its4s AGH  93.6  94.0  94.4  90.4  94.0  94.4  89.7 
its4s NTIA  93.9  94.4  94.7  86.1  94.3  95.1  85.6 
6.6 Runtime and Iterations
Lastly, we evaluate the runtime of the proposed NR and AP methods compared to the others. The experiment was performed on a MacBook Pro (15inch, 2018) with 2.9 GHz Intel Core i9 with 32 GB 2400 MHz DDR4 memory, macOS version 10.14.6. The schemes compared are implemented in Python, and are opensource on Github [7]. The results of 100 simulations runs (based on the similar methodology as in the previous sections) of each methods are reported in Table Runtime. The results reveal the order of magnitude of the algorithms compared. The plain MOS is typically the fastest, while the BT.500 and P.913 are two magnitude slower. The NR and AP algorithms are three and one magnitude slower, respectively. Notably, the AP runs faster than BT.500 and P.913, and is about 50x faster than the NR. The AP also requires about half the number of iterations to reach convergence than the NR.
Dataset  Mean Runtime (seconds)  No. Iterations  

MOS  BT.500  P.913  NR  AP  NR  AP  
VQEG HD3  5.2e4  1.5e2  1.5e2  2.1e1  4.3e3  26.2  12.1 
NFLX Public  5.7e4  1.8e2  1.9e2  2.8e1  4.5e3  34.5  11.8 
HDTV Ph1 Exp1  7.7e4  3.3e2  3.4e2  2.0e1  4.6e3  23.4  10.3 
HDTV Ph1 Exp2  7.8e4  3.3e2  3.4e2  2.8e1  4.9e3  33.2  11.3 
HDTV Ph1 Exp3  7.8e4  3.3e2  3.4e2  2.5e1  4.7e3  29.4  10.7 
HDTV Ph1 Exp4  7.6e4  3.3e2  3.4e2  3.3e1  5.0e3  38.3  11.5 
HDTV Ph1 Exp5  7.8e4  3.3e2  3.4e2  2.7e1  4.7e3  31.3  10.8 
HDTV Ph1 Exp6  7.6e4  3.3e2  3.4e2  2.2e1  4.6e3  25.8  10.7 
ITUT Supp23 Exp1  8.1e4  3.5e2  3.5e2  3.4e1  5.0e3  36.0  11.6 
MM2 1  4.9e4  1.3e2  1.3e2  2.1e1  4.3e3  27.4  12.4 
MM2 2  4.0e4  1.0e2  1.1e2  5.8e1  1.4e2  78.0  54.9 
MM2 3  5.3e4  1.3e2  1.4e2  1.8e1  4.2e3  23.3  11.6 
MM2 4  5.0e4  1.3e2  1.4e2  2.6e1  4.6e3  33.4  13.8 
MM2 5  5.0e4  1.3e2  1.4e2  2.9e1  6.0e3  37.3  19.3 
MM2 6  4.8e4  1.2e2  1.3e2  2.2e1  4.3e3  28.8  13.1 
MM2 7  4.8e4  1.2e2  1.3e2  2.0e1  4.2e3  25.6  12.3 
MM2 8  4.3e4  1.1e2  1.1e2  2.7e1  5.5e3  35.3  18.7 
MM2 9  4.3e4  1.1e2  1.2e2  2.8e1  5.1e3  36.5  16.8 
MM2 10  4.3e4  1.1e2  1.2e2  2.3e1  4.8e3  29.8  15.4 
its4s2  3.3e3  2.5e1  2.5e1  1.1e+0  1.3e2  49.8  13.3 
its4s AGH  8.7e4  4.1e2  4.2e2  3.5e1  5.3e3  39.4  11.6 
its4s NTIA  2.6e3  1.6e1  1.6e1  6.4e1  1.1e2  46.2  11.3 
7 Conclusions
In the paper, we proposed a simple model to account for two of the most dominant effects of test subject inaccuracy: subject bias and subject inconsistency. We further proposed to solve the model parameters through maximum likelihood estimation and presented two numerical solvers. We compared the proposed methodology with the standardized recommendations including ITUR BT.500 and ITUT P.913, and showed that the proposed methods have advantages in: 1) better modeldata fit, 2) tighter confidence intervals, 3) better robustness against subject outliers, 4) shorter runtime, 5) absence of hard coded parameters and thresholds, and 6) auxiliary information on test subjects. We believe the proposed methodology is generally suitable for subjective evaluation of perceptual audiovisual quality in multimedia and television services, and we propose to update the corresponding recommendations with the methods presented.
References
 [1] ITUR BT.50014 (10/2019): Methodologies for the Subjective Assessment of the Quality of Television Images. https://www.itu.int/rec/RRECBT.500.
 [2] ITUT P.910 (04/08): Subjective Video Quality Assessment Methods for Multimedia Applications. https://www.itu.int/rec/TRECP.910.
 [3] ITUT P.913 (03/2016): Methods for the Subjective Assessment of Video Quality, Audio Quality and Audiovisual Quality of Internet Video and Distribution Quality Television in Any Environment. https://www.itu.int/rec/TRECP.913.
 [4] Maximum Likelihood Estimation (Wikipedia). https://en.wikipedia.org/wiki/Maximum_likelihood_estimation. [Online; accessed 30March2020].
 [5] Newton’s Method in Optimization (Wikipedia). https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization. [Online; accessed 30March2020].
 [6] Bayesian Information Criterion (Wikipedia). https://en.wikipedia.org/wiki/Bayesian_information_criterion. [Online; accessed 30March2020].
 [7] SUREAL  Subjective Recovery Analysis. https://github.com/Netflix/sureal. [Online; accessed 30March2020].
 [8] Tobias Hoßfeld, Raimund Schatz, and Sebastian Egger. SOS: The MOS is not enough! 2011 3rd International Workshop on Quality of Multimedia Experience, QoMEX 2011, pages 131–136, 2011.
 [9] Zhi Li and Christos G. Bampis. Recover Subjective Quality Scores from Noisy Measurements. https://arxiv.org/abs/1611.01715. in Proceedings of the Data Compression Conference, 2017.
 [10] L. Janowski and M. Pinson. Subject bias: Introducing a theoretical user model. In 2014 Sixth International Workshop on Quality of Multimedia Experience (QoMEX), pages 251–256, Sep. 2014.
 [11] L. Janowski and M. Pinson. The accuracy of subjects in a quality experiment: A theoretical subject model. IEEE Transactions on Multimedia, 17(12):2210–2224, Dec 2015.
 [12] Lucjan Janowski, Bogdan Ćmiel, Krzysztof Rusek, Jakub Nawała, and Zhi Li. Generalized score distribution. https://arxiv.org/abs/1909.04369. in arXiv:1909.04369 [stat.ME], 2019.
 [13] T.M. Cover and J.A. Thomas. Elements of Information Theory. A WileyInterscience publication. Wiley, 2006.
 [14] Margaret Pinson, Filippo Speranza, M Barkowski, V Baroncini, R Bitto, S Borer, Y Dhondt, R Green, L Janowski, T Kawano, et al. Report on the validation of video quality models for high definition video content. Video Quality Experts Group, 2010.
 [15] Netflix Public Dataset. https://github.com/Netflix/vmaf/blob/master/resource/doc/datasets.md. [Online; accessed 30March2020].
 [16] Margaret H. Pinson. Its4s: A video quality dataset with foursecond unrepeated scenes. Technical Report NTIA Technical Memo TM18532, NTIA/ITS, Feb. 2018.
 [17] Margaret H. Pinson and L. Janowski. Agh/ntia: A video quality subjective test with repeated sequences. Technical Report NTIA Technical Memo TM14505, NTIA/ITS, June 2014.
 [18] M. H. Pinson, L. Janowski, R. Pepion, Q. HuynhThu, C. Schmidmer, P. Corriveau, A. Younkin, P. Le Callet, M. Barkowsky, and W. Ingram. The influence of subjects and environment on audiovisual subjective tests: An international study. IEEE Journal of Selected Topics in Signal Processing, 6(6):640–651, Oct 2012.
 [19] ITUT PSeries. ITUT codedspeech database. Technical Report Series P: Telephone Transmission Quality, Telephone Installations, Local Line Networks, ITUT, Feb. 1998.
 [20] Margaret H. Pinson. ITS4S2: An image quality dataset with unrepeated images from consumer cameras,. Technical Report NTIA Technical Memo TM19537, NTIA/ITS, Apr. 2019.
8 Appendix: First and SecondOrder Partial Derivatives of
We can derive the firstorder and secondorder partial derivatives of with respect to , and as:
9 Appendix: An MLE Interpretation of the Plain MOS
The plain MOS and its confidence interval can be interpreted using the notion of maximum likelihood estimation. Consider the model:
(7) 
where is the opinion score, is the true quality of stimulus and is the “ambiguity” of . is i.i.d. Gaussian. Note that this is different from the proposed model (1) where is associated with the subjects, not the stimuli. We can define the loglikelihood function for this model as , and solve for and that maximize the loglikelihood function, as follows:
The secondorder partial derivative w.r.t. to is . The 95% confidence interval of is then:
(8) 
One minor difference between (8) and the 95% confidence interval formula in BT.50014 Section A12.2.1 is that the former uses differential degrees of freedom 0 and the latter uses 1 for the sample standard deviation calculation. In fact, neither one is fully precise. In the most precise way to calculate the confidence interval, one should use a Student’s tdistrubiton with a differential degrees of freedom 1 (see Section Validation of Solvers and Confidence Interval Calculation and Table CI% for more discussions).