Industry depends on video, image, and audio quality assessments when making important business decisions. Examples include optimizing video encoders, choosing video transmission bandwidths, fine tuning networks, and agreeing upon standard speech codecs for future cellular networks. The traditional options are to either conduct a subjective test or choose a well-established objective metric.
Both options have advantages and limitations . Subjective tests are accurate but time-consuming. Objective metrics are inaccurate but fast. Moreover, metrics quickly become unreliable when applied to new technologies or scenarios outside of their intended scope. For example, a metric designed for broadcast video applications may yield random ratings when given user generated content. Neither option meets the needs of a company to quickly assess new technologies during the research and development process.
This paper proposes a compromise solution: a subjective test where four to six team members rate each stimulus multiple times. This would reduce costs associated with recruiting and handling subjects, shifting the work onto team members who are obliged to assist. Obvious quality impairments are easily spotted by all subjects (particularly experts), and novel technologies are not problematic. This compromise would provide a viable third option, as long as the accuracy lies somewhere between the accuracy of subjective test data and objective metric values. We have named the resulting test protocol FOWR: Few Observers With Repetitions.
In this paper, we will review previous studies of subject rating behaviors, which provide the theoretical basis for this proposal. We then present a subjective test that was designed to evaluate this FOWR experiment design. The conclusions reached by this test will be compared with conclusions reached by a test conducted with the traditional experiment design.
I-a Related Work
The conventional experiment design is well described in the literature. In literature not related to video or audio quality a typical subjective experiment is called a within-subjects design. Details about analysis of such experiments can be found in classical literature from sociology like chapter 14 of .
We cannot take short-cuts when choosing stimuli. An illustrative example is Fig. 3 of . Thirteen experiments conducted at various labs investigated the same issue: an equation that maps audio quality and video quality (measured separately) to the overall audiovisual quality (assessed jointly). Eight experiments used one or two source videos and reached disparate conclusions; five experiments used five to ten source videos and reached similar conclusions. Subjective and objective analyses are only reliable if care is taken when choosing media stimuli and impairments.
However, the number and nature of subjects are more open to discussion. The decision about the number of subjects should be based on so-called power analysis . This analysis is simple for classical problems where the effect size is well understood, but in the case of new technology and less classical experiments it is difficult to say what is the effect size. In this case a pilot study is a good option, but we have to carefully analyze the pilot study data. A low number of participants can have a strong influence on the obtained results .
An understanding of the specificity of subjective experiments is key for correctly planning experiments. One issue is to analyze how the number of subjects influences the obtained results. Pinson et al.  presents a detail analysis of the influence of number of subjects and the environment. The main conclusion was that 24 subjects is a reasonable number. A different conclusion was drawn by Winkler in , where 15 was recommended. The difference could be the result of complicated behaviors related to the scoring process, as described in  pages 118–119.
Pinson A has significantly better quality than B, or vice versa). Assuming a well-designed and carefully conducted subjective test using the 5-level absolute category rating (ACR) method, the MOS CI is approximately 0.5 for 24 subjects, 0.7 for 15 subjects, 1.1 for 9 subjects, and 1.5 for 6 subjects. Unknown factors can lead to higher CIs, but usually no more than the next category (e.g., a 24 subject test with CI = 0.7). The likelihood that two subjective test labs will disagree on the rank ordering of stimuli is with a long tail extending up to
and an outlier at. These lab-to-lab comparisons show similar rating behaviors for speech, image, and video quality experiments.
Additionally, Brunnström and Barkowsky  analyze how the discriminating power of experiments decreases when the number of comparisons increases. A typical subjective experiment with 24 subjects and 100 pairwise comparisons (about 10 points compared pairwise) cannot discriminate reliably between levels smaller than 1.0 MOS differences in a 1-5 scale. The sample size required to discriminate between different levels for 0.5 MOS differences increases to 80 subjects.
A few studies have compared expert and non-expert rating behaviors. Speranza et al.  found a small but statistically significant difference between expert MOSs and non-expert MOSs, particularly at low bit-rates where experts were more critical than non-experts. Note that the authors assume that subjective tests produce absolute MOSs, which was later refuted . Speranza et al. did not compare the relative ranking of stimuli MOSs, but a visual examination of the  scatter plots indicates a simple relationship: experts used more of the 100-level scale and had a negative bias.
Kumcu et al.  compared expert and non-expert ratings for two different applications: medical imagery and denoising. Their experts and non-experts produced comparable scores except for the rank ordering of two denoising systems. Recent analyses of  and similar datasets indicate that experts and non-experts agree on the rank ordering of stimuli quality, when statistical equivalence is taken into consideration (revision to , pending publication). However, experts may have an increased or decreased sensitivity to certain impairments. Thus, for some applications, experts and non-experts will act like two pools of subjects selected at random from the same population, while for other applications, experts may reach different conclusions than non-experts about whether MOSs are equivalent or significantly different.
Several researchers have tried to describe and model the voting processes. Streijl, Winkler, and Hands  present a detail analysis of the limitations of Mean Opinion Scores (MOS). Mean opinion scores are not a precise number, they note, but a statistical measurement with some uncertainty. This statistical nature has the advantage of allowing the use of parametric statistics on the MOS analysis, as the averaging of subject scores shows normal behavior 
. However, mean values might not capture all the relevant information of subjective scores, and some alternatives have been proposed, such as 10% or 90% quantiles, discrimination between pairs or similar or dissimilar stimuli , or quality score distributions [33, 27].
Recent analyses try to build a mathematical model that describes subject rating behaviors. Janowski and Pinson  model subject voting as a stochastic Gaussian process, with users presenting a bias and randomness that deviate from the target ground truth. The ground truth is also affected by a random noise associated to the difficulty of rating each Processed Video Sequence (PVS). Variations of this model have been proposed for processing of noisy subjective data , modeling just-noticeable-difference scores , or analyzing Adaptive Media Playout quality .
is a random variable related to a vote
is the observed rating for subject , PVS , and repetition
is the true quality of PVS (i.e., ground truth)
is the subject bias (i.e., overall shift between the th subject’s ratings and the true quality)
is the magnitude of the th subject’s random noise
is the magnitude of the th PVS’s random noise
are random variables with normal distribution
Janowski and Pinson show that subject bias is quite stable . The observed subject biases, , follow a normal distribution:
where they observe . They also observe that the combined distributions of , and span about of the rating scale, and the ranking of sequences is quite stable across subjects. Of particular importance to this paper, the model underlying (1) indicates that the number of subjects in an experiment can be reduced if each subject scores each PVS multiple times: by repeating the experiment, the mean score of each individual subject for each individual PVS should converge to the expected value of , which represents the true opinion of the subject :
Subject bias can change slightly over the duration of a long experiment or in response to certain subject matter. For example, dataset ITS4S2  is divided into 14 sessions, each with photographs on a different topic (e.g., landscapes or disasters). Visual inspection of subject biases indicates that only one of 16 subjects had notable session biases: liking tourist photos and disliking photos of places. Janowski and Pinson  measure statistically significant differences in subject bias between sessions for 6 of 26 subjects in  but conclude that these perturbations are small enough to be safely ignored.
I-B Research Questions
Given the following findings of previous research:
Subjects show a consistent additive bias.
Subjects tend to agree on pairwise orderings.
The number of subjects can be reduced if each subject scores each PVS several times.
A well-designed and carefully conducted subjective experiment will have a resolution no better than 0.5 MOS.
We pose the following research questions:
Is it possible to design a valid experiment using few observers with repetitions (FOWR) with resolution of 0.5 MOS?
Can these subjects be experts?
How can subject bias be correctly handled?
Would this approach be at least as reliable as objective quality scores?
Ii Experimental design
Ii-a Subjective experiment
To answer these questions, we conducted a subjective experiment, ITERO, using PVSs drawn from a prior experiment, ITS4S. The ITS4S dataset  contains 4-second PVSs without repetition, organized into eight sessions of 100 PVSs. For ITERO we chose the “Everglades” session and added 10 sequences from the “Sports” session, for a total of 110 PVSs. ITS4S provides subject ratings for these sequences from two independent labs: ITS and AGH.
The aim of ITERO is to assess whether, by repeating the experiment several times, it is possible to replicate the results from ITS4S “Everglades” with just one or a few subjects. However, (3) shows that this will only be possible if we can estimate the subject bias, . Our proposed solution is adding a reference: 10 sequences from a prior test, with known MOSs from 24 subjects. We hypothesize that we can estimate subject bias using the prior test subset and remove it from the new test’s ratings. Therefore, ITERO has 10 sequences from the “Sports” session (the reference) and 100 sequences from the “Everglades” session (the new test).
The ITERO experiment was conducted in three different labs: Nokia, AGH, and UPM. At Nokia and AGH, each person rated sequences on their own laptop. At UPM, the same computer was used by all the people in the experiment. Each subject was instructed to repeat the experiment 10 times, preferably not twice on the same day. Twenty-seven subjects took part on the ITERO experiment. Of these, 20 finished all 10 repetitions.
|Confidence||It was easy to have an opinion about each sequence|
|Focus||I have been focused on the task for the whole duration of the test|
|Tiredness||I am tired of doing this test|
In each experiment session, the subject performed a screen test, rated the same 110 PVSs, and answered a questionnaire. The screen test assessed interactions between the subjects, monitor, and environment (e.g., the subject’s ability to perceive small luma differences on this monitor in the current lighting condition). This tool has been successfully used in crowdsourcing experiments in the past [7, 4]. A short questionnaire was conducted after each session to assess the subject’s tiredness (Table I). Questions were answered in a 5-point Likert scale, where 5 is totally agree and 1 is totally disagree.
This subjective test adheres to ITU-T Rec. P.913, which ensures the rights and welfare of the human subjects (see Clause 11.1). Photographs of the subjective testing environments are omitted, due to the large diversity of locations.
Ii-B Structure of the data set
The data from one ITERO session and a single subject will be referred to as a repetition. The objective of the test is to discover whether a few (e.g. 3-4) subjects of the ITERO test can, after a few repetitions, obtain similar results as in a “traditional” subjective experiment, conducted with more than 20 subjects. With this aim, from the data obtained in the ITERO dataset, two subsets have been extracted: ITERO-TEN and ITERO-ONE.
ITERO-TEN includes the data from the 10 repetitions of the 20 subjects that concluded them all. In the subsequent analysis, will represent the rating for subject , PVS and repetition within ITERO-TEN dataset.
ITERO-ONE takes the first repetition of each ITERO subject. It includes data from all 27 subjects, regardless of how many repetitions the subject completed. Subjects were screened for outlier rejection according to Rec. ITU-R BT.500, and no one was rejected. ITERO-ONE is itself a “traditional” subjective assessment experiment, with 27 different subjects scoring the PVSs for the first time. It will be referred to as the baseline. The notation will be used to represent ITERO-ONE MOS value for PVS .
Additionally, we will consider the original ITS4S experiment. The ITS4S data will be referred to as the ground truth, since ITS4S was conducted according to the best known practices in ITU-T Rec. P.913. Consequently, the notation will be used to represent ITS4S MOS value for PVS .
Ii-C Dataset comparison
The target of the study is to determine whether a few subjects from ITERO-TEN can, after some repetitions, be equivalent to a “traditional” subjective experiment or to a state-of-the art objective metric. This equivalence will be analyzed in terms of association, agreement, perceptual similarity, and confusion analysis.
Association measures the (potentially linear) correlation between two variables, which are related but may have been measured from two different populations (or two different variables of the same observed population) . We will measure association by computing Pearson’s linear correlation coefficient (PCC), as it is the most frequently used comparison metric both for objective and subjective scores .
Agreement is based on the sameness or difference between two values that measure the same underlying variable . In our case, it would mean that ITERO-TEN scores measure the very same quantity as the baseline or the ground truth
. Several options are proposed in the literature, mostly designed for the evaluation of objectives scores (see e.g. ITU-T J.149). We will use the Root Mean Square Error (RMSE), which is probably the most popular one, and is typically reported as a figure of merit in the scientific literature. Measuring agreement is particularly relevant in our experiment due to the effect of bias: a theoretical subjecttrue opinion would have perfect association (PCC=1), but not perfect agreement (RMSE=).
Perceptual similarity is based on the idea that individual subjects are unable to perceive small differences of quality between similar PVSs. In particular, increasing the resolution of the measurement scale beyond the recommended 5 levels does not increase the accuracy . This finding, together with the limitations of the discriminating power of existing experiments already described in the introduction , suggest that the actual resolution of subjective scores must lie somewhere between 0.5 and 1.0 MOS points. Considering this, we will assume that the same PVS has been rated similarly in two different experiments if its MOS rating differs by less than 0.5 points. We will measure perceptual similarity by computing the ratio of PVSs that have been rated similarly (MOS05).
Confusion analysis compares the conclusions reached by the ground truth test and the ITERO-TEN dataset. Pinson  provides expected variances, based on lab-to-lab comparisons. We measure the differences between the conclusions reached by the ground truth and the ITERO experiment design, to understand whether the differences fall within the expected behavior of subjective testing.
It is not likely that any subjective or objective score is going to replicate exactly the true quality so that PCC = 1, RMSE = 0 or MOS05 = 1. In practice, the scores extracted from ITERO-TEN should be able to match the performance of state-of-the-art objective metrics, or the results of two different laboratories conducting the same subjective experiment.
Table II shows the performance of a few Full-Reference quality metrics with respect to the dataset VQEG-HDTV-3 . PSNR is a traditional benchmark for objective scores. VQuad-HD (ITU-T J.341), and VMAF  can be considered state-of-the-art metrics for compression and, in some cases, packet loss artifacts. PSNR and VQuad-HD objective scores have been extracted from , which includes 3-degree polynomial fitting towards the subjective score. VMAF scores were obtained using the subset of VQEG-HDTV-3 that contains only compression artifacts . Due to the different ways in which those metrics were developed, Table II is not a fair comparison among them. Nonetheless, it provides a view on what to expect from an objective metric working within its comfort zone.
|VQuad-HD (ITU-T J.341)||0.917||0.446||0.714|
|VQM (ITU-T J.144)||0.794||0.690||0.597|
Table III shows the same analysis, using subjective data instead of objective data. It performs lab-to-lab comparisons using experiments that were conducted by six or more international labs: the common set from VQEG-HDTV  and VQEG-MM2 , . We made lab-to-lab pairwise comparisons and then picked the median over all pairs. Table III also compares the ITS and AGH lab data from ITS4S session “Everglades,” as these sequences constitute most of the content in ITERO.
Iii Results for individual subjects
Iii-a Subject responses
The ITERO experiment was self-paced, both within each repetition and in the scheduling of repetitions on different days. The distribution of experiment duration was very heterogeneous: some subjects did the 10 repetitions in 12 days, while others took 8 months. The median time between two consecutive repetitions of the same subject was 2 days. However, in 10% of the cases, consecutive repetitions were longer than 2 weeks apart. Self-reported values of confidence, focus, and tiredness had slight variations in response to the repetition number (Fig. 1).
The screen test performed at the beginning of each repetition provides a reliability index on a 0-100 scale, which was at least 95 in 93% of the sessions. The sessions for which the reliability index was not at least 95 were randomly distributed across different subjects and repetitions. We found no obvious behavior pattern that suggests that any subject or repetition should be discarded or analyzed separately.
Fig. 2 shows the fraction of changes in scoring for the same PVS in consecutive sessions, for the 20 subjects that did all 10 repetitions. It shows a learning process along time, which stabilizes after a few repetitions. On the one hand, all subjects gave the same ratings for at least half of the sequences starting from the third repetition. There were a few outliers, but only as expected of the distribution and these are not unduly influential. On the other hand, all subjects changed their vote on at least 10% of the sequences even at the 10th repetition.
It is interesting to see how this learning process results in information gain. To do so, we have compared the scores of the subjects in ITERO-TEN with the baseline MOS from ITERO-ONE (outward comparison), as well as with the mean opinion of the subject across the ten repetitions (inward comparison):
which can be interpreted as an approximation of the subject true opinion .
Fig. 3 shows the results of this comparison for PCC, RMSE and MOS05 metrics, for four treatments of the ITERO subjective ratings. Black lines show comparisons between each subject and his/her own estimated true opinion (inward), while blue lines show comparisons with the baseline ITERO-ONE (outward). Solid lines show results for current individual repetition , dashed lines accrue the first repetitions, and dotted lines accrue the last repetitions (reverse).
Taking the left graphic (Pearson’s linear correlation coefficient) as an example, the solid blue line shows PCC between a single repetition of a single subject and the baseline, averaged over all ITERO-TEN subjects:
where represents an individual rating of the ITERO-TEN set, represents the baseline MOS of each PVS calculated from ITERO-ONE, represents the correlation coefficient across all PVSs, and subscript (O,C) is an abbreviation for (Outward, Current).
The dashed blue line shows the correlation between the first repetitions of a single subject and the baseline, averaged over all ITERO-TEN subjects:
where subscript (O,A) is an abbreviation for (Outward, Accrued). That is, the dashed blue line accumulates or accrues data from all prior repetitions, while the solid line considers only the current repetition.
The dotted blue line, labeled reverse, considers the correlation between the last R sessions of a single subject and the baseline, averaged over all ITERO-TEN subjects:
where subscript (O,R) is an abbreviation for (Outward, Reverse).
The black lines repeat these same computations, but they use, instead of baseline (), the inward opinion of each subject (, see equation (4)) across the ten repetitions:
where subscripts (I,C), (I,A), and (I,R) are an abbreviation for (Inward, Current), (Inward, Accrued), and (Inward, Reverse) respectively.
The shaded area around each line represents the 95% confidence interval of the mean. Confidence intervals have been excluded from reverse plots for figure clarity, but they are similar to the ones for their respective accrued plots. The same treatments have been applied to RMSE (Fig. 3, center) and MOS05 (right).
Iii-B Inward comparison
As already observed in Fig. 2, Fig. 3 shows a learning process of the subjects with respect to their final opinions, as the first session tends to be farther away from the inward MOS than the rest. This learning process stops at about the fourth or fifth repetition; even when the inward accrued correlation, RMSE, and MOS05 converge to the final opinion of the subject, as expected, each individual repetition does not.
Individual scores at each repetition are, by definition, integer, and therefore they present some quantification noise when used to approximate the estimated true opinion of each subject . Accruing the results of several repetitions can remove this noise and, after the sixth repetition, subjects have converged to their estimated true opinions within our target resolution of 0.5 MOS points, i.e. .
Iii-C Outward comparison
Each individual repetition has similar properties of outward current association (), agreement () and perceptual similarity (). Unlike inward comparisons, the learning process described above does not result in each individual session being closer to the baseline than the previous one.
However, this learning process produces some additional information which makes outward accrued metrics to actually converge to the baseline. Part of this convergence may be just the compensation for quantification noise. But another part is truly produced by the learning process: the changes in opinion of the subject during the first repetitions actually generate information. This can be seen when comparing (direct) accrued to reverse curves: the former converges to a saturation point much faster than the latter.
The behaviors described above apply similarly to association, agreement, and perceptual similarity. After the first four or five repetitions, the average subject has , , and . No significant improvement is produced after that, which can be interpreted as the opinion changes being just random noise.
Iii-D Subject bias
Bias has been computed with respect to the baseline (ITERO-ONE); and the distributions are calculated for the 20 subjects who completed all repetitions:
Fig. 4 shows the the bias of the different subjects for each repetition , as well as the median. The subjects tend to get slightly more pessimistic with time, as seen in the median, but the distribution is relatively stable otherwise. Four individual subjects have been identified to illustrate different behaviors: A and B are consistently optimistic or pessimistic. C is the subject whose bias has highest variance; it transits from 0.1 to -0.6 as sequences advance. D is the subject whose bias has lowest variance. It is relatively stable; however, it shows oscillations.
From Fig. 4 we can see that bias is not perfectly stable. To better understand its instability we tested whether bias for one session is statistically different from bias obtained for all sessions. For all repetitions we can calculate mean bias by:
We have compared each subject and each repetition using the Student’s t-test to , to determine whether global bias for subject is statistically different from that obtained for his/her session . Since we run multiple comparisons, we use Bonferroni significant level correction . The analysis shows that for 8 subjects we have no statistically different sessions, for 6 we have one statistically different session, for 5 two, and for 1 three statistically different sessions. The obtained results are tricky to interpret. For most comparisons we do not detect a statistically significant difference, but this is not always true.
Additionally, considering the bias values obtained for subsequent repetitions, a slow global decreasing trend can be observed. Such a trend cannot be related to bias instability, since it is a systematic change. We do not know why such a bias trend is observed. People seem to become slightly more critical as they see the same videos repeatedly. This increases the likelihood of detecting statistical significance between different repetitions. Considering all this, we think that the bias is mostly stable.
A complementary analysis is studying how many sequences are needed to estimate bias from a complete experiment. In this line, Fig. 5 shows the root mean square error of the global bias versus bias predicted by samples. To compute it, we randomly chose a session and then randomly chosen samples out of the 110 sequences of the session. Results suggest that bias can be predicted with around 15 samples with error around 0.2; for 60 samples the bias estimation error is around 0.15.
In summary, our results suggest that subject bias actually exists and it is stable across sessions. However, there is always going be an error (of at least around 0.15) when estimating the bias from specific sessions; and this error will be higher if the estimation is done from a reduced subset of sequences.
Iv Results for several observers
Iv-a Estimating the baseline
So far we have seen that a single subject has limited ability to predict the results of a “traditional” experiment such as the baseline. After approximately four repetitions, there is no visible gain of information or prediction capability. We will now analyze whether this limitation can be overcome by aggregating the results of a reduced number of subjects.
To do so, we pick , a random sample of subjects from ITERO-TEN, and we consider the MOS obtained by their first repetitions . We then compute our benchmark metrics between each subset and a modified baseline that excludes the ratings from selected subjects111Note that and depend on , but we have decided not to show it explicitly to simplify the notation.:
For instance, Pearson correlation is computed as
where represents the correlation coefficient across all PVSs. RMSE and MOS05 are computed likewise.
For each combination of and , we repeat the process 1000 times and then compute the distribution of the benchmark metrics.
The first row of Fig. 6 shows the cumulative distribution of the metrics for and . Table IV shows the values of the most relevant points of the distribution: median and 5th (PCC, MOSO5) or 95th (RMSE) percentiles. We have initially selected as it is the point where additional repetitions do not seem to increase information from individual subjects, as described above. As a reference, we have also computed the distribution metrics resulting from pairwise comparing different repetitions of the same subjective experiment, as also described in section II-D (see Table III): VQEG-HDTV (6 laboratories, 15 comparisons), VQEG-MM2 (10 laboratories, 45 comparisons) and ITS4S (2 laboratories, 1 comparison).
By most metrics, having 4 or 5 subjects repeat the session 4 times each outperforms the results of predicting one “standard” experiment from the results of another one, both considering the median and the worst case (95% percentile). Having 4 subjects repeat the assessment 4 times results in correlation coefficients higher than 0.94, RMSE values lower than 0.45, and MOS05 higher than 0.68, with a probability higher than 95%, also in line with state-of-the-art objective metrics (see Table II).
The second row of Fig. 6 shows the evolution of the metrics with respect to the repetitions. The most relevant points of the cumulative distribution are shown: median, and edge percentiles (5% and 95%). It can be seen that the behavior is similar to that of a single observer: it clearly improves during the first 3 repetitions, and it stabilizes at the 4th or 5th. The ability to improve during repetition seems to be slightly better in association (correlation) than in agreement (RMSE) or perceptual similarity (MOS05), particularly for the worst case.
This limitation can be explained by the combined bias of the users within the same subset:
According to the subject bias model in (2), is the sum of independent normal random variables :
The actual distribution of is approximately Gaussian, with observed meanshown in Table V. It approximately follows (19). Therefore, subject bias imposes a practical limitation on the achievable agreement between an experiment with a reduced number of subjects and a “traditional” one: with only a few subjects, the combined bias will result in systematic error affecting RMSE and MOS05 metrics. In the next subsection we will explore whether it is possible to remove such bias by introducing some additional sequences in the experiment for that purpose (“Sports” sequences, in our case).
Iv-B Estimating the ground truth
So far we have shown the performance of the metrics of subjects, repetitions, when predicting the values of the very same experiment done by subjects, once (baseline). However, the final target would be to be able to predict the original “Everglades” results in ITS4S database (ground truth). In fact, the addition of the “Sports” sequences should be helpful to estimate and correct the combined bias of the subjects.
The same analysis described in the previous subsection is done to compute the benchmark metrics of a random subset of ITERO-TEN with the ground truth ITS4S. Now (16) is replaced by:
and likewise for the rest of the metrics. is the MOS of each PVS in the original ITS4S “Everglades” experiment, and is computed only for the 100 “Everglades” sequences.
Fig. 7 shows some of the results. Results are similar in terms of association, but the behavior is worse in terms of agreement or perceptual similarity; even results do not differ very much, it is clear that the addition of repetitions does not improve RMSE or MOS05 significantly.
A potential explanation for this is the fact that the introduction of the 10 “Sports” sequences within the “Everglades” dataset actually modified the scoring scales of the subjects. Fig. 8(a) shows the MOS of each individual sequence in ITS4S vs ITERO-TEN, considering all users and four repetitions (). Two different effects can be observed: on the one hand, the spread of the distribution is wider in the case of ITERO-TEN; on the other, it is clear that “Everglades” and “Sports” sequences are scored differently. On average, Everglades sequences are scored 0.09 points lower in ITERO-TEN than in the ITS4S. However, “Sports” sequences are scored 0.35 points higher in ITERO-TEN.
As a consequence, it is not possible to compensate for the subject bias in ”Everglades“ sequences with only the bias estimated from “Sports” sequences; it actually makes results worse (increases RMSE and reduces MOS05). Fig. 8(b) shows that, even though there is some correlation between predicted and actual bias, it is not strong enough to really improve the agreement properties of the MOS calculation. This is in line with Fig. 5 showing that 10 sequences do not allow precise estimation of bias.
A possible reason for the “Sports” sequence not being able to predict general subject bias is that the kind of content may be significantly different from “Everglades”. To explore this possibility, we have used different subsets of sequences (within the 110 PVS) to estimate the bias of the whole experiment. In particular, we have taken each of the SRCs, as defined in ITS4S dataset, as the bias estimator. Unlike other subjective experiments, different PVSs of the same ITS4S dataset contain different source content; however, they belong to the same scene as the original content. Fig.8(c) shows the distribution of bias estimation error for each SRC, including the “Sports” subset. As can be seen, the error behavior depends quite significantly on the specific sequences selected to compute it, and therefore it is not stable across experiments.
Iv-C Confusion Analysis
When a subjective test is repeated in multiple labs, each lab will reach slightly different conclusions. For all pairs of stimuli, A and B, we will use the paired stimulus Student’s t-test to decide whether A is better than, equivalent to, or worse than B. We can then compare the conclusions reached by the different labs and tally their frequency.
We are only interested in two of the outcomes. The first is the likelihood that the two subjective tests disagree (i.e., the labs reach opposing conclusions on the quality ranking of A and B). Pinson  shows that the disagree rate is stable and not influenced by the number of subjects or range of quality in the subjective test. We expect a well-designed and carefully conducted subjective test to have a disagree rate 1%.
The second outcome of interest is the likelihood that two subjective test labs agree (i.e., the labs reach the same conclusion on the quality ranking of A and B, ignoring ties). The agree rate is influenced by the number of subjects and range of quality in the subjective test, so we must compare results with statistics gathered from ITERO-ONE and the ground truth. This gives us three labs, each with 24 or 27 subjects. Using all available subjects and the Everglades PVSs, lab-to-lab comparisons yield agree rates of 66%, 66%, and 68%. If we randomly select 15 subjects, the agree rate ranges from 52% to 63%, with an average of 57%.
We will use the 100 Everglades sequences and the ITERO-TEN subjects to compute agree and disagree rates for different numbers of subjects and repetitions. For each case, we will randomly select subjects and repetitions, and then compare those ratings to the ground truth data from each of the ground truth labs separately. This random selection will be repeated 50 times, for a total of 100 trials.
Based on this data, Table VI shows the likelihood in percent that a test with N subjects and R repetitions will have rates of agreement ( 52%) and disagreement ( 1%) equivalent to a conventional subjective test of 15 subjects. Table VII repeats this analysis for a 24 subject test, where equivalence requires agreement 66% and disagreement 1%. Each column contains a single number of subjects (e.g., “1S” means one subject, “4S” means four subjects).
From Table VI and VII, let us choose experiment designs where the likelihood of equivalence is 95%. We will add one repetition beyond minimum as a safety margin, because our statistics unrealistically assume that no other factors will cause the disagree rate to rise. These criteria, combined with our desire for a small number of subjects, identify the following experiment designs for 15 subjects:
3 subjects & 5 repetitions 15 subjects
4 subjects & 4 repetitions 15 subjects
5 subjects & 3 repetitions 15 subjects
And these experiment designs for 24 subjects:
5 subjects & 6 repetitions 24 subjects
6 subjects & 5 repetitions 24 subjects
The lower values in VII reinforces the theory that the 24 subject test is a higher standard of performance than the 15 subject test.
V-a FOWR test methodology
After analyzing the results obtained by the ITERO experiment, we can state that the main research question has been answered affirmatively: it is possible to use few observers with repetitions (FOWR) to obtain valid subjective scores, although with some limitations.
The experiment was designed to take place under conditions that are easy to replicate in any quality assessment laboratory: most subjects were actually staff of the laboratory, certainly including video experts, and the viewing conditions were not strict. In fact, from the observers who completed the ten repetitions, only 4 of them can be considered “naive subjects” with respect to video quality. They show similar results as the others, e.g. their combined performance for is PCC=0.96, RMSE=3.28, MOS05=0.89. Therefore we can argue that our conclusions apply to expert viewers, but they will probably be applicable to other kinds of observers as well.
To get good association results, it is enough to do the experiment with 4 subjects and 4 repetitions. Repetitions should be done on different days. For agreement and perceptual similarity, however, there is a problem with combined bias. Chances are good that 4 subjects are enough (median results still beat state-of-the-art metrics), but the distribution tails (5/95 percentiles) are worse. For safer results, 5-6 subjects should be used. In general, increasing the number of subjects will always improve the test result, while increasing the number of repetitions (beyond 4) will not.
Due to the inability to get an accurate agreement, the FOWR protocol cannot replace a full subjective assessment test. However, it can provide good enough results for a pre-test: to further prepare a subjective test. It can also be used in the absence of an available objective score, with similar expected predictive capability. To put this into perspective,  shows that some objective metrics perform equivalently to subjective tests of 24 subjects, when confidence intervals are used to make decisions.
V-B On the limits of the subject model
Subject model (1) assumes that ratings in a subjective test follow a specific random process. Our experiment confirms this hypothesis by showing that even the same subject repeating the same experiment generates different answers. These differences are beyond the scale limitation even for a very simple five point scale. Future analysis should focus on better understanding and limiting answer randomness.
The experiment repetition provides interesting data about subject bias. First we see that we can estimate subject bias with limited precision, which is not surprising taking into account the typical precision of a psychological test. We see that for most subjects the bias is stable, so we can count it as an important model parameter. On the other hand, we showed that using different content to estimate bias does not work well. One explanation is that we need more sequences (see Fig. 5): 10 sequences are not enough to estimate bias for a 100-sequence experiment. Another is that the bias estimated with some sequences in one specific experiment cannot be used to predict the bias of those sequences within a different experiment. Besides, introducing just 10% additional sequences can affect the scores of the sequences under study.
Finally, we have also shown that there is a practical limitation on the ability to estimate the subject model parameters by repeating the same experiment several times. Repetitions do not allow estimating the subject true opinion with precision, as would be expected from (3). Consecutive repetitions of random variable for a given subject and sequence are not independent.
V-C Implications for traditional experiments
Even though there are limitations on the ability to compute the actual true opinion , we have assumed that we can obtain a reasonable estimation by averaging the results from all the available repetitions, as defined in (4). Additionally, Fig. 3 shows that subject opinion converges after a few repetitions, so that the 10th repetition is closer to the estimated true opinion than the first one . However, “traditional” experiments (e.g. based on ITU-T P.910) only ask for the first opinion of each subject . Would results be different if the experiment was performed with a better estimate of , e.g. using several repetitions?
|First vs Average||0.990||0.176||0.991|
|Last vs Average||0.995||0.101||1.000|
|First vs Last||0.981||0.230||0.963|
Fortunately, when considering the aggregate opinion of all the subjects, we have found no differences between the alternative estimates of . Table VIII shows the pairwise comparison between the first repetition (), the last one (), and the average of all 10 sessions (), for the 20 subjects in ITERO-TEN subset: coincidence is almost exact. However, the distinction may be relevant when modeling the behavior of individual subjects.
Additionally, there are some implications for the design of large multi-site subjective tests, such as the ones performed by VQEG under HDTV or MM projects. Those tests traditionally use different source contents in each laboratory. However, there is typically a common set which appears in all the tests, and is used to align the result across labs. Unfortunately, introducing those extra sequences may alter the score of the sources under study in unexpected ways (see Fig. 8), particularly affecting agreement and perceptual equivalence of the experiments.
In this paper, we propose the FOWR experiment design, where a small number of subjects rate the same set of PVSs repeatedly, on different days. We prove that the FOWR experiment design is non-inferior to a conventional subjective test. By non-inferior, we mean the FOWR experiment design yields similar performance to a conventional experiment design, based on association, agreement, perceptual similarity, and confusion analysis.
We recommend the FOWR methodology for pilot studies (to indicate trending), for pre-tests, and as an alternative to objective metrics for laboratory applications. The FOWR experiment design is particularly valuable when an objective metric is not available (e.g., new technologies, camera capture). The FOWR method allows a small team to make a quick and reasonably accurate quality assessments, when the time and expense of subject recruitment is non-viable.
For most applications, we recommend 4 subjects rating all stimuli 4 times on subsequent days. This experiment design is at least as good as the best objective metrics and will probably respond similarly to a 15 subject test.
There are intrinsic limitations on the protocol, particularly with respect to its capacity for agreement, as subject bias cannot be compensated. If accurate agreement is required, we recommended 5 subjects scoring 5 times or 6 subjects scoring 5 times. These experiment designs will probably respond similarly to a 24 subject test. We tested the FOWR protocol on expert subjects, who are more likely to be recruited for this type of in-house test.
Subject bias exists and is reasonably stable across time. However, subject bias is not uniform across sequences, either from the same or from different experiments. There is “behavioral correlation” (optimistic subjects tend to be more optimistic, average-wise). However, subjects who are “optimistic” with respect to one content sequence may be “pessimistic” with respect to another.
Our results also have some implications for the modeling of subjective score processes. First, the hypothesis that subject bias is independent of the PVS is only valid within a given subjective test; it is not stable across tests. In addition, when adding some sequences from test A into experiment B, those sequences will not only be evaluated differently from how they were originally rated in A, but also impact the evaluation of the sequences already present in B. This challenges the whole concept of “common set” in cross-lab experiments. And finally, when repeatedly rating the same set of sequences, subjects tend to converge to their true opinion after about 4 repetitions. When modeling the subject scoring process, this may conflict with traditional experimental design where subjects are instructed to rate PVSs that they have never seen before.
The authors want to thank the subjects who volunteered to repeat the same subjective assessment experiment ten times.
The authors also want to thank Aleks Zaleński (AGH) and Daniel Berjón (UPM) for their help with the test setup.
-  (2014-01) Digital video concepts, methods, and metrics. External Links: Cited by: §I-A.
-  (2018) Statistical quality of experience analysis: on planning the sample size and statistical significance testing. Journal of Electronic Imaging 27 (5), pp. 053013. Cited by: §I-A, §II-C, §III-D.
-  (2018-12) Number of participants required for common designs in psychology: a power analysis. PsyArXiv. External Links: Cited by: §I-A.
-  (2014) Crowdsourcing 2.0: enhancing execution speed and reliability of web-based qoe testing. In 2014 IEEE International Conference on Communications (ICC), pp. 1070–1075. Cited by: §II-A.
-  (2010) Report on the validation of video quality models for high definition video content. http://www. its. bldrdoc. gov/media/4212/vqeg_hdtv_final_report_version_2. 0. zip. Cited by: §II-D, §II-D, TABLE III.
-  (2017) No silver bullet: qoe metrics, qoe fairness, and user diversity in the context of qoe management. In 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6. Cited by: §I-A.
-  (2014) Best practices for QoE crowdtesting: QoE assessment with crowdsourcing. IEEE Transactions on Multimedia 16 (2), pp. 541–558. External Links: Cited by: §II-A.
-  (2010) Study of rating scales for subjective quality assessment of high-definition video. IEEE Transactions on Broadcasting 57 (1), pp. 1–14. Cited by: §II-C.
-  (2012-01) Experimental design and analysis. Cited by: §I-A.
-  (2019) Notation for subject answer analysis. arXiv preprint arXiv:1903.05940. Cited by: §I-A.
-  (2015) The accuracy of subjects in a quality experiment: a theoretical subject model. IEEE Transactions on Multimedia 17 (12), pp. 2210–2224. Cited by: §I-A, §I-A, §I-A, §I-A.
-  (2016) On the accuracy of objective image and video quality models: new methodology for performance evaluation. In 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6. Cited by: §I-A.
-  (1987) Association, agreement, and equity. Quality and Quantity 21 (2), pp. 109–123. Cited by: §II-C, §II-C.
-  (2017) Performance of four subjective video quality assessment protocols and impact of different rating preprocessing and analysis methods. IEEE Journal of Selected Topics in Signal Processing 11 (1), pp. 48–63. External Links: Cited by: §I-A.
-  (2016) Toward a practical perceptual video quality metric. The Netflix Tech Blog 6. Cited by: §II-D.
-  (2017) Recover subjective quality scores from noisy measurements. In Data Compression Conference (DCC), 2017, pp. 52–61. Cited by: §I-A.
-  (2018) Data analysis in multimedia quality assessment: revisiting the statistical tests. IEEE Transactions on Multimedia 20 (8), pp. 2063–2072. Cited by: §I-A.
-  (2019) Subjective assessment of adaptive media playout for video streaming. In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6. Cited by: §I-A.
-  (2012-10) The influence of subjects and environment on audiovisual subjective tests: an international study. IEEE Journal of Selected Topics in Signal Processing 6 (6), pp. 640–651. External Links: Cited by: §I-A, §I-A, §II-D, TABLE III.
-  (2013-07) Subjective and objective evaluation of an audiovisual subjective dataset for research and development. In 2013 Fifth International Workshop on Quality of Multimedia Experience (QoMEX), Vol. , pp. 30–31. External Links: Cited by: §II-D.
-  (2013) Selecting scenes for 2d and 3d subjective video quality tests. EURASIP Journal on Image and Video Processing 2013 (1), pp. 50. Cited by: §I-A.
-  (2018) ITS4S: a video quality dataset with four-second unrepeated scenes. Technical report Institute for Telecommunication Sciences / NTIA. Note: NTIA Technical Memo TM-18-532 Cited by: §II-A, TABLE III.
-  (2014) AGH/ntia: a video quality subjective test with repeated sequences. Note: NTIA Technical Report TM-14-505 Cited by: §I-A.
-  (2019) ITS4S2: an image quality dataset with unrepeated images from consumer cameras. Note: NTIA Technical Report TM-19-537 Cited by: §I-A.
-  (2020) Confidence intervals for subjective tests and objective metrics that assess image, video, speech, or audiovisual quality. Note: NTIA Technical Report TM-21-550 Cited by: §I-A, §I-A, §II-C, §IV-C, §V-A.
-  (2007-09) Understanding power and rules of thumb for determining sample size. Tutorials in Quantitative Methods for Psychology 3, pp. . External Links: Cited by: §I-A.
-  (2019) Fundamental advantages of considering quality of experience distributions over mean opinion scores. In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6. Cited by: §I-A.
-  (2010) Objective and subjective quality assessment with expert and non-expert viewers. In 2010 Second International Workshop on Quality of Multimedia Experience (QoMEX), pp. 46–51. External Links: Cited by: §I-A.
-  (2015) Measuring video quality in the network: from quality of service to user experience. In 9th International Workshop on Video Processing and Consumer Electronics (VPQM 2015), pp. 5–6. Cited by: §I.
-  (2016-03-01) Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives. Multimedia Systems 22 (2), pp. 213–227. Cited by: §I-A, §II-C.
-  (2018) A user model for jnd-based video quality assessment: theory and applications. In Applications of Digital Image Processing XLI, Vol. 10752, pp. 107520M. Cited by: §I-A.
-  (2009-07) On the properties of subjective ratings in video quality experiments. In 2009 International Workshop on Quality of Multimedia Experience, Vol. , pp. 139–144. External Links: Cited by: §I-A.
Learning to predict the perceived visual quality of photos.
2011 International Conference on Computer Vision, pp. 225–232. Cited by: §I-A.