1 Introduction
Preference aggregation from annotators’ pairwise labeling on the test candidates is a traditional but still active research topic. As the name implies, the objective of preference aggregation is to infer the underlying rating or ranking of the test candidates according to annotator’s (users or players) binary label, e.g. which one is better? In particular, recently, with the access of big data, preference aggregation from pairwise labeling has been widely applied in recommendation systems such as on movie, music, news, books, research articles, restaurant, products according to user’s preference selection; or in social networks for aggregating social opinions; or in sports race, chess and online games to infer the global ranking of the players, etc.
In some applications, such as game players matching systems (e.g. MSR’s TrueSkill systemherbrich2007trueskill ), friendsmaking website and subjective image/video quality assessment (IQA/VQA) li2012analysis , discovering the underlying scores of the test candidates is more important than the rank order so the system could know the intensity of the preference from users, eventually to assign matching players to the online game players, or recommend the possible friends who have the same interests to the users, or to quantitatively evaluate the performance of different coding/rendering/display techniques in IQA/VQA domain. However, as the size of the test candidates gets bigger, which is happening nowadays, the number of required pairwise labeling grows exponentially leading to the unfeasible implementation. Thus, there is an urgent need to reduce the number of pairwise comparisons, that is, selecting part of the pairs but without loosing the aggregation accuracy.
In this paper, we present a hybrid active sampling strategy for pairwise labeling based on BradleyTerry (BT) modelbradley1952rank , which can convert pairwise preference data to scale values. This work considers not only about inferring ranking but also recovering the underlying rating. The term Hybrid
explains that different sampling strategies are used in this method determined by the test budget. Active learning recipe is adopted in our strategy by maximizing the information gain according to Lindley’s Bayesian optimal framework
lindley1956measure . To capture the latent rating information, the minimum spanning tree (MST) is employed where the pairwise comparison is considered as a undirected graph. The MST guarantees the strong connection and eventually leads to higher prediction precision by BT model. In addition, the MST allows for a parallel implementation on pairwise comparison through crowd sourcing platform (such as Amazon MTurk), i.e. multiple annotators could work at the same time. Source code is public available in Github ^{1}^{1}1Source code: https://github.com/jingnantes/hybridmst.The main contributions of our work are highlighted as follows: 1) Batch mode facility: When the number of test candidates is , the proposed HybridMST active sampling strategy allows for parallel pairwise comparison each time. 2) Erroneous tolerance: We didn’t model annotator’s behavior in this work, however, the utilization of MST to some extent tolerates the malicious labeling from spammers (who give wrong/random answers). 3) Low computational complexity: Compared to the stateoftheart method that considers numerous parameters and deals with both active sampling and noise removing (e.g. CrowdBT chen2013pairwise ), HybridMST has much less time complexity. 4) Application flexibility: HybridMST is applicable in all conditions where aggregation on ranking or rating or both is required. It is also conductible in both smallscale lab test environment or largescale crowdsourcing platform.
The remainder of this paper is organized as follows. Stateoftheart work is introduced in Section 2. The proposed HybridMST strategy is presented in Section 3 containing both theoretical analysis and Monte Carlo simulation analysis. Extensive experimental validation on simulated dataset and realworld datasets are shown in Section 4. Finally, Section 5 concludes this work.
2 Related Work
In real applications of preference aggregation, annotator’s label could be explicit, for instance, a Likert scale score from “excellent” to “bad”, or implicit, e.g. pairwise comparison voting on two test candidates. The explicit label is more likely to be inconsistent negahban2012iterative Liexplore and noisy due to diverse influence factors qualinetwhite . According to a well known phenomenon in psychological study of human choice that “human response to comparison questions is more stable in the sense that it is not easily affected by irrelevant alternatives”ailon2009reconciling , obtaining label from pairwise comparison is thus a more appealing way for human participated labeling application, such as image quality assessment. Nevertheless, in whatever types of pairwise comparison, pairwise labeling still suffers from noises from a variety of sources, such as the human annotator’s expertise, the emotional states of players in a match, or the environment (external factors) of competition venue. In such case, the challenge changes to how to invert this implicit and in most cases noisy pairwise data back to the true global ranking or rating.
Several models have been proposed to explain the relation between pairwisecomparison responses and ranking/rating scale, including the earlier heuristic methods Borda Count
emerson2013original , and the currently widely used probabilistic permutation model such as the PlackettLuce (PL) modelplackett1975analysis luce2005individual , the Mallows model mallows1957non , the BradleyTerry (BT) modelbradley1952rank , and the ThurstoneMosteller (TM) modelthurstone1927law. When facing the largescale data but with sparse labels, these models might have computational complexity issues or parameter estimating issues. Thus, in machine learning community, numerous studies have been focusing on optimizing the parameters of these models
azari2012random lu2011learning , designing efficient algorithms soufiani2013generalized freund2003efficient , providing sharp minimax bounds shah2016estimation and proposing novel aggregation modelsailon2009reconciling crammer2002pranking qin2010new . Meanwhile, some researches are aiming at develop novel models to infer the latent scores of the test candidates from pairwise data and eventually obtain the rank orderingnegahban2012iterative dangauthier2008trueskill cortes2007magnitude wauthier2013efficient .It is well known that pairwise comparison needs large number of pairwise data to infer the ranking, which is in most applications very time consuming. A straightforward way to boost the pairwise labeling procedure is through data sampling. A simple and straightforward pair sampling strategy is random sampling such as the “balanced subset” method proposed by Dykstra dykstra1960rank by putting the test candidates in a form (triangle, or rectangular matrix) only subsets of the test candidates are compared, and the HRRG (HodgeRank on Random Graph) method proposed by Xu et al. xu2012hodgerank where random graph is utilized and only connected vertices are compared, meanwhile a Hodge theory based rank model (HodgeRank) is proposed to convert the sparse pairwise data to scale ratings. Another way to sample pairs is based on empirical observations that comparing closer/similar pairs would be more important than the distant pairs. In silverstein1998quantifying , the authors proposed to apply the sorting algorithms to sample pairs. In Liboosting li2013subjective , Li et al. proposed an Adaptive Rectangular Design (ARD) to adaptively and iteratively selecting pairs based on the estimated rank ordering of test candidates.
To further improve the aggregation performance, the recent studies focused on active learning for information retrieval. In jamieson2011active , the authors exploit the underlying lowdimensional Euclidean space of the data to discover the ranking using a small number of pairwise comparisons. Some other researches focus on selecting the pairs which could generate the maximum information gain defined by a utility function. In pfeiffer2012adaptive , the sampling strategy is based on TM model by employing the Bayesian optimization framework, while Chen et.al. chen2013pairwise (CrowdBT) utilizes the BT model but also considers the annotator’s influence. Xu et al. xu2017hodgerank (Hodgeactive) employs the HodgeRank model as well as the Bayesian information maximization to actively select the pair.
Active learning based sampling methods have demonstrated their outstanding performance in different datasets. However, they still have at least one of the following drawbacks: 1) The sampling procedure is a sequential decision process, which means the generation of next pair is determined only when the previous observation is finished. Such sequential mode is not suitable for largescale (e.g. crowd sourcing) experiments, in which many conditions are tested in parallel. 2) Most of the proposed methods focus on ranking aggregation, which might not be accurate enough for the applications that require ratings scores. 3) Annotator’s unreliability on labeling the pairwise data should be considered in the active learning process, in other words, the active sampling strategy should be robust to observation errors. A straightforward way is to model annotator’s behavior, as done for the CrowdBT method chen2013pairwise . However, it is computationally expensive.
To resolve the challenges mentioned above, in this paper, we proposed a hybrid active sampling strategy which allows for batch mode labeling and be robust to annotator’s random/inverse labeling behavior to infer the scale ratings. Details are introduced in the following sections.
3 Proposed Methodology
Let us assume that we have objects to test in a pairwise comparison experiment. The underlying quality scores of these objects are . In an experiment, the annotator’s observed score for object is .
is a random variable
, where the noise term is a Gaussian random variable . In a single trial, if , then the annotator selects over , and the outcome is registered as . If , then . For the case that ,is randomly assigned with 0 or 1 (In real test, the annotators in such condition could randomly make a selection). The probability of selecting
over is denoted as .3.1 Preference aggregation model
There are already some wellknown models to convert the pairwise probability data to cardinal scale ratings as we mentioned before. In this study, we choose BT model as an example. But this work could be easily extended to generalized linear model (GLM), in which BT model is the logit condition, and TM model is the probit condition.
According to BT model, for any two objects and , the probability that is preferred over , i.e. could be represented as:
(1) 
where is the merit of the object . The relationship between underlying score and is , thus, we obtain:
(2) 
Since we measured is a distance value between two objects, there are in total free parameters that need to be estimated. To infer the parameters in BT model, the Maximum Likelihood Estimation (MLE) method is adopted in this study. Given the pairwise comparison results arranged in a matrix , where represents the total number of trial outcomes , the likelihood function takes the shape:
(3) 
Replacing by , and maximizing the log likelihood function , we could obtain the MLEs . Generally, there is no closedform solution for MLEs and they are found numerically. The MLEs
follow a multivariate Gaussian distribution. The covariance matrix
could be estimated using the Hessian matrix of the bradley1955rank . Thus, for a given pairwise observation , we could obtain the approximated prior information on .3.2 Active learning
The purpose of active learning is to gain information from the observations. For a given prior information, the selection of next pair or pairs should provide the maximum information than others. A utility function is thus defined to measure this expected information gain (EIG). Generally, the KullbackLeibler divergence (KLD) between the prior distribution and the posterior distribution on
is used as the utility function chen2013pairwise pfeiffer2012adaptive . Different from them, in this study, we utilize the local pair distribution information rather than the global multivariate distribution to calculate the EIG.According to the MLEs based on current observations, . For a pair {,}, the score distance between them is , where . The EIG of pair is defined as the expected KLD between the prior distribution and the posterior distribution of , that is:
(4) 
where is the prior density, is the posterior density given outcomes ( if , otherwise,
). According to Bayes’ theorem,
, Equation (4) could be rewritten as:(5) 
where is the conditional probability density for the outcome in condition . We define = , and = , thus, we have , . The Equation (14) could be rewritten in a tractable computation form :
(6) 
where is the expectation taken w.r.t prior distribution, i.e. . For instance, the first item in Equation (6) could be written in the form:
(7) 
This form allows us to use GaussianHermite quadrature davis2007methods for approximation which reduces the computational complexity dramatically. In our study, 30 sample points are used for estimation. An example of the contour plot and meshgrid plot for the
under different means and standard deviation conditions is shown in Figure
2. According to this figure, the pairs which have similar scores or the score differences have high uncertainties would generate high information, which is consistent with the studies in silverstein1998quantifying Liboosting .3.3 Hybrid pair selection strategy
Now, based on the current observations, we could estimate the EIG for all pairs. The next step is to study how to select the pair/pairs based on the EIG.
3.3.1 Global Maximum (GM) method
A conventional way of active sampling is to select the pair which provides the highest EIG chen2013pairwise pfeiffer2012adaptive xu2017hodgerank ye2014active , that is:
(8) 
However, as we already discussed before, it is a sequential sampling strategy which has limitations in real application such as in largescale data processing or crowdsourcing platform where parallel execution is necessary. Thus, a method which allows for batchmode implementation is considered.
3.3.2 Minimum Spanning Tree (MST) method
Pairwise comparison could be considered as a undirected graph , where vertices represent test candidates, and edges represent whether or not the pairs are compared. In our study, , . are the weights on the edges, in our study, they are the inverse of the EIG of candidate pairs, i.e. .
A MST is a subset of the edges of a connected, edgeweighted (un)directed graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. The characteristics of MST include:

If there are vertices in the graph, then each spanning tree has edges.

If each edge has a distinct weight, then there will be only one, unique MST.

If the weights are positive, then a MST is a minimumcost subgraph connecting all vertices.
Thus, MST facilitates the batch mode in real application, the strong connection over all test candidates and the maximum sum of information gains of all possible pairs. The pair selection criterion based on MST method is:
(9) 
In this study, we use Prim’s algorithm prim1957shortest to find the MST as it is optimal for dense graphs. An example of an undirected weighted graph and its MST is shown in Figure 2.
3.3.3 Threshold setting
In this section we analyze the performance of the GM and MST methods. Firstly, in GM method, we initialize the pair comparison matrix by to fix the resolving issue of BT model chen2013pairwise
. Then, we design a Monte Carlo simulation experiment, assuming 10, 16, 20 and 40 test objects. The underlying scores are uniformly distributed from 1 to 5, with noise
, is uniformly distributed between 0 and 0.7. In a simulated test, if the sampled score is higher than , then is selected over . We also model the observation errors that might happen in the real test, i.e. the subject makes a mistake (inverting the vote) during the test. The probabilities of observation errors are designed as 10%, 20%, 30% and 40%. Therefore, there are in total 16 simulated tests, each test repeats 100 times.To evaluate the aggregation performance of GM and MST, the Pearson Linear Correlation Coefficient (PLCC) and Kendall’s tau coefficient (Kendall) between the designed ground truth scores and the MLE scores obtained by BT model are calculated. For easier illustration, in the following section, we define 1 standard trial number as the total number of comparisons that one observer needs to compare in Full Pair Comparison (FPC), that is, for objects, 1 standard trial number equals to comparisons.
By running Student’s ttest on the performances of GM and MST methods and checking their significant difference (which one is better), we find that generally, the GM method performs better than the MST method when the standard trial number is less than 1. With the increase of the comparison numbers, the MST method performs better than GM method, especially when the observation errors are large.
To benefit from both GM and MST methods, we decide to develop a hybrid active sampling strategy with 1 standard trial number as the switching threshold, i.e.:
(10) 
The whole HybridMST sampling strategy is summed up in Algorithm 1.
4 Experiments
4.1 Simulated dataset
In this experiment, the proposed method is compared with the stateoftheart methods including FPC P910 , ARD Liboosting , HRRG xu2011random , CrowdBT chen2013pairwise , and Hodgeactive xu2017hodgerank . A Monte Carlo simulation is conducted on 60 conditions (stimuli) whose scores are randomly selected from a uniform distribution on the interval of [1 5]. The assumptions are exactly the same with the experiment that we did in Section 3.3.3 and the observation error is set as 10%.
To obtain statistically reliable results, the simulation experiment is conducted 100 times. The relationship between the ground truth and the obtained estimated scores are evaluated by Kendall, PLCC, and the Root Mean Square Error (RMSE). Results are shown in Figure 3. It should be noted that as the PLCC, Kendall and RMSE values increase/decrease fast and look saturate when the trial number is large, it is difficult to visually distinguish the performances of different methods. Thus, in this paper, we rescale the Kendall and PLCC values by Fisher transformation, i.e. , and the RMSE value by function .
Qualitative analysis
Under the condition that each annotator has a 10% probability that inverses the vote, according to Figure 3, Hodgeactive shows the strongest performance than others in ranking aggregation (Kendall) when the test budget (i.e. the number of comparisons) is small. With the increase of the trial number, the proposed HybridMST method as well as the CrowdBT shows comparable performance with Hodgeactive. Regarding rating aggregation (PLCC and RMSE), the proposed HybridMST method performs significantly better than the others except for that when the trial number is small, i.e. less than 2 or 3, the Hodgeactive performs slightly better than HybridMST. CrowdBT shows similar performance with ARD in rating aggregation, which is lower than HybridMST and Hodgeactive but higher than HRRG.
Saving budget compared to FPC
Following ITUR BT.500 BT500 and ITUT P.910 P910 , 15 standard trial number (i.e. 15 annotators to compare all pairs) is the minimum requirement for FPC to generate reliable results. In this part, we compare how much budget can be saved by active sampling methods, i.e. HybridMST, Hodgeactive, and CrowdBT. The mean of Kendall, PLCC and RMSE are used in a way that if pairwise comparisons in HybridMST/Hodgeactive/CrowdBT could achieve the same precision as the FPC with 15 standard trial numbers, the saving budget is:
(11) 
The obtained for Kendall, PLCC and RMSE are 77.11%, 74.89% and 74.89% for HybridMST, and 84.57%, 68.61%, 71.65% for Hodgeactive, respectively. CrowdBT only has value for Kendall, which is 78.43%, as it needs more trial number to achieve the same FPC precision in PLCC and RMSE, which does not save budget.
Computational cost
To evaluate the computational cost of each sampling method, the same Monte Carlo simulation test is conducted for and . The averaged time cost (milliseconds/pair) over 100 repetitions for each method is shown in Table 1. All computations are done using MATLAB R2014b on a MacBook Pro laptop, with 2.5GHz Intel Core i5, 8GB memory.
FPC is the simplest method without any learning process and therefore it is with the highest computationally efficiency. Besides, ARD, HRRG and Hodgeactive also show their advantages in runtime. CrowdBT shows similar runtime with our HybridMST in GM mode. When HybridMST is in MST mode, the runtime is approximately times more efficient than CrowdBT and GM method. It should be noted that our proposed HybridMST method only uses the GM method in the first standard trial (which can be easily reached in largescale crowdsourcing labeling experiment) and then switches to the MST method, thus, in real application, our sampling strategy in most cases is in MST mode, which is much faster than CrowdBT. Nevertheless, all runtimes are in a feasible range, even for large number of conditions and our unoptimized code.
FPC  ARD  HRRG  CrowdBT  Hodgeactive  HybridMST  
GM  MST  
10  0.11  1.24  0.38  85.69  0.34  48.72  6.16 
20  0.10  0.62  0.34  188.56  0.22  153.61  8.97 
100  0.10  0.16  0.65  3033.02  0.65  3007.08  30.04 
To demonstrate the superiority of batchmode sampling in real applications, we take a typical VQA experiment as an example (which also holds for player matching system, recommendation system, etc.). The typical presentation structure of sequential sampling methods (HRRG, CrowdBT, Hodgeactive, GM) for one pair comparison procedure is: pair presentation time () + annotator’s voting time () + runtime of pairwise sampling algorithm (), where and are generally in total 15 seconds, is determined by the used algorithm. Sequential sampling methods cannot generate a new optimal pair of objects to compare until the annotator is done with the previous pair. This introduces unacceptable delay in the system if multiple annotators work at the same time.
In contrast, the batchbased HybridMST (in MST mode) can generate multiple pairs, which can be worked on in parallel by multiple annotators. Ideally (annotators work synchronously), the whole procedure for pairs needs seconds. While in the worst case, the annotators work one after the other (just like in sequential method), which needs seconds for only one pair. To make a comparison, the time cost of a whole pairwise comparison procedure including stimuli presentation time and voting time in a typical VQA experiment is shown in Table 2, which demonstrates that our method HybridMST is particularly applicable in largescale crowd sourcing experiment.
CrowdBT  Hodgeactive  HybridMST  
GM  MST(ideal case)  MST (the worst case)  
10  135.8  135.0  135.4  15.1  135.1 
20  288.6  285.0  287.8  15.2  285.2 
100  1782.0  1485.1  1782.0  17.9  1487.9 
4.2 Realworld datasets
In this session, we compare our proposed HybridMST with the stateoftheart active learning methods, CrowdBT chen2013pairwise and Hodgeactive xu2017hodgerank . For statistical reliability, each method is conducted 100 times. Two realworld datasets are used. Details are shown below.
Video Quality Assessment(VQA) dataset
This VQA dataset is a complete and balanced pairwise dataset from xu2011random . It contains 38400 pairwise comparisons for video quality assessment of 10 references from LIVE database livevideodb . Each reference contains 16 different types of distortions. 209 annotators attend this test.
Image Quality Assessment (IQA) dataset
This IQA dataset is a complete but imbalanced dataset from xu2012hodgerank . It contains 43266 pairwise comparison data for quality assessment of 15 references from LIVE 2008 livedb and IVC 2005 ivcdb database. Each reference contains 16 different types of distortions. 328 annotators from Internet attend the test.
As there is no ground truth for the realworld dataset, we consider the results obtained by all observers as ground truth. Again, Kendall, PLCC and RMSE are used as the evaluation methods. Due to the limitation of spaces, part of the results are shown in Figure 4 and 5.
In the realworld datasets where the annotator’s labelings are much more noisy and diverse than our simulated condition, the proposed HybridMST shows higher robustness to these noisy labelling than others. Regarding the ranking aggregation ability (Kendall), though Hodgeactive still shows a bit stronger performance in ranking aggregation than HybridMST when the trial number is small, it is not as much as in the simulated data. With the increase of the test budget, HybridMST performs comparable or even better than Hodgeactive. They both outperform CrowdBT. Regarding the rating aggregation (PLCC and RMSE), HybridMST always outperforms the others significantly. Hodgeactive performs similar with CrowdBT in VQA dataset, but much better than CrowdBT in IQA dataset.
Both simulated and realworld experiments demonstrate that when the test budget is limited (23 standard trial numbers) and the objective is ranking aggregation, i.e. we care more about the rank order of the test candidates rather than their underlying scores, using Hodgeactive is safer than HybridMST. In all other conditions, HybridMST is definitely more applicable considering both the aggregation accuracy and batchmode execution.
5 Conclusions
In this paper, we present an active sampling strategy called HybridMST for pairwise preference aggregation. We define the EIG based on local KLD where Bayes’ theorem is adopted for finding the tractable computation form and GaussianHermite quadrature is utilized for efficient estimation. Pair sampling is a hybrid strategy which takes advantages of both GM method and MST method, allowing for better ranking and rating aggregation in small and large trial number conditions. In both simulated experiment and the realworld VQA and IQA datasets, HybridMST shows its outstanding aggregation ability. In addition, in crowdsourcing platform, the proposed batchmode MST method could boost the pairwise comparison procedure significantly by parallel labeling.
References
 (1) R. Herbrich, T. Minka, and T. Graepel, “Trueskill™: a bayesian skill rating system,” in Advances in neural information processing systems, 2007, pp. 569–576.
 (2) J. Li, M. Barkowsky, and P. Le Callet, “Analysis and improvement of a paired comparison method in the application of 3DTV subjective experiment,” International Conference on Image Processing, pp. 629–632, Sep. 2012.
 (3) R. Bradley and M. Terry, “Rank analysis of incomplete block designs: I. The method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, Dec. 1952.
 (4) D. V. Lindley, “On a measure of the information provided by an experiment,” The Annals of Mathematical Statistics, pp. 986–1005, 1956.
 (5) X. Chen, P. N. Bennett, K. CollinsThompson, and E. Horvitz, “Pairwise ranking aggregation in a crowdsourced setting,” in Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 2013, pp. 193–202.
 (6) S. Negahban, S. Oh, and D. Shah, “Iterative ranking from pairwise comparisons,” in Advances in neural information processing systems, 2012, pp. 2474–2482.
 (7) J. Li, M. Barkowsky, J. Wang, and P. Le Callet, “Exploring the effects of subjective methodology on assessing visual discomfort in immersive multimedia,” IS&T Electronic Imaging, Human Vision and Electronic Imaging, Jan. 2018.
 (8) P. Le Callet, S. Möller, and A. Perkis, “Qualinet white paper on definitions of quality of experience v.1.1,” European Network on Quality of Experience in Multimedia Systems and Services (COST Action IC 1003), Jun. 2012.
 (9) N. Ailon, “Reconciling real scores with binary comparisons: A new logistic based model for ranking,” in Advances in Neural Information Processing Systems, 2009, pp. 25–32.
 (10) P. Emerson, “The original borda count and partial voting,” Social Choice and Welfare, vol. 40, no. 2, pp. 353–358, 2013.
 (11) R. L. Plackett, “The analysis of permutations,” Applied Statistics, pp. 193–202, 1975.
 (12) R. D. Luce, Individual choice behavior: A theoretical analysis. Courier Corporation, 2005.
 (13) C. L. Mallows, “Nonnull ranking models. i,” Biometrika, vol. 44, no. 1/2, pp. 114–130, 1957.
 (14) L. Thurstone, “A law of comparative judgment,” Psychological review, vol. 34, no. 4, pp. 273–286, 1927.
 (15) H. Azari, D. Parks, and L. Xia, “Random utility theory for social choice,” in Advances in Neural Information Processing Systems, 2012, pp. 126–134.
 (16) T. Lu and C. Boutilier, “Learning mallows models with pairwise preferences,” in Proceedings of the 28th international conference on machine learning (icml11), 2011, pp. 145–152.

(17)
H. A. Soufiani, W. Chen, D. C. Parkes, and L. Xia, “Generalized methodofmoments for rank aggregation,” in
Advances in Neural Information Processing Systems, 2013, pp. 2706–2714.  (18) Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, “An efficient boosting algorithm for combining preferences,” Journal of machine learning research, vol. 4, no. Nov, pp. 933–969, 2003.
 (19) N. B. Shah, S. Balakrishnan, J. Bradley, A. Parekh, K. Ramchandran, and M. J. Wainwright, “Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2049–2095, 2016.
 (20) K. Crammer and Y. Singer, “Pranking with ranking,” in Advances in neural information processing systems, 2002, pp. 641–647.
 (21) T. Qin, X. Geng, and T.Y. Liu, “A new probabilistic model for rank aggregation,” in Advances in neural information processing systems, 2010, pp. 1948–1956.
 (22) P. Dangauthier, R. Herbrich, T. Minka, and T. Graepel, “Trueskill through time: Revisiting the history of chess,” in Advances in Neural Information Processing Systems, 2008, pp. 337–344.
 (23) C. Cortes, M. Mohri, and A. Rastogi, “Magnitudepreserving ranking algorithms,” in Proceedings of the 24th international conference on Machine learning. ACM, 2007, pp. 169–176.
 (24) F. Wauthier, M. Jordan, and N. Jojic, “Efficient ranking from pairwise comparisons,” in International Conference on Machine Learning, 2013, pp. 109–117.
 (25) O. Dykstra, “Rank analysis of incomplete block designs: A method of paired comparisons employing unequal repetitions on pairs,” Biometrics, vol. 16, no. 2, pp. 176–188, Jun. 1960.
 (26) Q. Xu, Q. Huang, T. Jiang, B. Yan, W. Lin, and Y. Yao, “Hodgerank on random graphs for subjective video quality assessment,” Multimedia, IEEE Transactions on, vol. 14, no. 3, pp. 844–857, 2012.
 (27) D. A. Silverstein and F. J. E., “Quantifying perceptual image quality,” Proc. IS&T Image Processing, Image Quality, Image Capture, Systems Conference, vol. 1, pp. 242–246, May 1998.
 (28) J. Li, M. Barkowsky, and P. Le Callet, “Boosting Paired Comparison methodology in measuring visual discomfort of 3DTV: performances of three different designs,” IS&T/SPIE Electronic Imaging, Feb. 2013.
 (29) ——, “Subjective assessment methodology for preference of experience in 3dtv,” in IVMSP Workshop, 2013 IEEE 11th. IEEE, 2013, pp. 1–4.
 (30) K. G. Jamieson and R. Nowak, “Active ranking using pairwise comparisons,” in Advances in Neural Information Processing Systems, 2011, pp. 2240–2248.
 (31) T. Pfeiffer, X. A. Gao, Y. Chen, A. Mao, and D. G. Rand, “Adaptive polling for information aggregation.” in AAAI, 2012.
 (32) Q. Xu, J. Xiong, X. Chen, Q. Huang, and Y. Yao, “Hodgerank with information maximization for crowdsourced pairwise ranking aggregation,” in AAAI, 2018.
 (33) R. A. Bradley, “Rank analysis of incomplete block designs: Iii some largesample results on estimation and power for a method of paired comparisons,” Biometrika, vol. 42, no. 3/4, pp. 450–470, 1955.
 (34) P. J. Davis and P. Rabinowitz, Methods of numerical integration. Courier Corporation, 2007.

(35)
P. Ye and D. Doermann, “Active sampling for subjective image quality
assessment,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2014, pp. 4249–4256.  (36) R. C. Prim, “Shortest connection networks and some generalizations,” Bell Labs Technical Journal, vol. 36, no. 6, pp. 1389–1401, 1957.
 (37) ITUT P.910, “Subjective video quality assessment methods for multimedia applications,” International Telecommunication Union, Apr. 2008.
 (38) Q. Xu, T. Jiang, Y. Yao, Q. Huang, B. Yan, and W. Lin, “Random partial paired comparison for subjective video quality assessment via hodgerank,” in Proceedings of the 19th ACM international conference on Multimedia. ACM, 2011, pp. 393–402.
 (39) ITUR BT.50013, “Methodology for the subjective assessment of the quality of television pictures,” International Telecommunication Union, Geneva, Switzerland, Jan. 2012.
 (40) “Live video quality assessment database,” http://live.ece.utexas.edu/research/quality/live_video.html.
 (41) H. Sheikh, Z. Wang, L. Cormack, and A. Bovik, “Live image quality assessment database release 2,” http://live.ece.utexas.edu/research/quality.
 (42) P. Le Callet and F. Autrusseau, “Subjective quality assessment irccyn/ivc database,” 2005, http://www.irccyn.ecnantes.fr/ivcdb/.
 (43) F. Wickelmaier and C. Schmid, “A Matlab function to estimate choice model parameters from pairedcomparison data,” Behavior Research Methods, Instruments, and Computers, vol. 36, no. 1, pp. 29–40, Feb. 2004.
Appendix
A.1 Estimating covariance matrix by Hessian matrix
The MLEs follow a multivariate Gaussian distribution. The covariance matrix of could be estimated using the Hessian matrix of the , i.e.,
(12) 
Following wickelmaier2004matlab bradley1955rank , we construct a matrix C, which has the following form by augmenting the negative
a column and a row vector of ones and a zero in the bottom right corner:
(13) 
The first columns and rows of form the estimated covariance matrix of , i.e., .
A.2 Simplification of Utility function
In our work, the EIG can be writen as:
(14) 
In our study, only has two values, 1 and 0. We define = , and = , thus, we have , , then:
(15) 
A.3 GaussianHermite quadrature estimation In our paper,
(18) 
According to GaussianHermite quadrature, the value of integrals with the form could be estimated by , where is the number of sample points used (please note that this is not the total number of objects in the paper), the are the roots of the physicists’ version of the Hermite polynomial :
(21) 
and the associated weights are given by
(22) 
In our study, n = 30.
A.4 Complete results of Video Quality Assessment (VQA) datasets
Complete results of Kendall, PLCC and RMSE on Reference 5  10 of VQA dataset are shown in Figure 6, 7 and 8.
A.5 Complete results of Image quality assessment (IQA) dataset
Comments
There are no comments yet.