Active Sampling for Pairwise Comparisons via Approximate Message Passing and Information Gain Maximization

04/12/2020 ∙ by Aliaksei Mikhailiuk, et al. ∙ University of Cambridge Microsoft UCL 0

Pairwise comparison data arise in many domains with subjective assessment experiments, for example in image and video quality assessment. In these experiments observers are asked to express a preference between two conditions. However, many pairwise comparison protocols require a large number of comparisons to infer accurate scores, which may be unfeasible when each comparison is time-consuming (e.g. videos) or expensive (e.g. medical imaging). This motivates the use of an active sampling algorithm that chooses only the most informative pairs for comparison. In this paper we propose ASAP, an active sampling algorithm based on approximate message passing and expected information gain maximization. Unlike most existing methods, which rely on partial updates of the posterior distribution, we are able to perform full updates and therefore much improve the accuracy of the inferred scores. The algorithm relies on three techniques for reducing computational cost: inference based on approximate message passing, selective evaluations of the information gain, and selecting pairs in a batch that forms a minimum spanning tree of the inverse of information gain. We demonstrate, with real and synthetic data, that ASAP offers the highest accuracy of inferred scores compared to the existing methods. We also provide an open-source GPU implementation of ASAP for large-scale experiments.



There are no comments yet.


page 9

Code Repositories


Fast and accurate Active ASmpling method for Pairwise comparisons

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The fields of subjective assessment and preference aggregation are concerned with measuring and modeling human judgments. Participants usually rate a set of stimuli or conditions according to some criteria, or rank a subset of them. Rating is inherently more complex for participants than ranking [Tsukida2011, Shah2015, Mantiuk2012a, Ye2014]. Thus, comparative judgment experiments are gaining attention in subjective assessment and crowd-sourced experiments, e.g. for image quality assessement. The simplest form of ranking experiments is comparing conditions in pairs (pairwise comparison protocol), and hence it is the most common ranking choice. Here observers are asked to choose one out of two conditions according to some criteria. As opposed to rating, in which conditions are mapped directly to a scale by computing mean opinion scores, we need to model and infer the latent scores from pairwise comparisons. This problem is known as psychometric scaling. Models used for scaling typically rely on the assumptions of Thurstone’s model [Thurstone1927] or Bradley-Terry’s model [Bradley1952]. The main limitation of pairwise comparison experiments is that for conditions there are possible pairs to compare, which makes collecting all comparisons too costly for large . However, active sampling can be used to select the most informative comparisons, minimizing experimental effort while maintaining accurate results.

The need for an efficient active sampling algorithm for preference aggregation is motivated by the recent spread of applications reliant on: i) user preferences (i.e. recommendation systems, information retrieval and relevance estimation)

[7023415]; ii) matchmaking in gaming systems such as TrueSkill for Xbox Live [NIPS2006_3079] and Elo for chess and tennis tournaments [doi:10.1111/1467-9876.00159]; iii) psychometric experiments for behavioural psychology [doi:10.1348/000711003321645412] and iv) quality of experience (e.g. image and video quality) [Prashnani_2018_CVPR, 8578166, Ponomarenko2015, Ye2014].

State-of-the-art active sampling methods are typically based on information gain maximization [AAAI124747, GLICKMAN2005279, crowdbt2013, Ye2014, HybridMSTRAFAL, Xu2018HodgeRankWI], where pairs in each trial are selected to maximize the weighted change of the posterior distribution of the scale. However, these are computationally expensive for a large number of conditions (), as they require computing the posterior distribution for pairs at every iteration of the algorithm. To make active sampling computationally feasible, most existing techniques update the posterior distribution only for the pairs that were selected for the next comparison. We show that this leads to a sub-optimal choice of pairs and worse accuracy as the number of measurements increases. To address this problem, we substantially reduce the computational cost of active sampling by using approximate message passing for inference, and by computing the expected information gain only for the subset of the most informative pairs. The reduced computational overhead allows us to update the full posterior distribution at every iteration, thus greatly improving the accuracy. To ensure balanced design and allow for a batch sampling mode, we sample the pairs from a minimum spanning tree as in [HybridMSTRAFAL]. The proposed technique (ASAP - Active SAmpling for Pairwise comparisons) results in the most accurate psychometric scale, especially for a large number of measurements. Moreover, the algorithm has a structure that is easy to parallelize, allowing for a fast GPU implementation. We show the benefit of using full posterior update by comparing to an approximate version of the algorithm (ASAP-approx), that, similar to other methods relies on online posterior update. Our main contributions are: A) an analysis of existing active sampling methods for pairwise comparison experiments under a range of condition score distributions, using both synthetic and real image and video quality assessment data; B) a novel active sampling method (ASAP), offering the highest accuracy of the scale; and C) along with the paper we include an implementation of 9 algorithms111, providing an open-source software for active sampling in pairwise comparison experiments and including the first GPU implementation of such a method.

Ii Related work

Comparative judgment experiments arise in ranking (ordering conditions) and scaling applications (putting conditions on a scale where distances convey meaning). Suppose we aim to compare a set of conditions (conditions being images, players, etc.) that are evaluated according to a feature or characteristic (subjective measurements such as aesthetics, relevance, quality, etc.) with unknown underlying ground truth scores , . In this paper, we simply refer to these as quality scores. The simplest experimental protocol is to compare pairs , , (referred to as pairwise comparisons). Although other works exist, e.g. estimating total or partial order [heckel2018approximate, heckel2019, jamieson2011, yue2011, szorenyi2015], this paper is focused on active sampling for psychometric scale construction, which uses pairwise comparisons to estimate quality scores that approximate . This section discusses related work, divided into four groups, based on the type of approach: passive, sorting, information-gain and matchmaking. The methods tested in the experiments are highlighted in bold face. We also distinguish between sequential methods —where the next pair is generated only upon receiving the outcome for the preceding pair —and batch, or parallel methods —where a batch of comparison pairs is generated and outcomes can be obtained in parallel. Batch methods are preferred in crowd-sourcing, where multiple conditions are distributed to participants in parallel.

Passive approaches

When every condition is compared to every other condition the same number of times, the experimental design is referred to as full pairwise comparisons (FPC). Such an approach is impractical, as it requires comparisons per participant. Another approach, nearest conditions (NC), relies on the idea that conditions that are similar in quality are more informative for constructing the quality scale [zerman:hal-01654133]. Thus, if the approximate ranking is known in advance, one can compare only the conditions that are neighbours in the ranking. Such initial ranking, however, may not be available in practice.

Sorting approaches

Similar to NC, sorting-based methods rank the conditions, then compare those that are of similar quality. Authors in [doi:10.1117/1.1344187] proposed an active sampling algorithm using a binary tree. Every new condition descends down the tree, branching depending on whether it is better or worse than the condition in the current node. Authors in [Maystre2017] applied Quicksort [10.1093/comjnl/5.1.10] using pairwise comparisons as the comparison operator.

Recently, [Ponomarenko2015] used the Swiss system in chess to rank subjective assessment of visual quality. The Swiss system first chooses random conditions to compare, then sorts the conditions to find pairs that are similar. A related method is the Adaptive Rectangular Design (ARD) [P915] which allows comparison of conditions far apart on the quality scale in later stages of an experiment. The work of [chen2016] takes a different approach, where active sampling (AKG) is based on the Bayesian decision process maximising Kendall’s tau rank correlation coefficient [kendall1938].

Sorting approaches are praised for their simplicity and low computational complexity and are thus often employed in practice. However, these approaches use heuristics that often result in suboptimal comparison choices, and in general perform worse than the methods that rely on information gain.

Information-gain approaches

These methods are based on information maximization. That is, the posterior distribution of quality scores is computed and the next comparison is selected according to a utility function, e.g. Kullback-Leibler (KL) divergence [kullback1951] between the current distribution and the distribution assuming any possible comparison [Settles10activelearning]. This group is the most relevant to our new method. Methods listed in this section are sequential, unless stated otherwise.

A greedy Bayesian approach, Crowd-BT, was proposed in [crowdbt2013]. The entropy for every pair of conditions is computed using the posterior distribution of each pair individually rather than jointly. The method also explicitly accounts for reliability of each annotator: scores and annotator quality are updated using an alternating optimization strategy.

Authors in [AAAI124747] derive the score distribution from the maximum likelihood estimation and the negative inverse of the Hessian of the log likelihood function. Since the original implementation was not provided by the authors and our implementation suffered from numerical instability, we did not include it in our tests.

Authors in [GLICKMAN2005279, Ye2014] develop a fully Bayesian framework to compute the posterior distribution of the quality scores. Hybrid-MST [HybridMSTRAFAL] extends this idea by selecting batches of comparisons (instead of single pairs) to maximize the information gain in the minimum spanning tree [Cormen:2009:IAT:1614191] of a comparison graph. The time efficiency of the method over its predecessor is improved by computing the information gain locally —within the compared pair.

A different approach is taken by [Xu2018HodgeRankWI], where authors propose to solve a least-squares problem to elicit a latent global rating of the conditions using the Hodge decomposition of pairwise comparison data. Like other methods, the information gain is computed using the posterior of only the pair of compared conditions. We refer to this approach as HR-active.


A matchmaking system was proposed for gaming, together with the TrueSkill algorithm [NIPS2006_3079]. The aim is to find the pairs of players with the most similar skill. The skill distribution of a pair of players is used to predict the match outcome. We refer to this approach as TS-sampling.

Our Work

In contrast to the previous work, our method (i) allows for batch and sequential modes; (ii) estimates the posterior using the entire set of comparison outcomes that has been collected so far; and (iii) computes the utility function for a subset of pairs, saving computations without compromising on performance.

Iii Methodology

Our algorithm consists of two main steps: (i) computing the posterior distribution of score variables using the pairwise comparisons collected; (ii) using the posterior of to estimate the next comparison to be performed. In this section we first describe the score posterior estimation and then explain our active sampling algorithm. We then discuss some features to make it more computationally efficient. Pseudo-code is included in the supplementary.

Iii-a Posterior Estimation

Posterior Estimation Model

Our model is similar to Thurstone’s model Case V [Thurstone1927]

, with unobserved normally distributed independent random variables. However, our approach is fully Bayesian, and so instead of point value scores

for each condition , we assume that each score is a random variable with distribution . Analogous to Thurstone’s model, represents the score value . represents the uncertainty in an estimate of and is not explicitly expressed in Thurstone’s model (it can be obtained, for example by bootstrapping [Perez2017]

). The probability that

is better than is then given by noting that , so that:


where is the cumulative standard normal distribution function and , with representing what is referred in the literature to as an observer/comparison noise. We further assume Thurstone Case V model in which is constant across all conditions. The choice of determines the relationship between distances in the scale and probabilities of better quality. In our experiments we set .

For a pair of compared conditions for , where is the total number of comparisons measured so far, we denote the comparison outcome as , where indicates that was preferred and indicates that was preferred, with no draws allowed. In the inference step, we want to estimate the distribution of score variables given and . The posterior distribution is:


where we assume a factorizing Gaussian prior distribution over scores , and being the parameters of the prior, set to and , respectively. The likelihood of observing comparison outcomes given the ground truth scores is modelled as:


where individual likelihoods can be defined as , i.e. equal to if the sign of is the same as that of the difference and 0 otherwise.

Although the score posterior can be written exactly via Bayes rule, the binary nature of the output factor means that the likelihood in Eq. 3 is not conjugate to the Gaussian prior. This would lead to a non-Gaussian posterior for , and result in challenging, high-dimensional integrals for our information gain metric. A Gaussian approximation to messages yields a multivariate Gaussian posterior with diagonal covariance matrix, resolving both issues.

Posterior Estimation Inference

Fig. 1: Factor graph for 2 comparisons of 3 conditions.

Figure 1 shows a factor graph implementing the distribution , used as the basis for efficient inference, and inspired by TrueSkill [NIPS2006_3079]. The posterior over is inferred via message passing between nodes on the graph, with messages computed using the sum-product algorithm. In the general case of conditions and comparisons, we will have score variables and prior factors,

difference factors, difference variables, output factors and output variables. Messages from output factors are approximated as Gaussians using expectation propagation via moment matching.

Iii-B Sampling Algorithm: ASAP

The basis of the proposed active sampling algorithm is to compute the posterior distribution over that would arise from each possible pairwise outcome in the next comparison, and then use this to choose the next comparison based on a criterion of maximum information gain.

Several utility functions can be used to compute the expected information gain (EIG). Our choice is the commonly used Kullback-Leibler (KL) divergence [kullback1951] between the prior and posterior distributions.

More specifically, our active sampling strategy picks conditions to compare in measurement , such that they maximize a measure of information gain :


where is the set of all conditions and subindex indicates that we use all measurements collected up to the point in time . For simplicity, we define as the estimated posterior after measurement .

For each possible pair , let and denote the updated posterior distributions (i.e. including comparison ) if is selected over ( for ) and vice versa. Since we cannot anticipate the outcome of the pairwise comparison, i.e. which condition will be selected, similarly to other active sampling methods [HybridMSTRAFAL, Xu2018HodgeRankWI, crowdbt2013, AAAI124747, GLICKMAN2005279], we weight the EIG with the probability of each outcome. We compute this probability using Equation 1, ; for condition selected over and vice versa, EIG is then defined as:


Iii-C Efficiency considerations

At every iteration , the comparisons to consider is , where is the total number of compared conditions. The complexity of the posterior evaluation is , thus the complexity of selecting the next comparison is . This may be very costly when the number of conditions is large. Here, we discuss two modifications that reduce the computational cost, and a batch mode, which also improves the accuracy.

Approximate (online) posterior estimation (ASAP-approx)

In order to quantify the improvement in accuracy brought by the full posterior update, we follow the common approach, and consider the use of an online posterior update using assumed density filtering (ADF) [murphy_2012]. That is, the posterior is used as the prior when computing the information gain for the comparison, allowing our algorithm to run in an online manner [minka2018trueskill]. Thus, for every and pair, we update only the scores and , resulting in complexity per pair. No additional ADF-projection step is required since expectation propagation has already yielded a Gaussian approximation to the posterior. The time complexity of selecting the next comparison is thus decreased to . However, computational efficiency comes at the cost of accuracy in posterior estimation [minka2018trueskill]. We refer to the algorithm using the approximate posterior update as ASAP-approx.

Selective EIG evaluations

Some comparisons are less informative than others [Settles10activelearning], such as conditions far apart on a scale where the outcome is obvious [AAAI124747, GLICKMAN2005279]. Therefore we evaluate the EIG only for the most informative pairs. For that we use a simple criterion from Equation 1 to compute the probability that conditions and are selected for EIG evaluation. Since Equation 1 is the probability that condition is better than , to identify obvious outcomes we set

. Thus, the probability is large when the difference between the scores and their standard errors are small. To ensure that at least one pair including

is selected, we scale per condition, i.e. .

Minimum spanning tree for the batch mode

When a sampling algorithm is in the sequential mode, one pair of conditions is scheduled in every iteration of the algorithm. However, selecting a batch of comparisons in a single iteration of the algorithm is computationally more efficient and can yield superior accuracy [HybridMSTRAFAL]. To extend our algorithm to the batch mode, we treat pairwise comparisons as an undirected graph. Vertices are conditions, and edges are pairwise comparisons. We follow the approach from [HybridMSTRAFAL] where the minimum spanning tree (MST) is constructed from the graph of comparisons. The MST is a subset of the edges connecting all the vertices together, such that the total edge weight is minimal. The edges of our graph are weighed by the inverse of the EIG, i.e. for an edge connecting conditions and the weight is given by . pairs are selected for the MST, allowing us to compute the EIG every iterations, greatly improving speed. Since each condition is compared at least once within our batch, detrimental imbalanced designs [JMLR:v18:16-206], where a subset of conditions is compared significantly more often than the rest, are eliminated.

Iv Evaluation

To assess different sampling strategies, we run Monte Carlo simulation on synthetic and real datasets. Spearman rank ordering correlation coefficient (SROCC) and root-mean-squared Error (RMSE) between the ground truth and estimated scores are used for performance evaluation. We report our results as multiples of standard trials, where 1 standard trial corresponds to measurements (the number of possible pairs for conditions). For clarity, we present RMSE on a log-scale, and SROCC after a Fisher transformation (). The same method, based on approximate message passing, was used to produce the scale from pairwise comparisons for each method. We verified that the scaled results are consistent with the MLE-based method from [Perez2017].

Iv-a Algorithms compared

We implement and compare different active sampling strategies using original authors’ codes where possible: AKG [chen2016], Crowd-BT [crowdbt2013], HR-active [Xu2018HodgeRankWI] and Hybrid-MST [HybridMSTRAFAL]. Our own implementation was used for Quicksort [Maystre2017], Swiss System [Ponomarenko2015], and TS-sampling [NIPS2006_3079].

Iv-B Simulated Data

In order to control the ground truth distribution underlying the data, we first run a Monte Carlo simulation with synthetic data. In the simulation, we use from Equation 1 to draw for comparison at trial between conditions and , which are determined by each algorithm. We note that the strongest influence on the results is the proximity of compared conditions in the target scale. When conditions have comparable scores, they are confused more often in comparisons, whereas when conditions are far apart in the scale they are easily distinguished, resulting in different performances for sampling methods. Hence, we consider 3 scenarios for 20 conditions with scores sampled uniformly from: (i) large range (scores far apart); (ii) medium range ; (iii) small range (scores close together). Results for larger numbers of conditions are given in Section IV-D. We run the simulation 100 times for comparisons ranging from 1 to 15 standard trials.

Selective EIG evaluations

Figure (a)a shows the proportion of saved evaluations with selective EIG computations. Since we initialize our algorithm with all scores set to 0, all possible pairs have their EIG computed at first (0 standard trials in the plot), as all conditions are close to each other. As more data are collected, conditions move away from each other on the scale and the EIG is computed for a subset of pairs only. Computational saving is greater for large-range simulations than for small-range simulations. In small-range simulations, conditions first move away from each other, as in the first few iterations their relative distances are likely to be overestimated, decreasing the overall number of computations; however, with more measurements the conditions move closer, and the proportion of saved evaluations decreases. Figure (b)b shows the probability of the EIG being evaluated after 10 standard trials for 20 conditions sampled from the medium range. For visualization purposes, conditions were ordered ascending in the quality scale. Pairs of conditions along the diagonal, i.e. close in the scale, have a higher chance of their EIG being computed. Figure (c)c shows performance of ASAP with and without selective EIG evaluations. Thus, selectively evaluating EIG greatly reduces the number of computations while maintaining the same accuracy measured in RMSE and SROCC. In the following sections, we only present the results with selective EIG computations.

(a) Saved evaluations
(b) Probability of EIG evaluation
(c) Performance with and without selective EIG
Fig. 2: (a) Percentage of saved evaluations with selective EIG evaluations; (b) probability of EIG evaluation after 10 standard trials for medium range; and (c) RMSE and SROCC with and without selective EIG;

Minimum spanning tree for the batch mode

Figure 3 shows the results of ASAP with and without batch mode for medium-range simulations. Without MST batch mode, the method is likely to result in an imbalanced sampling pattern, where certain conditions are compared significantly more often than others. This has a detrimental effect on performance, deteriorating the results with growing number of comparisons [JMLR:v18:16-206]. Below, we only present results with MST batch mode.

Fig. 3: Simulation with 20 conditions sampled from the medium range with and without MST. We observe similar pattern for conditions sampled from small and large ranges.

Simulation results

Figure 4 shows the results of the simulation for the implemented strategies. At all tested ranges, EIG-based methods have lower RMSE, and therefore higher accuracy, than the sorting methods (Quicksort and the Swiss System). While TS-sampling and Crowd-BT have good accuracy for the large range, these are among the worst methods for the small range. ASAP-approx exerts performance similar to the methods with online posterior update, however offers a modest but consistent improvement in accuracy over Hybrid-MST and HR-active. Of all tested methods, ASAP, employing full posterior update, is the most accurate by a substantial margin and across all ranges.

For SROCC, EIG-based methods do not show a clear advantage over sorting methods; however, it should be noted that EIG-based methods are designed to optimize for RMSE rather than ranking. Even so, ASAP still performs the best for small and medium range simulations, and one of the best for large range, reaching SROCC of 0.99 within five standard trials. It should be noted, however, that the problem of ordering conditions from the large range is trivial and the best methods compete at 0.99+ SROCC levels (almost perfect ordering). Because of the poor performance of the sorting-based methods, we do not consider them in the following experiments.

Fig. 4: Simulation results with 20 conditions for the compared sampling strategies.

Iv-C Real Data

We validate the performance of sampling strategies on two real-world datasets: i) Image Quality Assessment (IQA) LIVE dataset [Sheikh2006b], with pairwise comparisons collected by [Ye2014]; and ii) Video Quality Assessment (VQA) dataset [Xu2011]. Each dataset contains complete and balanced matrices of pairwise comparisons, with each condition compared to every other condition the same number of times. The empirical probability of one condition being better than another is obtained from the measured data and used throughout the simulation. We compute RMSE and SROCC between scores produced by each method, and scores obtained by scaling the original matrices of all comparisons.

IQA dataset

To allow multiple runs of the Monte Carlo simulation, we randomly select 40 conditions from the 100 available. In the original matrix, each condition is compared 5 times with each other (5 standard trials), yielding 24750 comparisons.

Fig. 5: Compared sampling strategies on LIVE dataset.

Figure 5 shows the results. The performance trends are consistent with the results for the simulated data for the medium range. ASAP has the best performance both in terms of SROCC and RMSE. It is followed by ASAP-approx, Hybrid-MST, and TS-sampling, each having roughly the same performance in terms of both RMSE and SROCC. Crowd-BT and HR-active have the worst performance in terms of both RMSE and SROCC.

VQA dataset

The dataset contains 10 reference videos with 16 distortions. Each matrix contains 3840 pairwise comparisons, i.e. each pair was compared 32 times.

Fig. 6: Compared sampling strategies on VQA dataset.

Figure 6 shows the results of running simulations on the first two reference videos. The performance trends are again, in general, consistent with the results for the simulated data sampled from the medium range, except that TS-sampling performs substantially worse, and Hybrid-MST outperforms ASAP-approx for small numbers of trials. ASAP consistently outperforms other methods. The results for the remaining eight reference videos are given in the supplementary.

Iv-D Large Scale Experiments

It is often considered that 15 standard trials is the minimum requirement for FPC to generate reliable results [itu910, itu500], however, this is rarely feasible in practice. Real-world large-scale datasets barely reach 1 standard trial. To make experiments with large number of conditions feasible, individual reference scenes or videos are often measured and scaled independently, missing important cross-content comparisons. However, the lack of cross-content comparisons yields less accurate scores [zerman:hal-01654133]. Active sampling techniques, such as ASAP, should accurately measure a large number of conditions, while saving substantial amount of experimental effort. To test such a scenario, we simulate the comparison of 200 conditions with scores distributed in the medium range. The results, shown in Figure 7, demonstrate that even with a small number of standard trials ASAP outperforms existing methods; it is followed by ASAP-approx and Hybrid-MST.

Fig. 7: Large scale experiment simulation with 200 conditions sampled from the medium range.

Iv-E Running Time and Experimental Effort

A practical active sampling method must generate new samples in an acceptable amount time. Hence, in Figure 8

we plot the time taken by each method as the number of conditions grows. The reported times are for generating a single pair of conditions, assuming that 5 standard trials have been collected so far. CPU times were measured for MATLAB R2019a code running on a 2.6GHz Intel Core i5 CPU and 8GB 1600MHz DDR3 RAM. GPU time was measured for Pytorch 1.4 with CUDA 9.2, running on GeForce GTX1080. We omit sorting methods as they do not offer sufficient accuracy. Although ASAP is the slowest method when running on a CPU, it can be effectively parallelized on a GPU and deliver the results in a shorter time than other methods running on a CPU.

In Figure 9 we show the experimental effort required to reach an acceptable level of accuracy for 20 and 200 conditions, where we define experimental effort as the time required to reach an RMSE of 0.15. We assume that each comparison takes 5 seconds, which is typical for image quality assessment experiments [2019TIP, Ponomarenko2015]. ASAP offers the biggest saving in experimental effort for both small and large scale experiments. In an experiment with 200 conditions ASAP achieves an accuracy of 0.15 RMSE in 0.355 standard trials. The total experimental time is thus 9.8h (7065 comparisons), which is significantly better than the 14.6h (10550 comparisons) for Hybrid-MST. Similarly, for 20 conditions the entire experiment would take 40 min for ASAP and 120 min for Hybrid-MST to reach the same accuracy of score estimates. For experiments with longer comparison times (e.g. video comparison) or high comparison cost (e.g. medical images) ASAP’s advantage is even greater.

Fig. 8: Average time to select the next comparison for a varied number of conditions and 5 standard trials.
Fig. 9: Experimental effort (amount of time, assuming 5 second decision time, required to reach 0.15 RMSE) for experiments with 20 and 200 conditions.

V Conclusions

In this paper, we showed the importance of choosing the right sampling method when collecting pairwise comparison data, and proposed a fully Bayesian active sampling strategy for pairwise comparisons – ASAP.

Commonly used sorting methods perform poorly compared to the state-of-the-art methods based on the EIG, and even EIG-based methods are sub-optimal, as they rely on a partial update of the posterior distribution. ASAP computes the full posterior distribution, which is crucial to achieving accurate EIG estimates, and thus the accuracy of active sampling. Fast computation of the posterior, important for real-time applications, was made possible by using fast and accurate factor graph approach, which is new to the active sampling community. In addition, ASAP only computes the EIG for the most informative pairs, reducing the computational cost of ASAP by up to 80%, and selects batches using a minimum spanning tree method, allowing to avoid imbalanced designs.

We recommend ASAP, as it offered the highest accuracy of inferred scores compared to existing methods in experiments with real and synthetic data. The computational cost of our technique is higher than for other methods in the CPU implementation, but is still in the range that makes the technique practical, with a substantial saving of experimental effort. For large-scale experiments, in GPU implementation ASAP offers both accuracy and speed.


This project has received funding from EPSRC research grant EP/P007902/1, from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement N 725253 (EyeCode), and from the Marie Skłodowska-Curie grant agreement N 765911 (RealVision).