1 Introduction
With the Internet and its associated explosive growth of information, individuals today in the world are facing with the rapid expansion of multiple choices (e.g., which book to buy, which hotel to book, etc.). All of these examples yield comparisons without explicitly revealing an underlying preference or utility function. That is, only partial ordered choices subject to the preference are observed instead of the whole utility function, especially the paired comparisons that all partial ordered choices can be converted into. Therefore the aggregation of incomplete comparison data to reveal the global preference function has been one important topic in the last decades.
In recent years, researchers in this odyssey usually take three approaches: (i) common consensus, that assumes all users’ choices are stochastic revelation of a common global preference or utility function on candidates; (ii) collaborative filtering for personalized ranking, which often assumes different users have correlated preference functions represented by some low rank rating matrices; (iii) mixture of random utility models [31, 13, 29], that assumes the personal choice comes from one of a small set of underlying random utility models which are yet unknown. Regarding common consensus pursuit, there has been a large volume of studies from the social choice theory to modern “rank aggregation” in computer science[3, 11, 8, 54, 28, 35, 29, 7, 19], on how to consistently aggregate the pairwise comparisons into a global consensus ranking that summarizes the preference of all users. On the other hand, low rank models of collaborative filtering [41, 40, 53, 27]
assume that there exist a small number of underlying intrinsic utility functions such that every individual’s personalized preference is a linear combination of these intrinsic utility functions, using nuclear norm as a penalty, while mixture models can be consistently recovered using tensor moment matching methods
[31]. However, few work takes the wide spectrum by considering both the social preference and individual variations simultaneously. In other words, they do not take into account the multilevel hierarchies from social choice to individuals.To see the nature of such hierarchies in preference learning, let’s consider the movie rating from MovieLens dataset for example. Fig.1 shows a twolevel movie preference functions learned from this dataset: the common preference and 21 group preferences. Fig.1 (a) illustrates this twolevel hierarchical model with six representative groups, among which farmer, tradesman, artist are the top 3 groups exhibiting a large deviation from the common preferences, while selfemployed, writer, homemaker are those showing similar preference with the common. Such results suggest that the main consuming groups of this website include homemaker, writer, and unemployed, who have more freedom to spend their time compared with other occupations. A coarsegrained model may just consider the common preference for all the users, or a refined model may incorporate these group variations to reflect diversity, while a further refined case may consider diversity in individual level. Fig.1 (b) shows the group preference diversity using the methodology proposed in this paper. The purple curve represents the common preference, while the remaining 21 curves there represent the 21 occupation group preferences in regularization paths, of which the earlier popping up to be nonzero, the more salient distinction is the group preference from the common. At different locations of axis, models of different diversity levels can be chosen.
In addition to the intrinsic preference diversity among users, there are abnormal behaviours of participants in crowdsourcing experiments due to diverse environment. Even they might share the same preference or utility function in making choices, they might suffer various disturbances during the experiments. For example, i) one typically clicks one side more often than another. As some pairs are highly confusing or annotators get too tired, in these cases, some annotators tend to click one side hoping to simply raise their record to receive more payment; while for pairs with substantial differences, they click as usual. ii) some extremely careless annotators, or robots pretending to be human annotators, actually do not look at the instances and click one side all the time to quickly receive payment for work. Such a kind of behavior is called the annotator’s position bias which has been studied in [10].
These examples above suggest us that we have to take into account of user or annotator specific variations in a crowdsourced preference aggregation task. In this paper, we propose a simple dynamic scheme that can learn multilevel utility models from the social common preference to individual diversity in a unified spectrum, adapted to different statistical models (e.g. linear, BradleyTerry, and ThurstoneMosteller etc).
As the classical social choice theory [1] points out, preference aggregation toward a global consensus is doomed to meet the conflicts of interests. What is a suitable way to quantitatively analyze the conflicts of interests?
In this paper, we are inspired by the Hodgetheoretic approach proposed in [19] which decomposes the pairwise comparison data into three orthogonal components: the global consensus ranking, the local inconsistency as triangular cycles, and the global inconsistency as harmonic cycles. Instead of merely extracting from the data the global ranking component, often called HodgeRank, the latter two are both cycles, collectively decoding all the conflicts of interests in the data. To decipher the sources of the conflicts of interests, we further decompose the cycles by considering two types of annotatorspecific variations here: annotator’s personalized preference deviations from the common ranking which characterize multicriteria in comparisons, and annotator’s position bias which deteriorates the quality of data. This results in a linear mixedeffects extension of HodgeRank, called MixedEffects HodgeRank here. Such a principle can be applied to various generalized linear models that will be studied in this paper.
To initiate a task of crowdsourced preference aggregation, we usually assume the majority of participants share a common preference interest and behave rationally, while deviations from that exist but are sparse. So a parsimonious model is desired in this paper, with sparsity structure on personalized preference deviations and position biases. Due to the unknown amount of such sparse random effects in reality, it is natural to pursue a family of parsimonious models at a variety levels of sparsity. Algorithmically we developed the Linearized Bregman Iterations (LBI) as discretized Inverse Scale Space method in our setting, which is a simple iterative procedure generating a sequence of parsimonious models, evolving from the common global ranking in HodgeRank, to annotator’s personalized ranking with a fully parametric model that might overfit the data. As the algorithm iterates, typically it appears early the large deviations in a personalized preference or abnormal behaviour, and the annotators who follow the common show at a later stage. In practice when the number of participants is large and sample size is relatively small, early stopping regularization is needed to prevent the overfitting in full model. Due to the algorithmic simplicity, it allows an easy (synchronized) parallel implementation to meet the need of largescale data analysis.
As a summary, our main contributions in this new framework are highlighted as follows:

A linear mixedeffects extension of HodgeRank including both the fixed effect of common ranking, and the random effects such as annotator’s preference deviations and position bias, which can be easily extended to generalized linear models including BradleyTerry (BT) and ThurstoneMosteller (TM) models that improve the efficiency for binary comparison data than the basic linear model (associated with the loss).

A path of parsimonious estimates of the preference deviation and position bias at different sparsity levels, based on Linearized Bregman Iterations as a discretization of Inverse Scale Space method, which allows a simple synchronized parallelization for an almost linear speedup.
This paper is an extension of our conference paper [52], where we proposed a basic linear mixedeffect model which not only can derive the common preference on populationlevel, but also can estimate an annotator’s large preference/utility deviation in an individuallevel, as well as an abnormal annotator’s position bias. However, there are some limitations in this work. First, it does not aim to predict the preferences (social or individual) based on features of new users and/or new alternatives. Such featured data are ubiquitous in Ecommerce etc., such as recommendations of books, movies, and restaurants, based their styles and categories of users. To learn such preference or ranking functions with predictive power on unseen products, a feature representation of the candidates in comparison must be used as model input in addition to the local ranking orders. Second, other types of models are not studied, such as generalized linear models which are particularly efficient for discrete choice data. In current new version, we propose a unified framework that includes various generalized linear models which learn both the social preference functions based on features of alternatives tobecompared and personalized utility functions conditioning on user categories. Such a model with at least two levels of diversity, enables us to simultaneously learn a coarsegrained social social function together with finegrained personalized rankings, equipped with prediction power for the choices of new users on new alternatives. In this paper, we shall see that the Linearized Bregman Algorithm can be adapted to all these generalized linear models with fast and parallel path algorithms, and particularly enjoys the improved statistical precision of generalized linear models for binary comparisons in real world datasets, without losing the algorithmic simplicity in basic linear model.
The remainder of this paper is organized as follows. Sec.2 contains a review of related works. Then we systematically introduce the methodology for parsimonious mixedeffects HodgeRank estimation in Sec.3. Extensive experimental validation based on one simulated and three realworld crowdsourced datasets are demonstrated in Sec.4. Finally, Sec.5 presents the conclusive remarks.
2 Related Work
Statistical preference aggregation, in particular ranking or rating from pairwise comparisons, is a classical problem which can be traced back to the century. Various methods have been studied for this problem, including the Borda count [11], maximum likelihood method such as the BradleyTerry model [8], rank centrality (PageRank/MC3) [30, 7], and most recently, HodgeRank [19].
HodgeRank, as an application of combinatorial Hodge theory to the preference or rank aggregation problem from pairwise comparison data, was first introduced in [19], inspiring a series of studies in computer science [50, 36, 34]
and game theory
[5], in addition to traditional applications in fluid mechanics [6]and computer vision
[55], etc. It is a general framework to decompose paired comparison data on graphs, possibly imbalanced (where different candidate pairs may receive different number of comparisons) and incomplete (where every voter may only give partial comparisons), into three orthogonal components (gradients, local cycles, and harmonic cycles). In these components HodgeRank not only provides us a mean to determine a global ranking from paired comparison data under various statistical models (e.g., Uniform, ThurstoneMosteller, BradleyTerry, and Angular Transform), but also measures the inconsistency of the global ranking obtained. The inconsistency shows the validity of the ranking obtained and can be further studied in terms of its geometric scale, namely whether the inconsistency in the ranking data arises locally or globally. Local inconsistency can be fully characterized by triangular cycles, while global inconsistency involves cycles consisting nodes more than three (harmonic cycles), which may arise due to data incompleteness and once presented with a large component indicates some serious conflicts in ranking data., it shows that under a natural statistical model, where pairwise comparisons are drawn randomly and independently from some underlying probability distribution, the rank centrality (PageRank) and HodgeRank algorithms both converge to an optimal ranking under a “timereversibility” condition. However, PageRank is only able to aggregate the pairwise comparisons into a global ranking over the items. HodgeRank not only provides us a mean to determine a global ranking under various statistical models, but also measures the inconsistency of the global ranking obtained. Exploiting the random graphs, we can efficiently control the global inconsistency via topology of random clique complexes
[49] as well as the sampling efficiency [37].However, all of these methods have a major drawback: they aim to find one global ranking thus cannot analyze the conflicts of interests or discrepancies across users. In HodgeRank [19], such conflicts are encoded in the components of cyclic rankings, which are not userspecific. On the other hand, in crowdsourcing scenarios, users may vote following multicriteria or under different environments that contribute to the preferential diversity. Deciphering such behaviors becomes necessary for a better exploit of crowdsourcing data.
Recently, some personalized ranking methods arose from the standard collaborative filtering (CF) approach that is based on matrix factorization [41, 40, 53]. The key idea behind them is to find a low rank user rating matrix via nuclear norm regularization such that every user’s utility is a linear combination of such lowrank ratings. However such models are not a natural fit in crowdsourcing scenarios where the majority of voters share some common preference while some annotators might deviate from that significantly.
Beyond the CF approach, there are various techniques to model annotators’ abnormal behaviors in general crowdsourcing [24, 25, 23, 56, 58, 57, 18, 59, 43, 46, 21], etc. The basic idea of these work is to characterize user quality using some probabilistic behavior models. The models roughly lie in two categories [43]: either a single parameter is associated with each user’s quality indicating the probability that the annotator correctly answers a task [17, 26]
, or a general confusion matrix is used for each user as extensions from the classic work of Dawid and Skene (DS)
[9, 39, 48]. In particular, [21] considers task dependent user quality parameters or confusion matrices such that the majority follows the common parameter while some may deviate from that with personalized parameters; on the other hand, [46] directly exploits the correlations between user confusion matrices to discover hidden groups of users.While these methods can model the quality of the workers in general crowdsourcing experiments for label aggregation, they lack the consideration for peculiarity in crowdsourced preference aggregation where every user may vote following some utilities. For example, in pairwise comparisons, the confusion matrix approach will lead to an adversarial mixture ranking model [44], where every voter follows a mixture of rational behavior by voting according to the common ranking model and abnormal behavior by voting according to its adversarial ranking. However, voters are not necessarily adversarial; for example, robot clickers on one side can be captured by position bias in our model and random clickers can be captured by his/her deviations in personalized ranking, both of which are clearly not adversarial voters. Therefore the models with quality parameter or confusion matrix above are coarsegrained models in crowdsourced ranking, insufficient to capture the preferential diversity. In this paper, we are inspired by the HodgeRank approach, and propose a parsimonious multilevel model for personalized rankings that decipher conflicts of interests but are not necessary adversarial, so may capture a wider or more refined preferential diversity in crowdsourced rank aggregation than previous models.
3 Methodology
In this section, we systematically introduce the methodology for parsimonious mixedeffects HodgeRank estimation. Specifically, we first start from introducing the proposed mixedeffects model based on HodgeRank, in which three kinds of random utility models are presented including the basic linear model with loss, BradleyTerry model, and ThurstoneMosteller model, etc. Then we present a simple iterative algorithm called Linearized Bregman Iterations to generate paths of parsimonious models at different sparsity levels, followed by Synchronized Parallel LBI to meet the need of largescale data analysis. Finally, early stopping regularization is discussed in the end of this section.
3.1 MixedEffects HodgeRank on Graphs
Suppose there are alternatives or items to be ranked, represented by data points with a feature matrix , where is a
dimensional feature vector representing item
. The pairwise comparison labels collected from users can be naturally represented as a directed comparison graph . Let be the vertex set of items and be the set of edges, where is the set of all users who compared items. User provides his/her preference between choice and , such that means prefers to and otherwise. Hence we may assumewith skewsymmetry (orientation)
. The magnitude of can represent the degree of preference and it varies in applications. The simplest setting is the binary choice, where if prefers to and otherwise. In applications, users are often categorized by their classifications, such as occupations and ages, hence may be a summary statistics of all the pairwise comparisons between and among the same category of users.The general purpose of preference aggregation is to look for a global score such that
(1) 
where
is a loss function,
denotes the confidence weights on made by rater (for simplicity, assumed to be for the provided voting data), and () represents the global ranking score of item i (j, respectively). In HodgeRank, one benefits from the use of square loss which leads to fast algorithms to find optimal global ranking , which becomes one component of a general orthogonal decomposition of paired comparison data [19], i.e.where the component cycles can be further decomposed into
Local cycles are triangular cycles, e.g. ; while global cycles, also called harmonic cycles, are loops involving nodes more than three (e.g. ) and typically traversing all nodes in the graph. These cycles may arise due to conflicts of interests in ranking data. Therefore to analyze the statistical models of cycles is crucial to understand the conflicts of interests.
In crowdsourcing scenarios, the conflicts of interests are mainly due to two kinds of sources: the multicriteria adopted by different annotators when they compare items in ; the abnormal behavior of annotators in the experiments, e.g. simply clicking one side of the pair when they got bored, tired, or distracted. From this viewpoint, the source of such cycles in HodgeRank are usually caused by the personalized ranking, position bias, and stochastic noise.
To be specific, together with the global ranking component in HodgeRank, we consider the following linear mixedeffects model for annotator’s pairwise ranking:
(2) 

is the common preference parameter such that the inner product with the feature , gives the common preference score on item , as a fixed effect;

is the user’s preference deviation parameter from the common consensus such that becomes user ’s personalized preference score, as a random effect;

is an annotator’s position bias, which captures the careless behavior by clicking one side during the comparisons;

The distribution
can be arbitrary cumulative distribution function.
Here is populationlevel parameter which indicates some common coefficient weight vector of the feature. In reality, as the preference vary greatly across different types of users, we allow each type of user to have their personalized parameters. These personalized parameters can be obtained by adding some random effects to the population parameter , representing personalized deviations from the population behavior. Moreover, measures an annotator’s position bias, i.e. the tendency of always clicking one side in paired comparison experiments. Under the random design of pairwise comparison experiments, a candidate should be placed on the left or the right randomly, so the position should not affect the choice of a careful annotator. However, some annotator might get confused, tired or distracted in experiments, such that he/she always clicks one side during some periods in experiments, which can be detected by such [51].
Considering the variety of applications, this model may incorporate several types of feature matrices , motivated but are not limited to the following examples.

is an identity matrix.
For example, in worldcollege ranking, consists of colleges to be ranked, and indicates user prefers college to . In this scenario, we do not have the features of each college but only the pairwise comparisons obtained from users. 
is lowlevel (or deep) visual features. For example, in music ratings,
can be the lowlevel audio features extracted from each audio frame (spectrum power, Zero Crossing Rate, intensity, bandwidth, pitch and MFCC, etc).

is categorical type. For example, in movie ratings, can be the genres of movie , (e.g., Action, Adventure, Animation, Comedy, Drama, etc). Or in dining restaurant ratings, can be the cuisine types (e.g., Bar, CafeCoffeeShop, Cafeteria, FastFood, etc) of the restaurant .
To make the notation clear, let satisfies and . Let satisfies . Denote , , and . Let , so .
Different distribution functions respond to different statistic models. For example, when
is normal function or subgaussian function, it indicates data follows the normal distribution or subgaussian distribution, which means
(3) 
or in matrix form
(4) 
where measures the random noise in sampling which is of zero mean and bounded. For notational simplicity, we abuse the notation to denote the vector . In this case, the loss function is often the loss, the negative loglikelihood of Gaussian:
(5) 
For robust statistics, one can also adopt loss [35] or Huber’s loss which is equivalent to the loss with samplewise sparse .
For binary comparison data , there is a family of generalized linear model (GLM) in statistics:
(6) 
or
(7) 
where is a symmetric cumulative distribution function (CDF) whose continuous inverse is welldefined. For example,
1. BradleyTerry model:
(8) 
2. ThurstoneMosteller model:
(9) 
More models can be found in [8, 19, 50, 49]. We note that in general Hodge theoretical framework for binary pairwise comparison data, one can map binary comparison data into skewsymmetric flows on graphs by and Hodge decomposition can be applied to such flows [19, 49]. In this paper, for the GLM model (3.1), the loss function is chosen as the following negative loglikelihood,
(10) 
Here we use the symmetry . Fig.2 illustrates the comparisons of three losses, including , BradleyTerry (BT), and ThurstoneMosteller (TM), respectively. One can see that as in classifications, BradleyTerry and ThurstoneMosteller provide convex surrogates [2] of binary comparison 01 loss with a better approximation than the L2 loss. Therefore one should expect that these two models may provide a better efficiency in reducing the pairwise mismatch (Kendall distance) from the observed data, as we shall see later in this paper.
3.2 Parsimonious Paths of Multilevel Models with Linearized Bregman Iteration
In crowdsourced preference aggregation scenarios with good controls, it is natural to assume a parsimonious model. In such a model, the majority of annotators carefully follows the common behavior governed by the fixed effect parameter , while only a small set of annotators might have nonzero personalized deviations and abnormal behavior in position bias. This amounts to assume that parameter to be group sparse, i.e. vanishes for all simultaneously, and to be sparse as well, i.e. zero for most of careful annotators.
Let’s consider two representative scenarios:
When is an identity matrix, , So such a sparsity pattern motivates us to consider the following penalty function with a mixture of LASSO () penalty on and group LASSO penalty on :
(11) 
Remark 1
Usually a normalization factor is used before a group lasso penalty , where is the group size of . But here all the have the same group size, and , so the column norm of is on average times of , this basically cancels out the factor . So here we just use this simple formula.
When is lowlevel (or deep) visual features, such a sparsity pattern only needs to assume traditional LASSO () penalty on both and .
(12) 
Given the Loss function and Penalty function, the following Linearized Bregman Iterations (LBI) give rise to a sequence of parsimonious (sparse) models:
(13a)  
(13b)  
(13c) 
where , , , is the iteration index, and the proximal map associated with the penalty function is given by
Here variable is an auxiliary parameter used for gradient descent, where by Moreau decomposition .
The Linearized Bregman Iteration (13) generates a path of global ranking score estimators and sparse estimators for preference deviation and position bias, . It starts from the null model, and evolves into parsimonious mixed effect models with different levels of sparsity until the full model, often overfitted. To avoid the overfitting, early stopping regularization is required to find an optimal tradeoff between the model complexity and insample error. In this paper, we find that cross validation works to find the early stopping time that will be discussed in Sec.3.4.
The Linearized Bregman algorithm was firstly introduced in [32] as a scalable algorithm for large scale image restoration with TVregularization. It has several advantages than the widely used LASSOtype convex regularizations. First of all, it is simpler than LASSO in generating the sparse regularization paths: instead of a parallel run of several optimization problem over a grid of regularization parameters, a single run of LBI generates the whole regularization path. LBI is thus desired in dealing with big problems.
The main advantage of such a three line algorithm, not only lies in its algorithmic simplicity, but also gives us more statistical precision. In fact, it has been shown [33] that LBI can be less biased than LASSO as if nonconvex regularizations [12]. Precisely as and
, the limit dynamics of Linearized Bregman Iterations in sparse linear regression may achieve the model selection consistency under nearly the same condition as LASSO yet return the unbiased Oracle estimator, while the LASSO estimator is wellknown biased. In our case, LBI (
13) is a discretization of the following limit differential inclusion:(14a)  
(14b)  
(14c) 
It evolves as gradient descent flows on a subspace restricted by . For example, for a LASSO penalty , the support set must lead to as a constant function, which leads to , whence is a minimizer (maximum likelihood estimator) restricted on the support set . Such a minimizer is unbiased when sign consistency is reached, hence is statistically more accurate than any convex regularized estimator such as LASSO. For more details, we refer the readers to see [33] and references therein. Dynamics (14) is often called Inverse Scale Space as it evolves with coarsetofine models, where at different one obtains models at different levels.
Here we give some remarks on the implementation details of the Linearized Bregman Iterations (13).

The parameter determines the bias of the sparse estimators, a bigger leading to the less biased ones. The parameter is the step size which determines the precise of the path, with a large rapidly traversing a coarsegrained path. However one has to keep small to avoid possible oscillations of the paths, e.g. . The default choice in this paper is as a tradeoff between performance and computation cost.
Now we are ready to give the following Linearized Bregman Algorithm for our MixedEffects HodgeRank as Alg. 1.
The function is different for different models. For linear model
While for GLM, it can be written as follows:
Here and means entrywise multiplication/division, respectively, and
is the probability density function corresponding to
. Here corresponds to BradleyTerry model and for ThurstoneMosteller model.3.3 Synchronized Parallel LBI
To meet the needs of largescale data analysis, we would like to introduce a vanilla version of synchronized parallel LBI. The algorithm 1 only needs matrixvector multiplication, which is easy to be parallelized. Algorithm 2 is the synchronized parallel version of Algorithm 1.
Initialization: Given parameter , and thread number , .
Split data and variables: , .
Iteration: For each thread
. For all in ,
(16a)  
(16b)  
(16c)  
(16d)  
(16e)  
(16f)  
(16g) 
Synchronize.
Synchronize.
Stopping: exit when stopping rules are met.
3.4 Early Stopping Regularization
The Alg.1 or 2 actually returns a solution path with many estimators of different sparsity. So we need to find an optimal stopping time among to choose some best estimators and avoid overfitting. Here we sketch the procedure of crossvalidation to choose the optimal stopping time:

Given the training data, fix and , then split the data into folds. Then choose a list of parameter .

for do

On the th fold of training data, use the estimator to predict, and then compute prediction error.
end for

Return the optimal with minimal average prediction error.
Remark: Because the Alg.1 or 2 only return the estimator at discrete and may not contain the predecided parameter , so we use a linear interpolation of the nearest two estimator and to approximate . is further obtained by using .
4 Experiments
In this section, four examples are exhibited with both simulated and realworld data to illustrate the validity of the analysis above and applications of the methodology proposed. The first example is with simulated data while the latter three exploit realworld data collected by crowdsourcing.
4.1 Simulated Study
Settings We validate the proposed algorithm on simulated data with labeled by 100 users. Specifically, we first generate the feature matrix for each nodes: , where is a dimensional ( in this experiment) column feature vector drawn randomly from representing node . Then each entry of the common coefficient has a probability with nonzero value and they are drawn randomly from . Besides, for each user , each entry of his personalized deviation coefficient has a probability to be nonzero and is drawn randomly from . Moreover, each user has a probability having a nonzero , and those nonzero is drawn randomly from . At last, we draw samples for each user randomly with binary response following the model , where . The sample number uniformly spans in . Finally, we obtain a multiedge graph labeled by 100 users.
Comparative Results To see whether our proposed method could provide more precise preference function for users by introducing individualspecific parameters, we randomly split the whole data sample into training set and testing set. In particular, we first split the items into training item ( of the total items) and testing item (the remaining ). Then pairwise comparisons which contain one/two of the testing item will be pushed into the testing set, while others will be treated as training set. In other words, via this partition, for each comparisons in the testing set, at least one item is a new comer which has never appear in the training set. To ensure the statistical stability, we repeat this procedure 20 times. We compare our finegrained model with 7 competitors, i.e., RankSVM [20], RankBoost [14], RankNet [4], gdbt[15], dart[47], Unified Robust Learning to Rank (URLR) [16], and HodgeRank [19]. Tab.I shows the experimental results of the proposed mixedeffects model compared with other coarsegrained models, which indicates that all of our models exhibit smaller test error (i.e. mismatch ratio) due to their parsimonious multilevels. Besides, it is worth mentioning that GLMbased models (i.e. BradleyTerry and ThurstoneMosteller) could exhibit better performance than linear model which suggests that these two are more suitable for binary data.
Speedup of SynParLBI We then demonstrate the linear speedup of the synchronized parallel LBI. In evaluating a parallel system, the typical performance measure is speedup, which is defined as the ratio of the elapsed time when executing a program on a single thread (the single thread execution time) to the execution time when threads are available. Let be the time required to complete the task on threads. The speedup is the ratio: S(M)=T(1)/T(M).
In our setting, . Fig.3 (Left) shows the mean running time for 20 times repeat of SynParLBI with thread number changing from 1 to 16 in a 16core server with Intel(R) Xeon(R) E52670 2.60GHz CPU and 384GB of RAM. The server runs Linux 4.2.0 64bit. Furthermore, Fig.3
(Right) shows the error bar of speedup with confidence interval [0.25 0.75]. It is easy to find that the parallel LBI could speed up the running time almost in a linear manner.
4.2 Movie Preference Prediction
Dataset The MovieLens 1M DataSet ^{1}^{1}1https://grouplens.org/datasets/movielens/ is comprised of 3952 movies rated by 6040 users. Each movie is rated on a scale from 1 to 5, with 5 indicating the best movie and 1 indicating the worst movie. There are a total of one million ratings in this dataset. Moreover, demographic information is provided voluntarily by the users, including gender, age range, occupation. Each movie titles are identical to titles provided by the IMDb ^{2}^{2}2http://www.imdb.com/ and each can be represented as a 18dimensional genre feature vector, including Action, Adventure, Animation, Children’s, Comedy, Crime, Documentary, Drama, Fantasy, FilmNoir, Horror, Musical, Mystery, Romance, SciFi, Thriller, War, Western.
Settings We then select a subset of this dataset containing 100 movies rated by 420 users, ensuring that each user has at least 20 ratings while each movie has been rated by at least 10 users. Since the proposed algorithm is designed for pairwise comparisons, we convert the rating information into a set of pairwise comparisons. More specifically, we create a pairwise comparison if item is rated higher by user than item . Note that no pairwise comparison data is generated if two items are given the same rating.
Individual Preference
Follow the experiment design in simulated study, we also split the dataset into training set and testing set. All the experiments were repeated 20 times to reduce variance. Similar to the simulated dataset, the proposed finegrained method could produce better performance than coarsegrained models with smaller mean test error, shown in Tab.
II. Moreover, Fig.4 shows the running time of SynParLBI on this movie dataset and we can easily find the nearly linear speedup.Occupation and Age Preference Movie preference behavior, may be influenced by the occupation and age factors. Tab.III (a) shows the occupation categories in this dataset while Tab.III (b) illustrates the age range. To exhibit the occupation influence of movie preference behavior, users from the same occupation are treated as a group. To further investigate the characteristics of groups with personalized preference, we plot the LBI regularization paths of the preference deviations, as has been shown in Fig.1 (b) in introduction. The purple curve indicates the path of the common preference parameter, being the first popping up. The red curves represent the top 3 groups (i.e., farmer, artist, and tradesman) who jumped out early. Groups who jumped out earlier are those with a large deviation from the common ranking. Besides, the blue curves indicate the bottom 3 groups (i.e., homemaker, writer, and selfemployed) jumped out later, and those often show similar preference with the common. In particular, the common preference is illustrated in Fig.5(a) where the bars are the proportions of movie genres among top 50 movies ranked by common consensus preference. One can see that the top four genres in the common (social) preference are Drama, Comedy, Romance, and Animation, respectively.
Despite those trends with occupation, movie preference also undergoes changes with age, and Fig.5(b) illustrates the evolution of preference over age groups. One can see that users under the year of 18 prefer Drama and Action movies best, while ones between 1824 are willing to watch Drama and Comedy instead. When users slowly waltz into their 2534, they begin to enjoy the love story. However, when they get to their 40s, it happened that they grew to like the thriller movie best. Not surprisingly, as they continue into old age such as beyond 56, their retrospect on whole life cherishes love in a deep way and Romance movie returns to be their favourite again.
4.3 Image Quality Assessment (IQA)
Settings Two publicly available datasets, LIVE [42] and IVC [22], are used in this work. It includes paired comparisons collected from 342 observers of different cultural background. The number of responses each reference image receives is different. To validate whether the annotators’ preference function we estimated is good enough, we randomly take reference image 1 as an illustrative example while other reference images exhibit similar results.
Results Tab.IV shows the mean test error (70% data for training, 30% for testing) results of 20 times achieved by this scheme. It is shown that consistent with the simulated data, in this dataset, the mixedeffects model with three losses could also provide better approximate results of the annotators’ preference than the HodgeRank estimator. Moreover, Fig.6 shows the running time of SynParLBI on this IQA dataset and we can easily find the nearly linear speedup.
To further investigate the characteristics of annotators with personalized ranking, Fig.7 illustrates annotator’s LBI regularization paths of preference deviations with optimal (i.e., ) returned by crossvalidation in three losses. The red curves in Fig.7 represent the top 10 annotators who jumped out early. Moreover, Fig.8 shows the order comparisons of common ranking (i.e., com.) and personalized ranking of 9 representative annotators at . The Xaxis represents user index: user = 2, 3, 4 jumped out early corresponding to paths labeled with red stars in Fig.7; user = 5, 6, 7 jumped out in the middle time corresponding to green stars; user = 8, 9, 10 jumped out late corresponding to blue stars. The order of faces in Yaxis is arranged from lower to higher (i.e., from color blue to red) according to the common ranking score calculated by our method. The color represents the ranking position returned by the corresponding user. It is easy to see users jumped out late exhibit almost consistent ranking order with the common ranking, while the earlier ones are almost the adversarial against the common.
Remark It is easy to see that among the top 10 annotators returned by linear model, 9 of them (except annotator with ID = 133) click one side almost all the time (i.e., positionbiased annotators), while results returned by other two are not. The reason of such a phenomenon lies in the difference of linear model and probability model. Such kind of user always has or . In simple linear model, to fit such user, only a position bias term is not enough. Since the common score always exists and is nonzero, only or and can fit the data well, so under the linear model, these users’ is nonzero. While in the other two GLM, the probability explain makes a single enough to fit the data. Since a with much larger magnitude the can already dominant the probability even . Also in the other data, there also exist such kind of users, but their samples are not as many as those in this data, so such a phenomenon is not observed in the other data. This gives an example that GLM can be qualitatively better than the linear model for binary comparison data.
Moreover, Fig.9 illustrates the LBI regularization paths of annotator’s position bias with red lines represent the top 10 annotators. It is easy to see that the corresponding results returned from these three loss functions are exactly the same. Tab.V further shows the click counts of each side (i.e., Left and Right) for these top 10 positionbiased annotators. It is easy to see that these annotators can be divided into two types: (1) click one side all the time (with ID in blue); (2) click one side with high probability (others). Although it might be relatively easy to identify the annotators of type (1) above by inspecting their inputs, it is impossible for eye inspection to pick up those annotators of type (2) with mixed rational and abnormal behaviors. Therefore it is essential to design such a statistical methodology to quantitatively detect these kind of positionbiased annotators for crowdsourcing platforms in market. It is interesting to see that annotators highlighted with blue color in Tab.V click the left side all the time. We then go back to the crowdsourcing platform and find out that the reason behind this is a default choice on the left button, which induces some lazy annotators to cheat for the task.
4.4 WorldCollege Ranking
Settings We now apply the proposed method to the WorldCollege dataset, which is composed of 261 colleges. Using the Allourideas crowdsourcing platform, a total of 340 distinct annotators from various countries (e.g., USA, Canada, Spain, France, Japan) are shown randomly with pairs of these colleges, and asked to decide which of the two universities is more attractive to attend. Finally, we obtain a total of 8,823 pairwise comparisons.
Results We apply the proposed method to the resulting dataset and find out that, similar to the simulation and other two realworld datasets, the mixedeffects model could produce better performance than Hodgerank with smaller mean test error, shown in Tab.VI. Moreover, Fig.10 shows the linear speedup of SynParLBI on this dataset. Besides, noting in this dataset, only 9 annotators are treated as annotators with distinct personalized rankings at optimal (i.e., ) selected via crossvalidation in linear model case, as is shown in Fig.11(a). However, other two losses with smaller mean test error detect more than 9 personalizedranking annotators via crossvalidation. To better illustrate comparison result of three losses, for the other two, we also only show the top 9 annotators, as is shown in Fig.11(b) and 11(c). It is pleasing to see that the top 9 annotators returned by three losses are exactly the same in this dataset. The common ranking vs. personalized ranking of 9 representative users is shown in Fig.12 with a similar observation to the other two datasets. Besides, the regularization paths of position bias and click counts of top 10 annotators in this dataset are shown in Fig.13 and Tab.VII. It is easy to see that similar to the human age dataset, these annotators are either clicking one side all the time, or clicking one side with high probability in mixed behaviors. Clearly, when showing top 10 positionbiased annotators, there is only one difference among these three cases, where linear model and ThurstoneMosteller both pick out annotator with ID=115, while BradleyTerry treats annotator with ID=245 as positionbiased one. A further inspection of the dataset confirms that such a detection result is reasonable, as the ratio of left/right clicks of these two annotators are 34:0 and 0:34 respectively, as is shown in Tab.VII.
5 Conclusions
In this paper, we propose a parsimonious mixedeffects model based on HodgeRank to learn user’s preference or utility function in crowdsourced ranking, which takes into account both the personalized preference deviations from the common and position biases of the annotators. To be specific, common preference scores indicate the consistent ranking on populationlevel which approximates the behavior of all users, while a small set of annotators might have nonzero personalized deviations and abnormal behavior in position bias. Equipped with the newly developed Linearized Bregman Iteration, which is a simple iterative procedure generating a sequence of parsimonious models, we establish a dynamic path from the common utility to individual variations, with different levels of parsimony or sparsity on personalization. In this dynamic scheme, three kinds of models are systematically discussed, including the linear model with L2 loss, the BradleyTerry model, and the ThurstoneMosteller model. Experimental studies conducted on simulated examples and realworld datasets show that our proposed method could exhibit better performance (i.e. smaller test error) compared with the traditional HodgeRank. In addition, generalized linear models may be more efficient to fit binary comparison data in terms of both the reduction of pairwise mismatch (Kendall distance) from observations and the discrimination of position bias from personalized preference deviations. Our results suggest that the proposed methodology is an effective tool to investigate the diversity in annotator’s behavior in modern crowdsourced preference data.
6 Acknowledgments
The research of Qianqian Xu was supported in part by National Key Research and Development Plan (No.2016YFB0800403), in part by National Natural Science Foundation of China (No.61672514, 61390514, 61572042), Beijing Natural Science Foundation (4182079), Youth Innovation Promotion Association CAS, and CCFTencent Open Research Fund. The research of Xiaochun Cao was supported in part by National Natural Science Foundation of China (No.U1636214, 61650202), Beijing Natural Science Foundation (No.4172068), Key Program of the Chinese Academy of Sciences (No.QYZDBSSWJSC003). The research of Qingming Huang was supported in part by National Natural Science Foundation of China: 61332016, 61620106009, U1636214 and 61650202, in part by National Basic Research Program of China (973 Program): 2015CB351800, in part by Key Research Program of Frontier Sciences, CAS: QYZDJSSWSYS013. The research of Yuan Yao was supported in part by Hong Kong Research Grant Council (HKRGC) grant 16303817, National Basic Research Program of China (No. 2015CB85600, 2012CB825501), National Natural Science Foundation of China (No. 61370004, 11421110001), as well as awards from Tencent AI Lab, Si Family Foundation, Baidu Big Data Institute, and Microsoft ResearchAsia.
References
 [1] K. Arrow. Social Choice and Individual Values, 2nd Ed. Yale University Press, New Haven, CT, 1963.
 [2] P. L. Bartlett, M. I. Jordan, and J. D. Mcauliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
 [3] S. Brin and L. Page. The anatomy of a largescale hypertextual web search engine. In International Conference on World Wide Web, pages 107–117, 1998.

[4]
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and
G. Hullender.
Learning to rank using gradient descent.
In
International Conference on Machine Learning
, pages 89–96, 2005.  [5] O. Candogan, I. Menache, A. Ozdaglar, and P. A. Parrilo. Flows and decompositions of games: Harmonic and potential games. Mathematics of Operations Research, 36(3):474–503, 2011.
 [6] A. Chorin and J. Marsden. A Mathematical Introduction to Fluid Mechanics. Texts in Applied Mathematics. Springer, 1993.
 [7] D. Cynthia, K. Ravi, N. Moni, and S. Dandapani. Rank aggregation methods for the web. In International Conference on World Wide Web, pages 613–622, 2001.
 [8] H. David. The Methods of Paired Comparisons. Oxford University Press, 1988.
 [9] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer errorrates using the em algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20–28, 1979.
 [10] R. L. Day. Position bias in paired product tests. Journal of Marketing Research, 6(1):98–100, 1969.
 [11] J. de Borda. Mémoire sur les Elections au Scrutin. Histoire de l’Académie Royale des Sciences, 1781.
 [12] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association, pages 1348–1360, 2001.
 [13] V. Farias, S. Jagabathula, and D. Shah. A datadriven approach to modeling choice. In Advances in Neural Information Processing Systems, pages 504–512, 2009.
 [14] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4(Nov):933–969, 2003.

[15]
J. H. Friedman.
Greedy function approximation: A gradient boosting machine.
The Annals of Statistics, 29(5):1189–1232, 2001.  [16] Y. Fu, T. M. Hospedales, T. Xiang, J. Xiong, S. Gong, Y. Wang, and Y. Yao. Robust subjective visual property prediction from crowdsourced pairwise labels. IEEE transactions on pattern analysis and machine intelligence, 38(3):563–577, 2016.
 [17] S. Guo, A. G. Parameswaran, and H. GarciaMolina. So who won?: dynamic max discovery with the crowd. pages 385–396, 2012.
 [18] H. Hu, Y. Zheng, Z. Bao, G. Li, J. Feng, and R. Cheng. Crowdsourced POI labelling: Locationaware result inference and task assignment. In IEEE International Conference on Data Engineering, pages 61–72, 2016.
 [19] X. Jiang, L.H. Lim, Y. Yao, and Y. Ye. Statistical ranking and combinatorial Hodge theory. Mathematical Programming, 127(6):203–244, 2011.

[20]
T. Joachims.
SVMrank: Support vector machine for ranking, 2009.
 [21] E. Kamar, A. Kapoor, and E. Horvitz. Identifying and accounting for taskdependent bias in crowdsourcing. In AAAI Conference on Human Computation and Crowdsourcing, pages 92–101, 2015.
 [22] P. Le Callet and F. Autrusseau. Subjective quality assessment irccyn/ivc database, 2005. http://www.irccyn.ecnantes.fr/ivcdb/.
 [23] G. Li, C. Chai, J. Fan, X. Weng, J. Li, Y. Zheng, Y. Li, X. Yu, X. Zhang, and H. Yuan. CDB: optimizing queries with crowdbased selections and joins. In ACM International Conference on Management of Data, pages 1463–1478, 2017.
 [24] G. Li, J. Wang, Y. Zheng, and M. J. Franklin. Crowdsourced data management: A survey. IEEE Trans. Knowl. Data Eng., 28(9):2296–2319, 2016.
 [25] G. Li, Y. Zheng, J. Fan, J. Wang, and R. Cheng. Crowdsourced data management: Overview and challenges. In ACM International Conference on Management of Data, pages 1711–1716, 2017.
 [26] X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. CDAS: A crowdsourcing data analytics system. PVLDB, 5(10):1040–1051, 2012.
 [27] Y. Lu and S. N. Negahban. Individualized rank aggregation using nuclear norm regularization. In Allerton Conference on Communication, Control, and Computing (Allerton), pages 1473–1479. IEEE, 2015.

[28]
W. Ma, J. M. Morel, S. Osher, and A. Chien.
An based variational model for retinex theory and its
application to medical images.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pages 153–160, 2011.  [29] S. Negahban, S. Oh, and D. Shah. Iterative ranking from pairwise comparisons. In Advances in Neural Information Processing Systems, pages 2483–2491, 2012.
 [30] S. Negahban, S. Oh, and D. Shah. Iterative ranking from pairwise comparisons. In Annual Conference on Neural Information Processing Systems, pages 2483–2491, 2012.

[31]
S. Oh and D. Shah.
Learning mixed multinomial logit model from ordinal data.
In Advances in Neural Information Processing Systems, pages 595–603, 2014.  [32] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin. An iterative regularization method for total variationbased image restoration. SIAM Journal on Multiscale Modeling and Simulation, 4(2):460–489, 2005.
 [33] S. Osher, F. Ruan, J. Xiong, Y. Yao, and W. Yin. Sparse recovery via differential inclusions. Applied and Computational Harmonic Analysis, 41(2):436–469, 2016.
 [34] B. Osting, C. Brune, and S. Osher. Enhanced statistical rankings via targeted data collection. In International Conference on Machine Learning, pages 489–497, 2013.
 [35] B. Osting, J. Darbon, and S. Osher. Statistical ranking using the norm on graphs. Inverse Problems and Imaging, 7(3):907–926, 2013.
 [36] B. Osting, J. Darbon, and S. Osher. Statistical ranking using the norm on graphs. Inverse Problems & Imaging, 7(3), 2013.
 [37] B. Osting, J. Xiong, Q. Xu, and Y. Yao. Analysis of crowdsourced sampling strategies for hodgerank with sparse random graphs. Applied and Computational Harmonic Analysis, 41(2):540–560, 2016.
 [38] A. Rajkumar and S. Agarwal. A statistical convergence perspective of algorithms for rank aggregation from pairwise data. In International Conference on Machine Learning, pages 118–126, 2014.
 [39] V. C. Raykar, S. Yu, L. H. Zhao, A. K. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: whom to trust when everyone lies a bit. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 889–896, 2009.
 [40] J. D. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In International conference on Machine learning, pages 713–719, 2005.

[41]
R. Salakhutdinov and A. Mnih.
Bayesian probabilistic matrix factorization using markov chain monte carlo.
In International conference on Machine learning, pages 880–887, 2008.  [42] H. Sheikh, Z. Wang, L. Cormack, and A. Bovik. LIVE image & video quality assessment database, 2008.
 [43] A. Sheshadri and M. Lease. SQUARE: A benchmark for research on computing crowd consensus. In AAAI Conference on Human Computation and Crowdsourcing, 2013.
 [44] C. Suh, V. Y. F. Tan, and R. Zhao. Adversarial top ranking. IEEE Transactions on Information Theory, 63(4):2201–2225, 2017.
 [45] N. M. Tran. Hodgerank is the limit of perron rank. Mathematics of Operations Research, 41(2):643–647, 2016. arXiv:1201.4632.
 [46] M. Venanzi, J. Guiver, G. Kazai, P. Kohli, and M. Shokouhi. Communitybased bayesian aggregation models for crowdsourcing. In International World Wide Web Conference, pages 155–164, 2014.

[47]
R. K. Vinayak and R. GiladBachrach.
DART: dropouts meet multiple additive regression trees.
In
International Conference on Artificial Intelligence and Statistics
, 2015.  [48] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. R. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems, pages 2035–2043, 2009.
 [49] Q. Xu, Q. Huang, T. Jiang, B. Yan, W. Lin, and Y. Yao. HodgeRank on random graphs for subjective video quality assessment. IEEE Transactions on Multimedia, 14(3):844–857, 2012.
 [50] Q. Xu, T. Jiang, Y. Yao, Q. Huang, B. Yan, and W. Lin. Random partial paired comparison for subjective video quality assessment via HodgeRank. pages 393–402. ACM Multimedia, 2011.
 [51] Q. Xu, J. Xiong, X. Cao, and Y. Yao. False discovery rate control and statistical quality assessment of annotators in crowdsourced ranking. In International Conference on Machine Learning, pages 1282–1291, 2016.
 [52] Q. Xu, J. Xiong, X. Cao, and Y. Yao. Parsimonious mixedeffects HodgeRank for crowdsourced preference aggregation. page preprint. ACM Multimedia, 2016.
 [53] J. Yi, R. Jin, S. Jain, and A. Jain. Inferring users’ preferences from crowdsourced pairwise comparisons: A matrix completion approach. In AAAI Conference on Human Computation and Crowdsourcing, 2013.
 [54] S. Yu. Angular embedding: A robust quadratic criterion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1):158–173, 2012.
 [55] J. Yuan, G. Steidl, and C. Schnorr. Convex Hodge decomposition and regularization of image flows. Journal of Mathematical Imaging and Vision, 33(2):169–177, 2009.
 [56] Y. Zheng, R. Cheng, S. Maniu, and L. Mo. On optimality of jury selection in crowdsourcing. In International Conference on Extending Database Technology, pages 193–204, 2015.
 [57] Y. Zheng, G. Li, and R. Cheng. DOCS: domainaware crowdsourcing system. PVLDB, 10(4):361–372, 2016.
 [58] Y. Zheng, G. Li, Y. Li, C. Shan, and R. Cheng. Truth inference in crowdsourcing: Is the problem solved? PVLDB, 10(5):541–552, 2017.
 [59] Y. Zheng, J. Wang, G. Li, R. Cheng, and J. Feng. QASCA: A qualityaware task assignment system for crowdsourcing applications. In ACMInternational Conference on Management of Data, pages 1031–1046, 2015.