The flourish of various online crowdsourcing services (e.g., Amazon Mechanical Turk), presents us an effective way to distribute tasks to human workers around the world, on-demand and at scale. Recently, there arises a plethora of pairwise comparison data in crowdsourcing experiments on the Internet Liu (2011); Chen et al. (2013), ranging from marketing and advertisements to competitions and election. Information of this kind is all around us: which college a student selected, who won the chess match, which movie a user watched, etc. How to aggregate the massive amount of personalized pairwise comparison data to reveal the global preference function has been one important topic in the last decades Cynthia et al. (2001); Jiang et al. (2011); Xu et al. (2011); Negahban et al. (2012); Chen et al. (2013); Osting et al. (2013).
But is the aggregated result necessarily more important than individual opinions? This is not always the case especially when our Internet is flooded with personalized information in diversity. The disagreement over the crowd could not be simply interpreted as a random perturbation of a consensus that everybody should follow. For example, we often observe quite different preferences on a college ranking or a favorite movie list. Hence the wave of personalized ranking arises in recent years in search of better individualized models. One line of the related research assumes that the ranking function is determined by a small number of underlying intrinsic functions such that every individual’s personalized preference is a linear combination of these intrinsic functions Yi et al. (2013); Lu and Negahban (2015); Jiang et al. (2018); He et al. (2018)
. Another line of research attributes the personalized bias to user quality, where either a single parameter or a general confusion matrix is adopted to model the users’ ability to provide a correct labelHu et al. (2016); Kamar et al. (2015); Li et al. (2017); Sheshadri and Lease (2013); Venanzi et al. (2014); Zheng et al. (2015). There is also a trend to explore personalized ranking effects in terms of preference distributions Lu and Boutilier (2014, 2011). Moreover, Xu et al. (2016, 2019) take a wide spectrum by considering both the social preference and individual variations simultaneously. Specifically, it designs a basic linear mixed-effect model which not only can derive the common preference on population-level, but also can estimate user’s preference/utility deviation in an individual-level.
All the work mentioned above either focuses on instance-wise preference learning or assumes that the candidates are comparable in a total order. For pairwise preference learning, however, the answer might go beyond a win/loss option in real-world scenarios. The following gives an example in crowdsourced college ranking.
Example. In world college ranking with crowdsourcing platforms such as Allourideas, a participant is asked about “which university (of the following two) would you rather attend?”. As is shown in Fig.1, let be a pairwise ranking graph whose vertex set is , the set of universities to be ranked, and the edge set is , the set of university pairs which receive some comparisons from users. Here different colors indicate different users. If a voter thinks college is better than college , a solid arrowed line from to occurs (i.e., superiority). However, when a voter thinks the two colleges (i.e., and ) listed are incomparable and difficult to judge, he may click the button “I can’t decide”, then a dotted line connecting and happens (i.e., tie).
Here for a pair , if a voter believes and share a similar strength and neither one is superior to the other, he may abstain from this decision and leave it with a tie. An abstention of this kind is an obvious means to avoid unreliable predictions. Such kind of pairwise comparison data, together with “I cannot decide” decision, provide us information about possible ties or equivalent classes of items in partial orders. Though there is some work in the literature studying how to organize information in partial orders of such tied subsets or equivalent classes (partitions, bucket orders) Gionis et al. (2006); Lebanon and Mao (2008), little has been done on learning the individualized partial order models from such pairwise comparison data with ties.
In this paper, we aim to learn the individualized partial ranking models for each user based on such kind of pairwise ranking graph with ties. Based on the partial ranking, we could recommend universities for a specific user. For example, recommending universities that are with the same quality as college A; or, recommending universities that are slightly better than college B, etc.
Moreover, another challenge of personalized preference ranking comes from the fact that abnormal users might exist in the crowd. They either bear an extremely different pattern with the majority of the crowd or belong to malicious users trying to attack the learning system. To deal with abnormal user detection in crowdsourced data, existing studies often take a majority voting strategy, which often ignores the personalized effect.
Seeing the issues mentioned above, we propose a unified framework, called iSplitLBI, for personalized partial ranking, tie state recognition, and abnormal user detection. The merits of our framework are of three-fold: 1) It decomposes the parameters into three orthogonal parts, namely, abnormal signals, personalized signals, and random noise. The abnormal signals can serve the purpose of abnormal user detection, while the abnormal signals and personalized signals together are mainly responsible for user partial ranking prediction. 2) It provides a compatible framework between predict individual preferences (i.e., model prediction) and identification of abnormal users (i.e., model selection) by virtue of variable splitting scheme. 3) Exploiting the regularization path, it simultaneously searches hyper-parameters and model parameters. Up to our knowledge, this is the first proposal of such a model in the literature on partial ranking.
In crowdsourced pairwise comparison experiments, suppose there are alternatives or items to be ranked. Traditionally, the pairwise comparison labels collected from users can be naturally represented as a directed comparison graph . Let be the vertex set of items and be the set of edges, where is the set of all users who compared items. User provides his/her preference between choice and , such that means prefers to and otherwise.
However, in real-world applications, ties are ubiquitous. In this case, if a rater thinks neither of the two items in a pair is superior to the other, he/she may abstain from this decision and instead declare a tie, as is shown with the red dotted line in Fig.1. This inspires us to adopt a win/tie/lose user feedback in the following sense:
Given the definition of the user feedback, in the rest of this section, we elaborate our proposed model in the following order. First we propose a probability model to describe the generation process of the comparison results. Then we present a simple iterative algorithm called individualized Split Linearized Bregman Iterations (i.e., iSplitLBI) for individualized partial ranking. In the end, we provide a decomposition property of iSplitLBI which dives deeper into the insights of our proposed model.
2.1 Probabilistic Model of Partial Ranking with Ties
Now we describe our dataset at hand with the following notations. Suppose that we have users and for a specific user , he/she annotates pairwise comparisons. For a specific comparison , the user provides a label correspondingly following (1). We denote the set of all pairwise comparisons available for user as , and define the label set as:
Then our dataset could be expressed as . We assume that each user has a personalized score list for all items. We denote such true personalized score lists as , where is the number of items that are available for . Furthermore, for any specific , is a personalized threshold value to be learned for decision. Then, for a specific user , and a specific observation , we assume that is produced by comparing the score difference with the threshold . Meanwhile, to model the randomness of the sampling and the decision making process, we model the uncertainty of with an associated random noise which has a c.d.f . Then, in our model, user would choose , if the observed personalized score difference is greater than the threshold . To the opposite, if is smaller than , then user would choose . Otherwise, has a smaller magnitude than , in which case the user would claim a tie. Above all, is obtained from the following rule:
Furthermore, we define two variables and as :
is a random variable with a c.d.f, we could then derive the probability to observe , respectively. Specifically, together with (3) and (4) we have:
Note that different could lead to different models. In this paper, we simply consider the most widely adopted Bradley-Terry model: , while leaving other models for future studies.
2.2 Individualized Split LBI
In our framework, we assume the majority of participants share a common preference interest and behave rationally, while deviations from that exist but are sparse. To be specific, we consider the following linear model for annotator’s individualized partial ranking:
where (1) and represent the consensus level pattern, in which is the common global ranking score, is the common , as a fixed effect, and are the th and th element of , respectively; (2) and represent the individualized bias pattern, in which is the annotator’s preference deviation from the common ranking score , is the individualized bias with , as a random effect, and are the th and th element of , respectively; (3) is the random noise.
To make the notation clear, let , and , then we could represent as:
Given all above, for a specific user , it is easy to write out the negative log-likelihood:
In the constraints we use , where , as closed and convex approximations of the positivity constraints . The benefit to employ the relaxations are two-fold: 1) The closed domain constraints induce closed-form solution; 2) The threshold improves the quality of the solution to avoid ill-conditioned cases being too close to zero.
Obviously, the personalized bias could not grow arbitrarily large. More reasonably, only highly personalized users have a significant bias and , while the majority of the mass tends to have smaller or even zero biases. If we denote and , this means that
satisfies group sparsity, then we add a group lasso penalty to the loss function, which is in the form:
where is a regularization parameter. Such a structural penalty (7) can identify abnormal users whose and are nonzero. These non-zero terms increase the penalty function. However the corresponding reduction of loss function must dominate the increasing penalty so as to minimize the overall objective function. In this sense, the abnormal users capture the strong signals for individualized biases. However, it ignores the possibility that weak signals could also induce individualized biases. Such signals help to decrease the loss, but the reduction of loss is not strong enough to cover the penalty term. This motivates us to propose a variable splitting scheme to simultaneously embrace strong and weak patterns. Specifically, we model the overall signal as the sum of the strong signals and weak signals . The group lasso penalty is exhibited on the strong signals. Moreover, we give the weak signals an penalty in the form: to avoid it being arbitrarily large. Denote the parameter set as Define and , the loss function is defined as:
Instead of directly solving the above-mentioned problem, we adopt the Split Linearized Bregman Iterations which we call individualized Split LBI (iSplitLBI), which gives rise to a regularization path where both the model parameters and hyper-parameters are simultaneously evolved. The -th iteration on such a path is given as:
where the initial choice , , , , parameters , and the proximal map associated with a convex function is defined by . The and are denoted as the indicator function for the set and respectively (an indicator function of a set is 0 when the input variable is in the set, otherwise it is ). Hence, at each step, the first two steps give a projected gradient descent of and , which makes the variables feasible.
The iSplitLBI algorithm generates a regularized solution path of dense estimators and sparse estimators . These sparse estimators could be obtained by projecting (, ) onto the support set of (, ), respectively. Along the path, the stopping time at in this algorithm plays the same role as the regularization parameter in the lasso problem. In fact, Eq.(9a)-(9d) describes one iteration of the optimization process, which is actually a discretization of a dynamical system shown in Huang and Yao (2018). Such a dynamical system is known as inverse scale spaces Burger et al. (2005); Osher et al. (2016); Huang et al. (2020), leveraging a regularization path consisting of sparse models at different levels from the null to the full. At iteration , the cumulative time can be regarded as the inverse of the Lasso regularization parameter (here roughly ): the larger is , the smaller is the regularization and hence the more nonzero parameters enter the model. Following the dynamics, the model gradually grows from sparse to dense models with increasing complexity. In particular as , the dynamics may reach some over-fitting models when noise exists like our case, equivalent to a full model in generalized Lasso of minimal regularization. To prevent such over-fitting models in noisy applications, we adopt an early stopping strategy to find an optimal stopping time by cross validation.
Moreover, the also plays an important role in the model. When , only sparse strong signals (features) are kept in models, then the iSplitLBI reduces to LBI algorithm, which is shown to reach model selection consistency under nearly the same condition as LASSO for linear models Osher et al. (2016). Recently, it is shown in Huang and Yao (2018) that the model selection consistency can also hold even under non-linear models. With a finite value of , it is shown in Huang et al. (2016, 2020) that the sparse estimator enjoys improved model selection consistency. Moreover, equipped with the variable splitting scheme, the finite value of enables the overall signals (here ) to capture features ignored by the strong (sparse) signals. It has been shown in the literature (e.g. Sun et al. (2017); Zhao et al. (2018)), which coincides with our discussion, that such kinds of features can improve prediction in various tasks. Now we note the following implementation details for iSplitLBI. The hyper-parameter is a damping factor which determines the bias of the sparse estimators, a bigger leading to less biased estimators (bias-free as ). The hyper-parameter is the step size which determines the precise of the path, with a large rapidly traversing a coarse-grained path. However one has to keep small to avoid possible oscillations of the paths, e.g. . The default choice in this paper is as a tradeoff between performance and computation cost.
2.3 Decomposition Property of iSplit LBI
By virtue of the variable splitting term, the dense parameter enjoys a specific orthogonal decomposition property, as is shown in Fig.2:
(1) is simply , i.e., the projection of on the support set of . In other words, if , and otherwise. Users corresponding to the non-zero columns of have significant biases toward the popular scores and the common threshold . Thus the structure of could tell us who is an abnormal user in the crowd. In this sense, we refer to as the abnormal signal. This corresponds to the strong signals in the last subsection. (2) Among the remainder of such projection, stands for the elements having a significant magnitude than random noise. This component drives the dense parameter further away from the sparse parameter . According to the discussion in the previous subsection, this component takes into consideration the weak signals that help to further reduce the loss function. In this sense, including brings better performance to . (3) The remaining entries in are referred to as , i.e., the random noises, which are inevitable due to the randomness of the data.
With all above, we present a compatible framework for both model prediction and model selection: (1) The strong signal contains all the personalized biases which is a better choice for model prediction; (2) and exclude the weak and dense personalized signals in the overall signals, which makes it a natural choice of abnormal user identification using model selection. This motivates us to take advantage of the support set of to detect abnormal users, while utilizing for prediction.
3.1 Simulated Study
Settings. We validate our algorithm on simulated data with items and annotators. We first generate the true common ranking scores . Then each annotator has a probability to have a nonzero . Those nonzero s are drawn randomly from . If is nonzero, we generate as , otherwise we simply set , where . At last, we draw samples for each user randomly following the Bradley-Terry model. The sample number uniformly spans . Finally, we obtain a multi-edge graph with ties annotated by 50 annotators.
Abnormal User Detection. In this part, we validate abnormal user detection ability of iSplitLBI with visualization analysis. As we have stated, the support set of (or equivalently) implies the abnormal users. In this sense, we visualize the (the ground-truth parameters) and (the estimated parameters) in Fig.3 (a)-(b), whereas we visualize the magnitude of (i.e. ) and (i.e. ) in Fig.3 (c)-(d). Although the magnitude of tends to be smaller than the true parameter, the results in Fig.3 (a)-(b) clearly suggest a perfect detection of the abnormal users.
Furthermore, Fig.4 shows the -distance between each user’s individualized ranking (i.e., ) and the common ranking (i.e., ), . Clearly one can see the abnormal users we detected all exhibit larger L2-distance with the common ranking compared with other users. This indicates that these 13 abnormal users detected are those with large deviations from the population’s opinion.
Prediction Ability. After showing the successful detection of abnormal users, in the following, we will exhibit the prediction ability of the proposed iSplit LBI method.
(1) Evaluation metrics: We measure the experimental results via two evaluation criteria, i.e., Macro-F1, and Micro-F1 over the three classes -1,0,1, which take both precision and recall into account. Note that the larger the value of Micro-F1 and Macro-F1, the better the performance. For more details, please refer toZhang and Zhou (2014).
(2) Competitors: We employ two competitors that share most of the problem settings with iSplitLBI. i) the -cut algorithm Cheng et al. (2010) is an early trial of common partial ranking. Since -cut is an ensemble-based algorithm, its performance depends on the choice of weak learners. Consequently, we compare our proposed algorithm with the -cut algorithm where different types of such weak learners and regularization schemes are adopted. Regarding the parameter-tuning of the weak learners in -cut, we tune the coefficients for Ridge/LASSO regularization from the range and the best parameters are picked out through a 5-fold cross-validation on the training set. ii) a most recently developed margin-based MLE method Xu et al. (2018) where Uniform, Bradley-Terry, and Thurstone-Mosteller models are considered, respectively.
(3) Qualitative Results: Tab.1 shows the corresponding performance of our proposed algorithms and the competitors. In this table, the second column shows the weak learners and regularization terms employed in -cut and three models proposed in MLE-based
algorithm. Specifically, LR represents for logistics regression, SVM stands for the Support Vector Machine method, LS stands for the method of least squares while SVR stands for the Support Vector Regression method. For regularization, we employ the Ridge and LASSO regularization terms. Here we split the data into a training set (of each user’s pairwise comparisons) and a testing set (the remaining ). To ensure the statistical stability, we repeat this procedure 20 times. It is easy to see that iSplit LBI significantly outperforms the other two competitors with an average of in Micro-F1 and in Macro-F1 due to its individualized property.
3.2 Human Age
Dataset. In this dataset, 25 images from human age dataset FG-NET 111http://www.fgnet.rsunit.com/ are annotated by a group of volunteers on ChinaCrowds platform. The annotator is presented with two images and given a choice of which one is older (or difficult to judge). Totally, we obtain 9589 feedbacks from 91 annotators.
Qualitative Results. Tab.2 shows the corresponding performance of our proposed algorithms and the competitors. We can easily find that our proposed algorithm significantly outperforms the other two competitors in terms of both Micro-F1 and Macro-F1. Moreover, Fig.5 (a) shows the -distance between selected users’ (i.e., the top 10% and bottom 10% in the regularization path) individualized ranking and the common ranking. Clearly one can see that users jumped out earlier (i.e., the top 10% marked with pink) show larger -distance, thus are those with large deviation from the population’s opinion and can be treated as abnormal users. On the contrary, users jumped out later (i.e., the bottom 10% marked with blue) tend to have smaller or even zero -distance.
3.3 WorldCollege Ranking
Dataset. We now apply the proposed method to the world college ranking dataset, which is composed of 261 colleges. Using the Allourideas crowdsourcing platform, a total of 340 random annotators with different backgrounds from various countries (e.g., USA, Canada, Spain, France, Japan, China, etc.) are shown randomly with pairs of these colleges and asked to decide which of the two universities is more attractive to attend. If the voter thinks the two colleges are incomparable, he/she can choose the third option by clicking “I cannot decide”. Finally, we obtain a total of 11012 feedbacks, among which 9409 samples are pairwise comparisons with clear opinions (i.e., 1/-1) and the remaining 1603 are samples records with voter clicking “I cannot decide” (i.e., 0).
Qualitative Results. Tab.3 shows the comparable results on the college dataset. It is easy to see that our proposed algorithm again achieves better Micro-F1 and Macro-F1 with a large margin than all the -cut and MLE-based variants. To investigate the reason behind this, we further compare our proposed algorithm with the MLE-based algorithms in terms of fine-grained precision, recall performances on label in Fig.5 (c). For labels -1 and 1, the performance improvement is relatively small, whereas a sharp improvement is highlighted for label 0. This suggests that the major contribution of the overall improvements of our proposed algorithm comes from its strength to recognize the incomparable pairs, which is exactly the main pursuit of this paper. Moreover, similar to the human age dataset, we also plot the distance between the top/bottom 10% users’ individualized ranking and the common ranking and similar phenomenon occurs on this dataset, as is shown in Fig.5 (b). Again, we see a significant difference between the recognized most individualized rankers and the least individualized rankers.
In this paper, we propose a novel method called iSplitLBI which is capable of simultaneously predicting personalized rankings with ties and detecting the abnormal users in the crowd. To tackle the personalized deviations of the scores, a hierarchical decomposition of the model parameters is designed where both the popular opinions and the individualized effects are taken into consideration. In what follows, a specific variable splitting scheme is adopted to separate the functionality of model prediction and abnormal user detection. Experiments on both simulated examples and real-world applications together demonstrate the effectiveness of the proposed method.
This work was supported in part by the National Key R&D Program of China (Grant No. 2016YFB0800403), in part by National Natural Science Foundation of China: 61620106009, U1636214, 61836002, U1803264, U1736219, 61672514 and 61976202, in part by National Basic Research Program of China (973 Program): 2015CB351800, in part by Key Research Program of Frontier Sciences, CAS: QYZDJ-SSW-SYS013, in part by the Strategic Priority Research Program of Chinese Academy of Sciences, Grant No. XDB28000000, in part by Peng Cheng Laboratory Project of Guangdong Province PCL2018KP004, in part by Beijing Natural Science Foundation (4182079), in part by Youth Innovation Promotion Association CAS, and in part by Hong Kong Research Grant Council (HKRGC) grant 16303817.
Nonlinear inverse scale space methods for image restoration.
International Workshop on Variational, Geometric, and Level Set Methods in Computer Vision, pp. 25–36. Cited by: §2.2.
-  (2013) Pairwise ranking aggregation in a crowdsourced setting. In International Conference on Web Search and Data Mining, pp. 193–202. Cited by: §1.
Predicting partial orders: ranking with abstention.
Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 215–230. Cited by: §3.1.
-  (2001) Rank aggregation methods for the web. In International Conference on World Wide Web, pp. 613–622. Cited by: §1.
-  (2006) Algorithms for discovering bucket orders from data. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 561–566. Cited by: §1.
-  (2018) Adversarial personalized ranking for recommendation. In International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 355–364. Cited by: §1.
-  (2016) Crowdsourced POI labelling: location-aware result inference and task assignment. In International Conference on Data Engineering, pp. 61–72. Cited by: §1.
-  (2016) Split LBI: an iterative regularization path with structural sparsity. In Advances in Neural Information Processing Systems, pp. 3369–3377. Cited by: §2.2.
-  (2020) Boosting with structural sparsity: a differential inclusion approach. Applied and Computational Harmonic Analysis 48 (1), pp. 1–45. Cited by: §2.2, §2.2.
A unified dynamic approach to sparse model selection.
International Conference on Artificial Intelligence and Statistics, pp. 2047–2055. Cited by: §2.2, §2.2.
-  (2011) Statistical ranking and combinatorial Hodge theory. Mathematical Programming 127 (6), pp. 203–244. Cited by: §1.
-  (2018) Recommendation in heterogeneous information networks based on generalized random walk model and bayesian personalized ranking. In ACM International Conference on Web Search and Data Mining, pp. 288–296. Cited by: §1.
-  (2015) Identifying and accounting for task-dependent bias in crowdsourcing. In AAAI Conference on Human Computation and Crowdsourcing, Cited by: §1.
-  (2008) Non-parametric modeling of partially ranked data. Journal of Machine Learning Research 9, pp. 2401–2429. Cited by: §1.
-  (2017) CDB: optimizing queries with crowd-based selections and joins. In ACM International Conference on Management of Data, pp. 1463–1478. Cited by: §1.
-  (2011) Learning to rank for information retrieval. Springer. Cited by: §1.
-  (2011) Learning mallows models with pairwise preferences. In International Conference on Machine Learning, pp. 145–152. Cited by: §1.
-  (2014) Effective sampling and learning for mallows models with pairwise-preference data. The Journal of Machine Learning Research 15 (1), pp. 3783–3829. Cited by: §1.
-  (2015) Individualized rank aggregation using nuclear norm regularization. In Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1473–1479. Cited by: §1.
-  (2012) Iterative ranking from pair-wise comparisons. In Advances in neural information processing systems, pp. 2474–2482. Cited by: §1.
-  (2016) Sparse recovery via differential inclusions. Applied and Computational Harmonic Analysis 41 (2), pp. 436–469. Cited by: §2.2, §2.2.
-  (2013) Enhanced statistical rankings via targeted data collection. In International Conference on Machine Learning, pp. 489–497. Cited by: §1.
-  (2013) Square: a benchmark for research on computing crowd consensus. In AAAI conference on human computation and crowdsourcing, Cited by: §1.
-  (2017) GSplit LBI: taming the procedural bias in neuroimaging for disease prediction. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 107–115. Cited by: §2.2.
-  (2014) Community-based bayesian aggregation models for crowdsourcing. In International Conference on World Wide Web, pp. 155–164. Cited by: §1.
-  (2011) Random partial paired comparison for subjective video quality assessment via HodgeRank. In ACM International Conference on Multimedia, pp. 393–402. Cited by: §1.
-  (2019) From social to individuals: a parsimonious path of multi-level models for crowdsourced preference aggregation. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (4), pp. 844–856. Cited by: §1.
-  (2016) Parsimonious mixed-effects HodgeRank for crowdsourced preference aggregation. In ACM International Conference on Multimedia, pp. 841–850. Cited by: §1.
-  (2018) A margin-based mle for crowdsourced partial ranking. In ACM International Conference on Multimedia, pp. 591–599. Cited by: §3.1.
-  (2013) Inferring users’ preferences from crowdsourced pairwise comparisons: a matrix completion approach. In AAAI Conference on Human Computation and Crowdsourcing, pp. 207–215. Cited by: §1.
-  (2014) A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26 (8), pp. 1819–1837. Cited by: §3.1.
MSplit LBI: realizing feature selection and dense estimation simultaneously in few-shot and zero-shot learning. In International Conference on Machine Learning, pp. 5907–5916. Cited by: §2.2.
-  (2015) QASCA: a quality-aware task assignment system for crowdsourcing applications. In ACM SIGMOD International Conference on Management of Data, pp. 1031–1046. Cited by: §1.