Consensus measure of rankings

04/27/2017 ∙ by Zhiwei Lin, et al. ∙ Nanyang Technological University Ulster University 0

A ranking is an ordered sequence of items, in which an item with higher ranking score is more preferred than the items with lower ranking scores. In many information systems, rankings are widely used to represent the preferences over a set of items or candidates. The consensus measure of rankings is the problem of how to evaluate the degree to which the rankings agree. The consensus measure can be used to evaluate rankings in many information systems, as quite often there is not ground truth available for evaluation. This paper introduces a novel approach for consensus measure of rankings by using graph representation, in which the vertices or nodes are the items and the edges are the relationship of items in the rankings. Such representation leads to various algorithms for consensus measure in terms of different aspects of rankings, including the number of common patterns, the number of common patterns with fixed length and the length of the longest common patterns. The proposed measure can be adopted for various types of rankings, such as full rankings, partial rankings and rankings with ties. This paper demonstrates how the proposed approaches can be used to evaluate the quality of rank aggregation and the quality of top-k rankings from Google and Bing search engines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In many information systems, rankings are widely used to represent the preferences over a set of items or candidates, ranging from information retrieval, recommender to decision making systems [1, 2, 3, 4, 5, 6], in order to improve quality of the services provided by the systems. For example, in search engine, the list of the terms suggested by a search engine after a user’s few keystrokes is a typical ranking and such ranking service, widely adopted nowadays, has great impact on user’s search experience; it is also recognized that the list of search results is a ranking after a query is issued.

A ranking is an ordered sequence of items, in which an item with higher ranking score is more preferred than the items with lower ranking scores. The consensus of rankings is the degree to which the rankings agree according to certain common patterns. The consensus measure, can be used in many information systems, in order to uncover how close or related the rankings are. For example, in the group decision making, a group of experts express their preferences over a set of candidates by using rankings and the measure of the degree of consensus is very useful for reaching consensus [2].

In many information system with large volume of items, such as search engines, it is hard to clearly define what ground truth is, which make it more difficult to evaluate and compare the rankings returned from the systems. The consensus measure of rankings, as a tool for understanding how related or close the rankings are, will help engineers and researchers to discern what aspects of a ranking system need to be improved and to detect outliers

[7, 8]. For a set of rankings , one approach to understanding the degree to which rankings agree is to use rank correlation or similarity function by pairwise comparison [3, 9, 10, 11, 12, 13, 14, 15, 16]. The notable functions include the Kendall index and the Spearman index [12, 14], which however do not have a weighting scheme so that less important items can be penalized. It is common that in information retrieval, the documents (items) at the top of a ranking list are more important than those at the bottom [17]. As such, it makes sense to reduce the impact from the bottom items with a weighting scheme. For example, the variation of index, denoted by , with average precision, is able to give greater weight to the top items of the ranking lists [9]. These methods assume rankings are conjoint, meaning that items in the rankings are completely overlapped. Undoubtedly, they cannot be used for partial rankings, in which items may not be mutually overlapped. As a similarity function for two partial rankings, the RBO (rank-biased overlap) proposes to weight the number of common items according to the depth of rankings [15], and it doesn’t take into account the order of items in the rankings.

When one of the correlation or similarity functions is used for consensus measure for a set of rankings in , we can aggregate the pairwise comparison values across all rankings for times. Since the pairwise comparison is based on the degree of commonality in two rankings, with respect to features or patterns (e.g, the common items, or the concordant pairs against the disconcordant pairs), the aggregated result is not informative enough to tell the extend to which the rankings agree in , according to the study by Elzinga et al. [18]. Also, the type of rankings can be full or partial, specially the top- [17], the existing measures fail to meet the requirements for handling different types of rankings.

In order to effectively evaluate and compare rankings, which could be full or partial and especially in which some items need to be weighted, this paper propose a new approach based on graph representation. The novelty of this paper lies in that fact the new proposed consensus measure of rankings does not need pairwise comparison, which is significantly different from the pairwise approaches using similarity or correlation functions. The contribution of the paper includes:

  • we introduce a directed acyclic graph (DAG) to represent the relationship between items in the rankings so that such representation can be used to induce efficient algorithms for consensus measure of rankings;

  • the proposed representation of DAG enables us to approach consensus measure of rankings in terms of different aspects of the common features or patterns hidden in the rankings, including – the number of common patterns, – the number of common patterns with a fixed length , and – the length of the longest or largest common patterns.

  • the proposed representation of DAG is extended to allow the edges in the graph to have weights so that more “important” features or patterns are assigned with higher values and the features or patterns with less “importance” are penalized.

  • we also demonstrate that the consensus measure of rankings with graph representation can be extended to calculate consensus measure for duplicate rankings, for rankings with ties and for rankings whose top items need to be weighted.

  • we show that our approach can be used for different types of rankings, including the full rankings and top- rankings.

The rest of paper is organized as follows. Section II introduces the important notations and concepts used in the paper, followed by a review of related work in Section III. Section IV presents a directed graph representation approach for consensus measure. Section V shows how the proposed approaches can be used to evaluate rank aggregation and to compare top- rankings. The paper is concluded in Section VI.

Ii Preliminaries

This section introduces notations and concepts of graph, ranking sequence, and consensus measure that will be used in the rest of the paper.

Ii-a Directed graph

A directed graph is a pair , where is the set of nodes (or vertices) and is the set of directed edges. A directed edge means that the edge leaves node and enters node . An edge is called a loop, which leaves node and returns to itself. Given a graph with nodes, matrix is used to denote the adjacency matrix of graph , where if there exists edge ; and , otherwise.

The adjacency matrix assumes that all the edges have identical weights of 1, and this can be relaxed in the weighted directed graph. A weighted directed graph is a directed graph, in which is a set of weights on the edges and each edge is assigned a non-zero weight . Then, the adjacency matrix for is defined as if and , otherwise.

A path from node to is a sequence of distinct non-loop edges

connecting node and .

Ii-B Ranking sequences

A ranking is an ordered sequence of distinct items drawn from a universe , where and is more preferred than if . The length of is denoted by . For notational simplicity, we shall simply write a ranking as a sequence of in the rest of the paper.

For a ranking , where for , we can define the embedded patterns with respect to subsequences. A sequence is called a subsequence of , denoted by , if can be obtained by deleting items from . We denote by that is not a subsequence of . For example, , and .

A ranking sequence with no items is an empty sequence. We use to denote the set of all possible non-empty subsequences of . can be partitioned into subsets , where consists of all subsequences of length . For example, if , then , in which each subsequence has length .

The degree to which rankings agree lies in the common patterns or features which are embedded in the rankings. For ranking sequences, the subsequences are the patterns or features. Given a set of rankings , consider , each element is a common subsequence of , for which we also use the notation . Similar to , we also define to denote the subsets of all common subsequences of length . Therefore, it holds that , where . In a special case, for two rankings and , we will write to denote the set of -long common subsequences between and .

It is clear that accommodates all common features (subsequences), which are subsumed by each ranking . Let denote the number of all common subsequences of , i.e,

(1)

The more common features has or the bigger is, the higher degree of consensus has. We also define

(2)

in order to measure the consensus in with respect to the number of subsequences of a given length . The length of the longest common subsequences of rankings in is denoted by or simply . Then, .

Therefore, we have the following properties:

  • For a set with only one ranking , where , ;

  • For a set of two rankings , where and , we have

  • For two sets of rankings and , if , then

    (3)

Ii-C Consensus measure of rankings in feature spaces

For a set of rankings , we can form a set of features . Let and . Each ranking

can be represented by a feature vector with a mapping function

:

where

(4)

It is clear that , defined in Equation (1) can be rewritten using the inner product on -inner product spaces [19, 20] as

(5)

With the generalized inner product, we find that the is a kernel function [21] when . The rewritten relies on the definition of as defined in Equation (4), whose co-domain is . It is computationally expensive to enumerate all the features and to form . In this section, we will transform relationship between items to a graph so that efficient algorithms can be found without enumerating features explicitly, which is similar to the kernel trick for kernel functions [21].

Iii Related work

A ranking can be full or partial ranking, depending on the number of items from being ranked. A ranking is a full ranking if . A ranking is called partial ranking if the items in forms a subset of . A top- ranking is a sub-ranking of full ranking but only with the top- items. Rankings with ties occur when some items share an identical ranking score, which happens very often in the decision making or voting process [16]. For example, in a ranking , both items and are assigned with an identical ranking score.

Evaluation or comparison of rankings is an important tasks in many ranking related systems, including decision making, information retrieval, voting and recommender [1, 2]. One approach to evaluating rankings is to use rank correlation between two rankings. The widely used Kendall index [12, 13] is a measure of rank correlation between two rankings and over items by taking into account 2-long common subsequences between them, which can be formulated as

where is a reverse ranking of . In rank aggregation, one could also use the Kendall distance – a variation of the Kendall :

The Spearman index is another measure of rank correlation that does not utilize the 2-long common subsequences but instead takes into account of each item positions in and [14]. It is defined as follows

where . The Spearman footrule distance is an distance, which is a variation of the Spearman :

Compared with the Spearman , the Kendall ignores the use of items positions, which are in many cases very important factors, e.g, for the top- rankings. Again, the Spearman can not be used for sensitivity detection and analysis, as studied in [13].

Both Kendall and Spearman can only be used for full rankings. They cannot be used for partial or top- rankings. Even in full rankings, both of them lack of weighting schemes and are not flexible enough for rankings whose items at the top are more important than the items at the bottom [17]. Therefore, it is necessary to reduce the impact from the bottom items with a down-weighting scheme for those bottom items. For example, the variation of index, denoted by , with average precision, is able to give greater weight to the top items of the ranking lists [9]. Shieh also developed a weighted metric based on the Kendall by adding weighting factors to the 2-long subsequences [22]. For full rankings with ties, was proposed based on the Kendall index [16]. One extension to the Spearman index by Iman et al. was to assign higher weights to the items at the top [23].

The above methods assume that rankings are full rankings, meaning that items in the rankings are completely overlapped. Therefore, they cannot be used for partial rankings. In information retrieval, it is more interesting to compare the the rankings based on their top- items. Fagin et al. proposed two measures and by adapting both Kendall and Spearman for top- rankings [17]. As a similarity function for two partial rankings, the RBO (rank-biased overlap) proposes to weight the number of common items according to the depth of rankings [15], but it does not take into account the order of items in the rankings.

Full Partial Weighted Ties
x
x
x x
x x
x x
x
RBO x x
x x
x x x x
x x x x
TABLE I: A summary of the popular indices for various types of rankings with comparison to our approach of and .

These functions are pairwise comparison and they can be transferred into consensus measure for a set of rankings by aggregating the pairwise distance values across all rankings. For example, one can use if the Kendall index is preferred. However, this aggregated result is not informative enough to tell the extend to which the rankings agree in , according to the study by Elzinga et al. [18].

We summarize the popularly used indices in Table I and we show that our approach of and is more flexible for various types of rankings, which will be demonstrated in the next section. Also, those existing indices shown in Table I cannot be used for sensitivity detection in the consensus measure, while our approach has the ability to discern how the rankings come to agree by varying the parameters to the gaps and positions of items, as pointed out in Section IV-D1 and as verified in Section V.

Iv Graph representation for consensus measure of rankings

This section will introduce a graph approach to consensus measure of rankings by calculating and .

Iv-a A motivating example

Consider a set of rankings . Without loss of generality, we randomly pick (note that ) and form a lower triangle matrix of size , where for , if the item and the item of both occur in the same order in all rankings in , and otherwise. Then we obtain matrix :

(6)

With matrix , we can induce a weighted directed graph on the diagonal elements of , where is the set of vertices, is the set of loops and is the set of non-loop edges. Later, we may use interchangeably without confusion as each stands for an item. Hereinafter in this paper, we shall distinguish loops and non-loops edges, and abuse the notation and simply call the latter edges.

The edges are drawn according to the following: for , an edge from to is added if the following conditions all hold: (1) ; (2) ; (3) . We also add dashed loops on diagonal elements of value . Figure 1 shows the weighted directed graph for the matrix , in which there are seven directed (solid) edges, i.e,

or simply Those edges are the 2-long common subsequences: , and all of them occur in , , and . As such, . Similarly, paths111Recall that our definition of path excludes loop edges. of length corresponds to common subsequences of length . We find that with common subsequences being , , and . Next, with the common subsequence being . There is no longer common subsequences since the length of the longest path in is .

In Figure 1, the five dashed loops mean five singletons, i.e., , , , , . As a result, . Therefore, we obtain .

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

0

0

0

1

1

1

1

1

1
Fig. 1: The weighted directed graph of the matrix (in Equation (6)) with directed edges , from to , where , if , , and .

This process of finding patterns of various lengths with graph representation not only allows us to calculate , but also makes it easy for us to calculate the number of all common patterns and the length of the longest common subsequences .

Iv-B Consensus measure by graph representation

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Fig. 2: The weighted directed graph generalized from the matrix (in Equation (6)) with weights on edges and weights on loops .

The above example shown in Fig. 1 presents an approach with graph representation for consensus measure of rankings when . This section will extends the graph representation to the consensus measure of rankings by calculating and , when .

In Equation (4), the definition of assumes that features in are equally assigned with a weight of 1. However, this is not true in many cases when some features or items in is more important than the others [9, 15]. The definition of is not flexible enough to differentiate the importance of the features. As such, we shall extend it to if , and if so that “important” features will receive higher values of while features with less importance will be “penalized” with lower . Therefore, we rewrite Equation (5) as

(7)

for .

In the DAG shown in Figure 1, we assume that the weights on the edges equal to 1, which does not reflect the nature of how each subsequence is embedded in the original rankings. Consider the four rankings in , item occurs at different positions in the rankings, which is shown in the following table:

6 5 10 5

The position for in is 10, which deviates from the positions of in the other rankings substantially. In order to incorporate those factors which may affect the degree of consensus, we relax the assumption that the weights are identical to 1 and generalize the induced weighted DAG by introducing two functions and . Figure 2 shows the new DAG, where each edge is associated with weight and each loop is assigned with , where and . We will illustrate how the two functions reflect those factors in the following sections, and how they are related to for Equation (7).

For simpler presentation of our algorithm, we introduce the (left-continuous) Heaviside function

(8)

Now we present the following theorem for measuring the consensus of rankings.

Theorem 1.

Given a set of rankings over a universe , where each ranking is naturally associated with a map defined as

(9)

Let be an arbitrary ranking from , , and be an adjacency matrix of a graph, where

(10)

and let be strictly lower triangle of , and be a vector of all ones. Then,

(11)
Proof.

Note that gives the sum of all entries in a matrix . By definition of we know that if and otherwise. It follows that is the number of s on the diagonal line, which equals to, noting that all other entries on the diagonal lines are s, the sum of diagonal entries of , or .

For , it is a classical inductive argument that when , where is the number of common sequences of length which begin with and end with . The advertised result then follows from the fact that all rankings have distinct items and thus the common subsequences also have distinct items. ∎

Theorem 1 shows that the individual items are weighted by and the edges between any two items by , reflecting the strength of the relationship between two items and .

Iv-B1 – weighted by standard deviation of item’s positions

The position of item in a ranking is an indication of the strength of being preferred. To show the importance of the position of , we define , the average of the positions of in , as follows.

(12)

If an item is placed in a small range of positions throughout all rankings, it is assumed that this item is preferred consistently at the same level by all rankings. On the other hand, if an item has a low position in one ranking while has a high position in another ranking , the big difference between the positions indicates the inconsistency of the preferences over this item. To take into account the differences of item’s positions in consensus measure, we define

(13)

where , in order to weigh the item using the positions . In fact, when feature is a singleton (i.e., ), it is clear that

where for Equation (7).

Iv-B2 – weighted by gaps

The gap between items has been used for pairwise kernel functions or sensitivity detection [24, 21]. Now we extend this to the set-wise consensus measure of .

The weighted DAG in Figure 2 shows that the edges and , i.e., subsequences and are quite different in terms of the distance between and , and between and , in . There are no items between and , however and are separated by three other items , and , which means that is much more preferred than . Therefore, for each and every 2-long subsequence , we define the gap , which indicates how much is more preferred than its successor in the 2-long subsequence with respect to the original ranking sequence . The following table shows the gaps for and with respect to the example rankings.

1 2 1 5
4 4 9 4


Clearly, the accumulated gaps for and is much bigger than that for and . This suggests that is less likely to be preferred over . To take into account this likelihood, we define

(14)

where and , so that any subsequence with bigger average gaps will be “penalized”.

Now we relate to for Equation (7). For , a -long subsequence is represented by a -long path in the graph. As each edge has weight , defining

makes Equation (11) consistent with Equation (7).

Iv-B3 An example for and

Comparing Fig. 1 and Fig. 2, obviously the example in Section IV-A is a special case for Theorem 1 when and .

Now we can use Theorem 1 to calculate the consensus score for the example in Section IV-A.

Example 1.

Consider , based on the matrix in Equation (6), we have

Then,

Iv-C and

In Example 1, we observe that for , which implies that there are no common subsequences with length more than 4 and hence the length of the longest common subsequences of is 4. Based on this fact, the following corollary of Theorem 1 provides an algorithm for calculating .

Corollary 1.

Under the assumptions in Theorem 1 , the length of the longest subsequences in can be obtained by

Corollary 2.

Under the assumptions in Theorem 1 ,

where and

is the identity matrix of size

. Consequently can be computed in time.

Proof.

Since the longest possible length of a common subsequence is , Theorem 1 implies that

Invoking the identity and the observation that since is strictly lower triangular, we obtain that

(15)

Now we discuss the runtime. Since is a strictly lower triangular matrix, is lower triangular. Note that computing is equivalent to solving and, if is lower triangular, can be done efficiently in time using forward elimination (degenerated Gaussian elimination), where denotes the number of non-zero entries in . Therefore can be computed in time, and thus in time. ∎

We remark that the runtime in Corollary 2 is significantly faster, by a factor of , than the naïve algorithm to sum up over , which would take time.

Iv-D Remarks

Data: A set of rankings
Result:
1 Pick an arbitrary222For efficiency purpose, we can choose the ranking with the least number of items. ;
2 ;
3 Initialize , with ;
4 Initialize , with ;
5 Initialize , with ;
6 Initialize , with ;
7 for  to  do
8       ;
9       for  to  do
10             ;
11            
12       end for
13      
14 end for
15 ;
16 ;
17 ;
18 ;
19 while  do
20       ;
21       ;
22       ;
23      
24 end while
25;
return
Algorithm 1 Pseudocode to compute

Algorithm 1 is the pseudocode for Theorem 1. Generating (Line 11) takes time. Computing (Line 1) takes time. For each , the matrix-vector multiplication in Line 1 takes time, since has at most non-zero entries. In fact, . Line 20 takes time. Overall computing , after generation of , takes time for .

Iv-D1 Sensitivity detection by gaps and positions of items

There are some differences among . Note that is controlled by the and in Equation (13), where is a factor to reflect the deviation of each item’s positions in the rankings. Higher disagreement of the items positions in the rankings will result in lower . Hence is sensitivity detection of the variation of items positions. Whereas for , the measure takes into account the extent of the relationship between two items by incorporating and in Equation (14).

We demonstrate such ability for sensitivity detection in terms of the gaps and positions of items in Section V.

Iv-D2 Duplicate rankings

In the above example of and its adjacent matrix in Equation (6), we assume that there are only distinct rankings in . However, this is not the case, especially in the group decision making process, where there may be duplicate rankings produced by the experts. For example, we may have a multi-set , , which contains duplicate ranking of . Obviously, has higher degree of consensus than . However, as , that is, cannot discriminate and . In order to distinguish the difference, we let be the number of occurrences of ranking in , and define