1 Introduction
Given a dataset with multiple attributes, the challenge is to combine the values of multiple attributes to arrive at a rank. In many applications, especially in databases with numeric attributes, a weight vector
is used to express user preferences in the form of a linear combination of the attributes, i.e., . Finding flights based on a linear combination of some criteria such as price and duration [8], diamonds based on depth and carat [33], and houses based on price and floor area [33] are a few examples.The difficulty is that the concept of “best” lies in the eyes of the beholder. Different users may consider different attributes more important, and hence arrive at very different rankings. In the absence of explicit user preferences, the system can remove dominated items, and offer the remaining Paretooptimal [7] set as representing the desirable items in the data set. Such a skyline (resp. convex hull) is the smallest subset of the data that is guaranteed to contain the top choice of a user based on any monotonic (resp. linear) ranking function. Borzsony et. al. [9] initiated the skyline research in the database community and since then a large body of work has been conducted in this area. A major issue with such representatives is that they can be a large portion of the dataset [5, 34], especially when there are multiple attributes. Hence, several researchers have tackled [13, 52] the challenge of finding a small subset of the data for further consideration.
One elegant way to find a smaller subset is to define the notion of regret for any particular user. That is, how much this user loses by restricting consideration only to the subset rather than the whole set. The goal is to find a small subset of the data such that this regret is small for every user, no matter what their preference function. There has been considerable attention given to the regretratio minimizing set [45, 5] problem and its variants [44, 55, 39, 17, 1, 12, 40]. For a function and a subset of the data, let be the maximum score of the tuples in dataset based on and be the one for the subset. The regretratio of the subset for is the ratio of to . The classic regretratio minimizing set problem aims to find a subset of size that minimizes the maximum regretratio for any possible function. Other variations of the problem are pointed out in § 7.
Unfortunately, in most real situations, the actual score is a “made up” number with no direct significance. This is even more so the case when attribute values are drawn from different domains. In fact, the score itself could also be on a madeup scale. Considering the regret as a ratio helps, but is far from being a complete solution. For example, wine ratings appear to be on a 100 point scale, with the best wines in the high 90s. However, wines rated below 80 almost never make it to a store, and the median rating is roughly 88 (exact value depends on rater). Let’s say the best wine in some data set is at 90 points. A regret of 3 points gives a very small regret ratio of .03, but actually only promises a wine with a rating of 87, which is below median! In other words, a small value of regret ratio can actually result in a large swing in rank. In the case of wines at least the rating scales see enough use that most winedrinkers would have a sense of what a score means. But consider choosing a hotel. If a website takes your preferences into account and scores a hotel at 17.2 for you, do you know what that means? If not, then how can you meaningfully specify a regret ratio?
Although ordinary users may not have a good sense of actual scores, they almost always understand the notion of rank. Therefore, as an alternative to the regretratio, we consider the position of the items in the ranked list and propose the position distance of items to the top of the list as the rankregret measure. We define the rankregret of a subset of the data to be , if it contains at least one of the top tuples of any possible ranking function.
Since items in a dataset are usually not uniformly distributed by score, solutions that minimize regretratio do not typically minimize rankregret. In this paper, we seek to find the smallest subset of the given data set that has
rankregret of . We call this subset the order rankregret representative of the database. (We will write this as RRR, or simply as RRR when is understood from context). The order rankregret representative of a database (for linear ranking functions) is its convex hull: guaranteed to contain the top choice of any linear ranking function. The convex hull is usually very large: almost the entire data set with five or more dimensions [5, 34]. By choosing a value of larger than , we can drastically reduce the size of the rankregret representative set, while guaranteeing everyone a choice in their top even if not the absolute top choice.Unfortunately, finding RRR is NPcomplete, even for three dimensions. However, we find a bound on the maximum rank of an item for a function and use it for designing efficient approximation algorithms. We also find the connection of the RRR problem with wellknown notions in combinatorial geometry such as set [24], a set of points in dimensional space separated from the remaining points by a
dimensional hyperplane. We show how the
set notion can be used to find a set that guarantees a rankregret of and has size at most a logarithmic times the optimal solution. We then show how a smart partitioning of the function space offers an elegant way of finding the rankregret representative.Summary of contributions. The following are the summary of our contributions in this paper:

We propose the rankregret representative as a way of choosing a small subset of the database guaranteed to contain at least one good choice for every user.

We provide a key theorem that, given the rank of an item for a pair of functions, bounds the maximum rank of the item for any function “between” these functions.

For the special 2D case, we provide an approximation algorithm 2drrr that guarantees to achieve the optimal output size and the approximation ratio of 2 on the rankregret.

We identify the connection of the problem with the combinatorial geometry notion of set. We review the set enumeration can be modeled by graph traversal. Using the collection of sets, for the general case with constant number of dimensions, we model the problem by geometric hitting set, and propose the approximation algorithm mdrrr that guarantees the rankregret of and a logarithmic approximationratio on its output size. We also propose a randomized algorithm for set enumeration, based on the coupon collector’s problem.

We propose a function space partitioning algorithm mdrc that, for a fixed number of dimensions, guarantees a fixed approximation ratio on the rankregret. As confirmed in the experiments, applying a greedy strategy while partitioning the space makes this algorithm both efficient and effective in practice.

We conduct extensive experimental evaluations on two real datasets to verify the efficiency and effectiveness of our proposal.
In the following, we first formally define the terms, provide the problem formulation, and study its complexity in § 2. We provide the geometric interpretation of items, a dual space, and some statements in § 3 that play key roles in the technical sections. In § 4, we study the 2D problem and propose an approximation algorithm for it. § 5 starts by revisiting the set notion and its connection to our problem. Then we provide the hitting set based approximation algorithm, as well as the function space partitioning based algorithm, for the general multi dimensional case. Experiment results and related work are provided in § 6 and 7, respectively, and the paper is concluded in § 8.
2 Problem Definition
Database: Consider a database of tuples, each consisting of attributes that may be involved in a user’s preference function^{1}^{1}1each tuple could also include additional attributes that are not involved in the user preferences. We do not consider these attributes for the purpose of this paper.. Without loss of generality, we consider for all . We represent each tuple as a dimensional vector .
Ranking function: Consider a ranking function that maps each tuple to a numerical score. We further assume, through applying any arbitrary tiebreaker, that no two tuples in the database have the same score  i.e., with , there is . We say outranks if and only if . For each , let be the rank of in the ordered list of based on . In other words, there are exactly tuples in that outrank according to .
Ranking functions can take a wide variety of forms. A popular type of ranking functions studied in the database literature is linear ranking functions, i.e.,
(1) 
where (, ) is a weight vector used to capture the importance of each attribute to the final score of a tuple. We use to refer to the set of all possible linear ranking functions.
Maxima representation: For a given database , if the set of ranking functions of interest is known  say  then we can derive a compact maxima representation of by only including a tuple if it represents the maxima (i.e., is the No. 1 ranked tuple) for at least one ranking function in . For example, if we focus on linear ranking functions in , then the maxima representation of is what is known in the computational geometry and database literature as the convex hull [19] of . Similarly, the set of skyline tuples [9], a superset of the convex hull, form the maxima representation for the set of all monotonic ranking functions [6].
A problem with the maxima representation is its potentially large size. For example, depending on the “curvature” of the shape within which the tuples are distributed, even in 2D, the convex hull can be as large as [34]. The problem gets worse in higher dimensions [53, 5]. As shown in [5], in practice, even for a database with dimensionality as small as , the convex hull can often be as large as .
To address this issue, we propose in this paper to relax the definition of maxima representation in order to reduce its size. Specifically, instead of requiring the representation to contain the top1 item for every ranking function, we allow the representation to stand so long as it contains at least one of the top items for every ranking function. This tradeoff between the compactness of the representation and the “satisfaction” of each ranking function is captured in the following formal definitions of rank regret:
Definition 1
Given a subset of tuples and a ranking function , the rankregret of for is the minimum rank of all tuples in according to . Formally,
Definition 2
Given a subset of tuples and a set of ranking functions , the rankregret of for is the maximum rankregret of for all functions in  i.e.,
Definition 3
Given a set of ranking functions and a user input , we say is a rankregret representation of if and only if has the rankregret of at most for , and no other subset of satisfies this condition while having a smaller size than . Formally:
Problem Formulation: Finding the rankregret representative of the dataset is our objective in this paper. Therefore, we define the rankregret representative (RRR) problem as following:
RankRegret Representative (RRR) Problem:
Given
a dataset ,
a set of ranking functions ,
and a user input ,
find the rankregret representative of for and according to Definition 3.
We note that there is a dual formulation of the problem  i.e., a user specifies the output size , and aims to find that has the minimum rankregret. Interestingly, a solution for the RRR problem can be easily adopted for solving this dual problem. Given the solver for RRR, for the set size , one may apply a binary search to vary the value of in the range and, for each value of , call the solver to find RRR. If the output size is larger than , then the search continues in the upper half of the search space for , or otherwise moves to the lower half. Given an optimal solver for RRR, this algorithm is guaranteed to find the optimal solution for the dual problem at a cost of an additional factor in the running time.
In the rest of the paper, we focus on , the class of linear ranking functions.
Complexity analysis: The decision version of RRR problem asks if there exists a subset of size of that satisfies the rankregret of . Somewhat surprisingly, even though no solution for RRR exists in the literature, we can readily use previous results to prove the NPcompleteness of RRR. Specifically, the regret problem studied in Agrawal et al. [1] asks if there exists a set that guarantees the maximum regretratio of from the top th item of any linear ranking function. Note that the regret problem is the equivalent of RRR problem for . Given that the NPcompleteness proof in [1] covers the case when , through a reduction to the NPcompleteness of the convex polytope vertex cover (CPVC) problem proven by Das et al. [20], the NPcompleteness of RRR follows.
We would like to reemphasize that even though the complexity of RRR was established in existing work, RRR is still a novel problem to study because all previous work in the regret ratio area focused on the case where . In other words, they seek approximations on the absolute score achieved by tuples in the compact representation  a strategy which, as discussed in the introduction, could lead to a significant increase on rank regret because many tuples may congregate in a small band of absolute scores. RRR, on the other hand, focus on the rank perspective (i.e., ) and assumes no specific distribution of the absolute scores.
3 Geometric interpretation of
items
In this section, we use the geometric interpretation of items, explain a dual transformation, and propose a theorem that plays a key role in designing the RRR algorithms.
Each item with scalar attributes can be viewed as a point in . As an example, Figure 3 shows a sample dataset with items, defined over attributes. Figure 3 shows these items as the points in . In this space, every linear preference function with the weight vector can be viewed as a ray that starts at the origin and passes through the point . For each item , consider the orthogonal to the ray of that passes through ; let the projection of on be the intersection of this line with the ray of . The ordering of items based on is the same as the ordering of the projection of them on where the items that are farther from the origin are ranked higher. For example, Figure 3 shows the ray of the function , as well as the ordering of items based on it. As shown in the figure, the items are ranked as , , , , , , and , based on . Every ray starting at the origin in is represented by angles. For example in , every ray is identified by a single angle. In Figure 3, the ray of function is identified by the angle .
Small changes in the weights of a function will move the corresponding ray slightly, and hence change the projection points of items. However, it may not change the ordering of items. In fact, while the function space is continuous and the number of possible weight vectors is infinite, the number of possible ordering between the items is, combinatorially, bounded by .
In order to study the ranking of items based on various functions, throughout this paper, we consider the dual space [24] that transforms a tuple in to the hyperplane as follows:
(2) 
In the dual space, the ray of a linear function with the weight vector remains the same as the original space: the originstarting ray that passes through the point . The ordering of items based on a function is specified by the intersection of hyperplanes with it. However, unlike the original space, the intersections that are closer to the origin are ranked higher. Using Equation 2, every tuple in two dimensional space gets transformed to the line . Figure 3 shows the items in the example dataset of Figure 3 in the dual space. Looking at the intersection of dual lines with the axis in Figure 3, one can see that the ordering of items based on is , , , , , , and ; hence, for any set containing or , for (i.e., ), .
The set of dual hyperplanes defines a dissection of into connected convex cells named as the arrangement of hyperplanes [24]. The borders of the cells in the arrangement are dimensional facets. For example, in Figure 3, the arrangement of dual lines dissect the space into connected convex polygons. The borders of the convex polygons are one dimensional line segments. For every facet in the arrangement consider a line segment starting from the origin and ending on it. Let the level of the facet be the number of hyperplanes intersecting this line segment. We define a top border (or simply border) as the set of facets having level . For example, the red chain consisting of piecewise linear segments in Figure 3, shows the top border for . For any function , the hyperplanes intersecting the ray of on or below the top border are the top. Looking at the red line in Figure 3, one may confirm that:

The top border is not necessarily convex.

A dual hyperplane may contain more than one facet of the top border. For example, in Figure 3 contains two line segments of the top border.
In the following, we propose an important theorem that is the key to designing the 2D algorithm, as well as the practical algorithm in MD.
Theorem 1
For any item consider two (if any) functions and where and . Also, consider a line segment starting from a point on the ray of and ending at a point on the ray of . For any function that its ray intersects , .
We use the dual space and prove the theorem by contradiction. In the dual space, consider the 2D plane passing through the rays of and – referred as . Note that is the affine space for the origin starting rays that intersect . The intersection of each hyperplane and this plane is a line that we name as . The arrangement of lines , , identify the orderings of items based on any originstarting ray (function) that falls in . This is similar to Figure 3 in that the arrangement of lines identify the possible ordering of items in Figure 3. For any pair of items and , the intersection of the lines and shows the function (the originstarting ray that passes through the intersection) that ranks and equally well, while on one side of this point outranks , but outranks on the other side. Note that since and are both (one dimensional) lines, they intersect at most once.
Now consider the point and its corresponding line in the arrangement. Since , there exist at most lines below it on the ray of . Moving from the ray of toward the ray of , in order for to have a rank greater than , has to intersect with at least lines in a way that after the intersection points (toward ) those points outrank . Since every pair of lines has at most one intersection point, will not intersect with those lines any further. As a result, those (at least) points keep outranking , and thus cannot have a rank smaller than or equal to again, which contradicts the fact that .
4 RRR in 2D
In this section, we study the special case of two dimensional (2D) data in which . In § 2, we discussed the complexity of the problem for . However, we believe that the complexity of the problem is due to the complexity of covering the possible top results and therefore, provide an approximation algorithm for 2D. We consider the items in the dual space and use Theorem 1 as the key for designing the algorithm 2drrr. Later on, in § 5, we also use this theorem for designing a practical algorithm for the multidimensional cases.
Based on our discussion about the top border in the previous section, each dual line may contain multiple segments of the top border. As a results, for each item, the set of functions for which the item is in the top, is a collection of disjoint intervals. Based on Theorem 1, if we take the union of these intervals (i.e., the convex closure), we get a single interval, in which the item is guaranteed to be in the top2. This, we are effectively applying Theorem 1 to get the 2approximation factor.
At a highlevel, the algorithm 2drrr consists of two parts. It first makes an angular sweep of a ray anchored at the origin, from the xaxis (angle ) toward the yaxis (angle ) so that for every item , it finds the first (smallest angle) and the last function (largest angle) for which is in top. Then it transforms the problem into an instance of onedimensional range cover [10] and solves it optimally.
The first part, i.e., the angular sweep, is described in Algorithm 1. For every item the algorithm aims to find the first () and the last () function for which . Algorithm 1 initially orders the items based on their coordinates and puts them in a list that keeps tracks of orderings while moving from x to yaxis. It uses a minheap data structure to maintain the ordering exchanges between the adjacent items in . Please note that each ordering exchange is always between two adjacent items in . Using Equation 2, the angle of the ordering exchange between two items and is as follows:
For the items that are initially in the top, Algorithm 1 sets to the angle . Then, it sweeps the ray and pops the next ordering exchange from the heap. Upon visiting an ordering exchange, the algorithm updates the ordered list . If the exchange occurs between the items at rank and : (i) if this is the first time enters the top, the algorithm sets as the current angle, and (ii) for the item that leaves the top, it sets to the current angle. The algorithm will update later on if it becomes a top again. Figure 4 shows the ranges for the example dataset in Figure 3 and (border is shown in Figure 3).
After computing the ranges for the items, the problem is transformed into a one dimensional range cover instance. The objective is to cover the function space (the range between and ) using the least number of ranges. The greedy approach leads to the optimal solution for this problem – that is, at every iteration, select the range with the maximum coverage of the uncovered space.
At every iteration, the uncovered space is identified by a set of intervals. Due to the greedy nature of the algorithm, the range of each remaining item intersects with at most one uncovered interval.
To explain this by contradiction, consider an item that its range intersect with two (or more) uncovered intervals (Figure 5). Let and be these intervals. Also, let us name the covered space between and as . (i) Since the range of intersects with both and , is contained within the range of , which implies the range of is larger than . (ii) should be covered by the range of at least one previously selected item . Also, since the ranges of items are continuous, the range of cannot be larger than . As a result, the range of is less than the range of , which contradicts the fact that the ranges are selected greedily.
Using this observation, after finding the ranges for each item, 2drrr (Algorithm 2) uses a sorted list to keep track of the uncovered intervals. The elements of the list are in the form of , where (resp. ) specifies that this is the beginning (resp. the end) of an uncovered interval.
At every iteration, for each item that has still not been selected, the algorithm applies a binary search to find the element in that falls right before it, i.e., and such that . Then depending on whether specifies the beginning () or the end () of an uncovered interval, it computes how much of the uncovered region covers. The algorithm chooses the item with the maximum coverage, adds it to the selected set, and updates the uncovered intervals accordingly. It stops when no more uncovered intervals are left.
As an example, for the dataset in Figure 3, if we execute Algorithm 2 on the ranges provided in Figure 4, it returns the set .
Theorem 2
The algorithm 2drrr runs in time.
Intuitively, the summation of the cost of each iteration of the greedy algorithm is used to derive the running time. Please find the details of the proof in Appendix A.
Theorem 3
The output size of 2drrr is not more than the size of the optimal solution for RRR.
The proof follows from the fact that the ranges identified by Algorithm 1 provide a superset for each top result. Please refer to Appendix A for the details.
Theorem 4
The output of 2drrr guarantees the maximum rankregret of .
5 RRR in MD
In multidimensional cases (MD) where , the continuous function space becomes problematic, the geometric shapes become complex, and even the operations such as computing the volume of a shape and the set operations become inefficient. Therefore, in this section, we use the set notion [24] to take an alternative route for solving the RRR problem by transforming the continuous function space to discrete sets. This leads to the design of an approximation algorithm that guarantees the rankregret of , introduces a log approximationratio in the output size, and runs in time polynomial for a constant number of dimensions. We will explain the details of this algorithm in § 5.2. Then, in § 5.3, we propose the functionspace partitioning algorithm mdrc that uses the result of Theorem 1 in its design for solving the problem without finding the sets. Note that proposed algorithms in this section are also applicable for 2D.
5.1 kSet and Its Connection to RRR
A set is an important notion in combinatorial geometry with applications including halfspace range search [16, 18]. Given a set of points in , a set is a collection of exactly points in the point set that are strictly separable from the rest of points using a dimensional hyperplane.
Consider a finite set of points in the euclidean space . A hyperplane partitions it into and , called half spaces of , where (resp. ) is the open half space above^{2}^{2}2We use the word above (resp. below) to refer to the half space that does not contain (resp. contains) the origin. (resp. below) [24]. The hyperplane in the Euclidean space can be uniquely defined by a point on it and a dimensional normal vector orthogonal to it, and has the following equation:
(3) 
A half space of is a set if . Without loss of generality, we consider the positive half spaces and . That is, is a set if a point and the positive normal vector such that and . For example, the empty set is a set and each point in the convex hull of is a set. We use to refer to the collection of sets of ; i.e., is a set. For example, Figure 6 shows the collection of sets for for the dataset of Figure 3. As we can see, the sets are .
If we consider items as points in , the notion of sets is interestingly related to the notion of top items, as the following arguments show:

A hyperplane describes the set of all points with the same score as point , for the ranking function with the weight vector , i.e., the set of attributevalue combinations with the same scores as based on the ranking function .

If we consider a hyperplane where , the set of points belonging to is equivalent to the top items of for the ranking function with weight vector .
Lemma 5
Let be the collection of all sets for the points corresponding to the items . For each possible ranking function , there exists a set such that top()=.
We provide the proof by contradiction. Please refer to Appendix A for the details.
Based on Lemma 5, all possible answers to top queries on linear ranking functions can be captured by the collection of sets. This will help us in solving the RRR problem in § 5.2. As we shall explain in § 7, the best known upper bound on the number of sets in and are [22] and [49]. For , the best known upper bound is [2], where is a small constant^{3}^{3}3Note that this is polynomial for a constant .. However, as we shall show in § 6, in practice is significantly smaller than the upper bound.
In Appendix B, we review the set enumeration. For the 2D case, a ray sweeping algorithm (similar to Algorithm 1) that follows the border finds the collection of sets. For higher dimensions, the enumeration can be modeled as a graph traversal problem [3]. The algorithm considers the set graph in which the vertices are the sets and there is an edge between two sets if the size of their intersection is . We discuss the connectivity of the graph, and explain how to traverse it and enumerate the sets.
Next, we use the set notion for developing an approximation algorithm for RRR that guarantees a rankregret of and a logarithmic approximation ratio on the output size.
5.2 MDRRR: HittingSet Based Approximation Algorithm
As we discussed in § 5.1 the collection of sets contains the set of all possible top results for the linear ranking functions. As a result, a set of tuples that contains at least one item from each set is guaranteed to have at least one of the items in the top of any linear ranking function; which implies that satisfies the rank regret of . On the other hand, since every set is at least the top of the linear function with the weight vector , a subset that does not contain any of the items of a set does not satisfy the rank regret of .
One can see that given the collection of sets, our RRR problem is similar to the minimum hitting set problem [38]. Given a universe of items , and a collection of sets where each set is a subset of , the minimum hitting set problem asks for the smallest set of items such that has a nonempty intersection with every set of . The minimum hitting set problem is known to be NPcomplete [38] and the existing approximation algorithm provides a factor of from the optimal size . A deterministic polynomial time algorithm with an improved factor of had been proposed by [10] for a specific instance of this problem – the geometric hitting set problem – where is the Vapnik Chervonenkis dimension (VCdimension). The VCdimension is defined as the cardinality of the largest set of points that can be shattered by , i.e., the system introduced by on contains all the subsets of [51]. In the RRR problem, since the sets are defined by half spaces, the VCdimension is (the number of attribute) [4, 10].
Next we formally show the mapping of the RRR problem into the geometric hitting set problem, and provide the detail of approximation algorithm.
Mapping to Geometric hitting set problem:
Given a set space , where is the collection of sets and is the set of points, find the smallest set such that s.t. .
In mdrrr (Algorithm 3), we use the approximation algorithm for the geometric hitting set problem that is proposed in [10] using the concept of nets [35]. More formally, an net of for is a set of points such that contains a point for every with size of at least . Algorithm 3 shows the psudocode of mdrrr, the approximation algorithm that uses the mapping to geometric hitting set problem. The algorithm initializes the weight of each point to one. It then iteratively, in polynomial time, selects (using weighted sampling) a smallsized set of tuples that intersects all highly weighted sets in . More formally if a set intersects each set of with weight larger than , where is the total weights of of points in , then is an net. If is not a hitting set (lines 49), then the algorithm doubles the weight of the points in the particular sets of missed by .
Discussion: In summary, considering the onetoone mapping between the RRR problem and the geometric hitting set problem over the collection of sets, we can see that:

mdrrr guarantees rankregret of . That is because mdrrr is guaranteed to return at least one item from each set in , the set of all top results.

mdrrr guarantees the approximation ratio of , where is the optimal output size and is the number of attributes.

mdrrr runs in polynomial time. This is because it has been shown in [10] that the number of iterations the algorithm must perform is at most , where is the number of points in , and is the size of the optimal hitting set. Moreover, recall that mdrrr needs the collection of sets, which can be enumerated by traversing the set graph (c.f Appendix B) which runs in polynomial time.
Nevertheless, although it runs in polynomial time, the mdrrr algorithm is quite impractical as described above. It needs the collection of sets (), as input. Therefore, its efficiency depends on the set enumeration and the size of . Although, as we shall show in § 6, in practice the size of is reasonable and as explained in Appendix B, the set graph traversal algorithm is linear in , the algorithm does not scale beyond a few hundred items in practice. The reason is that while exploring each set, it needs to solve much as linear programs, each of size constraints over variables. This makes the enumeration extremely inefficient. Therefore, we need to explore practical alternatives to the set enumeration algorithm.
In the next subsection, we propose a more practical randomized algorithm kset for set enumeration.
5.2.1 kset: Sampling for set enumeration
Here we propose a samplingbased alternative for the set enumeration. There is a many to one mapping between the linear ranking functions and the sets. That is, while a set is the top of infinite number of linear ranking functions, every ranking function is mapped to only one set, the set of top tuples for that function. Instead of the exact enumeration of the sets, which requires solving expensive linear programming problems for the discovery of the sets, we propose a randomized approach based on the coupon collector’s problem [28]. The coupon collector’s problem describes the “collect the coupons and win” contest. Given a set of coupons, consider a sampler that every time picks a coupon uniformly at random, with replacement. The requirement is to keep sampling until all coupons are collected. Given a set of coupons, it has been shown that the expected number of samples to draw is in . We use this idea for collecting the sets by generating random ranking functions and taking their top results as the
sets. This is similar to the coupon collector’s problem setting, except that the probabilities of retrieving the
sets are not equal. For each set, this probability depends on the portion of the function space for which it is the top. Therefore, rather than applying a set enumeration algorithm, Algorithm 4, repeatedly generates random functions and computes their corresponding sets, stopping when it does not find a new set after a certain number of iterations. The algorithm returns the collection of sets it has discovered, as . Recall that the function space in MD, is modeled by the universe of originstarting rays. The set of points on the surface of the (first quadrant of the) unit hypersphere represent the universe of originstarting rays. Therefore, uniformly selecting points from the surface of the hypersphere in , is equivalent to uniformly sampling the linear functions. Algorithm 4 adopts the method proposed by Marsaglia [43] for uniformly sampling points on the surface of the unit hypersphere, in order to generate random functions. It generates the weight vector of the sampled function as the absolute values of random normal variables. We note that since the sets are not collected uniformly by kset, its running time is not the same as coupon collector’s problem, but as we shall show in § 6, it runs well in practice.After finding , using Algorithm 4, we pass it, instead of to mdrrr. Since Algorithm 4 does not guarantee the discovery of all sets, the output of the hitting set algorithm does not guarantee the rankregret of for the missing sets. However, the missing sets (if any) are expected to be in the very small regions that has never been hit by a randomly generated function. Also, the fact that the adjacent sets in the set graph vary in only one item, further reduces the chance that a missing set is not covered. Therefore, this is very unlikely that the top of a randomly generated function is not within the output.
On the other hand, since Algorithm 4 finds a subset of sets, the output size for running the hitting set on top of the subset (i.e., ) is not more than the output size of running the hitting set on . As a result, the output size remains within the logarithmic approximation factor.
5.3 MDRC: Function Space Partitioning
Given the collection of sets, the hitting set based approximation algorithm mdrrr guarantees the rankregret of while introducing a logarithmic approximation in its output size. Despite these nice guarantees, mdrrr still suffers from set enumeration, as it can only be executed after the sets have been discovered. Therefore, as we shall show in § 6, in practice it does not scale well for large problem instances. One observation from the set graph is the high overlap between the sets, as the adjacent sets differ in only one item. As a result many of them may share at least one item. For example, we selected 20 random items from the DOT (Department of Transportation) dataset (c.f. § 6) while setting . By performing an angular sweep of a ray from the xaxis to the yaxis while following the border (see Figure 3), we enumerated the sets. In Figure 8, we illustrate the overlap between these sets. The figure confirms the high overlap between the sets where the item with id 2 appears in all except one of the sets. This motivates the idea of finding these items without enumerating the sets. In addition, the top of two similar functions (where the angle between their corresponding rays is small) are more likely to intersect.
We uses these observations in this subsection and propose the functionspace partitioning algorithm mdrc which (similar to the 2D algorithm 2drrr) leverages Theorem 1 in its design. The algorithm is based on the extension of Theorem 1 that bounds the rank of an item that appears in the top of the functions corresponding to the corners of a convex polytope in the function space.
mdrc considers the function space in which every function (i.e., a ray starting from the origin) in is identified as a set of angles. Rather than discovering the sets and transforming the problem to a hitting set instance, here our objective is to cover the continuous function space (instead of the discrete set space). Intuitively, we propose a recursive algorithm which, at every recursive step, considers a hyperrectangle in the function space, and either assigns a tuple to the functions in the space, or uses a round robin strategy on the angles to break down the space in two halves, and to continue the algorithm in each half. This partitioning strategy is similar to the Quadtree data structure [32]. The reason for choosing this strategy is to maximize the similarity of the functions in the corners of the hyperrectangles to increase the probability that their top sets intersect. mdrc also follows a greedy strategy in covering the function space, by partitioning a hyperrectangle only if it cannot assign a tuple to it.
Consider the space of possible ranking functions in . This is identified by a set of angles , where . To explain the algorithm, consider the binary tree where each node is associated with a hyperrectangle in the angle space, specified by a range vector of size . The root of the tree is the complete angle space, that is the hyperrectangle defined between the ranges on each dimension. Let the level the nodes increase from top to bottom, with the level of the root being . Every node at level uses the angle to partition the space in two halves, the negative half (left children) and the positive half (the right child). Figure 8 illustrates an example of such tree for 3D. The root uses the angle to partition the space. The left child of the root is associated with the rectangle specified by the ranges and the right child shows the one by . The nodes at level use the angle for partitioning the space.
At every node, the algorithm checks the top items in the corners of the node’s hyperrectangle and if there exists an item that is common to all of them, returns it. Otherwise, it generates the children of the node and iterates the algorithm on the children. The algorithm combines the outputs of each of the halves as its output. Algorithm 5 shows the pseudocode of the recursive algorithm mdrc. The algorithm is started by calling mdrc .
As a running example for the algorithm, let us consider Figure 8. The algorithm starts at the root, partitions the space in two halves, as the intersection of the top of its hyperrectangle’s corners are empty, and does the recursion at nodes and . The node finds the item which appears in the top of all of its corners and returns it to . Node , however, cannot find such an item and does the recursion by partitioning its hyperrectangle along the angle . Nodes and find the items and and return them to which returns to the root. The root returns as the representative.
Theorem 6
The algorithm mdrc guarantees the maximum rankregret of .
This proof uses Theorem 1 to extend the maximum rank bound from one dimensional ranges to dimensions. Please find the details in Appendix A. Theorem 6 uses the result of Theorem 1 to provide an upper bound on the maximum rank of the items assigned to each hyperrectangle, for the functions inside it. However, as we shall show in § 6, the rankregret of its output in practice is much less. For all the experiments we ran, the output of mdrc satisfied the maximum rank of for all settings. Also, following the greedy nature in partitioning the function space, as we shall show in § 6, the output of mdrc in all cases was less than 40. In addition, in § 6, we show that this algorithm is very efficient and scalable in practice.
6 Experimental Evaluation
6.1 Setup
Datasets. To evaluate our algorithms to compute RRR, we conducted experiments over two real multiattribute datasets that could potentially benefit from the user of rank regret. We describe these datasets next.
US Department of Transportation flight delay database (DOT)^{4}^{4}4www.transtats.bts.gov/DL_SelectFields.asp?: This database is widely used by thirdparty websites to identify the ontime performance of flights, routes, airports, and airlines.
After removing the records with missing values, the dataset contains 457,892 records, for all flights conducted by the 14 US carriers in the last months of 2017, over the scalar attributes DepDelay, TaxiOut, Actualelapsed
time, ArrivalDelay, Airtime, Distance, Taxiin, and CRSelapsedtime. For Airtime and Distance higher values are preferred while for the rest of attributes lower values are better.
Blue Nile (BN)^{5}^{5}5www.bluenile.com/diamondsearch?: Blue Nile is the largest diamonds online retailer in the world. We collected its catalog that contained 116,300 diamonds at the time of our experiments. We consider the scalar attribute Carat, Depth, LengthWidthRatio, Table, and Price. For all attributes, except Price, higher values are preferred. The value of the diamonds highly depend on these measurement, small changes in these scores may mean a lot in terms of the quality of the jewel: For example, while the listed diamonds range from 0.23 carat to 20.97, minor changes in the carat affects the price. We considered two similar diamonds, where one is 0.5 carat and the other is 0.53 carat. Even though all other measures are similar for both diamonds, the second is 30% more expensive than the first one. This is also correct for Depth, LengthWidthRatio, and Table. Such settings where slight changes in the scores may dramatically affect the value (and the rank) of the items, highlight the motivation of rankregret.
We normalize each value of a higherpreferred attribute as and for each lowerpreferred attribute , we do it as .
Algorithms evaluated: In addition to the theoretical analyses, we evaluate the algorithms proposed in this paper. In § 4, we proposed 2drrr, the algorithm that uses Theorem 1 to transform the problem into one dimensional range covering. This quadratic algorithm guarantees the approximation ratio of 2 on the maximum rank regret of its output. In this section, we shall show that in all the cases it generated an output with maximum rank of . For 2D, we implemented the raysweeping algorithm (similar to Algorithm 1) that enumerates the sets by following the changes in the border (Figure 3). We also implemented the set graph based enumeration explained in Appendix B for MD. We did not include the results here, but we observed that it does not scale beyond a few hundred items (that is because it need to solve much as linear programs for a single set). Instead, we apply the randomized algorithm kset for finding the sets (while setting the termination condition to 100). The MD algorithms proposed in § 5 are the hittingset based algorithm mdrrr and the space function covering algorithm mdrc. As we explained in § 1 and 7, all of the existing algorithms proposed for different varieties of regretratio consider the score difference, as the measure of regret and apply the optimization based on it. Still to verify this, we consider comparing with them as the baseline. As we shall further explain in § 7, the advanced algorithms for the regretratio problem are two similar studies [1, 5] that both work based on discretizing the function space and applying hitting set, and therefore, provide similar controllable additive approximation factors. We adopt the hdrrms algorithm [5] which as mentioned in [1, 40] should perform similar to the one in [1]. Since the input the algorithm is the index size, in order to be fair in the comparison, in all settings, we first run the algorithm mdrc, and then pass the output size of it as the input to hdrrms.
Evaluations: In addition to the efficiency, we evaluate the effectiveness of the proposed algorithms. That is, we study if the algorithms can find a small subset with bounded rankregret based on
. We consider the running time as the efficiency measure and the rankregret of output set, as well as its size, for effectiveness. Computing the exact rankregret of a set needs the construction of the arrangement of items in the dual space which is not scalable to the large settings. Therefore, in the experiments for estimating the rankregret of a set in MD, we draw 10,000 functions uniformly at random (based on Lines 4 to 6 of Algorithm
4) and consider them for estimating the rankregret.Default values: For each experiment, we study the impact of varying one variable while fixing other attributes to their default values. The default values are as following: (i) dataset size (): 10,000, (ii) number of attributes (): 3, and (iii) : top1%.
Hardware and platform. All experiments were performed on a Linux machine with a 3.8 GHz Intel Xeon processor and 64 GB memory. The algorithms are implemented using Python 2.7.
6.2 Results
2D. We use a ray sweeping algorithm, similar to Algorithm 1, to enumerates the sets by following the changes in the border. We also use the ray sweeping to find out the (exact) rank regret of a set in 2D. Due to the space limitations, for 2D, we only provide the plots for the DOT dataset. Figures 12 and 12 show the performance of the algorithms for varying the dataset size () from 1000 to 400,000. The running times of 2drrr and mdrrr are dominated by the time required by the sweeping line algorithms for finding the ranges (Algorithm 1) and the sets. Since these two algorithms have similar structure, their running times are similar. Still, because the sweeping ray algorithm is quadratic, these algorithms did not scale beyond 100K items. On the other hand mdrc does not depend on finding the set or sweeping a line. Rather, it partitions the space until top of two corners of each range intersect. Due to the binary search nature of the algorithm that breaks the space by half at every iteration, soon the functions in the two ends of each range become similar enough to share an item in their top. Therefore, the algorithm performs very well in practice, and scales well for large settings. For example, it took less than a second to run mdrc for 100K items, while 2drrr and mdrrr required several thousand seconds. See Figure 12. In Figure 12, and all other plots with, two yaxes, the left axis show the rankregret and the right one is the output size. The dashed green line show the border for the rankregret of 1%.
The algorithm 2drrr guarantees the optimal output size. For all settings its output also had the rankregret of less than , confirming that it returned the optimal solution. On the other hand, mdrrr guarantees the rankregret of and provides the logarithmic approximation ratio on its output size. This is also confirmed in the figure, where the rank regret of the output of mdrrr is always below the green line. However, the size of its output is more than the optimal for two (out of three) settings. the space partitioning algorithm mdrrr provides the output which in all cases satisfied the rankregret of and also its output size was the minimum, confirming that it also discovered the optimal output. In Figures 12 and 12, we fix the dataset size and other variables to the default and study the effect of changing on the efficiency of the algorithm and the quality of their outputs. Similar to Figure 12, 2drrr and mdrrr have similar running times (due to applying the ray sweeping algorithm) and mdrc runs within a few milliseconds for all settings. On the other hand, in Figure 12, the output size of mdrc is in all cases, except one, equal to the optimal output size (the output size of 2drrr) while, due to its logarithmic approximation ratio, the hitting set based mdrrr generates larger outputs.mdrrr guarantees the rankregret of , which is confirmed in the figure. mdrc also provided the maximum rankregret of for all settings and 2drrr did so for all, except for which its maximum rank regret was slightly above the threshold.
kset size. Next, we compare the actual size of sets with the theoretical upperbounds, using the kset algorithm. To do so, we select the DOT and BN datasets, set number of items to 10K and study the impact of varying and . The results are provided in Figures 16, 16, 16, and 16. The leftyaxis in the figures show the size and the rightyaxis show the running time of the kset algorithm. The horizontal green line in the figures highlight the number of items K. Figures 16 and 16 show the results for varying for DOT and BN, respectively. First, as observed in the figures, the actual sizes of the sets are significantly smaller than the best known theoretical upperbound for 3D ( [49]). In fact, the number of sets is closer to than the upperbounds. Second, the number of sets for is significantly larger than the number of sets for smaller values of . Recall that the sets are densely overlapping, as the neighboring sets in the set graph only differ in one item. As increases (up until ), for each node of the set graph the number of candidate transitions to the neighboring sets increases which affect as well. Although significantly smaller than the upper bound, still the sizes are large enough to make the set discovery impractical for large settings. For example, running the kset algorithm for the DOT dataset and took more than ten thousand seconds. The observations for varying (Figures 16 and 16) are also similar. Also, the gap between the theoretical upperbound for and the actual sets sizes show how loose the bounds are.
MD. Here, we study the algorithms proposed for the general cases where . mdrrr is the hitting set based algorithm that, given the collection of sets, guarantees the rankregret of and a logarithmic increase in the output size. So far, the 2D experiments confirmed these bounds. The other algorithm is the space partitioning algorithm mdrc which is designed based on Theorem 1. Given the possibly large number of sets and the cost of finding them (even using the randomized algorithm kset), this algorithm is designed to prevent the set enumeration. mdrc uses the fact that the sets are highly overlapping and recursively partitions the space (see Figure 8) into several hypercubes and stops the recursion for each hypercube as soon as the intersection of the top items in its corners is not empty. This algorithm performs very well in practice, as after a few iterations, the functions in the corners become similar enough to share at least one item in their top. Also, the maximum rankregret of the item that appear in the top of the corners of the hyperrectangle for the functions inside the hypercube is much smaller than the bound provided in Theorem 6. We so far observed it in the 2D experiments where in all cases the rankregret of the output of mdrc is less than , while the output size also was always close to the optimal output size.
In addition to these algorithms, we compare the efficiency and effectiveness of our algorithms against, hdrrms [5], the recent approximation algorithm proposed for regretratio minimizing problem. Since hdrrms takes the index size as the input, we first run the mdrc algorithm and pass its output size to hdrrms. Having a different optimization objective (on the regretratio), as we shall show, the output of hdrrms fails to provide a bound on the rankregret. In the first experiment, fixing the other variables to their default values, we vary the dataset size from 1000 to 400,000 for DOT and from 1000 to 100,000 for BN. Figures 20, 20, 20, and 20 show the results. Figures 20 and 20, 20 show the running time of the algorithms for DOT and BN, respectively. Looking at these figures, first mdrrr did not scale for 100K items. The reason is that mdrrr needs the collection of sets in order to apply the hitting set. For a very large number of items even the kset algorithm does not scale. hdrrms has a reasonable running time in all cases. mdrc has the least running time for large values of and in all cases it finished in less than a few seconds. The reason is that after a few recursions, the functions in the corners of the hypercubes become similar and share an item in their top. Figures 20 and 20 show the effectiveness of the algorithms for these settings. The leftyaxes show the maximum rankregret of an output set while the rightyaxes show the output size. The green lines show the rankregret of border. First, the output size for all settings is less than 20 items, which confirm the effectiveness of algorithms for finding a rankregret representative. As explained in § 5.2, mdrrr guarantees the rankregret of , which is observed here as well. As expected, hdrrms fails to provide a rankregret representative in all cases. Both for DOT and BN, the maximum rankregret of the output of hdrrms are close to , the maximum possible rankregret. For example, for DOT and 400K, the rankregret of hdrrms was 112K, i.e., there exists a function for which the top based on the output of hdrrms has the rank 112,000. Based on Theorem 6, for these settings, the rankregret of the output of mdrc is guaranteed to be less than for all cases. However, in practice we expect the rankregret to be smaller than this. This is confirmed in both experiments for DOT (Figure 20) and BN (Figure 20) where the output of mdrc provided the rankregret of .
Next, we evaluate the impact of varying the number of dimensions. Setting to 10,000 and to of (i.e. 100), we the number of attributes, , from to for DOT and from to for BN. Figures 24, 24, 24, and 24 show the results. The running times of the algorithms for DOT and BN are provided in Figures 24 and 24. Similar to the previous experiments, since the hitting set based algorithm mdrrr requires the collection of sets, it was not efficient. Both hdrrms and mdrc performed well in both experiments. On the other hand, looking at Figures 24 and 24 hdrrms fails to provide a rankregret representative, as in all settings there the rankregret of its output was several thousands, while the maximum possible rankregret is . The outputs of proposed algorithms in § 5, as expected, satisfied the requested rankregret. Interestingly, the output of mdrc had a lower rankregret, especially for DOT where its rankregret was around 10 for all settings. The output of both mdrrr and mdrc was less than 40, for all settings and both datasets, which confirm the effectiveness of them as the representative.
In the last experiment, we evaluate the impact of varying . For both datasets, while setting to 10,000 and to 3, we varied from 0.1% of items (i.e., 10) to 10% (i.e., 1000). Figures 28, 28, 28, and 28 show the results. Looking at Figures 28 and 28 which show the running time of the algorithms for DOT and BN, respectively, mdrrr had the worst performance, and it got worse as increased. The bottleneck in mdrrr is the set enumeration, and (looking at Figures 16 and 16) it increased by , as the number of sets increased. Both hdrrms and mdrc were efficient for all settings. One interesting fact in these plots is that the running time of mdrc decreases as increases. This is despite the fact that, as showed in Figures 16 and 16, the number of sets increased. The reason for the decrease, however, is simple. The probability that the top of corners of a hypercube share an item increases when looking at larger values of where each set contains more items. Although hdrrms was efficient in all settings, similar to the previous experiments it fails to provide a rankregret representative as the rankregret of its output is not bounded. The outputs of mdrrr and mdrc, on the other hand, had smaller rankregret than the requested in all settings for both datasets. Again, the output sizes in all settings were less than 20, which confirm the effectiveness of them as the rankregret representative.
Summary of results: To summarize, the experiments verified the effectiveness and efficiency of our proposal. While the adaptation of the regretratio based algorithm hdrrms fails to provide a rankregret representative, 2drrr, mdrrr, and mdrc found small sets with small rankregrets. Although the rankregret of the outputs of 2drrr and mdrc can be larger than , in our experiments and our measurements those were always below . mdrrr provided small outputs that as expected, always guarantees the rankregret of . Interestingly, the output size of mdrc was around the size of the one by mdrrr, which verifies the effect of the greedy behavior of mdrc. The output sizes in all the experiments were less than 40, confirming the effectiveness of the representatives. The quadratic 2drrr and the hittingset based algorithm mdrrr scaled up to a limit, whereas mdrc had low running time at all scales.
7 Related Work
The problem of finding preferred items of a dataset has been extensively investigated in recent years, and research has spanned multiple directions, most notably in top query processing [37] and skyline discovery [9]. In top query processing, the approach is to model the user preferences by a ranking/utility function which is then used to preferentially select tuples. Fundamental results include accessbased algorithms [31, 30, 11, 42] and viewbased algorithms [36, 21]. In skyline research, the approach is to compute subsets of the data (such as skylines and convex hulls) that serve as the data representatives in the absence of explicit preference functions [9, 7, 48]. Skylines and convex hulls can also serve as effective indexes for top query processing [15, 54, 6].
Efficiency and effectiveness have always been the challenges in the above studies. While top algorithms depend on the existence of a preference function and may require a complete pass over all of the data before answering a single query, representatives such as skylines may become overwhelmingly large and ineffective in practice [5, 34]. Studies such as [13, 52] are focused towards reducing the skyline size. In an elegant effort towards finding a small representative subset of the data, Nanongkai et al. [45] introduced the regretratio minimizing representative. The intuition is that a “closetotop” result may satisfy the users’ need. Therefore, for a subset of data and a preference function, they consider the score difference between the top result of the subset versus the actual top result as the measure of regret, and seek the subset that minimizes its maximum regret over all possible linear functions. Since then, works such as [44, 55, 5, 39, 47, 1, 12, 40] studied different challenges and variations of the problem. Chester et al. [17] generalize the regretratio notion to regret, in which the regret is considered to be the difference between the top result of the subset and the actual top result (instead of the top1 result). They also prove that the problem is NPcomplete for variable number of dimensions. [12, 1] prove that the regret problem is NPcomplete even when , using the polytope vertex cover problem [20] for the reduction. As explained in § 2, this also proves that our problem is NPcomplete for . For the case of two dimensional databases, [17] proposes a quadratic algorithm and [5] improves the running time to
. The cube algorithm and a greedy heuristic
[45] are the first algorithms proposed for regretratio in MD. Recently, [1, 5] independently propose similar approximation algorithms for the problem, both discretizing the function space and applying the hitting set, thus, providing similar controllable additive approximation factors. The major difference is that [5] considers the original regretratio problem while [1] considers the regret variation.It is important to note that the above prior works consider the score difference as the regret measure, making their problem setting different from ours, since we use the rank difference as the regret measure.
The geometric notions used in this paper, such as arrangement, dual space, and set, are explained in detail in [24]. Finding bounds on the number of sets of a point set do not lead to promising results on the upper bound of the size of . Lovasz and Erdos [41, 29] initiated the study of set notion and provided an upper bound on the maximum number of sets in . The problem in has also been studied in [26, 50, 25, 46]. The best known upper bound on the number of sets in and are [22] and [49], respectively. For higher dimensions, finding an upper bound on the number of sets has been extensively studied [24, 50, 23, 2]; the best known upper bound is [2], where is a small constant. The problem of enumerating all sets has been studied in [27, 14] for 2D and [3] for MD.
8 Final Remarks
In this paper, we proposed a rankregret measure that is easier for users to understand, and often more appropriate, than regret computed from score values. We defined rankregret representative as the minimal subset of the data containing at least one of the top of any possible ranking function. Using a geometric interpretation of items, we bound the maximum rank of items on ranges of functions and utilized combinatorial geometry notions for designing effective and efficient approximation algorithms for the problem. In addition to theoretical analyses, we conducted empirical experiments on real datasets that verified the validity of our proposal. Among the proposed algorithms, mdrc seems to be scalable in practice: in all experiments, within a few seconds, it could find a small subset with small rankregret.
References
 [1] P. K. Agarwal, N. Kumar, S. Sintos, and S. Suri. Efficient algorithms for kregret minimizing sets. LIPIcs, 2017.
 [2] N. Alon, I. Bárány, Z. Füredi, and D. J. Kleitman. Point selections and weak nets for convex hulls. Combinatorics, Probability and Computing, 1(03):189–200, 1992.
 [3] A. Andrzejak and K. Fukuda. Optimization over kset polytopes and efficient kset enumeration. In WADS, 1999.
 [4] P. Assouad. Densité et dimension. Ann. Institut Fourier (Grenoble), 1983.
 [5] A. Asudeh, A. Nazi, N. Zhang, and G. Das. Efficient computation of regretratio minimizing set: A compact maxima representative. In SIGMOD. ACM, 2017.
 [6] A. Asudeh, S. Thirumuruganathan, N. Zhang, and G. Das. Discovering the skyline of web databases. VLDB, 2016.
 [7] A. Asudeh, G. Zhang, N. Hassan, C. Li, and G. V. Zaruba. Crowdsourcing paretooptimal object finding by pairwise comparisons. In CIKM, 2015.
 [8] A. Asudeh, N. Zhang, and G. Das. Query reranking as a service. VLDB, 9(11), 2016.
 [9] S. Borzsony, D. Kossmann, and K. Stocker. The skyline operator. In ICDE, 2001.
 [10] H. Brönnimann and M. T. Goodrich. Almost optimal set covers in finite vcdimension. DCG, 14(4):463–479, 1995.
 [11] N. Bruno, S. Chaudhuri, and L. Gravano. Topk selection queries over relational databases: Mapping strategies and performance evaluation. TODS, 2002.
 [12] W. Cao, J. Li, H. Wang, K. Wang, R. Wang, R. ChiWing Wong, and W. Zhan. kregret minimizing set: Efficient algorithms and hardness. In LIPIcs, 2017.
 [13] C.Y. Chan, H. Jagadish, K.L. Tan, A. K. Tung, and Z. Zhang. Finding kdominant skylines in high dimensional space. In SIGMOD, 2006.
 [14] T. M. Chan. Remarks on klevel algorithms in the plane. Manuscript, Department of Computer Science, University of Waterloo, Waterloo, Canada, 1999.
 [15] Y.C. Chang, L. Bergman, V. Castelli, C.S. Li, M.L. Lo, and J. R. Smith. The onion technique: indexing for linear optimization queries. In SIGMOD, 2000.
 [16] B. Chazelle and F. P. Preparata. Halfspace range search: an algorithmic application of ksets. In SOCG. ACM, 1985.
 [17] S. Chester, A. Thomo, S. Venkatesh, and S. Whitesides. Computing kregret minimizing sets. VLDB, 7(5), 2014.
 [18] K. L. Clarkson. Applications of random sampling in computational geometry, ii. In SOCG. ACM, 1988.
 [19] G. B. Dantzig. Linear programming and extensions. Princeton university press, 1998.

[20]
G. Das and M. T. Goodrich.
On the complexity of optimization problems for 3dimensional convex polyhedra and decision trees.
Computational Geometry, 8(3), 1997.  [21] G. Das, D. Gunopulos, N. Koudas, and D. Tsirogiannis. Answering topk queries using views. In VLDB, 2006.
 [22] T. K. Dey. Improved bounds for planar ksets and related problems. DCG, 19(3):373–382, 1998.
 [23] T. K. Dey and H. Edelsbrunner. Counting triangle crossings and halving planes. In SOCG, pages 270–273. ACM, 1993.
 [24] H. Edelsbrunner. Algorithms in combinatorial geometry, volume 10. Springer Science & Business Media, 1987.
 [25] H. Edelsbrunner, N. Hasan, R. Seidel, and X. J. Shen. Circles through two points that always enclose many points. Geometriae Dedicata, 32(1):1–12, 1989.
 [26] H. Edelsbrunner and E. Welzl. On the number of line separations of a finite set in the plane. Journal of Combinatorial Theory, Series A, 38, 1985.
 [27] H. Edelsbrunner and E. Welzl. Constructing belts in twodimensional arrangements with applications. SICOMP, 15(1):271–284, 1986.

[28]
P. Erdős.
On a classical problem of probability theory.
Magy. Tud. Akad. Mat. Kut. Int. Kőz., 6(12), 1961.  [29] P. Erdős, L. Lovász, A. Simmons, and E. G. Straus. Dissection graphs of planar point sets. A survey of combinatorial theory, pages 139–149, 1973.
 [30] R. Fagin, R. Kumar, and D. Sivakumar. Comparing top k lists. Journal on Discrete Mathematics, 2003.
 [31] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. JCSS, 2003.
 [32] R. A. Finkel and J. L. Bentley. Quad trees a data structure for retrieval on composite keys. Acta informatica, 1974.
 [33] Y. D. Gunasekaran, A. Asudeh, S. Hasani, N. Zhang, A. Jaoua, and G. Das. Qr2: A thirdparty query reranking service over web databases. In ICDE Demo, 2018.
 [34] S. HarPeled. On the expected complexity of random convex hulls. arXiv preprint arXiv:1111.5340, 2011.
 [35] D. Haussler and E. Welzl. ɛnets and simplex range queries. DCG, 2(2):127–151, 1987.
 [36] V. Hristidis and Y. Papakonstantinou. Algorithms and applications for answering ranked queries using ranked views. VLDB, 13(1), 2004.
 [37] I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of topk query processing techniques in relational database systems. CSUR, 40(4):11, 2008.
 [38] R. M. Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pages 85–103. Springer, 1972.
 [39] T. Kessler Faulkner, W. Brackenbury, and A. Lall. kregret queries with nonlinear utilities. VLDB, 8(13), 2015.
 [40] N. Kumar and S. Sintos. Faster approximation algorithm for the kregret minimizing set and related problems. In ALENEX. SIAM, 2018.
 [41] L. Lovász. On the number of halving lines. Ann. Univ. Sci. Budapest, Eötvös, Sec. Math, 14:107–108, 1971.
 [42] A. Marian, N. Bruno, and L. Gravano. Evaluating topk queries over webaccessible databases. ACM Trans. Database Syst., 29(2), 2004.
 [43] G. Marsaglia et al. Choosing a point from the surface of a sphere. The Annals of Mathematical Statistics, 43(2), 1972.
 [44] D. Nanongkai, A. Lall, A. Das Sarma, and K. Makino. Interactive regret minimization. In SIGMOD. ACM, 2012.
 [45] D. Nanongkai, A. D. Sarma, A. Lall, R. J. Lipton, and J. Xu. Regretminimizing representative databases. VLDB, 2010.
 [46] J. Pach, W. Steiger, and E. Szemerédi. An upper bound on the number of planar ksets. DCG, 7(1):109–123, 1992.
 [47] P. Peng and R. C.W. Wong. Geometry approach for kregret query. In ICDE. IEEE, 2014.
 [48] M. F. Rahman, A. Asudeh, N. Koudas, and G. Das. Efficient computation of subspace skyline over categorical domains. In CIKM. ACM, 2017.
 [49] M. Sharir, S. Smorodinsky, and G. Tardos. An improved bound for ksets in three dimensions. In SOCG. ACM, 2000.
 [50] G. Tóth. Point sets with many ksets. DCG, 26(2), 2001.

[51]
V. Vapnik.
The nature of statistical learning theory
. Springer science & business media, 2013.  [52] A. Vlachou and M. Vazirgiannis. Ranking the sky: Discovering the importance of skyline points through subspace dominance relationships. DKE, 69(9), 2010.
 [53] W. Weil and J. Wieacker. Stochastic geometry, handbook of convex geometry, vol. a, b, 1391–1438, 1993.
 [54] D. Xin, C. Chen, and J. Han. Towards robust indexing for ranked queries. In VLDB, 2006.
 [55] S. Zeighami and R. C.W. Wong. Minimizing average regret ratio in database. In SIGMOD. ACM, 2016.
Appendix A Proofs
Theorem 2. The algorithm 2drrr is in . The complexity of the algorithm 2drrr depends is determined by Algorithms 1 and 2. Algorithms 1 first orders the items based on in . Then in applies a ray sweeping from the xaxis toward and at every intersection applies constant number of operations. The upper bound on the number of intersections in and therefore, it is the running time of Algorithms 1. Calling Algorithm 1, generates at most ranges, each for an item. Every iteration of Algorithm 2 is in as it applies a binary search on the set of uncovered intervals for each unselected item, and the number of uncovered intervals is bounded by . Given that the output size is bounded by , Algorithm 2 is in .
Theorem 3. The output size of 2drrr is not more than the size of the optimal solution for RRR. Following the border, while sweeping a ray from axis to , the top results change only when a line above the border intersects with it. For example, in Figure 3, moving from axis to , in the intersection between and , the top changes from to . Consider the collection of the top results and the range of angles of rays (named as top regions) that provide them. Now consider the ranges that are generated by Algorithm 1 for each item. Let us name them here as the ranges of items. These ranges mark the first and last angle for which an item is in top. For each top region , let the set items that their ranges cover it be . Each top region is covered by each and every item in its top. In addition the ranges of some other items cover each top region. Therefore, is a superset for the top of . An optimal solution with the minimum number of items from the collection of supersets that contains at least one item from each set, is not larger than the minimum number of such items from the collection of subsets. As a result, the output size of 2drrr is not greater that the size of the optimal solution for the RRR problem.
Theorem 4. The output of 2drrr guarantees the maximum rankregret of . The proof is straightforward, following the Theorem 1. For each item , Algorithm 1 finds a range that in its beginning and its end, is in the top. Therefore, based on Theorem 1, the rank of for each of the functions inside its range is no more than . Algorithm 2 covers the function space with the ranges generated by Algorithm 1. Hence, for each function, there exists an item in the output where .
Lemma 5. Let be the collection of all sets for the points corresponding to the items . For each ranking function , there exists a set such that top()=.
The proof is straightforward using contradiction. Consider a ranking function with the weight vector where the top is and does not belong to . Let be the item for which . Consider the hyperplane . For all the items in and for all items in , . Hence, all the items in fall in the positive half space of – i.e., . Since is , . Therefore is a set and should belong to , which contradicts with the assumption that is does not belong to the collection of sets.
Theorem 6. mdrc guarantees the maximum rankregret of . The proof of this theorem is based on Theorem 1. We also consider the arrangement lattice [24] for this proof. Every convex region in the dimensional space is constructed from the dimensional space convex facets as its borders. Each of the facets are constructed by dimensional facets, and this continues all the way down until the ( dimensional) points. For example, the borders of a convex polyhedron in 3D, are two dimensional convex polygones; the borders of the polygones are (one dimensional) line segments, each specified by two points. The arrangement lattice is the data structure that describe the convex polyhedron by its dimensional facets – . The nodes at level of the lattice show the dimensional facets, each connected to its dimensional borders, as well as the dimensional facets those are a border for.
Now, let us consider the hyperrectangle of each of the leaf nodes in the recursion tree of mdrc (c.f. Figure 8) and let be the tuple that appeared at the top of all corners of the hyperrectangle. Consider the arrangement lattice for the hyperrectangle of the leaf node and let us move up from the bottom of the lattice, identifying the maximum rank of at each level of it. Since is in the top of both corners of each line segment in level 1, based on Theorem 1, its rank for each point on the line is at most . Level 2 of the lattice shows the two dimensional rectangles, each built by the line segments at level 1. For every point inside each rectangle at level 2, consider a line segment on the rectangle’s affine space starting from one of its corners, passing through the point and ending on the edge of the rectangle. Since the rank of the point on the corner is less than and for any point on the edge less than , based on Theorem 1, the rank of for the points inside the rectangles at level 2 of lattice is at most . Similarly, consider each hyperrectangle at level of the lattice. The hyperrectangle is built by the dimensional hyperrectangle at level . For every point inside the dimensional hyperrectangle, consider the line segment starting from a corner of the hyperrectangle, passing through the point and hitting the edge of it. By induction, the rank of on the dimensional edges of hyperrectangle is at most . Therefore, since the rank of on the corner is at most , based on Theorem 1, its rank for the point inside the dimensional hyperrectangle is at most . Therefore, the rank of for every point inside the dimensional hyperrectangle (the top of the lattice) is at most . mdrc partitions the function space into hyperrectangles that, for each, there exists an item
Comments
There are no comments yet.