Skyline Queries in O(1) time?

09/12/2017 ∙ by Spyros Sioutas, et al. ∙ National Technical University of Athens ARISTOTLE UNIVERSITY OF THESSALONIKI Ionian University 0

The skyline of a set P of points (SKY(P)) consists of the "best" points with respect to minimization or maximization of the attribute values. A point p dominates another point q if p is as good as q in all dimensions and it is strictly better than q in at least one dimension. In this work, we focus on the static 2-d space and provide expected performance guarantees for 3-sided Range Skyline Queries on the Grid, where N is the cardinality of P, B the size of a disk block, and R the capacity of main memory. We present the MLR-tree, which offers optimal expected cost for finding planar skyline points in a 3-sided query rectangle, q=[a,b]×(-∞,d], in both RAM and I/O model on the grid [1,M]× [1,M], by single scanning only the points contained in SKY(P). In particular, it supports skyline queries in a 3-sided range in O(t· t_PAM(N)) time (O((t/B)· t_PAM(N)) I/Os), where t is the answer size and t_PAM(N) the time required for answering predecessor queries for d in a PAM (Predecessor Access Method) structure, which is a special component of MLR-tree and stores efficiently root-to-leaf paths or sub-paths. By choosing PAM structures with O(1) expected time for predecessor queries under discrete μ-random distributions of the x and y coordinates, MLR-tree supports skyline queries in optimal O(t) expected time (O(t/B) expected number of I/Os) with high probability. The space cost becomes superlinear and can be reduced to linear for many special practical cases. If we choose a PAM structure with O(1) amortized time for batched predecessor queries (under no assumption on distributions of the x and y coordinates), MLR-tree supports batched skyline queries in optimal O(t) amortized time, however the space becomes exponential. In dynamic case, the update time complexity is affected by a O(log^2N) factor.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we study efficient algorithms with non-trivial performance guarantees for skyline processing on the static plane. Let denote the set of points in the dataset. Also, let denote the value of the -th coordinate of a point (in our case ).

[Dominance] A point dominates another point () when is as good as in all dimensions and strictly better than in at least one dimension. Formally: when , and such that .

[Skyline] The skyline of a set of points contains the points that are not dominated by any other point. Formally:

In the above definitions, we have assumed that small values are preferable. However, this may change according to the concept and characteristics of the dimensions. For example, if each point represents a purchased item with one dimension being the price and the other dimension being the quality, then the best items should have low price and high quality.

Skyline queries have attracted the interest of the database community for more than a decade. Although the problem was already known in Computational Geometry under the name maximal (or minimal) vectors, the necessity to support skyline queries in databases was first addressed in

[1]. Since, they have been used in many applications including multi-criteria decision making, data mining and visualization, quantitative economics research and environmental surveillance [2, 3, 4]. Assume that we use the operator SKYLINE OF to express skyline queries using SQL. Then, a SQL query asking for the skyline of a relation could look like the following two examples:  

SELECT id, name, price, quality  
FROM items 
WHERE price <= 100 AND quality>=3 
SKYLINE OF price MIN, quality MAX 

SELECT player_name, Height, Performance  
FROM Basketball_Team 
WHERE Height IN AND  
Performance IN  
SKYLINE OF Height MAX, Performance MAX 

In first example, price and quality are dependent variables. However, in second example, the two variables, Height of Player and his overall Performance respectively, are completely independent. General speaking, in multi-dimensional space, various well known dimensionality reduction techniques [2, 3, 4] generate spatial vectors with uncorrelated (independent) dimensions. Thus, the probabilistic study of skyline problem with independent dimensions is of great practical interest. Observe that, in addition to the skyline preferences the WHERE clause contains some additional constraints. For example, the user may not be interested in an item that is more expensive than she can afford. Usually, these additional constraints form a rectangular area referred to as the region of interest. The answer to the query comprises the skyline of the points falling inside the region of interest. Another example is given in Figure 1, where MAX(X)-MIN(Y) semantics are being used. The skyline of the entire dataset is composed of the points , , and , whereas the skyline inside the region of interest contains the black dot points.

Figure 1: Example of a skyline result in a 3-sided query rectangle

In this paper, we present the MLR (Modified Layered Range) tree-structure providing an optimal expected solution for finding planar skyline points in -sided query rectangle in both RAM and I/O model on the grid , by single scanning not all the sorted points but the points of the answer only.

The latter means that MLR-tree supports planar skyline queries in a -sided range in time ( I/Os), where is the answer size and the time required for answering predecessor queries for in a PAM (Predecessor Access Method) structure, which is a special component of MLR-tree and stores efficiently root-to-leaf paths or sub-paths. By choosing PAM structures with expected time for predecessor queries under descrete -random distributions of the and coordinates, MLR-tree supports skyline queries in optimal expected time ( expected number of I/Os) with high probability. In addition to the general case, where the and coordinates are drawn from a -random distribution, we examine two more special cases with practical interest: (a) The inserted points have their -coordinates drawn from a class of distributions, whereas the -coordinates are arbitrarily distributed. In this case the space becomes linear, also the query time is marginally affected by a very small sublogarithmic factor. (b) The -coordinates are arbitrarily distributed and the -coordinates are continuously drawn from a more restricted class of smooth distributions. Similarly, the space is reduced to linear, also the query time remains unaffected (optimal). The practical interest of these special cases stems from the fact that anyprobability distribution is -smooth, for a suitable choice of the parameters, as we will describe later. Finally, if we choose a PAM structure with amortized time for batched predecessor queries (under no assumption on distributions of the and coordinates), MLR-tree supports batched skyline queries in optimal amortized time, however the space becomes exponential.In dynamic case, the update time complexity is affected by a factor.

The proposed data structure borrows ideas from the Modified Priority Search Tree presented in [5] that supports simple 3-sided range reporting queries. However, the modifications to support skyline queries are novel and non-trivial. The same problem (dynamic I/O-efficient range skyline reporting) but with worst case guarantees (logarithmic query I/Os, logarithmic amortized update I/Os and linear space) has been presented in [6].

The rest of the work is organized as follows. Related work in the area and a brief discussion of our contributions are given in Section2, whereas some fundamental concepts are presented in Section 3. A detailed description and analysis of our contributions are given in Section 4. Finally, in Section 6 we conclude the work and discuss future research briefly.

2 Related Work and Contributions

In this section, we describe related research in the area, focusing on the best available results. In addition, we present our contributions.

Results for RAM and PM models: The best previous solution presented in [8] and supports maxima (skyline) queries in optimal worst case time and updates in worst case time consuming linear space in the RAM model of computation with word size , where the coordinates of the points are integers in the range .

In the Pointer Machine (PM) or Comparison model (comparison is the only allowed computation on the coordinates of the points) the solution in [8] requires optimal worst case query time and worst case update time. The data structure of [8] also supports the more general query of reporting the maximal points among the points that lie in a given 3-sided orthogonal range unbounded from above in the same complexity. It can also support 4-sided queries in worst case time, and worst case update time, using space, where is the size of the output.
Results for the I/O model: The best previous solution has been presented in [9]. In the external-memory model, the 2-d version of the problem is known to be solvable in I/Os and (i.e., linear) space, where is the cardinality of , the size of a disk block, and the capacity of main memory.

In particular, the skyline of a set of 2-d points can be extracted by a single scan, provided that the points of have been sorted in ascending order of their -coordinates. For example, consider any point ; and let be the set of points of that rank before in the sorted order. Apparently, cannot be dominated by any point that ranks after , because has a smaller -coordinate than any of those points. On the other hand, is dominated by some point in if and only if the -coordinate of is greater than , where is the smallest -coordinate of all the points in .

Table 1: New bounds for dynamic -sided planar skyline on the grid in RAM model of computation.
Data Distributions Space Query Time : -random expected : -random : -random expected : : -random expected : : arbitrary amortized : arbitrary for any for batched skyline queries

Table 2: New bounds for dynamic -sided planar skyline on the grid in I/O model of computation.
Data Distributions Space Query Time : -random expected : : -random expected :

To populate , it suffices to read in its sorted order, and at any time, keep the smallest -coordinate of all the points already seen. The next point scanned is added to if its -coordinate is below , in which case is updated accordingly. In the I/O model, this algorithm performs I/Os, which is the time complexity of sorting elements in external memory.

For fixed , the solution presented in [9] requires I/Os. Previously, the best solution was adapted from an in-memory algorithm, and requires I/Os.

Our Contributions: In this work, we provide novel algorithmic techniques with non-trivial performance guarantees, to process planar skyline queries inside a region of interest. Evidently, in its static case (no insertions and deletions), the problem can be solved by reusing existing techniques that return the skyline of the entire dataset and keeping only the points that fall inside the region of interest. This approach leads to suboptimal solutions, since the processing cost does not depend on the size of the region and the number of points that fall inside: for every query, the whole dataset must be scanned. In addition, most of the proposed algorithms are not equipped to handle insertions and deletions of points.

An exception to this behavior is the BBS [10] algorithm, which in most cases returns the skyline without scanning the entire R-tree index and in addition, it supports skyline computation inside a region of interest and can handle insertions and deletions. However, BBS does not offer any theoretical performance guarantee.

In this paper, we propose the MLR (Modified Layered Range) tree-structure providing an optimal expected solution for finding planar skyline points in a given -sided query rectangle in both RAM and I/O model on the grid , by single scanning only the points contained in . The latter means that the MLR-tree supports planar skyline queries in expected number of I/Os ( in RAM), where the answer cardinality, consuming also super-linear space (general case), which becomes linear under specific data distributions. Also, MLR-tree supports batched skyline queries in optimal amortized time, however the space becomes exponential. In dynamic case, the update time complexity is affected by a factor.Our results are summarized in Tables 1 and 2.

3 Fundamental Concepts

For main memory solutions we consider the RAM model of computation. We denote by the number of elements that reside in the data structures and by the size of the query.

For the external memory solutions we consider the I/O model of computation [11]. This means that the input resides in the external memory in a blocked fashion. Whenever a computation needs to be performed to an element, the block of size that contains that element is transferred into main memory, which can hold at most elements. Every computation that is performed in main memory is free, since the block transfer is orders of magnitude more time consuming. Unneeded blocks that reside in main memory are evicted by a LRU replacement algorithm. Naturally, the number of block transfers (I/O operation) consists the metric of the I/O model.

Furthermore, in the dynamic case we will consider that the points to be inserted are drawn by an unknown descrete distribution. Also, the asymptotic bounds are given with respect to the current size of the data structure. Finally, deletions of the elements of the data structures are assumed to be uniformly random. That is, every element present in the data structure is equally likely to be deleted [12].

3.1 Probability Distributions

In this section, we overview the probabilistic distributions that will be used in the remainder of the paper. We will consider that the and -coordinates are distinct elements of these distributions and will choose the appropriate distribution according to the assumptions of our constructions.

A probability distribution is -random if the elements are drawn randomly with respect to a density function denoted by . For this paper, we assume that is unknown.

Informally, a distribution defined over an interval is smooth if the probability density over any subinterval of does not exceed a specific bound, however small this subinterval is (i.e., the distribution does not contain sharp peaks).

Given two functions and , then is -smooth if there exists a constant , such that for all , and for all naturals , for a random key it holds that:

(1)

where for or , and for where and .

The above imply that no key can get a point mass, i.e. a value with nonzero111In the sense that it is bounded below by a positive constant. probability. More accurately, if we initially consider the whole universe of keys with , and , we equally split it into , many equal consecutive subsets of keys, then (1) implies that each subset (containing consecutive keys) gets probability mass , which is as , when . Hence, as , each key in has probability mass. Once more, we can describe (1) by rephrasing the intuitive description of -smooth distribution as:

among a number (measured by ) of consecutive subsets, each containing consecutive keys from , no subset containing consecutive keys from should be too dense (measured by ) compared to the others”.

The class of -smooth distributions (for appropriate choices of and ) is a superset of both regular and uniform classes of distributions, as well as of several non-uniform classes [13, 14]. Actually, any probability distribution is -smooth, for a suitable choice of .

The grid distribution assumes that the elements are integers that belong to a specific range .

3.2 Preliminary Access Methods

In this section, we describe the data structures that we utilize in order to achieve the desired complexities.

Half-Range Minimum/Maximum Queries: The half-Range Maximum Query (h-RMQ) problem asks to preprocess an array of size such that, given an index range where , we are asked to report the position of the maximum element in this range on . Notice that we do not want to change the order of the elements in , in which case the problem would be trivial. This is a restricted version of the general RMQ problem, in which the range is , where . In [16] the RMQ problem is solved in time using space and preprocessing time. The currently most space efficient solution that supports queries in time appears in [17]. We could use these solutions for our h-RMQ problem, but in our case the problem can be solved much simpler by maintaining an additional array of maximum elements for each index of the initial array.

The Lazy B-tree: The Lazy B-tree of [18] is a simple but non-trivial externalization of the techniques introduced in [19]. The first level consists of an ordinary B-tree, whereas the second one consists of buckets of size , where is approximately equal to the number of elements stored in the access method. Each bucket consists of two list layers, and respectively, where , each of which has size. The technical details concerning both the maintenance of criticalities and the representation of buckets, can be found in [18]. The following theorem provides the complexities of the Lazy B-tree:

Theorem1: The Lazy B-Tree supports the search operation in worst-case block transfers and update operations in worst-case block transfers, provided that the update position is given.

Interpolation Search Trees: In [20]

, a dynamic data structure based on interpolation search (IS-Tree) was presented, which requires linear space and can be updated in

time w.c. Furthermore, the elements can be searched in time expected w.h.p., given that they are drawn from a -smooth distribution, for any arbitrary constants . The externalization of this data structure, called interpolation search B-tree (ISB-tree), was introduced in [18]. It supports update operations in worst-case I/Os provided that the update position is given and search operations in I/Os expected w.h.p. The expected search bound holds w.h.p. if the elements are drawn by a -smooth distribution, where and are constants. If the elements are drawn by the more restricted densities the expected number of I/Os for the search operation becomes with high probability ( is an arbitrarily chosen constant). The worst case search bound is block transfers.
Random Input: The Data Structure presented in [21], the Random Search Array (RSA), alleviates all lower bounds for the dynamic predecessor search problem, by proving constant time with high probability (w.h.p.), as grows large, thus, improving over all approaches presented in [22, 13, 20]. The fine details of this dynamic data structure exhibit that achieves constant predecessor time w.h.p., working with only short memory words of length -bits, meaning that or . For equals to exactly -bits and for , RSA consumes super-linear space . The tuning of positive constant for practical purposes was not studied in this paper.
Batched Predecessor Queries: The Data Structure presented in [7], answers batched predecessor queries in amortized time. In particular, it supports queries in time per query and requires space for any , where is the size of the universe. It also can answer predecessor queries in time per query and requires space for any . The method of solution relies on a certain way of searching for predecessors of all elements of the query in parallel.

In a general case, the solution in [7] presents a data structure that supports queries in time per query and requires space for any , as well as a data structure that supports queries in time per query and requires space for any .

4 The MLR-TREE

In the following, we describe in detail the indexing scheme, which is termed the Modified Layered Range Tree (MLR-tree). The description of the MLR-tree is considered in the MAX-X, MIN-Y case. The other three cases can be handled in a similar way.

4.1 The Main Memory Static Non-Linear-Space MLR-tree

The Static Non-Linear MLR-tree (see Figure  2) is a static data structure that stores points on the 2-d grid. It is stored as an array in memory, yet it can be visualized as a complete binary tree. The static data structure is an augmented binary search tree on the set of points that resembles a range tree. stores all points in its leaves with respect to their -coordinate in increasing order. Let be the height of tree . We denote by the subtree of with root the internal node .

Let be the root-to-leaf path for leaf of . We denote by the subpath of consisting of nodes with depth . Similarly, () denotes the set of nodes that are left (right) children of nodes of and do not belong to . Let be the point stored in leaf of the tree where is its -coordinate and is its -coordinate. denotes the search path for , i.e., it is the path from the root to and it is equal to . The binary tree is augmented as follows:

Figure 2: The static non-linear-space MLR-tree in main memory.
  • Each internal node stores a point , which is the point with the minimum -coordinate among all points in its subtree .

  • Each internal node is equipped with a secondary data structure , which stores all points in with respect to -coordinate in increasing order. is implemented with a Predecessor Access Method (PAM) as well as an h-RMQ structure (see  3.2).

  • Each leaf stores arrays and , where , corresponding to sets and respectively. In particular, these arrays contain the points for each node in the corresponding sets. These arrays are sorted with respect to their -coordinate and are implemented with a PAM. In addition, they are also implemented as h-RMQ structures.

We use an array of size , which stores pointers to the leaves of . In particular, contains a pointer to the leaf of with maximum -coordinate smaller or equal to (this is ’s predecessor). In this way, we can determine in time the leaf of a search path for a particular point in . Finally, tree is preprocessed in order to support Lowest Common Ancestor queries in time. Since is static, one can use the methods of [15, 16] to find the LCA (as well as its depth) of two leaves in time by attaching to each node of a simple label.

Having concluded with the description of the data structure, we move to the skyline query. Assume we want to compute the skyline in the query range . The procedure to compute the points on the skyline is the following:

  1. We use the array to find the two leaves and of for the search paths and respectively. Let be the LCA of leaves and and let be its depth.

  2. The predecessor of is located in and and let these predecessors be at positions and respectively. In addition, let be the node that has the following property: the -coordinate of point belongs in the range and it has the largest -coordinate (the -coordinate of falls in the range because of step 1) among all nodes in and . This means that node is the rightmost node that has a point with -coordinate within the range .

  3. By executing an h-RMQ in and arrays for the range and node is located. The subtree stores the point (which surely exists) with the maximum -coordinate among all points in the query range . By executing a predecessor query for in returning the result , and then making an h-RMQ in for the range , we find and report the required point with the maximum -coordinate that belongs to the skyline (recall that we use MAX-X and MIN-Y semantics).

  4. The query range now becomes .

  5. We repeat the previous steps until .

Before moving to the analysis of the data structure we need to prove its correctness with respect to the skyline range query.

Theorem 2:The skyline range query correctly returns the skyline within the range .

Proof.

We prove by induction that the query algorithm returns the point in the skyline in decreasing order with respect to -coordinate. The first time that the algorithm is executed, the point on the skyline with the largest -coordinate is returned. To prove this statement assume that some other point with largest -coordinate is returned. This means that this point should be in a subtree rooted not at but at a different node. However, this is impossible since is the rightmost subtree whose point with minimum -coordinate is in the range . For the same reason, will be located correctly in the h-RMQ. As a result, the point on the skyline with the largest -coordinate is correctly located and reported first. Assume that some points of the skyline have already been reported. We have to show the following: a) the points considered in the current loop are those in and whose -coordinate is in the range are the ones we must consider and only these and b) the reported point on the skyline has the largest -coordinate among all points in the new query range. For the second part a similar discussion as in the previous paragraph applies. For the first part, it is enough to note that all points that are dominated by the reported skyline points are not considered since the query range has changed. ∎

Let be the required space for elements for the Predecessor Access Method (PAM) and let be the time complexity for a predecessor query. Finally, let be the time complexity for the construction of the PAM on elements. Building tree is performed in a bottom-up manner. In particular, tree as well as the respective points within the internal nodes can be built in time, since we have to sort the points with respect to the -coordinate. Arrays , for all internal nodes , are constructed in a bottom-up manner by merging the two already sorted with respect to -coordinate arrays of the children into one array in their father in linear (to their size) time. Note that the elements are copied and the arrays of the children are not destroyed. Then, we construct the h-RMQ structure in linear time as well as the PAM in time for node . This can be carried out in time since elements are processed at each level of the tree as well as in time for the PAM of each structure. Finally, sequences and , where , for all leaves , can be constructed one by one in . This is because, for each leaf among the leaves in total, we construct such sequences each of which has size . Each such sequence must be structured with a Predecessor Access Method (PAM) as well as as an h-RMQ structure. In this particular case we choose to use -heaps [23] as a PAM due to the small size of the sets and their linear time construction. The total time to construct the data structure on elements is .

Recall that for each point of the SKY(P) set, we execute in total three predecessor queries (two of them in Step 2 and one in Step 3). Since all other steps can be carried out in time, the total time complexity of the query algorithm is . The space complexity of the MLR-tree is dominated by the space used for implementing the , and sets as well as by the array , which is as implied by the discussion in the previous paragraph.

4.2 The Main Memory Static Linear-Space MLR-tree

We can reduce the space of the data structure described in 4.1 by using pruning techniques as in [24, 25]. However, pruning alone does not reduce the space to linear. We can get a better space complexity by recursive pruning until reaching a tree of constant size, but it will still be superlinear by an iterated logarithm222The iterated logarithm, written as , is equal to the number of times the logarithm must be iteratively applied on before the result is for the first time. (aggravating by a similar multiplicative term the time complexity of the query). To get an optimal space bound we use a combination of pruning and table lookup, which ends the recursion prematurely.

The pruning method is as follows: consider the nodes of with height . These nodes are roots of subtrees of of size and there are such nodes. Let be the tree whose leaves are these nodes and let be the subtrees of these nodes for . We call the first layer of the structure and the subtrees the second layer.

Figure 3: The static linear-space MLR-tree.

and each subtree is implemented as an MLR-tree. The representative of each tree is the point with the minimum -coordinate among all points in . The leaves of contain only the representatives of the respective trees . Each tree is further pruned at height resulting in trees with elements. Once again, contains the representatives of the third layer trees in a similar way as before. Each tree is structured as a table which stores all possible precomputed solutions. In particular, each is structured by using a PAM with respect to -coordinate as well as with respect to -coordinate (two different structures in total). In this way, we can extract the position of the predecessor in with respect to and coordinates. What is needed to be computed for is the point with the maximum -coordinate that lies within a -sided range region. To accomplish this, we use precomputation and tabulation for all possible results.

For the sake of generality, assume that the size of is . Let the points in be sorted by -coordinate. Let their rank according to -coordinate be given by the function . Apparently, function may generate all possible permutations of the points. We make a four-dimensional table ANS, which is indexed by the number of permutations (one dimension with choices) as well as the possible positions of the predecessor (3 dimensions with choices for the -sided range). Each cell of array ANS contains the position of the point with the maximum -coordinate for a given permutation that corresponds to a tree and the -sided range. Each tree corresponds to a permutation index that indexes one dimension of table ANS. The other indices are generated by predecessor queries on the -coordinate and one predecessor query on the coordinate. The size of ANS is and obviously it is common for all trees in the third layer of the MLR-tree. To build it, we proceed as follows:

We attach a unique label in the range to each one of the permutations corresponding to the respective index in array ANS. This label is constructed by enumerating systematically all permutations and keeping them in an array of labels. Each label is represented by bits. Each tree in the third layer is attached with such a label based on the permutation generated by the -coordinates of its points. This is the only step in the building process that requires knowledge of the trees . Then, we compute for every permutation and for every possible combination of the three predecessors the rank of the point with the maximum -coordinate. This can be done by a single scan of the permutation for each possible combination of the predecessor queries.

Although the skyline query changes to incorporate the division of the structure into layers, these changes are not extensive. Let be the initial range query. To answer this query on the three layered structure we access the layer trees containing and by using the array. Then, we locate the subtrees and containing the representative leaves of the accessed layer trees. The roots of these subtrees are leaves of . The MLR query algorithm described in 4.1 is executed on with these leaves as arguments. Once we reach the node with the maximum -coordinate, we continue in the layer tree corresponding to the representative with the maximum -coordinate located in . The same query algorithm is executed on this layer tree and then we move similarly to a tree in the third layer. We make three predecessor queries for , , and in and we use the ANS table to locate the point with the maximum -coordinate by retrieving the permutation index of . Let the point be the desired point at the third layer. We go back to . The range query now becomes and iterate as described in 4.1.

The total space required for the data structure depends on the size of each of the three layers. For the first layer, the MLR-tree on the representatives requires linear space for the leaf structures (all structures for each leaf are structured as -heaps and h-RMQ structures requiring linear space). For the structures, the total space needed is . The second layer consists of trees with representative points of the third layer each. Since each one of these trees is itself an MLR-tree its size is . For the structures for each tree in the second layer we need space. In total, the space for the second layer is . In the third layer, we use linear space for the two predecessor data structures (-heaps) as well as a table of size , which is . The construction time of the data structure can be similarly derived taking into account that the ANS table can be constructed in time. As for the query, we get an number of predecessor queries per iteration, in which iteration we report a point on the skyline. The following theorem summarizes the result (note that is the time needed to sort a list of elements):

Given a set of points on the 2-d grid , we can store them in a static main memory data structure that can be constructed in
time using space. It supports skyline queries in a -sided range in worst-case time, where is the answer size.

4.2.1 A Note on External Memory

The result can be easily extended to external memory as well. The base tree is a static -tree, where is the size of the block. One change to the structure is related to the definition of and . In particular, (and similarly ) correspond to the node with the minimum -coordinate among all nodes that are children of the nodes in and are to the left of a node in among all children of the father of that also belongs to . This means that may contain nodes and each leaf may have such lists. Another change is related to the level trees. We make the assumption that , which means that a level tree can be easily stored in blocks of size and as a result there is no need to use tabulation. See Figure 3 for a depiction of the tree. To get a feeling of the problem size that would violate this assumption, we get that when , which even for small values of block size, like , we get that a level tree can be stored in a block when , which is a number much larger than a googol (). The changes in the query algorithm are insignificant and mainly related to the change of the definition of and for all and .

The following theorem is an easy extension of Theorem 4.2 for external memory.

Given a set of points on the 2-d grid , we can store them in a static external memory data structure that can be constructed in
using
space. It supports skyline queries in a -sided range in I/Os, where is the answer size.

4.2.2 Results for the Static Case in Main Memory and External Memory

Applying Theorem 4.2 for various implementation of PAMs in main memory we get different results that are summarized in the following:

  • Binary Trees: Assuming that the PAM is a simple binary tree, the MLR tree uses space, can be constructed in time (by merging the sorted lists in linear time in a new sorted list with respect to the -coordinate) and has a query time of .

  • van Embde Boas trees [26]: Assuming that the PAM is a van Emde Boas tree, the MLR tree uses space, can be constructed in time and has a query time of .

  • IS-tree [20] (Random Input): Assuming that the PAM is an Interpolation Search Tree and that the elements are drawn from a -smooth distribution, where is a constant, then the MLR tree uses space, can be constructed in time and has an expected query time of .

  • RSA [21] (Random Input): Assume that the PAM is the Random Search Array and that the elements are drawn from a -input distribution, where and (vastly larger than the family of distributions for the IS-tree). The MLR tree uses space, can be constructed in time and has an expected query time of with high probability, where is an arbitrarily chosen constant.

  • BPQ [7] (Batched Predecessor Queries): Assume that the PAM is the Data Structure presented in [7], that answers batched predecessor queries in amortized time. In particular, supports queries in time per query and requires space for any . In this case, the MLR-tree uses exponential space and supports batched skyline queries in optimal amortized time.

Similarly, applying Theorem 4.2.1 for various implementations of PAMs in external memory we get the following results.

  • -trees: Assuming that the PAM is a simple binary tree, the MLR tree uses space, can be constructed in time and has a query time of .

  • -trees: Assuming that the PAM is an -tree [18] for discrete distributions as indicated by [20]and assuming that the coordinates of the points are generated by a smooth discrete distribution for each dimension independently, the MLR tree uses space, can be constructed in time and has a query time of . For a smaller set of distributions, the query time can be reduced to (see [20]).

4.2.3 The -sided Skyline Problem is at least as hard as the Predecessor Problem

Our approach makes explicit that the main bottleneck in the -sided skyline problem is the predecessor problem. At this point we show that this is not an artifact of our approach but in fact the -sided skyline problem is at least as much difficult as the predecessor problem. This means that we can only hope for bounds which resemble the bounds in the predecessor problem and not better than these. In the following, we show how the predecessor problem can be solved efficiently by the -sided skyline problem implying that the same lower bounds with the predecessor problem apply. Note that this is folklore knowledge and we provide it here for the sake of completeness as well as because our approach is explicitly heavily dependent on the predecessor problem.

Assume a sorted sequence of integer elements chosen from the range . We construct in time a set of points in two dimensions. Assume that we use MAX-MAX semantics. Let a predecessor query on , where an arbitrary integer and let the answer of the query be . We make the -sided skyline query with range . This means that there is no restriction on the -coordinate and we only wish to find the skyline of all points that have -coordinate .

We argue that the result of this particular -sided skyline query is point . Indeed, by construction, each point dominates all points , such that , which means that for all -sided ranges the skyline consists of at most one point. Since all points satisfy the restriction on the -coordinate, we must consider only the points with -coordinate . The point with the largest -coordinate and with -coordinate is . This point dominates all other points and as a result it is the only point on the particular skyline. As a result, we have answered the predecessor query as well. Since our approach has similar bounds with those optimal bounds of the predecessor problem, we can state that our solutions are optimal.

5 The Dynamic MLR-tree

Making dynamic the layered MLR tree described in 4.2 involves all layers. The following issues must be tackled in order to make the MLR-tree dynamic: 1. use of a dynamic tree structure with care to how rebalancing operations are performed, 2. the layer trees must have variable size within a predefined range, rebuilding them appropriately as soon as they violate this bound (by splitting or merging/sharing with adjacent trees) - similarly, the permutation index must be appropriately defined in order to allow for variable length permutations and 3. all arrays attached to nodes or leaves as well as array must be updated efficiently.

To begin with, global rebuilding [27] is used in order to maintain the structure. In particular, let be the number of elements stored at the time of the latest reconstruction. After that time when the number of updates exceeds , where is a constant, then the whole data structure is reconstructed taking into account that the number of elements is . In this way, it is guaranteed that the current number of elements is always within the range . We call the time between two successive reconstructions an epoch. The tree structure used for the first two layers is a weight-balanced tree, like the -trees [29] or the weight-balanced -trees [28]. In the latter case, the tree is not binary and the definition of lists and is extended analogously to external memory static MLR-tree in order to take into account the appropriate nodes.

Henceforth, assume for brevity that . We impose that all trees at layer will have size within the range . To compute the permutation index, if the size of the layer tree is

, then we pad the increasing sequence of elements in the tree with

values in order to have exactly size (alternatively, we could count also the number of subsets of size in the range increasing the size of the table ANS but not exceeding the bound). In addition, the array that indexes the leaves is structured with a PAM since it must be dynamic as well.

Assume that an update operation takes place. The following discussion concerns the case of inserting a new point since the case of deleting an existing point from the structure is symmetric. First, is used to locate the predecessor of , and in particular to locate the tree of layer that contains the predecessor of . Array is updated accordingly. The predecessor of in is located by using the respective -heap. If , then and are inserted in the respective -heaps. If , then is split into two trees with size approximately . This means that new -heaps must be constructed while two new permutation indices must be computed for the two new trees. Let be the layer tree that gets the new leaf. Note that is affected either structurally, when one of its leaves at layer splits as in this case ( is ) or it is affected without structural changes, when is minimum among all the -coordinates of and thus the representative of changes. In the latter case, all structures on the path of must be updated with the new point. In addition, let be the highest node with height in that has (the point with the minimum -coordinate in its subtree changes to ). Then, for all leaves in the subtree of , the -heaps for and as well as the h-RMQ structures are updated, given that . In the former case, we make rebalancing operations on the internal nodes of on the path . These rebalancing operations result in changing as in the previous case the -heaps for the and while the respective structures of the node that is rebalanced have to be recomputed. Similar changes happen to the tree of the first layer given that either a tree of the second layer splits or its minimum element is updated. In case of deleting , the layers of the MLR-tree are handled similarly.

In the following discussion assume that the time complexity of the update operation supported by the PAM on elements is . The change of the point with the minimum -coordinate can always propagate from to the root of . can be updated in time since the two updates in -heaps cost while the computation of the permutation index costs . Let the respective tree in the second layer be . Then, the cost for changing the point with the minimum -coordinate in each node on the path from the leaf to the root of is related to the update cost for the and lists as well as for the structures. In particular, all lists and are updated (deletion of the previous point and insertion of the new one in a -heap) in time. Similarly, a deletion and an insertion is carried out in each structure in total time. The same holds for the tree getting a total complexity of .

Rebalancing operations on the level trees as well as on the level tree of the structure may be applied when splits or fusions of leaves of level trees take place. Since level trees are exponentially smaller than the level tree and they are the same, the cost is dominated by the rebalancing operations at . Assume an update operation at a leaf of . In the worst case, each structure may have to be reconstructed and similarly to the previous paragraph the and structures need to be updated. The total cost is equal to for the lists while it is for the structures since the reconstruction of the structure of the root dominates the cost. One can similarly reason for level trees. However, the amortized cost is way lower for two reasons: 1. There is an update at a leaf of roughly every update operations and 2. The weight property of the tree structures guarantees that costly operations are rare. By using a standard weight property argument combined with the above two reasons we get that the amortized rebalancing cost is . This amortized cost is dominated by the cost to update the minimum element, in which case the worst-case as well as the amortized case coincide.The following theorem summarizes the result:

Given a set of points on the 2-d grid , we can store them in a dynamic main memory data structure that uses
space and supports update operations in
in the worst case. It supports skyline queries in a -sided range in worst-case time, where is the answer size.

The inefficiency of the update operations is overwhelming. Although rebalancing operations are efficient in an amortized sense, the change of minimum depends on the user and in principle this change can propagate to the root in each update operation. In the following, we overcome this problem by making a rather strong assumption about the distribution of the points.

5.1 Exploiting the Distribution of the Elements

To reduce the huge worst-case update cost of Theorem 5 we have to tackle the propagation of minimum elements. Assume that a new point is to be inserted in the MLR-tree. Let be stored in level tree according to . We call the point violating if is the minimum -coordinate among all -coordinates of the points in . When a new point is violating it means that an update operation must be performed on . In the following, we show that under assumptions on the generating distributions of the and coordinates of points we prove that during an epoch 333Recall than an epoch is the time between two successive reconstructions of the structure defined by the update operations. only violations will happen. We provide a sketch of the structure since it is an easy adaptation of the probabilistic results of [20, 5].

We assume that all points have their coordinate generated by the same discrete distribution that is -smooth, where and are constants. We also assume that the coordinates of all points are generated by a restricted set of discrete distributions , independently of the distribution of the coordinate. We later show the properties that must have and provide specific examples. Finally, we assume that deletions are equiprobable for each existing point in the structure. In a nutshell, the structure requires that during an epoch tree remains intact and only level and level trees are updated. All violating points are stored explicitly and since they are only a few during an epoch, we can easily support the query operation. After the end of the epoch, the new structure has no violating points stored explicitly.

The construction of the static tree now follows the lines of [20]. Without going into details, assume that the coordinates are in the range . Then, this range is recursively divided into subranges. The terminating condition for the recursion is when a subrange has elements. Note that the bounds of these subranges do not depend on the stored elements but only on the properties of the distribution. This construction is necessary to ensure certain probabilistic properties for discrete distributions. However, instead of building an interpolation search tree, we build a binary tree on these subranges and then continue building the lists of the leaves and the internal nodes as in the previous structures. The elements within each subrange correspond to a level tree whose leaves are level trees. Theorem 1 and Lemma 2 of [20] imply the following theorem with respect to each epoch:

The construction of the terminating subranges defining the level trees can be performed in time in expectation with high probability. Each level tree has points in expectation with high probability during an epoch.

The above theorem guarantees that the size of the buckets is not expected to change considerably and as a result we are allowed to assume that no update operations will happen on . This is the result of assuming that the coordinates of the points inserted are generated by an