DeepAI
Log In Sign Up

An Efficient Skyline Computation Framework

08/12/2019
by   Rui Liu, et al.
University of Tours
0

Skyline computation aims at looking for the set of tuples that are not worse than any other tuples in all dimensions from a multidimensional database. In this paper, we present SDI (Skyline on Dimension Index), a dimension indexing conducted general framework to skyline computation. We prove that to determine whether a tuple belongs to the skyline, it is enough to compare this tuple with a bounded subset of skyline tuples in an arbitrary dimensional index, but not with all existing skyline tuples. Base on SDI, we also show that any skyline tuple can be used to stop the whole skyline computation process with outputting the complete set of all skyline tuples. We develop an efficient algorithm SDI-RS that significantly reduces the skyline computation time, of which the space and time complexity can be guaranteed. Our experimental evaluation shows that SDI-RS outperforms the baseline algorithms in general and is especially very efficient on high-dimensional data.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/07/2021

A more efficient algorithm to compute the Rand Index for change-point problems

In this paper we provide a more efficient algorithm to compute the Rand ...
11/10/2020

LinCbO: fast algorithm for computation of the Duquenne-Guigues basis

We propose and evaluate a novel algorithm for computation of the Duquenn...
06/25/2021

Accelerated Computation of a High Dimensional Kolmogorov-Smirnov Distance

Statistical testing is widespread and critical for a variety of scientif...
11/27/2019

Adaptive Initialization Method for K-means Algorithm

The K-means algorithm is a widely used clustering algorithm that offers ...
01/17/2022

A Novel Framework to Jointly Compress and Index Remote Sensing Images for Efficient Content-Based Retrieval

Remote sensing (RS) images are usually stored in compressed format to re...
09/21/2020

Space/time-efficient RDF stores based on circular suffix sorting

In recent years, RDF has gained popularity as a format for the standardi...
01/04/2019

Faster and Smaller Two-Level Index for Network-based Trajectories

Two-level indexes have been widely used to handle trajectories of moving...

1. Introduction

Skyline computation aims at looking for the set of tuples that are not worse than any other tuples in all dimensions with respect to given criteria from a multidimensional database. Indeed, the formal concept of skyline was first proposed in 2001 by extending SQL queries to find interesting tuples with respect to multiple criteria (Borzsony et al., 2001), with the notion of dominance: we say that a tuple dominates another tuple if and only if for each dimension, the value in is better than the respective value in . The predicate better can be defined by any total order, such as less than or greater than.

Figure 1. The hotel Skyline on distance and price.

For instance, Figure 1 shows the most mentioned example in skyline related literature where we consider the prices of hotels with respect to their distances to the city center (sometimes to the beach, or to the railway station, etc.). If we are interested in hotels which are not only cheap but also close to the city center (the less value is the better), those represented by , , , , and constitute the skyline. It’s obvious that the hotel dominates the hotel since is better than in both distance and price; however, does not dominate because is better than in price but is however better than in distance. In real-world database and user centric applications, such Price-Distance liked queries are doubtless interesting and useful, and have been widely recognized.

Since the first proposed BNL algorithm (Borzsony et al., 2001), the skyline computation problem has been deeply studied for about two decades and many algorithms have been developed to compute the Skyline, such as Bitmap/Index (Tan et al., 2001), NN (Kossmann et al., 2002), BBS (Papadias et al., 2005), SFS (Chomicki et al., 2005), LESS (Godfrey et al., 2005), SaLSa (Bartolini et al., 2006), SUBSKY (Tao et al., 2007), ZSearch (Lee et al., 2007), and ZINC (Liu and Chan, 2010). However, the efficiencies brought by existing algorithms often depend on either complex data structures or specific application/data settings. For instance, Bitmap is based on a bitmap representation of dimensional values but is also limited by the cardinality; Index is integrated into a B-tree construction process; NN and BBS rely to a specific data structure as R-tree besides NN handles difficultly high-dimensional data (for instance (Tan et al., 2001)); SUBSKY specifically requires the B-tree and tuning the number of anchors; ZSearch and ZINC are built on top of the ZB-tree. On the other hand, as well as paralleling Skyline computation (Chester et al., 2015), different variants of standard skylines have been defined and studied, such as top-k Skylines (Tao et al., 2007), streaming data Skylines (Lin et al., 2005), partial ordered Skylines (Liu and Chan, 2010), etc., which are not in our scope.

In this paper, we present the SDI (Skyline on Dimension Index) framework that allows efficient skyline computation by indexing dimensional values. We first introduce the notion of dimensional index, based on which we prove that in order to determine whether a tuple belongs the skyline, it is enough to compare it only with existing skyline tuples present in any one dimensional index instead of comparing it with all existing skyline tuples, which can significantly reduce the total count of dominance comparisons while computing the skyline. Furthermore, within the context of dimension indexing, we show that in most cases, one comparison instead of two is enough to confirm the dominance relation. These properties can significantly reduce the total count of dominance comparisons that is mostly the bottleneck of skyline computation. Different form all existing sorting/indexing based skyline algorithms, the application of dimension indexing allows to extend skyline computation to any total ordered categorical data such as, for instance, user preference on colors and on forms. Based on SDI, we also prove that any skyline tuple can be used to define a stop line crossing dimension indexes to terminate the skyline computation, which is in particular efficient on correlated data. We therefore develop the algorithm SDI-RS (RangeSearch) for efficient skyline computation with dimension indexing. Our experimental evaluation shows that SDI-RS outperforms our baseline algorithms (BNL, SFS, and SaLSa) in general, especially on high-dimensional data.

The remainder of this paper is organized as follows. Section 2 reviews related skyline computation approaches. In Section 3, we present our dimension indexing framework and prove several important properties, based on which we propose the algorithm SDI-RS in Section 4. Section 5 reports our experimental evaluation of the performance of SDI-RS in comparison with several baseline benchmarks. Finally, we conclude in Section 6.

2. Related Work

In this section, we briefly introduce mainstream skyline computation algorithms.

Börzsöny et al.(Borzsony et al., 2001) first proposed the concept of skyline and several basic computation algorithms, of wihch Nested Loop (NL) is the most straightforward algorithm by comparing each pair of tuples, but always has the the same time complexity no matter the distribution of the data. Built on top of the naive NL algorithm, Block Nested Loop (BNL) algorithm employees memory window to speed up the efficiency significantly, and of which the best case complexity is reduced to when there is no temporary file generated during the process of BNL, however the worst case is , such as all tuples in database are incomparable with each other.

Bitmap and Index (Tan et al., 2001) are two efficient algorithms for skyline computation. Bitmap based skyline computation is very efficient, however it limits to databases with limited distinct value of each dimension; it also consumes high I/O cost and requires large memory when the database is huge. Index generates the index based on the best value’s dimension of tuples. It is clear that skyline tuples are more likely to be on the top of each index table, so index tables can prune tuple if one tuple’s minimum value in all dimensions is larger than the maximal value of all dimensions of another tuple.

Sorted First Skyline (SFS) (Chomicki et al., 2005) and Sort and Limit Skyline algorithm (SaLSa) (Bartolini et al., 2006) are another two pre-sort based algorithms. SFS has a similar process as BNL but presorts tuples based on the skyline criteria before reading them into window. SaLSa shares the same idea as SFS to presort tuples, but the difference between SFS and SaLSa is that they use different approach to optimize the comparison passes: SFS

uses entropy function to calculate the probability of one tuple being skyline and

SaLSa uses stop point. Indeed, SaLSa is designed on top of such an observation: if a skyline tuple can dominate all unread tuples, then the skyline computation can be terminated. Such a special tuple is called the stop point in SaLSa, which can effectively prune irrelevant tuples that they cannot be in the Skyline. However, the selection of the stop point depends on dominance comparisons that is completely different from our notion of stop line, which is determined by dimensional indexes without dominance comparison.

SUBSKY algorithm (Tao et al., 2007) converts the -dimensional tuples into 1D value so all tuples will be sorted based on value and that helps to determine whether a tuple is dominated by a skyline tuple. SUBSKY sorts the whole database on full space but calculates skyline on subspace based on user criteria. Nevertheless, the full space index may not be accurate when pruning data as the index maybe calculated on unrelated dimension. SDI also supports to calculate skyline on subspace but without re-sorting tuples. Moreover, dimension index could guarantee the best sorting of subspace and prune more tuples.

Besides sorting based algorithms, there are some algorithms solve the skyline computation problem using R-tree structure, such as NN (Nearest Neighbors) (Kossmann et al., 2002) and BBS (Branch-and-Bound Skyline) (Papadias et al., 2005). NN discovers the relationships between nearest neighbors and skyline results. It is observed that the skyline tuple must be close to the coordinate origin: the tuple which stays closest to the coordinate origin must be a part of the skyline. Using the first skyline tuple, the database can be further split to several regions, and the first skyline tuple becomes the coordinate origin of these regions. The nearest point of each region are part of skyline tuples as well, so the whole process iterates until there is no more region split. BBS uses the similar idea as NN. The main difference between NN and BBS is that NN process may include redundant searches but BBS only needs one traversal path. NN and BBS are both efficient but nevertheless rely on complex data structure which is not necessary for SDI algorithm.

3. Dimension Indexing for Skyline Computation

We present in this section the SDI (Skyline on Dimension Index) framework, within which we prove several interesting properties that allow to significantly reduce the total count of dominance comparisons during the skyline computation.

Let be a -dimensional database that contains tuples, each tuple

is a vector of

attributes with . We denote , for , the dimensional value of a tuple in dimension (in the rest of this paper, we consider by default that satisfies ). Given a total order on all values in dimension , we say that the value of the tuple is better than the respective value of the tuple if and only if ; if , we say that is equal to , and so that is not worse than if and only if , denoted by . Besides, is not better than is denoted by We have that . Without lose of the generality, we denote by the total order the ensemble of all total orders on all dimensions and, without confusion, instead of .

Definition 0 (Dominance).

Given the total order and a database , a tuple dominates a tuple if and only if on each dimension , and for at least one dimension , denoted by .

Further, we denote that the tuple and the tuple are incomparable. A tuple is a skyline tuple if and only if there is no tuple can dominate it. We therefore formally define skyline as follows.

Definition 0 (Skyline).

Given the total order and a database , a tuple is a skyline tuple if and only if such that . The skyline on is the complete set of all skyline tuples that .

It’s easy to see that the skyline of a database is the complete set of all incomparable tuples in , that is, for any two tuples .

ID Skyline
7.5 1.3 7.5 4.5 5.3 2.1 Yes
4.7 6.7 6.7 9.3 3.8 5.1 Yes
8.4 9.4 5.3 5.8 6.7 7.5 No
5.3 6.6 6.7 6.8 5.8 9.3 Yes
8.4 5.2 5.1 5.5 4.1 7.5 Yes
9.1 7.6 2.6 4.7 7.3 6.2 Yes
5.3 7.5 1.9 5.9 3.4 1.8 Yes
5.3 7.5 6.7 7.2 6.3 8.8 No
6.7 7.3 7.6 9.7 5.3 8.7 No
7.5 9.6 4.8 8.9 9.5 6.5 No
Table 1. A sample database with , , and .

Table 1 shows a sample database of 6 dimensions () that contains 10 tuples (), of which 6 are skyline tuples (, we also note the size of Skyline as with reference to most literature) while the order less than is applied to all dimensions.

Example 0 ().

Among all the 10 tuples listed in Table 1, , , , and ; , , and do not dominate any tuples and are not dominated by any tuples. The Skyline is therefore . ∎

The basis of our approach is to build dimensional indexes with respect to the concerned per-dimension total orders that allow to determine the skyline without performing dominance comparisons neither to all tuples in the database nor to all tuples in current skyline. In general, our approach can significantly reduce the total number of dominance comparisons, which plays an essential role that definitively affects the total processing time of Skyline computation. Furthermore, our approach constructs the Skyline progressively so no delete operation is required.

For each dimension of the database , the total order can be considered as a sorting function , where is an ordered list of all tuple values in the dimension of database. We call such a list a dimensional index.

Definition 0 (Dimensional Index).

Given a database , the dimensional index for a dimension is an ordered list of tuple IDs sorted first by dimensional values with respect to the total order , and then, in case of ties, by their lexicographic order.

In order to avoid unnecessary confusions, we represent a dimensional index as a list of entries such as which shown in Table 2 (where all skyline tuples are in bold).

4.7:1 1.3:0 1.9:6 4.5:0 3.4:6 1.8:6
5.3:3 5.2:4 2.6:5 4.7:5 3.8:1 2.1:0
5.3:6 6.6:3 4.8:9 5.5:4 4.1:4 5.1:1
5.3:7 6.7:1 5.1:4 5.8:2 5.3:0 6.2:5
6.7:8 7.3:8 5.3:2 5.9:6 5.3:8 6.5:9
7.5:0 7.5:6 6.7:1 6.8:3 5.8:3 7.5:2
7.5:9 7.5:7 6.7:3 7.2:7 6.3:7 7.5:4
8.4:2 7.6:5 6.7:7 8.9:9 6.7:2 8.7:8
8.4:4 9.4:2 7.5:0 9.3:1 7.3:5 8.8:7
9.1:5 9.6:9 7.6:8 9.7:8 9.5:9 9.3:3
Table 2. Dimension indexing of the sample database shown in Table 1.
Example 0 ().

Table 2 shows the 6 dimensional indexes with respect to all the 6 dimensions of the sample database shown in Table 1. We show in detail that in , the dimensional value 5.3 appears in 3 tuples so these 3 entries are secondarily sorted by tuple IDs for . ∎

Now let us consider the dimensional indexes containing distinct dimensional values only, such as shown in Table 2. In such an index without duplicate dimensional values, we see that a tuple can only be dominated by a tuple such that (implies that since for any tuple such that , we have that so cannot dominates .

Lemma 0 ().

Given a database , let be the skyline of , be a dimensional index containing only distinct dimensional values, and be a tuple. Then, if and only if we have for any skyline tuple such that on . ∎

Proof.

If , then is a skyline tuple because no tuple is better than in the dimension since all dimensional values are distinct. If , let be a skyline tuple such that , then , thus, such that , that is, ; now let be a skyline tuple such that , then , further, , so we also have . Thus, is incomparable to any skyline tuple so is a skyline tuple, that is, . ∎

With Lemma 6, to determine whether a tuple is a skyline tuple, it is only necessary to compare with each skyline tuple in one dimension such that , instead of comparing with all skyline tuples. Furthermore, we recall that BNL-like algorithms dynamically update the early skyline set that require a second dominance comparison between an incoming tuple and early skyline tuple to determine whether . However, with dimensional indexes, Lemma 6 shows that one dominance comparison is enough to determine , instead of two comparisons. Lemma 6 also ensures a progressive construction of the skyline.

However, in most cases and particularly in real data, there are often duplicate values in each dimension where Lemma 6 cannot be established. As shown in Table 2, we can find that there are duplicate values in most dimensions, where a typical instance is , in which two different cases should be identified:

  1. The dimensional value 5.3 appears in three entries , , and where and are skyline tuples and is not skyline tuple.

  2. The dimensional value 8.4 appears in both of the two entries and , where is not skyline tuple but is indexed before the skyline tuple .

In the case (1), a simple straightforward scan on these three dimensional index entries can progressively identify that () and ( and ) are skyline tuples and filter out (). However, in the case (2), a straightforward scan cannot progressively identify skyline tuples because: there is no precedent tuple dominating so will be first identified as a skyline tuple; then, since no tuple can dominate , will identified as a skyline tuple without checking whether , hence, finally the output skyline is wrong.

To resolve such misidentifications of skyline tuples, we propose a simple solution that first divides a dimensional index into different logical blocks of entries with respect to each distinct dimensional value, then apply the BNL algorithm to each block containing more than one entry to find block skyline tuples in order to establish Lemma 6.

Definition 0 (Index Block).

Given a database , let be the dimensional index of a dimension . An index block of is a set of dimensional index entries that share the same dimensional value sorted by the lexicographical order of tuple IDs.

If each block contains one entry, the only tuple will be compared with existing skyline tuples with respect to Lemma 6; otherwise, for any block contains more than one entry, each block skyline tuple must be compared with existing skyline tuples with respect to Lemma 6. We can generalize the notion of tuples in Lemma 6 to block skyline tuples because one block contains one entry, the concerned tuples are block skyline tuples.

Theorem 8 ().

Given a database , let be the Skyline of , be a dimensional index, and be a block skyline tuple on . Then, if and only if we have for any skyline tuple such that on .

Proof.

With the proof of Lemma 6 and the statement of block skyline tuples, the proof of Theorem 8 is immediate. ∎

4.7:1
5.3:3 5.3:6 5.3:7
6.7:8
7.5:0 7.5:9
8.4:2 8.4:4
9.1:5
Table 3. A block view of the dimensional index .
Example 0 ().

As shown in Table 3, 6 blocks can be located from with respect to all 6 distinct values: 4.7, 5.3, 6.7, 7.5, 8.4, and 9.1. According to Theorem 8: the block 4.7 contains , so is a block skyline tuple and is the first skyline tuple; the block 5.3 contains , , and where and , so and block skyline tuples such that and , hence, and are new skyline tuples; the block 6.7 contains , so is a block skyline tuple that is dominated by ; the block 7.5 is different from the block 5.3, where so both of them are block skylines, and we have so is a skyline tuple; the block 8.3 is the same case as the block 5.3, where is a skyline tuple; finally, no skyline tuple dominates , so the Skyline is . ∎

It is important to note that Theorem 8 allows dominance comparisons to be performed on arbitrary dimensional indexes and the computation stops while the last entry in any index is reached. Therefore, we see that a dynamic dimension switching strategy can further improve the efficiency of the Skyline computing based on dimension indexing. For instance, if we proceed a breadth-first search strategy among all dimensional indexes shown in Table 2, while we examine the second entry in , although currently , we do not have to compare with all those skyline tuples but only with ; if we continue to examine the second entry in ,

can be ignored since it is already a skyline tuple. We also note that duplicate dimensional values present in tuples severely impact the overall performance of dimensional index based Skyline computation, therefore, reasonable dimension selection/sorting heuristics shall be helpful.

4. A Range Search Approach to Skyline

In this section, we first propose the notion of stop line that allows terminate searching skyline tuples by pruning non relevant tuples, then present the algorithm SDI-RS (RangeSearch) for skyline computation based on the SDI framework. Notice that the name RangeSearch stands for the bounded search range while determining skyline tuples.

4.1. Stop Line

Let us consider again the Skyline and the dimensional indexes shown in Table 2. It is easy to see that all 6 skyline tuples can be found at the first two entries of all dimensional indexes, hence, a realistic question is whether we can stop the Skyline computation before reaching the end of any dimensional index.

Definition 0 (Stop Line).

Given a database , let be a skyline tuple. A stop line established from , denoted by , is a set of dimensional index entries such that appears in each dimension. An index entry is a stop line entry and an index block containing a stop line entry is a stop line block.

Let be a tuple, we denote the offset of the index block on a dimensional index that contains , that is, the position of the index block that contains . Hence, let be a stop line tuple and be a tuple, we say that the stop line covers the index entry on a dimensional index if . For instance, Table 4 shows the stop line created from the tuple , which totally covers 41 index entries without on neither on .

4.7:1 1.3:0 1.9:6 4.5:0 3.4:6 1.8:6
5.3:3 5.2:4 2.6:5 4.7:5 3.8:1 2.1:0
5.3:6 6.6:3 4.8:9 5.5:4 4.1:4 5.1:1
5.3:7 6.7:1 5.1:4 5.8:2 5.3:0 6.2:5
6.7:8 7.3:8 5.3:2 5.9:6 5.3:8 6.5:9
7.5:0 7.5:6 6.7:1 6.8:3 5.8:3 7.5:2
7.5:9 7.5:7 6.7:3 7.2:7 6.3:7 7.5:4
8.4:2 7.6:5 6.7:7 8.9:9 6.7:2 8.7:8
8.4:4 9.4:2 7.5:0 9.3:1 7.3:5 8.8:7
9.1:5 9.6:9 7.6:8 9.7:8 9.5:9 9.3:3
Table 4. The stop line created from tuple covers 41 index entries in total.

Obviously, let be a stop line tuple and be a tuple such that , then we have that on each dimensional index and on at least one dimensional index .

Theorem 2 ().

Given a database , let be the stop line with respect to a skyline tuple . By following any top-down traversal of all dimensional indexes, if all stop line blocks have been traversed, then the complete set of all skyline tuples has been generated and the skyline computation can stop.

Proof.

Let be a skyline tuple and be a skyline tuple, we have (1) or (2) , if and have identical dimensional values. In the first case, , that is, if the index traversal passes the stop line , the tuple must have been identified at least in the dimension . In the second case, we have for any dimension . In both cases, if all stop line blocks have been processed, then all skyline tuples have been found. ∎

In principle, any skyline tuple can be chosen to form a stop line, however, different stop lines behave differently in pruning useless tuples. For instance, as shown in Table 4, the stop line created from covers totally 41 index entries and two tuples can be pruned; however, as shown in Table 5, the the stop line created from covers only 37 index entries and no tuple can be pruned. Obviously, a good stop line shall cover index entries at much as possible, so we can use an optimal function, , to minimize the offsets of a skyline tuple in all dimensional indexes for building a stop line , defined as:

The function min(p) sorts tuples first by the maximum offset, then by the mean offset in all dimensional indexes, so the minimized skyline tuple is the best stop line tuple. Hence, a dynamically updated stop tuple can be maintained by keeping for any new skyline tuple .

4.7:1 1.3:0 1.9:6 4.5:0 3.4:6 1.8:6
5.3:3 5.2:4 2.6:5 4.7:5 3.8:1 2.1:0
5.3:6 6.6:3 4.8:9 5.5:4 4.1:4 5.1:1
5.3:7 6.7:1 5.1:4 5.8:2 5.3:0 6.2:5
6.7:8 7.3:8 5.3:2 5.9:6 5.3:8 6.5:9
7.5:0 7.5:6 6.7:1 6.8:3 5.8:3 7.5:2
7.5:9 7.5:7 6.7:3 7.2:7 6.3:7 7.5:4
8.4:2 7.6:5 6.7:7 8.9:9 6.7:2 8.7:8
8.4:4 9.4:2 7.5:0 9.3:1 7.3:5 8.8:7
9.1:5 9.6:9 7.6:8 9.7:8 9.5:9 9.3:3
Table 5. The stop line created from tuple covers 37 index entries in total.

Nevertheless, the use of stop lines requires that all stop line blocks in all dimensions being examined, so it is difficult to judge whether a scan reaches first at the end of any dimensional index or first finishes to examine all stop line blocks although we can state that the setting of stop lines can effectively help the Skyline computation in correlated data. We also note that the use of stop lines require that all dimensions are indexed, which is an additional constraint while applying Theorem 8 and Theorem 2 together since Theorem 8 does not impose that all dimensions must be constructed. We propose, thus, to consider different application strategies of Theorem 8 and Theorem 2 with respect to particular use cases and data types to accelerate the Skyline computation.

4.2. The RangeSearch Algorithm

Theorem 8 allows to reduce the count of dominance comparisons while computing the skyline. However, as mentioned in Section 3, the duplicate dimensional values severely augment the dominance comparisons count because a BNL based local comparisons must be applied. Notice that it is useless to apply SFS or SaLSa to such local comparisons because their settings of sorting functions disable one of the most important features of our dimension indexing based approach: individual criterion including that for non-numerical values of skyline selection can be independently applied to each dimension.

In order to reduce the impact of duplicate dimensional values, we propose a simple solution based on sorting dimensional indexes by their cardinalities . The computation starts from the best dimensional index so the calls of BNL can be minimized. For instance, in Table 2, all dimensional indexes can be sorted as , where the best dimensional index contains no duplicate dimensional values so Lemma 6 can be directly established so dimension switching can be performed earlier.

We present then SDI-RS (RangeSearch), an algorithm with the application of Theorem 8 and Theorem 2 by performing dominance comparisons only with a range of skyline tuples instead of all, as shown in Algorithm 1, to the skyline computation on sorted dimensional indexes.

Input: Sorted dimensional indexes
Output: Complete set of all skyline tuples
1 empty stop line
2 while true do
3       foreach  do
4             while  get next block from  do
5                   if  then
6                         return
7                        
8                  foreach  do
9                         if  has been compared and  then
10                               remove from
11                              
12                        
13                   compute the block Skyline from by BNL
14                   foreach  and  do
15                         if  then
16                              
17                              
18                               build stop line from
19                               if  or is better than  then
20                                    
21                                    
22                              
23                        
24                  if  for each dimension  then
25                         return
26                        
27                  if [dimension-switching] then
28                         break
29                        
30                  
31            
32      
Algorithm 1 SDI-RS (RangeSearch)

The algorithm accepts a set of sorted dimensional indexes of a -dimensional database as input and outputs the complete set of all skyline tuples. First, we initialize an empty stop line , then we enter a Round Robin loop that find the complete set of all skyline tuples with respect to Theorem 8 and Theorem 2. In each dimensional index based iteration, we first get the next block of index entries from . According to Theorem 8, if is null, which means that the end of is reached, we exit the algorithm by returning ; otherwise, we treat all index entries block by block to find skyline tuples. If a tuple is already compared and marked as non skyline tuple, we should ignore it in order to prevent comparing it with other tuples again; however, if is a skyline tuple, we shell keep it because may dominate other new tuples in block-based BNL while computing the block Skyline . Therefore, for each tuple such that (again, we do not want to compare a skyline tuple with other skyline tuples), we compare it with all existing skyline tuples present in current dimension . Here we introduce a shortcut operator at line 12 that means that none of skyline tuples in dominates , and according to Theorem 8, must be a skyline tuple in this case and must be added to the dimensional Skyline and the global Skyline . Furthermore, with respect to Theorem 2, we build a new stop line from each new skyline tuple and if it is better than current stop line (or no stop line is defined), we update by . While the above dominance comparisons are finished, we compare current dimensional iteration position on all dimensions with the latest stop line, if in each dimension the stop line entry is reached, RangeSearch stops by returning the complete Skyline . Otherwise, RangeSearch switch to the next dimension and repeat the above procedure with respect to a particular [dimension-switching] strategy.

In our approach, we consider breadth-first dimension switching (BFS) and depth-first dimension switching (DFS). With BFS, if a block is examined and if SDI-RS shall continue to run, then the next dimension will be token. However, in depth-first switching, if a block is examined and if SDI-RS shall continue to run, SDI-RS continues to go ahead in current dimension if current block contains new skyline tuples, till to meet a block without any new skyline tuple. The difference between breadth-first switching and depth-first switching is clear. DFS tries to accelerate skyline tuple searching in each dimension, this strategy benefits the most from Theorem 8; furthermore, if the best stop line is balanced in each dimension, then DFS reaches well the stop line in each dimension so more tuples can be pruned. However, DFS is not efficient if there are a large number of duplicate values in some dimensions because each block shall be examined before switching to the next dimension. In this case, depth-first switching takes duplicate dimensional values into account: since all dimensional indexes are sorted with respect to their cardinalities, SDI-RS starts always from the best dimensions that contain less duplicate values and finds skyline tuples as much as possible by depth-first switching, hence, while switching to other dimensions, it is possible that some tuples in some blocks have already been compared or are already skyline tuples so no more comparisons will be performed.

In comparison with sorting based algorithms like SFS and SaLSa, SDI-RS allows to sort tuples with respect to each dimension, which is interesting while different criteria are applied to determine the skyline. For instance, we can specify the order less than () a one dimension and the order greater than () to another dimension, without of additional calculation to unifying and normalizing dimensional values. With the same reason, SDI-RS allows to directly process categorical data as numerical data: if any total order can be defined to a categorical attribute, for instance, the user preference on colors such that blue green yellow red, then SDI-RS can treat such values as any ordered numerical values without any adaptation.

With dimensional indexes, SDI-RS is efficient in both space and time complexities. The storage requirement for dimensional indexes is guaranteed: for instance, a C/C++ implementation of SDI-RS may consider an index entry as a struct of tuple ID and dimensional value that requires 16 bytes (64bit ID and 64bit value), therefore if each dimensional index corresponds to a std::vector structure, the in-memory storage size of dimensional indexes is the double of the database size: for instance, 16GB heap memory fits the allocation of 1,000,000,000 structures of ID/value, as 12,500,000 8-dimensional tuples. Let be the dimensionality, be the cardinality of data, and be the size of the skyline. The generation of dimensional indexes requires with respect to a general-purpose sorting algorithm of complexity. For the best-case, that is, , SDI-RS finishes in since the only skyline tuple is the stop line and the computation stops immediately; for the worst-case, all tuples are skyline tuples, SDI-RS finishes in

according to Theorem 8 if each block contains only one tuple (that is, the case without duplicate dimensional values). More generally, if the best dimension index contains duplicate values, then SDI-RS finishes in

since the worst-case is that all duplicate values appear in the same block.

5. Experimental Evaluation

In this section, we report our experimental results on performance evaluation of SDI-RS that is conducted with both of BFS and DFS dimension switching, and is compared with three baseline algorithms BNL, SFS, and SaLSa on synthetic and real benchmark datasets. The vol sorting function and the max sorting function are respectively applied to SFS and SaLSa as mentioned in (Bartolini et al., 2006).

Run-time on independent datasets. Dominance comparisons on independent datasets.
(a) K (b) M (c) K (d) M
Run-time on correlated datasets. Dominance comparisons on correlated datasets.
(e) K (f) M (g) K (h) M
Run-time on anti-correlated datasets. Dominance comparisons on anti-correlated datasets.
(i) K (j) M (k) K (l) M
Figure 2. Overall performance of SDI-RS.

We generate independent, correlated, and anti-correlated synthetic datasets using the standard Skyline Benchmark Data Generator111http://pgfoundry.org/projects/randdataset (Borzsony et al., 2001) with the cardinality and the dimensionality in the range of 2 to 24. Three real datasets NBA, HOUSE, and WEATHER (Chester et al., 2015) have also been used. Table 6 and Table 7 show statistics of all these datasets.

Dataset
100K 12 282 2534 9282 82546 99629
Independent 1M 17 423 6617 30114 629091 981611
100K 3 9 49 135 3670 13479
Correlated 1M 1 19 36 208 8688 58669
100K 56 3865 26785 55969 96816 99730
Anti-correlated 1M 64 8044 99725 320138 892035 984314
Table 6. Skyline size of synthetic datasets.
Dataset Cardinality () Dimensionality () Skyline Size ()
NBA 17264 8 1796
HOUSE 127931 6 5774
WEATHER 566268 15 63398
Table 7. Statistics of real datasets.

We implemented SDI-RS in C++ with C++11 standard, where dimensional indexes were implemented by STL std::vector and std::sort(). In order to to evaluate the overall performance of our SDI-RS, three baseline algorithms BNL, SFS, and SaLSa were also implemented in C++ with the same code-base. All algorithms are compiled using LLVM Clang with -O3 optimization flag. All experiments have been performed on a virtual computation node with 16 vCPU and 32GB RAM hosted in a server with 4 Intel Xeon E5-4610 v2 2.30GHz processors and 256GB RAM.

Figure 2 shows the overall run-time, including loading/indexing data, and the total dominance comparison count of SDI-RS and BNL/SFS/SaLSa on 100K and 1M datasets, where the dimensionality is set to 2, 4 6, 8, 16, and 24. We note that in the case of low-dimensional datasets, such as , there are no very big differences between all these 4 algorithms; however, SDI-RS extremely outperforms BNL/SFS/SaLSa in high-dimensional datasets, for instance . Indeed, the run-time of SDI-RS is almost linear with respect to the increase of dimensionality, which is quite reasonable since the main cost in skyline computation is dominance comparison and SDI-RS allows to significantly reduce the total count of dominance comparisons. On the other hand, it is surprising that SaLSa did not finish computing on all 24-dimensional datasets as Figure 2 (b) and Figure 2 (j), for more than 5 hours. Notice that SaLSa outperforms BNL and SFS on real datasets.

We note that the total run-time of SDI-RS on low-dimensional correlated datasets is much than BNL/SFS/SaLSa as regards independent and anti-correlated datasets, because SDI-RS requires building dimensional indexes. Table 8 details the skyline searching time (the time elapsed on dominance comparisons and data access in msec) and total run-time (the time elapsed on the whole process, including data loading and sorting/indexing in msec). It is clear that the construction of dimensional indexes in SDI-RS is essential while the total processing time is short.

SDI-RS 100K 0.14 271 0.38 664 423 1657 243 1348
+BFS 1M 0.17 5161 0.64 10293 1.34 23388 9896 40830
SDI-RS 100K 0.17 238 0.25 692 0.98 1263 4.75 1443
+DFS 1M 0.14 5772 0.42 12069 1.32 21845 5.86 33012
BNL 100K 2.51 75 5.36 141 5.33 243 10.29 284
1M 25.49 744 45.13 1468 62.56 2267 94.04 2901
SaLSa 100K 2686 2784 386 543 26.03 278 51.67 361
max 1M 88.63 1067 451 2117 377 2987 674 3829
SFS 100K 1.49 91 4.53 162 4.91 263 10.99 320
vol 1M 13.33 931 33.78 1605 43.04 2346 88.83 3128
Table 8. Skyline searching time (msec) total run-time (msec) on correlated datasets.

Table 9 shows the performance of SDI-RS on real datasets. DFS dimension switching outperforms BFS dimension switching on both NBA and HOUSE datasets however BFS outperforms DFS on WEATHER dataset. After having investigated these datasets, we confirm that there are a large number of duplicate values in several dimension of WEATHER dataset so the BFS dimension switching strategy takes its advantage. BNL outperforms all other tested algorithms on HOUSE dataset, which corresponds to the results obtained from synthetic low-dimensional independent datasets. Furthermore, the update numbers of the best stop line in SDI-RS is quite limited with respect to the size of skylines.

SDI-RS+BFS SDI-RS+DFS BNL SaLSa SFS
Dominance 680,388 662,832 8,989,690 6,592,178 8,989,690
Search Time (msec) 54 38 151 108 147
Total Time (msec) 172 158 191 152 189
Stop Line Update 15 32

(a) NBA dataset: , , .

SDI-RS+BFS SDI-RS+DFS BNL SaLSa SFS
Dominance 4,976,773 4,860,060 59,386,118 51,484,870 59,386,118
Search Time (msec) 962 337 1,486 1,550 1,534
Total Time (msec) 2,663 1,918 1,716 1,800 1,768
Stop Line Update 16 18

(b) HOUSE dataset: , , .

SDI-RS+BFS SDI-RS+DFS
Dominance 1,744,428,382 1,737,143,260
Search Time (msec) 48,773 58,047
Total Time (msec) 65,376 77,665
Stop Line Update 14 18
BNL SaLSa SFS
Dominance 14,076,080,681 7,919,746,895 14,076,080,681
Search Time (msec) 539,100 394,995 545,263
Total Time (msec) 541,820 397,650 547,914
Stop Line Update

(c) WEATHER dataset: , , .

Table 9. Performance evaluation on real datasets.

We did not directly compare SDI-RS with all existing skyline algorithms, but with reference to most literature comparing proposed algorithms with BNL, SFS, or SaLSa, the comparative results obtained in our experimental evaluation indicate that SDI-RS outperforms the most of existing skyline algorithms.

6. Conclusion

In this paper, we present a novel efficient skyline computation approach. We proved that in multidimensional databases, skyline computation can be conducted on an arbitrary dimensional index which is constructed with respect to a predefined total order that determines the skyline, we therefore proposed a dimension indexing based general skyline computation framework SDI. We further showed that any skyline tuple can be used to stop the computation process by outputting the complete skyline. Based on our analysis, we developed a new progressive skyline algorithm SDI-RS that first builds sorted dimensional indexes then efficiently finds skyline tuples by dimension switching in order to minimize the count of dominance comparisons. Our experimental evaluation shows that SDI-RS outperforms the most of existing skyline algorithms. Our future research direction includes the further development of the SDI framework as well as adapting the SDI framework to the context of Big Data, for instance with the Map-Reduce programming model.

References

  • I. Bartolini, P. Ciaccia, and M. Patella (2006) SaLSa: computing the skyline without scanning the whole sky. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, CIKM’06, pp. 405–414. Cited by: §1, §2, §5.
  • S. Borzsony, D. Kossmann, and K. Stocker (2001) The Skyline operator. In Proceedings of the 17th International Conference on Data Engineering, ICDE’01, pp. 421–430. Cited by: §1, §1, §2, §5.
  • S. Chester, D. Šidlauskas, I. Assent, and K. S. Bøgh (2015) Scalable parallelization of skyline computation for multi-core processors. In Proceedings of the 31st International Conference on Data Engineering, ICDE’15, pp. 1083–1094. Cited by: §1, §5.
  • J. Chomicki, P. Godfrey, J. Gryz, and D. Liang (2005) Skyline with presorting: theory and optimizations. In Intelligent Information Processing and Web Mining, pp. 595–604. Cited by: §1, §2.
  • P. Godfrey, R. Shipley, and J. Gryz (2005) Maximal vector computation in large data sets. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB’05, pp. 229–240. Cited by: §1.
  • D. Kossmann, F. Ramsak, and S. Rost (2002) Shooting stars in the sky: an online algorithm for skyline queries. In Proceedings of the 28th International Conference on Very Large Data Bases, VLDB’02, pp. 275–286. Cited by: §1, §2.
  • K. C. K. Lee, B. Zheng, H. Li, and W. Lee (2007) Approaching the skyline in z order. In Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB’07, pp. 279–290. Cited by: §1.
  • X. Lin, Y. Yuan, W. Wang, and H. Lu (2005) Stabbing the sky: efficient skyline computation over sliding windows. In Proceedings of the 21st International Conference on Data Engineering, ICDE’05, pp. 502–513. Cited by: §1.
  • B. Liu and C. Chan (2010) ZINC: efficient indexing for skyline computation. PVLDB 4, pp. 197–207. Cited by: §1.
  • D. Papadias, Y. Tao, G. Fu, and B. Seeger (2005) Progressive skyline computation in database systems. ACM Transactions on Database Systems 30 (1), pp. 41–82. Cited by: §1, §2.
  • K. Tan, P. Eng, and B. C. Ooi (2001) Efficient progressive skyline computation. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB’01, pp. 301–310. Cited by: §1, §2.
  • Y. Tao, X. Xiao, and J. Pei (2007) Efficient skyline and top-k retrieval in subspaces. IEEE Transactions on Knowledge and Data Engineering 19 (8), pp. 1072–1088. Cited by: §1, §2.