1 Introduction
Topk queries are the first solution that has been proposed to solve the problem of finding the best results in data. In fact, in different applications, users may want to access or visualize only the most relevant results from an otherwise large set of tuples. Topk queries aim to identify the top results through the use of a scoring function, defined by the preferences of the user. The better the score, the higher an object is in the output data. Topk queries could be classified based on different dimensions such as Query Model, Data Access Methods, Implementation Level, Data and Query Uncertainty and Ranking Function
[ilyas2008survey].Sometimes, however, it is not possible to compute the best k results. For example, let’s assume that we want to book a train between Milan and Rome and we are looking for a cheap ticket on a highspeed train. As we might expect, highspeed trains tend to be more expensive than normal ones, so we cannot properly classify the tickets as cheap. The purpose of skyline queries [skylineoperator] is to solve this problem by displaying the most interesting results to the enduser. Skylines are based on the concept of dominance, meaning that a point dominates another point only if it is as good or better in all the dimensions and better in at least one dimension. Points that aren’t dominated by other points make up the skyline.
Finally, flexible skylines or Fskylines [ciaccia2017reconciling] (in the original paper called Rskylines) are an evolution of traditional skyline queries. Fskylines implement the concept of dominance as normal skylines but apply it to a set of scoring functions, thus allowing greater flexibility than the techniques previously described.
To fully understand flexible skylines, we must first introduce the concept of Fdominance. A generic tuple s Fdominates another tuple z if, for every scoring function taken into consideration, the tuple s is no worse than z. Giving a more precise definition:
Def 1.1: Let be a set of monotone scoring functions. A generic tuple Fdominates another tuple , indicated by , if . (bearing in mind that if , has a better or equal result than ).
Flexible skylines use two operators, both subsets of the skyline: ND, the set of nonFdominated tuples and PO, the set of potentially optimal tuples that returns the best tuples (i.e., top1) according to some scoring function in . It is important to note that PO is always a subset of ND.
Other notable approaches that try to evolve the traditional paradigms and bridge topk and skylines are ORD/ORU [mouratidis2021marrying] and UTK (Uncertain Topk Query) [mouratidis2018exact]. ORD implements the OSS (outputsize specified) property, relaxed input and it is dominanceoriented, while the ORU approach also focuses on the first two characteristics of ORD as well, but it follows a utility paradigm instead of a dominanceoriented paradigm.
UTK has two versions. In the first one , given a set of uncertain user preferences that form an approximate description of the weight values, it reports all options that may belong to the topk set, while in the second it additionally includes in the result set the exact topk set for each of the possible weight settings.
In the next sections, we will discuss how these techniques can be compared, their main strengths and weaknesses and some of their applications.
2 Comparisons
As argued in Reference [freitas2004critical], each approach has its own strengths and weaknesses, which can be summarised in Table 1. Topk queries, since they are so tightly bound to user preferences, lack the simplicity of formulation and the ability to display interesting results.
For example, the best plane ticket is determined by the scoring function P = 0.9*Price + 0.1*NumOfStopovers, the function P could yield good results for a person whose main focus keeping the price as low as possible. However, a businessman that flies a lot might object and, to save time, may want to prioritize diminishing as low as possible the number of stopovers. This, of course, would result in a modification of the weights, yielding a completely different result.
As much as this structure can hinder simplicity and the display of overall interesting results, it has also a lot of benefits. Topk queries display an exceptional level of control of the cardinality of the results while offering tradeoff among attributes and efficiency.
In skylines, there is no need of specifying the weights of a scoring function, thus allowing for a more straightforward and simple to use paradigm. This technique also focuses greatly on the effort of providing interesting results, a task that is not achievable with the use of traditional ranking queries. However, skylines present some weaknesses too: the result of a skyline operation may contain a large number of tuples and they can’t provide any form of user preferences.
Overall, skylines trade user preferences and efficiency for the simplicity of use and the ability to show interesting results.
Flexible skylines, combining the two techniques, try to bridge the two previous approaches [ciaccia2020flexible], resulting in a more balanced approach. Fskylines can, in fact, account for the different importance of various attributes. Fskylines can also consider and model user preferences by leveraging the weights used in different scoring functions, while still displaying the overall most interesting results.
We also find that the ORD/ORU approach has a lot of benefits compared to traditional techniques, mainly: personalization, controllable output size and preference specification flexibility. These techniques are not as simple as Fskylines but, on the other hand offer a better performance.
On the topic of alternative approaches UTK focuses on providing the user with a practical and easy to use design by computing weight regions to take into account the uncertain preferences that a user may have. Basically, with this approach, we forego the control on the result cardinality in order to offer a better user experience.
We will now focus on one last dimension of analysis considered in Table 1: performance. As presented in the introductory section, flexible skylines are an evolution of traditional skylines. Therefore, we will only compare the performance of topk queries and Fskylines.
Dimension of analysis  Topk  Skyline  Fskylines  ORD/ORU  UTK 

Control of result cardinality  Yes  No  By modifying constraints  Yes  No 
Tradeoff among attributes  Yes  No  Yes  Yes  Yes 
Simplicity  No  Yes  Yes  No  Yes 
Display of interesting results  No  Yes  Yes  Yes  Yes 
Efficient performance  Yes  No  No  No  No 
2.1 Performance of Topk queries
Firstly, we introduce the notion of instance optimality. Instance optimality is a form of optimality aimed at when standard optimality is unachievable.
Def 2.1.1: Let be a family of algorithms, a set of problem instances and be a cost metric applied to an algorithminstance pair. Algorithm is instanceoptimal wrt and for the cost metric if there exist constants and such that, for all and , such that:
Topk queries are known for their high efficiency. The most established and popular algorithms that allow ranking queries to reach this goal are: Fagin’s Algorithm [fagin1998fa] (or FA), Threshold Algorithm [FAGIN2003614] (or TA) and No Random Access [FAGIN2003614] (or NRA), with the latter one only used when random access cannot be executed since its performance is worse than FA and TA.
FA is a noninstance optimal algorithm, has a sublinear time complexity and its stopping criterion is independent of the scoring function. FA has three phases: In the first phase, the algorithm only executes sorted accesses, while in the second phase only random accesses are processed. Then, in the third and final phase, the scores of the retrieved objects are computed using the chosen scoring function finally determining the topk results. FA is greatly surpassed by the instance optimal algorithm TA, which in general, performs much better since it can adjust to a specific scoring function. The TA algorithm is instance optimal and utilizes a threshold value calculated by applying the chosen scoring function to the tuple composed of the last scores seen by sorted access on each ranking. Also, differently from FA, random accesses are performed after executing the sorted accesses (one for each ranking) and not in a second phase.
Since their first introduction, however, many algorithms that improve the efficiency of these baseline algorithms have been published. BPA (Best Position Algorithm) and its improved version BPA2 [akbarinia2007best] provide such examples. As explained in Reference [akbarinia2007best], with BPA2 we can outperform TA by a factor of , where is the number of the best positions. As evident, BPA2 becomes more efficient the larger the required output size is. These kinds of problems are not uncommon in real applications of topk queries. For example, let’s consider a database that holds every branch of an international bank and we want to view the subsidiaries with the most transactions made. Since the database is very large, the number of top positions may range from a few tens to the order of thousands.
It is important to note that FA, TA, NRA and BPA2 are algorithms designed for a distributed setting where multiple components of the systems are located on different networked computers. On Table 2 we find a summary of all the main topk algorithms discussed, ranked by efficiency.
Algorithms  Data Access  Notes 

BPA2  Sorted and random access  Better than TA by a factor of (m+1)/2 
TA  Sorted and random access  Instanceoptimal, can adjust to a specific scoring function 
FA  Sorted and random access  Stopping criterion is independent from the scoring function 
NRA  Sorted access  Instanceoptimal but no exact scores 
2.2 Performance of Flexible Skylines
Before delving into FSkylines, let’s discuss two of the most famous algorithms for traditional skylines: Block Nested Loop (BNL) [skylineoperator] and SortFilterSkyline (SFS) [chomicki2003skyline], both designed for centralized settings. BNL was the first algorithm introduced to compute skylines. It adopts a naive approach, simply applying a nestedloops algorithm and confronting each tuple with all the other tuples. This, of course, results in a very inefficient algorithm, particularly for large datasets. Overall, BNL has a time complexity of . A solution to this problem is SFS, which implements presorting to improve efficiency in big datasets. Even though SFS is much more efficient in some cases, the overall complexity is still the same as BNL.
Now, we will introduce some of the algorithms initially proposed for Fskylines: SVE1F and PODI2 and then compare them to SFS.
SVE1F is an algorithm used for computing ND. It is a onephased algorithm, meaning that ND is directly calculated from the database and the skyline is not computed beforehand. SVE1F also implements a presorting of the dataset before any operation is carried out and uses vertex enumeration to perform dominance tests. SVE1F has a worstcase time complexity of , where is the time complexity of the vertex enumeration, is the number of tuples in the final ND set and q is the number of vertices of used in the vertex enumeration.
PODI2 is an algorithm used for computing PO. It is a twophased algorithm that, in the first phase, calculates the ND tuples and then, in the second phase, filters out all the nonPO tuples, finally outputting the PO set. PODI2 also utilizes an incremental approach to test whether a tuple is potentially optimal or not. This test requires solving a linear programming (LP) problem, which can be potentially highly inefficient and time consuming when the dataset is very large. To solve this problem an incremental approach is used, which implies solving the complete LP problem by solving smaller LP problems of increasing size. The worstcase time complexity is
, where is the time complexity of computing ND and is the time complexity of the Fdominance tests. Both SVE1F and PODI2 are designed for centralized systems.In Reference [ciaccia2020flexible], a comparison between SFS, SVE1F and PODI2 is carried out by testing the algorithms on three different datasets:

NBA: a famous dataset with thousands of points which include different statistics on NBA players’ performance;

ANT: a synthetic dataset with anticorrelated values across different dimensions;

UNI
: a synthetic dataset with uniformly distributed values.
In Table 3, we find a summary of this performance comparison, including the complexities of each algorithm. Also, in Figure 1, the performance results are graphically displayed.
Algorithm  NBA  UNI  ANT  Complexity 

SFS  0,58  0,34  11,17  
SVE1F  0,65  0,36  1,54  
PODI2  1,74  1,41  12,69 
As we can see, calculating ND and PO can be a potentially detrimental task that could hinder performance. All the algorithms initially proposed to calculate these two operators had a higher time complexity than topk algorithms, resulting in less efficiency. This is of course a big issue when comparing Fskyline with topk queries as this could render the added benefits of Fskylines not worth the additional computing time.
Moving again to a distributed setting, some progress has been made with the introduction of the Flexible Score Aggregation (FSA) [ciaccia2018fsa], algorithm. FSA is an instance optimal algorithm and it is the result of a combination between the FA and TA algorithms introduced in Section 2.1. FSA allows the user to calculate the set which, when is equal to one, matches the result of ND. This algorithm is very efficient and effective, leading to a reduction in the result set within cardinality, even for large datasets. However, this is not enough, since FSA still presents a quadratic complexity, while topk queries are far better in terms of raw performance.
Finally, comparing the performance of Fskylines with topk queries we can see that Fskylines are not comparable with the level of performance of topk queries. Bearing in mind that flexible skylines and topk algorithms are designed for different setting, it is clear that the main objective of these new approaches is not maximizing performance. For example, calculating PO is still really inefficient, albeit offering much more interesting results than any other traditional technique. The topic of the quality and content of the output data will be covered in Section 2.4.
Regarding performance, a notable observation is that dominance tests efficiency could be further improved using decision trees
[choi2021optimization]. However, this methodology has not yet been tested with Fskylines, so it remains an open problem.2.3 Comparisons between the new approaches
Shifting the focus from the comparison with traditional topk, we now compare Fskylines with other two new approaches, ORD/ORU and UTK.
The main difference between ORD/ORU and the approaches at [ciaccia2020flexible] and [mouratidis2018exact] is that ORD/ORU produce an output of controllable size, respecting the OSS property, as seen in Table 1. Also, Fskylines and UTK utilize a fixed polytope region given in advance to establish dominance, whereas the ORD/ORU dominance region is effectively a hypersphere and usually not given in advance. Therefore, as established in [mouratidis2021marrying], ORD/ORU are hardly comparable, in terms of output data, to the other two techniques. However, focusing on performance only, ORD/ORU algorithms are much more efficient than Fskylines and UTK. ORD, in particular, is 2 to 4 orders of magnitude faster than Fskylines while ORU is 12 to 134 (or even more in some cases) faster than JAA, the best algorithm for fixed region approaches, employed by UTK.
Another important comparison is that Fskylines, with its operators ND and PO, generalize topk queries fixing , while UTK and the ORD/ORU approaches allow any .
2.4 Output data
In order to discuss differences in output data, we first have to introduce two measures: precision and recall. Let be a set of topk tuples on a dataset and be the set of the tuples present in , and , over the same dataset , where is the skyline output set on . We define precision as and recall as .
In the context of this comparison, the aim is to assess whether or not the results yielded by the two techniques overlap in a significant way or if they present different outputs. The value of this two measures is highly dependent on how similar the topk scoring function is to the family of scoring functions used in the Fskyline. In Table 4
, we can see what precision and recall values represent in our context.
Regarding the precision, experimental data gathered in [ciaccia2020flexible] shows that, for each k in a generic dataset , , meaning that a result set of tuples of a topk, will more likely have SKY tuples than ND or PO tuples. This means that the more we seek interesting results with Fskylines, the more the precision will drop when comparing to a a traditional topk approach.
Regarding the recall measure, experimental data of Reference [ciaccia2020flexible] finds that, when the number of tuples from the topk and Fskylines operators is equal, the range of is between 38% and 50%, meanwhile for the range drastically decreases to 18%–38%. This means that, in order for topk to reach a quality of output level similar to that of Fskylines, we would need an extremely large k, resulting in a huge output set.
Measure  High Value  Low Value 

Precision  Most topk tuples are also in S  Most topk tuples are NOT also in S 
Recall  Most tuples in S are also topk tuples  Most tuples in S are NOT also topk tuples 
3 Applications
In this section, we will present some of the main applications of both topk queries and Fskylines. Regarding topk queries, they are used in a wide variety of fields, thanks to their efficiency and relatively straightforward approach. Two of the most classical applications for topk are: multicriteria queries and knearest neighbours. An example of multicriteria queries could be ranking the best k cars in an online car dealer, combining different criteria like mileage, price, color etc. Instead, knearest neighbours is a similarity search used in machine learning and it is utilized to, given N points in a Ddimensional space and a point Z in that space, find the k points closest to point Z.
Additionally to these classical applications, topk queries are used in a multitude of other niche applications, for example, privacypreserving topk queries [vaidya2005privacy] or reverse topk queries [vlachou2010reverse], where the first technique aims to bridge privacy with topk while the second technique tries to look at topk queries through the lenses of the product manufacturer and not the user.
Fskylines can, of course, be used in all the fields where traditional skylines are useful and established, bearing in mind that the number of real applications is smaller, due to the most recent introduction of the concept. However, a very important field where Fskylines could have a huge impact are Data Warehouses, where queries already take up a lot of time and where finding useful and interesting data is extremely useful. Data Warehouses are often centralized and collect a huge amount of data from different sources, resulting in really expensive queries. However, this is accepted and common, since the main objective of Data Warehouses is not performance, but extracting meaningful and interesting data to be used in a business context, aligning particularly well with the purpose and goals of Flexible skylines.
4 Conclusions
In this paper, we tried to compare the ranking techniques with the hope of finding out whether topk is better than flexible skylines, or vice versa.
There is no definitive answer, as it greatly depends on the dimension of evaluation. In general, we conclude that topk is better if the main focus is on pure performance and computing capabilities. If we instead focus on usability, Fskylines offer the user a more simple approach, without leaving behind user preferences. On the topic of output data, Fskylines are clearly the better alternative, giving the most interesting results in a more compact result set.
Overall, Fskylines are better in every aspect apart from performance, basically trading efficiency for greater simplicity and better output data. Notably, also ORD/ORU and UTK are better than topk in every aspect apart from performance.