Machine learning (ML) and artificial intelligence (AI) are taking the world by storm. This holds for both the real world as well as science and in particular computer science. Research areas that were believed to have been researched to completion have been revisited with exciting new results showing that considerable improvements are still possibleif
In the database world, we witnessed a surge of similar re-exploration endeavors in the past five years. Notable recent examples of works in that space include cardinality estimation(Hilprecht et al., 2020; Woltmann et al., 2019), auto-tuning (Pavlo et al., 2017; Aken et al., 2017), and indexing (Kraska et al., 2018; Galakatos et al., 2019; Ding et al., 2020; Ferragina and Vinciguerra, 2020; Kipf et al., 2020). From these three areas, we believe that indexing is the most surprising result because both cardinality estimation and auto-tuning are optimization problems and thus have a natural proximity to machine learning. The connection to indexing becomes evident when we examine a special case of indexing.
Problem Statement: Given a sorted, densely packed array of keys and a query asking for a particular key that may or may not exist in that array, return the array index of that key.
In other words, we are looking for a function that assigns to each key its position in the sorted array. Traditionally, this function is implemented by a suitable algorithm like binary search or a data structure like a B-tree. In contrast, Kraska et al. (Kraska et al., 2018) observe that this function can be learned through regression, effectively making the indexing problem a machine learning task.
Goals: We pursue the following objectives with this paper.
Provide the first inventor-independent analysis of RMIs.
Provide our own and complete implementation of RMIs.
Understand when and why RMIs outperform other indexes.
Develop a clear guideline for database architects when to use and how to configure RMIs.
Contributions: We make the following contributions to achieve these goals:
. What is their core idea? How to perform an index lookup? How to build an RMI? Which hyperparameters must be taken into account? (Section 2)
(2) The Learned Indexing Landscape. We provide a detailed overview on the design dimensions of learned indexes and the already large body of work in that space. (Section 3)
Our experimental analysis consists of four parts (see Section 4 for the setup):
(3) Predictive Accuracy Analysis. First, we analyze the predictive power of RMIs. How well can an RMI predict the position of a key in a sorted array? What is the impact of hyperparameters such as the model types and layer sizes on the prediction quality and the index size? (Section 5)
(4) Lookup Time Analysis. Second, we provide an in-depth analysis of the end-to-end lookup time of RMIs. We investigate a large space of different model configurations for predicting the position as well as error bounds and search algorithms for correction prediction errors. We show that while the choice of hyperparameters can have a huge impact on the lookup performance, in many cases a single configuration outperforms the others. (Section 6)
(5) Built Time Analysis. Third, we investigate the build time of RMIs. How do the many hyperparameters like the model types, layer size, or type of error bounds affect the build time of RMIs? What is the impact of the data distribution? We also discuss how to improve the original training algorithm. (Section 7)
(5) Comparison to Other Indexes. Fourth, we provide an end-to-end comparison in terms of lookup time and build time with other learned like ALEX (Ding et al., 2020), PGM-index (Ferragina and Vinciguerra, 2020), and RadixSpline (Kipf et al., 2020), as well as state-of-the-art index structures like B-trees (Bayer and McCreight, 1970), ART (Leis et al., 2013), and HIST-Tree (Crotty, 2021). (Section 8)
(6) Guideline We conclude by presenting a clear guideline for database architects when to use which learned index and when to use a traditional index. (Section 9)
2. Recursive Model Indexes
In this section, we briefly recap recursive model indexes. Besides the core idea of RMIs, we explain lookup procedure, training algorithm, and the hyperparameters of RMIs.
2.1. Core Idea
RMIs are based on the observation that the position of a key in a sorted array can be computed using the cumulative distribution function (CDF) of the data. Let be a dataset consisting of keys. Further, letbe the CDF of . Then, the position of each key in the sorted array is computed as:
Note that in the context of learned indexes, the term CDF is frequently used synonymously for a mapping from key to position in the sorted array instead of its statistical definition of a mapping from key to the probability that a random variable will take a value less than or equal to that key. In the following, we submit to the former interpretation.
The core idea of an RMI is to approximate the CDF of a dataset by means of a hierarchical, multi-layer model. Consider Figure 1 for an example three-layer RMI. Each model in an RMI approximates a segment of the CDF, all models of a layer together approximate the entire CDF. An RMI is a directed acyclic graph (DAG), i.e. in contrast to a tree, a node (or model) in an RMI may have multiple direct predecessors. We denote the -th layer of a -layered RMI by where and refer to the -th model of the -th layer by . The first layer of an RMI always consists of a single root model . Each subsequent layer may consist of an arbitrary number of models. The number of models of a layer is denoted by and called the size of the layer.
2.2. Index Lookup
A lookup in an RMI consists of two steps: (1) Prediction: We evaluate the RMI on a given key yielding a position estimate. (2) Error Correction: We search the key in the area around the estimated position in the sorted array to compensate for estimation errors. In the following, we discuss both steps in more detail.
Prediction. Consider again Figure 1 that shows an index lookup for key 42. We start by evaluating the root model on key 42 yielding a position estimate. Based on this estimate, model in the next layer is chosen for evaluation. This iterative process is continued until the position estimate pos of the last layer is obtained.
Definition 2.1 (RMI Prediction).
Let be a -layer RMI trained on dataset consisting of keys. Let us denote the value restricted to the interval by
The predicted position for key of the -th layer is recursively defined as
Intuitively, to determine the model in layer that is evaluated on key , the estimate of the previous layer is scaled to the size of the current layer. Note that might be less than 0 or greater than . Thus, the result is restricted to to evaluate to a valid model index. The predicted position for key of RMI is the output of layer :
Error correction. Based on the estimate obtained by evaluating the RMI, the sorted array is searched for the key. In Figure 1, the position estimate for key 42 points to key 37 in the sorted array. Since 3742, we have to search to the right of 37 to find 42. To facilitate the search, an RMI may store error bounds that limit the size of the interval that has to be searched. The RMI guarantees that if a key is present, then it can be found within the provided error bounds. A simple way of achieving this is to store the maximum absolute error err of the RMI. The left and right search bounds, i.e. the error interval, is set to . If the key exists, it must be within these bounds. We search the interval for key using an appropriate algorithm like binary search.
2.3. Training Algorithm
The goal of the training process is to minimize the prediction error. The training algorithm is shown in Listing 1. Its core idea is to perform a top-down layer-wise bulk loading. We start by assigning all keys to the root model (line 6). Then, the root model is trained on those keys (line 9). Afterwards, the keys are assigned to the next-layer models based on the root model’s estimates (lines 11–13). We proceed by training the models of the next layer on the keys that were assigned to the models. This process is repeated for each layer until the last layer has been trained. If the RMI should provide error bounds, the error bounds need to be computed on the trained RMI (after line 13).
RMIs offer a high degree of freedom in configuration and tuning. In the following, we briefly describe each configuration parameter. We provide a set of possible configurations for each parameter inSection 4.2 when describing the experimental setup.
Model types are crucial to the predictive quality of RMIs. While simple models, e.g., linear regression, are small in size and fast to train and evaluate, more complex models, e.g., neural networks, might offer higher accuracy.
Layer count. The number of layers determines the depth of an RMI. While a deeper RMI might provide a more accurate estimate, deeper RMIs are larger in size and take longer to train and evaluate.
Layer sizes. The size of a layer defines the number of models in that layer. A higher number of models usually leads to more accurate predictions since each model has to cover less keys.
Error bounds. Error bounds facilitate the error correction by limiting the size of the interval that has to be searched. Error bounds can be chosen on different granularities or be omitted altogether.
Search algorithm. Depending on the error bounds, several search algorithms may be applied to perform error correction, e.g., binary search, linear search, or exponential search.
3. Related Work
has caused both excitement and criticism within the database community. Although learned indexes opened up a new line of research, some people were skeptical. Early criticism focused on the lack of efficient updates, the relatively weak baselines, and the absence of an open-source implementation(Neumann, 2017; Bailis et al., 2018). Later, Crotty (Crotty, 2021) claimed that the performance advantages of learned indexes are primarily due to implicit assumptions on the data such as sortedness. Subsequently published learned indexes addressed some of these weaknesses (Galakatos et al., 2019; Ding et al., 2020; Ferragina and Vinciguerra, 2020; Kipf et al., 2020).
Although RMI is the first learned index, it is frequently among the best performing indexes in experimental evaluations (Kipf et al., 2019; Marcus et al., 2020a; Ferragina and Vinciguerra, 2020; Kipf et al., 2020; Crotty, 2021). In the following, we give an overview of existing learned indexes and experimental analyses of them.
3.1. Learned Indexes
While existing learned indexes commonly approximate the CDF, they most notably differ in the following aspects. (1) the type of model that is used to approximate the CDF, (2) whether they are trained bottom-up or top-down, and (3) whether they support updates. Table 1 gives an overview of existing learned indexes. We discuss each learned index in more detail below.
FITing-tree. FITing-tree (Galakatos et al., 2019) models the CDF using piecewise linear approximation (PLA). During training, a dataset is first divided into variable-sized segments by a greedy algorithm in a single pass over the data. The segments are created in such a way that their linear approximation, represented by the first and last key of the segment, satisfies a user-defined error bound. Segments are then indexed by bulk loading them into a B-tree. Thus, FITing-tree can be considered as a sparse B-tree with variable-sized pages. The size of a FITing-tree depends on the amount of segments that are required to satisfy the error bound. A lookup consists of three steps: (1) traversing the B-tree to find the segment that contains the key, (2) computing an estimated position based on the linear function approximating the segment, and (3) searching the key within the error bounds around the estimated position. FITing-tree offers two insert strategies. Either inserts are carried out in-place where a strict ordering is retained at all times or using a buffering strategy where each segment has a buffer that is merged with the other keys in the segment whenever the buffer is full. Unfortunately, at the time of writing, an open-source implementation of FITing-tree was not available which prevented us from including it in our experiments.
ALEX. ALEX (Ding et al., 2020) uses a variable-depth tree structure to approximate the CDF with linear models. Internal nodes are linear models, which, given a key, determine the child node. Leaf nodes hold the data, the distribution of which is again approximated by a linear model. During a lookup the tree is traversed until a leaf node is reached, then a position is predicted using the leaf’s linear model, and finally the key is searched using exponential search. Like RMI, ALEX is trained top-down, however, unlike RMI, ALEX has a dynamic structure that is controlled by a cost model that decides how to split nodes. Further, ALEX supports inserts where full leaf nodes are either split or expanded.
PGM-index. Like FITing-Tree and ALEX, PGM-index (Ferragina and Vinciguerra, 2020) also approximates the CDF by means of PLA. Similar to FITing-tree, PGM-index starts by computing segments that satisfy an error bound . However, in contrast to FITing-Tree, PGM-index creates a PLA-model that is optimal in the number of segments. Each segment consists of the smallest key in the segment and a linear function that approximates the segment with an error of at most . Afterwards, this process is continued recursively by again creating a PLA-model on the keys of the segments. The recursion is terminated as soon as a single segment is left. So unlike ALEX, each path from the root model to a segment is of equal length. A lookup is an iterative process where on each level of the PGM-index (1) a linear model predicts the next-layer segment containing the key, (2) the correct segment is searched for within the error bounds around the prediction using binary search, and (3) the process is continued for the next-layer segment until the sorted array of keys is reached. Ferragina and Vinciguerra (Ferragina and Vinciguerra, 2020) also introduce variants of PGM-index that support updates (dynamic PGM-index) and compression on the segment level (compressed PGM-index). Like FITing-tree (Galakatos et al., 2019), the size of PGM-index depends on the number of segments required to satisfy the error bound.
RadixSpline. In contrast to the aforementioned learned indexes, RadixSpline (Kipf et al., 2020)
approximates the CDF using a linear spline. The linear spline is fit in a single pass over the data and guarantees a user-defined error bound. The resulting spline points are inserted into a radix table that maps keys to the smallest spline point with the same prefix. The size of the radix table depends on the user-defined prefix length. A lookup in a RadixSpline consists of the following steps: (1) finding the spline points surrounding the lookup key using the radix table, (2) performing linear interpolation between the spline point to obtain an estimated position, and (3) applying binary search in the error interval around the estimated position to find the key. Like RMI, RadixSpline has a fixed number of layers and does not support updates.
3.2. Experiments and Analysis
With the Search On Sorted Data (SOSD) benchmark, Kipf et al. (Kipf et al., 2019) propose a benchmarking framework for learned index structures. Besides providing a variety of index implementations, both learned and traditional, the framework introduces four real-world datasets. In their preliminary analysis, the authors conclude that RMI and RadixSpline were able to outperform traditional indexes including ART (Leis et al., 2013), FAST (Kim et al., 2010), and B-trees while being significantly smaller in size. They also stated that the lack of efficient updates, long building times, and the need for hyperparameter tuning are notable drawbacks of learned indexes.
Marcus et al. (Marcus et al., 2020a) conduct a more detailed experimental analysis of learned indexes based on the framework and datasets from SOSD (Kipf et al., 2019). The authors perform a series of experiments to explain the superior performance of learned indexes. They conclude that a combination of cache misses, branch misses, and number of instructions accounts for most of the improved performance compared to traditional indexes. Further, the authors show that learned indexes are pareto-optimal in terms of size and lookup performance independently of dataset, dataset size, and key size.
Marcus et al. (Marcus et al., 2020b) provide an open-source implementation of RMIs along with an automatic optimizer. For a given dataset, the optimizer computes a set of pareto-optimal configurations (model types and second layer size) with respect to lookup time and index size of two-layer RMIs using grid-search. Instead of automatically optimizing RMIs, in our work, we aim to understand the implications of each hyperparameter. Inventors of the RMI were involved in all three aforementioned publications. To the best of our knowledge, we conduct the first independent analysis of RMIs.
Ferragina et al. (Ferragina et al., 2020) take a theoretical approach at understanding the benefits of learned indexes, specifically of indexes based on PLA. The authors show that for a number of distributions, PGM-index (Ferragina and Vinciguerra, 2020), while achieving the same query time complexity as B-trees, offers improved space complexity. To support their theoretical results, Ferragina and Vinciguerra (Ferragina and Vinciguerra, 2020) conduct several experiments both on synthetic and real-world datasets. The theoretical results build a solid foundation for further research. However, since RMIs are neither limited to PLA nor do RMIs aim to construct the optimal number of segments (like PGM-index does), the results cannot be transferred to RMIs.
Our work focuses on the implications of each hyperparameter in RMIs. Besides the type of models and the size of the second layer, we also investigate error bounds and search algorithms and conduct the first inventor-independent analysis of RMIs.
4. Experimental Setup
In this section, we introduce the implementation, hyperparameters, datasets, and workload used in our experiments and indexes considered for comparison. All experiments are conducted on a Linux machine with an Intel® Xeon® CPU E5-2620 v4 (2.10 GHz, 20 MiB L3) and 4x8 GiB DDR4 RAM running Arch Linux 5.12.4. Our code is compiled with clang-11.1. using optimization level -O2 and executed single-threaded.
Our implementation of RMIs is purely written in C++. RMI classes have a fixed number of layers and model types are passed as template arguments. This implies that all models in a layer are of the same type. Training algorithms of the model types are adapted from the open-source Rust implementation (Marcus et al., 2020b). When assigning keys to the next-layer models, the open source-implementation always copies keys to a new array. We optimized the training process based on the observation that the models considered here are monotonic and will never create overlapping segments. Thus, when assigning keys to next-layer models, we simply store iterators on the sorted array of the first and last key of each segment. We then train the next-layer models by passing them the respective iterators and thereby avoid copying keys. Further, instead of training all models on a mapping from key to position in the sorted array, we train inner layers on a mapping from key to next-layer model index which is obtained by scaling the position to the size of the next layer similar to Equation (3). In other words, we train inner layers directly on a targeted equi-width segmentation. This approach saves a multiplication and division during lookup that are otherwise required for computing the model index from the position estimate.
In the following, we give a detailed list of hyperparameter configurations evaluated in our experiments.
Model types. We consider the model types listed in Table 2. These models are most frequently recommended by the automatic optimizer of the open-source implementation (Marcus et al., 2020b). Linear regression (LR) is a linear model that minimizes the mean squared error. Linear spline (LS) and cubic spline (CS) fit a linear respectively a cubic spline segment through the leftmost and rightmost data points. Radix (RX) eliminates the common prefix and maps keys to their most significant bits. For the last layer, we only consider LR and LS. We plan to consider more model types as future work.
Layer count. It was previously reported that in most cases two layers are sufficient to accurately approximate a CDF (Marcus et al., 2020b, a). Thus, we also limit ourselves to two-layer RMIs. We plan to explore RMIs with more than two layers as future work.
Layer size. We vary the size of the second layer between and in power of two steps.
Error bounds. We consider five different variants of error bounds listed in Table 3
. The approaches differ in two properties: (1) Error bounds might be computed on different granularities, either for each last-layer model individually (local) or for the entire RMI (global). Global bounds, while being extremely memory efficient, are prone to outliers as the single largest error determines the search interval size of all lookups. Local bounds are more robust against outliers as an outlier only affects the respective model. However, depending on the layer size, local bounds significantly increase the memory footprint of the RMI. (2) We can either store the maximum absolute error (absolute) or both the maximum positive and negative error individually (individual). While the former is again more space efficient, the latter allows for tighter bounds, especially if a model tends to either overestimate or underestimate the actual position. Additionally, we might not store any bounds (NB). Both local individual (LInd) and NB were suggested byKraska et al. (Kraska et al., 2018), local absolute (LAbs) is the default configuration of the open-source implementation (Marcus et al., 2020b).
|LInd||Local Individual||Model||max +/- error||(Kraska et al., 2018)|
|LAbs||Local Absolute||Model||max abs error||(Marcus et al., 2020b)|
|GInd||Global Individual||RMI||max +/- error|
|GAbs||Global Absolute||RMI||max abs error|
|NB||No Bounds||-||none||(Kraska et al., 2018)|
Search algorithm. The evaluated search algorithms are listed in Table 4. We generally distinguish between two types of search algorithms: (1) search algorithms that only consider the error bounds and (2) search algorithms that also utilize the estimated position (model-biased) (Kraska et al., 2018). Standard binary search is an example of the first type of search algorithm. We search the key in the interval between the two error bounds and ignore the position estimate. However, binary search can be made model-biased by slightly adjusting it. Instead of choosing the middle element of the interval as first comparison point, we choose the estimated position. While this small adjustment does not change the complexity of the search, in practice it sometimes yields slightly faster lookups. Similarly, linear search and exponential search can be tweaked to become model-biased. Instead of searching the interval from left to right, we start the search from the estimated position and search to the left or right, depending on whether the estimated position was overestimated or underestimated. The search is stopped either if we reach an error bound or the key cannot be found anymore. Initially, we also considered standard linear search and exponential search for our experiments but both always performed worse than their model-biased counterparts. Note that not all combinations of error bounds and search algorithms make sense, e.g., in the case of absolute error bounds, model-biased binary search and standard binary search are essentially the same as the estimate will be the center of the interval anyways. Further, model-biased linear and exponential search do not require error bounds to work.
Learned indexes are know to adapt well to artificial data sampled from statistical distributions (Marcus et al., 2020a). To make our experimental evaluation more realistic, we use four real-world datasets from the SOSD benchmark (Kipf et al., 2019). Each dataset consists of 200M 64-bit unsigned integer keys. The CDFs of the four datasets are depicted in Figure 2, zoom-ins show a segment of 100 consecutive keys indicating the amount of noise in the dataset.
books: keys represent the popularity of books on Amazon.
fb: keys represent Facebook user ID. This dataset contains 21 outliers at the upper end of the key space that are several orders of magnitude larger than the rest of the keys.
osmc: keys represent cell IDs on OpenStreetMap. This dataset has clusters that are artifacts of projecting two-dimensional data into one-dimensional space (Marcus et al., 2020a).
wiki: key represents edit timestamps on Wikipedia.
For the lookup performance, we consider lower bound queries, i.e. for a given key, the index returns an iterator to the smallest element in the sorted array that is equal to or greater than the key. The sorted array is kept in memory. We always perform three independent runs of 20M lower bound lookups, the keys of which are sampled from the sorted array uniformly at random with a fixed seed. Reported execution times are the average execution time of the median run. As a checksum, we sum up the returned positions.
We compare RMIs in terms of lookup time and build time against a number of indexes listed in Table 5. Note that due to the workload, we only consider indexes that support lower bound queries. In the following, we give a brief overview of the indexes.
|ALEX (Ding et al., 2020)||Learned||sparsity||(2)|
|PGM-index (Ferragina and Vinciguerra, 2020)||Learned||max error||(22)|
|RadixSpline (Kipf et al., 2020)||Learned||radix width, max error||(23)|
|B-tree (Bayer and McCreight, 1970)||Tree||sparsity||(25)|
|Hist-Tree (Crotty, 2021)||Tree||num bins, max error||(5)|
|ART (Leis et al., 2013)||Trie||sparsity||(24)|
Learned indexes. ALEX (Ding et al., 2020), PGM-index (Ferragina and Vinciguerra, 2020), and RadixSpline (Kipf et al., 2020) are learned indexes discussed in Section 3.1. The index size of PGM-index and RadixSpline is varied based on the maximum error. Additionally, RadixSpline provides a parameter to adjust the size of the RadixTable that is used to index the spline points. Since we do not consider update performance here, we use the standard variant of PGM-index that does not support updates. ALEX does not provide any parameters itself, so we vary its size by adjusting the number of keys that are inserted (sparsity), i.e. we only insert every -th key.
Trees and tries. B-tree (Bayer and McCreight, 1970), ART (Leis et al., 2013), and the recently published Hist-Tree (Crotty, 2021) are traditional in-memory index structures. Like ALEX, we vary the size of B-tree and ART by adjusting the number of keys that are inserted. Therefore, we use the implementation of ART that supports lower bound queries from SOSD (Kipf et al., 2019). Hist-Tree provides two tuning parameters: the number of bins determines the size of a node in the tree and the maximum error defines a threshold for the size of a terminal node. We use an implementation of a compact Hist-Tree that does not support updates in favor of lookup performance.
Binary search. We also consider standard binary search over the sorted array without any index as provided by std::lower_bound.
5. Predictive Accuracy Analysis
In this section, we analyze the impact of hyperparameters on the predictive accuracy of a two-layer RMI. Our analysis is divided into three parts.
Segmentation (Section 5.1): We investigate how root models of different types divide the keys into segments.
Prediction (Section 5.2): We analyze how models of different types approximate the segments of the CDF.
Error bounds (Section 5.3): We examine how different types of error bounds limit the error interval to be searched.
A two-layer RMI splits the CDF into segments by partitioning the keys based on the root model’s approximation of the CDF. The goal is to create segments that can then be accurately approximated by the second-layer models. Thus, a segmentation of keys on the root-level cannot be fully independently evaluated without considering the second layer. Nevertheless, we would like to first focus only on the root model and examine how models of different types segment the keys.
We trained a model of each type on each of the datasets and then segmented the keys based on these models for varying numbers of segments. Figure 3 shows the CDFs and the corresponding root model approximations. We observed that segmentations differ in two major properties: (1) the percentage of empty segments
, i.e. segments that do not contain any key and (2) the skewness oflarge segments, i.e. segments containing significantly more keys than others. In the following, we discuss both properties in more detail.
Empty segments. Since there is a second-layer model for every segment, empty segments increase the size of an RMI without improving the prediction accuracy. Thus, we should aim for as few empty segments as possible.
Figure 4 shows the percentage of empty segments when segmenting the keys with root models of different types. We generally observe that the percentage of empty segments increases with an increasing number of segments. LS and CS exhibit similar behavior. On smooth CDFs like books and wiki, the percentage only slightly increases. This is due to the fact that both approximations cover the entire range of positions. In both cases, CS has slightly less empty partitions because it approximates the CDF more accurately.
In contrast, LR does not always cover the entire range of positions (cf. Figure 3). For example, on books, LR never predicts the first fifth of positions resulting in the corresponding segments to remain empty. Similar behavior can be observed for the last tenth of positions on wiki. Nevertheless, the percentage hardly increases with an increasing number of segments on both datasets. Similar to LR, RX also does not cover the entire range of positions. In fact, RX often only covers a fraction of positions resulting in a high percentage of empty partitions.
Due to the clustered distribution of keys on the osmc dataset, the percentages are higher and increase more quickly since the keys are distributed over a small number of segments, irrespective of the type of root model.
Due to the few outliers that strongly affect the CDF approximation of fb, none of the root models is able to segment the keys in a meaningful way. Since the majority of keys is mapped to the same position, all of these keys are assigned to the same segment. Increasing the number of segments gradually removes the outliers from this segment but does not improve the segmentation.
Large segments. Large segments potentially follow a more complex distribution and are more difficult to approximate by the second-layer models. Therefore, large partitions may negatively affect the prediction quality of an RMI.
Figure 5 shows the number of keys that reside in the largest segment as a measure of skewness of keys among segments. For CS, LS, and RX, the size of the largest segment decreases as the number of segments increases. LS and CS mostly show similar sizes. The difference in books can be attributed to the flattening of CS at the upper end, which results in more keys being assigned to a single segment. Segments created by RX are generally slightly larger than those created by LS since RX leaves more segments empty.
For LR, the size of the largest partition remains near constant. The reason for this is that LR may produce estimates outside the range of valid positions. These out-of-range predictions are then clamped to either the first or last valid position. Clamping can be observed at the upper end of of books and osmc and the lower end of wiki in Figure 3. All keys whose prediction is clamped will be assigned to the same segment. Increasing the number of segments only decreases the size of these segments until the segments consist exclusively of keys whose prediction had to be clamped.
On fb, almost all keys reside in a single segment, regardless of the number of segments and type of the root models. This is due to the fact that the root models predict the same position for almost all keys and thus they are assigned to the same segment.
In summary, without considering the second layer of the RMI, LS and CS seem to be better suited for segmenting the keys than LR and RX because RX produces many empty segments and LR often creates large segments at the upper and lower end of the key space due to clamping.
5.2. Position Prediction
After dividing the keys into segments based on the root model, a second-layer model is trained on each segment. These models are then used to predict the position of keys in their respective segment. We analyze which model type is best suited for this task.
Given the segmentations from the root layer, we train second-layer models of types LS and LR on the segments. In Figure 6, we report the median absolute error over all keys as a measure of deviation between predicted position and actual position. We decided against reporting the mean absolute error (MAE), as it was skewed upwards due to high errors on the large partitions when segmenting with an LR model. In Figure 6, we refer to an RMI that uses RX in the first layer and LR in the second layer as RXLR.
As expected, RMIs with more segments and thus more second-layer models generally produce more accurate predictions. On both the books and wiki dataset, RMIs with more than second-layer models even achieve errors in single digits. The osmc and fb dataset are more difficult to approximate. The osmc dataset has a clustered distribution that results in a high number of empty segments, making non-empty segments larger on average. Additionally, these segments often have a significant amount of noise and cannot be approximated precisely with the models considered here. Similarly, the large prediction error of fb can also be attributed to the single large segment. The sudden drop in prediction error between and segments is due to none of the outliers being assigned to the large segment anymore. Although the distribution within that large segment is close to uniform, it still contains a considerable amount of noise that leads to the persistent high prediction error.
Comparing the different RMI configurations, RMIs with LR, LS, and CS as root model achieve similar errors while RX performs slightly worse. Regarding the second-layer models, LR always achieves lower errors than LS. This is expected since LR minimizes the mean squared error.
In summary, on datasets without outliers, we could achieve accurate predictions. Models that use RX on the first layer achieve slightly less accurate predictions. On the second layer, LR should be preferred over LS due to higher accuracy. Generally, the more segments are created, the better. On the dataset with outliers, we could not achieve accurate predictions as the keys could not be segmented in a meaningful way and the large segment contained too much noise to be approximated by a single linear model.
5.3. Error Bounds
In order to facilitate correcting prediction errors, RMIs can store error bounds which limit the size of the error interval during the actual lookup operation. In the following, we analyze how different strategies for storing error bounds affect error interval sizes.
Given RMIs with different combinations of first-layer and second-layer models and varying numbers of segments, we compute error bounds of different types on these RMIs. Figure 7 shows the median error interval size over all keys for two representative combinations of first-layer and second-layer models, i.e. the median number of keys that have to be searched during error correction. We omit fb as the size of the error interval remains constant due to inaccurate predictions. At a similar index size, an RMI with global bounds (either GAbs or GInd) has roughly twice the number of segments compared to an RMI using LInd. Nevertheless, global bounds lead to larger error intervals. The reason for this is that global bounds are less robust since a single inaccurate prediction determines the error interval size for all keys. With local bounds, this inaccurate prediction would only affect the segment that the key belongs to. In the wiki dataset, we even see that with global bounds the error interval size increases despite an increasing index size.
Comparing LInd and LAbs, both error bounds achieve similar error interval sizes. LS as a second-layer model profits from LInd as unlike LR, LS does not minimize the error and thus often tends to either underestimate or overestimate. As expected, the error interval sizes are closely related to prediction accuracy. For example, both on books and wiki, we could achieve more accurate predictions and consequently, the error intervals are smaller by up to two orders of magnitude compared to osmc.
In summary, local bounds should be preferred over global bounds as they achieve smaller error intervals for the same index size.
6. Lookup Time Analysis
In this section, we analyze the impact of hyperparameters on the lookup performance of two-layer RMIs. Our analysis is divided into three parts.
Model types (Section 6.1): We investigate the lookup performance of different combinations of model types.
Error bounds (Section 6.2): We analyze the impact of different error bounds on the lookup performance.
Search algorithms (Section 6.3): We examine how different search algorithms affect the lookup performance.
6.1. Model Types
We saw in Section 5.2 that the combination of model types impacts prediction accuracy. In the following, we will evaluate how the model types affect lookup performance.
Figure 8 compares the lookup time of different combinations of first-layer and second-layer models with varying index size. We use local absolute bounds and standard binary search for error correction. This is the default configuration used in the open-source implementation of RMIs (Marcus et al., 2020b). The dashed horizontal lines in Figure 8 is the average time for finding a key in the sorted array using binary search. We omit this line for books since all configurations are faster than binary search.
Overall, the data distribution has the largest impact on lookup performance. All combinations of models achieve the fastest lookup times on books and wiki. The first-layer model type has a larger impact on lookup performance than the second-layer model type. RMIs using RX, LS, or CS as first-layer model type achieve faster lookup times almost without exception. RX is slightly worse for index sizes below 7 MiB and 40 MiB on books and wiki, respectively. LR performs worse on wiki for index sizes above 2 MiB and on books in general. Regarding the second-layer model type, RMIs that use LR always perform better than RMIs using LS. This is expected since LR achieves smaller errors which leads to smaller error intervals.
Despite the good performance on books, osmc, and wiki, none of the RMIs is able to beat binary search on the fb dataset. Previous work reported performance similar to the other datasets on fb (Kipf et al., 2019; Marcus et al., 2020a). This performance was likely achieved by using a variant of LR that ignores the lowest and highest 0.01% of keys as this is the model type suggested by the automatic optimizer for RMIs (Marcus et al., 2020b). This approach, while effectively eliminating the outliers in fb
from the segmentation process, only works if there are at most 0.01% of outliers at either end of the key space. We did not include this model type in our evaluation because we believe that a more robust solution potentially involving outlier detection should be sought.
In summary, our observations are consistent with the error analysis in Section 5.2 suggesting that prediction accuracy is a strong indicator for lookup performance. In our experiment, LSLR and CSLR consistently achieve good lookup performance and should be the preferred choice.
6.2. Error Bounds
In Section 5.3, we showed that the type of error bounds has a significant effect on the size of the error interval that has to be searched during a lookup. In the following, we evaluate how the error bounds affect lookup performance.
Figure 9 shows the lookup times of two combinations of models with different error bounds using binary search for error correction. We omit fb as different error bounds could not improve lookup performance. RMIs with local bounds generally perform better than RMIs with global bounds. This is consistent with our observation from Section 5.3 that at similar index size, local bounds lead to smaller error intervals. Nevertheless, binary search mitigates differences in search interval size drastically, e.g., global and local bounds perform almost identical with LSLR on books, although the search interval sizes differ by more than an order of magnitude. Whether individual or absolute bounds perform better depends on the second-layer model type: LS benefits from individual bounds as LS is more prone to either overestimate or underestimate, LR performs better with absolute bounds.
In summary, local bounds always perform better than global bounds. We recommend LAbs as it works best with LR, our recommended second-layer model type.
6.3. Search Algorithms
Next, we compare the performance of different search algorithms for error correction. Figure 10 shows the lookup times of two combinations of models on books, osmc, and wiki using different search algorithms for error correction. Note that we use LInd for Bin and MBin and NB for MExp and MLin, respectively. Thus, the differences in lookup time are not only due to search time but also due to slightly different evaluation time.
Let us first consider books and wiki which are the easiest to approximate. Here, Bin and MBin only perform best for relatively small index sizes before MExp is faster, e.g., MExp is faster on books when using LSLR of size larger than 0.1 MiB. The reason for this is that binary search relies on error bounds that are determined by the worst prediction in the respective segment. However, many predictions are more accurate, so that a few steps of exponential search are sufficient to find the key in a significantly smaller search interval. MLin beats Bin and MBin only if the prediction error is in the low single digits, otherwise MLin performs worse.
Let us consider osmc next which is harder to approximate. Here, Bin and MBin always achieve the fastest lookup times. MExp is slightly slower but the relative difference decreases with increasing index size. This is due to the fact that predictions are less accurate (cf. Section 5.2) and thus more steps are required to determine the search interval for MExp. In our experiments, MLin barely beats the baseline of searching the entire sorted array using binary search once the median error drops below 100 but due to imprecise predictions requires to many steps to beat any of the other searches.
In summary, Bin and MBin consistently achieve fast lookup times. MExp is faster if the prediction error is sufficiently small.
7. Build Time Analysis
In this section, we analyze the impact of hyperparameters on the build performance of two-layer RMIs. Recall that the build process of a two-layer RMI consists of four steps: (1) training the root model, (2) creating segments based on the root model, (3) training the second-layer models, and (4) computing error bounds over all keys. Figure 11 shows build times for different configurations of RMIs on the books dataset. In the following, we discuss each hyperparameter that affects build time individually.
Root model type. Consider Figure (a) for a build time comparison of different root model types. Models in general and root models in particular not only differ in training time, which affects step (1), but also in evaluation time, which affects steps (2) and (4). The most notable difference between the model types in terms of training time is whether a model considers all keys, like LR, or a constant number of keys, like LS, CS, and RX. Since the evaluation time of LR and LS is the same, the difference in build time in Figure (a) can be attributed entirely to the training time of the root model. Like LS, RX also considers only two keys for training. Here, the faster build time of RX is caused by the faster evaluation of RX during segmentation. CS is faster than LR because it again only considers a constant number of keys but slower than LS. This is due to two reasons: The evaluation of CS takes longer than LS and we train an additional LS to potentially fall back to in case it has a lower prediction error111This is also done by the open-source implementation (Marcus et al., 2020b)..
Second-layer model type. Consider Figure (b) for a build time comparison of different second-layer model types. Analogously to the root model type, the second-layer model type affects training time and evaluation time. Regarding training time, second layers consisting of LS models takes about two seconds less to train than second layers consisting of LR models. In this example, the second layer is never evaluated because we do not compute bounds. Otherwise, evaluation time would be the same for LR and LS as both are linear models.
Error bounds. Consider Figure (c) for a build time comparison of different error bounds. Computing bounds requires evaluating the RMI on every key plus the actual computation of the bounds. This additional effort explains the difference in built time between NB and configurations with bounds. The difference between individual configurations with bounds is mainly due to branch misses during the calculation of the bounds. For indexes of similar size, local bounds trigger more branch misses than global bounds and individual bounds trigger more branch misses than absolute bounds.
Index size. Consider again the RMI configuration without bounds in Figure (c). The build time remains almost constant as long as the entire RMI fits in cache. Once the RMI no longer fits in cache, the build time increases due to cache misses. Next, consider the configurations with bounds in Figure (c). Here, the previously described branch misses and cache misses add up, which is why the build time already increases for configurations that are smaller than the cache size.
In summary, differences in build times can be attributed to varying training and evaluation times of models, branch misses when computing bounds, and cache misses for RMIs that do not fit into cache. Our optimization of the segmentation process for monotonous models described in Section 4.1 improves the build time of RMIs by factor 2x compared to previous reports (Marcus et al., 2020a).
8. Comparison Against Other Indexes
In this section, we compare RMIs against a number of indexes in terms of lookup time, build time, and index size. The section also serves the purpose of demonstrating that our implementation of RMIs is competitive. We consider the indexes introduced in Section 4.5 and vary the parameters listed in Table 5 to obtain indexes of different sizes. For simplicity, the RMI used in this comparison has a fixed configuration (LSLR with LAbs) and we only vary the size of the second layer. This RMI configuration achieved optimal or near-optimal lookup performance in our previous experiments.
8.1. Lookup time
Lookup times with respect to index size are shown in Figure 12. During a lookup, each index yields a search range, either through error bounds or level of sparsity. We use binary search to find keys in that search range. Note that both Hist-Tree and ART did not work on wiki.
At index sizes beyond 100 MiB, Hist-Tree outperforms the other indexes. Depending on the dataset, Hist-Tree performs best at index sizes between 500 MiB and 2.2 GiB. Since there is no single best configuration across all datasets, hyperparameters must be tuned to achieve optimal performance. Learned indexes perform best at index sizes of up to 100 MiB. Like Hist-Tree, the performance of learned indexes is impacted by the data distributions. Each of the learned indexes except ALEX performs best on at least one of the datasets. PGM-index and ALEX are more robust to the data distribution but PGM-index outperforms ALEX on every dataset at index sizes below 100 MiB. RMI and RadixSpline work best on datasets with a smooth CDF (books, wiki
). Since the maximum error on PGM-index and RadixSpline is capped, for a fixed configuration, each lookup takes roughly the same time. In contrast, the estimation error of RMIs might vary greatly between segments inducing a noticeable variance in lookup times222We tried to accurately measure the variance in lookup times for RMIs but due to caching effects were not able to do.. B-tree is a general purpose data structure that does not make any assumption about the data. Therefore, the performance of B-tree is independent of the dataset. Nevertheless, B-tree was barely able to beat binary search in our experiments. ART is always faster than a B-tree of similar size. However, ART is significantly slower than learned indexes while being up to two orders of magnitude larger in size.
Figure 13 shows the best lookup time of each index on the books and osmc dataset. The lookup time is divided into evaluation time (evaluating the model or traversing the tree) and search time (performing error correction or finding the key within a data page). We see that there is a trade-off between fast evaluation and fast search. RMIs clearly prioritize fast evaluation: The evaluation leads to the correct segment in a fixed number of steps but the RMI does not provide any guarantees on the prediction accuracy. Adding additional segments continuously improves the lookup performance because they hardly increase the evaluation time while improving the search time. In contrast, PGM-index and RadixSpline prioritize fast error correction: They cap the maximum error at the cost of a slower evaluation that requires traversing additional layers or performing searches to find the correct segment. At a certain point, the improved search time of a smaller maximum error does not compensate the longer evaluation time and the lookup performance decreases. This implies that determining the optimal index size of other learned indexes is more complicated than for RMIs.
8.2. Build Time
Build times with respect to index size are shown in Figure 14. Note that for a fair comparison, the build times do not include the time to prepare the data for a particular index.
At any index size, B-tree provides the fastest build times closely followed by ART. Both B-tree and ART provide consistent build times across datasets. In order to vary the index size of B-tree and ART, both indexes are only built on a subset of the data thus simulating a sparse index. With an increasing number of keys, the structure of the indexes becomes more complex, e.g., the B-tree introduces more levels, and the build time increases. Similarly, ALEX is also built on a subset of the dataset. However, with an increasing number of keys, the build time of ALEX increases significantly. ALEX not only learns the distribution of the data but actually stores the key/position pairs in data nodes. With an increasing number of keys, the depth of ALEX also increases leading to higher build times. In contrast, RMI, PGM-index, and RadixSpline are always built on the entire dataset. This means that their build times are higher from the outset. RMI and RadixSpline have a fixed number of layers. Therefore, the build time increases only slightly with an increasing index size and their build time is hardly impacted by the data distribution. PGM-index, on the other hand, has a variable number of layers. In order to achieve low prediction errors, more layers have to be trained leading to a steeper increase in build times for larger index sizes. Overall, RadixSpline is faster to build than RMI and PGM-index. Hist-Tree exhibits similar build times to the learned indexes. However at sizes beyond 100 MiB its built time quickly increases. Again, this is mainly due to the increasing depth that is required to achieve the desired accuracy.
Overall, it can be stated that the benefits of learned indexes come at the cost of increased build times. Thus, the improvement of build times should be a priority of future work.
9. Conclusion and Guideline
We summarize our findings by suggesting a default configuration for RMIs and providing use cases for the evaluated indexes.
9.1. How to configure RMIs?
Although RMIs provide numerous hyperparameters, the following configuration should provide good performance in most cases.
Model types. If the dataset does not contain outliers, the root model has surprisingly low impact on the overall performance. We recommend using spline models, preferably LS. For the second layer, LR always performs better than LS. LS should only be chosen, if faster training is prioritized.
Layer size In our experiments, increasing the size of the second layer only ever improved lookup performance. We suggest a layer size of at least 0.01% of the number of keys in the dataset.
Error bounds. At similar index size, local bounds should always be preferred over global bounds. Whether individual or absolute bounds are used hardly impacts performance. No bounds should only be considered, if the model is extremely accurate (median prediction errors in the low tens).
Search algorithm. If bounds are available, binary search is to be preferred. Otherwise, model-biased exponential search reliably provides good performance.
9.2. When to use which index?
Hist-Tree should be chosen, if lookup performance is the main priority and both a large index size and comparably high build times are acceptable. A sparsely populated ART does not provide the same lookup performance as learned indexes but is very robust against data distributions and offers very low build times. Learned indexes, in general, offer excellent lookup performance at smaller index sizes. RadixSpline offers the best balance between build time and lookup time. ALEX is the fastest in terms of build time. PGM-index is the most robust against data distributions. RMI offers the best lookup performance on smooth CDFs.
- Automatic database management system tuning through large-scale machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, S. Salihoglu, W. Zhou, R. Chirkova, J. Yang, and D. Suciu (Eds.), pp. 1009–1024. Cited by: §1.
-  (Website) External Links: Cited by: Table 5.
- External Links: Cited by: §3.
- Organization and maintenance of large ordered indexes. pp. 107–141. Cited by: §1, §4.5, Table 5.
-  (Website) External Links: Cited by: Table 5.
-  (Website) External Links: Cited by: Table 5.
- Hist-tree: those who ignore it are doomed to learn. In 11th Conference on Innovative Data Systems Research, CIDR 2021, Virtual Event, January 11-15, 2021, Online Proceedings, Cited by: §1, §3, §3, §4.5, Table 5.
- ALEX: an updatable adaptive learned index. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, D. Maier, R. Pottinger, A. Doan, W. Tan, A. Alawini, and H. Q. Ngo (Eds.), pp. 969–984. Cited by: §1, §1, §3.1, §3, §4.5, Table 5.
- Why are learned indexes so effective?. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 3123–3132. Cited by: §3.2.
- The pgm-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow. 13 (8), pp. 1162–1175. Cited by: §1, §1, §3.1, §3.2, §3, §3, §4.5, Table 5.
- FITing-tree: A data-aware index structure. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, P. A. Boncz, S. Manegold, A. Ailamaki, A. Deshpande, and T. Kraska (Eds.), pp. 1189–1206. Cited by: §1, §3.1, §3.1, §3.
- DeepDB: learn from data, not from queries!. Vol. 13, pp. 992–1005. Cited by: §1.
- FAST: fast architecture sensitive tree search on modern cpus and gpus. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010, A. K. Elmagarmid and D. Agrawal (Eds.), pp. 339–350. Cited by: §3.2.
- SOSD: A benchmark for learned indexes. Vol. abs/1911.13014. Cited by: §3.2, §3.2, §3, Figure 2, §4.3, §4.5, §6.1.
- RadixSpline: a single-pass learned index. In Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM@SIGMOD 2020, Portland, Oregon, USA, June 19, 2020, R. Bordawekar, O. Shmueli, N. Tatbul, and T. K. Ho (Eds.), pp. 5:1–5:5. Cited by: §1, §1, §3.1, §3, §3, §4.5, Table 5.
- The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, G. Das, C. M. Jermaine, and P. A. Bernstein (Eds.), pp. 489–504. Cited by: §1, §1, §1, §3, §4.2, §4.2, Table 3, Table 4.
- The adaptive radix tree: artful indexing for main-memory databases. In 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, C. S. Jensen, C. M. Jermaine, and X. Zhou (Eds.), pp. 38–49. Cited by: §1, §3.2, §4.5, Table 5.
- Benchmarking learned indexes. Proc. VLDB Endow. 14 (1), pp. 1–13. External Links: Cited by: §3.2, §3, §4.2, §4.3, §4.3, §6.1, §7.
- CDFShop: exploring and optimizing learned index structures. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, D. Maier, R. Pottinger, A. Doan, W. Tan, A. Alawini, and H. Q. Ngo (Eds.), pp. 2789–2792. Cited by: §3.2, §4.1, §4.2, §4.2, §4.2, Table 3, §6.1, §6.1, footnote 1.
- External Links: Cited by: §3.
- Self-driving database management systems. In 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings, Cited by: §1.
-  (Website) External Links: Cited by: Table 5.
-  (Website) External Links: Cited by: Table 5.
-  (Website) External Links: Cited by: Table 5.
-  (Website) External Links: Cited by: Table 5.
- Cardinality estimation with local deep learning models. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM@SIGMOD 2019, Amsterdam, The Netherlands, July 5, 2019, R. Bordawekar and O. Shmueli (Eds.), pp. 5:1–5:8. Cited by: §1.