HINT: A Hierarchical Index for Intervals in Main Memory

04/22/2021 ∙ by George Christodoulou, et al. ∙ Fedora Summer University University of Mainz 0

Indexing intervals is a fundamental problem, finding a wide range of applications. Recent work on managing large collections of intervals in main memory focused on overlap joins and temporal aggregation problems. In this paper, we propose novel and efficient in-memory indexing techniques for intervals, with a focus on interval range queries, which are a basic component of many search and analysis tasks. First, we propose an optimized version of a single-level (flat) domain-partitioning approach, which may have large space requirements due to excessive replication. Then, we propose a hierarchical partitioning approach, which assigns each interval to at most two partitions per level and has controlled space requirements. Novel elements of our techniques include the division of the intervals at each partition into groups based on whether they begin inside or before the partition boundaries, reducing the information stored at each partition to the absolutely necessary, and the effective handling of data sparsity and skew. Experimental results on real and synthetic interval sets of different characteristics show that our approaches are typically one order of magnitude faster than the state-of-the-art.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

There is a wide range of applications that require managing a large collections of intervals. Indicatively, in temporal databases (Snodgrass and Ahn, 1986; Böhlen et al., 2017), each tuple has a validity interval, which captures the period of time that the tuple is valid. In statistics and probabilistic databases (Dalvi and Suciu, 2004), uncertain values are often approximated by (confidence or uncertainty) intervals. In data anonymization (Samarati and Sweeney, 1998), attribute values can be generalized to intervals. XML data indexing techniques (Min et al., 2003) encode label paths as intervals and evaluate path expressions using containment relationships between the intervals.

We approach the problem of indexing a large collection of objects (or records), based on an interval attribute that characterizes each object. Hence, we model each object as a pair , where is the object’s identifier (which can be used to access any other attribute of the object), and is the interval associated to .

Our focus is on range queries, the most general type of queries over intervals. Given a query interval , the objective is to find the ids of all objects , which overlap with . Formally, the result of range query on object collection is . For example, in a temporal database, a timeslice query (Salzberg and Tsotras, 1999) asks for all tuples in a table that are valid at some time during a given query interval. Range queries can be specialized to retrieve intervals that satisfy any relation in Allen’s set (Allen, 1981), e.g., intervals that are covered by . Stabbing queries (pure-timeslice queries in temporal databases) are a special class of range queries for which . Without loss of generality, we assume that the intervals and queries are closed at both ends. Our methods can easily be adapted to manage intervals and/or process range queries, which are open at either or both sides, i.e., , or .

Although range queries are fundamental and very popular, previous work has mainly focused on more expensive and complex queries, such as temporal aggregation (Kline and Snodgrass, 1995; Moon et al., 2003; Kaufmann et al., 2013b) and interval joins (Dignös et al., 2014; Piatov et al., 2016; Bouros and Mamoulis, 2017; Bouros et al., 2021; Böhlen et al., 2017; Cafagna and Böhlen, 2017). For range and stabbing queries, classic data structures for managing intervals are typically used, like the interval tree (Edelsbrunner, 1980). Still, these methods have not been optimized for handling very large collections of intervals in main memory. Hence, there is room for the design and use of new data structures, which exploit the characteristics and capabilities of modern hardware. Domain-partitioning schemes have been proposed to facilitate interval join queries (Dignös et al., 2014; Bouros and Mamoulis, 2017; Bouros et al., 2021; Cafagna and Böhlen, 2017) and these can be potentially used for processing interval range queries. For example, in Ref. (Bouros and Mamoulis, 2017; Bouros et al., 2021), the domain is split into disjoint partitions and each interval is assigned to all partitions it overlaps with. Although this introduces replication (hence, the storage requirements are not minimal), query evaluation is fast, because only a limited number of partitions that overlap with the query are accessed. The state-of-the-art domain-partitioning index for intervals is considered to be the recently proposed period index (Behrend et al., 2019), which first divides the space into coarse disjoint partitions and then re-partitions each division hierarchically.

In this paper, we first identify some deficiencies of domain-partitioning indices and propose a number of techniques that greatly boost their performance. This allows us to develop interval indices, which are significantly faster than the state-of-the-art. First of all, for queries that overlap with the boundaries of multiple partitions, it is possible that the same interval can be detected as query result in multiple partitions. Although duplicates can be eliminated by a simple and cheap post-processing check (Dittrich and Seeger, 2000), more than necessary intervals may have to be accessed during query evaluation. We tackle this problem by further dividing the intervals in each partition to groups based on whether they start inside or before the partition boundaries. The second problem is that in a single-level partitioning scheme, the replication of intervals can be excessive, hence, the index size may grow a lot. In view of this, we propose HINT, an indexing scheme based on a hierarchical domain decomposition, which has controlled space requirements. In addition, we use the prefixes of the values that define the interval and query boundaries to guide search, greatly reducing the number of comparisons. For datasets where the domain is discrete and relatively small, our index can process queries without conducting any comparisons. For larger domains, we design a variant of HINT, termed HINT, which limits the comparisons to a small number of partitions.

Finally, we optimize the way the data are physically stored in each partition, by (i) only storing the elements of the interval data representations that are necessary for comparisons and result; (ii) sorting the contents in each partition, in order to facilitate efficient search in it; (iii) deploying a sparse array representation and indexing for the intervals in order to alleviate data sparsity and skew; and storing the ids of the intervals in each partition in a dedicated array (i.e., a column), to avoid unnecessary data accesses, wherever comparisons are not necessary. Our experiments on real and synthetic datasets show that our approaches are typically one order of magnitude faster than classic interval indexing approaches and the state-of-the-art. We also show that our index can be used to process fast overlap interval joins between a small and a large and indexed interval collection.

Section 2 reviews related work. Our contributions can be summarized as follows:

  • We propose a number of techniques, which significantly improve the performance of a single-level domain-partitioning index (FLAT) for intervals (Section 3).

  • We propose the Hierarchical index for INTervals (HINT), which exploits the binary representation of the interval endpoints to partition and search them efficiently (Section 4). A version of HINT for relatively small domains can avoid comparisons and does not need to store any information about the intervals, besides their identifiers. In addition, HINT applies and extends the optimization techniques designed for the FLAT family in order to support indexing and search of intervals in arbitrary domains.

  • We present two techniques that can greatly improve the performance of our indices in practice (Section 4.3).

  • We evaluate the performance of our techniques experimentally (Section 5) on real and synthetic data, showing that it outperforms the state-of-the-art, typically by one order of magnitude.

2. Related Work

One of the most popular data structures for intervals is Edelsbrunner’s interval tree (Edelsbrunner, 1980), designed to efficiently find all intervals that contain a given value (stabbing query). The interval tree is a binary search tree, which takes space and answers queries in time, where is the size of the query result. The tree divides the domain recursively by placing all intervals strictly before (after) the domain’s center to the left (right) subtree and all intervals that overlap with the domain’s center at the root. This process is repeated recursively for the left and right subtrees using the centers of the corresponding sub-domains. The intervals assigned to each tree node are sorted in two lists based on their starting and ending values, respectively. Given a stabbing query, the tree is accessed recursively and at each node one or both sorted lists are traversed to obtain the results at that node. Interval trees can also be used to answer interval (i.e., range) queries. A relational interval tree for disk-resident data was proposed in (Kriegel et al., 2000).

Another classic data structure for intervals is the segment tree (de Berg et al., 2008). Like the interval tree, the segment tree is a binary search tree, which hierarchically divides the domain. A set of elementary domain intervals is defined from the endpoints of the indexed intervals and these form the leaves of the tree. The domain for each non-leaf node is the union of the domains of its children. Each data interval is assigned to the smallest set of nodes whose domain intervals define . Hence, the space requirements are . Given a stabbing query, the binary tree is searched to find the nodes whose domain interval includes the query value and reports all intervals in them in time. The tree is not designed for range queries, which demand a duplicate result elimination mechanism.

In computational geometry (de Berg et al., 2008), indexing intervals has been studied as a subproblem within a more complex problem, such as orthogonal 2D range search, and the worst-case optimal interval tree is typically used. Indexing intervals has re-gained interest with the advent of temporal databases (Böhlen et al., 2017). In the context of temporal data, a number of indices have been proposed for secondary memory, with a main focus on effective versioning and compression (Becker et al., 1996; Lomet et al., 2008). Even though such indices could be used to solve our problem, they are tailored for historical versioned data and they are designed for disk-resident data; hence, they are not better than the interval tree for the general interval indexing problem that we study in this paper. Besides, recent research on indexing intervals does not address basic queries such as stabbing or range queries, but more demanding operations such as temporal aggregation (Kline and Snodgrass, 1995; Moon et al., 2003; Kaufmann et al., 2013b) and interval joins (Dignös et al., 2014; Piatov et al., 2016; Bouros and Mamoulis, 2017; Bouros et al., 2021; Cafagna and Böhlen, 2017; Chekol et al., 2019; Zhu et al., 2019).

Specifically, the timeline index [18] can be used to compute aggregates at any time moment during a query time interval. The index is composed of a table with

events (start/end points of intervals) and queries are processed by scanning the events contained in the query interval. It also materializes the entire set of active interval-ids at certain checkpoints in order to avoid a full scan of the events table for every query (scans are performed after the last checkpoint before the query). When used to process range queries (i.e., timeslice queries), the timeline index accesses more data than necessary (i.e., events between the checkpoint and , interval end bounds after ), hence it is mostly appropriate for temporal aggregation. This data structure was later adapted and optimized for processing interval overlap joins (Piatov et al., 2016). Piatov et al. (Piatov and Helmer, 2017) present a collection of plane-sweep algorithms that extend the timeline index with other forms of temporal aggregation, such as aggregation over fixed intervals, sliding window aggregates, and MIN/MAX aggregates. A domain partitioning technique for interval joins was proposed in (Bouros and Mamoulis, 2017; Bouros et al., 2021), which can also be adapted and used for interval range queries, as we discuss in the next section. Alternative partitioning techniques for overlap interval joins were proposed in (Dignös et al., 2014; Cafagna and Böhlen, 2017). Temporal joins considering Allen’s algebra relationships for RDF data were studied in (Chekol et al., 2019). Finally, multi-way interval joins in the context of temporal -clique enumeration were studied in (Zhu et al., 2019).

The period index (Behrend et al., 2019) is a domain-partitioning self-adaptive structure, specialized for range and duration queries. The authors split the time domain into partitions and then divide each partition hierarchically using a fixed number of levels, in order to organize the intervals assigned to the partition based on their durations. For range queries, this index performs in par with the interval tree and a single-level domain partitioning technique, so we compare our methods with it experimentally in Section 5.

3. Flat Indexing

In this section, we propose a single-level indexing approach (termed FLAT), based on disjoint partitions, which also serves as building block for our hierarchical index, presented in Section 4. FLAT is inspired by partitioning approaches for interval joins (e.g., (Bouros and Mamoulis, 2017; Gunadhi and Segev, 1991)) and spatial joins (e.g., (Dittrich and Seeger, 2000)). Besides presenting a basic version of FLAT, stemming directly from previous work, we propose a number of optimizations, which greatly improve its efficiency in practice. Table 1 summarizes the notations used throughout the paper.

notation description
identifier, start, end point of interval
query range interval
-th domain partition and its contents
start, end point of -th domain partition
() sub-partition of with originals (replicas)
() intervals in ending inside (after) partition
-th partition at level of HINT index
-bit prefix of integer
Table 1. Table of notations

3.1. Basic Version of FLAT

FLAT divides the domain into partitions , which are pairwise disjoint in terms of their interval span and collectively cover the entire data domain ; that is (i) for each , we have , and (ii) .111Whenever the context is clear, we use the same symbol to denote the contents of partition and the interval that defines it. For example, we use to denote the start and end value of the domain interval corresponding to partition . Domain partitioning could be regular (i.e., uniform) or may adapt to the data distribution (e.g., see (Bouros and Mamoulis, 2017)).

Each interval is assigned to all partitions that overlaps with. For example, assume that the domain is divided into partitions and consider a set with 5 intervals, as shown in Figure 1. Interval is assigned to partition only, whereas interval is assigned to all four partitions. Given a range query interval , the intervals that overlap with can be obtained by accessing the partitions which collectively cover . For each partition which is contained in (i.e., ), all intervals assigned to are guaranteed to overlap with . For each partition , which is not contained in but overlaps with , we need to compare each interval with in order to determine whether is a query result. Each partition that does not overlap with can be safely ignored. For example, for query interval in Figure 1, partitions , , and have to be accessed, wherefrom we obtain result . Note that , which is included in is not a query result, as does not overlap with .

One issue that arises in range query evaluation using FLAT is that duplicate results may be produced, which need to be handled. This may happen in the case where the query interval overlaps with multiple partitions. For example, in Figure 1, can be identified as a result of query in all , , and . An efficient approach for handling duplicates (that does not rely on hashing) is the reference value method (Dittrich and Seeger, 2000), which was originally proposed for rectangles but can directly be applied for 1D intervals. For each interval found to overlap with in any partition , we compute as reference value and report only if . Since is unique, is reported only in one partition and duplicate results are avoided. In Figure 1, is only reported in because contains value . Still, FLAT has the overhead that the duplicate results should be computed and considered before being eliminated by the reference value approach. In the next subsection, we show how the computation and elimination of duplicates can be completely avoided. In addition, we show how the necessary comparisons for each query can be minimized.

Figure 1. Example of basic FLAT indexing

3.2. Flat+

The key idea behind the FLAT+ index is to divide the intervals that are assigned to each partition into two groups, inspired by the re-partitioning approach proposed in (Bouros and Mamoulis, 2017) for parallel processing of interval joins. In particular, for each partition , all intervals that start inside but not before , i.e., the starting endpoint of the partition, are placed into set (for originals) while those which start before are placed into set (for replicas). Figure 2 shows the contents of and , , for each of the four partitions in Figure 1.

Given a range query interval , for the first partition that overlaps with , we consider all intervals in as possible query results, as usual. However, for all other partitions which overlap with , where , we only need to consider the intervals in and we can safely disregard the entire set . This is because, each interval in is guaranteed to overlap with in the previous partition , hence it would definitely be a duplicate result in . Consider again query in Figure 1. In partition , we should consider all intervals, i.e., both and . On the other hand, in partitions and , we only consider all intervals in and and ignore and , respectively. Recall that for the first partition () and the last one () we have to perform comparisons between all intervals in them and , in order to verify whether the intervals overlap with , while for all partitions in-between (i.e., ) we just report all intervals in without any comparisons. In Figure 2, the accessed divisions per partition for query are shown in bold.

Figure 2. FLAT+ partitioning

In the rest of this section, we propose a number of additional optimizations that further improve the performance of FLAT indexing; these optimizations target interval range queries and are not relevant to the problem of interval joins, so they were not considered in previous work (Bouros and Mamoulis, 2017).

3.2.1. Subdivisions of and

We further divide set into and , such that (resp. ) includes the intervals from that end inside (resp. after) partition . Similarly, is divided into and .

Range queries that overlap with multiple partitions. Consider a range query which overlaps with a sequence of more than one partitions. Let be the first partition overlapping with and , the last one (). As already mentioned, we should consider all intervals in (i.e., both and ) for possible overlap with . However, since the query interval starts inside but ends after , we have the following lemma:

Lemma 0 ().

If (i) each interval in overlaps with iff ; and (ii) all intervals in and are guaranteed to overlap with .

Hence, we need just one comparison for each interval in , whereas we can report all intervals as query results without any comparisons. For partitions between and , we just report intervals in (i.e., in ) as results, without any comparisons. For the last partition , we consider just , and perform one comparison per interval. Specifically:

Lemma 0 ().

If , each interval in overlap with iff .

Range queries that overlap with a single partition. If , i.e., the range query overlaps only one partition , we can use following lemma in place of Lemmas 1 and 2:

Lemma 0 ().

If then

  • each interval in overlaps with iff ,

  • each interval in overlaps with iff ,

  • each interval in overlaps with iff ,

  • all intervals in overlap with .

Overall, the subdivisions help us to minimize the number of intervals in and in , for which we have to apply comparisons. Figure 3 shows the subdivisions which are accessed by query in each partition. In partition , all four subdivisions are accessed, but comparisons are needed only for intervals in and . In partition , all “originals” in and are accessed and reported without any comparisons. Finally, in , all “originals” in and are accessed and compared to .

Figure 3. Subdivisions in FLAT+

3.2.2. Sorting the intervals in each subdivision

We can keep the intervals in each subdivision sorted, in order to reduce the number of comparisons for queries that access them. For example, let us examine the last partition that overlaps with a query . If the intervals in are sorted based on their start endpoint (i.e., ), we can simply access and report the intervals until the first , such that . Or, we can perform binary search to find the first , such that and then scan and report all intervals before . Table 2 (second column) summarizes the sort orders for each of the four subdivisions of a partition that can be beneficial in range query evaluation. For a subdivision , intervals may have to be compared based on their start point (if ), or based on their end point (if ), or based on both points (if ). Hence, we choose to sort based on either or to accommodate two of these three cases. For a subdivision , intervals may only have to be compared based on their start point (if ). For a subdivision , intervals may only have to be compared based on their end point (if ). Finally, for a subdivision , there is never any need to compare the intervals, so, no order provides any search benefit.

subdivision beneficial sorting necessary data
by or by
by
by
no sorting
Table 2. Sort orders that can be beneficial

3.2.3. Storage optimization

So far, we have assumed that each interval is stored in the partitions whereto is assigned as a triplet . However, if we split the partitions into subdivisions, as discussed in Section 3.2.1, we do not need to keep all information of the intervals in them. Specifically, as Lemmas 1, 2, and 3 suggest, for each subdivision , we may need to use and/or for each interval , while for each subdivision , we may need to use for each , but we will never need . From the intervals of each subdivision , we may need , but we will never use . Finally, for each subdivision , we just have to keep the identifiers of the intervals. Table 2 (third column) summarizes the data that we need to keep from each interval in the subdivisions of each partition. Since each interval is stored as “original” in just one partition, but as replica in possibly multiple partitions, a lot of space can be saved by storing only the necessary data, especially in the case when the intervals span multiple partitions.

3.3. Evaluation of FLAT+

We now evaluate the four FLAT+ versions and FLAT (Section 3.1). In particular, we denote by (i) FLAT+(base), the version described in the introductory text of Section 3.2 which stores the intervals of a partition inside the and sets, (ii) FLAT+subs, the version which employs the subdivisions , , , discussed in Section 3.2.1, (iii) FLAT+subs+sort, the version described in Section 3.2.2 which sorts the intervals inside the subdivisions, and (iv) FLAT+subs+sort+sopt, the version discussed in Section 3.2.3 for storage optimization. For the comparison, we used datasets BOOKS and TAXIS which represent inputs with long and short intervals, respectively, and run 10K randomly distributed queries while varying the number of partitions and the extent of each query as a percentage of the domain size. For more details about the implementation and compilation options, our machine, and the datasets and queries, please refer to Section 5 and Table 3. Figure 4 reports the results; for the plots in Figures 4(e) and 4(f), the number of partitions are set equal to the best value observed in Figures 4(b) and 4(c), respectively.

The plots clearly show the efficiency of FLAT+ indexing and the merit of the proposed optimizations; all FLAT+ versions steadily outperform FLAT. Overall, we observe that FLAT+subs+sort+sopt is the fastest FLAT+ version while always being an order of magnitude faster than basic FLAT. For datasets with long-lived intervals such as BOOKS, FLAT+subs+sort+sopt achieves an order of magnitude higher throughput than FLAT+(base) and up to 3 times higher throughput compared to FLAT+subs and FLAT+subs+sort. On the other hand, for datasets with very short intervals like TAXIS, the advantage of FLAT+subs+sort+sopt is smaller especially compared to FLAT+subs+sort mainly due to the low replication ratio. Nevertheless, FLAT+subs+sort+sopt is still the fastest FLAT+ version. The benefits from the storage optimizations discussed in Section 3.2.3 are twofold. As expected, the index always occupies less space; Figures 4(a), 4(b) clearly show this. In addition though, the query evaluation is accelerated because FLAT+subs+sort+sopt incurs a smaller footprint in main memory and so, a higher cache hit ratio.

In view of the above, we consider only the FLAT+subs+sort+sopt version for the rest of the text, simply denoting it as FLAT+.

BOOKS TAXIS

FLAT    FLAT+(base)    FLAT+subs FLAT+subs+sort FLAT+subs+sort+sopt

Figure 4. Comparing FLAT variants

4. Hierarchical Indexing

If the indexed interval collection includes a large number of long intervals, FLAT+ will have high storage requirements (even with the storage optimization), due to excessive replication. In this section, we propose a Hierarchical index for INTervals (HINT), in which the number of interval replications is bounded, but the search performance is not compromised. In a nutshell, HINT is a hierarchy of FLAT+ indices; each interval is guaranteed to be assigned to at most two partitions per level. Hence, long intervals are assigned to a relatively small number of partitions in HINT. Our range query evaluation algorithm over HINT (i) uses bitwise operations to determine the relevant partitions per level and (ii) is expected to conduct comparisons on very few partitions.

4.1. A comparison-free version of HINT

We first describe a version of HINT, which is appropriate in the case of a discrete and relatively small domain . Specifically, assume that the domain wherefrom the endpoints of intervals in take value is , where is not very large (e.g., ). We can define a regular hierarchical decomposition of the domain into partitions, where at each level from to , there are partitions, denoted by array . Figure 5 illustrates the hierarchical domain partitioning for .

Next, we assign each interval to the smallest set of partitions which collectively define . It is not hard to show that will be assigned to at most two partitions per level. For example, in Figure 5, interval is assigned to one partition at level and two partitions at level . The assignment of each interval to the partitions can be done using Algorithm 1. In a nutshell, for interval , starting from the bottom-most level , if the last bit of (resp. ) is 1 (resp. 0), we assign the interval to partition (resp. ) and increase (resp. decrease ) by one. We then update and by cutting-off their last bits (i.e., integer division by 2, or bitwise right-shift). If, at the next level, , we terminate.

Input : HINT index , interval
Output : updated after indexing
; ;
   set masks to endpoints
;
   start at the bottom-most level
1 while  and  do
2        if last bit of is  then
               add to ;
                 update partition
3               ;
4              
5       if last bit of is  then
               add to ;
                 update partition
6               ;
7              
       ; ;
          cut-off last bit
        ;
          repeat for previous level
8       
ALGORITHM 1 Assignment of an interval to partitions
Figure 5. Hierarchical partitioning and assignment of

4.1.1. Range queries

A range query can be evaluated by finding at each level the partitions that overlap with . Specifically, the partitions that overlap with the query interval at level are partitions to , where denotes the -bit prefix of integer . All intervals in these partitions are guaranteed to overlap with and intervals in none of these partitions cannot overlap . However, the same data interval may exist in multiple partitions (of the same or different levels); Hence, direct reporting of all these intervals would produce duplicate results.

To avoid producing and eliminating duplicates, we divide the intervals in each partition to originals and replicas , as we have proposed in Section 3.2 for the FLAT+ index. Given a range query , at each level of the index, we report all intervals in the first partition that overlaps with ; for all other partitions , , which overlap with , we report all intervals in and ignore . This guarantees that no result is missed and no duplicates are produced. The reason is that each interval will appear as original in just one partition, hence, reporting only originals cannot produce any duplicates. At the same time, all replicas in the first partitions per level that overlap with begin before and overlap with , so they should be reported. On the other hand, replicas in subsequent partitions () that overlap with contain intervals which are either originals in a previous partition , or replicas in , hence, they can safely be skipped. Finally, the partitions that overlap with at each level are partitions to , where (resp. ) is the -bit prefix of (resp. ). Algorithm 2 describes the range query algorithm using HINT.

Input : HINT index , query interval
Output : set of all intervals that overlap with
1 ;
2 foreach level in  do
3        ;
4       
5        while  do
6               set ;
7              
8              
9       
10return ;
ALGORITHM 2 Range query algorithm on HINT

For example, consider the hierarchical partitioning of Figure 6 and a query interval . The binary representations of and are 0101 and 1001, respectively. The partitions that overlap with at each level are shown in bold (blue) and dashed (red) lines and can be determined by the corresponding prefixes of 0101 and 1001. At each level , all intervals in the first partitions (bold/blue) are reported and only the original intervals in the subsequent partitions (dashed/red) are reported.

Figure 6. Accessed partitions for range query

Discussion. The version of the HINT index described above finds all range query results, without conducting any comparisons. This means that in each partition , we just have to keep the ids of the intervals that are assigned to (i.e., we do not have to store/replicate the interval endpoints). In addition, the relevant partitions at each level are computed by bit-shifting operations which are fast. To use HINT for arbitrary integer domains, we should first normalize all interval endpoints by subtracting the minimum endpoint, in order to convert them to values in a domain (the same transformation should be applied on the queries).

4.2. Indexing arbitrary intervals

The version of HINT in Section 4.1 is not appropriate for arbitrary intervals, as it requires the domain to be discrete and relatively small. We now present a generalized version of HINT, denoted by HINT, which can be used for intervals in arbitrary domains. HINT uses a hierarchical domain partitioning with levels, based on a domain ; each raw interval endpoint is mapped to a value in , by linear rescaling. The mapping function is , where and are the minimum and maximum interval endpoints in the dataset , respectively. Hence, each raw interval is mapped to interval . The interval is then assigned to at most two partitions per level in HINT, using its mapped endpoints and Algorithm 1.

For the ease of presentation, we will assume that the raw interval endpoints take values in , where , which means that the mapping function simply outputs the most significant bits of its input. As an example, assume that and . Interval (=) is mapped to interval (=) and assigned to partitions , , and , as shown in Figure 5.

In contrast to HINT, the set of partitions whereto an interval is assigned in HINT does not define , but the smallest interval in the domain , which covers . Hence, for a range query , simply reporting all intervals in the relevant partitions at each level (as in Algorithm 2) would produce false hits. Instead, as in FLAT indices, comparisons to the query endpoints may be required for the first and the last partition at each level that overlap with . Specifically, we consider each level of HINT as a FLAT+ index to evaluate range queries. At each level, we apply the partition subdivisions and all the sorting/storage optimizations described in Section 3.2 to form the HINT index.

4.2.1. Query evaluation using HINT

A straightforward range query evaluation algorithm would perform comparisons in the first and the last partition at each level (by accessing the relevant subpartitions), according to Section 3.2.1. Specifically, we can apply Algorithm 2, to go through the partitions at each level that overlap with and (i) for the first partition , verify whether for each interval , (ii) for the last partition , verify whether for each interval . For each partition between and , we report all without any comparison. As an example, consider the HINT index and the range query interval shown in Figure 7. The identifiers of the relevant partitions to are shown in the figure (and also some indicative intervals that are assigned to these partitions). At level , we have to perform comparisons for all intervals in the first relevant partitions . In partitions ,…,, we just report the originals in them as results, while in partition we compare the start points of all originals with , before we can confirm whether they are results or not. At level , the relevant partitions are and we perform comparisons in and .

Figure 7. Avoiding redundant comparisons in HINT

It turns out, however, that the algorithm described above may perform redundant comparisons, as in certain levels, it is not necessary to do comparisons at the first and/or the last partition. For instance, in the previous example, we do not have to perform comparisons for partition , because any interval assigned to should overlap with and the interval spanned by is covered by . This means that the start point of all intervals in is guaranteed to be before (which is inside ). In addition, observe that for any relevant partition which is the last partition at an upper level and covers (i.e., partitions ), we do not have to conduct the tests because intervals in these partitions are guaranteed to start before . On the other hand, for , we do have to conduct comparisons because starts before the first relevant partition at level 4, i.e., . Finally, for the first relevant partition at level 2 and the first relevant partition at any level above 2, we do not have to perform comparisons. This observation is formalized by the lemma below:

Lemma 0 ().

Given a range query , if the first (resp. last) relevant partition at level () starts at the same point as the first (resp. last) relevant partition at level , then for all first (resp. last) partitions at levels , we do not have to compare their intervals to with respect to their (resp. ), as these intervals are guaranteed to pass the (resp. ) test.

The next question is if there exists a fast way to verify the condition of Lemma 1. The answer is yes. If the last bit of the index (resp. ) of the first (resp. last) partition (resp. ) relevant to the query at level is 0 (resp. 1), then this means that the first (resp. last) partition at level above satisfies the condition. For example, in Figure 7, examine the first relevant partition at level . The last bit of is not 0, which means that the first partition at level does not satisfy the condition. On the other hand, consider the last relevant partition at level . The last bit of is 1; hence, the last partition at level satisfies the condition and we do not have to perform comparisons in the last partitions at level and above. When we consider the first partition at level , we repeat the bit-testing to find out that the last bit of this time is 0, meaning that for the first partitions at level and above, no comparisons are needed.

Algorithm 3 is a pseudocode for the range query algorithm on HINT. Although, for simplicity, we present the algorithm assuming only the basic division of each partition to and , in practice, each is divided to , , , , according to Section 3.2.1 and the optimizations described in Sections 3.2.2 and 3.2.3 are applied. The algorithm accesses all levels of the index, bottom-up. It uses two auxiliary flag variables and to mark whether it is necessary to perform comparisons at the current level (and all levels above it) at the first and the last partition, respectively, according to the discussion in the previous paragraph. At each level , we find the indices of the relevant partitions to the query, based on the -prefixes of and (Line 3). For the first position , the partitions holding originals and replicas and are accessed. The algorithm first checks whether , i.e., the first and the last partitions coincide. In this case, if and are set, then we perform all comparisons according to Section 3.2.1 (paragraph Range queries that overlap with a single partition). If only (resp. ) is set, then we only conduct (resp. ) comparisons. If neither nor is set, no comparisons are necessary. If , we do or skip comparisons for the first partition, depending on . For all partitions after the first one, we only consider the subdivisions and the only case where we have to conduct comparisons is when we are at the last partition and is set.

Input : HINT index , query interval
Output : set of intervals that overlap with
1 ; ;
2 ;
3 for  to  do  bottom-up
4        ; ;
5        for  to  do
6               if  then  first overlapping partition
7                      if  and and  then  Lemma 3
8                             ;
9                             ;
10                            
11                     else if  and  then
12                             ;
13                             ;
14                            
15                     else if  then  Lemma 1
16                             ;
17                            
18                     else
19                             ;
20                            
21                     
22              else if  and  then  last partition, Lemma 2
23                      ;
24                     
25              else   in-between or last, without comparisons
26                      ;
27                     
28              
29       if  then  last bit is of 0
30              ;
31       if  then  last bit is of 1
32              ;
33       
34return ;
ALGORITHM 3 Range query algorithm on HINT

4.2.2. Analysis

We now analyze the cost of HINT in terms of (i) the number of partitions for which comparisons are necessary and (ii) the overall number of intervals, which have to be compared with in order to determine whether they are results.

At the last level of the index , we definitely have to do comparisons in the first and the last partition (which are typically different). At level , for each of the first and last partitions, we have a 50% chance to avoid comparisons. Hence, the expected number of partitions for which we have to perform comparisons at level is 1. Similarly, at level each of the remaining first/last partitions have a 50% chance to avoid comparisons. Overall, for the worst-case conditions, where is large and is long-lived, the expected number of partitions, for which we need to perform comparisons is .

More importantly, the number of intervals which are compared to is the same as the number of intervals compared to in a FLAT+ index which has the same partitions as the bottom-most level of HINT. This is because, in both cases, the only intervals that are compared are those for which is covered by the -level partition that includes or is covered by the -level partition that includes .

Overall, HINT is expected to have the same search performance as a FLAT+ index having regular partitions. On the other hand, HINT is expected to have lower space requirements, as each interval is assigned to at most two partitions per level; in particular the space complexity of HINT is . On the other hand, each interval can be assigned to all partitions of FLAT+ in the worst case, i.e., the space complexity of FLAT+ is .

4.3. Optimizations

In this section, we discuss implementation techniques, which improve the performance of HINT and HINT. First, we show how to handle very sparse or skewed data at each level. Another (orthogonal) optimization is decoupling the storage of the interval ids with the storage of interval endpoints in each (sub-)partition.

4.3.1. Handling data skew and sparsity

Data skew and sparsity may cause many partitions to be empty, especially at the lowest levels of HINT (i.e., large values of ). At these levels, there could be many empty partitions. Recall that a query accesses a sequence of multiple partitions at each level . Since the intervals are physically distributed in the partitions, this results into the unnecessary accessing of empty partitions and may cause cache misses. We propose a storage organization where all divisions at the same level are merged into a single table and an auxiliary index is used to find each non-empty division. The auxiliary index locates the first non-empty partition, which is greater than or equal to the -prefix of (i.e., via binary search or a binary search tree). From thereon, the nonempty partitions which overlap with the query interval are accessed sequentially and distinguished with the help of the auxiliary index. Hence, the contents of the relevant ’s to each query are always accessed sequentially. Figure 8(a) shows an example at level of HINT. From the total partitions at that level, only five are nonempty (shown in grey at the top of the figure): . All nine intervals in them (sorted by start point) are unified in a single table as shown at bottom of the figure (the binary representations of the interval endpoints are shown). At the moment, ignore the ids column for shown at the right of the figure. The sparse index for has one entry per nonempty partition pointing to the first interval in it. For the query shown in the example, the index is used to find the first nonempty partition , for which the id is greater than or equal to the -bit prefix of . All relevant non-empty partitions are accessed sequentially from , until the position of the first interval of .

Searching for the first partition that overlaps with at each level can be quite expensive if the nonempty partitions are numerous. To alleviate this issue, we suggest adding to the auxiliary index, a link from each partition to the partition at the level above, such that is the smallest number greater than or equal to , for which partition is not empty. Hence, instead of performing binary search at level , we use the link from the first partition relevant to the query at level and (if necessary) apply a linear search backwards starting from the pointed partition to identify the first non-empty partition that overlaps with . Figure 8(b) shows an example, where each nonempty partition at level is linked with the first nonempty partition with greater than or equal prefix at the level above. Given the query of the example, we use the auxiliary index to find the first nonempty partition which overlaps with and also sequentially access and . Then, we follow the pointer from to to find the first nonempty partition at level , which overlaps with . We repeat this and get partition at level , which however is not guaranteed to be the first one that may overlap with , so we go backwards and reach .


(a) auxiliary index

(b) linking between levels

Figure 8. Storage and indexing optimizations

4.3.2. Reducing cache misses

At most levels of HINT, no comparisons are conducted and the only operations are processing the interval ids which qualify the query. In addition, even for the levels where comparisons are required, these are only restricted to the first and the last partitions and that overlap with and no comparisons are needed for the partitions that are in-between. Summing up, when accessing any (sub-)partition for which no comparison is required, we do not need to access any information about the intervals, except for their ids. Hence, in our implementation, for each (sub-)partition, we store the ids of all intervals in it in a dedicated array (the ids column) and the interval endpoints (wherever necessary) in a different array. If we need the id of an interval that qualifies a comparison, we can access the corresponding position of the ids column. This storage organization greatly improves search performance by reducing the cache misses, because for the intervals that do not require comparisons, we only access their ids and not their interval endpoints. This optimization is orthogonal to and applied in combination with the strategy discussed in Section 4.3.1, i.e., we store all divisions at each level in a single table , which is decomposed to a column that stores the ids and another table for the endpoint data of the intervals. An example of the ids column is shown in Figure 8(a). If, for a sequence of partitions at a level, we do not have to perform any comparisons, we just access the sequence of the interval ids that are part of the answer, which is implied by the position of the first such partition (obtained via the auxiliary index). In this example, all intervals in and are guaranteed to be query results without any comparisons and they can be sequentially accessed from the ids column without having to access the endpoints of the intervals. The auxiliary index guides the search by identifying and distinguishing between partitions for which comparisons should be conducted (e.g., ) and those for which comparisons are not necessary.

5. Experimental Analysis

In this section, we evaluate the performance of the indices proposed in this paper with previous work on indexing intervals for range queries. Specifically, the comparison includes our optimized single-level index presented in Section 3.2 and evaluated in Section 3.3, i.e., variant FLAT+subs+sort+sopt, which is denoted by FLAT+. We also include our hierarchical indices HINT and HINT, detailed in Sections 4.1 and 4.2, respectively. We compare our methods against three baseline approaches. The first one is an implementation of the interval tree (Edelsbrunner, 1980) (code from (Garrison, )); this is the best data structure in terms of worst-case performance. The second competitor is the period index (Behrend et al., 2019), which is the state-of-the-art index for range (and duration) queries. We implemented both adaptive and non adaptive variants of the period index, but we kept the adaptive version as it was performing best (as also shown in (Behrend et al., 2019)). The third competitor is FLAT, which is the basic version of a single-level index, described in Section 3.1 (used also as a competitor in (Behrend et al., 2019) under the name ‘Grid’). All algorithms were implemented in C++ and compiled using gcc (v4.8.5) with flags -O3, -mavx and -march=native. The experiments were conducted on a machine with 384 GBs of RAM and a dual Intel(R) Xeon(R) CPU E5-2630 v4 clocked at 2.20GHz running CentOS Linux 8.2.2004.

BOOKS WEBKIT TAXIS GREEND
Cardinality
Domain (sec)
Min duration (sec)
Max duration (sec)
Avg. duration (sec)
Avg. duration [%] 6.98 7.19 0.0024 0.000005
Table 3. Characteristics of real datasets

5.1. Data and queries

For our experiments, we used four real collections of time intervals, which have also been used in previous work (Dignös et al., 2014; Piatov et al., 2016; Bouros and Mamoulis, 2017; Cafagna and Böhlen, 2017; Bouros et al., 2021). Their characteristics are summarized in Table 3. BOOKS (Bouros and Mamoulis, 2017) contains the time intervals during which books were lent out by Aarhus public libraries in 2013 (https://www.odaa.dk). WEBKIT (Bouros and Mamoulis, 2017, 2018; Dignös et al., 2014) records the file history in the git repository of the Webkit project from 2001 to 2016 (https://webkit.org); the intervals indicate the periods during which a file did not change. TAXIS (Bouros et al., 2021) includes the time periods of taxi trips (pick-up and drop-off timestamps) from New York City (https://www1.nyc.gov/site/tlc/index.page) in 2013. GREEND (Cafagna and Böhlen, 2017; Monacchi et al., 2014) records time periods of power usage data from households in Austria and Italy from January 2010 to October 2014. BOOKS and WEBKIT contain around 2M intervals each, which are quite long on average. On the other hand, TAXIS and GREEND contain over 100M relatively short intervals.

We also generated synthetic collections of intervals, in order to simulate different cases for the lengths and the skewness of the input intervals. Table 4 shows the values of the parameters that determine the synthetic datasets and their default values. The domain of the datasets ranges from 32M to 512M, which requires index level parameter to range from to for a comparison-free HINT. The cardinality ranges from 1M to 100M. The lengths of the intervals were generated using numpy’s random.zipf() function. They follow a zipfian distribution according to the probability density function, where is the Riemann Zeta function. A small value of results in most intervals being relatively long, while a large value results in the great majority of intervals having length 1. The positions of the middle points

of the intervals are generated from a normal distribution centered at the middle point

of the domain. Hence, the middle point of each interval is generated by calling numpy’s random.normalvariate(, ). The greater the value of the more spread the intervals are in the domain.

parameter values (default in bold)
Domain length 32M, 64M,128M, 256M, 512M
Cardinality 1M, 5M, 10M, 50M, 100M
(interval length) 1.01, 1.1, 1.2, 1.4, 1.8
(interval position) 10K, 100K, 1M, 5M, 10M
Table 4. Parameters of synthetic datasets

On the real datasets, we ran range queries which are uniformly distributed in the domain. On the synthetic datasets, the positions of the queries follow the distribution of the data. In both cases, the lengths of the query intervals were fixed to a percentage of the domain size (default 0.1%). At each experimental instance, we ran 10K random queries, in order to measure the overall throughput.

dataset throughput [queries/sec] index size [MBs]
original optimized original optimized
BOOKS 12098 36173 3282 273
WEBKIT 947 39000 49439 337
TAXIS 2931 31027 10093 7733
GREEND 648 47038 57667 10131
Table 5. HINT: impact of the skewness & sparsity optimization (Section 4.3.1), default parameters

original       skew & sparsity optimization cache misses optimization        both optimizations BOOKS WEBKIT TAXIS GREEND

Figure 9. HINT: impact of optimizations (Section 4.3)

5.2. Impact of Optimizations

In the first experiment, we test the effect of the optimization techniques presented in Section 4.3 to the performance of HINT and HINT. Since the comparison-free version of HINT (Section 4.1) just stores interval ids, the cache misses optimization described in Section 4.3.2 is not applicable, so we just test the effect of the handling data skew & sparsity optimization (Sec. 4.3.1) in Table 5. Observe that the optimization has a great effect in both the throughput and the size of the index on all four real datasets, as empty partitions are effectively excluded from query evaluation and from the indexing process.

Figure 9 shows the effect of either or both of these optimizations on the performance of HINT for different values of (i.e., the number of index levels used). Observe that for all values of , the version of HINT which uses both optimizations is superior than all other versions. As expected, the skew & sparsity optimization helps to reduce the space requirements of the index when is large, because there are many empty partitions in this case at the bottom levels of the index. At the same time, the cache misses optimization helps in reducing the number of cache misses in all cases where no comparisons are needed. Overall, the optimized version of converges to its best performance at a relatively small value of , where the space requirements of the index are relatively low, especially on the BOOKS and WEBKIT datasets which contain long-lived intervals. On the other hand, as we have seen in the experiments of Section 3.3, FLAT+ needs a relatively large number of partitions to achieve its best performance (especially for BOOKS and WEBKIT).

Table 6 shows the best values for the number of partitions of FLAT+ and the number of levels of HINT, i.e., the smallest values for which the indices converge to their best throughputs. In addition, for these values, the table shows the replication factor of the indices, i.e., the average number of partitions in which each interval is stored as a replica. A larger number implies larger storage requirements. Observe that the replication factor of FLAT+ is very high for BOOKS and WEBKIT because these datasets contain long intervals, whereas the factor of HINT is relatively low, because long intervals are stored in at most two partitions per level. On the other hand, for TAXIS and GREEND replication is low for both FLAT+ and HINT, because the intervals are very short. Finally, the last line of the table (avg. comp. part.) shows the average number of HINT partitions for which comparisons have to be applied. All numbers are below 4, which is consistent with our expectation (see the analysis in Section 4.2.2). In practice, this means that the performance of HINT is very close to the performance of the comparison-free HINT.

BOOKS WEBKIT TAXIS GREEND
best (FLAT+)
repl. factor (FLAT+)
best (HINT)
repl. factor (HINT)
avg. comp. part. (HINT)
Table 6. Statistics in FLAT+ and HINT

5.3. Index performance comparison

In this section, we compare the optimized versions of HINT, HINT and FLAT+ with the three competitors: interval tree, period index, and FLAT. We start with our tests on the real datasets. For each dataset, the best value for (number of partitions) and (number of index levels) is used for FLAT+ and HINT, respectively, as shown in Table 6. Figure 10 compares the query throughputs of all methods on queries of various lengths (as a percentage of the domain size), whereas Table 7 shows the sizes of each index in memory and Table 8 shows the construction cost of each index. The first set of bars in each plot correspond to stabbing queries, i.e., range queries of length 0. Observe that HINT and HINT outperform the competition across the board and they are one order of magnitude faster compared to the competitors. FLAT+ achieves very good performance as well. In fact for very short queries, FLAT+ may outperform HINT and HINT, as in these cases the query range is usually completely contained inside a single FLAT+ partition. On the other hand, FLAT+ requires significantly more space for datasets with long intervals (BOOKS and WEBKIT) (see Table 6).

HINT is overall the best index as it achieves the performance of HINT, requiring less space, confirming the findings of our analysis in Section 4.2.2. As shown in Table 7, HINT always has higher space requirements than HINT; even up to an order of magnitude higher in case of GREEND. What is more, since HINT offers the option to control the occupied space in memory by appropriately setting the parameter, it can handle scenarios with space limitations. HINT is marginally better than HINT only on datasets with short intervals (TAXIS and GREEND) and only for selective queries. In these cases, the intervals are stored at the lowest levels of the hierarchy where HINT typically needs to conduct comparisons to identify results, but HINT applies comparison-free retrieval.

Interval tree     Period index     FLAT     FLAT+     HINT     HINT BOOKS WEBKIT TAXIS GREEND

Figure 10. Comparing throughputs, real datasets
index BOOKS WEBKIT TAXIS GREEND
Interval tree 97 115 3125 2241
Period index 210 217 2278 1262
FLAT 949 604 2165 1264
FLAT+ 3725 6466 3564 1270
HINT 273 337 7733 10131
HINT 150 174 3079 1283
Table 7. Comparing index size [MBs]
index BOOKS WEBKIT TAXIS GREEND
Interval tree 0.249647 0.333642 47.1913 26.8279
Period index 1.14919 1.35353 76.9302 46.3992
FLAT 1.26315 0.952408 4.02325 2.23768
FLAT+ 22.334614 47.882271 32.965872 6.041567
HINT 1.70093 11.7671 49.589 36.5143
HINT 1.15779 1.15857 39.4318 8.39652
Table 8. Comparing index time [sec]
Figure 11. Comparing throughputs, synthetic datasets

The next set of experiments are on synthetic datasets. In each experiment, we fix all but one parameters (domain length, cardinality, , , query extent) to their default values and varied one (see Table 4). The parameters and of FLAT+ and HINT were tuned to their best values on each dataset (as we did for the case of real datasets). Figure 11 reports the results, which follow a similar trend to the tests on the real datasets. HINT and HINT are always significantly faster than the competition, followed by FLAT+ in all cases. Different to the case of the real datasets, FLAT is steadily outperformed by both the interval tree and the period index. Essentially, the uniform partitioning of the domain employed by FLAT cannot cope with the skewness of the synthetic datasets; in contrast, despite using the same partitioning FLAT+ is boosted by the subdivisions and the optimizations discussed in Section 3, while HINT and HINT use a much finer partitioning. As expected the domain size, the dataset cardinality and the query extent have a negative impact on the performance of all indices. Essentially, increasing the domain length under a fixed query extent, affects the performance similar to increasing the query extent, i.e., the queries become longer and less selective, including more results. Further, the querying cost grows linearly with the dataset size because the number of query results are proportional to it. On the other hand, as grows, the intervals become smaller, so the query performance improves. Similarly, by increasing the intervals become more widespread, which means that the queries are expected to retrieve fewer results, and the query cost drops accordingly.

5.4. Overlap Interval Joins

In the last experiment, we investigate the applicability of our indices to the evaluation of overlap interval joins, where the objective is to find, in two joined datasets and , all pairs of interval