I Introduction
Connected mobility companies need to process vast amounts of location data in near realtime to run their businesses. For instance, Uber needs to join locations of passenger requests with a set of predefined polygons to display available products (e.g., Uber X) and to enable dynamic pricing^{1}^{1}1https://eng.uber.com/gogeofence/. Another example are traffic use cases where the positions of vehicles need to be joined with street segments to enable realtime traffic control. With a future of connected (and possibly selfdriving) cars and drones, highperformance geospatial joins will become a key feature for increasingly demanding applications.
While geospatial joins have been studied for decades, many of them optimize for I/O operations, which is an important performance factor for diskbased systems. With the large mainmemory capacities of modern hardware, however, it is for the first time possible to maintain finegrained index structures purely in main memory.
We propose an adaptive geospatial join that addresses workloads with streaming points and static polygons. Our approach is optimized for modern hardware with large highbandwidth memory and our implementation fully utilizes manycore processors and their Single Instruction Multiple Data (SIMD) units.
Figure 1 illustrates our approach. We make use of true hit filtering proposed by [1]. The authors of this work used maximum enclosed rectangles and circles as progressive approximations of single polygons. In contrast, we use an adaptive grid represented by a specialized radix tree, Adaptive Cell Trie (ACT), to not only index single polygons but entire sets of polygons.
There are systems (e.g., Oracle Spatial [2] and Spark Magellan^{2}^{2}2https://github.com/harsha2010/magellan) that have used the idea of true hit filtering. We take this idea one step further and leverage the large mainmemory capacities available in modern hardware to maximize the likelihood of true hits. We increase this likelihood by training the index using historical data points.
In summary, we transform the traditionally computeintensive problem of geospatial joins into a memoryintensive one. Our approach results in speedups of up to two orders of magnitude compared to stateoftheart standalone libraries and geospatial database systems.
Our contributions include:

An adaptive geospatial join that uses hierarchical grids combined with radix indexing techniques to identify most join partners in the filter phase.

An approximate version of this approach that guarantees a userdefined precision.

An optimization of this technique for Intel’s latest Xeon Phi, named Knights Landing (KNL).
Ii Background
Unless stated otherwise, we follow the semantics of the —ST_Covers— join predicate supported by PostGIS and Oracle Spatial. —ST_Covers— evaluates whether one geospatial object (e.g., a polygon) covers another (e.g., a point). Points on the edges or vertices of a polygon are considered to be within the polygon. Other predicates, such as —ST_Intersects—, are not explicitly addressed in this paper but can be implemented with similar techniques.
Our geospatial join uses several building blocks:
PIP: A pointinpolygon (PIP) test determines whether a point lies within a polygon. Typically such a test is performed using complex geometric operations, such as the ray tracing algorithm
which involves drawing a line from the query point to a point that is known to be outside of the polygon and counting the number of edges that the line crosses. If the line crosses an odd number of edges, the query point lies within the polygon. The runtime complexity of this algorithm is
withbeing the number of edges. While there are many conceptual optimizations to the PIP test, this operation remains computationally expensive since it processes real numbers (e.g., latitude/longitude coordinates) and thus involves floating point arithmetics. In geospatial joins, this test should thus be avoided whenever possible.
Spacefilling curves: Geographical coordinates can either be indexed using multidimensional data structures, such as Rtrees, or they can be linearized and indexed using onedimensional data structures, such as Btrees or tries. Linearization methods include the Hilbert spacefilling curve and the Z curve. Our approach relies on discretization and linearization but does not depend on a concrete spacefilling curve. For our approach to work, the cell enumeration must only fulfill the property that children cells share a common prefix with their parent cell.
Google S2: We use the Google S2 library^{3}^{3}3http://code.google.com/archive/p/s2geometrylibrary/ that provides primitives that our approach builds upon. These primitives include a PIP test algorithm and adaptive grid approximations of polygons. In general, S2 offers many features for processing geospatial data in both discrete and nondiscrete space. To transform coordinates from the nondiscrete space (the latitude/longitude coordinate system) into the discrete space, S2 maps points on earth onto a surrounding unit cube. Then it recursively subdivides each of the six faces of the cube in a quadtree fashion and enumerates the cells using the Hilbert spacefilling curve. The enumerated cells are 64 bit integers, called cell ids, that uniquely identify a cell. The three most significant bits represent one of the six faces of the cube. The following zero to 30 bit pairs identify the quadtree cell. The next bit is always set and is used to identify the level of a cell. The remaining bits (if any) are set to zero. The smallest cells use all 64 bits.
Figure 2 illustrates the hierarchical decomposition of a cube face and shows how many bits the corresponding cell ids require depending on their level. The colored bits encode the levels. The cell ids in the example share a common prefix with their ancestors up to the colored bit. For example, is an ancestor of and (i.e., fully contains and ). A containment relationship between cells can thus be efficiently computed using bitwise operations.
As mentioned above, S2 offers methods for approximating polygons. Figure 2(a) shows a covering of a polygon. A covering is a collection of cells (possibly at different levels) fully covering a polygon. Figure 2(b) shows an interior covering. As the name suggests, an interior covering approximates the interior of a polygon. All cells of an interior covering (interior cells) are completely contained within the polygon. When a point lies in an interior cell, we know that it also lies within the polygon. Both types of coverings can be utilized to speed up PIP tests.
To allow for an efficient search, S2 stores the cell ids of a covering in a sorted vector. Besides sorting the cell id vector, it allows for normalizing the covering. A normalized covering neither contains conflicting nor duplicate cells. A conflict between two cells exists, if one cell contains the other. Only when the covering is normalized, cell containment checks can be efficiently () implemented using a binary search on the sorted vector. The normalization of an covering does not lead to a precision loss.
Iii Geospatial Join
With our approach, we address geospatial joins between streaming points and static polygons. An example of such a workload is mapping Uber or DriveNow cars and passenger requests to predefined zones for allocation purposes.
Iiia Overview
Our geospatial join takes a set of simple polygons, builds an index, and probes points against the index. The high level idea is to index adaptive grid approximations of the polygons in a highlyefficient and specialized radix tree. In contrast to techniques that first reduce the number of polygons using an index, e.g., an Rtree on the polygons’ minimum bounding rectangles (MBRs), and then refine candidates using geometric operations, our approach implements true hit filtering [1] and identifies most or even all join partners in the filter phase. Given a memory budget and a precision bound, we adapt our index to satisfy these constraints while maximizing probe performance.
Our algorithm consists of five phases:
Build logical index: We compute coverings and interior coverings of all polygons and merge them to form a logical index.
Build physical index: Then we build a physical index on the cells of the logical index.
Training: When our approximate approach cannot guarantee a userdefined precision without exceeding a specified memory budget, we use an exact approach and train the index with historical data points to increase probe performance. Popular areas that expect more hits are approximated using a finergrained grid than less popular areas.
Probe index: In this phase, we probe the points against the physical index. An index probe can either return a false hit or a list of hits, where each hit can either be a true hit or a candidate hit.
Refine candidates: Finally, we perform exact geometric computations
for all candidate hits^{4}^{4}4Often, this phase can be skipped. In fact, it is only required if exact results are needed and if there are candidate hits at all..
We use the S2 library to compute coverings and interior coverings. Note that our technique does not rely on S2. In fact, any other adaptive grid index would work as well. We then merge the individual coverings to create a logical index, before we build a physical index. Note that the first two phases can be performed in one step, i.e., the individual coverings can be computed and merged while inserting their cells into the index. In our approach, the probe (filter) phase is the performance critical part. We therefore highly parallelize this phase using thread and instructionlevel parallelism to accelerate lookups in the physical index. In the refinement phase, we use S2’s PIP test, which implements the ray tracing algorithm (cf., [3] for performance numbers).
The overall strategy of our technique is to minimize the number of (expensive) refinements. We can decrease the number of refinements by using a more finegrained index. Since we make use of true hit filtering, a precise index allows us to identify most join partners during the filter phase. We can always omit the refinement phase when approximate results are sufficient. The error of such an approximate algorithm is bound by the length of the diagonal of the largest covering cell, which we choose to be sufficiently small.
Our geospatial join takes two parameters: (i) a memory budget that must not be exceeded and (ii) an applicationdependent precision^{5}^{5}5The maximum distance (in meters) between the two partners of a false positive join pair. that must be guaranteed. We first try using the approximate approach by refining the index until we can guarantee the userdefined precision bound. This involves replacing covering cells with their children cells until the largest covering cell is sufficiently small (i.e., its diagonal is less than the precision bound). It might occur that we exhaust the memory budget during this procedure. In that case, we use the exact approach and train the index until we have used the entire memory budget. Training the index reduces the likelihood for expensive geometric operations. We evaluate these two approaches in Section V.
In summary, we trade memory consumption with precision (approximate approach) and probe performance (exact approach). The approximate approach almost involves no computation (except result aggregation) while the exact approach reduces expensive computations by training the index. Thus, both strategies favor modern hardware with large main memory capacities and high memory bandwidths.
IiiB Build Logical Index
We compute coverings and interior coverings of all polygons, and combine them into a logical index, which we call super covering. The precision of the super covering defines the selectivity of the index. As mentioned earlier, a normalized covering excludes conflicting and duplicate cells. In contrast to the lossless normalization of a single covering, we might experience a precision loss when merging two overlapping coverings and normalizing the resulting super covering^{6}^{6}6There is no precision loss if the overlap only consists of duplicate cells.. In other words, a normalized super covering may be less selective than two individual coverings. Figures 3(a) and 3(b) show the coverings of two individual polygons. The red cells have conflicts with cells of the other covering. Figure 3(c) shows the normalized super covering. The conflicting cells were expanded to larger cells, which lead to a precision loss.
When combining the covering and the interior covering of a single polygon, the situation is even more severe since there will be many overlapping cells. To retain the precision of the individual coverings, we can omit removing conflicting cells and only remove duplicates. However, there is a tradeoff between the size of a super covering and its precision. We have experimented with three flavors of a super covering: (i) only covering cells, (ii) covering and interior cells with precision loss, and (iii) covering and interior cells without precision loss. The first approach only indexes covering cells. This approach cannot identify true hits during the filter phase. The second approach additionally indexes interior cells and can thus skip the refinement phase when hitting an interior cell. The first two approaches normalize the super covering when combining the cells leading to a precision loss. The third approach avoids the normalization. Instead of storing a cell and its descendant cell , we compute their difference and store and . This has the advantage that there will not be any overlap between the indexed cells and thus an index lookup will at most return a single cell. The side effect is that the total number of cells will increase since consists of at least three cells. This approach retains the precision and the type (covering or interior) of the individual cells as well as the mappings of cells to polygons^{7}^{7}7We expect the individual coverings (that serve as inputs to the merge and index phases) to be normalized. Instead, they could also be normalized “on the fly”.. Figure 5 illustrates this precision preserving conflict resolution. Assume that and are cells of two different coverings. First, we compute , which consists of six cells. We then copy all references of to and and omit . Note that the cell count is increased by five.
Each cell in the super covering maps to a list of polygon references. A polygon reference has two attributes:
polygon id: The id of the polygon that this cell references.
interior flag: Whether the cell is an interior or a covering cell of the polygon.
Listing 1 outlines the algorithm that builds a super covering of coverings and interior coverings and retains the precision of the individual coverings. We iterate over all input cells and try to insert them into the super covering. When a cell already exists, this means that it is also part of another covering that has already been processed. When a cell conflicts with another cell, this means that either the current cell covers the other cell or vice versa. These two cases may happen when polygons overlap or are close to each other. Conflicts also occur when we first insert the cells of a covering of a given polygon and then the cells of its interior covering. The interior cells always overlap some (if not all) covering cells. By appropriately resolving these conflicts, we retain the precision of the index. As mentioned earlier, this strategy increases the total number of cells. However, a more precise index reduces the number of refinements and thus increases the overall performance of our algorithm. In the remainder of this paper, we use the approach that retains the precision of the individual coverings.


Figure 6 illustrates a coarsegrained and a finegrained super covering of the five NYC boroughs. For computing the illustrated super coverings, we use up to 50 and up to 200 cells per individual (interior) covering for the coarsegrained and the finegrained super covering, respectively. The key message of this illustration is that the size of the green area increases with more precise individual coverings, while the size of the red and the blue area decreases. Increasing the size of the green area increases the likelihood that we can skip the refinement phase. Interior cells are not only more likely to be hit than covering cells but they also tend to be larger. Since larger cells use less bits, this leads to an additional performance improvement when probing ACT. The reason for this is that large cells are indexed high up in the tree and are thus found sooner.
IiiC Build Physical Index
To store the logical index, we use a physical index consisting of two data structures, ACT and a lookup table. Both data structures are designed as inmemory data structures and are optimized for modern hardware. In particular, we utilize large mainmemory capacities, avoid unnecessary cache misses, and have parallelized and vectorized the lookups in ACT (cf., Section IV). ACT stores the cell ids of the super covering and the lookup table stores the corresponding lists of polygon references. When a given cell maps to less than three polygon references, we inline the references into the tree nodes avoiding an additional indirection and a possible cache miss.
Adaptive Cell Trie: In S2, the cells of an individual covering are stored in a sorted vector that can be efficiently searched using a binary search (cf., Section II). While this approach has a runtime complexity of , it is not very space efficient, since the cell ids of the cells of a single covering often share a common prefix that is stored redundantly. For a single covering, the space consumption can be reduced by storing the common prefix separately and only storing the varying parts of the cells (e.g., using delta encoding). Since the cells of a single covering are spatially dense and thus likely to share a common prefix, this works well. However, when combining the coverings and interior coverings of multiple polygons, the cells are not necessarily dense anymore. Also, we only require prefix lookups, i.e., we need to search for the cell ids that share a common prefix with our search key (the cell id of a query point) to check whether the query point is contained in one of the indexed cells. A radix tree thus is the ideal data structure for our needs. It is more spaceefficient than a sorted vector, since it avoids redundantly storing common prefixes (in a trie, the path to a leaf node implicitly defines the key). Another advantage of a radix tree is that lookups are in with being the key length. Besides the algorithmic complexity, lookup performance depends on the number of cache misses. Since we are generally interested in high selectivity and index many cells, the tree usually exceeds cache size (cf., Section V). Thus, traversing the tree results in many cache misses (at least one per noncached node).
Let be the average key length and be the fanout of the tree. Then the average costs
of a lookup can be estimated as follows:
= * costs per node access
Increasing the number of indexed cells does not necessarily increase . Thus, the costs of a lookup in a noncached tree are relatively independent of the number of indexed cells. The number of node accesses is bounded by the maximum key length , which is 60 when using S2’s adaptive grid index^{8}^{8}8The three face bits are indexed in a dedicated face node and the level encoding bit is not required.. Our tree uses a default fanout () of 256 (= 8 bits). Thus every level in the tree represents four S2 levels (recall that every S2 level is encoded with two bits). This has the side effect that we can only index cells at certain levels.
Let be the S2 level granularity of the tree. Then the following holds for indexed cells:
mod = 0
Thus, we need to denormalize^{9}^{9}9Denormalizing a cell to a given level means replacing the cell with all of its descendant cells at that level. cells upon insertion and replicate their payloads.
While a fanout of 256 results in sparse nodes and thus in a large space consumption, it allows for efficient lookups. An adaptive radix tree (ART) [4] is usually more spaceefficient, however, experiments show that introducing a second node type with four children (Node4 in ART) would (i) only save a negligible amount of space in the case of our workload and (ii) switching between the different node types would have a significant impact on lookup performance. Also, lookups in compressed node types would be more expensive due to the additional indirection for accessing the payloads. Since ACT is designed as a transient data structure that is built at runtime, we favor performance over space consumption. With , the maximum number of node accesses is . In practice, a lower is often sufficient. For example, allows for indexing S2 cells up to level 24 (a level 24 cell represents at most 2 on earth) and limits the number of node accesses to 6.


Figure 6(a) illustrates the structure of ACT. Values (payloads or offsets) can be found in any node at any level of the tree. Every node consists of a fixedsized array of 256 entries of 8 byte pointers. These pointers are tagged. By default, all entries point to a sentinel node indicating a false hit. Such a tagged entry can be:

An 8 byte pointer to a child or the sentinel node

An inlined payload (a 31 bit value)

Two inlined payloads (two 31 bit values)

An offset (a 31 bit value) into a lookup table
We use the two least significant bits of the 8 byte pointer to differentiate between these four possibilities. For an inlined payload, we differentiate between a true hit and a candidate hit using the least significant bit of the 31 bit payload. Thus, we can effectively only store bit payloads (i.e., index up to polygons). We decided to inline two payloads since it is very likely that a cell has less than three payloads in the case of disjoint polygons. We do not inline more than two payloads since this would mean further reducing the supported number of polygons.
Since our implementation uses S2, we need to maintain up to six radix trees, one for each face of the cube (cf., Section II). We call the node that stores the root nodes of these trees face node.
Lookup table: When a cell references more than two polygons, the tree contains an offset into a lookup table. Cells often reference the same set of polygons. To avoid redundancy, we therefore only store unique polygon reference lists. Figure 6(b)
shows an example of a lookup table. The reference lists are split into two parts, a list with true hits and a list with candidate hits. Both lists contain polygon ids. When a cell references at most two polygons, we inline its payloads into the tree. Assuming disjoint polygons and a uniform distribution of points, a more finegrained super covering may reduce the chance that we need to access the lookup table. Let
be the area that is covered by cells that contain an offset into the lookup table (i.e., cells that reference more than two polygons). Increasing the granularity of the index may reduce the size of () and thus reduces the likelihood of having to access the lookup table. The lookup table is encoded as a single 32 bit unsigned integer array. The offsets stored in the tree are simply offsets into that array. Each encoded entry contains the number of true hits followed by the true hits, the number of candidate hits, and the candidate hits.IiiD Training
When we cannot build an index that satisfies a userdefined precision without exceeding the memory budget, we use an approach that may enter the expensive refinement phase. To minimize the likelihood of refinements, we train the index to adapt to the expected distribution of query points.
The index can either be trained online or offline. In this work, we only discuss and evaluate how to train the index in a dedicated training phase. The online case introduces additional concurrency and buffer management issues that we leave for future work.
Training the index minimizes the area that is covered by expensive cells. We define expensive cells as cells that map to polygon reference lists with at least one candidate hit. When we hit such a cell during the join, we need to enter the refinement phase. Expensive cells are marked blue and red in Figure 6.
We start off with an instance of ACT that does not exceed the memory budget. When a training point hits an expensive cell, we determine the logical representation of this cell^{10}^{10}10Recall that ACT uses a fanout of 256 thus the physical cell in the tree might not be the same as the logical cell that was originally inserted into the tree.. Then we verify for each of its four children cells whether they intersect, are fully contained in, or do not intersect the referenced polygons at all, and update ACT accordingly. We repeat this procedure until we have exhausted the memory budget. We show the effect of training the index in Section V.
IiiE Probe Index
A lookup in ACT returns at most one cell that maps to a list of polygon references. The probe algorithm is shown in Listing 2. In contrast to a search in a binary tree, a search in a radix tree is comparisonfree. This means that we do not compare the value of the search key to the value(s) stored in the current node. We only need to extract the relevant bits of the search key and jump to the corresponding offset. However, we do a comparison for checking whether the entry contains a payload.
[outerlinewidth=0.1pt,outerlinecolor=black, innerleftmargin=2pt,innerrightmargin=2pt,innertopmargin=15pt,innerbottommargin=10pt]
For the tagged entry returned by the tree probe, we need to differentiate between (i) one payload, (ii) two payloads, and (iii) an offset. In the first case, we need to check whether the payload is invalid, which indicates a false hit. Otherwise, we extract the interior flag (the least significant bit of the 31 bit payload) and the polygon id and return a polygon reference. In the second case, we extract and return both references. Only in the third case, we need to access the lookup table to retrieve the polygon references.
Listing 3 shows the complete probe algorithm. For a given point, we retrieve the cell that it is contained in (if such a cell exists) and go over all references of this cell. We can skip refinement checks for true hits and immediately output the join partners.
Iv Modern Hardware
In this section, we present the optimizations of our proposed join algorithm for modern hardware architectures. The general hardware trend is having an increasing number of cores with ever larger vector processing units (VPUs). That applies to established serverclass machines such as Intel’s Xeon and particularly to highperformance computing platforms such as Intel’s Xeon Phi. In this work, we focus on Intel’s secondgeneration Xeon Phi processor, Knights Landing.
Iva Knights Landing Processor
Just like its predecessor, the KNL processor is a many integrated core (MIC) architecture which draws its computational power from wide VPUs. The corresponding AVX512 offer SIMD capabilities on 512bit registers. In contrast to the first generation Phi, KNL is available as a coprocessor expansion card and as a selfboot processor (socket on a mainboard) which is fully backward compatible as it also implements all legacy instructions. In this work, we focus on the selfboot processor and not the coprocessor expansion card.
Compared to our Xeon processor (cf., Section V), equipped with 14 fast clocked brawny cores per socket, the KNL processor consists of 64 wimpy cores moderately clocked at 1.3 GHz. The register width as well as the number of vector registers has been doubled over AVX2 architectures. Thus, for a software to run efficiently on KNL, it is essential to exploit the offered SIMD capabilities.
Especially for dataintensive workloads like ours, KNL offers three promising features: (i) it is equipped with an onchip High Bandwidth Memory (HBM) which is comparable to the memory on highend GPUs in speed and size, (ii) the processor performs a much more aggressive memory prefetching, and (iii) the 4way hyper threading can help to hide memory latencies.
IvB Parallelism
KNL is a massively parallel processor in terms of threadlevel parallelism and data parallelism on the instruction level.
IvB1 Multithreading
Threadlevel parallelism can be achieved using wellknown techniques. In existing parallel joins, multiple threads probe an index structure (e.g., a hash table) in parallel. During the probe phase, the index structure is in an immutable state. Therefore, an arbitrary number of threads can perform read operations without the need for synchronization. As no contention points exist, the join performance typically scales linearly in the number of threads.
IvB2 Data Parallelism
Achieving data parallelism on instruction level is more involved and requires several modifications over the scalar implementation. The goal is again to perform multiple index lookups in parallel. We divide a 512 bit SIMD register into eight 64 bit SIMD lanes and can thus process up to eight elements simultaneously per vector unit. The major difference to thread parallelism is, that the same instruction is executed on all inflight lookups, which is due to the lockstep nature of SIMD. Thus, for the SIMD implementation to be efficient, the inflight lookups need to make progress with every issued instruction. This implies that every lookup must be in the same stage of the algorithm, i.e., it is no longer possible that one lookup performs a prefix comparison while simultaneously another issues a memory load. To achieve this, our probe algorithm from Listing 2 is decomposed into the following stages:
Determine tree root: As mentioned earlier, our implementation uses S2 for space partitioning the earth. Thus, a join index actually consists of up to six radix trees. In this stage, we extract the three face bits of the query point’s cell id to determine the radix tree that needs to be traversed. Further, we check if the query point and the polygons indexed in the radix tree share the predetermined common prefix. If the prefixes do not match, then it is guaranteed that there is no join partner. We want to point out that on the one hand this stage of the algorithm is S2 specific but on the other hand it is similar to a scenario where the data is radix partitioned by three bits.
Tree traversal: In this stage, we traverse the tree. In each iteration, we compute the offset of the entry in the current tree node according to the search key. If the entry is a pointer to a child tree node we follow the pointer and continue.
Produce output: In the third and last stage, we interpret the content of the entries where the search in the previous stage terminated. We have to distinguish the cases (i) false hit, (ii) single or double hit, and (iii) multiple hits which are either true or candidate hits (cf., Section III).
To illustrate the vectorized implementation, we use pseudo code instead of C++ for better readability. In the pseudo code, vectors are annotated with an arrow, bitmasks are named —m—, and for masked operations we use the following notation:
= —operation——(——)—
The —operation— is applied to the components in specified by mask and the results are stored in . The remaining components are copied from to . Further, we make use of the —vpermd/q— instruction (for better readability we refer to it as —permute—) that shuffles vector components using the corresponding index vector, e.g.,
= —permute(——)—.
During index construction we initialize the following vectors: [outerlinewidth=0.1pt,outerlinecolor=black, innerleftmargin=2pt,innerrightmargin=2pt,innertopmargin=3pt,innerbottommargin=10pt]
These vectors are used to determine the tree roots in the first stage (cf., Listing 4). – For performance reasons these vectors remain in CPU registers to avoid memory loads. – Given a vector of cell ids, we first extract the face bits of each cell id. The resulting vector is then used as a permutation index vector to move the pointer of the corresponding root nodes to the output vector . Similarly, we obtain the prefixes and the lengths of the prefixes. The actual prefix check, an equality comparison, results in a bitmask which is used as an execution mask in the later stage. I.e., if the cell id in SIMD lane does not match the common prefix, then the th bit in the bitmask is set to zero.
[outerlinewidth=0.1pt,outerlinecolor=black, innerleftmargin=2pt,innerrightmargin=2pt,innertopmargin=15pt,innerbottommargin=10pt]
The tree traversal, shown in Listing 5, is more involved. As it performs multiple traversals in lockstep, it is required to do some bookkeeping about the current traversal state. The mask —m— keeps track of the active SIMD lanes and in —m— the algorithm memorizes the lanes that produced an output. The output mask therefore represents the execution mask for the next stage. The output values itself are stored in .
In each iteration of the while loop, the algorithm first computes the entry offsets from which it subsequently gathers the contents. The gather is a masked operation for two reasons: (i) to avoid unnecessary memory accesses and (ii) to keep already existing output values in .
Once the bucket contents are loaded into a vector register, we have to check whether the values are payloads or pointers. In case of pointers, the addresses are compared to the address of the sentinel node. If the comparison evaluates to true, the search terminates without producing an output, otherwise the corresponding pointers are moved to . Based on the two masks —m— and —m—, the traverse mask —m— is updated and the search continues until —m— is zero.
[outerlinewidth=0.1pt,outerlinecolor=black, innerleftmargin=2pt,innerrightmargin=2pt,innertopmargin=15pt,innerbottommargin=10pt]
In the last stage, we have to distinguish between true and candidate hits. We have not vectorized this stage since its performance impact is neglectable. However, for the —select polygon_id, count(*) … group by polygon_id— query we use in the evaluation (cf., Section V), we partially optimized the aggregation using SIMD. Otherwise, the aggregation would become a bottleneck due to the wimpy CPU cores of KNL.
Our SIMDfied aggregation is optimized for the most likely cases, where either one polygon id or two polygon ids are returned by the ACT lookup. We have implemented a direct aggregation based on gather and scatter instructions and the efficient popcount implementation presented in [5]. In the less likely cases, where the query point matches more than two polygons (i.e., ACT returns an offset into the lookup table) or where the match is inconclusive, we materialize the results into a small L1 resident buffer, which is consumed sequentially later on.
V Evaluation
In this section, we show the performance of our join and the selectivity of our index for a join between points and polygons on two different machines. We compare the performance of our technique to an Rtree based approach that uses the same code in the refinement phase as ours. To give a point of reference, we also compare against PostGIS and Magellan. In addition to the results of the exact algorithm, we show that an approximate algorithm can significantly speed up the join while providing sufficiently precise results for many use cases.
Configuration: We evaluate our approach on a serverclass machine (Xeon) and on KNL (cf., Section IVA). Xeon runs Ubuntu 16.04 and is equipped with an Intel Xeon E52680 v4 CPU (2.40 GHz, 3.30 GHz turbo) and 256 GB DDR4 RAM. The machine has two NUMA sockets with 14 physical cores each, resulting in a total of 28 physical cores (56 hyperthreads). KNL runs CentOS Linux release 7.2.1511 and is equipped with a single Intel Xeon Phi CPU 7210 (1.30 GHz, 1.50 GHz turbo), 96 GB DDR4 RAM, and 16 GB onchip HBM. We use GCC version 5.4.0 with O3 enabled in all experiments.
We join 1.23 B points from the NYC taxi dataset^{11}^{11}11http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml against NYC’s boroughs (5 polygons), neighborhoods (289 polygons), and census blocks (39,184 polygons) and count the number of points per polygon. While there are only five boroughs, their polygons are significantly more complex.
We compare our technique to an approach that builds an Rtree on the polygons, probes the points against the index, and refines candidate hits. In the refinement phase, we use S2’s PIP test, the same algorithm that our technique uses. In the filter phase, we use the boost Rtree implementation (1.6.0) and evaluate different splitting strategies (—linear—, —quadratic—, and —rstar—) and maximum element counts per node (4, 8, 16, and 32). —rstar— with a maximum of 8 elements per node performs best in all workloads. We therefore omit the results for the other configurations. We have also evaluated the —STRtree— implementation in GEOS (3.5.0), however, omit it in the further evaluation as it cannot compete with —rstar— for our workloads. For example in the neighborhoods join on Xeon, —STRtree— achieves a throughput of 27.4 M points/s, while —rstar— processes 70.9 M points/s. The performance difference is caused by the selectivity of the respective index structure. —rstar— returns 1.60 candidates on average while —STRtree— returns 1.94. In addition, we benchmark PostgreSQL 9.6.1 (PostGIS 2.3.1) with a GiST index on the polygons. We configured PostgreSQL to use as many parallel workers^{12}^{12}12The intraquery parallelism feature was introduced in PostgreSQL 9.6. as available hyperthreads. As another competitor, we evaluate a singlemachine deployment of Magellan. Similar to our approach, Magellan uses true hit filtering. Our default configuration for computing the individual coverings is as follows: max covering cells = 128, max covering level = 30, max interior cells = 256, and max interior level = 20.


Results: Figure 8 shows the throughput of our approach compared to the boost Rtree using all threads (56 and 256 threads on Xeon and KNL, respectively).
approx always skips the refinement phase and treats all candidate hits as true hits. The precision of approx is bounded by the diagonal length of the largest covering cell. We refined the index to only contain sufficiently small covering cells to guarantee the denoted precision. approx10m has a false positive rate of almost 0% for boroughs, 3% for neighborhoods, and 17% for the census blocks dataset. The average distance of these false positives join partners is 1.31 m for boroughs, 1.62 m for neighborhoods, and 1.50 m for the census blocks dataset. This shows that our technique can achieve a significantly better precision than the userdefined precision bound (10 m in that case) for real workloads. approx10m consumes 1.2 GB of memory for the largest of the three datasets.
exact denotes the full join that refines candidate hits using S2. We train the index with 1 M historical data points for all datasets.
Figure 9 shows the throughput increase when training ACT with an increasing number of points. An untrained ACT achieves a throughput of 2360 M points/s, whereas ACT trained with 1 M points achieves 3732 M points/s (+58%).
The Xeon binaries are compiled with AVX2 optimizations (march=coreavx2) while the KNL implementation is handoptimized (cf., Section IV). Compared to an autovectorized version, the performance of our handoptimized implementation is between 29% and 70% higher. Autovectorization with AVX512 (march=knl) instead of AVX2 does not have a performance impact suggesting that handtuning is essential to fully utilize the wide vector processing units of KNL.
On Xeon (Figure 7(a)), ACTexact achieves a throughput of 3735 M points/s for the boroughs dataset. With a higher number of polygons in the neighborhoods and census datasets, the throughput of ACTexact decreases. The reason for the slowdown is the lower selectivity and the higher space consumption of our index for the larger datasets (cf., Table I). While the index of the boroughs dataset fits into the cache, the index of the neighborhoods and census blocks datasets exceed the cache size. The Rtree achieves a peak performance of 61.2 M points/s for the neighborhoods dataset. The reason for its slow performance for the boroughs dataset is as follows: S2’s PIP test implements the ray tracing algorithm. As mentioned earlier, this algorithm has a runtime complexity of with being the number of edges. Since the boroughs are complex polygons with many edges, the PIP tests in the refinement phase are very expensive. Here, our algorithm shines since it can identify most join partners in the filter phase and only needs to enter the refinement phase for 0.1% of the points.
As a point of reference, PostgreSQL (PostGIS) with a GiST index on the polygons only achieves a throughput of 0.39 M points/s for the boroughs, 1.09 M points/s for the neighborhoods, and 0.69 M points/s for the census blocks dataset. Similar to the Rtree based join, PostgreSQL’s performance suffers from the complex polygons in the boroughs dataset. Even though PostgreSQL parallelizes this query, its performance is still orders of magnitude lowers than ours. Magellan performs better than PostGIS and achieves a throughput of 0.88 M points/s for boroughs, 4.57 M points/s for neighborhoods, and 2.24 M points/s for census blocks.
With 10 meter precision, ACT achieves a throughput of 4532 M points/s for the boroughs dataset and still 874 M points/s for the almost 8000x larger census blocks dataset. The reason for the similar performance of ACTapprox100m and ACTapprox10m is that neither of the two performs any exact checks. Thus, their performance is dominated by the costs for the ACT node accesses and the final aggregation. For the census blocks dataset, approx10m is even faster than approx100m, since the finergrained index of approx10m returns less candidates on average leading to less aggregations.
On KNL (Figure 7(b)), ACTexact achieves a peak throughput of 4436 M points/s for the boroughs dataset outperforming the Xeon counterpart. For the census blocks dataset, we can see the same effect as on the Xeon machine with the more precise approx approach achieving the highest throughput. These results show that the singlesocket KNL can compete with or even outperform a dualsocket Xeon machine when the code uses vector instructions.
metric  boroughs  neighborhoods  census 

tree nodes  413  13025  590147 
false hits  0.07%  0.09%  0.07% 
solely true hits  99.9%  87.1%  72.1% 
candidates  1.06  1.66  1.72 
metric  boroughs  neighborhoods  census 

tree nodes  456  22484  634005 
false hits  0.07%  0.11%  0.07% 
solely true hits  99.9%  97.7%  88.6% 
candidates  1.15  1.57  1.70 


Table I shows different metrics of our index for the three polygon datasets. The quality metric solely true hits indicates the percentage of points that skipped the expensive refinement phase, which is clearly above 70% throughout all datasets. The points for which the index probe neither returns a false hit nor solely true hits and thus need to enter the refinement phase are compared against less than two candidate polygons on average. For the boroughs dataset, our index consumes less than one MB of memory and thus the top levels of the tree fit into the L2 cache of the Xeon machine. For the larger datasets, we use a larger number of cells and thus also a larger index structure. The exact index of the census blocks dataset requires more than one GB of memory. The high memory consumption is caused by the high fanout of 256 of our radix tree. An adaptive radix tree would be more spaceefficient, however, switching between the different node types has a significant impact on probe performance, especially for cacheresident indices.
Table II shows the effect of training the index with 1 M training points (cf., Section IIIB). This optimization improved the percentage of solely true hits for all three datasets (especially for census blocks) while increasing the number of tree nodes and thus the space consumption of the indices.
Finally, we evaluate the scalability of our technique on both machines. We choose the census blocks dataset for this experiment since its index significantly exceeds the cache sizes of both machines but still fits into the MCDRAM of KNL. Figure 10 shows the results.
On Xeon, all three approaches scale almost linearly with the number of physical cores and benefit from hyperthreading. The fact that an oversubscription of cores has a positive performance impact shows that our technique is bound by memory access latencies and having more threads than physical cores can hide these latencies.
On KNL, ACTapprox10 has its maximum performance with 2way hyperthreading (128 threads) and decreases with 4way hyperthreading (256 threads). ACTexact scales from 13.0 M points/s with one thread to 417 M points/s with 56 threads on Xeon. On KNL, ACTexact scales from 25.9 M points/s with eight threads to 260 M points/s with 256 threads.
Vi Related Work
Geospatial joins have been studied for decades and there is a lot of related work on algorithmic techniques. [6] gives a summary of spatial join techniques. [1] proposes true hit filtering in the form of maximum enclosed rectangles and circles allowing to skip the refinement phase in many cases. [7] follows up on this approach by using raster approximations in the form of uniform grids thereby improving the selectivity. Both approaches optimize not only for selectivity but also for I/O operations, which is an important performance factor for diskbased database systems. [2] recursively divides the MBR of a polygon into four cells until a certain granularity is reached, identifies interior cells, and indexes them in an Rtree to skip refinement checks during join processing. [8] presents an approach that leverages multiple grids with different resolutions to index geospatial objects using standard database indices.
In contrast to existing techniques, our approach proposes a new way of processing geospatial joins by identifying the majority or even all of the join partners in the filter phase using an adaptive grid while leveraging the large mainmemory capacities, high memory bandwidths, multicores, and wide vector processing units of modern hardware. None of these techniques supports training the index with historical data points to improve probe performance.
Several database systems and geographical information systems (GIS) support geospatial joins. [9]
present a comprehensive survey of stateoftheart systems and classifies the approaches that exist in literature to design and implement such systems. There are two main geospatial processing libraries used by database systems: GEOS
^{13}^{13}13http://trac.osgeo.org/geos/ and S2 (cf., Section II). MongoDB, a documentoriented database, uses S2 for geospatial processing and indexes the coverings of geospatial objects in a Btree^{14}^{14}14http://blog.mongodb.org/post/50984169045/newgeofeaturesinmongodb24. HyPerSpace [3], a geospatial extension to the mainmemory database system HyPer [10], also uses S2 for geospatial processing. PostGIS^{15}^{15}15http://postgis.refractions.net/, a geospatial extension to PostgreSQL^{16}^{16}16http://postgresql.org/, uses GEOS for geospatial processing and an Rtree implemented on top of GiST [11] for indexing geospatial objects. MonetDB, a columnoriented database, also uses GEOS for geospatial processing^{17}^{17}17http://monetdb.org/Documentation/Extensions/GIS. Magellan, a library for geospatial analytics that is based on Apache Spark, indexes geospatial objects using the Z curve. Similar to our technique, it is capable of identifying join partners in the filter phase. However, it uses a uniform grid and thus cannot adapt to the geometrical features of the polygons and the expected distribution of query points. Two popular Hadoopbased geospatial information systems are HadoopGIS [12] and SpatialHadoop [13]. Both systems partition the data and store it in HDFS blocks such that spatial objects that belong to a particular partition are stored in the same HDFS block and insert the MBRs of the partitions into a global index. Since these systems rely on offline partitioning of the data, they cannot efficiently handle the online case where points are streamed. GeoSpark [14] does not support the online case either since it does not allow to perform the join without an index on points, and Simba [15] does not yet support polygons. Most work on hardware optimizations for geospatial joins focuses on GPU offloading[16, 17, 18] and [19] propose a vision for a GPUaccelerated endtoend system for performing spatial computations which decides whether to use CPUs or GPUs based on a cost model.Vii Conclusions
We have presented an adaptive geospatial join that leverages true hit filtering using a hierarchical grid represented by a specialized radix tree. We have transformed a traditionally computeintensive problem into a memoryintensive one. Our approach is enabled by adaptively using the large highbandwidth main memory of modern hardware. We have shown that it is possible to refine the index up to a userdefined precision and identify all join partners in the filter phase. We have demonstrated that the exact version of our algorithm can adapt to the expected point distribution. We have optimized our implementation for the brand new KNL processor and have shown that our approach can outperform existing techniques by up to two orders of magnitude.
In future work, we plan to evaluate our approach on the nextgeneration Intel Xeon processor, named Skylake, which also supports AVX512, and on GPUs. We also plan to address offline workloads by leveraging small materialized aggregates (SMAs) available in Data Blocks [20].
Acknowledgment
We want to thank Ram Sriharsha for his help in evaluating Magellan. This work has been sponsored by the German Federal Ministry of Education and Research (BMBF) grant 01IS12057 (FASTDATA and MIRIN). This work is further part of the TUM Living Lab Connected Mobility (TUM LLCM) project and has been funded by the Bavarian Ministry of Economic Affairs and Media, Energy and Technology (StMWi) through the Center Digitisation.Bavaria, an initiative of the Bavarian State Government.
References
 [1] T. Brinkhoff, H. Kriegel, R. Schneider, and B. Seeger, “Multistep processing of spatial joins,” in Proc. of SIGMOD, 1994, pp. 197–208. [Online]. Available: http://doi.acm.org/10.1145/191839.191880
 [2] K. V. R. Kanth and S. Ravada, “Efficient processing of large spatial queries using interior approximations,” in Proc. of SSTD, 2001, pp. 404–424. [Online]. Available: http://dx.doi.org/10.1007/3540477241_21
 [3] V. Pandey, A. Kipf, D. Vorona, T. Mühlbauer, T. Neumann, and A. Kemper, “Highperformance geospatial analytics in HyPerSpace,” in Proc. of SIGMOD. ACM, 2016, pp. 2145–2148.
 [4] V. Leis, A. Kemper, and T. Neumann, “The adaptive radix tree: ARTful indexing for mainmemory databases,” in Proc. of ICDE, 2013, pp. 38–49. [Online]. Available: http://dx.doi.org/10.1109/ICDE.2013.6544812
 [5] T. Gubner and P. Boncz, “Exploring Query Compilation Strategies for JIT, Vectorization and SIMD,” in ADMS, 2017.
 [6] E. H. Jacox and H. Samet, “Spatial join techniques,” ACM Trans. Database Syst., vol. 32, no. 1, p. 7, 2007. [Online]. Available: http://doi.acm.org/10.1145/1206049.1206056
 [7] G. Zimbrao and J. M. de Souza, “A raster approximation for processing of spatial joins,” in Proc. of VLDB, 1998, pp. 558–569. [Online]. Available: http://www.vldb.org/conf/1998/p558.pdf
 [8] J. Roth, “The extended split index to efficiently store and retrieve spatial data with standard databases,” in Proc. of IADIS, 2009, pp. 85–92.
 [9] A. Eldawy and M. F. Mokbel, “The era of big spatial data,” in ICDE Workshops, 2015, pp. 42–49. [Online]. Available: http://dx.doi.org/10.1109/ICDEW.2015.7129542
 [10] A. Kemper and T. Neumann, “HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots,” in Proc. of ICDE, 2011, pp. 195–206. [Online]. Available: http://dx.doi.org/10.1109/ICDE.2011.5767867
 [11] J. M. Hellerstein, J. F. Naughton, and A. Pfeffer, “Generalized search trees for database systems,” in Proc. of VLDB, 1995, pp. 562–573. [Online]. Available: http://www.vldb.org/conf/1995/P562.PDF
 [12] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. H. Saltz, “HadoopGIS: A high performance spatial data warehousing system over MapReduce,” PVLDB, vol. 6, no. 11, pp. 1009–1020, 2013. [Online]. Available: http://www.vldb.org/pvldb/vol6/p1009aji.pdf
 [13] A. Eldawy, “SpatialHadoop: Towards flexible and scalable spatial processing using MapReduce,” in SIGMOD, PhD Symposium, 2014, pp. 46–50. [Online]. Available: http://doi.acm.org/10.1145/2602622.2602625
 [14] J. Yu, J. Wu, and M. Sarwat, “Geospark: a cluster computing framework for processing largescale spatial data,” in Proc. of SIGSPATIAL, 2015, pp. 70:1–70:4. [Online]. Available: http://doi.acm.org/10.1145/2820783.2820860
 [15] D. Xie, F. Li, B. Yao, G. Li, L. Zhou, and M. Guo, “Simba: Efficient inmemory spatial analytics,” in Proc. of SIGMOD, 2016, pp. 1071–1085. [Online]. Available: http://doi.acm.org/10.1145/2882903.2915237
 [16] N. Bandi, C. Sun, D. Agrawal, and A. El Abbadi, “Fast computation of spatial selections and joins using graphics hardware,” Inf. Syst., vol. 32, no. 8, pp. 1073–1100, 2007. [Online]. Available: https://doi.org/10.1016/j.is.2006.12.001
 [17] S. You, J. Zhang, and L. Gruenwald, “Parallel spatial query processing on gpus using rtrees,” in SIGSPATIAL Workshops, 2013, pp. 23–31. [Online]. Available: http://doi.acm.org/10.1145/2534921.2534949
 [18] J. Zhang, S. You, and L. Gruenwald, “Parallel selectivity estimation for optimizing multidimensional spatial join processing on gpus,” in Proc. of ICDE, 2017, pp. 1591–1598. [Online]. Available: https://doi.org/10.1109/ICDE.2017.236
 [19] H. Chavan, R. Alghamdi, and M. F. Mokbel, “Towards a GPU accelerated spatial computing framework,” in ICDE Workshops, 2016, pp. 135–142. [Online]. Available: http://dx.doi.org/10.1109/ICDEW.2016.7495634
 [20] H. Lang, T. Mühlbauer, F. Funke, P. A. Boncz, T. Neumann, and A. Kemper, “Data Blocks: Hybrid OLTP and OLAP on compressed storage using both vectorization and compilation,” in Proc. of SIGMOD, 2016, pp. 311–326. [Online]. Available: http://doi.acm.org/10.1145/2882903.2882925
Comments
There are no comments yet.