The raster model is a logical model widely used for the representation of data in Geographic Information Systems (GIS) Worboys and Duckham (2004); Rigaux et al. (2002) and for the storage of images in general. It is mainly used in GIS to store information of continuous variables, that cover the whole space and for which a specific value at each point in space may exist. Essentially the raster model represents this information as matrices of values. A matrix is built by dividing the space into fixed-size cells, so each cell represents the value of the spatial feature in the corresponding region. Raster image representations store the value of each pixel in a cell of the matrix.
The raster model is frequently used in GIS to store data related to natural geographic phenomena like temperature, wind speed, rainfall level, land elevation, atmospheric pressure, etc. Other not nature-related information, such as land use, is also suitable to be represented by this model. The alternative model, the vector model, usually represents discrete variables that have well-defined boundaries, using a collection of points and segments. This is a good fit for the representation of information related to human-made constructions, rivers, boundaries of lakes and forests, etc., but not for others that cannot be described with a few points and lines.
In this paper, we focus on the efficient representation of raster data. As stated before, a raster is essentially a matrix, so an uncompressed raster representation would use much space (for instance, a raster image with a resolution of just 0.5 km and worldwide coverage would require a matrix, or around 13 GB to store an integer per cell; modern high-resolution raster imagery can reach much higher spatial resolution, and therefore require much larger storage space). Because of this, plain raster representations usually have to be stored in secondary memory. Compressed raster representations exist, but they are mainly designed to reduce storage, and do not provide efficient access. Most of them are based on well-known compression techniques such as run-length encoding or LZW Welch (1984). In these compressed solutions the space requirements become much smaller, due to the locality of raster datasets (spatial continuity): close cells tend to have similar values. However, in most of them the full file, or at least large chunks of the file, must be decompressed even to display a small region of the space. A well-known technique, called tiling Shekhar et al. (2017), divides the raster in smaller, fixed-size tiles and compresses each tile independently, providing some level of direct access and taking advantage of the locality of values to improve compression. For example, the TIFF image format and its extension for geographical information 222http://trac.osgeo.org/geotiff/ support this partition into tiles with different compression techniques including LZW. Still, tiles must be relatively large to enable compression.
When data collections are stored by a GIS in a compressed format, such as the ones we describe before, some of the processing tasks that involve the complete raster can be performed by simply decompressing the data. However, many operations would benefit from direct access to regions (e.g., to display a local map), or the ability to find the cells whose value is within some range. A classic example of this is the visualization of pressure or temperature bands Zhang and You (2010), where the raster is filtered to display in a different way the cells according to the range of values to which they belong. Another example involves retrieving the regions of a raster with an elevation above a given threshold, to find zones with snow alert, or below a value, to find regions with risk of floods Martinis et al. (2009). Regular compressed raster representations lack the indexing capabilities on the values stored in the raster that would be required to answer this type of queries. Therefore, these representations need to traverse the complete raster in order to return the cells that contain a given value, even when the results may be restricted to a small subset of the cells.
There are several approaches to provide direct access to values in a raster dataset. For instance, we can consider the raster as a 3-dimensional matrix and use computational geometry solutions to answer any query involving spatial ranges or ranges of values by means of range reporting queries Chan et al. (2011). However, these solutions require superlinear space and therefore they are not suitable to the large datasets involved. Other representations of raster data that aim at efficient querying are usually based on quadtrees Finkel and Bentley (1974), particularly variations of the linear quadtree Gargantini (1982), a data structure originally devised for secondary memory. There exist other quadtree representations Chang and Chang (1994); Lin (1997a) that can work in internal memory, and they are very efficient for processing complete rasters, but they usually lack query capabilities to access specific regions or cells with specific values. An extension of the quadtree to 3 dimensions, or oct-tree Samet (1984), could support those queries in a similar way to the computational geometry solutions. This structure does not require superlinear space, but does not provide compression either.
Compact data structures have been a very active research topic for the last few decades. They aim to represent any kind of information (texts, permutations, trees, graphs, etc.) in compressed space, while supporting query and processing algorithms that are able to work over the compact representation. This allows compact data structures to improve the efficiency of classical data structures, thanks to being stored in upper levels of the memory hierarchy. However, regarding spatial information, and more precisely, raster data representation and querying, most of the previous work based on compact representations lacks in advanced query support Chang and Chang (1994); Lin (1997b).
A simple solution to store raster data using compact data structures could be achieved by reading the raster row-wise and storing the sequence of values. We could use any compressed sequence representation Grossi et al. (2003); Golynski et al. (2006); Barbay et al. (2010) to return the cells with a given value (or a range of values Grossi et al. (2003); Navarro (2012)) efficiently, but in this kind of approach restricting the search to a spatial range becomes difficult. Furthermore, these sequence representations achieve at best the zero-order entropy space of the sequence, and this is not a significant space reduction in many cases, since it cannot fully exploit the spatial locality of values in raster data.
In this paper, we propose several compact data structures for raster data that efficiently support different queries, particularly those combining spatial indexing (filtering cells in a spatial window) with filters on values (retrieving cells with a specific value or in a range of values). We build on existing compact data structures that represent sets of points in a kind of compressed linear quadtree, and upgrade them to efficiently store and query raster data in different forms: simple binary images, general raster matrices, and even time-evolving raster data. We experimentally test our proposals to demonstrate their low space requirements and good performance in these new application domains. Notice that our data structures can be conceived as a compact representation for any kind of matrix. Nevertheless, they rely on the locality of values to achieve compression, so we focus our evaluation on raster data that displays spatial continuity.
2 Previous Concepts
2.1 The -tree
The -tree Brisaboa et al. (2014b) is a compact data structure for the representation of sparse binary matrices, that was initially devised to represent the adjacency matrix of Web graphs, and later applied to compression of social networks Claude and Ladra (2011) or RDF databases Álvarez-García et al. (2015). Given a binary matrix of size , the -tree conceptually represents it as a -ary tree, for a given 333The size of the matrix is assumed to be a power of . If it is not, the matrix is expanded to the next power of filling the new cells with 0s.. The root of the conceptual tree corresponds to the complete matrix. Then, the matrix is partitioned into equal-sized submatrices of size , and each of them (taken from left to right and top to bottom) is represented as a child of the root node. A single bit is associated to each node: a 1 is used if the submatrix associated to the node contains at least one 1; otherwise, the the bit is set to 0. The subdivision is applied recursively for each node with value 1, until we reach a matrix full of 0s or the cells of the original matrix. The conceptual tree is then stored using two bitmaps: stores all the bits in the upper levels of the tree, following a levelwise traversal, and stores only the bits in the last level. Figure 1 shows an example of -tree.
To navigate the tree a rank structure over T is built. This structure is used to compute the number of ones in the bitmap up to any position ( operation) in constant time, using sublinear space Munro (1996). The -tree exhibits a property that provides simple navigation over the conceptual tree using only the bitmaps and the rank structure: given a value 1 at any position in , its children will start at position of . When the last level is reached, , so the excess determines their position in . A -tree can answer single cell queries, queries reporting a complete row/column or general range queries (i.e., retrieve all the 1s in a range) using only rank operations to traverse the tree, by visiting all the necessary subtrees.
Some improvements have been proposed by the original authors of the -tree Brisaboa et al. (2014b) to enhance its compression and query efficiency. For instance, a -tree uses different values of in the upper and lower levels of decomposition. Other techniques use statistical compression of the bitmap . A dynamic variant of the -tree, called d-tree, has also been proposed Brisaboa et al. (2017). The d-tree is based on a custom implementation of dynamic bitmaps for and . By supporting update operations over and , in addition to rank and select operations, the d-tree is able to handle changes in the bits of the binary matrix, as well as insertion of new rows/columns at the end of the matrix.
2.2 The I-tree
The Interleaved -tree Álvarez-García et al. (2017) (I-tree) is a data structure based on the -tree and devised to deal with RDF triples. Given a ternary relation , the I-tree uses vertical partitioning to decompose into binary relations , one for each different value . Hence, each adjacency matrix will store the pairs that are related with . The dimension is called partitioning variable. After this transformation, each of the binary relations could simply be stored in a separate -tree, but the I-tree is able to represent all those matrices simultaneously in the same tree, providing indexing capabilities also on .
Conceptually, building an I-tree is equivalent to building a collection of -trees and merging the equivalent branches of the conceptual trees into a single tree, where each node will store the bits of all the -trees. This means that the children of the root node will always have bits, but nodes at lower levels of the tree have as many bits as 1s exist in their parent node (i.e., as many bits as trees contain that node). The conceptual tree is stored in two bitmaps and , exactly like a -tree. Figure 2 displays an example of I-tree, for . Note that the fourth node at the first level of the I-tree () has 3 bits, one per matrix; its first bit is 0, because the bottom-right submatrix of matrix is full of 0s, and the second and third bits of are set to 1; therefore, its children have 2 bits each.
The I-tree can be navigated in a similar fashion to -trees: at the root level we have nodes of bits each; given a node at position , with bits, its children are located at position in , where is a fixed correction factor; each children of the node will have bits, where is the number of bits set to 1 in the current node. Observe also that, thanks to having the bits for all the together in the same node, it is possible to restrict traversal of the tree to a specific value in the partitioning dimension. However, pruning the tree by the dimension requires more complex operations than filtering branches on the other dimensions, so the structure is usually limited to domains where a partitioning variable of small size can be selected. See Álvarez-García et al. (2017) for a further details on the implementation of query operations. in the I-tree.
2.3 The -treap
The -treap Brisaboa et al. (2014a) is another proposal based on the -tree, and also inspired by the treap Seidel and Aragon (1996). It is specifically designed to answer range top-k queries on multidimensional grids (e.g. OLAP cubes). Given a matrix and a spatial window inside the matrix, a range top- query asks for the location of the highest values in the query window.
Starting with a matrix , where each cell can be empty or store a numeric value, the -treap follows a recursive partition of the matrix into submatrices, similar to the -tree. The decomposition works as follows: the root of the tree stores the coordinates of the cell whose weight is the maximum value in the matrix, as well as the cell value. Then, the cell is marked as empty and removed from the matrix. Then, the resulting matrix is subdivided into submatrices and we add the corresponding children nodes to the root of the tree. The assignment process is repeated for each child, taking the cell with the maximum value and its coordinates from the corresponding submatrix, and deleting the value of the cell before continuing. Decomposition eventually stops when a completely empty submatrix is found or the cells or the original matrix are reached.
The conceptual -treap is stored using three elements: a sequence coords per level, keeping the coordinates of the local maxima, and stored as relative offset to the origin of the current submatrix (note that empty nodes are dismissed, and coordinates are not needed in the last level of the tree since submatrices are of size 1); the values of the local maxima (i.e. their weights), also differentially encoded with respect to their parent node and compressed using DACs Brisaboa et al. (2013) (notice that a small array is also stored to mark the offset in the array where each level of the tree starts); and the tree topology, stored like a -tree with a single bit array, , with rank support. This change is necessary since rank operations are also needed in the last level of the tree in the -treap.
Figure 3 shows a -treap construction, for . The top of the image shows the state of the matrix at each level of decomposition, and the cells selected as local maxima at each level are highlighted, except in the last level where all the cells are local maxima. Empty submatrices are represented in the tree with the symbol “-”.
The -treap provides support for cell access, basic range and top-k queries, and also interval queries regarding cell weights. A detailed description of the structure and navigation algorithms is presented in Brisaboa et al. (2016).
3 Representation of binary rasters
Binary images can be considered as the simplest form of raster data. In a binary image we store a matrix that uses a single bit per cell, to determine whether a single feature is present or not within the region of the space corresponding to that cell. Hence, a binary raster is essentially a simplified version of a general raster, limiting the range of possible values to two. Several GIS applications make use of this simple technique to represent binary attributes of the space. Examples of this would be information of events like oil spills, plagues or cloud cover in their simplest version, as well as simple rasterized representations of vectorial data.
Due to the simplicity of these binary images, their representation usually requires specific techniques to achieve the best compression and query performance. The -tree, introduced in Section 2.1, is an example of those. However, it was devised to compress Web graphs, so it works well mostly on binary matrices that are very sparse.
In this section we propose a solution, that we call -ones, based on the -tree, designed to efficiently compress the kind of binary images that usually appear in GIS applications. Essentially, our technique is devised to overcome the limitation of the -tree to sparse matrices: our technique is designed to efficiently compress binary matrices with a large percentage of 1s, as long as there is some clusterization of the values, which is typical of most real-world raster data.
Our proposal is based on the same decomposition of the binary matrix used by the -tree, but we recursively divide the matrix until we reach any uniform region, be it full of 0s or 1s. This means that in our -ones we have 3 possible types, or “colors”, of node, following the usual naming of quadtrees: black and white nodes are regions full of ones and zeros, respectively; the internal nodes, that are regions with ones and zeros, are gray. Note that the main difference of our proposal with a -tree is that we are able to represent large regions of ones with a single node, instead of using a full subtree.
The -ones can efficiently answer cell retrieval queries, as well as row/column or range queries, simply by performing a top-down traversal of the tree branches that intersect the region of interest. The only consideration is that when a black node is found, we need to output all the cells of the region of interest that fall within the submatrix covered by that node.
The goal of the -ones is to be efficiently traversed in a similar manner to the -tree, using just rank operations. To achieve this purpose, we devised a small set of implementation alternatives that store the conceptual tree in different ways. Figure 4 shows all our implementation variants for the same conceptual tree. For each variant we will describe how the components are built, and how the basic traversal operations are implemented, since query algorithms are based on the same conceptual traversal of the tree in all cases.
3.1 Naive 2-bit coding: -ones
A simple representation for the new conceptual tree, where three kinds of nodes exist, is just to use 2 bits per node instead of one. According to this idea, we use the following encoding: internal (gray) nodes are encoded with 10; white nodes with 00; black nodes with 01. Then, the first bit of each node is stored in a bitmap , and the second bit in a second bitmap . With this setup, the bitmap marks internal nodes with 1 and leaves with 0, and stores the type of leaf. Note that this is not necessary in the last level of the tree, where no internal nodes can exist, so in the last level we use a bitmap like in regular -trees.
The -ones can be efficiently navigated, much like a -tree, as follows: given an internal node at position , we can compute the position where its children start as , since each 1 in is an internal node that yields exactly children. If we find a leaf node (), we check to determine whether it is a black or white node. Notice that the rank structure to perform traversal in constant time is only needed in but not in or .
3.2 Improved 2-bit encoding: -ones
The -ones uses 2 bits per node to represent only three possible node types in the tree. We can use a more space-efficient encoding using just 1 bit for internal nodes. In this variant, internal nodes are stored with a bit 1 in , but do not have a second bit in . Again, the last level is stored with a single bitmap .
In the -ones we can compute the children of a node using the same formula of the previous approach, since and are identical. The only difference is that, when we reach a leaf node at position , the corresponding bit in will not be located at position but at .
3.3 Navigable DF-expression: -ones
Following the same ideas of the previous variants of using two bitmaps and , we propose a variant based on the DF-expression encoding Kawaguchi and Endo (1980). In this variant, we encode internal nodes with 10, white nodes with 0 and black nodes with 11. We use the same bitmaps and for the first and second bits of each node, and a single bitmap in the last level.
The encoding used by the -ones has been suggested to be more space efficient than the previous ones, but it is not as efficient in our implementation since it requires more complex computations. Particularly, to compute the children of a gray node we need to count the number of internal nodes up to the current position, as . This increases complexity and forces us to add a rank structure not only to but also to bitmap in order to perform the previous computation. Notice also that, unlike in previous variants, we now need to check the bitmap to know if the current node is internal or a (black) leaf.
3.4 An asymmetric approach: -ones
Our last proposal aims at storing our conceptual tree, with three types of nodes, using the same data structure of the original -tree, and almost identical encoding. Internal nodes are encoded with 1 and white leafs with 0, like in the original -tree. Black leaves are encoded as a small subtree: an internal node with white leaves as children. We take advantage of this configuration, that is not possible in a -tree, to mark black nodes using (typically 5) bits.
The -ones can be traversed exactly like a -tree. The only difference is that, when we are performing traversal, if we reach a node encoded with 1, we need to check its children: if all of them are white, the current node is black. In practice, this can be performed when checking the node, or we can simply traverse it like an internal node in the -tree, and in the next step check whether it was indeed an internal or black node.
The -ones uses a very asymmetric encoding for the nodes, requiring bits for black nodes and only 1 bit for white nodes. However, it also has an interesting property: since it is identical to a -tree where regions of ones are encoded using a shorter subtree, this approach will never exceed the space of the original -tree.
3.5 Experimental evaluation
In this section we compare the -ones with the original -tree. We focus on two different types of data, with fundamentally different characteristics: Web graph datasets, that are very sparse, and raster data, where there can be a large percentage of ones. Table 1 shows the datasets used. The Web graph datasets444Provided by the Laboratory for Web Algorithmics (LAW) at http://law.di.unimi.it/datasets.php are very sparse datasets (less than 0.005% of ones). The raster datasets have been extracted from the Digital Land Model (MDT05) of the Spanish Geographic Institute555http://www.cnig.es. They are high-resolution (cells of meters) elevation rasters. We took several fragments of the overall dataset, numbered as shown in Table 1. Datasets and are built by combining several adjacent pieces to build larger rasters. Note that the original datasets store decimal values; we select a reference value and build binary matrices by selecting all cells with value below the given threshold.
We run all the experiments on an AMD-Phenom-II X4 email@example.com GHz, with 8GB DDR2 RAM, running Ubuntu 12.04.1. Our implementations are written in C and compiled with gcc version 4.6.2. with -O9 optimizations.
3.5.1 Space analysis
We evaluate our -ones implementation variants comparing them with original -trees. We use for the comparison Web graphs and raster datasets, with a threshold set to have 50% of ones. For all the approaches we use a hybrid version, where in the first three levels of decomposition and in the remaining levels.
Table 2 shows the compression achieved by our techniques and the original -tree. We highlight the best compression results for each dataset. In the first four datasets, Web graphs, the -ones slightly improves the compression of the original -tree, thanks to being able to exploit slightly larger clusters of ones that appear in most Web graphs. However, the sparsity of the datasets makes all of our other variants larger than the -tree. In raster datasets, due to the much higher percentage of ones, the -tree becomes much less efficient than our variants: our techniques are roughly 10 times smaller than the -tree in all the datasets. The -ones achieves the best compression results in all of them, but all the variants are relatively close.
To better demonstrate the difference in performance when compared with the -tree, we extended the evaluation to binary rasters with varying percentage of ones. Figure 5 (left) displays the compression obtained for the dataset , with thresholds set to get between 1% and 90% of ones. Results show that all the -ones variants are already smaller than the -tree baseline with a 1% of ones in the dataset, due to the larger size of the clusters of ones. The right-hand plot in Figure 5 focuses on the differences among our proposals. All of them achieve similar results and evolve almost in parallel, but the -ones is the best variant in general and the -ones is the worst. The -ones, being asymmetric, is slightly worse when the percentage of ones is around 50%.
3.5.2 Query times
In this section we focus on the query performance of the -ones, particularly compared to that of the -tree. Specifically, we measure performance on cell retrieval queries, that involve the traversal of a single branch of the tree to locate a cell, so they provide a clearer comparison of the differences in traversal cost among variants. We perform tests using Web graphs and binary rasters, that are again generated using a threshold over the original datasets to get binary images with 50% and 10% of ones, respectively. To compare the techniques, we measure query times to answer cell retrieval queries, i.e. returning the value of a given cell. We use a set of 10 million random queries for each dataset, and show the average query times in s/query.
Table 3 shows the results for all the datasets, grouped by family. In Web graphs, the -tree achieves the best query times, due to the simpler navigation required. Our encodings obtain higher query times than original -trees. Nevertheless, the overhead of our fastest variant, the -ones, is very low. The -ones is also very efficient, whereas the -ones and especially the -ones are slower, due to the extra rank operations required. In the raster datasets, our fastest solutions are always more efficient than the -tree, due to the improved access to regions full of ones. Again, the -ones is the fastest variant and the -ones the slowest. There is also a difference in performance depending on the percentage of ones in the dataset: the -ones and -ones are very similar in both cases, but the -ones is slightly better when the percentage of ones is lower. Considering that the -ones achieves the best compression in all the datasets, we consider it to yield the best space-time tradeoff overall for any dataset. The -ones can offer slightly better query times sacrificing space, whereas the -ones can be an alternative when the number of ones is expected to be relatively low.
3.6 Comparison with linear quadtrees
The decomposition of the space in submatrices used in the -ones is a generalization of the quadrant decomposition used by generic quadtrees. Hence, our technique can be seen as a compact quadtree representation, since the conceptual tree we are representing in our variants, for , can also be stored as a classical quadtree.
The linear quadtree Gargantini (1982) is a representation devised to work efficiently from secondary storage. In the linear quadtree, the quadrants are numbered 0-3 from left to right and top to bottom. Each entry in the matrix (i.e. each 1 in binary matrices) will be represented by a sequence representing the quadrant chosen at each decomposition step to reach the corresponding cell. These sequences, called quadcodes, can be sorted and stored in a B-Tree in secondary memory. Cell retrieval queries can be implemented as a simple search for the corresponding quadcode in the B-Tree.
Our -ones variants are in practice more similar in space to compact quadtree representations designed for main memory, but those are usually designed for operations involving the full raster, whereas our techniques still retain the ability to efficiently access a subregion of the space, something that can be easily performed with linear quadtrees but not with other compact representations. In this section we compare the performance to answer cell retrieval queries of our techniques against linear quadtree implementations. We implemented an in-memory version of the linear quadtree, that uses a B-Tree maintained in main memory. Additionally, since the linear quadtree is a dynamic data structure that allows efficient modifications, we perform different comparisons for a static and dynamic setup. In the static comparison, we use our -ones, and compare it with a linear quadtree that stores quadcodes in an array in main memory, using binary search to answer queries. In the dynamic comparison, we use a linear quadtree with a regular B-Tree, fully in main memory. We use a dynamic version of the -ones, that is a straightforward adaptation of the existing d-tree data structure to properly handle the new semantics for regions of ones. The machine and configuration of our variants are the same as in Section 3.5.
Table 4 shows the compression, in bits per one, achieved by the -ones and the corresponding static and dynamic linear quadtrees (QT). We only show results for a subset of the collections, since results are similar among all Web graphs, and among all raster datasets. Results show that our variants are around 10 times smaller than linear quadtrees in all the datasets.
Table 5 displays a comparison of query times. We measure the average query time over a query set with 1 million random cell retrieval queries. As shown, our query times are still 2-3 faster than the linear quadtree in the static setup. In the dynamic setup, the overhead required by the dynamic implementation of our structure causes it to become 2-3 times slower than the static version, so query times become similar to those of linear quadtrees. Due to this, we are faster than linear quadtrees in the raster datasets, but slower in Web graphs. We consider the raster datasets to be more significant to the actual performance of the solutions, since they are designed for this kind of data, but even the worse query times obtained in Web graphs are easily compensated by the much better (8x) compression.
4 Representation of general rasters and spatio-temporal data
In this section we introduce solutions based on the -ones that can handle more complex raster data. Particularly, we focus on the representation of general raster data and temporal raster data. In general rasters we have a matrix of non-binary values in which each cell contains a numeric value. Temporal rasters store the evolution of a raster data along time. We will describe the usual problems for both kinds of raster data and then introduce our proposals to store them.
In our representations for general raster data we aim at providing support for queries involving not only the spatial dimension of the dataset, but also the possible values stored. For instance, the values above a given threshold in an elevation raster can be selected to yield snow alerts in a given region. Our solutions are designed to efficiently answer this kind of queries, combining a spatial constraint with a filter on the possible values, as well as simpler queries involving constraints only on space or values.
Due to their characteristics, some of the data structures we introduce for integer rasters can also be adapted to the representation of spatio-temporal data, or time-evolving regional data. We consider temporal rasters containing the evolution of a binary raster dataset along time. Hence, we essentially have a collection of rasters corresponding to the same feature in different time points. In these datasets, we also have two ways to filter the data: spatial constraints, to obtain values in a region, and temporal constraints, to obtain values in a given time interval. We consider the following temporal constraints:
Time-instant, or time-slice, queries refer to a single point in time.
Time-interval queries refer to a time interval. We consider three different types of interval: standard queries just return all the results found, possibly with multiple occurrences for the same cell; weak queries will return the set of cells that fulfilled the query constraints at any point in the interval (e.g., in a cloud cover raster, find the regions that were covered at any time); strong queries return the set of cells that fulfilled the constraints during the full interval.
4.1 Our proposals
The proposals we introduce next are -tree variants, in most cases built from our -ones implementations. For general rasters, we assume that our input is a matrix , of size whose cells contain integer values in the range . Note that this implies the assumption that the number of different values is not too large, and raster dataset with floating-point values can either be rounded or mapped to an integer range. For temporal rasters, we assume that we have a collection of binary rasters of the same size. Most of our proposals can be applied to both cases, with adjusted algorithms to answer the relevant queries.
4.1.1 Multiple -ones: M-ones
The M-ones uses a collection of -ones to store the original data. If we see the input matrix as a collection of binary matrices , one for each possible value, the representation of is reduced to the representation of a collection of binary rasters. The M-ones simply stores each (i.e. the cells with each possible value) in a different -ones .
In this approach, queries involving cells with a given value can be answered by checking a single -ones. Queries involving a range of values, however, require checking all the trees in the range, so they become less efficient. The worst performance, therefore, is expected in queries with no constraints on values, where all the trees have to be checked.
The same approach can be used for temporal raster data: we use a different tree per time instant. Time-instant queries are executed on a single tree but time-interval queries require a synchronized traversal of several trees. Note that in standard time-interval queries we can just return all the results querying each tree separately, but for weak and strong queries we need to traverse all the trees simultaneously and compute the or or and operation of their corresponding bits to filter out branches that do not fulfill the query semantics.
4.1.2 Cumulative -ones: CM-ones
Our second proposal, the CM-ones, is based on the same idea of building a tree per value, but uses a cumulative approach: the first tree will store the cells with the minimum value; each consecutive tree will store the cells with the next value, plus all the cells stored in previous trees. Figure 7 shows the CM-ones representation, for the same input matrix of Figure 6.
In this approach, the trees store a much larger number of ones. However, taking advantage of the ability of the -ones to store large regions of ones, the space of the final structure is not expected to increase too much with respect to the previous approach. In some raster datasets, where values tend to form concentric curves, the use of cumulative values can even improve compression by generating larger clusters.
The CM-ones can answer any query involving a single value, or range of values, using the same strategy: for a range , we compute the results for value and subtract those of value (in practice, we can traverse both trees simultaneously to filter out branches as soon as possible). Hence, its performance is independent of the length of the range. Additionally, it can answer queries not involving value constraints more efficiently: to find the value of a single cell, instead of checking every tree, we can use binary search to look for the leftmost tree that contains the cell.
The CM-ones relies on the fact that the leftmost tree containing a 1 for the cell yields the actual value of the cell. This approach cannot be used for time-evolving data, where the same cell can change value several times.
The -tree is a straightforward extension of the -tree to three dimensions. The conceptual decomposition of a bi-dimensional matrix can be extended to any number of dimensions, creating submatrices at each step to build a -tree. Navigation of the tree is similar, just considering constraints in the new dimensions and adjusting the formulas to nodes with children.
Our approach uses a -tree to store the complete raster matrix. Particularly, it will store a 3-dimensional binary matrix, where the third dimension is the value of the cell. Hence, for each coordinate the only 1 in the third dimension will correspond to the value of that cell.
Retrieval algorithms in the -tree are quite simple: to get the value of a cell, we simply traverse the conceptual tree looking at all the branches for that coordinate; to find cells with a given value or range of values, we fix the range in the third dimension and search for all the ones in the corresponding slice of the matrix.
The -tree can also be applied to temporal raster data. Considering the third dimension as time, we can combine all the raster datasets in a single 3-dimensional matrix. Time-instant and standard time-interval queries are similar to queries on values. Weak and strong time-interval queries can be processed as standard queries, filtering out repeated values during or after traversal.
The I-tree has been shown to improve the performance of a collection of -trees in other application domains. Therefore, our next proposal is an adaptation of the same data structure to work with our -ones. This just requires adjustments in the data structures and basic navigation operations similar to those performed in individual -ones. For instance, using the variant based on the -ones, a second bitmap must be added, and additional operations are defined to check the color of a node and traverse the tree to reach its children.
Figure 8 shows the I-ones, for the same input matrix used in previous examples. We display the actual bits used by the -ones encoding, and the final bitmaps generated. Notice that the bits of each node correspond to the concatenation of the corresponding bits in the equivalent M-ones representation.
The I-ones can answer queries involving a single value or a range value using the same traversal techniques of the original I-tree. Even if navigation is slower than in individual -ones, making simple queries slower, the ability to combine all the trees into one provides a much more efficient way to perform checks in queries involving ranges of values or not involving value constraints.
The I-ones can also be adapted to temporal raster data. Particularly, most time-interval queries can be efficiently answered by keeping track of the corresponding limits of the range for each node: in weak queries, if the current node contains at least a one in our interval, we can confirm the result immediately; in strong queries, if a node has at least a 0 in the interval, we can discard the result.
4.2 Experimental Evaluation for General Rasters
We test the performance of our proposals using the real elevation rasters described in Section 3
. Since the values stored are floating-point values obtained from interpolation, we round the values to a precision of 1m.
We compare the compression of our techniques with a GeoTIFF666http://trac.osgeo.org/geotiff/ representation of the same datasets. tiff simply stores the matrix row-wise, using a 16-bit integer per cell; tiff uses the default compression options: the matrix is partitioned in tiles of size , and LZW compression is applied to each tile.
To measure the query efficiency of our proposals, we compare them with GeoTIFF using the libtiff library, version 4.0.3. All time measurements correspond to CPU time. We consider the following representative queries: cell retrieval queries, that ask for the value of a given cell; single-value queries, that ask for all the cells with a given value; and combined queries, that ask for cells within a spatial region and with values in a given range.
Table 6 shows the compression obtained for different raster datasets. For each dataset, we show the number of different values existing in the dataset, as well as the zero-order entropy of the matrix, read in row order. The best space results are obtained by the compressed TIFF representation, and the best of our proposals is the -tree, that is only 10-20% larger. Note that tiff is designed mainly for compression, and it does not provide support for efficient access.
Table 7 shows the results obtained for cell retrieval queries. Our best approach, the -tree, is much faster than the tiff variant, and even faster than the plain version (this is an artifact due to the nature of the library, that is not designed to access specific cells and always processes the data in chunks). Among our techniques, the CM-ones variant is several times slower than the -tree, but still efficient. The M-ones and I-ones variants are much less efficient in this simple query, in the first case due to the need for a sequential search in all the trees, and in the second case because of the added complexity of the structure.
Table 8 displays the query times to retrieve all cells with a given value. This query demonstrates the indexing capabilities of our techniques, all of them being much faster than the TIFF-based implementations, because we can filter results by value while they have to traverse the complete dataset. The M-ones is the fastest technique, since it has a specific structure per value. The CM-ones, as expected, is roughly two times slower. The I-ones is also inefficient, due to the more complex navigation of the structure. Finally, the -tree is now slightly slower than the other techniques, due to the locality of values: many regions with values close to the target generate branches in the tree that have to be checked but will be discarded later.
Table 9 shows the query times obtained for combined queries involving different spatial windows and value ranges. Results confirm that all our proposals are again faster than the TIFF-based solutions, that are unable to filter small subsets of data. The CM-ones is now the fastest of our techniques in most cases, thanks to its ability to efficiently compute the difference between any two values. The -tree also achieves good query times overall, and is the fastest technique in some of our tests, thanks to its ability to efficiently filter in the 3 dimensions at the same time. The M-ones is very inefficient, especially with longer ranges, whereas the I-ones is also inefficient but scales better to longer ranges.
4.3 Experimental Evaluation for Temporal Rasters
Next we test the application of our proposals to temporal raster data. We perform an experimental evaluation on real and synthetic datasets. CFCA and CFCB contain cloud fractional cover data777Obtained from the Satellite Application Facility on Climate Monitoring, at http://www.cmsaf.eu, covering the whole world with a resolution of 0.25 degrees. CFCA uses data from years 1982 to 1985, and CFCB data from 2007 to 2009. Our threshold to determine the value of the raster is a cover value above 50%. RegionsA and RegionsB are synthetic datasets created by randomly grouping circles and altering their borders to build random but generally smooth and connected regions. Time evolution in these datasets simulates slow movement and changes/deformations of the original shapes.
The experiments in this section were run in a machine with 4 Intel Xeon E5520 cores at 2.27 GHz and 72 GB of RAM memory, running Ubuntu 9.10. Our code is compiled with gcc 4.4.1, with -O9 optimizations.
Table 10 displays the spatial size, number of time instants and percentage of ones in each dataset. The remaining columns of the table show the compression results obtained by our proposals. As a baseline, we show the space that would be necessary to store the quadcodes of the corresponding raster datasets with two approaches: using a separate representation per time instant (base); and using a differential approach where only the changes are stored at each time instant (diff). The latter corresponds to the minimum space that would be required by a linear quadtree that uses differential encoding, like the OLQ Tzouramanis et al. (2004). Results show that our techniques are much more space-efficient than the baseline. The M-ones and the I-ones, that do not take advantage of similarities between consecutive time instants, achieve the best results in the CFCA and CFCB datasets. However, in RegionsA and RegionsB the -tree is much more efficient. This is due to the change rate of the datasets: in the CFC datasets a large fraction of values change between consecutive time instants, whereas in our Regions datasets changes are more gradual. Therefore, the -tree can take advantage of similarities between consecutive time instants in the latter, but is not able to do it in the former.
In order to confirm the effect of the change rate, we build smaller datasets taking subsets of 100 snapshots from RegionsA. We build datasets taking every time instant, every second time instant, and so on, hence representing the same temporal raster with different time granularity. We also create a new dataset, built like RegionsA but with , and generate a similar group of subsets from it.
Figure 9 shows the compression results obtained in the datasets built from RegionsA (left) and in the datasets built from the larger raster (right). Each plot displays how the compression obtained by our structures evolves as the change rate (measured as the percentage of ones that change on average between consecutive time instants) increases. The -tree is the most efficient of our proposals for datasets with very small change rate, but when the number of changes reaches a given threshold, the implementations that ignore similarity between snapshots (M-ones and I-ones) become more efficient. Hence, the -tree is the best alternative for slowly changing datasets, but it is not able to exploit similarities when changes exceed a relatively slow percentage. Notice that datasets with high change rate would also be difficult to compress using any other state-of-the-art techniques based on exploiting this similarity, like OLQs.
We compare the query performance of our proposals using snapshot queries and time-interval queries. We select different window sizes and time interval lengths, and build random query sets for each of them. For time-instant queries, our query sets for each configuration contain 1,000 random queries. For time-interval queries, we also consider different interval lengths, and build query sets with 10,000 random queries per window size, interval length and dataset. In all cases, we measure CPU times, and average the times over a number of repetitions of the full query set to obtain precise results.
Figure 10 shows the results obtained for all the datasets in snapshot queries, for different spatial window sizes. Results are consistent with those in the previous section: the M-ones is the fastest technique, since it only has to query one tree. The I-ones is around two times slower, but still faster than the -tree, that must traverse many branches corresponding to time instants close to the target.
Figure 11 shows the query times for standard time-interval queries (i.e. queries returning all occurrences for the same cell). We display results for a representative window of size 32, and interval lengths 1 to 40. The -tree is the most efficient technique for long intervals, whereas the I-ones is competitive in shorter intervals. Notice that the M-ones is only the fastest for snapshot queries.
In addition to standard time-interval queries we also check weak and strong queries. The results are shown in Figures 12 and 13 respectively. The evolution of query times is significantly different for these queries: the M-ones technique still achieves query times roughly proportional to the length of the interval, since it must perform a search in all the trees involved. However, the -tree and the I-ones are much less affected by the interval length. The I-ones obtains similar times for any interval length, and is the best solution in general in this case, since it has the ability to efficiently check any time interval at any node of the conceptual tree. The -tree, on the other hand, cannot improve the query times of the standard query algorithm, being forced to check all the branches and then removing duplicates, so it becomes much slower than the I-ones. In strong interval queries, in which many search branches could be potentially filtered checking the intervals, the -tree is the slowest technique in general, especially in the CFC datasets, due to their higher change rate.
5 Top- range queries in raster data
In this section we describe how to apply the same ideas devised in previous sections to obtain structures that solve top- range queries, i.e. given a spatial window, queries that retrieve the cells with maximum values inside it. The -treap, introduced in Section 2.3, is able to answer this kind of queries in general matrices. We introduce next two variants that extend the original -treap to efficiently handle raster matrices where values are highly clustered. Then, we compare our proposals with a naive technique based on the M-ones, that simply searches for cells in the tree corresponding to the maximum value, and keeps searching in consecutive trees until the desired number of results is obtained.
5.1 -treap variants
Our first variant, called -treap-uniform (-treap), is built in a similar manner to the original -treap. Yet, like in our -ones, the decomposition of the matrix stops whenever a “uniform” submatrix is found. This can happen when an empty region is identified or when the same value is shared by all its cells. Figure 14 shows an example of this tree decomposition. Matrices to display the consecutive steps of the -treap construction, where the top cells (cells with the maximum value) for each step are highlighted. Observe that any dataset can be represented in a more compact way if similar values are present on many of its submatrices. Notice also that in uniform nodes all the cells in the submatrix share the same values, so we do not have to keep the coordinates of the cell with the maximum.
After these changes in the conceptual tree, we use a -ones to store the tree shape. Uniform nodes are marked as black nodes were in a binary raster, and empty nodes as white nodes. Using the same techniques explained for binary matrices, we can easily check whether a node is empty or uniform. In empty nodes we stop traversal, and in uniform nodes we can immediately output all the cells in the submatrix with the same value. The actual representation uses, in addition to the -ones, the arrays , and , that work essentially like in the original -treap.
Only minor adjustments are required to traverse the conceptual tree in our variant. Unlike node values, which are kept for all the nodes in the -treap, coordinates are just stored for non-leaf nodes. Therefore, we can use the formula to get the offset in the list of coordinates corresponding to the current position and level in the tree. To compute the offset of the node in the list of values, we also have to consider uniform nodes (marked with a 1 in ) in our formula: (i.e., the number of internal nodes and uniform nodes that exist up to the current position, respectively).
Our second proposal, called -treap-uniform-or-empty (-treap), tries to improve compression even more, at the expense of increasing query times. This approach slightly differs from the previous one. Here, we stop decomposition at any node as long as all the values in the corresponding submatrix are equal (even if some cells have a value and others are empty). For instance, in Figure 14, the bottom-left quadrant in becomes uniform with this new definition. This variant essentially builds the same -treap representation, but taking into account that these regions are now also considered as uniform. Figure 15 depicts an example of this new approach.
This proposal will cut many branches earlier during the construction of the tree. Even so, it has a drawback: since we cannot tell apart uniform and empty regions easily, some results may be emitted more than once. For instance, if cell had the maximum value in the matrix, it will be emitted at the root of the tree. But when traversing the bottom-left quadrant, if we identify that region as “uniform”, it may be emitted a second time. Hence, to solve top- queries we use an additional data structure to keep track of already emitted results (any binary search tree or hash table suffices for this purpose). The additional overhead may become significant in space and/or time for large , providing a space/time tradeoff between this proposal and the -treap.
5.2 Experimental evaluation
To test the query efficiency of our proposals, we compare them with the M-ones representation of raster data. Notice that, despite its simplicity, the M-ones can efficiently answer top- queries by querying the individual trees, starting from the one corresponding to the highest value, so it should be relatively efficient for this kind of queries.
Table 11 shows the compression results obtained by our -treap variants and the M-ones for different raster datasets. Our first variant, the -treap, is larger than the M-ones, but the -treap achieves better compression. Both variants obtain reasonable results in terms of space, at least comparable to the solutions described for general raster data, so they are a viable alternative if top- queries are relevant.
Table 12 shows the query times for top- queries obtained by all the tested data structures. For each dataset several window sizes and values of are tested, by generating sets of random square windows within the bounds of the raster. Results show that the -treap exhibits a good performance, regardless of the window size or value, as it is the alternative that achieves the best results in most of the cases. The -treap, the most compact of the -treap variants, still behaves well for small values of , but when increases the overhead of keeping track of previous results dominates the query cost. Also, observe that for larger values of , the M-ones becomes more competitive with the -treap data structures, since if the query involves many accesses to the tree retrieving cells from one or more -trees, it requires less computation than extracting values one by one from the -treap.
We have presented several compact data structures for the representation of general raster data with advanced query support. Our representations store real raster datasets in small space and provide efficient access not only to regions of the raster, but also advanced query capabilities, such as selecting cells with a particular value or range of values, queries that involve spatio-temporal restrictions, or even top- queries.
Most of the proposals are based on variants of the -tree. We propose a representation, called -ones, that enhances the -tree so that we can efficiently compress any kind of clustered binary matrix. Building over this, we propose compact and indexed solutions for different application domains. Additionally, most of the approaches introduced can be transformed into dynamic solutions using a dynamic -tree.
Overall, our proposals obtain good compression results and are able to answer a variety of interesting queries. In our experiments we show that our proposals are very compact, several times smaller than state-of-the-art representations based on linear quadtrees, and still able to store and query large datasets in main memory. We evaluate our representations for general raster data, showing their relative strengths and drawbacks: the -tree obtains very good space results, being close to a compressed GeoTIFF representation, and shows competitive times in most cases, but the variant with independent -ones obtains the best time results to retrieve all the cells with a given value, and the variant with cumulative -ones obtains the best results in most of the queries involving ranges of values. Nevertheless, the results of our proposals are clearly better than the representations based on GeoTIFF images. We also apply some of the proposals to the representation of time-evolving raster data. Results show again relative strengths among our proposals: a -tree is the best solution for slowly-changing datasets, but as soon as the change rate increases the approaches based on multiple -ones become smaller. Finally, we also test new proposals to answer top- queries in raster data. Our experiments confirm the space efficiency of the -treap variants, that are competitive in space with our other representations of raster data and faster to answer top- queries.
We show the scalability of our representations to efficiently represent rasters with several thousands of different values. Nevertheless, the space efficiency of most of our proposals will degrade if the number of different values in the raster becomes too high. An assumption in our proposals is that the number of different values in the dataset is not too high. We claim that in many real-world datasets, even though the values actually stored may have a high precision, that precision does not add quality or accuracy after a given threshold: when measuring features such as temperature, elevation, pressure, etc. the actual measurements may have high-precision but the interpolation of values, or even the simple averaging of measurements, distorts the precision of the measurements, so for many purposes we can safely reduce the precision of the values significantly without reducing the quality of the dataset.
The preliminary version of this work inspired several other research lines. In particular, limitations to handling large ranges of values were recently addressed in follow-up research Ladra et al. (2017), that extends our original work to support higher-precision datasets. Our representations are preferable when high-resolution values are not available or not relevant (e.g., in some applications, high-resolution values are just interpolations), as well as in domains where the number of different values is small (e.g., land-use rasters). Additionally, we have extended our proposals to efficiently store and query time-evolving data, a challenging problem where other solutions are difficult to apply due to the particularities of spatio-temporal queries.
- A succinct data structure for self-indexing ternary relations. Journal of Discrete Algorithms 43, pp. 38 – 53. Cited by: §2.2, §2.2.
- Compressed vertical partitioning for efficient RDF management. Knowledge and Information Systems 44 (2), pp. 439–474. Cited by: §2.1.
- Alphabet partitioning for compressed rank/select and applications. In Proc. of the 21st International Symposium on Algorithms and Computation (ISAAC 2010), pp. 315–326. Cited by: §1.
- Compressed representation of dynamic binary relations with applications. Information Systems 69, pp. 106 – 123. Cited by: §2.1.
- Aggregated 2d range queries on clustered points. Information Systems, pp. 34–49. Cited by: §2.3.
- -Treaps: range top- queries in compact space. In Proc. 21st Int. Symp. on String Processing and Information Retrieval (SPIRE 2014), pp. 215–226. Cited by: §2.3.
- DACs: bringing direct access to variable-length codes. Information Processing and Management 49 (1), pp. 392–404. Cited by: §2.3.
- Compact representation of web graphs with extended functionality. Information Systems 39 (1), pp. 152–174. Cited by: §2.1, §2.1.
- Orthogonal range searching on the RAM, revisited. In Proc. of the 27th International Symposium on Computational Geometry (SoCG 2011), pp. 1–10. External Links: Cited by: §1.
- Fixed binary linear quadtree coding scheme for spatial data. Visual Communications and Image Processing, pp. 1214–1220. External Links: Cited by: §1, §1.
- Practical representations for web and social graphs. In Proc. 20th ACM Int. Conf. on Information and Knowledge Management (CIKM 2011), pp. 1185–1190. Cited by: §2.1.
- Compact querieable representations of raster data. In Proc. of the 20th Int. Sym. on String Processing and Information Retrieval (SPIRE 2013), pp. 96–108. Cited by: footnote 1.
- New data structures and algorithms for the efficient management of large espatial datasets. Ph.D. Thesis, Department of Computer Science, University of A Coruña, Spain. Cited by: footnote 1.
- Quad trees: a data structure for retrieval on composite keys. Acta Informatica 4, pp. 1–9. Cited by: §1.
- An effective way to represent quadtrees. Communications of the ACM 25 (12), pp. 905–910. Cited by: §1, §3.6.
- Rank/select operations on large alphabets: a tool for text indexing. In Proc. of the 17th ACM-SIAM Symposium on Discrete Algorithms (SODA 2006), pp. 368–373. External Links: Cited by: §1.
- High-order entropy-compressed text indexes. In Proc. of the 14th ACM-SIAM Symposium on Discrete Algorithms (SODA 2003), pp. 841–850. External Links: Cited by: §1.
- On a method of binary-picture representation and its application to data compression. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-2 (1), pp. 27 –35. External Links: Cited by: §3.3.
- Scalable and queryable compressed storage structure for raster data. Information Systems 72, pp. 179 – 204. Cited by: §6.
- Set operations on constant bit-length linear quadtrees. Pattern Recognition 30 (7), pp. 1239–1249. External Links: Cited by: §1.
- Set operations on constant bit-length linear quadtrees. Pattern Recognition 30 (7), pp. 1239–1249. Cited by: §1.
- Towards operational near real-time flood detection using a split-based automatic thresholding procedure on high resolution terrasar-x data. Natural Hazards and Earth System Sciences 9, pp. 303–314. External Links: Cited by: §1.
- Tables. In Proc. of the 16th Foundations of Software Technology and Theoretical Computer Science Conference (FSTTCS 1996), pp. 37–42. Cited by: §2.1.
Wavelet trees for all.
Proc. of the 23rd Annual Symposium on Combinatorial Pattern Matching (CPM 2012), pp. 2–26. Cited by: §1.
- Spatial databases - with applications to GIS. Elsevier. External Links: Cited by: §1.
- The quadtree and related hierarchical data structures. ACM Comput. Surv. 16 (2), pp. 187–260. External Links: Cited by: §1.
- Randomized search trees. Algorithmica 16 (4/5), pp. 464–497. Cited by: §2.3.
- Encyclopedia of GIS. Springer. External Links: Cited by: §1.
Benchmarking access methods for time-evolving regional data.
Data & Knowledge Engineering49 (3), pp. 243–286. Cited by: §4.3.
- A technique for high-performance data compression. Computer 17 (6), pp. 8–19. External Links: Cited by: §1.
- GIS: a computing perspective, 2nd edition. CRC Press, Inc.. External Links: Cited by: §1.
- Supporting web-based visual exploration of large-scale raster geospatial data using binned min-max quadtree. In Scientific and Statistical Database Management, M. Gertz and B. Ludäscher (Eds.), Berlin, Heidelberg, pp. 379–396. External Links: Cited by: §1.