Spatial data can be represented using either a raster or a vector data model. Basically, vector models represent the space using points and lines connecting those points. They are used mainly to represent man-made features. Raster models represent the space as a tessellation of disjoint fixed size tiles (usually squares), each one storing a value. They are traditionally used in engineering, modeling, and representations of real-word elements that were not made by men, such as pollution levels, atmospheric and vapor pressure, temperature, precipitations, wind speed, land elevation, satellite imagery, etc.
Temporal evolution of vectorial data has been extensively studied, with a large number of data structures to index and/or store spatio-temporal data. Examples are the 3DR-tree , HR-tree , the MVR-tree , or PIST .
In  the classical Map Algebra of Tomlin for managing raster data is extended to manage raster data with a temporal evolution. The conceptual solution is simple, instead of considering a matrix, it considers a cube, where each slice of the temporal dimension is the raster corresponding to one time instant.
Most real systems capable of managing raster data, like Rasdaman, Grass, or even R are also capable of managing time-series of raster data. These systems, as well as raster representation formats such as NetCDF (standard format of the OGC111http://www.opengeospatial.org/standards/netcdf) and GeoTiff, rely on classic compression methods such as run length encoding, LZW, or Deflate to reduce the size of the data. The use of these compression methods poses an important drawback to access a given datum or portion of the data, since the dataset must be decompressed from the beginning.
Compact data structures [7, 11] are capable of storing data in compressed form and enable us to access a given datum without the need for decompressing from the beginning. In most cases, compact data structures are equipped with an index that provides fast access to data. There are several compact data structures designed to store raster data [2, 8]. In this work, we extend one of those compact data structures, the , to support representing time-series of rasters.
2 Related work
In this section, we first revise the , a compact data structure that can be used to represent binary matrices. Then, we also present several compact data structures for representing raster data containing integers in the cells. We pay special attention to discuss one of them, the , which is the base of our proposal Temporal ().
The  was initially designed to represent web graphs, but it also allows to represent binary matrices, that is, rasters where the cells contain only a bit value. It is conceptually a non-balanced -ary tree built from the binary matrix by recursively dividing it into submatrices of the same size. First, the original matrix is divided into submatrices of size , being the size of the matrix. Each submatrix generates a child of the root whose value is if it contains at least one , and otherwise. The subdivision continues recursively for each node representing a submatrix that has at least one , until the submatrix is full of 0s, or until the process reaches the cells of the original matrix (i.e., submatrices of size 11).
The is compactly stored using just two bitmaps and . T stores all the bits of the conceptual , except the last level, following a level-wise traversal: first the bit values of the children of the root, then those in the second level, and so on. stores the last level of the tree.
It is possible to obtain any cell, row, column, or window of the matrix very efficiently, by running and operations  over the bitmaps and .
The  is obtained by simply adding a third dimension to the , and thus, it conceptually represents a binary cube. This can be trivially done by using the same space partitioning and representation techniques from the , yet applied to cubes rather than to matrices.
Thus, each 1 in the binary cube represents a tuple , where are the coordinates of the cell of the raster and is the value stored in that cell.
The  representation for a raster dataset is composed by as many as different values can be found in the raster. Given different values in the raster: , contains , where each has a value 1 in those cells whose value is .
2.0.4 2D-1D mapping:
In , it is presented a method that uses a space-filling curve to reduce the raster matrix to an array, and the use of one dimensional index (for example a B-tree) over that array to access the data.
has proven to be superior in both space and query time [12, 8] to all the other compact data structures for storing rasters. In , it was also compared with NetCDF. It drew slightly worse space needs than the compressed version (that uses Deflate) of NetCDF, but queries performed noticeably faster.
is based in the same partitioning method of the , that is, it recursively subdivides the matrix into submatrices and builds a conceptual tree representing these subdivisions. Now, in each node, instead of having a single bit, it contains the minimum and maximum values of the corresponding submatrix. The subdivision stops when the minimum and maximum values of the submatrix are equal, or when the process reaches submatrices of size 11. Again the conceptual tree is compactly represented using, in addition to binary bitmaps, efficient encoding schemes for integer sequences.
More in detail, let be the size of the input matrix. The process begins by obtaining the minimum and maximum values of the matrix. If these values are different, they are stored in the root of the tree, and the matrix is divided into submatrices of size . Each submatrix produces a child node of the root storing its minimum and maximum values. If these values are the same, that node becomes a leaf, and the corresponding submatrix is not subdivided anymore. Otherwise, this procedure continues recursively until the maximum and minimum values are the same, or the process reaches a 11 submatrix.
Figure 1 shows an example of the recursive subdivision (top) and how the conceptual tree is built (centre-top), where the minimum and maximum values of each submatrix are stored at each node. The root node corresponds to the original raster matrix, nodes at level 1 correspond to submatrices of size 44, and so on. The last level of the tree corresponds to cells of the original matrix. Note, for instance, that all the values of the bottom-right 44 submatrix are equal; thus, its minimum and maximum values are equal, and it is not further subdivided. This is the reason why the last child of the root node has no children.
The compact representation includes two main parts. The first one represents the topology of the tree () and the second one stores the maximum/minimum values at the nodes (). The topology is represented as in the , except that the last level () is not needed. The maximum/minimum values are differentially encoded with respect to the values stored at the parent node. Again, these values are stored as arrays following the same method of the , that is, following the same level-wise order of the conceptual tree. By using differential encoding, the numbers become smaller. Directly Addressable Codes (DACs)  take advantage of this, and at the same time, provide direct access. The last two steps to create the final representation of the example matrix are also illustrated in Figure 1. In the center-bottom and bottom parts, we respectively show the tree with the differences for both the maximum and minimum values, and the data structures that compose the final representation of the . Therefore, the original raster matrix is compactly stored using just a bitmap , which represents the tree topology, and a pair of integer arrays ( and ), which contain the minimum and maximum values stored at the tree. Note that when the raster matrix contains uniform areas, with large areas of equal or similar values, this information can be stored very compactly using differential encoding and DACs.
The maximum/minimum values provide indexation of the stored values, this technique is usually known as lightweight indexation. It is possible to query the structure only decompressing the affected areas. Queries can be efficiently computed navigating the conceptual tree by running and operations on and, in parallel, accessing the arrays and .
3 : A temporal representation for raster data
Let be a raster matrix of size that evolves along time with a timeline of size time instants. We can define as the sequence of raster matrices of size for each time instant .
A rather straightforward baseline representation for the temporal raster matrix can be obtained by simply representing each raster matrix in a compact way with a . In this section we use a different approach. The idea is to use sampling at regular intervals of size . That is, we represent with a all the raster matrices . We will refer to those rasters as snapshots of at time . The raster matrices that follow a snapshot are encoded using as a reference. The idea is to create a modified to represent where, at each step of the construction process, the values in the submatrices are encoded as differences with respect to the corresponding submatrices in rather than as differences with respect to the parent node as usual in a regular .
With this modification, we still expect to encode small gaps for the maximum and minimum values in each node of the conceptual tree of . Yet, in addition, when a submatrix in is identical to the same submatrix in , or when all the values in both submatrices differ only in a unique gap value , we can stop the recursive splitting process and simply have to keep a reference to the corresponding submatrix of and the gap (when they are identical, we simply set ). In practice, keeping that reference is rather cheap as we only have to mark, in the conceptual tree of , that the subtree rooted at a given node has the same structure of the one from the conceptual tree of . For such purpose, in the final representation of , we include a new bitmap , aligned to the zeroes in . That is, if we have (node with no children), we set ,222From now on, asume returns the number of bits set to in , and . Note that the first index of , , , and is . and set . Also, if we have , we also can set and (where is the gap between the maximum values of both submatrices) to handle the case in which the maximum and minimum values in the corresponding submatrix are identical (as in a regular ).
The overall construction process of the for the matrix related to the snapshot can be summarized as follows. At each step of the recursive process, we consider a submatrix of and the related submatrix in . Let the corresponding maximum and minimum values of the submatrix of be and , and those of be and respectively. Therefore:
If and are identical (or if we reach a 11 submatrix), the recursive process stops. Being the position in the final bitmap , we set , , and .333Since in we have to deal both with positive and negative values, we actually apply a zig-zag encoding for the gaps .
If all the values in and differ only in a unique value (or if they are identical, hence ), we set , , and .
Otherwise, we split the submatrix into parts and continue recursively. We set , and, as in the regular , , and .
Figure 2 includes an example of the structures involved in the construction of a over a temporal raster of size 88, with . The raster matrix corresponding to the first time instant becomes a snapshot that is represented exactly as the in Figure 1. The remaining raster matrices and are represented with two that are built taking as a reference. We have highlighted some particular nodes in the differential conceptual trees corresponding to and . (i) the shaded node labeled in indicates that the first 44 submatrix of both and are identical. Therefore, node has no children, and we set: , , and . (ii) the shaded node labeled in illustrates the case in which all the values of a given submatrix are increased by . In this case values in become in . Again, the recursive traversal stops at that node, and we set: , , and (values are increased by ). (iii) the shaded node labeled in corresponds to the node labeled in . In this case, when we sum the maximum and minimum values of both nodes we obtain that that node in has the same maximum and minimum values (set to ). Consequently the recursive process stops again. In this case, we set , , and .
4 Querying temporal raster data
In this section, we show two basic queries over .
Obtaining a cell value in a time instant:
This query retrieves the value of a cell of the raster at time instant : . For solving this query, there are two cases: if is represented by a snapshot, then the algorithm to obtain a cell in the regular is used, otherwise, a synchronized top-down traversal of the trees representing that time instant () and the closest previous snapshot () is required.
Focusing on the second case, the synchronized traversal inspects the two nodes at each level corresponding to the submatrix that contains the queried cell. The problem is that due to parts of or having the same value, the shape of the trees representing them can be different. Therefore, it is possible that one of the two traversals reaches a leaf, whereas the other does not. In such a case, the traversal that did not reach a leaf, continues, but the process must remember the value in the reached leaf, since that is the value that will be added or subtracted to the value found when the continued traversal reaches a leaf.
Indeed, we have three cases: (a) the processed submatrix of is uniform, (b) the original submatrix of is uniform and, (c) the processed submatrix after applying the differences with the snapshot has the same value in all cells.
Algorithm 1 shows the pseudocode of this case. To obtain the value stored at cell of the raster matrix , it is invoked as getCell, assuming that the cell at position (0,0) of the raster is that in the upper-left corner.
is used to store the current position in the bitmap of () during the downward traversal at any given step of the algorithm, similarly, is the position in of (). When () has a value, it means that the traversal reached a leaf and, in () the algorithm keeps the maximum value stored at that leaf node. Note that, , , , , and are global variables.
In lines 1-11, the algorithm obtains the child of the processed node that contains the queried cell, provided that in a previous step, the algorithm did not reach a leaf node (signaled with / set to ). In (), the algorithm stores the maximum value stored in that node.
If the condition in line 12 is true, the algorithm has reached a leaf in both trees, and thus the values stored in and are added/subtracted to obtain the final result. If the condition of line 15 is true, the algorithm reaches a leaf in the snapshot. This is signaled by setting to and then a recursive call continues the process.
The If in line 19 treats the case of reaching a leaf in . If the condition of line 20 is true, the algorithm uses bitmap to check if the uniformity is in the original submatrix or if it is in the submatrix resulting from applying the differences between the corresponding submatrix in and . A in implies the latter case, and this is solved by setting to and performing a recursive call. A means that the treated original submatrix of has the same value in all cells, and that value can be obtained adding/subtracting the values stored in and , since the unique value in the submatrix of is encoded as a difference with respect to the maximum value of the same submatrix of , and thus the traversal ends.
The last case is that the treated nodes are not leaves, that simply requires a recursive call.
Retrieving cells with range of values in a time instant:
obtains from the raster of the time instant , the positions of all cells within a region containing values in the range .
Again, if is represented with a snapshot, the query is solved with the normal algorithm of the
. Otherwise, as in the previous query, the search involves a synchronized top-down traversal of both trees. This time requires two main changes: (i) the traversal probably requires following several branches of both trees, since the queried region can overlap the submatrices corresponding to several nodes of the tree, (ii) at each level, the algorithm has to check whether the maximum and minimum values in those submatrices are compatible with the queried range, discarding those that fall outside the range of values sought.
5 Experimental evaluation
In this section we provide experimental results to show how handles a dataset of raster data that evolve along time. We discuss both the space requirements of our representation and its performance at query time.
We used several synthetic and real datasets to test our representation, in order to show its capabilities. All the datasets are obtained from the TerraClimate collection , that contains high-resolution time series for different variables, including temperature, precipitations, wind speed, vapor pressure, etc. All the variables in this collection are taken in monthly snapshots, from 1958 to 2017. Each snapshot is a 43208640 grid storing values with spatial resolution. From this collection we use data from two variables: TMAX (maximum temperature) is used to build two synthetic datasets, and VAP (vapor pressure) is compressed directly using our representation. Variable TMAX is a bad scenario for our approach, since most of the cells change their value between two snapshots. In this kind of dataset, our would not be able to obtain good compression. Hence, we use TMAX to generate two synthetic datasets that simulate a slow, and approximately constant, change rate, between two real snapshots. We took the snapshots for January and February 2017 and built two synthetic datasets called T_100 and T_1000, simulating 100 and 1000 intermediate steps between both snapshots; however, note that to make comparisons easier we only take the first 100 time steps in both datasets. We also use a real dataset, VAP, that contains all the monthly snapshots of the variable VAP from 1998 to 2017. Note that, although we choose a rather small number of time instants in our experiments, the performance of our proposal is not affected by this value: it scales linearly in space with the number of time instants, and query times should be unaffected as long as the change rate is similar.
We compared our representation with two baseline implementations. The first, called 444https://gitlab.lbd.org.es/fsilva/k2-raster is a representation that stores just a full snapshot for each time instant, without trying to take advantage of similarities between close time instants. The second baseline implementation, NetCDF, stores the different raster datasets in NetCDF format, using straightforward algorithms on top of the NetCDF library555https://www.unidata.ucar.edu/software/netcdf/ (v.4.6.1) to implement the query operations. Note that NetCDF is a classical representation designed mainly to provide compression, through the use of Deflate compression over the data. Therefore, it is not designed to efficiently answer indexed queries.
We tested cell value queries (getCellValue) and range queries (getCells). We generated sets of 1000 random queries for each query type and configuration: 1000 random cell value queries per dataset, and sets of 1000 random range queries for different spatial window sizes (ranging from 44 windows to the whole matrix), and different ranges of values (considering cells with 1 to 4 possible values). To achieve accurate results, when the total query time for a query set was too small, we repeated the full query set a suitable number of times (in practice, 100 or 1000 times) and measured the average time per query.
All tests were run on an Intel (R) Core TM i7-3820 CPU @ 3.60GHz (4 cores) with 10MB of cache and 64GB of RAM, over Ubuntu 12.04.5 LTS with kernel 3.2.0-126 (64 bits). The code is compiled using gcc 4.7 with -O9 optimizations.
|(varying )||NetCDF (varying deflate level)|
Table 1 displays the space requirements for the datasets T_100 and T_1000 in all the representations. We tested our with several sampling intervals , and also show the results for NetCDF using several deflate levels, from level 0 (no compression) to level 9. Our representation achieves the best compression results in both datasets, especially in T_1000, as expected, due to the slower change rate. In T_100, our approach achieves the best results for , since as the number of changes increases our differential approach becomes much less efficient. In T_1000, the best results are also obtained for a relatively small (6-8), but our proposal is still smaller than for larger . NetCDF is only competitive when compression is applied, otherwise it requires roughly 20 times the space of our representations. In both datasets, NetCDF with compression enabled becomes smaller than the representation, but is able to obtain even smaller sizes.
Figure 3 shows the space/time trade-off for the datasets T_100 and T_1000 in cell value queries. We show the results only for NetCDF with compression enabled (deflate level 2 and 5), and for with a sampling interval of 6 and 50. The is slower than the baseline , but is much smaller if a good is selected. Note that we use two extreme sampling intervals to show the consistency of query times, since in practice only the best approach in space would be used for a given dataset. In our experiments we work with a fixed
, but straightforward heuristics could be used to obtain an space-efficientwithout probing for different periods: for instance, the number of nodes in the tree of differences and in the snapshot is known during construction, so a new snapshot can be built whenever the size of the tree of differences increases above a given threshold.
Table 2 shows an extract of the range query times for all the representations in datasets T_100 and T_1000. We only include here the results for with a of 6 and 50, and for NetCDF with deflate level 2 and 5, since query times with the other parameters report similar conclusions. We also show the results for some relevant spatial window sizes and ranges of values. In all the cases, is around 50% slower than , due to the need of querying two trees to obtain the results. However, the much smaller space requirements of our representation compensate for this query time overhead, especially in T_1000. NetCDF, that is not designed for this kind of queries, cannot take advantage of spatial windows or ranges of values, so it is several orders of magnitude slower than the other approaches. The last query set (ALL) involves retrieving all the cells in the raster that have a given value (i.e. the spatial window covers the complete raster). In this context, NetCDF must traverse and decompress the whole raster, but our representation cannot take advantage of its spatial indexing capabilities, so this provides a fairer comparison. Nevertheless, both and are still several times faster than NetCDF in this case, and our proposal remains very close in query times to the baseline.
Figure 4 (left) shows the space/time trade-off for the real dataset VAP. Results are similar to those obtained for the previous datasets: our representation, , is a bit slower in cell value queries than , but also requires significantly less space. The NetCDF baseline is much slower, even if it becomes competitive in space when deflate compression is applied.
Finally, Figure 4 (right) displays the query times for all the alternatives in range queries over the VAP dataset. The is again a bit faster than the , as expected, but the time overhead is within 50%. NetCDF is much slower, especially in queries involving small windows, as it has to traverse and decompress a large part of the dataset just to retrieve the values in the window. Note that even if the window covers the complete raster, and achieve significantly better query times.
6 Conclusions and future work
In this work we introduce a new representation for time-evolving raster data. Our representation, called , is based on a compact data structure for raster data, the , that we extend to efficiently manage time series. Our proposal takes advantage of similarities between consecutive snapshots in the series, so it is especially efficient in datasets where few changes occur between consecutive time instants. The provides spatial and temporal indexing capabilities, and is also able to efficiently filter cells by value. Results show that, in datasets where the number of changes is relatively small, our representation can compress the raster and answer queries very efficiently. Even if its space efficiency depends on the dataset change rate, the is a good alternative to compress raster data with high temporal resolution, or slowly-changing datasets, in small space.
As future work, we plan to apply to our representation some improvements that have already been proposed for the , such as the use of specific compression techniques in the last level of the tree. We also plan to develop an adaptive construction algorithm, that selects an optimal, or near-optimal, distribution of snapshots to maximize compression.
-  Abatzoglou, J.T., Dobrowski, S.Z., Parks, S.A., Hegewisch, K.C.: Terraclimate, a high-resolution global dataset of monthly climate and climatic water balance from 1958–2015. Scientific Data 5(170191) (2017)
-  de Bernardo, G., Álvarez-García, S., Brisaboa, N.R., Navarro, G., Pedreira, O.: Compact Querieable Representations of Raster Data. In: Proceedings of the International Symposium on String Processing and Information Retrieval (SPIRE). pp. 96–108 (2013)
-  Botea, V., Mallett, D., Nascimento, M.A., Sander, J.: Pist: An efficient and practical indexing technique for historical spatio-temporal point data. GeoInformatica 12(2), 143–168 (2008)
-  Brisaboa, N.R., Ladra, S., Navarro, G.: DACs: Bringing direct access to variable-length codes. Information Processing and Management 49(1), 392–404 (2013)
-  Brisaboa, N.R., Ladra, S., Navarro, G.: Compact representation of web graphs with extended functionality. Information Systems 39(1), 152–174 (2014)
-  Couclelis, H.: People manipulate objects (but cultivate fields): Beyond the raster-vector debate in GIS. In: Proceedings of GIS: from space to territory - theories and methods of spatio-temporal reasoning. pp. 65–77 (1992)
-  Jacobson, G.: Succinct static data structures. Ph.D. thesis, Carnegie-Mellon (1988)
-  Ladra, S., Paramá, J.R., Silva-Coira, F.: Scalable and queryable compressed storage structure for raster data. Information Systems (72), 179–204 (2017)
-  Mennis, J., Viger, R., Tomlin, C.D.: Cubic map algebra functions for spatio-temporal analysis. Cartography and Geographic Information Science 32(1), 17–32 (2005)
-  Nascimento, M.A., Silva, J.R.O.: Towards historical R-trees. In: Proceedings of the 1998 ACM symposium on Applied Computing, SAC’98. pp. 235–240. ACM (1998)
-  Navarro, G.: Compact Data Structures – A practical approach. Cambridge University Press (2016)
-  Pinto, A., Seco, D., Gutiérrez, G.: Improved queryable representations of rasters. In: Proceedings of the 2017 Data Compression Conference (DCC). pp. 320–329 (2017)
-  Tao, Y., Papadias, D.: MV3R-tree: A spatio-temporal access method for timestamp and interval queries. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB). pp. 431–440 (2001)
-  Vazirgiannis, M., Theodoridis, Y., Sellis, T.K.: Spatio-temporal composition and indexing for large multimedia applications. ACM Multimedia Systems Journal 6(4), 284–298 (1998)