Relative compression techniques were designed to exploit the redundancy of highly repetitive datasets. Those techniques represent a large set of sequences with respect to another sequence called reference. Since the differences between the reference and the rest of elements are small, relative compression saves much space in highly repetitive datasets. These techniques were long applied over DNA collections, however it can be extended to other kind of highly repetitive datasets. For example, in datasets of moving objects over networks (taxis) or with predefined routes between different points (planes or boats), objects follow the best-known route between the origin and the final position, therefore the trajectories tend to be similar and repetitive. The aim of this study is to exploit the repetitiveness of object trajectories and support efficiently spatio-temporal queries by using relative compression.
The best-known relative compression technique is Relative Lempel-Ziv (RLZ) [kuruppu2010relative], which compresses each sequence of a dataset by applying an LZ77 [ziv1978compression] parsing with respect to a reference. The reference can be a representative sequence of the dataset or an artificial reference built by parts from other sequences. RLZ gets a good compression ratio and supports efficient random access to the original sequence. Therefore, we can compress trajectories with RLZ, it allows us to retrieve the trajectory of an object at a given interval of time, however RLZ cannot solve efficiently spatio-temporal queries, for example, it cannot retrieve the objects within a region during an interval of time.
In this work, we propose a structure based on RLZ which compresses the set of sequences in a dataset with respect to an artificial reference built using the technique proposed in [liao2016effective]. After the construction of the reference, for each trajectory, the LZ77 parser generates z phrases with respect to the reference. By combining existing data structures for trajectories [BGGBNPspire17] built on the reference with -sized data structures on the sequences of phrases, we offer the same functionality of previous work [BGGBNPspire17, brisaboa2016gract] within RLZ-bounded space. We plan that this arrangement will obtain the best from both previous work: the low space requirement of GraCT [brisaboa2016gract] and the speed of ContaCT [BGGBNPspire17].
2.1 Relative Lempel-Ziv
Let be a sequence of length called source and be a sequence of length called reference, where . The Relative Lempel-Ziv (RLZ) compresses by using an LZ77 parsing with respect to . As a consequence of the parsing step, we obtain z phrases which represent . Therefore can be represented with phrases , and every phrase is stored as a pair where is the starting position of the -th phrase at and its length.
For example, with and , RLZ represents with three phrases. The first phrase is and it is represented with the pair because it appears at position in and . beginning at position at position and with length , hence it is encoded with . Finally, we obtain the pair which corresponds with .
GraCT [brisaboa2016gract] is a compact data structure designed to store trajectories and support spatio-temporal queries. It assumes regular timestamps and stores trajectories using two components. At regular periods of time, it represents the position of all the objects in a structure called snapshot. The positions of objects between snapshots are represented in a structure called log.
Let us denote the snapshot representing the position of all the objects at timestamp . Given a parameter , which specifies the distance between two consecutive snapshots, and , there is a log for each object, which is denoted , being the identifier of the object. The log stores the differences of positions compressed with RePair [larsson2000off], a grammar-based compressor. In order to speed up the queries over the resulting sequence, the nonterminals are enriched with additional information, mainly the MBB of the trajectory segment encoded by the nonterminal.
Each snapshot is a binary matrix where a cell set to 1 indicates that one or more objects are placed in that position of the space. To store such a (generally sparse) matrix, it uses a -tree [ktree]. The -tree is a space- and time- efficient version of a region quadtree [Sam2006], and is used to filter the objects that may be relevant for queries which involve spatial areas.
ContaCT is based on GraCT, hence it keeps the same components: snapshots and logs. The main differences are in the log. Instead of a log compressed with RePair, the differences of positions are stored continuously using two bitmap per axis, one for positive movements and another for negative movements. For the -axis, we have the bitmap where we store how many positions an object moves to the right in each timestamp and the bitmap which stores the movements to the left. Two additional bitmaps are required to store the movements up and downwards. In those bitmaps, the movements are encoded in unary. For example, if an object moves two positions to the right, its representation is and because it moves positions to the left.
In order to compute the position of an object in GraCT, we have to perform a sequential traversal of the log from the closest snapshot up to the queried timestamp. ContaCT avoids this traversal, we can compute the position of an object by using select [j-ssds-89] operations over those bitmaps, which can be solved in constant time using an extra space of bits.
3 Relative compression of trajectories (RCT)
RCT uses snapshots and logs, just like ContaCT and GraCT. As in the previous structures, in RCT the main differences involve the log, which is composed by two parts: an artificial reference and the log of trajectories.
By using the technique presented in [liao2016effective], we built an artificial reference which represents well all the trajectories stored in the dataset. After the construction of the artificial reference, the trajectories are compressed with RLZ. In order to speed up the queries, an structure similar to ContaCT is built on the artificial reference and -sized structures are added on the sequences of phrases, being the number of phrases of LZ77 parsing.
3.1 Artificial reference
The phrases obtained after applying RLZ compression are pointers to the artificial reference, therefore most of the queries have to be solved on the artificial reference. For this reason, we need a mechanism to compute efficiently two queries:
movement(i,j), computes the movement performed in the reference from the time instant i to the time instant j. It returns the pair where is the difference in -axis from to , and the equivalent in -axis .
MBB(i,j), computes the minimum bounding box of the movements represented by the reference between time instants and . The value returned, , is relative to the position of the object at time instant .
In order to support movement(i,j), we add the bitmaps of ContaCT. Recall that all the differences are represented in unary, as we can observe in Figure 1(a). This representation allows that the number of zeroes before the -th 1 corresponds with the number of movements (upwards, downwards, to the right, or to the left) depending on the bitmap. from the initial time instant to the -th movement, denoted as , being : , or . That difference can be computed as , which is solved in constant time using an extra space of bits. Hence, the number of movements in the bitmap from to is computed as . Therefore, and .
To solve MBB(i,j), we need to compute the maximum and minimum in each axis. A naive approach could be store a range minimum query structure rmq and a range maximum query structure rMq per axis. Both structures only return the index of the minimum/maximum, hence they do not store the values, and each one takes an space of bits. For example, the minimum value could be computed as . However, is the size of the trajectory, which can be very large.
Taking into account that most of the time the objects move in a constant direction, we can sample the local minimums/maximums per axis and mark in a bitmap the movements where the local minimums/maximums appears. For example, in Figure 1(a) we can compute the of MBB(5,11), we have the bitmap which locates the local minimums and the which returns the index of the local minimum.
First, we compute where we have to run the operation from to . and it corresponds with the movement . Finally, as it is a minimum local, we have to compare the extreme values and against and return the minimum of them. We do not store , but we can compute in constant time. The is computed with to obtain a relative value. We repeat this step to compute , and .
3.2 Log of trajectories
The trajectories are compressed with RLZ, as we can observe in Figure 1(b), therefore each trajectory is represented with phrases: . Recall that each phrase is a pair where is the starting position of the -th phrase in the reference and its length. We store the information of the pairs separately, values in an array and we mark in the bitmap the beginning of all the phrases at . We store for each the previous position . In addition, we store the time instant when the trajectory starts. With this information we can compute the position of an object at in constant time, as we show in the next section.
The minimum value in -axis of each is stored by the array with an structure, and the same with the maximum value in an array with an structure. This structures are replicated per axis, hence we can compute the which involves each . As we will explain later, it speeds up the time-interval queries.
4.1 Search object
Given at time instant and an object identifier we access the log to retrieve the position at . First we perform a , thus we know contains the result, and it is stored inside at position . By accessing the reference we compute the position as .
The operation trajectory returns the position of a given object between two time instants: and . It can be solved by computing the position with search object at and applying movement(i,i+1) for every , where belongs to the set of phrases which contains the time instants between and .
4.3 Time slice
Let us define a region . Time slice returns the objects within at a given time instant . In order to solve this query we have to consider the maximum speed of the dataset, . The algorithm starts by obtaining the candidates from the previous snapshot at , it means, all the objects which are contained in , where extends in all directions . Finally, we compute the position at of every candidate using the operation search object and return those candidates that are contained in .
4.4 Time interval
The time interval query returns those objects which are in a given region at any time instant . To solve this query we split the interval into as many intervals as portions of the log between two snapshots overlaps . Then, this portions can be solved similar to time-slice. Firstly, we obtain the candidates from the previous snapshot at by extending in all directions . For each candidate, we check the phrases which overlap the interval . As we can observe in Figure 2, some phrase are completely included in but others can be partially included. In the worst case, there are one interval where the phrases are completely covered and two extreme partially covered intervals ( and ).
Since we are storing the minimum and maximum of each phrase per axis, we can compute the covered by in constant time. First, we need to know the interval of phrases equivalent to , that interval is between and . By computing and we obtain the minimum and maximum of the for the -axis, respectively. Analogously, we obtain the minimum and maximum of -axis. Then, we check if the is contained in , in that case we add the object to the solution and stop the search. If the intersects with , we run this process recursively splitting the interval into two halves. On the other hand, if is outside and they do not intersect, we stop processing the actual interval and we process the next one.
Once, we have to process only one phrase, we run the binary search in the reference. We split it in two halves and we continue recursively computing the as , where is the relative and the previous location of the object. When we have not found any completely contained in between , we repeat these steps on the reference for the partially covered intervals.