A Compact Representation for Trips over Networks built on self-indexes

Representing the movements of objects (trips) over a network in a compact way while retaining the capability of exploiting such data effectively is an important challenge of real applications. We present a new Compact Trip Representation (CTR) that handles the spatio-temporal data associated with users' trips over transportation networks. Depending on the network and types of queries, nodes in the network can represent intersections, stops, or even street segments. CTR represents separately sequences of nodes and the time instants when users traverse these nodes. The spatial component is handled with a data structure based on the well-known Compressed Suffix Array (CSA), which provides both a compact representation and interesting indexing capabilities. The temporal component is self-indexed with either a Hu-Tucker-shaped Wavelet-tree or a Wavelet Matrix that solve range-interval queries efficiently. We show how CTR can solve relevant counting-based spatial, temporal, and spatio-temporal queries over large sets of trips. Experimental results show the space requirements (around 50-70 baseline) and query efficiency (most queries are solved in the range of 1-1000 microseconds) of CTR.


Using Compressed Suffix-Arrays for a Compact Representation of Temporal-Graphs

Temporal graphs represent binary relationships that change along time. T...

New structures to solve aggregated queries for trips over public transportation networks

Representing the trajectories of mobile objects is a hot topic from the ...

Towards a compact representation of temporal rasters

Big research efforts have been devoted to efficiently manage spatio-temp...

Querying Spatial-Temporal-Spectral Data Using a Graphical Query Builder

Constructing complex queries on data which combines spatial, temporal, a...

A Grammar-based Compressed Representation of 3D Trajectories

Much research has been published about trajectory management on the grou...

Semantrix: A Compressed Semantic Matrix

We present a compact data structure to represent both the duration and l...

Revisiting compact RDF stores based on k2-trees

We present a new compact representation to efficiently store and query l...

1 Introduction

Due to the current advances in sensor networks, wireless technologies, and RFID-enabled ubiquitous computing, data about moving-objects (also referred to as trajectories) is an example of massive data relevant in many real applications. Think in the notion of Smart Cities, where the implementation of new technologies in public transportation systems has become more widespread all around the world in the last decades. For instance, nowadays many cities -from London to Santiago- provide the users of the public transportation with smartcards that help in making the payment to access buses or subways an easier task. Even though smartcards may only collect data when users enter to the transportation system, it is possible to derive the users’ trip (when they enter and leave the system) using historical data and transportation models Munizaga20129 . When having this data, counting or aggregate queries of trajectories become useful tools for traffic monitoring, road planning, and road navigation systems.

New technologies and devices generate a huge amount of highly detailed, real-time data. Several research exists about moving-object databases (MODs) DBLP:conf/chorochronos/GutingBEJLNSV03 ; DBLP:journals/geoinformatica/Spaccapietra01 ; DBLP:conf/sigmod/ForlizziGNS00 ; DBLP:journals/geoinformatica/ErwigGSV99 and indexing structures DBLP:books/sp/PelekisT14 ; DBLP:conf/vldb/PfoserJT00 ; DBLP:conf/vldb/PapadiasT01 ; DBLP:conf/ssd/Frentzos03 ; DBLP:journals/geoinformatica/AlmeidaG05 ; DBLP:journals/vldb/PopaZOBV11 ; DBLP:conf/icde/Cudre-MaurouxWM10 . They, however, have addressed typical spatio-temporal queries such as time slice or time interval queries that retrieve trajectories or objects that were in a spatial region at a time instant or during a time interval. They were not specially designed to answer queries that are based on counting, such as the number of distinct trips, which are more meaningful queries for public-transportation or traffic administrators. This problem was recently highlighted in DBLP:conf/cikm/LiCDYZZ0Z15 , where authors describe an approximate query processing of aggregate queries that count the number of distinct trajectories within a region. In this work, we concentrate on counting-based queries on a network, which includes the number of trips starting or ending at some time instant in specific stops (nodes) or the top-k most used stops of a network during a given time interval.

The work in this paper proposes a new structure named Compact Trip Representation () that answers counting-based queries and uses compact self-indexed data structures to represent the large amount of trips in a compact space. combines two well-known data structures. The first one, initially designed for the representation of strings, is Sadakane’s Compressed Suffix Array () Sad03 . The second one is the Wavelet Tree WT03 (). To make the use of the possible, we define a trip or trajectory of a moving object over a network as the temporally-ordered sequence of the nodes the trip traverses. An integer is assigned to each node such that a trip is a string with the s of the nodes. Note that this representation avoids the cost of storing coordinates to represent the locations users pass through during a trip. It is just enough to identify the stops or nodes and when necessary to map these nodes to geographic locations. Then a , over the concatenation of these strings (trips), is built with some adaptations for this context. In addition, we discretize the time in periods of fixed duration (i.e. timeline split into 5-minute intervals) and each time segment is identified by an integer . In this way, it is possible to store the times when trips reach each node by associating the corresponding time with each node in each trip. The sequence of times for all the nodes within a trip is self-indexed with a to efficiently answer temporal and spatio-temporal queries.

We experimentally tested our proposal using two sets of data representing trips over two different real public transportation systems. Our results are promising because the representation uses around % of its original size and answers most of our spatial, temporal, and spatio-temporal queries within microseconds. In addition, since implicitly keeps all the original trajectories in a compact and self-indexed way, it would permit us to extend its functionality with additional operations that could benefit from the indexed access provided both by the underlying and structures. No experimental comparisons with classical spatial or spatio-temporal indexing structures were possible, because none of them were designed to answer the types of queries in this work. Our approach can be considered as a proof of concept that opens new application domains for the use of well-known compact data structures such as the and the , creating a new strategy for exploiting trajectories represented in a self-indexed way.

The organization of this paper is as follows. Section 2 reviews previous works on trip representations. It also presents and upon which we develop our proposal. We pay special attention to show the internals of those structures and discuss also their properties and functionality. Then, in Section 2.3.2, we also present the wavelet matrix () and show how to create a Hu-Tucker-shaped (). These are the two variants of we use to represent temporal data. In Section 3, we present the main counting-based queries that are of interest for a transportation network. In Section 4, we present and show how to reorganize a dataset of trips to allow a to handle the spatial data and a -based structure to manage the temporal data. Section 5 shows how represents the spatial component and how spatial queries are dealt with. In Section 6, we focus on how to represent the temporal component of trips and how to answer temporal queries. We also include a brief comparison of the space/time trade-off of and . In Section 7, we show how spatio-temporal queries are solved by , and Section 8 includes our experimental results. Finally, conclusions and future work are discussed in Section 9.

2 Previous Work

2.1 Models of trajectory and types of queries

There is a large amount of work on data models for moving-object data DBLP:conf/ssdbm/WolfsonXCJ98 ; DBLP:conf/icde/SistlaWCD97 ; DBLP:journals/tods/GutingBEJLSV00 ; DBLP:conf/chorochronos/GutingBEJLNSV03 ; DBLP:journals/geoinformatica/Spaccapietra01 ; DBLP:conf/sigmod/ForlizziGNS00 ; DBLP:journals/geoinformatica/ErwigGSV99 ; DBLP:books/mk/GutingS2005 . Basically, a moving-object data model represents the continuous change of the location of an object over time, what is called the trajectory of the object.

Moving-object data is an example of big data that differ in the representation of location, contextual or environmental information where the movement takes place, the time dimension that can be continuous or discrete, and the level of abstraction or granularity on which the trajectories are described DBLP:journals/sigspatial/DamianiIGV15 . A common classification of trajectories distinguishes free from network-based trajectories. Free trajectories or Euclidean trajectories are a sequence of GPS points represented by an ad-hoc data type of moving points DBLP:conf/ssdbm/WolfsonXCJ98 ; DBLP:conf/icde/SistlaWCD97 ; DBLP:journals/tods/GutingBEJLSV00 . Network-based trajectories are a temporal ordered sequence of locations on networks. This trajectory model includes a data type for representing networks and for representing the relative location of static and moving points on the network DBLP:journals/vldb/GutingAD06 . In a recent work, network-matched trajectories are defined to avoid the need of a mobile map at the moving-object side DBLP:journals/tits/Ding0GL15 .

The definition of trajectories at an abstract level must be materialized in an internal representation with access methods for query processing. An early and broad classification of spatial-temporal queries for historical positions of moving objects DBLP:conf/vldb/PfoserJT00 identifies coordinate- and trajectory-based queries. Coordinate-based queries include the well-known time-slice, time-interval and nearest-neighbor queries. Examples are find objects or trajectories in a region at a particular time instant or during some time interval. Another important example of range-based queries is find the k-closest objects with respect to a given point at a given time instant. Trajectory-based queries involve topology of trajectories (e.g., overlap and disjoint) and information (e.g., speed, area, and heading) that can be derived from the combination of time and space. An example of such queries would be find objects or trajectories that satisfy a spatial predicate (eg., leave or enter a region) at a particular time instant or time interval. There also exist combined queries addressing information of particular objects: Where was object X at a particular time instant or time interval?. In all previous queries, the results are individual trajectories that satisfy the query constraints.

When dealing with large datasets of trajectories we can find scenarios where answering counting based or aggregated queries are typically of concern. This is for example the case of network management applications, those for mobility analysis, or when there are privacy issues that prevent us from revealing the original individual trajectories. In this context, we can further distinguish range- from trajectory-based queries. Range queries impose constraints in terms of a spatial location and temporal interval. Examples of these queries are to retrieve the number of distinct trajectories that intersect a spatial region or spatial location (stop) at a given time instant or time interval, retrieve the number of distinct trajectories that start at a particular location (stop) or in a region and/or end in another particular location of region, retrieve the number of trajectories that follow a path, and retrieve the top-k locations (stops) or regions with the larger number of trajectories that intersect at a given time instant or time interval. Trajectory-based queries require not only to use the spatio-temporal points of trajectories but also the sequence of these points. Examples of such queries are to find the number of trajectories that are heading (not necessarily ending at) to a spatial location during a time interval, find the destination of trajectories that are passing through a region during a time interval, find the number of starting locations of trajectories that go or pass through a region during a time interval.

2.2 Trajectory indexing

Many data structures have been proposed to support efficient query capabilities on collections of trajectories. We refer to (DBLP:books/sp/PelekisT14, , Chapter 4)

for a comprehensive and up-to-date survey on data management techniques for trajectories of moving objects. We can broadly classify these data structures into two groups: those that index trajectories in free space and those that index trajectories that are constrained to a network.

In free space, it is common to see spatial indexes that extend the R-Tree index DBLP:conf/sigmod/Guttman84 beyond a simple 3D R-Tree where the time is one of the dimensions. Two examples of such indexes can be found in DBLP:conf/vldb/PfoserJT00 where the authors present two fundamental variations of the R-Tree: the STR-Tree and the TB-Tree. Both indexes modify the classical construction algorithm for the R-Tree, where the nodes are not only grouped by the spatial distance among the indexed objects, but also by the trajectories they belong to. In the MV3R-tree DBLP:conf/vldb/PapadiasT01 , the construction takes into account temporal information of the moving objects, adapting ideas from the Historical R-Tree nascimento1998towards . Another interesting approach is described in chakka2003indexing , where the authors split trajectories of moving objects across partitions of space, indexing each partition separately. This improves query efficiency, as only the partitions that intersect a query region are accessed.

R-Tree adaptations can also be useful when the trajectories are constrained to a network. They exploit the constraints imposed by the topology of the network to optimize the data structure. This is the case of the FNR-tree DBLP:conf/ssd/Frentzos03 , which consists of a 2D R-Tree to index the segments of the trajectories over the network, and a forest of 1D R-Trees used to index the time interval when each trajectory is moving through each segment of the network. The MON-Tree DBLP:journals/geoinformatica/AlmeidaG05 can be seen as an improvement over the FNR-Tree, saving considerable space by indexing MBRs of larger network elements (edge segments or entire roads) and reducing the number of disk accesses at query time. Both indexes are outperformed by the TMN-Tree chang2010tmn in query time, which indexes whole trajectories of moving objects with a 2D R-Tree and indexing the temporal component with a B-Tree, which proves to be more efficient for that application than the R-Tree.

PARINET is another interesting alternative to represent trajectories constrained to a network DBLP:journals/vldb/PopaZOBV11 . It partitions trajectories into segments from an underlying road network using a complex cost model to minimize the number of disk accesses at query time. It takes into account the spatial relations among the indexed network elements, as well as some statistics of the data to index. Then it adds a temporal B-tree to index the trajectory segments from each road. Those indexes permit PARINET to filter out candidate trajectory segments matching time constraints at query time. The same ideas were used in TRIFL that2015trifl , where the cost model is adapted for flash storage.

All previous data structures were designed to answer spatio-temporal queries, where the space, in particular geographic coordinates, and time are the main filtering criteria. Examples of such queries are: retrieve trajectories that crossed a region within a time interval, retrieve trajectories that intersect, or retrieve the -best connected trajectories (i.e., the most similar trajectories in terms of a distance function). Yet, they could not easily support queries such as the number of trips starting in X and ending at Y. A recent work in DBLP:conf/cikm/LiCDYZZ0Z15 proposes a method to compute the approximate number of distinct trajectories that cross a region. Note that computing aggregate queries of trajectories in the hierarchical structure of classical spatio-temporal indices is usually done by aggregating the information maintained in index nodes at the higher levels to avoid accessing the raw spatio-temporal data. However, for a trajectory aggregate query, maintaining the statistical trajectory information on index nodes does not work because what matters for these queries is to determine the number of distinct trajectories in a spatio-temporal query region.

The application of data compression techniques has been explored in the context of massive data about trajectories. The work by Meratnia and de By DBLP:conf/edbt/MeratniaB04 adapts a classical simplification algorithm by Douglas and Peucker to reduce the number of points in a curve and, in consequence, the space used to represent trajectories. Potamias et al. DBLP:conf/ssdbm/PotamiasPS06 use concepts, such as speed and orientation, to improve compression. It is also possible cao2006spatio to compress a trajectory in a way that the maximum error at query time is deterministic, although the method greatly depends on the distance function to be used.

In DBLP:journals/josis/RichterSL12 ; DBLP:journals/jss/KellarisPT13 ; DBLP:conf/w2gis/FunkeSSS15 , they focus mainly on how to represent trajectories constrained to networks, and in how to gather the location of one or more given moving objects from those trajectories. Yet, these works are out of our scope as they would poorly support queries oriented to exploit the data about the network usage such as those that compute the number of trips with a specific spatio-temporal pattern (e.g. Count the trips starting at stop and ending at stop in working days between 7:00 and 9:00).

A recent work DBLP:conf/gis/KroghPTT14 proposed an indexing structure called NETTRA to answer strict and approximate path queries that can be implemented in standard SQL using -trees and self-JOIN operations. For each trajectory, NETTRA represents the sequence of adjacent network edges touched by the trajectory as entries in a table with four columns: id, entering and leaving time, and a hash value of the entire path up to and including the edge itself. Using the hash value for the first and last edge on a query path, NETTRA determines whether the trajectory followed a specific path between these edges. Also for strict path queries, Koide et al. DBLP:conf/gis/KoideTY15 proposed a spatio-temporal index structure called SNT-Index that is based on the integration of a FM-index DBLP:conf/focs/FerraginaM00 to store spatial information with a forest of B+trees that stores temporal information. To the best of our knowledge, this makes up the first technique using compact data structures to handle spatial data in this scenario. Yet, in our opinion, strict path queries have little interest in the context of exploiting data to analyze the usage of a transportation network.

Unlike previous works, we designed an in-memory representation, that targets at solving counting-based queries, and is completely based on the use of compact data structures (discussed in the next section) to make it successful not only in time but also in space needs. Since keeps data in a compressed way, it will permit to handle larger sets of trajectories entirely in memory and consequently to avoid costly disk accesses.

2.3 Underlying Compact Structures of

relies on two components: one to handle the spatial information and another to represent temporal information. The spatial component is based on the well-known a Compressed Suffix Array (Sad03 . The temporal component can be implemented with either a Wavelet Tree (WT03 or a Wavelet Matrix (CNO15 . The latter is a variant of that performs better when representing sequences built on a large alphabet as we see below.

2.3.1 Sadakane’s Compressed Suffix Array (CSA)

Given a sequence built over an alphabet of length , the suffix array built on MM93 is a permutation of of all the suffixes so that for all , being the lexicographic ordering. Because contains all the suffixes of in lexicographic order, this structure permits to search for any pattern in time by simply binary searching the range that contains pointers to all the positions in where occurs.

To reduce the space needs of , Sadakane’s CSA Sad03 uses another permutation defined in GV00 . For each position in pointed from , gives the position such that points to . There is a special case when , in this case gives the position such that . In addition, we could set up a vocabulary array with all the different symbols from that appear in , and a bitmap aligned to so that if or if (; otherwise). Basically, a in marks the beginning of a range of suffixes pointed from such that the first symbol of those suffixes coincides. That is, if the and one in occur in and respectively, we will have . Note that returns the number of 1s in and can be computed in constant time using extra bits Jac89 ; Mun96 .

By using , , and , it is possible to simulate a binary search for the interval where a given pattern occurs () without keeping nor . Note that, the symbol pointed by can be obtained as , and we can obtain the following symbol from the source sequence as , can be obtained as , and so on. Therefore,  replaces , and it does not need anymore to perform searches.

However, in principle, would have the same space requirements as . Fortunately, is highly compressible. It was shown to be formed by subsequences of increasing values GV00 so that it can be compressed to around the zero-order entropy of  Sad03 and, by using -codes to represent the differential values, its space needs are bits. In NM07 , they showed that can be split into (for any ) runs of consecutive values so that the differences within those runs are always . This permitted to combine -coding of gaps with run-length encoding (of -) yielding higher-order compression of . In addition, to maintain fast random access to , absolute samples at regular intervals (every entries) are kept. Parameter implies a space/time trade-off. Larger values lead to better compression of but slow down access time to non-sampled values.

In FBNCPR12 , authors adapted Sadakane’s CSA to deal with large (integer-based) alphabets and created the integer-based CSA (). They also showed that, in their scenario (natural language text indexing), the best compression of was obtained by combining differential encoding of runs with Huffman and run-length encoding.

2.3.2 The Wavelet Tree (WT)

Given a sequence built on an alphabet with symbols that are encoded with a fixed-length binary code , a  WT03 built over is a balanced binary tree where leaves are labeled with the different symbols in , and each internal node contains a bitvector . The bitvector in the root node contains the first bit from the codes of all the symbols in . Then symbols whose code starts with a 0 are assigned to the left child, and those with codes starting with a 1 are assigned to the right child. In the second level, the bitvectors contain the second bits of the codes of their assigned symbols. This applies recursively for every node, until a leaf node is reached. Leaf nodes can only contain one kind of symbol. The height of the tree is , and since the bitvectors of each level contain bits, the overall size of all the bitvectors is bits. To calculate the total size of the we also need to take into account the space needed to store pointers from each symbol to its corresponding tree node which is bits. In addition, as we see below the reduces the general problem of solving , , and operations to the problem of computing , and on the bitvectors. Therefore, additional structures to efficiently support those operations add up to space. The overall size of the is + . Figure 1.(left) shows a built on the sequence assuming we use a 3-bit binary encoding for the symbols in . Shaded areas are not included in the but help us to see the subsequences handled by the children of a given node.

Figure 1: (left) and (right) for sequence . Shaded areas are neither included in the nor in the .

Among others, the permits to answer the following queries in time:

  • returns .

  • returns the number of occurrences of symbol in .

  • returns the position of the i-th occurrence of symbol in .

  • described in gagie2012new , returns the number of occurences in of the symbols between and .

To solve and operations we traverse the from the root until we reach a leaf. In the case of we descend the tree taking into account the encoding of in each level. Being the bitmap in the root node, if we move to the left child and set ; otherwise we move to the right child and set . We proceed recursively until we reach a leaf where we return . is solved similarly, but at each level, we either move left or right depending on if or respectively. The leaf where we arrive corresponds to the symbol which is returned. To solve we traverse the tree from the leaf corresponding to symbol until the root. At level , we look at the value of the -th bit of the encoding of (). If we set (where is the bitmap of the parent of the current node), otherwise we set . Then we move to level and proceed recursively until the root, where the final value of is returned.

In this work we are also interested in operation that allow us to count the number of occurrences of all the symbols within . Assuming the encodings of the symbols form also a contiguous range this can be solved in gagie2012new ; CNO15 . The idea is to traverse the tree from the root and descend through the nodes that cover the leaves in . At each node (whose bitmap is and range is considered), that covers symbols in range we check whether . In that case we sum occurrences. If both ranges are disjoint we found a node not covering the range and stop the traversal. Similarly, if range becomes empty traversal stops on that branch. Otherwise, we recursively descend from node to their children and where we map the interval into and with operation. In practice, , and , . For more details and pseudocodes see Nav16 ; gagie2012new ; CNO15 . In Figure 1.(left) we can see the nodes (, , and ) that must be traversed, and the ranges within the bitmaps in those nodes, to solve . Therefore, we want to compute the number of occurrences of symbols between and that occur within . We start in the root node , where contains zeroes and ones. We compute , and . At this point we could move to , but we can see that all the encodings of the symbols start by and they are covered by . Therefore, we report occurrences of symbols in range and no further processing is done in the subtree whose root is . However, we descent to since in could belong to any symbol in range , and we have to track only occurrences of symbol . We check and compute , and . Since the second bit of the encoding of symbol is a (as for symbol ), we can discard descending on the left child of and move only to its right child where we are interested in the range . Since covers both symbols and , and the third bit of the encoding of is a zero whereas it is a one for , we do only need to count the number of ones in . After computing , and , we report occurrence of symbol . Therefore, we conclude .

One way of reducing the space needs of a WT consists in compressing its bitvectors Navjda13 . Among others (i.e. Golynski et al. Golynski2007 which is better theoretically), Raman el al. technique Raman:2002:SID:545381.545411 (RRR) is, in practice, one of the best choices. The overall size of the becomes , whereas operations still require time.

Another way of compressing a is to use a prefix-free variable-length encoding for the symbols. For example, Huffman huffman1952method code can be used to build a Huffman-Shaped  ferragina2009compressed , where the tree is not balanced anymore. The size reduces to ,222 term includes both the tree pointers and the size of the Huffman model. and average time becomes for , , and (worst-case time is still Barbay:2013:CPA:2562345.2562626 ). By using compressed bitvectors CNO15 space can be reduced even further to . Unfortunately, the Huffman codes given to adjacent symbols are no longer contiguous, and it is not possible to give a bound for anymore, even if the code is canonical. Hu-Tucker codes hu1971optimal can be used instead.333Hu-Tucker  hu1971optimal is an optimal prefix code that preserves the order of the input vocabulary. This means that the lexicographic order of the output variable-length binary codes is the same as the order of the input symbols. Compression degrades slightly with respect to using Huffman coding,444Being and the average codeword length of Huffman coding and Hu-Tucker codes respectively, it holds: and (see Cover:2006:EIT:1146355 (pages 122-123), or HORIBE1977148 ; GilbertandMore1959 ). but the codes for adjacent symbols are lexicographically contiguous. This permits to solve efficiently. The size of a Hu-Tucker-shaped () can be bounded to and can be reduced to by using compressed bitvectors as well.

The Wavelet Matrix (WM)

For large alphabets, the size of the is affected by the term . A pointerless CNspire08.1 permits to remove555In a pointerless Huffman-shaped a term still remains due to the need of storing the canonical Huffman model. that term by concatenating all the bitmaps level-wise and computing the values of the pointers during the traversals. The operations on a pointerless have the same time complexity but become slower in practice.

By reorganizing the nodes in each level of a pointerless , the Wavelet Matrix (CNO15 obtains the same space requirements ( bits), yet its performance is very close to that of the regular with pointers. Figure 1.(right) shows an example.

As in the , the -th level stores the -th bits of the encoded symbols. A single bitvector is kept for each level. In the first level, stores the -st bit of the encoding of the symbols in the order of the original sequence . From there on, at level , symbols are reordered according to the -th bit of their encoding; that is, according to the bit they had in the previous level. Those symbols whose encoding had a zero at position must be arranged before those that had a one. After that, the relative order from the previous level is maintained. That is, if a symbol occurred before other symbol , and the -th bit of their encoding coincides, then will precede at level .

If we simply keep the number of zeros at each level (), we can easily see that the -th zero at level is mapped at position within , whereas the -th one at level is mapped at position within . This avoids the need for pointers and permits to retain the same time complexity of the operations. For implementation details see CNO15 ; ordonez2015statistical . For example to solve , we see that and . We move to the next level where we check position ; we see that and . We move to next level and check position , where we finally see . Therefore, we have decoded the bits that correspond to symbol .

To reduce the space needs of we could use compressed bitvectors as for s. Space needs become bits. Yet, compressing the by giving either a Huffman or Hu-Tucker shape is not possible as the reordering of the could lead to the existence of holes in the structure that would ruin the process of tracking symbols during traversals. To overcome this issue an optimal Huffman-based coding was specifically developed for wavelet matrices CNO15 ; Farina2016 . This allows to obtain space similar to that of a pointerless Huffman-shaped but faster , , and operations. Unfortunately, since the encodings of consecutive symbols do not form a contiguous range, is no longer supported in time and computing is required for each in .

As indicated before, since in we need efficient support for operation, we will try (see Section 6) the Hu-Tucker-shaped as well as the uncompressed . In addition, we will couple them with both uncompressed and RRR compressed bitvectors.

3 Counting-based queries

In transportation systems, new technologies such as automatic fare collection (e.g., smartcards) and automatic passenger counting have made possible to generate a huge amount of highly detailed, real-time data useful to define measures that characterize a transportation network. This data is particularly useful because it actually consists of real trips, combining implicitly the service offered by a public transportation system with the demand for the system. When having this data, it is not the data about individual trajectories but measures of the use of the network what matters for traffic monitoring and road planning tasks. Examples of useful measures are accessibility and centrality indicators, referred to how easy is to reach locations or how important certain stops are within a network Morency2007193 ; El-Geneidy2011 ; Wang2015335 ; gmtnNAS15 . All these measures are based on some kind of counting queries that determine the number of distinct trips that occur within a spatial and/or temporal window.

Among other types of queries, in this work we focus on the following counting queries, which to the best of our knowledge have not been addressed by previous proposals. In general terms, we define two general queries, number-of-trips queries and top-k queries, upon which we apply spatial, temporal or spatio-temporal constraint when useful.

  • Number-of-trips queries. This is a general type of queries that counts the number of distinct trips. When applying spatial, temporal or spatio-temporal constraints, it can specialized in the following queries:

    1. Pure spatial queries:

      • Number of trips starting at node (starts-with-x).

      • Number of trips ending at node (ends-with-x).

      • Number of trips starting at and ending at (from-x-to-y).

      • Number of trips using or passing through node (uses-x)

    2. Spatio-temporal queries:

      • Number of trips starting at node during time interval (starts-with-x).

      • Number of trips ending at node during the time interval (ends-with-x).

      • Number of trips starting at and ending at occurring during time interval (from-x-to-y). This type of queries is further classified into: (i) from-x-to-y with strong semantics (from-x-to-y-strong), which considers trips that completely occur within interval . (ii) from-x-to-y with weak semantics (from-x-to-y-weak), which considers trips whose life time overlap .

      • Number of trips using node during the time interval (uses-x).

    3. Pure temporal queries:

      • Number of trips starting during the time interval (starts-t).

      • Total usage of network stops during the time interval (uses-t).

      • Number of trips performed during the time interval (trips-t).

  • Top-k queries. In this type of queries we want to retrieve the nodes with the highest number of trips. In this case, depending on having a temporal constraint or not we include the following queries:

    1. Pure spatial Top-k queries:

      • Top-k most used nodes (top-k) that returns the nodes with the largest number of trips passing through.

      • Top-k most used nodes to start a trip (top-k-starts) that returns the nodes with the largest number of trips that start at that node.

    2. Spatio-temporal Top-k queries:

      • Top-k most used nodes during time interval (top-k) that returns the nodes with the largest number of trips passing through within time interval .

      • Top-k most used nodes to start a trip during time interval (top-k-starts) that returns the nodes with the largest number of trips starting there within time interval at that node.

4 Compact Trip Representation ()

If we consider a network with nodes, we can see a dataset of trips over as a set of trips, where for each trip , we represent a list with the temporary-ordered nodes it traverses and the corresponding timestamps: , , , and . Note that every node in the network can be identified with an integer ID (

) and that, if we are interested in analyzing the usage patterns of the network, we will probably be interested in discretizing time into time intervals (i.e. 5-min, 30-min intervals). Therefore, we will have

different time intervals that can also be identified with an integer ID ().

The size of the time interval is a parameter for the time-discretizing process that can be adjusted to fit the required precision in each domain. For example, in a public transportation network where we could have data including five years of trips, one possibility would be to divide that five-years period into 10-minute intervals hence obtaining a vocabulary of roughly different intervals. Other possibility would be to use cyclically annual 10-minute periods resulting in . However, in public transportation networks, queries such as “Number of trips using the stop X on May 10 between 9:15 and 10:00” may be not as useful as queries such as “Number of trips using stop X on Sundays between 9:15 and 10:00”. For this reason, can adapt how the time component is encoded depending on the queries that the system must answer.

Figure 2: A set of trips over a network with 10 nodes.
Example 4.1.

Figure 2 shows a network that contains nodes numbered from to . Over that network we have six trips (), and, for each of them, we indicate the sequence of nodes it traverses and the time when the trip goes through those nodes. If we discretize time into 5-minute intervals, starting at 08:00h, and ending at 9:20h, we will have have different time intervals. Any timestamp within interval will be assigned time-code , those within code , and so on until times within that are given time-code . Therefore, our dataset of trips will be: : , , , , , , , , , , , , where bold numbers indicate node IDs and slanted ones indicate times. ∎

In we represent both the spatial and the temporal component of the trips using well-known self-indexing structures in order to provide both a compact representation and the ability to perform fast indexed searches at query time. In Section 5 we focus on the spatial component and discuss how we adapted to deal with trips. We also show how we support spatial queries. Then, in Section 6 we show that the times, which are kept aligned with the spatial component of the trips, can be handled with a -based representation. Actually we study two alternatives (a and a ) and show how temporal and spatio-temporal (Section 7) queries are supported by .

5 Spatial component of

We use a to represent the spatial component of our dataset of trips within . Yet, we perform some preprocessing on before building a on it. Initially, we sort the trips by their first node (), then by the last node (), then by the starting time (), and finally, by its second node (), third node (), and successive nodes (). Note that the start time () of the trip does not belong to the spatial component, but it is nevertheless used for the sorting.666This initial sorting of the trips will allow us to answer some useful queries very efficiently (i.e., count trips starting at and ending at ).

Following with Example 4.1, after sorting the trips in with the criteria above, our sorted dataset would look like: : , , , , , , , , , , , . Note that appears before because during the sorting process we compare with ; that is, we compare the starting nodes ( and ) and then the ending nodes ( and ). If needed (not in this example) we would have also compared the slanted values ( and ) that are the starting times of the trips, and finally the rest of nodes ( and ). Similarly, the two trips containing nodes are sorted by the starting times ( and ).

In a second step, we enlarge all the trips with a fictitious terminator-node whose timestamp is set to that of the initial node of the trip. We choose terminators such that ; that is the lexicographic value of is smaller for smaller values. In addition, the lexicographic value of any terminator must be lower than the ID of any node in a trip. Therefore, an enlarged trip would become .

The next step involves concatenating the spatial component of all the enlarged trips and to add an extra trailing terminator to create a sequence . must be lexicographically smaller than any other entry in (then it also holds ). In the top part of Figure 3, we can see array for the running example, as well as the corresponding time-IDs that are regarded in sequence ( shows the original times).

Finally, we build a on top of to obtain a self-indexed representation of the spatial component in . Figure 3 depicts the structures and used by built over . There is also a vocabulary containing a symbol and the different node IDs in lexicographic order.

Note that the use of different values as terminators ensures that our sorting criteria are kept even if we follow the standard suffix-sort procedure777Suffix is compared with suffix . required to build suffix array during the creation of . Yet, when we finish that process, we can replace all those terminators by a unique . This is the reason why there is only one symbol in .

Although they are not needed in , we show also suffix array and ’ for clarity reasons in Figure 3. contains the first entries of from a regular , whereas we introduced a small variation in for entries . For example, points to the first node of the first trip . and point to the second node. and point to the third node. and point to the ending of the first trip. Therefore, in the standard , and point to the first node of the second trip. However, in , and point to the first node of the first trip. With this small change, subsequent applications of will allow us to cyclically traverse the nodes of the trip instead of accessing the following entries of .

Figure 3: Structures involved in the creation of a .

Another interesting property arises from the use of a cyclical on trips, and from using trip terminators. Since the first entries in correspond the symbols that mark the end of each trip in (remember that corresponds the ), we can see that the node of the trip can be obtained as , (where ). This property makes it very simple to find starting nodes for any trip. For example, if we focus on the shaded area , we can find the ending terminator of the fourth trip at the position (because the first corresponds to the final at ). Therefore, its starting node can be found as . Since and , the starting node is . For illustration purposes note that it would correspond to . By applying again, the next node of that trip would be obtained by computing , , and accessing (that is, we have obtained , and so on.

Regarding the space requirements of the in , we can expect to obtain a good compressibility due to the structure of the network, and the fact that trips that start in a given node or simply those going through that node will probably share the same sequence of “next” nodes. This will lead us to obtaining many runs in  NM07 , and consequently, good compression.

5.1 Dealing with Spatial Queries

With the structure described for representing the spatial component of the trips, the following queries can be solved.

  • Number of trips starting at node (starts-with-x). Because was cyclically built in such a way that every symbol is followed by the first node of its trip, this query is solved by over the , which results on a binary search for the pattern over the section corresponding to symbols. Then gives the number of trips starting at .

  • Number of trips ending at node (ends-with-x). In a similar way to the previous query, this one can be answered with .

  • Number of trips starting at and ending at (from-x-to-y). Combining both ideas from above, and thanks to the cyclical construction of , this query is solved using .

  • Number of trips using node (uses-x). Even though we could solve this query with , it is more efficient to solve it by directly operating on . Assuming that is at position in the vocabulary of (), its total frequency is obtained by . If is the last entry in , we set .

  • Top-k most used nodes (top-k). We provide two possible solutions for this query named: sequential and binary-partition approaches.

    • To return the most used nodes using sequential approach (top-k-seq). The idea is to apply operation sequentially for every node from to to compute the frequency of each node and to return the nodes with highest frequency. We use a min-heap that is initialized with the first nodes, and for every node from to , we compare its frequency with that of the minimum node (the root) from the heap. In case the frequency of is higher, the root of the heap is replaced by and then moved down to comply with the heap ordering. At the end of the process, the heap will contain the top-k most used nodes , which can be sorted with the heapsort algorithm if needed. Finally, we return . Note that this approach always performs operations on .

    • The binary-partition (top-k-bin)

      approach takes advantage of a skewed distribution of frequency of the nodes that trips traverse. Working over

      and , we recursively split into two segments after each iteration. If possible, we leave the same number of different nodes in each side of the partition. Initially, we start considering the range in which corresponds to the nodes that appear in from positions to .888We skip the at the first entry of and its corresponding entries in ; that is, . We use a priority queue that is initialized as . Then, assuming and , we create two partitions and , which correspond respectively to the nodes in and . These segments created after the partitioning step are pushed into . The pseudocode can be found in Figure 4.

      The priority of each segment in is directly the size of its range in (). When a segment extracted from represents the instance of only one node (, with ), that node is returned as a result of the top-k algorithm (we return ). The algorithm stops when the first nodes are found.

      For example, when searching for the top-1 most used nodes in the example from Figure 3, is initialized with the segment , corresponding to nodes from 1 to 10 (positions from 2 to 11 in ). Note that the entries of from 1 to 7 and represent the symbol. Since it is not an actual node, it must be skipped. Then is split producing the segments for nodes 1 to 5 () and for nodes 6 to 10 (). After three more iterations, we extract , hence obtaining the segment for the single node 3 (position in ), concluding that the Top-1 most used node is with a frequency equal to .

      GetTopK_most_used_nodes : (l.1 )   new PriorityQueue(); (l.2 )  pushselect; (l.3 )  ;
    (l.4 )  while :
    (l.5 ) pop;
    (l.6 ) if :
    (l.7 ) ; (l.8 ) ; (l.9 ) else: (l.10) ; (l.11) select; (l.12) push; (l.13) push; (l.14)  return ;  
    Figure 4: Algorithm Top-k most used nodes using binary-partition approach.
  • Top-k most used nodes to start a trip (top-k-starts). Both top-k approaches above can be adapted for answering top-k-starts. However, unlike its simpler variant, it requires performing over (rather than a on ) at each iteration, hence increasing the temporal complexity of the operation.

    The implementation of the linear approach is straightforward. The binary-partition approach differs slightly from the algorithm in Figure 4: in (l.2) we insert into , and we replace (l.11) with ; .

5.2 Implementation details

In our implementation of , we used the 999http://vios.dc.fi.udc.es/indexing from FBNCPR12 briefly discussed in Section 2.3.1. Yet, we introduced some small modifications:

  • The construction of the Suffix Array is done with SA-IS algorithm nong2011two .101010https://sites.google.com/site/yuta256/sais In comparison with the qsufsort algorithm111111 http://www.larsson.dogma.net/research.html Larsson:2007:FSS:1314704.1314853 used in the original , it achieves a linear time construction and a lower extra working space.

  • In , they used a plain representation for bitvector and additional structures to support in constant time using ( bits). With that structure, they could solve in time (yet they did not actually needed solving in ). In our , we have used the SDArray from okanohara2007practical to represent . It provides a very good compression for sparse bitvectors, as well as constant-time operation.

  • In FBNCPR12 , operation was implemented with a simple binary search over rather than using the backward-search optimization proposed in the original Sad03 . In our experiments, we used backward search since it led to a much lower performance degradation at query time when a sparse sampling of was used.

6 Temporal component of

In this section we focus on the temporal component associated with each node of the enlarged trips in our dataset. Recall that in Figure 3, sequence contains the time associated with each node in a trip, and a possible encoding of times. In we focus on the values in , yet, since is not kept anymore in , we reorganize the values in to keep them aligned with rather than with . Those values are represented within array in Figure 3. For example, we can see that corresponds with , corresponds with , and so on.

Aiming at having a compact representation of while permitting fast access and resolution of range-based queries (that we could use to search for trips within a given time interval), we have considered two -based alternatives from the ones presented in Section 2.3.2:

Figure 5: Balanced (top) and Hu-Tucker-shaped (bottom) for the times within in Figure 3.
  • A Wavelet Tree  WT03 using variable-length Hu-Tucker codes hu1971optimal (). Recall this is the variant that permits to compress the original symbols with variable-length codes and still supports operation in time. Since Hu-Tucker coding assigns shorter codes to the most frequent symbols, the compression of our is highly dependent of the distribution of frequencies of the values in . Yet, if our trips represent movements of single users in a transportation network, we could expect to observe two or more periods corresponding to rush hours within a single day. This would lead to obtaining a skewed distribution of the frequencies for the symbols in , and consequently, we could expect to have better compression than if we used a balanced . The expected number of bits of our is .

  • A balanced Wavelet Matrix (CNO15 . As we showed in Section 2.3.2 the is typically the most compact uncompressed variant of and it is faster than a pointerless . This is the reason why we chose a balanced instead of a balanced as this second alternative. Recall that, contains symbols, and each symbol can be encoded with bits, hence the balanced will be a matrix of bits.

In Figure 5, we show both the and the built on top of from Figure 3. The binary code-assignment to the source symbols in and that obtained after applying Hu-Tucker encoding algorithm hu1971optimal are also included in the figure.

6.1 Dealing with Temporal queries

With either one of the described alternatives ( or ) to represent time intervals we can answer the following pure temporal queries:

  • Number of trips starting during the time interval (starts-t). Since we keep the starting time of each trip within , we can efficiently solve this query by simply computing .

  • Total usage of network stops during the time interval (uses-t). This query can be seem as the sum of the number of trips that traversed each network node during . We can solve this query by computing .

  • Number of trips performed during the time interval (trips-t). This is also an interesting query that permits to know the actual network usage during a time interval. To solve this query we could compute trips-t by subtracting the number of trips that started after (starts-t) and the number of trips that ended before (ends-t) from the total number of trips (). However, recall that has the starting time of each trip, but we do not keep their ending time. We could solve ends-t by taking the first node () of each trip starting before , then applying until reaching the ending node (), and finally getting the ending time of that trip associated to node . However, this would be rather inefficient. A possible solution to efficiently solve ends-t, would require to increment our temporal component, in parallel with , with another -based representation of the ending times for our trips. This would permit to report the number of trips ending before as , but would increase the overall size of

    . Yet, note that even without keeping ending-times, we could provide rather accurate estimations of

    trips-t for a system administrator. For example, using uses-t to compute the number of times each trip went through any node during the time interval , and dividing that value by the average nodes per trip. Another good estimation can also be obtained with starts-t.

6.2 Implementation details

We include here details regarding how we tune our and . As we discussed in Section 2.3.2, both and are built over bitvectors that require support for and operations. In our implementations we included two alternative bitvector representations avaliable at libcds library:121212https://github.com/fclaude/libcds

  • A plain bitvector based on Mun96 named RG with additional structures to support in constant time ( in logaritmic time). RG includes a sampling parameter () that we set to value . In this case, our bitvector RG uses bits. That is, we tune RG to use a sparse sampling.

  • A compressed RRR bitvector Raman:2002:SID:545381.545411 . The RRR implementation includes a sampling parameter that we tune to values , , and . Higher sampling values achieve better compression.

In advance, when presenting results for and we will consider the four bitvector configurations above. Regarding our implementations of and , note that we reused the same implementation of from CNO15 , and we created our custom implementation, paying special focus at solving efficiently.

6.3 Comparing the space/time trade-off of and

In order to compare the efficiency of our (that uses variable-length codes and supports efficiently) with a balanced alternative under different time distributions (recall that this is time distribution invariant), we run some experiments that evaluate the average time to execute operation on both representations.

Figure 6: Time distributions used. The y-axis indicates the number of passengers per each 5-min interval.

We used a dataset of generated trips (Refer to Section 8.1 for the details about Madrid dataset) and we generated three kinds of time distributions for our evaluation. We refer to them as: uniform, skewed and very skewed. They are shown in Figure 6

. According to the total number of passengers in a day, in the uniform distribution

passengers use the network for each 5-minute interval. We also generated a skewed distribution for the time interval frequencies in an effort to model the usage of a public transportation network in a regular working day, where the starting time of a trip is generated according to the following rules:

  • With 30% of probability, a trip occurs during a morning rush hour.

  • With 45% of probability, a trip occurs in an evening rush hour.

  • With 5% of probability, a trip occurs during lunch rush hour.

  • The remaining 20% of probability is associated to unclassified trips, starting at a random hour of the day, which may also fall into one of the three previous periods discussed.

In the very skewed distribution we increase the rush-hour probabilities with 40% for the morning rush hour, 50% for the evening rush hour, 8% for lunch period and only 2% of random movements.

Then we built the and the considering two different granularities for the discretization of times: five-minute and thirty-minute intervals. Then, we generated random intervals of times over the whole time sequence of the dataset considering interval widths of five minutes, one hour, and six hours. Finally, we run queries (we show average times) from each query set over the six configurations of and