Using Compressed Suffix-Arrays for a Compact Representation of Temporal-Graphs

12/28/2018 ∙ by Nieves R. Brisaboa, et al. ∙ University of Concepcion Universidade da Coruña Universidad del Desarrollo 0

Temporal graphs represent binary relationships that change along time. They can model the dynamism of, for example, social and communication networks. Temporal graphs are defined as sets of contacts that are edges tagged with the temporal intervals when they are active. This work explores the use of the Compressed Suffix Array (CSA), a well-known compact and self-indexed data structure in the area of text indexing, to represent large temporal graphs. The new structure, called Temporal Graph CSA (TGCSA), is experimentally compared with the most competitive compact data structures in the state-of-the-art, namely, EDGELOG and CET. The experimental results show that TGCSA obtains a good space-time trade-off. It uses a reasonable space and is efficient for solving complex temporal queries. Furthermore, TGCSA has wider expressive capabilities than EDGELOG and CET, because it is able to represent temporal graphs where contacts on an edge can temporally overlap.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The main assumption of static graphs is that the relationship between two vertexes is always available. However, this is not true in many real world situations. For example, consider how friendship relations evolve in an online social network, or how the connectivity in a communication network changes when users, with their mobile devices, move in a city. Temporal graphs deal with the time-dependence of relationships between vertexes by representing these relationships as a set of contacts Nicosia et al. (2013). Each contact represents an edge (i.e., two vertexes) tagged with the time interval when the edge was active. For example, in a communication network, a contact may represent a call between users made from 4 pm to 4.05 pm.

The temporal dimension of edges adds a new constraint to the relationship between vertices not found in static graphs: two vertexes can communicate only if there is a time-respecting path (also called journeys Nicosia et al. (2013)) between them Nicosia et al. (2013); Sizemore and Bassett (2017); Wu et al. (2014); Tang et al. (2013); Holme and Saramäki (2012). For example, in Figure 1.b (corresponding to the time aggregation of the edges in the temporal graph of Figure 1.a), there are two paths connecting the vertexes and : one through the vertex , and the other one through . However, there is no such path when considering the temporal availability of the edges and . Notice that the vertexes and are only reachable from the vertex because the edges reaching are not available. Therefore, taking into account the temporal dynamism of graphs allows us to exploit information about temporal correlations and causality, which would be unfeasible through a classical analysis of static graphs Nicosia et al. (2013); Holme and Saramäki (2012); Michail (2016).

Figure 1: A temporal graph composed of four vertexes with a lifespan of four time-instants: a) A snapshot-based representation showing the available edges per time-instants, b) the time-aggregated graph of the temporal graph.

A direct approach to represent temporal graphs could be a time-ordered sequence of snapshots (Figure 1a), one for each time instant, showing the state of the temporal graph at a time instant as a static graph. Several centralized and distributed processing systems follow this approach (e.g. Pregel Malewicz et al. (2010), Giraph222, Neo4J333, Trinity Shao et al. (2013)), but without specific support for temporal extensions Kosmatopoulos et al. (2016).

In temporal graphs where contacts are active during long time intervals (as in a social network), consecutive snapshots tend to become very similar. Thus, strategies based on a sequence of snapshots are space consuming because edges are duplicated in each snapshot. An alternative change-based approach represents the temporal graph by the differences between snapshots; that is, by the set of edges that appear/disappear along time. These differences can be calculated with respect to consecutive snapshots Ferreira and Viennot (2002), or with respect to a derived graph that diminishes the number of stored edges Ren et al. (2011); Khurana and Deshpande (2013); Labouseur et al. (2013); Semertzidis and Pitoura (2016a).

The change-based approach has also been used for pre-computing reachability queries Semertzidis and Pitoura (2016b, a), as some paths may remain available for several time instants Bannister et al. (2013). Although these works improve the time performance of complex algorithms, they overlook the space cost, which becomes crucial for large temporal graphs. In this context, a compact representation can keep larger sections or even the whole temporal graph in memory and, in consequence, queries could become much more efficient by avoiding disk transfers.

Recently, some compact approaches to represent temporal graphs have been proposed Caro et al. (2016, 2015). The work in Caro et al. (2016) presents the -, a tree-shaped compact data structure based on the Quadtree Samet (1984), which represents a temporal graph as a point in a four dimensional space. This data structure was designed to reduce space usage at the expense of time access in sparse temporal graphs. EdgeLog (Time Interval Log per Edge) Caro et al. (2015) uses a compressed inverted index, which also provides fast answers to different types of queries, in particular, when solving adjacency queries involving the recovery of active neighbors of a vertex at a specific time instant. CET (Compact Events ordered by Time) Caro et al. (2015) uses a wavelet tree Navarro (2014); Grossi et al. (2003) to represent temporal graphs and is the best alternative in the state-of-the-art to answer queries related to time-instant events that change the state of an edge.

Both EdgeLog and CET

overcome the overload of storing a snapshot per each time instant by representing the temporal graph as a log of events. These events indicate when edges become active or inactive. Then, the activation state of a given edge can be recovered by counting how many events occurred on that edge during a time interval. If there is an even number of events, it means that the edge has been active and inactive several times. Conversely, if the edge has an odd number of events, it means that the last state of the edge is active. A detailed explanation of these data structures is available in Section 


A main drawback of the log-based structures, such as EdgeLog and CET, is that they do not allow the representation of time-overlapping contacts of an edge. For example, if a contact represents the data communication between two machines and during a time interval, it is impossible to represent a second contact between and during an overlapping time interval. This limitation arises because in these structures the event that represents the activation of the second contact would be interpreted as the deactivation event of the first contact.

The work in this paper presents and evaluates a data structure named Temporal Graph CSA (TGCSA). The TGCSA is a compact and self-indexed structure based on a modification of the well-known Compressed Suffix Array (CSA)Sadakane (2003), extensively used for text indexing. We focus on algorithms to process temporal-adjacency queries that recover the set of active neighbors of a vertex at a given time instant. These queries are basic blocks to solve time-respecting paths Michail (2016), which can be useful in the context of moving-object data Mamoulis et al. (2004); Krogh et al. (2014), and also when analyzing activity patterns as temporally ordered sequences of actions occurring at specific time instances or time intervals Liu et al. (2016a, b).

We also present algorithms for answering queries that recover the snapshot of the graph at a time instant, as well as queries to recover the state of single edges. In addition, we include a complete experimental evaluation with real and synthetic data that compares TGCSA with EdgeLog and CET in terms of both space and time usage. The results of this evaluation show that TGCSA opens new opportunities for the application of suffix arrays Manber and Myers (1993); Sadakane (2003) in the context of graphs in general, and of temporal graphs in particular.

As discussed above, there are different fields where the application of our TGCSA, or other compact existing alternatives from the state of the art such as EdgeLog or CET, can be of interest. Among others, we can mention Shmueli et al. (2014); Hulovatyy et al. (2015): (i) Social networks, where friendships establish connections between nodes that can vary along time. (ii) Biological networks, where function brain connections are dynamic. (iii) Communication networks, where nodes are connected while their exchange information. This applies to person-to-person and machine-to-machine communication. (iv) Transportation networks, where the connectivity between nodes can change due to scheduling and traffic conditions. In this context, one could also model movements on a network by considering that two nodes are connected if there exists an object that moves from one to the other node during a time interval.

The structure of this paper is as follows. Section 2 presents preliminary concepts about temporal graphs and relevant queries on them. To make the paper self-contained, Sections 2.2 and 2.3 provide a brief overview of both EdgeLog and CET. These are the state-of-the-art techniques we compare TGCSA with. Section 3 introduces TGCSA by showing how to modify a traditional CSA to create TGCSA. It also describes how TGCSA solves relevant queries for temporal graphs and provides pseudocode for such operations. Finally, this section presents a new representation of the array from CSA Grossi and Vitter (2000); Fariña et al. (2012), called in this work , which increases the query performance of TGCSA. Section 4 provides the experimental evaluation that uses real and synthetic data. Final conclusions and future research directions are given in Section 5.

2 Preliminary concepts

In this section we introduce temporal graphs and a classification of the relevant basic queries that could be of interest for most applications. We also revise previous compact representations of temporal graphs.

2.1 Temporal graph definition

Formally, a temporal graph is a set of contacts that connect pairs of vertexes in a set during a time interval defined over the set that represents the lifetime of the graph. A contact in of an edge is a 4-tuple , where is the time interval when the edge is active Nicosia et al. (2013). We say that an edge is active at time if there exists a contact such that

. Note that this definition applies for directed graphs as we consider ordered pairs of vertexes.

We classify operations on temporal graphs into two categories: queries for checking the connectivity between vertexes and queries for retrieving the changes on the connectivity occurred along time. For the first category of queries, we define four operations: (1)

checks if an edge is active. (2) returns the active direct neighbors of a vertex. (3) gives the active reverse neighbors of a vertex. (4) returns all the active edges. For example, in the temporal graph of Figure 2.a, we know that at time instant the edge is active, the set of direct neighbors of is and the set of reverse neighbors of is ; whereas the snapshot at time corresponds to the edges .

For queries retrieving the changes on connectivity, we defined two operations: (1) returns the set of edges that were activated. (2) returns the set of edges that were deactivated. For example, given Figure 2.a at time instant , the edge was activated, and the edges were deactivated.

(a) Set of contacts
(b) EdgeLog representation
Figure 2: A temporal graph of vertexes and its EdgeLog representation. The reverse aggregated graph is omitted in (b).

Note that all previous queries have a time-instant or a time-interval version. In what follows, we concentrate on time-instant queries, which can be easily extended to answer time-interval queries, and they also serve as the building blocks for more complex temporal measures that are based on recovering time-respecting paths Michail (2016).

2.2 EdgeLog: Baseline representation

A simple temporal graph representation Buin-Xuan et al. (2003) stores the aggregated graph444The static graph including all the edges that were active at any time during the lifetime of the temporal graph. as adjacency lists, one per each vertex, with a sorted list of time intervals attached to each neighboring vertex indicating when that edge is/was active. Figure 2.b shows a conceptual example.

To check if an edge was active at time , we first check if appears within the adjacency list of vertex . If is found, then we need to check if falls into one of the time intervals related to that are represented in the time-interval list of that edge. Direct neighbors of vertex at time are recovered similarly. For each neighbor in the adjacency list of , we check if is within the time intervals of the edge .

A simple representation of the aggregated graph and the temporal labels attached to vertices has two main drawbacks: (1) it uses much space; and (2) operation requires traversing all the adjacency lists. The data structure EdgeLog Caro et al. (2015) addressed these weaknesses. On the one hand, since both the adjacency list and the time-interval list are sorted (i.e., they are of the form , with ), they can be represented as d-gaps , and those differences can be compressed using a variable-length encoding (e.g.,  Zukowski et al. (2006),  Zhang et al. (2008), Witten et al. (1999)). On the other hand, to avoid traversing all the adjacency lists in queries, EdgeLog  stores a reverse aggregated graph containing an adjacency list with all the reverse neighbors of each vertex. Therefore, to get the reverse neighbors of vertex at time , we first use the reverse adjacency list to obtain the candidate reverse neighbors of . Then, for each candidate reverse neighbor , we search for in its adjacency list and, finally, check if the edge is active at time (using the time-interval list of the edge).

2.2.1 Strengths and weaknesses of EdgeLog

Although EdgeLog is a simple structure using well-known technology, it is expected to be extremely space-efficient when the temporal graph has a low number of edges per vertex and a large number of contacts per edge. In the opposite way, a low number of contacts per edge will have a negative impact on the compression obtained by EdgeLog (as d-gaps become large). Note also that, even with the reverse aggregated graph to find reverse neighbors, the performance is expected to be poor if the number of edges per vertex is high because all their adjacency lists will have to be checked.

EdgeLog was designed to be efficient for , , and queries, but it could not efficiently answer queries such as: “Find all the edges that have active contacts at time or “Find all the edges that have been active only once”. This is because in such operations, all the adjacency lists must be processed. Also, the applicability of EdgeLog is limited to temporal graphs whose contacts do not temporally overlap; that is, it assumes that a contact of an edge ends before another contact of the same edge starts.

2.3 Cet: Compact Events ordered by Time

In CET a temporal graph is a sequence of symbol pairs that represent the changes on the connectivity between vertexes. Each pair represents either the activation or deactivation of an edge along time. Note that a contact of the form generates two changes: an activation of the edge at time , and a deactivation at time instant . The sequence of pairs () is composed of the changes on the connectivity of edges (i.e., activations or deactivations produced by all the contacts in the temporal graph) grouped by time instant in increasing order. In Figure 3.a, we show how the sequence of changes of the temporal graph from Figure 2.a is built. We can see that the first two entries of correspond to the edges and that are activated at time instant . Next entry corresponds to the activation of the edge at time instant . The fourth and fifth entries of are related to the edge , which is deactivated at time instant and activated again at , respectively. The next three entries reflect the changes produced at when the edges and are deactivated and is activated. Finally, the edges and are deactivated at time instant .

The activation state of an edge at time instant is computed by counting how many times the pair encoding the edge appears in the subsequence of changes within the time interval between and (in the closed time interval). As we assume that all edges are inactive at the beginning of the lifetime, the first occurrence of the pair means that the edge becomes active, the second occurrence means that the edge becomes inactive, and so on. In consequence, if the pair appears an odd number of times, it means that the state of the edge is active; otherwise, it is inactive. For example, we can see in Figure 3.a that, because the pair occurs three times within interval , the edge is active at time instant . The direct neighbors of a vertex at time are also recovered using the counting strategy, but checking the frequency of the form , i.e., the pairs whose first component is . Similarly, the reverse neighbors of are obtained by counting the pairs that end with .

The sequence of pairs that composes is represented in an Interleaved Wavelet Tree (IWTCaro et al. (2015), a variant of the Wavelet Tree Grossi et al. (2003, 2011) capable of counting the number of occurrences of multidimensional symbols in logarithmic time, while keeping a reduced space. The Wavelet Tree is a balanced binary tree, whose leaves are labeled with symbols in an alphabet , and whose internal nodes handle a range of the alphabet. Each node of the Wavelet Tree represents the sequence as a bitmap with 0s and 1s, depending on the binary code used to represent each symbol in the alphabet . Figure 3.b shows the IWT representation for the sequence of changes of the temporal graph in Figure 2.a. (For more details on the Wavelet Tree and its applications, refer to Navarro (2014)).

In the IWT, the pairs of symbols in are represented by an interleaved code that is the result of interleaving the bits (Morton Code Samet (2006)) of the codes corresponding to the source and target vertexes of each pair. Figure 3.c shows the interleaved bits for the pairs (corresponding to the edges) of the temporal graph in Figure 2.a. Note that the symbols in pair ad are given the codes 00 and 11 respectively. Therefore, the interleaved code for pair ad is 0101, and those four bits are represented along the wavelet tree by starting in the root node with the first 0. Because that bit is a zero, we move to the left child in the next level where we use the second bit of such code. This second bit is 1 and appears at the first position in the bitmap. Subsequently, we move to the right child in the next level, and use the third bit of the code, which is the 0 at the first position of the bitmap. Finally, we move again to the left child of the node and reach the last level where we set the last bit of the code of ad, which is 1.

The counting operation of a symbol in the sequence 555For simplicity, we will use the notation to refer to the sequence of elements . is translated into counting operations over the bitmaps in the path of the symbol . In order to show how the counting algorithm works, let us use the operation .666Given a bitmap , computes the number of occurrences of bit in . The algorithm works as follows. At the root node, if the first bit of symbol is 0 (1) we descend through the left (right) child of the node. At the child node, the position is updated to (), if the first bit of the symbol is 0 (1). This process is recursively repeated until we reach a leaf node. At the leaf node, the number of occurrences of the symbol corresponds to the updated value of . In total, this counting strategy requires to answer operations over the bitmaps in the path of a symbol. Figure 3.b shows, with a darker background, the bitmaps used to count how many times the symbol appears until the fifth position of the sequence.

Figure 3: The CET data structure representing the temporal graph in Figure 2.a. The top part shows the sequence of changes . The bottom-left part shows the Interleaved Wavelet Tree (IWT) representation of . The bottom-right part shows the interleaving bits used to represent pairs of symbols in the IWT.

2.3.1 Strengths and weaknesses of Cet

One advantage of CET is its ability to retrieve reverse neighbors with the same time performance of direct neighbors, due to the bi-dimensional representation used for storing the events of activation/deactivation of edges. Indeed, we just need to update the retrieval range to to obtain the frequency of neighboring changes of the edges whose target vertex is .

Another advantage is that the time performance in operations about vertexes and edges is independent of the number of contacts per query in the graph. This is because IWT allows the counting of events in logarithmic time with respect to the number of edges (instead of a sequential counting on the history of events). Due to the temporal arrangement of events of activation/deactivation of edges, operations regarding events on edges are easily obtained by extracting the subsequence related to the time instant of the query. For example, to obtain the edges that change their state at time instant , we just need to recover the pairs of vertexes in the section related to events occurred at time .

Despite the advantages of CET, its main weakness is related to the counting strategy used to recover the states of edges when contacts are active for short time intervals. For example, if we want to retrieve a snapshot at a time instant in a graph where all the edges were activated and deactivated before , we are forced to retrieve the frequency of all the edges (i.e., visiting each node of the IWT), although only a small fraction of them will actually be in the output. In addition, the frequency counting does not allow the representation of temporal graphs with overlapping contacts. This is because a symbol representing an overlapping contact will be interpreted as a symbol denoting the deactivation of the contact.

2.4 Improved representations of EdgeLog and Cet

In the previous section, the descriptions of EdgeLog and CET are given for temporal graphs where edges can freely appear and disappear along time, with no restrictions on the number of contacts per edge. The representation of these data structures can be improved by taking into account properties of the graph being represented. In particular, properties such as the duration and the dynamism of contacts Holme and Saramäki (2012).

When all contacts last only one time instant, both EdgeLog and CET can be modified to only store the event that activates an edge because, by definition, all edges will only remain active for one time instant. This small modification invalidates the strategy used to check if the edge is active (i.e., the counting strategy in CET, and the check of the interval in EdgeLog). However, it enables a new strategy to check if an edge is active. For example, in EdgeLog, the list of time intervals per edge is replaced by a list of time instants when an edge was active. Thus, the updated algorithm for checking the activation state of an edge at time is replaced by verifying if the new list of time instants contains . Similarly in CET, the activation state of an edge is replaced by checking if the edge appears in the subsequence related with the events occurred at time instant .

The data structures were also specialized for temporal graphs where each edge has only one contact, and once activated, this contact remains active until the end of the lifetime. In the literature, these graphs are called incremental graphs Demetrescu et al. (2010). With this kind of temporal graphs, the modification is straightforward. As all contacts end at the same time instant (i.e., at the end of the lifetime), it is not necessary to explicitly store the events that deactivate the edges. Caro et al. Caro et al. (2015) also used this strategy to improve the space cost of both EdgeLog and CET data structures, without the need of updating the query algorithms. Nevertheless, its usefulness depends on how many contacts effectively end at the last time instant of the graph.

3 CSA for Temporal graphs (Tgcsa)

The Compressed Suffix Array for Temporal Graphs (TGCSA) is a new data structure adapted from Sadakane’s Compressed Suffix Array (CSA) Sadakane (2003) to represent temporal graphs. Unlike EdgeLog and CET, it can represent contacts of the same edge that temporally overlap, what makes TGCSA a more general representation for temporal graphs.

Below we provide a brief presentation of the CSA. Then, we include a detailed description of TGCSA where we show how to create a TGCSA and we present a modification of the main structure () of TGCSA (Section 3.4) that targets at improving its efficiency. Finally, we also show how it solves the most relevant temporal queries.

3.1 Sadakane’s Compressed Suffix Array (CSA)

Given a sequence built over an alphabet of length , the suffix array built on is a permutation of of all the suffixes such that for all , being the lexicographic ordering Manber and Myers (1993). In Figure 4.a, we show the suffix array for the text "abracadabra".777The at the end of is a terminator that must be lexicographically smaller than all the other symbols in .

Because contains all the suffixes of in lexicographic order, this structure permits to search for any pattern in time with a simple binary search of the range (i.e., ) that contains pointers to all the positions in where occurs. The term of the cost appears because, at each step of the binary search, one could need to compare up to symbols from with those in the suffix . Unfortunately, the space needs of are high.

To reduce the space needs, CSA Sadakane (2003) uses another permutation defined in Grossi and Vitter (2000). For each position in pointed by , gives the position such that points to . There is a special case when , in which case gives the position such that . In addition, two other structures are needed, a vocabulary array with all the different symbols that appear in , and a bitmap aligned to so that if or if (; otherwise). Basically, a in marks the beginning of a range of suffixes pointed from such that the first symbol of these suffixes coincides. Therefore, if the and ones in occur in and , respectively, that is, if and , it means that all the suffixes , ,… pointed from the entries start by the same symbol of the vocabulary. The bitmap is used to index the vocabulary array. Note that . Recall that returns the number of 1s in and can be computed in constant time using extra bits Jacobson (1989); Munro (1996), whereas returns the position of the in . In Figure 4.b, we show the components of the CSA for the text "abracadabra".

Figure 4: The Compressed Suffix Array for the text "abracadabra". The left part shows the Suffix Array (). The right part depicts the permutation , the bitmap , and the vocabulary . Arrows under the elements of denote (highly compressible) increasing values. In addition, the inverse of the Suffix Array would be .

By using , , and , it is possible to perform binary search without the need of accessing or . Note that, the symbol pointed by can be obtained by , symbol can be obtained by , symbol can be obtained by , and so on. Recall that basically indicates the position in that points to the symbol . Therefore, by using , , and we can obtain the symbols that we could need to compare with in each step of the binary search.

In principle, would have the same space requirements as . Fortunately, is highly compressible. It was shown to be formed by subsequences of increasing values Grossi and Vitter (2000) and, therefore, it can be compressed to around the zero-order entropy of Sadakane (2003), and by using -codes to represent the differential values, a space cost of bits is obtained. Note that, in Figure 4.b, the arrows under denote the subsequences of increasing values in . In Navarro and Mäkinen (2007), they showed that can be split into (for any ) runs of consecutive values so that the differences within those runs are always . This permitted them to combine -coding of gaps with run-length encoding (of 1-runs) yielding higher-order compression of . In addition, to maintain fast random access to , absolute samples at regular intervals are kept.

In Fariña et al. (2012), authors adapted CSA to deal with large (integer-based) alphabets and created the integer-based CSA (iCSA). They also showed that, in this scenario, the best compression of was obtained by combining differential encoding of runs with Huffman Huffman (1952) and run-length encoding.

As said before, , , and are enough to simulate the binary search for the interval where pattern occurs without keeping and (). Being the number of occurrences of in , this permits to solve the well-known count operation. However, if one is interested in locating those occurrences in , is still needed. In addition, to be able to extract the subsequence , we also need to keep so that we know the position in that points to . In practice, only sampled values of and are stored. Non-sampled values can be retrieved by applying -times until a sampled position is reached (then ). Similarly, sampled values of can be obtained by applying k-times from the previous sample (starting with ). In this case, . From this point, the CSA is a self-index built on that replaces (as any substring could be extracted) and does not need anymore to perform searches.

3.2 Modifying Csa to represent Temporal Graphs

Recall that a temporal graph is a set of contacts of the form , where and are vertexes () and a link or edge between them is active during a time interval . Also , with being the time instants representing the lifetime of the graph. In Example 1, we include a set of five contacts that we will use in our discussion below.

Example 1

Let us consider the temporal graph in Figure 5 with vertexes numbered and time instants numbered . This graph contains the following five contacts: , , , , and . ∎

Figure 5: The temporal graph from Example 1.

Targeting at using a CSA to obtain a self-indexed representation of a set of contacts (i.e. all their terms regarded as a unique sequence), we discuss in this section two adaptations that we performed. The first one, using-disjoint-alphabets, consists in assigning ids from disjoint alphabets to both vertexes and time instants. Then, when we perform a query for a given (or a sequence of ) within the CSA, that will correspond either to a source vertex, a target vertex, a starting time instant, or an ending time instant. The second modification consists in making cyclical on the elements of the 4-tuple representing a contact. This will permit us to use the regular binary search procedure of the CSA to efficiently search for (and retrieve) those contacts matching some constraints on their terms.

3.2.1 Using disjoint alphabets

Given a set of contacts, such as the one in Example 1, our procedure to create TGCSA starts by creating an ordered list of the contacts, so that they are sorted by their first term, then (if they have the same first term) by the second term, and so on. After that, these sorted contacts are regarded as a sequence with elements (), and a suffix array is built over it. This is depicted in Figure 6.

Figure 6: Suffix Array for the contacts from Example 1 using a unique alphabet .

If were made up of text, and (or a CSA built on ) would be enough to perform searches for any word or text substring . In such case, if we looked for the occurrences of symbol (i.e ), would indicate that there are occurrences of symbol . They occur at , , and respectively. However, in our scenario, when we search for symbol (i.e. ) we have to be able to distinguish among the source vertex , the target vertex , the starting time instant and the ending time instant . This would require accessing all the entries , , and checking the positions in they are pointing to. In practice, if then points to a source vertex; otherwise, if then it points to a target vertex, and so on. However, this procedure would ruin the search time that would now become , where is the number of occurrences of the query pattern in .

A simple workaround to the problem above consists in using disjoint alphabets for the four terms in a contact. In our case, we use alphabets , and satisfying that ( indicates lexicographic order). Note that we can always replace vertexes and time instants in the original set of contacts by new satisfying this property. For example, in Figure 7, we have created a new sequence where: (i) the of the source vertexes have been kept as they were initially (); (ii) the of the target vertexes have been added (); (iii) the of the starting time instants have been added (); and (iv) the of the ending time instants have been added (). Now, when we build the suffix array for the new , we can search for either the pattern , , , or , depending on if we want to find the occurrences of the term that corresponds to a source vertex, target vertex, starting time, or ending time, respectively. For example, we can see in the figure that when we are searching for the starting time , we can simply add to its and actually use the suffix array (or the CSA) to look for obtaining its two occurrences pointed by and . However, to search for the target vertex we would add to its and found that points to its unique occurrence in . In any case, we retain the original search time as expected.

Figure 7: Suffix Array for the contacts from Example 1 using disjoint alphabets. The structures , , and for the corresponding CSA are also depicted.

An interesting by-product that arises from the use of disjoint alphabets is that, since values from are smaller than those from (), the first quarter of entries in () will point to the first terms of all the contacts (), the next entries in () to the second terms (), and so on. Consequently, the first quarter of entries of () will point to a position in the range , because in the indexed sequence each symbol is followed by a symbol , and so on. In this way, each entry in the last quarter of will point to a position in the range , corresponding to the first quarter of entries in .

In our example, recall we have contacts. We can see that the entries in the four quarters of discussed above match that: ; ; ; and . In addition, in Figure 7, we have also included the structure that arises when we build the corresponding CSA. In this case, we can also verify that it holds that: ; ; ; and . This property will be of interest in the following section.

3.2.2 Modifying to make it cyclical on the terms of each contact

Recall that in a regular CSA, once we know that the entry in the underlying suffix array points to a position of the source sequence , we can recover the entries from the original sequence as , the next symbol as , the next symbol as , and so on. Therefore, as shown in Section 3.1, by using , , and , we can binary search for any pattern obtaining the range so that points to the positions in where can be found. Then, from those positions on, we could recover the source data of the suffixes that start with . Unfortunately, this mechanism allows us to recover the source data only forward-wise (not backwards), and this is not enough in our scenario because we typically want to search for the contacts that match a given constraint and then we want to retrieve all their terms.

To clarify the issue above, consider, for example, when we look for the contacts whose target vertex is (), then we obtain its unique occurrence at the position (). Consequently, to retrieve the terms of that contact , we would compute: ; ; . However, would not recover the first term of the current contact, but the first term of the next contact in . As in a regular CSA, to retrieve , we would have to access to know that the target vertex occurs at position , and consequently the source vertex should be retrieved from . Now, because is not actually kept in the CSA, to extract , we have to know the entry in such that . We can use that .888Recall indicates which position from points to the entry of . That is, such that . Finally, by using we have fully recovered the contact we were searching for. To sum up, the previous procedure would make it necessary to use not only , , and , but also and as explained in Section 3.1. Fortunately, we can modify in such a way that it allows us to move circularly from one term to the next term within a given contact.

Recall that, due to our disjoint alphabets, if points to the last term of the contact, then would store the position in pointing to the first term of the following contact (), which would be in the range . For TGCSA, we modified these pointers in the last quarter of in such a way that, instead of pointing to the position corresponding to the first term of the following contact, they point to the first term of the same contact; that is, or if . The modified quarter of is depicted as in Figure 7. In this way, starting at any entry in , and following the pointers , , and , all the elements of the current contact can be retrieved, but no entry from any other tuple will be reached. Due to this modification, in the example above, we can recover , and and are no longer needed.

Note that it is not possible now to traverse the whole CSA by just using because consecutive applications of the function will cyclically obtain the four elements of the corresponding contact. However, this small change in to make it cyclical on the terms of each contact, brings additional interesting searching capabilities that we will exploit in Section 3.5.

3.3 Detailed construction of Tgcsa

Once we have explained the need of using disjoint alphabets and the reason why we use a modified , in this section we explain the actual procedure to build our TGCSA. In Figure 8, we depict all the structures involved in the creation of a TGCSA representing the temporal graph in Example 1.

As indicated above, the first step to build a TGCSA is to create a sequence with the ordered contacts. Hence we obtain, .999Note that the ordering is not relevant because we have a set of contacts. Therefore, we will assume that contacts are sorted by the first term, then by the second one, and so on.

The second step involves defining a reversible mapping that enables us to use disjoint alphabets. Let us assume we have different vertexes and time instants. It is possible to define a reversible mapping function that maps the terms of any original contact to . To achieve this, we define an array and a set with elements . This mapping defines four ranges of entries in an alphabet for both vertexes and time instants such that . Note that vertex is mapped to either the integer or depending on whether it is the source or target vertex of an edge. Similarly, the time instant is mapped to either or . This allows us to distinguish between starting/ending vertexes/time instants by simply checking the range where their value falls into.

Observe that even though vertex always exists in the temporal graph, either source vertex or target vertex could actually not be used. Similarly, a time instant could not occur either as an initial or as an ending time of a contact, yet we could be interested in retrieving all the edges that were active at that time .

To overcome the existence of holes in the alphabet , a bitmap is used. We set if the symbol from occurs in a contact, and ; otherwise. Therefore, each of the four terms within a contact will correspond to a in . Then an alphabet of size 101010Recall returns the number of 1s in . is created containing the positions in where occurs. For each symbol , a mapID() function assigns an integer to , so that mapID() = if , and mapID if . The reverse mapping function is provided via unmapID.111111Recall computes the position of the in B.

Figure 8: Structures involved in the creation of a TGCSA for the temporal graph in Example 1.

At this point, a sequence of ids can be created by setting mapID. Indeed, being , respectively, the types of source vertexes, target vertexes, starting time instants, and ending time instants from the original sequence , we can map any source symbol from into by getmap. Similarly, the reverse mapping obtains getunmap.

Once we have made up our indexable sequence , an iCSA is built over it.121212We actually added four integers set to that make up a dummy contact (,,,) at the beginning of . This is required to avoid limit-checks at query time. Then, as discussed in Section 3.2.2, we modified the array in our TGCSA to allow to move circularly from one term to the next one within the same contact. To do this, we simply have to modify the last quarter of the regular array so that, . This small change brings an interesting property that allows us to perform a query for any term of a contact in the same way. We use the iCSA to binary search for a term of a contact(s), obtaining a range , and then by circularly applying up to three times, we can retrieve the other terms of the contact(s).

To sum up, TGCSA consists of a bitmap , and the structures and of the iCSA. In practice, is compressed using Raman et al. strategy131313Raman et al strategy allows both and in time and requires bits. Raman et al. (2007), and for we used a faster bitmap representation Fariña et al. (2012) using bits. For the representation of we also used the best option (named ) that samples at regular intervals and then differentially encodes the remaining values Fariña et al. (2012). Yet, we also created an alternative representation for that is discussed in Section 3.4.

3.4 A more suitable representation of for temporal graphs: strategy

The regular representation of is based on sampling the array at regular intervals (one sample every entries) and then, differentially encoding the remaining values between two samples. In Fariña et al. (2012), they studied different alternative encodings for the non-sampled values, and showed that the best space/time trade-off in a text-indexing scenario was reported by coupling run-length encoding of 1-runs (sequences of values) with bit-oriented Huffman ( approach). In practice, they used Huffman codes to indicate the presence of -runs of length . They also reserved Huffman codes to represent short gaps (where is a parameter typically set to ). Finally, being the machine word size, additional Huffman codes are used as escape codes to mark the number of bits needed to either represent a large positive gap () or a negative gap (). In both cases, such a escape code is followed by represented with bits.

Figure 9: Example of representation of assuming .

In this paper, we present a new strategy to represent , that we called , where we try to speed up the access performance at the cost of using a little more space. An example of the structure for the resulting representation is shown in Figure 9. We also use sampling and differentially encode non-sampled values. Yet, we made some changes with respect to the traditional representations (i.e., ), which are summarized as follows:

  • We used vbyte (byte-aligned) codes Williams and Zobel (1999) rather that bit-oriented Huffman codes to differentially encode non-sampled values. This should result in around one order of magnitude improvement in decoding speed when sequential values of are to be retrieved. Note that in the bottom part of Figure 9, we include a sequence of byte-oriented codewords (either 1 or 2-byte codewords in our example) that are used to represent the gaps from the original structure. It can also contain a pair of codewords for the pair to encode a 1-run of length . Of course, using byte-aligned rather than bit-oriented codes will imply a loss in compression effectiveness.

  • We do not sample at regular intervals. Instead of that, we keep samples aligned with the ones in bitmap , that is, there is a sample at the beginning of the interval in corresponding to each symbol . This modification brings three main advantages:

    1. We ensure that is always sampled, whereas with the traditional representation of the previous sampled position could be in the range . Therefore,

      was sampled with probability

      . Note that, in TGCSA, a typical access pattern to during searches (see Section 3.5) consists in traversing all the values once we know the interval corresponding to a given symbol . This requires decoding gaps from the previous sample to in to obtain synchronization at value , and sequentially decoding gaps from there on. Since is always sampled in , we avoid that synchronization cost.

    2. While in the traditional representation of , the differential sequence () could contain up to negative values (when belongs to a symbol and to symbol )Grossi and Vitter (2000), the representation does not deal with negative values because is always a sampled position.

    3. We do not break 1-runs. Recall that 1-runs could occur mainly within the range corresponding to a given symbol . Because our first-level sampling stores only a sample at position , 1-runs are no longer split. This is interesting for both space and access time because a unique codeword can be used to represent a large 1-run sequence. In our example, we can see that the codewords in represent the 1-run of length within . That is, we do not break the 1-run every values.

    In Figure 9, we can see that samples consist of a triple of values that are aligned with the ones in : indicates the absolute value, is a pointer to  sequence, and indicates the index of the sampled position. In practice, these values are set in three arrays , , and , respectively, such that if is sampled, we set , , and .

    Note that the absolute values are kept explicitly in and are not represented within the sequence  (exactly as in ). For example, is stored at the first entry of , and the first codeword in  represents value , which corresponds to the gap . Hence, no codeword in  is associated with the sampled value . Note also that is the position in  that we have to access to recover values . In our example, we can see that can be recovered by accessing the previous sampled value , then accessing sequence  at position to obtain the gap by . Finally, we recover . As an important remark, observe that given a symbol , we will use to obtain the starting sampled position for the range . We could skip storing array as we can compute . This introduces a space/time trade-off that we discuss in the next section.

    Despite the advantages of the sampling structures described above, our representation has also a main drawback: we cannot parameterize the number of samples we want to use. Thus, we can be using a rather too dense sampling for infrequent symbols (consequently, we expect that compression will suffer in datasets with very large vocabularies ()), or we can be using a very sparse sampling for frequent symbols , as they will have only one sample at the beginning of the corresponding interval . This fact could slow down the access to an individual position , with . To overcome this, we added a second-level sampling where we sample the positions ( is again the sampling interval). We use a bitmap (see Figure 9) to mark the positions of these samples in , and, aligned with the ones in , arrays , , and keep the sampling data ( is the number of ones in ). This second-level sampling works exactly like the first-level one with the exception that sampled values are also retained in the  sequence. This redundant data is kept to allow us to sequentially decode the whole values belonging to a given symbol without the need to access the second-level sampling data. This is of interest when we want to retrieve a range of consecutive values from instead of simply recovering an individual value.

3.4.1 Comparing the Space/time trade-off of with .

We run experiments to compare the space/time trade-off obtained by against and (the latter is the variant of where arrays and are not stored). We tuned these representations using four different sampling values for . In particular, we used values (from sparser to denser sampling, respectively). In addition, we include in the comparison a non-compressed baseline representation for (we refer to it as ) that represents each entry of with bits and provides direct access to any position.

In Figures 10 and 11, we compare the space (shown as the number of bits needed to represent each entry in ) and time (in per entry reported) required to access all the values in for three different scenarios. In the plots labeled by [B1] and [B2], we assume that the ranges for all the symbols are known and we perform a buffered access to retrieve the values for all these symbols. In scenario [B2], we only retrieve those values for symbols occurring at least times (hence ). In these buffered scenarios, synchronization is done once to obtain (except in that has direct access and does not require synchronization at all) and from there on, we apply sequential decoding of subsequent values. In the last scenario (plot labeled [S1]), we show the cost of accessing at individual positions (hence synchronization, for the compressed variants, is required for each access to ). We access sequentially all the positions in , .

We have run tests for all the datasets in Table 2 (described in Section 4) and show results here for datasets: I.Comm.Net, Powerlaw, Flickr-Data, and Wikipedia-Links. We do not show plots for ba* datasets because they obtain as fairly identical shapes as those for I.Comm.Net (yet with slightly different x-axis).

Figure 10: Space/time trade-off for buffered access to .

We can see that the cost of the synchronization required by and the slower decoding of bit-Huffman in comparison with vbyte make more than 5 times slower than when decoding all the entries of corresponding to a given symbol . In Section 3.5, we will see that this particular operation appears in most TGCSA query algorithms (a for loop after a binary search that returns the range of values for a given symbol). The shortcoming of this speed up at recovering values is that the overall size of increases by around -%. As we expected, it can be seen that in the Flickr-Data dataset, due to the large vocabulary size of this dataset in comparison with the number of contacts, the representation becomes unsuccessful because a plain representation of would even be smaller. We also include results for the counterpart. In this case, we do not explicitly store arrays and , and we require operations to know the position in corresponding to the -th sample. In general, when the number of synchronization operations is small (this occurs when is small), offers an interesting space/time trade-off. In particular, we can see that it typically yields the same performance of baseline representation while requiring -% less space.

Figure 11: Space/time trade-off for sequential access to .

Unfortunately, not all the accesses to performed at query time will follow a sequential pattern in TGCSA. In that case, the previous buffered retrieval of values is not applicable, and we need to perform many random accesses to positions within . Accessing random positions implies that each access to must initially check if is a sampled position. This is accomplished by checking if in or if in .141414 returns the value of the bit at position in the bitmap . In that case or , respectively. Yet, in