On computing distances and latencies in Link Streams

07/03/2019 ∙ by Frédéric Simard, et al. ∙ uOttawa 0

Link Streams were proposed a few years ago as a model of temporal networks. We seek to understand the topological and temporal nature of those objects through efficiently computing the distances, latencies and lengths of shortest fastest paths. We develop different algorithms to compute those values efficiently. Proofs of correctness for those methods are presented as well as bounds on their temporal complexities as functions of link stream parameters. One purpose of this study is to help develop algorithms to compute centrality functions on link streams such as the betweenness centrality and the closeness centrality.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Network science has been greatly influenced in recent years by the notion of temporal networks. Researchers in various fields have observed that real data varies over time and that static networks are insufficient to capture the full extent of some phenomenon. Different models of temporal networks have been suggested, among which the Link Streams of Latapy et al. [7] that captures the network evolution in continuous time. As is the case with other forms of networks, the notions of paths and distances are fundamental to the study of link streams. Kempe et al. [5] mention the use of time-respecting paths to study temporal networks. They further mention applications to epidemiology, in which one would seek information about the spread of a virus in a population. Human interactions can also be analyzed with temporal networks as has been observed by Tang et al. [13] and the link stream framework can help advance those studies. Although online social networks can be thought to vary in discrete time, with tweets and retweets for example, in real social networks the interactions have durations which are important to take into account in order to have an accurate description of the data. To see how link streams can be used in practice, many studies have emerged from the SocioPatterns Collaboration that includes datasets on face-to-face contacts [4, 1] with temporal labels. Those datasets are valuable tools to more accurately investigate aspects of social networks such as homophily [12] and epidemics [8].

Latapy et al. develop the notion of shortest fastest paths in their link stream model as a new concept of paths that gather together the temporal as well as the structural information of a link stream. A shortest fastest path is one that is shortest among the fastest paths between two endpoints. This type of path is used to define a betweenness centrality and it appears other centrality functions could be so defined as well. A social network can thus be analyzed through different perspectives: using the distance to measure how the connectivity of a group varies over time, the latency to measure how quickly an information can spread into a group of people and the length of a shortest fastest path to measure how efficiently this information is relayed. Note also how the time a shortest path starts and ends influences the information it can spread.

We propose here to compute the metrics of shortest (fastest) paths in a link stream with different algorithms. General definitions are presented in section 2, followed by a state of the art on section 3. Then, we present our two main methods in section 4, experiments in section 5 and we conclude in section 6.

2 Background

Most definitions are taken from Latapy et al. [7]. A link stream is a tuple where is a set of time instants, is a finite set of nodes (vertices) and is a set of links (edges). Here, denotes the set of unordered pairs of vertices and we write . We say an element is a temporal vertex.

An edge of is a tuple . Given an interval , we write , instead of , to mean all edges such that are in . We say an edge is maximal if there exists no other edge such that . We say a maximal edge starts on , ends on and has duration . We let be the set of event times of , that is . Elements of are called event nodes. We write .

A maximal edge, as well as and are illustrated on the link stream of Figure 1. On this link stream, is a maximal edge, whereas is not. Thus, .

Figure 1: A simple link stream with maximal edge .

The graph induced by a time is defined as . In a link stream , a path from to is a sequence of elements of such that , , , and for all , and . We say that such a path starts at , arrives at , has length and duration . We write to mean that there exists a path from to and say is reachable from . We also call a starting time and an arrival time from to . Each path between two fixed temporal nodes and defines a pair of starting time and associated arrival time. On the link stream of Figure 1, two paths are illustrated: the green one and the red one . Both have the same starting and arrival times from to , namely times and . Both paths are fastest. We can also say is a starting time from a temporal node to a node , in which case there exists some time such that is the starting time of a path from to . Same goes for the arrival times.

We say a path is shortest if it has minimal length and call its length the distance from to , written . Similarly, is fastest if it has minimal duration, in which case this duration is called the latency from to and is written . Note that if , there exists at least one pair of starting time and arrival time such that . Finally, is called shortest fastest if it has minimal length among the set of fastest paths from to . We call its length the sf-metric from to and write it . In general, this is not a distance as it does not respect the triangular inequality and is only a premetric, a simple counterexample is shown on Figure 2. On the same figure are drawn a shortest path, two fastest paths and a unique shortest fastest path.

Figure 2: The shortest path from to (both encircled

) is drawn in green . The two fastest paths are drawn in red and in blue . The sole shortest fastest path is the red one. Observe that, .

3 Related work

This work is close to the study of Wu et al. [15]. As such, the applications of computing fastest and shortest paths mentioned by these authors also apply here. The main contribution of the present work is to compute sf-metrics, as well as distances and latencies, in a single pass over a dataset. Separately, Wu et al.’s fastest and shortest paths methods are insufficient to compute centralities such the betweenness of Latapy et al., while an algorithm combining them to produce sf-metrics is not efficient because it requires iterating multiple times over the dataset. Meanwhile, our methods iterate only once to produce the three metrics and are suitable for studying different aspects of a link stream. We also output information on the starting and arrival times of shortest (fastest) paths that give valuable information on connectivity. This study was instigated as a first step in computing Latapy et al.’s betweenness centrality.

Furthermore, this work is also close to Tang et al. [14] since these authors define a betweenness centrality on temporal networks in terms of fastest shortest paths. Whether to use fastest shortest or shortest fastest paths (or any other path that combines temporal and structural information) depends on what information one wants to emphasize which depends on the context of the study. Shortest and fastest paths were also studied by Xuan et al. [16] and we were inspired by their all-pairs fastest path method to develop Algorithm 2. The latter is relevant to compute some centralities because metrics between all pairs of (temporal) nodes may be required. To our knowledge, Xuan et al.’s method is the only of its kind to return latencies between all pairs of nodes. More recently, Casteigts et al. [2]. adopted the same strategies as Xuan et al. for computing shortest and fastest paths in a distributed way.

Casteigts et al. [3] also offer a survey of temporal networks that includes many applications of shortest and fastest paths. In particular, such paths can be used to study the reachability of a temporal node from another. It appears from that survey that either the distance or the latency is often used as a temporal metric to evaluate how well a temporal node can communicate with another. In this regard, the sf-metric can be used as another temporal function since it combines the temporal as well as the structural information into a single map. Note that the notion of foremost paths (or journeys) is also used by some authors [2] to study temporal reachability. A foremost path only has minimal arrival time, while its starting time is unconstrained. This type of path is also useful in many studies and we expect our algorithms can be extended to those cases to output lengths of shortest foremost paths.

Finally, observe that the link stream framework is also close to the Time-Varying Graphs framework [3]. Thus, all results presented in this paper carry to this other framework as well.

4 Multiple-targets shortest fastest paths algorithms

The full implementations of the algorithms presented here, in C++, can be found online [10].

We present here two main methods, Algorithms 1 and 2 that compute the distances, latencies and sf-metrics from one source event node to all other event nodes. Algorithm 2 builds on the first method to compute those values for all pairs of event nodes. Subsection 4.4 also presents Algorithm 3 that was derived from Algorithm 1. This last method was first devised to fairly compare Algorithm 1 against the literature, but is also interesting as a standalone algorithm. We focus on the first two algorithms.

We present some small results that lead the way to those algorithms. The strategy for both methods is essentially the same: we compute the distances from any temporal node to such that is the largest (or maximal) starting time from any to . If it happens that , then this distance is the sf-metric from the former to the latter temporal node. Otherwise, since we iterate chronologically over , this latency must have been computed at a time earlier than and is saved in memory.

4.1 Two simple lemmas

The algorithms we present compute what we call reachability triples that contain information about the lengths of shortest paths from one temporal node to another as well as the starting and arrival times of those paths.

Definition 4.1 (Reachability triples).

Let be an event node. If there exists a shortest path of length from to the event node that starts on a largest starting time , then we say is a reachability triple from to .

In the following, we write for the dictionary of reachability triples from a fixed source event node to any node . In order to reduce to cost of operations in , we assume this dictionary is implemented in such a way that holds keys and holds pairs that form reachability triples . We write this dictionary so that accessing each takes constant time.

Algorithms 1 and 2 compute distances from largest starting times only. Those distances are contained in dictionaries for each as part of reachability triples. Note that if a link stream reduces to a network, that is if the set of time instants is a singleton, then each will contain the usual distances from a fixed source to . The temporal nature of a link stream forces us to take starting and arrival times into account when looking for shortest paths. Moreover, reachability triples could also be defined without the constraint that starting times are largest, however the algorithms would not be as efficient because the dictionaries would grow larger.

4.2 below, due to Wu et al. [15], states that shortest paths are prefix-shortest. We say a path from a temporal node to another temporal node is a prefix of another path from the same source to temporal node if is a subsequence of .

Lemma 4.2.

Let be a shortest path from a temporal node to another . Then, every prefix of is a shortest path from to .

Proof.

Suppose otherwise and assume there exists a temporal node such that the prefix of is not shortest from to . Then, there exists a shorter path from to , . Since , we can use to form a shorter path to , contradicting the minimality of . ∎

Let and be two temporal nodes. Then we define the outer distance from to , , as either , when , or , when . 4.3 suggests it suffices to compute distances in induced graphs for any time to deduce the distances between two temporal nodes.

Lemma 4.3.

Let be a source temporal node and be a temporal node reachable from the source by a non-empty shortest path. Then, there exists and a connected component of such that

(1)
Proof.

Let be a non-empty shortest path from to . Then, and . There exist non-empty subpaths in of the form . Let be such a subpath with the largest number of elements. By Lemma 4.2, the prefix of from to is shortest and its length is . Moreover, the subpath of from to must also be shortest with length . Finally, since is shortest and the two subpaths formed by are shortest, must also be a shortest path. Then, has length . The result follows by letting and be a connected component of . ∎

4.2 A single-source method

In this section, we present Algorithm 1 that computes the distances (from largest starting times), latencies and sf-metrics from a source event node to all other reachable event nodes. This algorithm mixes iterations on the induced graphs for each time with an all-pairs distances method on their connected components. Recall that if is the largest starting time from the source to some temporal node , then either or not. If so, then is the sf-metric from to . This length is computed with 4.3 by using the outer distances saved in memory as well as the all-pairs distance method on . Thus, when we iterate over all pairs of starting time and outer distance from the source to , we can deduce the duration and length of the shortest fastest paths from the source to . This method uses a set that is assumed sorted in lexicographic order. Sorting helps lower the temporal complexity, but is not fundamental to understand the algorithm.

Remark 4.4.

In Algorithms 1 and 2, we assumed the dictionaries were implemented in the form of self-balanced binary trees in order to obtain logarithmic worst-case complexities. In our implementations, we used hash tables to lower the average-case complexity.

Before proving that Algorithm 1 is correct, let us go through a small example in order to build intuition. Algorithms 2 and 3 are highly similar.

Example 4.5.

Consider again the link stream of Figure 2. Suppose the source is again , and . Thus, Algorithm 1 will look for shortest (fastest) paths that can reach temporal nodes and . The unique largest starting time from the source to at time is . This time is given by the greatest key in for any . Then, we iterate over the outer distances from to for each . Note how the time of the source has changed from to . By definition, and since the link stream is discrete, outer distances are given as the distances from to for each . Thus, we find outer distances from to and from to . Node is discovered at time and its outer distance does not exist before that. Finally, combining the outer distances with the distances inside the graph induced by at time , we find the distance from to is , from to and also from to . This last distance is given by the combination between the outer distance from to and the distance in from to . Since node is discovered first at time , that is its first arrival time from is , then the latency from to is and the distance from to is the sf-metric from the former to the latter.

Proposition 4.6.

Algorithm 1 correctly computes the latencies and sf-metrics from a source event node to all other reachable event nodes as well as the set of dictionaries . It requires at most operations in the worst case.

Proof of correctness.

Let be some reachable destination. Let’s show by induction on that , and is correct up to time .

  • When , we iterate only on time and the result is clear.

  • Suppose the result holds for all . Let be the times previously iterated over on line 1 and the current time. By the induction hypothesis, by time , all values of , for all , are correctly updated. Let be the connected component of containing . If , then the result follows as in the case with . Then, suppose . Since each is correctly updated up to time for each reachable , contains triples for each that have been visited prior to from the source from a starting time . The set contains the largest starting time from the source to . Then, either or this latency is given by some such that . Let’s iterate on .

    By Lemma 4.3, there exists a time and a connected component of such that . The sequence of distances

    is non-increasing for each because each element is minimal. Thus, since , in particular this lemma holds with and . Then,

    By the induction hypothesis, the outer distance can be recovered from for each . Then, using and the dictionary returned by the all-pairs distances algorithm on line 1, the expression above reduces to . In the last equation, the intermediary node over which the minimum is taken is irrelevant. If , then the distance from the source to is the same as the distance from the source to . Thus, it holds that:

    Thus, when we iterate on the element from , we construct the set of nodes at distance from at time . The last equation is thus used to insert into the right triple for each . When we have iterated over all of , all dictionaries are correct at time . Finally, it suffices to observe that once is updated with its final value, then by definition the update of on algorithm 1 yields the sf-metric from to for each .

Proof of complexity.

Let us write and . On each time , we first look up the connected components of , which requires at most operations. On each component of , we run an all-pairs distances method, which makes at most operations. For each node , the list in contains at most elements since there can be at most as many pairs in as there are arrival times on . The same goes for the number of keys in .

There are at most times such that and thus can be constructed with at most operations for all . Inserting and removing an element from takes at most operations: operation for accessing , operations for accessing key and operations to insert or remove an item in a set of size at most . The costliest operations on the connected component are those insertions and deletions. Thus, operating over takes at most operations. The list contains at most triples since for each node , it holds a largest starting time and at most distances (one distance for each arrival time on ). Thus, the for loop over will make at most operations.

The total number of operations at any time is bounded above by . It suffices to multiply this sum by and use the observation that . ∎

Observe that we use the sets and as parameters to evaluate the temporal complexities of our algorithms. These appear as natural choices since indicates how the temporal dimension affects the number of operations while is a surrogate for , which is in general infinite.

4.3 A multiple-sources sf-metrics method

Suppose is finite and starts on some time . Algorithm 2 returns a set of dictionaries of sf-metrics for each pair of nodes of dictionary such that and . During its execution, it updates a dictionary such that , and from . This dictionary helps in computing and in constructing from any source. It also returns a set of dictionaries of latencies.

Proposition 4.7.

Algorithm 2 returns the latencies, sf-metrics and dictionaries between all pairs of nodes in at most operations.

Proof of correctness.

Let us show that holds correct reachability triples from to for any two nodes , and time . Thus, let us fix those three variables. Let us show this by induction on .

  • If , then either and are in the same connected component of or not. This part is clear.

  • Suppose the result holds for any . Let be the sequence of times previously iterated over. Let be the connected component containing at time . If , then we argue as in the first case and the result follows. Otherwise, by the induction hypothesis, there must exist a largest starting time from to that can be found in , for some since all such node is connected to . Observe that contains pairs of largest starting time and arrival time from to . Observe also that is again an arrival time on . Thus, it suffices to compute the distance from to to obtain a reachability triple from to . We argue as in the proof of Algorithm 1 that Algorithm 2 returns this distance . The update again follows the same reasoning as before.

Proof of complexity.

Again, let and . The costliest operations occur in the for loop starting on algorithm 2. There are at most keys on each , for any . For any and , the size of is upper-bounded by since the starting time is maximal. Thus, at most operations are required. Finding the largest starting time requires in the worst case operations. By the same reasoning, the insertion on algorithm 2 will make at most operations.

, for any and , has a size at most , thus the loop over to find requires at most operations.

Recovering the last element of takes at most operations, thus the loop on makes at most operations. Meanwhile, inserting into takes at most operations. The for loop on algorithm 2 thus makes at most:

operations. This loop is itself repeated for all connected components , which in turn yields:

operations. Thus, this method should make at most operations in the worst case on each time . This number of operations is repeated at most times and the result follows. ∎

Observe that Algorithm 1 needs only be called times in order to deduce the lengths of all shortest fastest paths from any source to any destination, since it discovers all starting times from each source. Thus, about operations are required for Algorithm 1 to produce the same output as Algorithm 2. The multiple-sources algorithm is thus faster when the desired output is the set of sf-metrics from all sources to all destinations. The temporal complexities of both methods are affected mostly by the induced graphs . In subsection 4.4, we will see that complexities decrease drastically on cases such as -paths with since we can remove the dependency on those induced graphs.

4.4 Shortest paths with delays

In subsection 5.1, we want to compare Algorithm 1 against the shortest path procedure of Wu et al. [15] on the same datasets they used. The shortest path procedure of these authors is the most efficient method known to return distances in temporal networks. However, this algorithm works only on paths with delays , that is -paths.

A -path in a link stream is a path such that for all and some . We call the delay and note that the usual path corresponds to a -path. When , it is not necessary to iterate over connected components, since all nodes of a component do not communicate, and we can simplify Algorithm 1 in order to reduce its number of operations. The complexities of algorithms 1 and 2 are mainly influenced by the operations related to the graphs , for each time , namely: finding connected components, computing the all-pairs distances and iterating on the set of nodes at equal distances in the connected component. When , we can remove the dependency on the induced graphs and accelerate our methods. Thus, we present Algorithm 3 that is deduced from Algorithm 1 and assumes . Its correctness and temporal complexity follow from the same arguments used in 4.6.

Proposition 4.8.

When , Algorithm 3 computes the latencies and sf-metrics from a source event node to all reachable event nodes as well as the set of dictionaries , for all , in at most operations.

Proof.

This follows from the same reasoning as in 4.6. ∎

Finally, in Algorithm 3, the dictionaries and are implemented such that the keys are nodes and values are pairs such that is the time value is computed at that node. For example, if , then the latency from the source to is . This enables us to sort dictionaries by time. The same work could be done for Algorithm 2, that is to adapt it for the case , although that was not the focus here.

5 Experiments

We present some experiments to highlight the running times of Algorithms 1 and 2. In the first one, we compare Algorithm 3 with the single-source shortest path method from Wu et al. [15]. Algorithm 3 acts as a surrogate for Algorithm 1. Although Algorithm 2 should be more efficient than Algorithm 1 when the goal is to compute values between all pairs of temporal nodes, Wu et al. evaluated their method from a small set of source nodes on large datasets. It would be infeasible at this point to evaluate both our methods on the same datasets between all pairs of temporal nodes. In a second experiment, we compared the running times of our two methods on synthetic link streams.

Algorithm 2 was inspired by Xuan et al.’s fastest paths method that does not return distances. Comparing the two methods would be unfair against ours.

All experiments were run on a single machine with GHz Intel Core i7 processor and Gb of RAM. All methods were implemented in C++ with standard libraries, including Wu et al.’s method. We implemented standard approaches to compute connected components and all pairs distances in graphs.

5.1 Runtime comparison with the literature

We presented Algorithm 3 in subsection 4.4 that was motivated by a similar method developed by Wu et al. [15]. We now compare how Algorithm 3 fares against their algorithm. Since we are not aware of methods comparable to Algorithms 1 and 2, this is our comparison with the literature.

Wu et al. analyzed their method with the framework of temporal graphs and deduce a temporal complexity that is hard to compare with ours. We translate their result with link stream parameters, upper bounding with and with . Thus, the shortest path algorithm of Wu et al. makes at most operations in the worst case. The worst-case temporal complexities of both algorithms are thus the same.

We ran experiments on link streams of various sizes, as measured with , and . We used the same datasets as Wu et al.111 The datasets are only used as benchmarks. They all describe discrete temporal networks and can be found as part of the KONECT library of networks [6]. Only the values of the parameters and were extracted since only these were required for our experiments. , randomly chose different nodes from each and ran both methods one after the other. The full results (in seconds) can be found in Table 1. The running times of Wu et al.’s method are either comparable or significantly less than that of Algorithm 3. However, our method does more operations, since it must compute latencies as well and ensure the distances correspond to the sf-metrics. Thus, the running times of Wu et al.’s procedure are presented for reference only, it should not be expected that our methods would be faster. All datasets are heterogenous, which explains the variability in running times and we have not yet pinpointed any hidden link stream parameter that might explain this variability. The dictionaries are sensitive to the number of arrival times from the source and we suspect that in the problematic datasets some nodes must have a really high number of arrival times. This would make it more difficult to search values in some dictionary .

Dataset Wu et al. (s) (s) ratio
arxiv 28093 2337 4596803 1.30 170.00 130.77
digg 30398 9125 87627 1.60 1.10 0.69
elec 7118 90741 103675 0.71 2.90 4.08
enron 87273 178721 1148072 5.20 85.00 16.35
epinions 755760 501 13668320 41.00 40.00 0.98
facebook 63731 204914 817035 10.00 8.90 0.89
flickr 2302925 134 33140017 120.00 3700.00 30.83
slashdot 51083 67327 140778 4.80 4.30 0.90
wikiconflict 116836 215982 2917785 6.90 21.00 3.04
wiki 1870709 2198 39953145 100.00 22000.00 220.00
youtube 3223585 203 9375374 170.00 160.00 0.94
Table 1: Comparisons between Algorithms 3 and [15]

5.2 Comparison between algorithms 1 and 2

Algorithm 1 and 2 were run on a set of randomly generated link streams of size ranging from to , with increments of , and repeated times. Although the link streams are small in scale, the running times are significant since we compute the distances from every source to every destination. The link streams were constructed by generating Erdös-Renyi graphs , with and . Then, on each edge , we drew a time instant uniformly at random and added both directed edges and to . In this case, edges have no duration and the time instants are integers: this helps ensure the size of is fixed and small, so the running times scale only with .

Figure 3 presents the results of this comparison. We observe that, as the number of nodes involved increases, the amount of time taken by Algorithm 1 grows faster than that of Algorithm 2. This gives clear indication that this method is faster than Algorithm 1. (a) shows the mean running times (over all repetitions of the same experiment) of each algorithms on a link stream with a fixed number of nodes. In terms of scale, the MSMD method manages a link stream of nodes and about edges (the size of is an average over all repetitions) in, on average, less than seconds. Its counterpart takes more than minutes for the same calculations.

Figure 3: Runtime comparison between Algorithms 1 (SSMD) and 2 (MSMD) on synthetic link streams (runtime in seconds vs number of nodes)

Since Algorithm 2 is more scalable than Algorithm 1, we generated a new set of link streams, again with the same process as before, although the time instants are now drawn uniformly at random in the interval while the duration of an edge is drawn uniformly at random in the interval . Since grows on each generation, we kept lower than in the former experiment and let . The results are presented in the upper part of (b), above the horizontal line with up to . We fitted, with the statistical software [9], a linear model on the runtime of Algorithm 2 as function of both and in order to extrapolate the runtime of this method for larger values of and . The fit is reasonable but imperfect, although this is sufficient to illustrate the scaling trend. Extrapolating, we obtain the values below the horizontal line. We observe that with around nodes and event times, Algorithm 2 should already take more than a day to finish. This suggests scalability might be an issue as we could not tackle a real-world dataset even with this long amount of time.

SSMD (s) MSMD (s)
100 6942.40 142.36 11.74
105 7656.40 173.02 13.58
110 8394.00 208.28 15.47
115 9173.60 248.52 17.65
120 10005.60 293.48 20.01
125 10835.60 354.76 22.41
130 11723.20 404.31 25.23
135 12654.00 470.48 28.19
140 13601.60 547.13 31.50
145 14583.20 628.99 34.84
150 15609.20 718.28 38.40
155 16675.20 824.66 42.47
160 17794.80 946.35 46.74
165 18915.20 1107.04 52.12
(a) Comparisons between algorithms 1 and 2
Runtime (s)
10 29 58 0.06
20 134 268 1.75
30 313 626 15.08
40 550 1100 64.20
50 857 1714 203.79
60 1225 2450 557.06
70 1670 3340 1248.19
80 2177 4354 2379.12
100 3391 6783 6635.30
120 4872 9745 14814.50
140 6620 13241 28717.40
160 8635 17271 50469.20
180 10917 21835 82519.61
200 13466 26932 127642.88
(b) Runtimes of Algorithm 2
Table 2: Runtimes (in seconds) of Algorithms 1 and 2
Input: a link stream, the set of event times, a source event node
Output: Dictionaries of sf-metrics and latencies from to all other event nodes, set of dictionaries for each
1
2 for  do  for  do
3       for  do
4            
5            
6            
7             if  then  else
8                  
9                   for  do
10