Embedding-based routing algorithms rely on the assignment of a distinct logical coordinate to every node in a network. To discover routes between nodes, such routing algorithms rely on a metric function that indicates the logical distance between coordinates.
A highly promising approach for large-scale networks with low diameter are rooted spanning tree embeddings (Herzen et al., 2011; Chávez et al., 2007; Houthooft and others, 2015; Roos et al., 2016), where the logical coordinate of each node is an integer vector that uniquely encodes its position in a rooted spanning tree over the network. Such embeddings enable routing with low path stretch while requiring each node to keep only a polylogarithmic number of bits per neighbor as routing information (Herzen et al., 2011). Furthermore, multiple rooted trees can be leveraged in parallel to enable routing despite intermittent failures (Houthooft and others, 2015; Roos et al., 2016).
Due to its high efficiency, routing based on rooted spanning tree embeddings is well-suited for Friend-to-Friend (F2F) overlay networks, such as the dark Freenet (Clarke and others, 2010) or GNUnet (Grothoff, 2017)’s friend-to-friend mode. These overlays restrict connectivity to mutually trusted nodes to achieve strong security and privacy in the presence of malicious participants. To set up connections to nodes of other participants, an attacker needs to perform social engineering, which we consider to be costly to conduct on a large scale.
One of the key properties F2F overlays aim to achieve is membership concealment (Vasserman and others, 2009): identifying information, such as the IP address of a node, is not revealed to any untrusted participants. Here, these networks differ dramatically from anonymity networks such as Tor, which reveal the IP address to the guard or bridge node (Dingledine et al., 2004). However, due to the trust-based restriction of connectivity, the structure of a F2F overlay resembles the social graph of its participants. Previous studies have shown that unknown individuals in a social graph can be de-anonymized by looking for nodes with similar structural properties in another, non-anonymous social graph (Narayanan and Shmatikov, 2009; Korula and Lattanzi, 2014) obtained from publicly available data, e.g., by crawling online social networks. As a consequence, distributed algorithms that operate on F2F overlays, such as routing, should minimize exposure of overlay structure.
As the logical coordinate of each node in a rooted spanning tree embedding corresponds to a path from to the root of the spanning tree, this routing approach inherently leaks information about the structure of the encoded spanning tree and thus also the overlay structure. Furthermore, as this approach also leverages non-tree links during routing, colluding participants may be able to obtain additional information about links between overlay nodes by tracking which nodes a message has traversed, which makes de-anonymization attacks more accurate. Consequently, the usage of rooted spanning tree embeddings conflicts with the aforementioned goal of membership concealment. Yet, there is no work that quantifies the actual privacy loss caused by the logical coordinates.
While topology-hiding communication protocols have been proposed in the literature, they either rely on flooding for route discovery (Zhang and others, 2012) or perform broadcast to all participants for each message (Akavia et al., 2020) and thus incur prohibitively high overhead for communication in large networks. Thus, such protocols do not pose a suitable alternative to embedding-based routing.
In this paper, we present the following contributions:
We formalize the concept of topological knowledge about an overlay network and explain in detail which knowledge an attacker can infer from observed logical coordinates of a rooted spanning tree embedding.
We show that if data messages do not carry the logical coordinate of their originator in plain text, then colluding malicious participants cannot unambiguously infer links incident to nodes beyond their direct neighborhood.
We perform an extensive simulation study for two state of the art algorithms to evaluate the number of previously unknown participants that malicious participants can infer from logical coordinates propagated by the embedding algorithms.
The results of our simulation study show that in social graph-like overlay networks, the way logical coordinates are assigned has a strong impact on the number of participants that can be discovered. If coordinate elements are determined by enumeration of child nodes, an attacker can infer roughly one order of magnitude more participants than if vectors of random numbers are used as coordinates.
2. Related Work
While there are no studies on the inference of topology from embedding-based routing, the inference of network structure from other routing algorithms, in particular IP routing, has been addressed several times. In the following, we thus give an overview of state-of-the-art methods in the context of IP routing and discuss their applicability to embedding-based routing for F2F overlays.
One of the first approaches to obtain a snapshot of the Internet was by means of sending IP packets with varying initial values in their Time-To-Live (TTL) field (Spring and others, 2004; Donnet et al., 2005; Holbert and others, 2015; Jin et al., 2008). Whenever the TTL of an IP packet reaches zero during transit, many Internet routers drop the packet and send a notification towards the originator of the message. As the notification contains the IP address of the reporting router, paths between different endpoints can be recovered by sending packets with increasing initial TTL values between them while recording the received notification messages. Embedding-based routing schemes for F2F overlays do not have such a notification mechanism, so similar approaches for exploring the topology are not applicable.
Works from the area of network tomography infer the topology between multiple nodes based on end-to-end probe measurements of network characteristics, such as message loss or delay (Ni and others, 2009; Krishnamurthy and Singh, 2012; Malekzadeh and MacGregor, 2013; Coates and others, 2002). If there is a high correlation between two nodes and when probes are sent by the same node , then it is assumed that the path from to overlaps with the path from to and thus, there must be a common node on both paths.
However, tomography can detect if paths are likely to overlap but cannot reveal the number of overlapping nodes or the actual length of the paths. Thus, the inferred topology may contain fewer nodes than there actually are. To overcome this limitation, network tomography approaches have been combined with notification messages (Ni and others, 2009) or packets with a limited hop number (Malekzadeh and MacGregor, 2013). As explained before, embedding-based routing schemes for F2F overlays do not provide packet loss notification mechanisms. As routing on greedy embeddings does not suffer from routing loops, limiting the maximum number of hops is furthermore unnecessary.
In settings where nodes can learn their hop distance to all other nodes in the network, estimates on network topologies can be inferred from the hop distances of a subset of nodes(Bouchoucha et al., 2019). So far, no existing F2F network supports collection of hop distance from one node to all other nodes. However, when embeddings based on breadth-first-search spanning trees are used for routing, every node can learn its hop distance to a subset of nodes from the logical coordinate of their neighbors. The algorithm of Bouchoucha et al. (Bouchoucha et al., 2019) then enables inference of links between nodes in .
However, performing topology inference solely from hop distance information disregards further information that is available to the adversary. For example, if the adversary discovers two nodes and that are two hops away from one of his nodes and one node that is three hops away, he cannot tell if is connected to or to . We will show in Section 4 that adversaries can easily infer some of those links between nodes in from the logical coordinates. Furthermore, all links that are inferred by prior algorithm are also included by the inference attacks presented in Section 4. Thus, our algorithm is able to infer more topological information in networks that use embedding-based routing.
3. System model
In the following, we explain our system model, including our terminology, and subsequently state our adversary model.
3.1. Network Model
We consider overlay networks with bidirectional connections and thus model an overlay as an undirected graph , where presents the set of participating nodes and an edge represents a connection between two nodes. We say that is a neighbor of iff . In the following, we define the neighborhood of a set of nodes in as .
We do not assume that participating nodes have knowledge about the structure of the overlay beyond their direct neighborhood.
To enable communication between nodes that are not neighbors in the overlay, the network leverages routing based on rooted spanning tree embeddings (Herzen et al., 2011; Chávez et al., 2007). In these embeddings, a unique vector of integers is assigned to every node that encodes its position in a rooted spanning tree over the network. Each such vector then denotes the logical coordinate of the corresponding node in a virtual space. To do so, state-of-the-art distributed embedding algorithms (Hoefer et al., 2013; Herzen et al., 2011; Houthooft and others, 2015) first form a rooted spanning tree over the current overlay. Afterwards, the root node of the spanning tree sets a predefined vector as its logical coordinate. Subsequently, node determines an ordering among its children in the spanning tree and assigns the vector to the -th child for each , where ”——” denotes concatenation. As soon as a child of has set its logical coordinate accordingly, it analogously determines an order among its children and assigns to its -th child. This process continues until every node has obtained a logical coordinate.
For simplicity, in this paper we assume that the empty vector is assigned to the root node, i.e. , as proposed by Höfer et al. (Hoefer et al., 2013). The attacks presented in the following sections can however easily be adapted for other root coordinate assignments. Figure 1 shows an example for such an assignment of logical coordinates.
In the following, we say that a node with coordinate is the parent of a node with coordinate if is a prefix of and . In this case, we also say is a child of . We say that is a sibling of if both have the same parent. Furthermore, we say that is a descendant of if is a prefix of .
For the actual routing of messages, nodes determine the logical distance between coordinates by means of the tree distance
where ”” denotes the length of the vector and ”” denotes the length of the longest common prefix of and . When a node receives a message with target coordinate that differs from the coordinate assigned to , forwards the message to a neighbor with coordinate such that .
An important feature of the embeddings considered here is that during routing, nodes check the coordinates of all their neighbors, including those that are neither their parent nor their child. Latter property allows routing to find shorter paths than those found by simple spanning tree routing (Herzen et al., 2011) and allows the discovery of alternate paths in case of failures (Roos et al., 2016; Houthooft and others, 2015). In the following, we denote overlay links that are part of the spanning tree as tree links while all other links are called shortcut links.
3.2. Adversary model
F2F overlays such as Freenet (Clarke and others, 2010) offer services like messaging and publishing in a censorship-resistant and anonymous manner, making it a valuable communication tool for journalists and activists. In this work, we therefore consider a malicious actor that aims to identify the participants of a F2F network, e.g., to uncover activist networks.
Due to the trust-based formation of links, the topology of F2F overlays corresponds to the graph of mutual acquaintances between its participants. It therefore seems likely that the F2F overlay topology resembles other graphs that represent social interactions and relationships, such as those obtained from crawling online social networks or phone call records (Narayanan and Shmatikov, 2009). If the attacker is able to infer a subgraph of the overlay, they111We refer to the attacker using the singular they (Lee, ). can then leverage graph-based de-anonymization attacks (Narayanan and Shmatikov, 2009; Yartseva and Grossglauser, 2013; Korula and Lattanzi, 2014; Sharad, 2016)
to infer the identity of node operators. Such de-anonymization attacks heuristically find mappings between the nodes of two graphs based on structural features, such as common neighbors or node degrees. The adversary thus aims to infer as much information as possible about the overlay graphto increase the number of mappings that can be found and to increase the chance that the found mappings are indeed correct.
As we are interested in the leakage of topology information due to the overlay’s routing algorithm, we focus on internal attackers, where the adversary participates in a F2F overlay with one or more nodes under their control, which we call malicious nodes in the following. Protection against external attackers that infer overlay participants and links via traffic analysis is an orthogonal problem which can be addressed by tunneling F2F overlay messages through non-suspicious services (Barradas and Santos, 2020).
We assume that the attacker was able to identify a subset of the overlay’s participants and lured each of them to let their node set up a link to at least one malicious node. Malicious nodes participate in the embedding and routing but may deviate arbitrarily from correct behavior to obtain topology data. In the following, we denote nodes of identified participants that are connected to malicious nodes as compromised nodes.
4. Inference of Topology Structure
As described in Section 3, we consider routing based on logical coordinates that are assigned to nodes based on a rooted spanning tree over the overlay network. Because each link in the spanning tree corresponds to a unique link in the overlay network, it is desirable for the attacker to uncover the structure of the spanning tree, as it inherently corresponds to a subgraph of the overlay network’s topology. As the logical coordinate assigned to each node encodes the unique path in the spanning tree from to the root node, an attacker can leverage observations about which logical coordinates have been assigned to nodes to draw conclusions about the structure of the spanning tree and hence, the overlay.
To enable routing, data packets furthermore need to carry the logical coordinate of the recipient node. As explained in the previous section, messages may be routed via shortcut links, i.e., links that are not part of the spanning tree. By keeping track of which messages with which recipient coordinates have been routed via their nodes, the attacker can detect if a shortcut link has been used and infer possible paths taken by the message. As a consequence, the actual routing of messages allows the attacker to make inferences with regards to shortcut links between nodes.
In this section, we investigate the above risks in detail. To do so, we first formalize the concept of topological knowledge about an overlay network. We then we specify which concrete inferences can be made from observed logical coordinates. Afterwards, we analyze which inferences can be made from observations about the trajectories of messages routed via the overlay.
4.1. Modeling topological knowledge
For a given overlay network , we model the adversaries’ knowledge about at a fixed point in time by a tuple . is a set of nodes that the adversary considers to be participating in the overlay. This set always contains the compromised nodes defined in Section 3 and the malicious nodes but may furthermore contain pseudonymous nodes that the adversary is aware of but cannot immediately identify due to a lack of further information. While the adversary can unambiguously relate malicious and compromised nodes to overlay nodes (e.g., by IP address), a pseudonymous node is considered to be participating in the overlay, but cannot be related to a particular overlay node. More formally, the underlying injective mapping is known to the adversary for malicious and compromised nodes but not for pseudonymous nodes.
denotes links between nodes in that the adversary knows to exist. This means that it is guaranteed that if , then holds. The set encodes those links that the adversary knows to be non-existent between the nodes in , meaning that if then .
The partial function encodes the assignment of logical coordinates of the nodes the adversary is aware of. is a partial function because the embedding algorithm may not yet have assigned a coordinate to a malicious or compromised node. As we derive pseudonymous nodes from logical coordinates in the following, is always defined for pseudonymous nodes.
4.2. Inference of tree links
We now consider concrete inferences made from observations about coordinates assigned to nodes. A malicious participant may learn about the coordinates of other nodes in two ways:
To enable routing, each node needs to be aware of the logical coordinates of its neighbors. Therefore, as soon as a logical coordinate has been assigned to a node, it notifies all of its overlay networks about it. As a consequence, malicious nodes learn about the logical coordinates of their non-malicious neighbors.
Messages carry the logical coordinate of the target node. If a message is routed via a malicious node, it can thus read the coordinate included in the message.
Now consider that an adversary with initial knowledge has received a coordinate , with and that was previously unknown, meaning that there is no such that . First, they can obviously first infer that here exists a node to which coordinate has been assigned. If is a compromised node, i.e., a non-malicious node with a malicious neighbor, then it is already included in and only the mapping is added to . Otherwise, the attacker generates a unique pseudonymous identifier for , adds it to and adds a corresponding mapping to .
Furthermore, the attacker participates in the overlay and is thus aware of the embedding algorithm described in Section 3.1. From coordinate , they can thus also draw the following conclusions:
The coordinate assigned to a node corresponds to the coordinate of its parent node and an additional element at the end. Thus, if , meaning that is not the root node of the spanning tree, the they can infer that there must be a node with coordinate and that and are connected with each other. Thus, if is a previously unknown node, the attacker generates a unique identifier for , adds it to and adds a corresponding mapping . Furthermore, the attacker adds a link to .
The coordinate elements are determined by enumeration of child nodes in the spanning tree. Thus, if , they can infer that there must be nodes with coordinates for and that all of them are connected to the node with coordinate . For those nodes whose coordinates were previously unknown, the attacker thus analogously generates unique identifiers and adds corresponding entries to , , and .
If , i.e., is not a child of the root node, the attacker can then additionally make analogous inferences for every non-empty prefix of . Figure 2 shows an example for inferences made from a coordinate based on the previously described considerations. In Section 5, we present results from a simulation study that shed light on the number of nodes whose participation can be inferred in realistic settings.
To enable routing in a manner that hides the ultimate recipient of a message, Roos et al. (Roos et al., 2016) proposed an obfuscation scheme for logical coordinates. While not explicitly designed to hinder inference of topology structure, their obfuscation scheme nonetheless reduces the topological information an attacker can derive from observed coordinates. In the following, we thus explain key changes and their effects in more detail. In Section 5, we present simulation results showing that the obfuscation scheme drastically reduces the number of inferred participants.
Concretely, the embedding-based routing from Roos et al. differs from the routing presented in Section 3 in two key points:
Randomly chosen -bit integers are used as coordinate elements instead of enumeration indexes.
Before publishing the logical coordinate vector of a node ,
is padded to a fixed length by appending a corresponding number of additional randomly chosen integers. Subsequently, each elementof the padded vector is replaced by a cryptographic hash value over and a randomly chosen number.
Note that the second modification is used only to generate obfuscated addresses that can be published out of band to enable participants to contact a node in a privacy-preserving manner. In the coordinate assignment procedure, nodes only use non-padded coordinates.
As a consequence of the first modification, an attacker learning about a coordinate cannot infer whether a coordinate with has been assigned to any other node, since was chosen randomly, independent of the number of children in the spanning tree. However, the attacker can still infer that there is another node to which the coordinate has been assigned and that this node is connected to the node with coordinate and they can proceed analogously with every non-empty prefix of the coordinate.
The second modification keeps the attacker from learning about previously unknown node coordinates by reading the target coordinates included in data messages routed via malicious nodes. Shortly, this is because the attacker cannot determine the actual number of randomly added elements of the target coordinates. While the attacker, given an obfuscated target coordinate , can determine the longest common prefix between and any coordinate they are already aware of, they cannot tell whether element of is already part of the random padding or not. Due to the properties of the cryptographic hash function, the attacker furthermore cannot unambiguously infer the value of the -th element of the padded coordinate. Even if a node publishes multiple obfuscated variants of its coordinate, the attacker can only determine possible longer common prefixes among them by exhaustive search over the range of possible element values, which is computationally infeasible for a sufficiently large value of .
4.3. Inference of shortcut links
Recall from Section 3 that embedding-based routing also considers non-tree edges for forwarding. To detect the usage of such shortcut links, malicious nodes record every message that they received, including the message’s target coordinate as well as the coordinate of the neighbor from which they received the message. If the attacker is aware of the logical coordinate of another node over which the message was routed previously, they can then check if is a prefix of or of . If this is not the case, then does not lie on the path from to in the spanning tree underlying the coordinate assignment and thus, must have been routed via a shortcut link.
An attacker may become aware of the coordinates of nodes previously traversed by a message via multiple means. If the originator of a message writes its own logical coordinate into to enable the recipient of to send a reply, the attacker can simply read the value of . In the following, we however do not assume that sending nodes include their coordinate in messages sent. Even if they do so, reading the sender coordinate by malicious nodes can be prevented by having nodes publish a cryptographic key along with their coordinate, such that senders can attach their coordinate to messages in an encrypted form. Since F2F networks typically do not obfuscate message contents during routing, e.g., via re-encryption, the adversary can instead determine if the same message was routed via two or more malicious nodes and in which order.
Given the adversary has received a message and is aware of the coordinate of a previously traversed node, the actual inference of possible shortcut links is non-trivial. The message may have been routed via a yet unknown node or there may be two or more known nodes that qualify as the next hop. Figure 3 shows an example for such cases.
To formalize the conditions when the existence or absence of a link can be concluded, we first introduce the concept of a hypothetical overlay that addresses the possible presence of yet unknown nodes. Afterwards, we define the notion of a plausible trajectory within a hypothetical overlay and subsequently specify when a message is said to prove the existence or absence of an overlay link.
4.3.1. Hypothetical overlay
Given knowledge about an overlay, a corresponding hypothetical overlay is a tuple , where denotes a set of dummy nodes, and .
Each dummy node in represents an unknown number of nodes with the same parent in the spanning tree. The coordinate assignment assigns the same coordinates to each node from as but additionally assigns a unique, random coordinate to every dummy node. To enable discovery of all possible trajectories, includes all pairs of nodes except those for which . Given knowledge and a set of malicious nodes , a corresponding hypothetical overlay can be generated via the following steps:
Set , , and
Determine the length of the longest coordinate in
For every non-malicious node in with coordinate length , add a subtree of depth by adding a dummy node to with a unique coordinate for each level.
For every pair of nodes with , add a link to .
Figure 4 shows an example for the generation of the hypothetical overlay.
While a node may have a shortcut link to a yet unknown node whose logical coordinate has more than elements, we omit the generation of such dummy nodes. It can easily be shown that if a message may have been routed via an unknown node with a longer coordinate, then it is also possible that this message was routed instead via the predecessor of whose coordinate has length , which is represented by a dummy node. Thus, even if dummy nodes with coordinates longer than elements are omitted from the hypothetical overlay, we ensure that if a message may have been routed via an unknown node, then there always is at least one corresponding route via a dummy node in the hypothetical overlay.
4.3.2. Plausible trajectories and link existence
To be able to define a plausible trajectory, we first need to formalize the observation of a message by malicious nodes. We do so with the notion of a trace record, as given by Definition 1.
Definition 0 ().
(Trace record) Let be an overlay network and let be a set of observation points in . For a message , let with and be the path along which has been forwarded in .
For a given pair , a trace record of on is a 4-tuple where
There exists a subsequence of such that and .
Although a packet may traverse more than two malicious nodes on its way to the target node, we treat each path between two consecutively traversed malicious nodes as a separate trace record. We consider this simplification to be valid, as the greedy routing of each message from a malicious node to another is independent from the path over which the message was routed to before.
Based on the notion of a trace record, we define a plausible trajectory as given by Definition 2.
Definition 0 ().
(Plausible trajectory) Let be an overlay network and let be a coordinate assignment to the nodes in . Furthermore, let denote a set of observation points, denote a priori knowledge about and , and let be a hypothetical overlay for .
Given a trace record with and of a message with target coordinate , a sequence of nodes from is called a plausible trajectory for towards given knowledge if:
The first condition of Definition 2 ensures that only trajectories matching the trace record are considered to be plausible. Because we only consider trajectories between two malicious nodes, the second condition ensures that other malicious nodes are excluded. The third condition ensures that a plausible trajectory does not contradict the adversaries’ knowledge about absent links, as pairs are not included in . The fourth condition reflects that, due to greedy routing, nodes only forward messages to neighbors whose distance to the target is strictly lower than their own. The fifth condition furthermore guarantees that a plausible trajectory does not contradict the adversaries’ knowledge about existing links. For an example, again consider Figure 3. If there would be a link between node and that is known to the adversary, then the fifth condition would ensure that any route via is not considered plausible. Because if the message for target had been received by node , then it would have greedily forwarded it directly to instead of forwarding it to node .
Although there may be multiple plausible trajectories for a given trace record, there are cases where the adversary may nonetheless be able to infer the existence or absence of a link. Definition 3 therefore specifies when a trace record is said to prove the existence or absence of a link between known nodes.
Definition 0 ().
(Proof of link existence) Let be an overlay network and let be a coordinate assignment to the nodes in . Furthermore, let denote a set of observation points, denote a priori knowledge about and .
A trace record with and of a message with target coordinate proves the existence of a link between two known nodes given knowledge , if all plausible trajectories for towards given knowledge include the sequence .
4.3.3. Limits of inference from data messages
For an attacker aiming to perform graph-based de-anonymization attacks, it is desirable to obtain knowledge about the links incident to pseudonymous nodes, as this can be used to make correct de-anonymization more likely. In the following we show that by tracing message trajectories, the adversary cannot unambiguously infer shortcut links between compromised nodes. In particular, we show that for every pair of nodes where or is a pseudonymous node, every trace record that has a plausible trajectory that includes the sequence also has at least one plausible trajectory that does not include the sequence .
As a prerequisite, Lemma 4 states that whenever the logical coordinates of two nodes and have the same length, then either or for every coordinate .
Lemma 0 ().
Let be an overlay network and let be a coordinate assignment for the nodes in . The following holds for every pair of nodes : if , then there is no coordinate such that .
As described in Section 3, the tree distance between two coordinates is computed solely from the length of the coordinates as well as the length of their common prefix. Since , can only have a lower distance to if it has a longer common prefix. Therefore, assume that with . Thus,
Theorem 5 ().
Let be an overlay network and let be a coordinate assignment to the nodes in . Also, let denote a set of observation points, denote a priori knowledge about and . Furthermore, let be a pair of nodes such that .
If there is a trace record with and of a message with a target coordinate that proves the existence of a link between and , then it must hold that and .
In the following, we show that if does not have a malicious neighbor, the adversary cannot unambiguously determine whether any message was indeed forwarded directly from to or vice versa. This is because it is always possible that there is a yet unknown node over which or may have routed the message instead. More formally, we show that if is not a compromised node, i.e. , not connected to a malicious node, then for every trace record, there is at least one plausible trajectory that does not include the sequence or . Thus, no trace record proves the existence of a link according to Definition 3. The proof analogously holds for the case that is not a compromised node.
W.l.o.g., assume that is not a compromised node. First, note that cannot be a child of and vice versa. Otherwise, since the attacker is aware of ’s and ’s coordinates, he could already infer the existence of a link between and as described in Section 4.2. Nonetheless, it is possible that is a higher order descendant of in the sense that may be a descendant of a child of and vice versa. For simplicity, in the following we however only present the proof for the case that neither is a descendant of nor a descendant of . The proof for the case that either one is a descendant of the other proceeds very similar and can be found in the long version of this paper.[ADD CITATION –mb]
Given that is not a descendant of and vice versa, it follows that neither of them can be the root node of the spanning tree. As all neighbors of are non-malicious by assumption, thus must have a parent that also must be a non-malicious node. also must have at least one more non-malicious neighbor the adversary is aware of, which may either be a child of or a neighbor connected via an already known shortcut link. Otherwise, the adversary is unable to tell if it is even possible that any message he received traversed .
The two key insights used in this proof are that the adversary cannot tell i) if there is another, unknown child of besides , and ii) if has any yet unknown children. Thus, the hypothetical overlay corresponding to the adversaries knowledge contains a dummy node that is a sibling of as well as a dummy node that is a child of . As the attacker is unaware of the connections of these unknown nodes, is connected to as well as all neighbors of , as it is possible that an unknown child of may have such links. Similarly, is also connected to . We consider a worst case scenario, where is neither connected to nor and the attacker is aware of this fact, such that is also neither connected to nor in the hypothetical overlay. Figure 5 illustrates the considered scenario.
A message with target may only be forwarded from to or vice versa if or , respectively. We now proof each case separately.
Case : In this case, the message with target forwarded by a malicious node must subsequently have been routed first via and afterwards via . Since is not a descendant of , in this case it follows that the coordinate of must not be a prefix of , as then only descendants of can have a lower distance to than . Because all neighbors of are non-malicious, any message forwarded by a malicious node towards must first traverse one of ’s neighbors before reaching . Since ’s coordinate is not a prefix of , cannot be ’s parent , because it must hold that , meaning that would not forward the message to . Thus, must either be a child of or a neighbor connected via a known shortcut link.
At the same time, may also be connected to another, unknown sibling of , and thus is connected to in the hypothetical overlay. Because and are a child of , it must hold that . Also, since is not a prefix of , it follows that and thus, . As a consequence, may thus send the message to instead of . Since is connected to in the hypothetical overlay, there is at least one plausible trajectory that includes the sequence instead of and therefore, any trace record obtained from such a message cannot prove the existence of the link .
Case : In this case, a message was first sent to , which then may have forwarded it to . Here, we distinguish between three cases, namely that i) , ii) , and iii) .
In case (i), Lemma 4 implies that . For every child of , it therefore holds that . Consequently, there is at least one plausible trajectory that includes the sequence instead of and thus, any trace record obtained in this case cannot prove the existence of link .
In case (ii), the assumption implies that the coordinate of ’s parent must be a prefix of and thus . From Lemma 4, it follows that and thus for every child of , including those represented by in the hypothetical overlay. Thus, there is at least one plausible trajectory that includes the sequence instead of , such that any trace record obtained in this case also cannot prove the existence of link .
In case (iii), we further need to distinguish two cases, namely that a) , i.e., the recipient of the message is not a descendant of , and b) , i.e., the recipient is a descendant of . If (a) holds, then it is possible that has another child represented by with and thus . As it then follows that , it is thus possible that sent the message to instead of . If (b) is true, it must hold that , since is not a descendant of . Thus,
Thus, for every child of , it holds that . As a consequence, there is at least one plausible trajectory that contains the sequence and therefore any trace record obtained in this setting also cannot prove the existence of link .
We now consider the case that either is a descendant of or vice versa. As explained at the beginning of the proof, cannot be a child of or vice versa. Furthermore, there must be at least one malicious node on the path between and in the spanning tree. Otherwise, it is always possible that the message was routed solely along the tree links.
Case : In this case, the message may have first been routed via , which then may have sent it to . Since is not a neighbor of a malicious node, any message that may traverse must first traverse a non-malicious neighbor of . Furthermore, it must hold that , as would otherwise not forward the message to .
First, consider the case that is descendant of . Since cannot be the parent of , it must also hold that is a descendant of and . From latter statement, it also follows that the target of the message cannot be a descendant of and thus, it must hold that for every child of .
Because all children of have the same distance to , ’s neighbor may thus forward the message to ’s sibling instead of . As is connected to in the hypothetical overlay, there is at least one plausible trajectory that includes the sequence instead of and therefore, any trace record obtained from such a message cannot prove the existence of the link .
Now consider the case that is a descendant of . In this case, must have a child that also has as descendant. Furthermore, it must hold that . Because is also a descendant of , also must have a child such that . For every other child of , it must hold that . As a consequence, it is possible that has forwarded the message to a yet unknown child represented by of instead. Since is connected to in the hypothetical overlay, there is at least one plausible trajectory that includes the sequence instead of .
Case : If is a descendant of , then it must hold that , since is not the parent of . As a consequence, for every child of , it must hold that . Thus, may forward the message to a yet unknown child of represented by in the hypothetical overlay. As a consequence, there is a plausible trajectory that includes the sequence instead of .
If is a descendant of , then it similarly must hold that , since is not the parent of . Analogously to the previous case, it thus follows that there is a plausible trajectory that includes the sequence instead of . ∎
Note that Theorem 5 holds irrespective of whether the target coordinate of the message is obfuscated, as described in Section 4.2. However, if the target coordinates are not obfuscated, an adversary that inspects received messages may eventually learn almost all coordinates that are currently assigned to nodes and thus become more confident about the absence of yet unknown nodes. For scenarios where the overhead incurred by coordinate obfuscation is considered too high, our proof of Theorem 5 suggests that the deliberate introduction of fake children nodes by nodes that actually have only a single child node is a protection measure worth further investigation.
One limitation of the proof is that it is restricted to settings where the adversary cannot determine the coordinate of the actual originator of the message. However, as explained before, the coordinate of the sender can be obfuscated via different means to prevent monitoring by traversed nodes.
5. Simulation study
In the previous section, we showed that malicious participants can unambiguously infer tree links from observed coordinates while the monitoring of message trajectories does not allow unambiguous inferences most of the time. While the obfuscation scheme from VOUTE (Roos et al., 2016) outlined in Section 4.2 can be used to render the coordinates included in data packets useless for inference of tree links, it does not prevent inferences from the coordinates propagated by the embedding algorithm.
To evaluate the privacy risk posed by the fact that every node learns the actual logical coordinate of each of its neighbors in realistic settings, we performed a simulation study using OMNet++. In particular, this study investigates how many previously unknown nodes malicious participants can infer from the observed coordinates. We chose to do a simulation instead of a measurement study on a real-world F2F overlay, since the routing considered in this paper is not yet in use by any currently deployed overlay.
Metrics: Given an adversary with malicious node set and knowledge , let denote the overlay the attacker is aware of after it has processed the logical coordinates assigned to the neighbors of malicious nodes. We measure the number of newly discovered nodes by the number of pseudonyms . As the adversary can only de-anonymize nodes he is aware of, thus gives the maximum number of users the adversary may de-anonymize based on routing information.
Datasets: Because existing F2F overlays have not yet reached widespread adoption and are designed to hinder collection of topology information, there are currently no network snapshots available for investigation. Given that F2F overlays resemble social trust relationships, we thus leverage datasets obtained from crawling online social networks, whose characteristics are presented in Table 1. All of these graphs are undirected.
SPI denotes a graph obtained from a German university social network (Paul and others, 2016). Brightkite (BK) denotes a graph obtained by crawling the Brightkite location-based online social network (Cho and others, 2011). WoT represents a subgraph of a snapshot from the PGP Web of Trust taken on February 7, 2012 from the wotsap-database222www.lysator.liu.se/~jc/wotsap/wots2/, accessed 2021-01-08. As the original snapshot was a directed graph, we first removed any links between pairs of nodes that do not have links in both directions. The WoT graph used for our study consists of the largest connected component of the modified snapshot.
Model, System Parameters, and Set-up: For our study, we implemented two state of the art embedding algorithms, namely Greedy Forest Routing (GFR) (Houthooft and others, 2015) and VOUTE (Roos et al., 2016). In contrast to GFR, which uses enumeration indexes as coordinate elements, VOUTE uses random numbers, thus preventing the inference of further sibling nodes and their coordinates. While both algorithms allow the redundant construction of multiple embeddings, we chose the parameters of the algorithms such that a single BFS spanning tree with a randomly chosen root node is constructed in each simulation run. Since each embedding assigns a different logical coordinate to every node, the inference of network structure across multiple parallel embeddings is non-trivial. We thus consider this task to be an interesting venue for further research.
As our adversary can only obtain information from compromised nodes, we performed simulations with compromised nodes for each graph. We use fixed values for instead of a fraction of the graph’s number of nodes, as this allows us to focus on the effect of graph structure on the effectiveness of the attack. Otherwise, we cannot tell if an increase in the number of inferred pseudonymous for large graphs stems mostly from the increased number of compromised nodes.
For each graph, we determined the compromised nodes by randomly selecting a subset of nodes from that serve as malicious nodes. For each value of , we randomly selected 20 sets of malicious nodes such that each set results in compromised nodes.
We implemented two types of adversarial behaviors: In the first scenario, the malicious nodes follow the embedding algorithm correctly. In the second scenario, each malicious node acts to each non-malicious neighbor as if it does not have other neighbors, thus always becoming a child node of . The only exception is that if a malicious node is chosen as root node, it follows the embedding algorithm correctly. Whenever a compromised node becomes the child of a malicious node , the coordinate of only reveals pseudonymous nodes that can already be inferred from ’s coordinate. Thus, we expect the number of inferred pseudonyms to increase when the malicious nodes actively keep non-malicious nodes from becoming their child.
Each simulation run for a given graph and set of malicious nodes proceeded as follows: first, create a network with nodes and add a corresponding link for each edge in . Subsequently, configure the nodes in according to the simulated adversarial behavior and initialize the adversaries’ knowledge such that , , and . Afterwards, run the simulation until all nodes have received a coordinate. Whenever a compromised node notifies a malicious neighbor about its coordinate , any tree links and coordinates that can be inferred as described in Section 4.2 and are not yet included in knowledge are added.
Figure 6 shows the mean value for across the different graphs, attacker behaviors and number of compromised nodes. Each point in Figure 6 shows to the mean value of over all 20 sets of malicious nodes, with 50 runs done per set.
By comparing the results from GFR with the results from VOUTE, it becomes apparent that the usage of enumeration indexes as coordinate allows malicious participants to infer roughly one order of magnitude more participating nodes and their coordinates than if random numbers are used. While an adversary able to compromise 200 participants discovered on average around pseudonymous nodes on the SPI graph if VOUTE is used, they discovered around nodes if GFR is used. On the Facebook graph, the number of inferred pseudonymous nodes increased from if VOUTE is used to if GFR is used.
The more elements a coordinate announced by a compromised node message has, the more likely it is that new pseudonymous nodes can be inferred. As the average length of node coordinates decreases as the average hop distance to the root node decreases, we expected the number of inferred pseudonymous nodes to drop as the average shortest path length shrinks. Contrary to our expectation, the mean value for on the Facebook graph was always the highest across all graphs, even though it has the lowest shortest path length on average among all graphs used for our study. At the same time, the mean value for on the Web-of-Trust graph was always the lowest across all graphs for those runs where VOUTE is used as embedding algorithm, despite its high average shortest path length. These results indicate that is more strongly affected by other properties, such as the graph’s number of nodes, degree sequence as well as clustering.
By letting malicious nodes actively deviate from the correct behavior, the adversary is indeed able to infer more pseudonymous nodes than if malicious nodes operate correctly, although at a very limited scale for both embedding algorithms. For example, on the Brightkite graph, the mean value for given 1000 compromised nodes increased by from to inferred pseudonyms if is used when malicious nodes actively misbehaved. In the runs where VOUTE was used, increased by from to . Similarly, on the Web-of-Trust graph, the mean value for increased by from to for the runs with GFR and increased by from to for the runs with VOUTE.
Summary of results: Our study indicates that in overlay networks resembling social graphs, the usage of randomized coordinate elements reduces the number of participants that an attacker can infer from observed coordinates by at least one order of magnitude. Contrary to our intuition, our results show that the average shortest path length is not the most decisive factor for the number of pseudonymous nodes the attacker is able to infer. Furthermore, by letting malicious nodes only become leaf nodes, an attacker can increase the number of inferred pseudonyms by up to roughly .
In this work, we analyzed the vulnerability of routing based on rooted spanning tree embeddings to inference attacks, in which adversaries aim to detect or even identify participants in an overlay network. We showed that malicious participants can partially infer the structure of the encoded spanning tree from observed coordinates. Furthermore, as most currently proposed algorithms use enumeration indexes as coordinate elements, malicious participants can additionally infer the coordinates of child nodes from each element. To evaluate the feasibility of link inferences from observed data messages, we introduced the concept of a hypothetical overlay to represent the topological knowledge of an attacker, which takes potentially unknown links and participants into account. Based on this concept, we showed that inference of links beyond the direct neighborhood of malicious nodes is not possible if the attacker cannot determine the originator of a message.
Our simulation study indicates that in social graph-like networks, such as F2F overlays, the usage of random numbers as coordinate elements instead of enumeration indexes reduces the number of inferred tree nodes by more than one order of magnitude. Furthermore, by letting malicious nodes keep their non-malicious neighbors from choosing them as parent in the spanning tree, an attacker can increase the number of inferred tree nodes by up to .
From the proof regarding the inference of links from data messages, we identified the introduction of fake children nodes as a protection measure against link inferences in settings where the attacker may be aware of the coordinates of nearly all nodes. Further research is needed to design such a countermeasure in way that an attacker cannot easily detect if a particular coordinate belongs to a fake or an actual child node. Furthermore, further work is needed to investigate inferences that can be made if multiple such embeddings are formed in parallel and in the presence of network dynamics.
- Topology-hiding computation on all graphs. Journal of Cryptology, pp. 176–227. Cited by: §1.
- Towards a scalable censorship-resistant overlay network based on webrtc covert channels. In Proceedings of the 1st International Workshop on Distributed Infrastructure for Common Good, pp. 37–42. Cited by: §3.2.
- Topology inference of unknown networks based on robust virtual coordinate systems. Transactions on Networking, pp. 405–418. Cited by: §2.
- Routing in wireless networks with position trees. In International Conference on Ad-Hoc Networks and Wireless, pp. 32–45. Cited by: §1, §3.1.
- Friendship and mobility: user movement in location-based social networks. In Knowledge discovery and data mining, Cited by: §5.
- Private communication through a network of trusted connections: the dark freenet. Note: Available at: freenetproject.org/assets/papers/freenet-0.7.5-paper.pdf Cited by: §1, §3.2.
- Maximum likelihood network topology identification from edge-based unicast measurements. SIGMETRICS Performance Evaluation Review, pp. 11–20. Cited by: §2.
- Tor: the second-generation onion router. Technical report Naval Research Lab Washington DC. Cited by: §1.
- Improved algorithms for network topology discovery. In Workshop on Passive and Active Network Measurement, Cited by: §2.
- The GNUnet System. Note: Université de Rennes 1 Cited by: §1.
- Scalable routing easy as PIE: A practical isometric embedding protocol. In ICNP, Cited by: §1, §3.1, §3.1.
- Greedy Embedding, Routing and Content Addressing for Darknets. In KiVS/NetSys, Cited by: §3.1, §3.1.
- Network topology inference with partial information. Transactions on Network and Service Management, pp. 406–419. Cited by: §2.
- Robust geometric forest routing with tunable load balancing. In INFOCOM, Cited by: §1, §3.1