Fault Tolerant Network Constructors

by   Othon Michail, et al.
University of Liverpool

In this work we examine what graphs (networks) can be stably and distributedly formed if adversarial crash failures may happen. Our dynamic graphs are constructed by fixed memory protocols, which are like population protocols but also allow nodes to form/delete links when pairwise interactions occur (Network Constructors). First, we consider standard Network Constructors (i.e. without fault notifications) and we partially characterize the class of such protocols that are fault-tolerant. We show that the class is non-empty but small. Then, we assume a minimal form of fault notifications (N-NET protocols) and we give fault-tolerant protocols for constructing graphs such as spanning star and spanning line. We show a fault tolerant construction of a Turing Machine M that allows a fault tolerant construction of any graph accepted by M in linear space with a population waste of min{n/2 + f(n), n} (due to the construction of M), where f(n) is an upper bound on the number of faults. We then extend the class of graphs to any graph accepted in O(n^2) space, by allowing min{2n/3 + f(n), n} waste. Finally, we use non-constant memory to achieve a general fault-tolerant restart of any N-NET protocol with no waste.



page 1

page 2

page 3

page 4


Asynchronous Consensus Without Rounds

Fault tolerant consensus protocols usually involve ordered rounds of vot...

A Trivial Yet Optimal Solution to Vertex Fault Tolerant Spanners

We give a short and easy upper bound on the worst-case size of fault tol...

Vertex Fault-Tolerant Emulators

A k-spanner of a graph G is a sparse subgraph that preserves its shortes...

Byzantine Fault-Tolerant Min-Max Optimization

In this report, we consider a min-max optimization problem under adversa...

A general approach to deriving diagnosability results of interconnection networks

We generalize an approach to deriving diagnosability results of various ...

AggFT: Low-Cost Fault-Tolerant Smart Meter Aggregation with Proven Termination and Privacy

Smart meter data aggregation protocols have been developed to address ri...

Parallel fault-tolerant programming of an arbitrary feedforward photonic network

Reconfigurable photonic mesh networks of tunable beamsplitter nodes can ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

In this work, we address the issue of the dynamic formation of graphs under faults. We do this in a minimal setting, that is, a population of agents running Population Protocols that can additionally activate/deactivate links when nodes meet. This model was introduced in [1], called Network Constructors, and is strongly inspired by the Population Protocol (PP) model [2] and the Mediated Population Protocol (MPP) model [3]. Population Protocols run on networks that consist of computational entities called agents. One of the challenging characteristics is that the agents have no control over the schedule of interactions with each other. In a population of agents, repeatedly a pair of agents is chosen to interact, and they update their states based on their previous states. In general, the interactions are scheduled by a fair scheduler. When the execution time of a protocol needs to be examined, a very common example of a fair scheduler is the selection of pairs at random. The main difference between PPs and Network Constructors is that in the PP (and the MPP) models, the focus is on computation of functions of some input values, while Network Constructors are mostly concerned about the stable formation of networks satisfying some graph property. Fault tolerance has now to do additionally with the graph configuration, thus, previous results on self-stabilizing PPs and MPPs [4, 5] do not apply here.

In [1], Michail and Spirakis give protocols for several basic network construction problems, and they prove several universality results by presenting generic protocols that are capable of simulating a Turing Machine and exploiting it in order to stably construct a large class of networks, in the absence of crash failures.

In this work, we examine what networks can be stably formed if adversarial crash faults may exist. Here, adversarial crash faults mean that an adversary knows the rules of the protocol and can select at any time some node to remove from the population. We assume that the faults can only happen sequentially, that is, in every step at most one fault may occur.

A main difference between our work and existing self-stabilization approaches is that, due to constant local memory combined with possibly unbounded (e.g. linear) connections with other nodes, the nodes cannot distinguish whether they still have some activated connections with the remaining nodes or not, after a fault has occurred. This difficulty is the reason why it is not sufficient to just restart the state of a node in case of a fault, hence existing self-stabilization approaches cannot be directly applied here [6, 7]. In addition, in contrast to previous self-stabilizing approaches [8, 9] that are based on shared memory models, two adjacent nodes can only store bit of memory in the edge joining them, which denotes the existence or not of a connection between them.

Angluin et al. [12] incorporated the notion of self-stabilization into the population protocol model, giving self-stabilizing protocols for some problems such as leader election. They focus on the goal of stably maintaining some property such as a legal coloring of the communication graph, or having a unique leader.

A previous work of Delporte-Gallet et al. [10] studies the issue of correctly computing functions on the node inputs in the Population Protocol model [2], in the presence of crash faults and transient faults that can corrupt the states of the nodes. They construct a transformation which makes tolerant in the presence of such failures any protocol that works in the failure-free setting, as long as modifying a small number of inputs does not change the output. Guerraoui and Ruppert [11] introduced a new model, called Community Protocol, which is inspired by the Population Protocol model, but the nodes have unique identifiers and enough memory to store a constant number of other agents’ identifiers. They show that this model can solve any decision problem in NSPACE() while tolerating a constant number of Byzantine failures.

In [13], Peleg studies logical structures, constructed over static graphs, that need to satisfy the same property on the resulting structure after node or edge failures. He distinguishes between the stronger type of fault-tolerance obtained for geometric graphs (termed rigid fault-tolerance) and the more flexible type required for handling general graphs (termed competitive fault-tolerance). It differs from our work, as we address the problem of constructing such structures over dynamic graphs and we study fault-tolerance of distributed models.

Our contribution: A Network Constructor (NET) protocol stabilizes to a network, satisfying some graph property , starting from an initial configuration where all nodes are in the same state and all connections are disabled. The protocols in [1] do not consider any type of faults, and it is not clear whether they can tolerate even a single fault. In this work, we formally define the model that extends NET with crash failures, and we examine NET protocols in the presence of such faults. Whenever a node crashes, it is removed from the population, along with all its activated edges. This leaves the remaining population in a state where some actions may need to take place in order to eventually stabilize to a correct network. We answer the following questions: Can we always re-stabilize to a correct graph in this setting, and if not, what is the class of graph properties for which we can always find a fault-tolerant protocol? What are the additional minimal assumptions that we need to make in order to find fault-tolerant protocols for a bigger class of properties?

In Section 3, we study the class of properties for which we are able to design protocols that tolerate any number of faults. We show that this class is non-empty but very small, and then we show that for a wider class of properties, such protocols do not exist, if we do not make further assumptions (e.g. fault notifications or non-constant memory).

The main source of difficulty in the standard NET model (call it SNET) is that after a crash fault, it is not possible for the remaining population to detect the absence of the crashed node, with the purpose of taking actions and eventually re-stabilizing to a correct graph. Also, alive nodes cannot sense the changes in the links that were attached to them before faults occurred (crash faults change the degrees of alive nodes). This means that even if the faults occur only after stabilization, we show for some graph properties that the protocol cannot update the network (in order to fix it), unless it would incorrectly update a stable network in some other execution.

In light of the impossibilities in the SNET model, we introduce the minimal additional assumption of fault notifications on some nodes of the population (N-NET model). In particular, after a fault on some node occurs, its adjacent nodes (if any) are notified. If no adjacent nodes exist, an arbitrary node in the population is being notified. In that way, we guarantee that at least one node in the population will sense the removal of 111Some constructions work without notifications in the case of a crash failure on an isolated node, but for some of them it is essential..

In Section 4.1, we give protocols for some otherwise infeasible graph properties that we are now able to construct while tolerating any number of crash failures.

We go one step further, trying to provide universal constructors that can tolerate crash failures. To this end, we allow the nodes to toss an unbiased fair coin during an interaction (PN-NET model), and in Section 4.2 we investigate the more generic question of what is in principle constructible. We call useful space the number of nodes that eventually form the graph that satisfies the required property, and waste the rest of the population. The idea is based on [1], where they show several universality results by constructing (on nodes) of the population a network capable of simulating a Turing Machine (waste), and then repeatedly construct a random network on the remaining nodes (useful space). The idea is to execute on the Turing Machine which decides the language with input the network . If the Turing Machine accepts, the TM outputs , otherwise the TM constructs again a random graph. A fault tolerant extension of this is the core idea of our universality results for the PN-NET model, tolerating any number of crash failures.

In order to give fault-tolerant protocols without waste, in Section 4.3 we design a protocol that can be composed in parallel with any N-NET protocol in order to make it fault-tolerant. The idea is to restart the protocol whenever a crash failure occurs. We show that restarting is impossible with constant local memory, if the nodes form unbounded number of connections. To this end, we need to supply the agents with more memory (at most logarithmic on the population size).

Finally, in Section 5 we conclude and discuss further interesting open problems.

2 Model and Definitions

A Standard Network Constructor (SNET) is a distributed protocol defined by a 4-tuple , where is a finite set of node-states, is the initial node-state, is the set of output node-states, and is the transition function. The system consists of a population of distributed processes (also called nodes). In the generic case, there is an underlying interaction graph specifying the permissible interactions between the nodes. In this work, is a complete undirected interaction graph, i.e. and .

The main difference between this model and the Population Protocol model is that the edges have binary states (active or inactive). In other words, we say that the nodes are allowed to form connections between them. During a (pairwise) interaction, the agents are allowed to access the state of their joining edge and either activated it (state = ) or deactivate it (state = ). When the edge state between two nodes and is activated, we say that and are connected, or adjacent at that time , and we write . Initially, all nodes are in the same state and all connections are inactive. The goal is for the processes, after interacting and activating/deactivating connections for a while, to end up with a desired stable network, which satisfies some graph property .

In this work, we present a version of this model that allows adversarial crash failures. A crash (or halting) failure causes an agent to cease functioning and play no further role in the execution. We also discuss about edge failures throughout the paper. An edge failure disconnects two adjacent nodes (i.e. the edge state between two nodes is altered from 1 to 0).

The execution of a protocol proceeds in discrete steps. In every step, a pair of nodes from is selected by an adversary scheduler, subject to some fairness guarantee. These nodes interact and update their states and the state of the edge between them according to a joint transition function . If two agents in states and with the edge joining them in state encounter each other, they can change into states , and , where . Without loss of generality, assume that the transition function is symmetric: .

A configuration is a mapping specifying the state of each node and each edge of the interaction graph. An execution of the protocol on input is a finite or infinite sequence of configurations, , each of which is a multiset of states drawn from . In the initial configuration , all nodes are in state and all edges are inactive. A configuration is obtained from by one of the following types of transitions:

  1. Ordinary transition: where and .

  2. Crash failure: where .

  3. Null step: .

We say that is reachable from and write , if there is a sequence of configurations , such that for all , . The fairness condition that we impose on the scheduler is quite simple to state. Essentially, we do not allow the scheduler to avoid a possible step forever. More formally, if is a configuration that appears infinitely often in an execution, and , then must also appear infinitely often in the execution. Equivalently, we require that any configuration that is always reachable is eventually reached.

We define the output of a configuration as the graph where and . If there exists some step such that for all , we say that the output of an execution stabilizes (or converges) to graph , every configuration , for , is called output-stable, and is called the running time under our scheduler.

Finally, we say that an SNET protocol stabilizes eventually to a graph of type if and only if after a finite number of pairwise interactions, the graph defined by ’on’ edges does not change and has property . We call that stable graph the graph .

Definition 1.

Let be a property of graphs. Two graphs and are said to be equivalent under property , or belong to the same class under , if and only if both have property . We denote this by .

Definition 2.

Let be an SNET protocol that stabilizes to the graph , having property . is called k-fault-tolerant iff there exists a size such that for any population size , stabilizes to a graph , even if a sequence of up to crash failures occur during an execution. We also call fault-tolerant if it stabilizes to a graph , regardless of the number of faults.

To define N-NETs, we now extent the Standard Network Constructors model with a fault flag in each agent. When a node crashes at time , every node which was adjacent to at time () is notified, that is, the fault flag of all becomes . In the case where is an isolated node (i.e. it has no enabled connections), a (random) node in the network is notified, and its fault flag becomes . At any time, the agents are allowed to access the fault flag and reset it to zero. We call this model N-NET.

More formally, the set of node-states is , and for clarity in our descriptions and protocols, we define two types of transition functions. The first one determines the state/connection updates of pairwise interactions (), while the second transition function determines the state updates after a fault (). The first transition function is triggered after a pairwise interaction, while is triggered right after a fault.

The separation of these transition functions is equivalent to the case where only one transition function exists . Consider the case where a node crashes, notifying a node in the population (its fault flag becomes either or ). Then, in the first case (separate transition functions), is instantly allowed to update its state, while in the second case (unified transition functions), waits until its next interaction with a node , applying the rule of independently of the state and connection of . During the same interaction, and can also update their states and connections based on the corresponding rule of .

Finally, we define PN-NET in precisely the same way as N-NET, but in extension to the above model, every pair of processes is capable of tossing an unbiased coin during an interaction between them.

3 On the existence of Fault-Tolerant SNET Protocols

In this section, we study the existence of fault-tolerant protocols in the SNET model. We say that a protocol constructs a graph property if every execution of on a population of agents stabilizes on a graph with property . We show that not all properties can be constructed by an SNET protocol under faults, but there is a class of properties that has fault-tolerant SNET protocols for any number of crash failures.

Definition 3.

Let be a graph with property . Call critical node of if by removing at time and all its edges, the resulting network , does not satisfy property (i.e. ).

In other words, if there are no critical nodes in , then any (induced) subgraph of that can be obtained by removing nodes and all their edges (crash failures), also satisfy . The properties that satisfy this are known as hereditary properties in the literature.

Definition 4.

A property is called hereditary if for any graph with property , every induced subgraph of also satisfies . In other words, has no critical nodes.

Examples of hereditary properties are “Bipartite graph”, “Planar graph”, “Forest of trees”, “Clique”, “Set of cliques”, “Maximum node degree ” and so on. We call Hereditary the class of all hereditary properties.

We now define a subclass of this class of properties, which we call Preserving Graphs or PG.

Definition 5.

A property is called preserving if for any graph with property , every subgraph of (not necessarily induced) also satisfies .

Examples of preserving properties are “Bipartite graph”, “Planar graph”, “Maximum node degree ” and so on. We call Preserving Graphs or PG the class of all preserving properties.

Theorem 1.

PG is a subclass of Hereditary.


Consider a property , and a graph of type . Then, if we remove any node and all its edges, the resulting graph should still have property , as is subgraph of and . Thus, . Now, consider the property . If we remove a node and all its edges from a of type , the resulting graph is still a clique of smaller size. However, any subgraph of which consists of all the nodes of and edges is not a clique. Thus, , but . ∎

Theorem 2.

If a protocol stabilizes to a graph of property and if for all , is a subgraph of (i.e. does not remove any edges), then resists any sequence of single faults.


Since , then for each , has also . But then any fault does not destroy the property at any . ∎

In other words, for any property which is preserving, every protocol that stabilizes to a graph of some , is not necessary to deal with the failures in order to fix the configuration, as this class of graphs has the interesting property of maintaining in every subgraph. Note that protocols for properties in , tolerate both crash and edge failures. Edge failures corrupt the state of an edge, that is, an activated edge between two nodes is removed, leaving the two corresponding nodes disconnected.

There are some properties for which we can still design fault-tolerant protocols, without having to deal with the crash failures. An example of such property is the Spanning Clique. Let Clique be the following state symmetric protocol. If we consider the case where no crash faults are allowed, for any population size, Clique Protocol stabilizes to a clique with all the nodes in state r (i.e. ”clique on all nodes” and ”clique”).

Initial state:
\\All transitions that do not appear have no effect.
Protocol 1 Clique
Lemma 1.

Clique Protocol is fault-tolerant.


Let and assume that nodes crash during the execution. Call the remaining nodes.

(a) If all nodes in are in state , then the remaining nodes shall form a clique (in state ).

(b) If all nodes in are in state , then again, Clique Protocol stabilizes to a clique.

(c) If contains both colors, then the nodes will convert the nodes to and again Clique Protocol stabilizes to a clique. ∎

Definition 6.

A state of an SNET protocol is called critical iff its disappearance from the population at some execution point makes impossible to stabilize to a graph of property with no crash faults.

This means that if at some point during an execution the population remaining does not have state in any node, then will either not stabilize to any graph or stabilize to a graph where . The following observation holds by the definition of the critical states.

Observation 1.

An SNET is fault-tolerant iff has no critical states.

Theorem 3.

There exists a state SNET protocol with at least one critical state. In other words, not all state SNET are fault-tolerant.


Let be the Protocol 2 which constructs a spanning star.

Initial state:
Protocol 2 Spanning Star

If we do not allow crash faults to happen, then will stabilize to graph of type ”spanning star”, where the center is in state and the leaves in state .

Now, assume that the adversary waits until one node remains (the center) and then removes it (crashes). Now, only nodes remain, and with just one fault, will converge to a set of independent vertices in state (empty graph). Thus, the state in protocol is critical. ∎

Here, it is reasonable to ask whether there exists another SNET protocol which is fault-tolerant and stabilizes to a graph of property ”spanning star”. We call this protocol the ”self-stabilizing” version of the Spanning Star protocol.

Theorem 4.

There exists no SNET which would be the self-stabilizing version of the Spanning Star protocol, even with one fault.


Assume such an SNET protocol exists. Then should stabilize to a spanning star regardless of whether up to faults occur or not. Clearly, in any SNET protocol that stabilizes to a spanning star, the eventual state of the center of the star (say ) will be different from any of the states of the other nodes (leaves). This is because under any fair scheduler, nodes meet infinitely often. Then, the eventual states of the leaves of the star should enforce no edges between them. Thus, if was one of the states of the leaves, then no leaf would be connected to the center. Let us run the protocol , until stabilization, under no faults. Let be the set of states of the leaves, after the spanning star is formed. Now, let the adversary wait until this happens and , appear. Then, the adversary removes node (crash failure). Since, is fault-tolerant, the rules of should recreate the star. This means that the states in and the rules should create edges among the former leaves. But then, even when no faults occur, the same rules and the same sequence of interactions should create edges in (among the former leaves). This contradicts the assumption that is the set of states of the leaves after the star is formed. Thus, no such can exist. ∎

Corollary 1.

There is at least one SNET protocol which cannot have an equivalent fault-tolerant version .

In a similar way, we show the following lemma.

Lemma 2.

There is no -fault-tolerant SNET protocol for constructing a spanning line.


Assume such an SNET protocol exists. Then should stabilize to a spanning line regardless of whether up to faults occur or not. Clearly, in any SNET protocol that stabilizes to a spanning line, the eventual state (or states) of the endpoints of the line (say ) will be different from any of the states of the other nodes. This is because under any fair scheduler, nodes meet infinitely often. Then, the eventual states of the inner nodes of the line should enforce no more edges between them and other nodes. Thus, if was one of the states of the inner nodes, then no more nodes could be connected to the line, thus, the protocol would end up with many disjoint lines. Let us run the protocol , until stabilization, under no faults. Let be the set of states of the inner nodes, after the spanning line is formed. Now, let the adversary wait until this happens and , appear. Then, the adversary removes an inner node (in state ) from the line (crash failure). Since, is fault-tolerant, the rules of should recreate the spanning line. This means that the states in and the rules of should create edges among the former inner nodes. But then, even in the case where no faults occur, the same rules and the same sequence of interactions should create edges in , among the former inner nodes (i.e. a cycle is formed). This contradicts the assumption that is the set of states of the inner nodes after the spanning line is formed. Thus, no such can exist. ∎

We now show that if there exists at least one critical node in , there is no SNET protocol that always stabilizes to the correct network even if a single failure occurs during an execution.

Theorem 5.

If there exists a critical node in , there is no 1-fault tolerant SNET protocol that stabilizes to it.


Let be an SNET protocol that stabilizes to graph a , having property and tolerating one crash failure. Consider an execution and a sequence of configurations of . Assume a time that the output of has stabilized to graph (i.e. , ). Let be a critical node in . Assume that the scheduler removes and all its edges (crash failure) at time , resulting to a graph . In order to fix the network, the protocol must change at some point the configuration, for example a node changes its state. Now, call the execution that node does not crash, and between and the node has the same interactions as in the previous case where node crashed. Then, changes its state in order to fix the network, since it cannot distinguish from . The fact that either crashes or not, leads to the same result (i.e. tries to fix the network thinking that has crashed). This means that if we are constantly trying to detect faults in order to deal with them, this would happen indefinitely and the protocol would never be stabilizing. Consider that the network has stabilized to . At some point, because of the infinite execution, a node will surely but wrongly detect a crash failure. Thus, has not really stabilized. ∎

4 Notified Network Constructors

In this section, we use the N-NET model as described in Section 2, and we investigate whether the additional information in each agent (the fault flag) is sufficient in order to design fault-tolerant or fault-tolerant protocols, overcoming the impossibility of certain graph properties in the SNET model (graphs with critical nodes).

4.1 Fault-tolerant N-NET protocols via minimal updates

In this section, our goal is to design protocols that after a fault, the nodes try to fix the configuration with minimal updates and eventually stabilize to a correct network. We give protocols for some properties, such as spanning star, cycle cover, and in Section 4.2 we give a fault-tolerant spanning line protocol which is part of our generic constructor capable of constructing a large class of networks.

Initial state:
Protocol 3 FT Spanning Star
Lemma 3.

FT Spanning Star is fault-tolerant.


Assume that any number of faults occur during an execution. Initially, all nodes are in state (black). Two nodes connect with each other, if either one of them is black, or both of them are black, in which case one of them becomes (red). A black node can become red only by interaction with another black node, in which case they also become connected. Thus, with no crash faults, a connected component always includes at least one black node. In addition, all isolated nodes are always in state . This is because, if a red node removes an edge it becomes black.

Then, if a (connected) node crashes, the adjacent nodes are notified and the red nodes become black, thus, any connected component should again include at least one black node. Now, consider the case where only one black node remains in the population. Then the rest of the population (in state ) should be in the same connected component as the unique node. Then, if crashes, at least one black node will appear, thus, this protocol maintains the invariant, as there is always at least one black node in the population. FT Spanning Star then stabilizes to a star with a unique black node in the center. ∎

Initial state:
Protocol 4 FT Cycle-Cover

Similarly, we can show the following lemma.

Lemma 4.

FT Cycle-Cover is fault-tolerant.

4.2 Universal Fault-Tolerant Constructors with waste

In this section, we ask whether there is a generic fault-tolerant constructor capable of constructing a large class of networks. We first give a fault-tolerant protocol that constructs a spanning line, and then we show that we can simulate a given TM on that line, tolerating any number of crash faults.

Lemma 5.

FT Spanning Line is fault-tolerant.


Initially, all nodes are in state and they start connecting with each other in order to form lines that eventually merge into one.

When two nodes become connected, one of them becomes leader (state ) and starts connecting with nodes (expands). A leader state is always an endpoint. The other endpoint is in state (initially ), while the inner nodes are in state . Our goal is to have only one leader on one endpoint, because are also used in order to merge lines. Otherwise, if there are two endpoints, the line could form a cycle.

When two leaders meet, they connect (line merge) and a node appears. This state performs a random walk on the line and its purpose is to meet both endpoints (at least once) before becoming an leader. After interacting with the first endpoint, it becomes and changes the endpoint to . Whenever it interacts with the same endpoint they just swap their states from , to , and vice versa. In this way, we guarantee that will eventually meet the other endpoint in state , or . In the first case, the node becomes a leader (), after having walked the whole line at least once.

Now, consider the case where a fault may happen on some node on the line. If the fault flag of an endpoint state becomes , it updates its state to . Otherwise, the line splits into two disjoint lines and the new endpoints become . An becomes a walking state , changes the endpoint into and performs the same process (random walk).

If there are more than one walking states on a line, then all of them are , or and they perform a random walk. None of them can ever satisfy the criterion to become before first eliminating all the other walking states and/or the unique leader (when two walking states meet, only one survives and becomes ), simply because they form natural obstacles between itself and the other endpoint. If a new fault occurs, then this can only introduce another state which cannot interfere with what existing ’s are doing on the rest of the line (can meet them eventually but cannot lead them into an incorrect decision).

If an leader is merging while there are ’s and/or ’s on its line (but it is not aware of that), the merging results in a new state, which is safe because a cannot make any further progress without first succeeding to beat everybody on the line. A can become only after walking the whole line at least once (i.e. interact with both endpoints) and to do that it must have managed to eliminate all other walking states of the line on its way. ∎

Initial state:
\\ nodes perform a random walk on line
\\ nodes eliminate each other, until only one survives
Protocol 5 FT Spanning Line
Lemma 6.

There is an N-NET such that when is executed on nodes and at most faults can occur, , will eventually simulate a given TM of space in a fault-tolerant way.


The state of has two components , where is executing a spanning line formation procedure, while handles the simulation of the TM . Our goal is to eventually construct a spanning line, where initially the state of the second component of each node is in an initial state except from one node which is in state head and indicates the head of the TM.

In general, the states and are updated in parallel and independently from each other, apart from some cases where we may need to reset either , or both.

In order to form a spanning line under crash failures, the component will be executing our FT Spanning Line protocol which is guaranteed to construct a line, spanning eventually the non-faulty nodes.

It is sufficient to show that the protocol can successfully reinitialize the state of all nodes on the line after a final event has happened and the line is stable and spanning. Such an event can be a line merging, a line expansion, a fault on an endpoint or an intermediate fault. The latter though can only be a final event if one of the two resulting lines is completely eliminated due to faults before merging again. In order to re-initialize the TM when the line expands to an isolated node , we alter a rule of the FT Spanning Line protocol. Whenever, a leader expands to an isolated node , the leader becomes while the node in becomes , thus introducing a new walking state.

We now exploit the fact that in all these cases, FT Spanning Line will generate a or a state in each affected component.

Whenever a or state has just appeared or interacted with an endpoint or respectively, it starts resetting the simulation component of every node that it encounters. If it ever manages to become a leader , then it finally restarts the simulation on the component by reintroducing to it the tape head.

When the last event occurs, the final spanning line has a or leader in it, and we can guarantee a successful restart due to the following invariant. Whenever a line has at least one state and no further events can happen, FT Spanning Line guarantees that there is one or that will dominate every other state on the line and become an , while having traversed the line from endpoint to endpoint at least once.

In its final departure from one endpoint to the other, it will dominate all and states that it will encounter (if any) and reach the other endpoint. Therefore, no other states can affect the simulation components that it has reset on its way, and upon reaching the other endpoint it will successfully introduce a new head of the TM while all simulation components are in an initial state . ∎

Lemma 7.

There is a fault-tolerant N-NET protocol which partitions the nodes into two groups and with waste at most , where is an upper bound on the number of faults that can occur. is a spanning line with a unique leader in one endpoint and can eventually simulate a TM . In addition, each node of is connected with exactly one node of , and vise versa.


Initially all nodes are in state . Protocol partitions the nodes into two equal sets and and every node maintains its type forever. This is done by a perfect matching between ’s where one becomes and the other becomes . Then, the nodes of execute the FT Spanning Line protocol, which guarantees the construction of a spanning line, capable of simulating a TM (Lemma 6). The rest of the nodes (), which are connected to exactly one node of each, are used to construct on them random graphs. Whenever a line merges with another line or expands towards an isolated node, the simulation component in the states of the line nodes, as described in Lemma 6, is reinitialised sequentially.

Assume that a fault occurs on some node of the perfect matching before that pair has been attached to a line. In this case, it’s pair will become isolated therefore it is sufficient to switch that back to .

If a fault occurs on a node after its pair has been attached to a line, goes into a detaching state which disconnects it from its line neighbors, turning them into and itself becoming a upon release. An state on one endpoint is guaranteed to walk the whole line at least once (as ) in order to ensure that a unique leader will be created. If fails before completing this process, it’s neighbors on the line shall be notified becoming again , and if one of its neighbors fails we shall treat this as part of the next type of faults. This procedure shall disconnect the line but may leave the component connected through active connections within . But this is fine as long as the FT-Spanning Line guarantees a correct restart of the simulation after any event on a line. This is because eventually the line in will be spanning and the last event will cause a final restart of the simulation on that line.

Assume that a fault occurs on a node that is part of the line. In this case the neighbors of on the line shall instantly become . Now, its pair , which may have an unbounded number of neighbors at that point, becomes a special deactivating state that eventually deactivates all connections and never participates again in the protocol, thus, its stays forever as waste. This is because the fault partially destroys the data of the simulation, thus, we cannot safely assume that we can retrieve the degree of and successfully deactivate all edges. As there can be at most such faults we have an additional waste of . Now, consider the case where is one neighbor of a node which is trying to release itself after its neighbor in failed. Then, implements a -counter in order to remember how many of its alive neighbours have been deactivated by itself or due to faults in order to know when it should become . ∎

Theorem 6.

For any graph language that can be decided by a linear space TM, there is a fault tolerant PN-NET that constructs a graph in with waste at most , where is an upper bound on the number of faults that can occur.


By Lemma 7, there is a protocol that constructs two groups and of equal size, where each node of is matched with exactly one node of , and vice versa. In addition, the nodes of form a spanning line, and by Lemma 6 it can simulate a TM . After the last fault occurs, is correctly initialized and the head of the TM is on one of the endpoints of the line. The two endpoints are in different states, and assume, that the endpoint that the head ends up is in state (left endpoint), and the other is in state (right endpoint).

We now provide the protocol that performs the simulation of the TM , which we separate into several subroutines. The first subroutine is responsible for simulating the direction on the tape and is executed once the head reaches the endpoint . The simulation component (as in Lemma 6) of each node has three sub-components . is used to store the head of the TM, i.e. the actual state of the control of the TM, is used to store the symbol written on each cell of the TM, and is either , or , indicating whether that node is on the left or on the right of the head (or unknown). Assume that after the initialization of the TM, for all nodes of the line. Finally, whenever the head of the TM needs to move from a node to a node , , and .

Direction. Once the head of the TM is introduced in the endpoint by the lines’ leader, it moves on the line, leaving marks on the component of each node. It moves on the nodes which are not marked, until it eventually reaches the endpoint. At that point, it starts moving on the marked nodes, leaving marks on its way back. Eventually, it reaches again the endpoint. At that time, for each node on its right it holds that . Now, every time it wants to move to the right it moves onto the neighbor that is marked by while leaving an mark on its previous position, and vice versa. Once the head completes this procedure, it is ready to begin working as a TM.

Constructing a random graph in . This subroutine of the protocol constructs a random graph in the nodes of . In the Probabilistic

N-NET model, the nodes are allowed to toss a fair coin during an interaction. This means that we allow transitions that with probability

give one outcome and with another. To achieve the construction of a random graph, the TM implements a binary counter ( bits) in its memory and uses it in order to uniquely identify the nodes of set according to their distance from . Whenever it wants to modify the state of edge of the network in , the head assigns special marks to the nodes in at distances and from the left of the endpoint . Note that the TM uses its (distributed) binary counter in order to count these distances. If the TM wants to access the th node in , it sets the counter to , places a mark on the left endpoint and repeatedly moves the mark one position to the right, decreasing the counter by one in each step, until . Then, the mark has been moved exactly positions to the right. In order to construct a random graph in , it first assigns a mark to the first node , which indicates that this node should perform random coin tosses in its next interactions with the other marked nodes, in order to decide whether to form connections with them, or not. Then, the leader moves to the next node on its line and waits to interact with the connected node in . It assigns a mark , and waits until this mark is deleted. The two nodes that have been marked ( and ), will eventually interact with each other, and they will perform the (random) experiment. Finally the second node deletes its mark (). The head then, moves to the next node and it performs the same procedure, until it reaches the other endpoint . Finally, it moves back to the first node (marked as ), deletes the mark and moves one step right. This procedure is repeated until the node that should be marked as is the right endpoint . It does not mark it and it moves back to . The result is an equiprobable construction of a random graph. In particular, all possible graphs over nodes have the same probability to occur. Now, the input to the TM is the random graph that has been drawn on , which provides an encoding equivalent to an adjacency matrix. Once this procedure is completed, the protocol starts the simulation of the TM . There are edges, where and has available space, which is sufficient for the simulation on a space TM.

Read edges of . We now present a mechanism, which can be used by the TM in order to read the state of an edge joining two nodes in . Note that a node in can be uniquely identified by its distance from the endpoint . Whenever the TM needs to read the edge joining the nodes and , it sets the counter to . Assume w.l.o.g. that . It performs the same procedure as described in the subroutine which draws the random graph in . It moves a special mark to the right, decreasing by one in each step, until it becomes zero. Then, it assigns a mark on the th node of , and then performs the same for , where it also assigns a mark (to the th node). When the two marked nodes ( and ) interact with each other, the node which is marked as copies the state of the edge joining them to a flag (either or ), and they both delete their marks. The head waits until it interacts again with the second node, and if the mark has been deleted, it reads the value of the flag .

After a simulation, the TM either accepts or rejects. In the first case, the constructed graph belongs to and the Turing Machine halts. Otherwise, the random graph does not belong to , thus the protocol repeats the random experiment. It constructs again a random graph, and starts over the simulation on the new input.

A final point that we should make clear is that if during the simulation of the TM an event occurs (crash fault, line expansion, or line merging), by Lemma 6 and Lemma 7, the protocol reconstructs a valid partition between and , the TM is re-initialized correctly, and a unique head is introduced in one endpoint. At that time, edges in may exist, but this fact does not interfere with the (new) simulation of the TM, as a new random experiment takes place for each pair of nodes in prior to each simulation. ∎

We now show that if the constructed network is required to occupy instead of half of the nodes, then the available space of the TM-constructor dramatically increases from to . We provide a protocol which partitions the population into three sets , and of equal size . The idea is to use the set as a binary memory for the TM, where the information is stored in the edges of .

Initial state:
Protocol 6 3-Partition
Lemma 8.

Protocol 3-Partition partitions the nodes into three groups , and , with waste , where is an upper bound on the number of faults that can occur. is a spanning line with a unique leader in one endpoint and can eventually simulate a TM, each node in is connected with exactly one node of , and each node of is connected to exactly one node in and one node in .


Protocol Partition constructs lines of three nodes each, where one endpoint is in state , the other endpoint in state , and the center is in state . The nodes of operate as in Lemma 7 (i.e. they execute the FT Spanning Line protocol). A (connected) pair of nodes waits until a third node is attached to it, and then the center becomes and starts executing the FT Spanning Line protocol. Note that at some point, it is possible that the population may only consists of pairs in states and . For this reason, we allow nodes to connect with each other, forming lines of four nodes. One of the nodes becomes and the other becomes . A node in becomes only after deactivating its connection with a node (its previous pair). This results in lines of three nodes each with nodes in states , and . Then, the nodes start forming a line, spanning all nodes of . In a failure-free setting, the correctness of this protocol follows from Lemma 7. In addition, by Lemma 6, the TM of the line is initialized correctly after the last occurring event (line expansion, line merging, or crash fault).

If we consider crash failures, it is sufficient to show that eventually is a spanning line and and are disjoint. If a node ever becomes or , it might form connections with other nodes in or respectively, because of a TM simulation. A node in never forms connections with nodes in . After they receive a fault notification, they become the deactivating state . A node in state is disconnected from any other node, thus, it eventually becomes isolated and never participates in the execution again. We do this because nodes in and can form unbounded number of connections. The data of the TM have been partially destroyed (because of the crash failure), therefore it is not safe to assume that we can retrieve the degree of them and successfully re-initialize them.

A node in state (inner node of a line of four nodes), after a fault notification it becomes . A node in waits until its next interaction with a connected node . If is in state , this means that now a triple has been formed, thus becomes . If is in state , they delete the edge joining them, becomes and becomes ( might have formed connections with other nodes in ).

A node in , after a fault notification it becomes and waits until its next interaction with a connected node . At that point, can be either , , or . In all cases they disconnect from each other and becomes . The state indicates that the node should release itself from the spanning line in . This procedure works as described in Lemma 7, thus, after releasing itself from the line, it becomes . If is in state or , it becomes . If is in state , it becomes , as its (unique) adjacent node can only be in state .

A node in or , after a fault notification it becomes and continues participating in the execution again. Finally, a node in state , after receiving a fault notification, it becomes (a is the result of a fault notification in a node).

Note that a node in any state except from and can be re-initialized correctly, thus they may participate in the execution again. It is apparent that no node that might have formed unbounded number of connections can participate in the execution again after a crash fault. This guarantees that the connections in and can be correctly initialized after the final event, and that no node in can be connected with more than one node in . In addition, if a node receives a fault notification, it releases itself from the line, thus introducing new walking states in the resulting line(s). By Lemma 6, this guarantees the correct re-initialization of the TM. Finally, a crash failure can lead in deactivating two more nodes, in the worst case. These nodes never participate in the execution again, thus they remain forever as waste. This means that after crash failures, the partitioning will be constructed in nodes. ∎

Theorem 7.

For any graph language that can be decided by a space TM, there is a protocol that constructs equiprobably with waste at most , where is an upper bound on the number of faults.


Protocol 6 partitions the population in three groups , and and by Lemma 8, it tolerates any number of crash failures, while initializing correctly the TM after the final event (line expansion, line merging, or crash fault). Reading and writing on the edges of is performed in precisely the same way as reading/writing the edges of (described in Theorem 6). Thus, the Turing Machine has now a space binary memory (the edges of ) and space on the edges of the spanning line . The random graph is constructed on the nodes of (useful space), where by Lemma 8, in the worst case. ∎

4.3 Designing Fault-Tolerant protocols without waste by assuming non-constant memory per node

A very simple, (yet impractical) idea that could tolerate any number of faults is to restart the protocol each time a node crashes. The implementation of this idea requires the ability of some nodes to detect the removal of a node.

Definition 7.

Consider any execution of a finite protocol . There exists a finite number of different executions, and for each execution a step that stabilizes. Call the th configuration of execution , where . Then, we call maximum reachable degree of the value .

We first show that even in the case where the whole population is notified about a crash failure, global restart is impossible for protocols with unbounded maximum reachable degree, if the nodes have constant memory. However, we provide a protocol that restarts the population, but we supply the agents with bits of memory. In our approach, we use the N-NET model, and if a node crashes, the set of the nodes that are notified, has the task to restart the protocol (i.e. to convert the current configuration into an initial one).

Consider a protocol with the initial state . We define as global restart the process which leads all alive nodes to the initial state without any enabled connections among them and then gradually starts again.

Theorem 8.

Consider a protocol with unbounded maximum reachable degree. Then, global restart of is impossible for nodes with constant memory, even if every node in the population is notified about the crash failure.


Consider a protocol with constant number of states and unbounded maximum reachable degree, which stabilizes to a graph of property . Then any degree more than cannot be remembered by a node, that is, a state cannot indicate the degree of a node.

Assume that at time a crash failure occurs and that there are some edges in the graph (call them spurious edges).

Protocol is allowed to have rules that are triggered by the fault and try to erase those edges (erasing process). We assume that all nodes in the population are notified about the crash failure. But, as long as the nodes are not aware of their degree, they do not know when the edge erasing process stops in order to allow the restart. To stop the erasing process is equivalent to counting the remaining edges and wait until the degree reaches zero. After a node deletes an edge it either stays in the same state or updates it in order to remember it. No more than such changes can happen, thus it is impossible to delete all edges and restart with constant memory.

So, any self-stabilizing protocol will inherit (after restarting gradually) some arbitrary spurious edges. Thus, global restart is impossible. ∎

A very interesting related question is to ask whether a protocol with unbounded maximum reachable degree can still stabilize to a correct graph after an unsuccessful restart, where some edges exist in the beginning of the execution. This is equivalent to ask whether can still stabilize to a correct , is we enable arbitrarily some connections prior to the execution.

Theorem 9.

Consider an SNET protocol which stabilizes to a graph of property . Given that all nodes are in an initial state and assuming an adversary that can initialize arbitrarily any subset of edges among nodes, stabilizes to a graph .


Assume w.l.o.g. that stabilizes to a spanning line. Since the nodes have constant memory (i.e. constant number of states), there exists at least one state which nodes stabilize to. Consider an execution where two nodes and are in the same state after stabilization at time . Consider also a node in state which is adjacent to but not to , and that and never interacted with each other until time .

Consider now that the adversary initializes the edge between and to on, and we run an execution of which is exactly the same as ( and won’t update their connection state, as they do not interact until ). Then, node stabilizes having three enabled connections. Since and are both in the same state , cannot distinguish and . If there was a rule in which disconnects and , this would also happen in the case where was not adjacent to , resulting to stabilize to a graph with at least two disjoint lines, as would be disconnected from . ∎

In light of the impossibility result of Theorem 8, we allow the nodes to use non-constant local memory in order to develop a fault tolerating procedure based on restart. Our goal is to come up with a protocol that can be composed with any N-NET protocol , so that their composition is a fault-tolerant version of . Essentially, whenever a fault occurs, will restart all nodes in a way equivalent to as if a new execution of had started on the whole remaining population.

We give a protocol that achieves this as follows. All nodes are initially leaders. Through a standard pairwise leader elimination procedure, a unique leader would be guaranteed to remain in the absence of failures. But because a fault can remove the last remaining leader, the protocol handles this by generating a new leader upon getting a fault notification. This guarantees the existence of at least one leader in the population and eventually (after the last fault) of a unique one. There are two main events that trigger a new restarting phase: a fault and a leader elimination. As any new event must trigger a new restarting phase that will not interfere with an outdated one, eventually overriding the latter and restarting all nodes once more, we use phase counters to distinguish among phases. In the presence of a new event it is always guaranteed that a leader at maximum phase will eventually increase its phase, therefore a restart is guaranteed after any event. The restarts essentially cause gradual deactivation of edges (by having nodes remember their degree throughout) and restoration of nodes’ states to , thus executing on a fresh initial configuration. For the sake of clarity, we first present a simplified version of the restart protocol that guarantees resetting the state of every node to a uniform initial state . So, for the time being we may assume that the protocol to be restarted through composition is any Population Protocol that always starts from the uniform initial configuration (all in initially). Later on we shall extend this to handle with protocols that are Network Constructors instead.

Description of the PP Restarting Protocol. The state of every node consists of two components and . runs the restart protocol while runs the given PP . In general, they run in parallel with the only exception when restarts . The component of every node stores a leader variable, taking values from , and is initially , a phase variable, taking values from , initially , and a fault binary flag, initially .

The transition function is as follows. We denote by the value of variable of node and the value of it after the transition under consideration.

If a leaders’ flag becomes or , it sets it to , increases its phase by one, and restarts . If a followers’ flag becomes or , it sets it to , increases its phase by one, becomes a leader, and restarts . We now distinguish three types of interactions.

When a leader interacts with a leader , one of them remains leader (state ) and the other becomes a follower (state ), both set their phase variable to and both reset their component (protocol ) to (i.e. restart ).

When a leader interacts with a follower , if , do nothing in but execute a transition of (both and involved). If , then both set their phase variable to and both restart , and finally, if , then and restarts .

When a follower interacts with a follower , if do nothing in but execute transition of . If , then sets and restarts , and finally, if , then sets and restarts .

We now show that given any such PP , the above restart protocol when composed as described with , gives a fault-tolerant version of (tolerating any number of crash faults).

Lemma 9 (Leader Election).

In every execution of , a configuration with a unique leader is reached, such that no subsequent configuration violates this property.


If after the last fault there is still at least one leader, then from that point on at least one more leader appears (due to the fault flags) and only pairwise eliminations can decrease the number of leaders. But pairwise elimination guarantees eventual stabilization to a unique leader. It remains to show that there must be at least one leader after the last fault. The leader state becomes absent from the population only when a unique leader crashes. This generates a notification, raising at least one follower’s fault flag, thus introducing at least one leader. ∎

Call a leader-event any interaction that changes the number of leaders. Observe that after the last leader-event in an execution there is a stable unique leader .

Lemma 10 (Final Restart).

On or after the last leader-event, will go to a phase such that , where denotes the remaining nodes after the crash faults. As soon as this happens for the first time, let denote the set of nodes that have restarted exactly once on or after that event. Then , an interaction between and results in . Thus, will eventually be .


We first show that on or after the last leader-event there will be a configuration in which and it is stable. As there is a unique leader and follower-to-follower interactions do not increase the maximum phase within the followers population, will eventually interact with a node that is in the maximum phase. At that point it will set its phase to that maximum plus one and we can agree that before that follower also sets its own phase during that interaction to the new max, it has been satisfied that .

When the above is first satisfied, and . Any interaction within , only executes a normal transition of , as in they are all in the same phase. Any interaction between a and a , results in , because interactions between followers in cannot increase the maximum phase within , thus holds and the transition is: and restarts , thus enters . It follows that cannot decrease and any interaction between the two sets increases , thus eventually becomes equal to . ∎

Putting Lemma 9 and Lemma 10 together gives the aforementioned result.

Theorem 10.

For any such PP , it holds that is a fault-tolerant version of .

Lemma 11.

The required memory in each agent for executing protocol is bits.


Initially all nodes are potential leaders, and they eliminate each other, moving to next phases at the same time. In the worst case, a single leader will eliminate every other leader, turning them into followers, thus in a failure-free setting the phase of becomes at most . If we consider the case where crash faults may occur, each fault can result in notifying the whole population. This will happen if was adjacent to every other node by the time it crashed. Thus, all nodes increase their phase by one and become leaders again. In the worst case, a single leader eliminates all the other leaders, thus, after the first fault, the maximum phase will be increased by . The maximum phase than can be reached is , where is the maximum number of faults that may occur (). Thus, each node is required to have bits of memory. ∎

N-NET Restarting Protocol. We are now extending the PP Restarting Protocol in order to handle any N-NET protocol . Call this new protocol . We store in the component of each node a degree variable, that is, whenever a connection is formed or deleted, increases or decreases the value of degree by one respectively. In addition, whenever the fault flag of a node becomes one, it means that an adjacent node of it has crashed, thus it decreases degree by one. In the case of Network Constructors, the nodes cannot instantly restart the protocol by setting their state to the initial one . By Theorem 9, it is evident that we first need to remove all the edges in order to have a successful restart and eventually stabilize to a correct network.

We now define an intermediate phase, called Restarting Phase , where the nodes that need to be restarted enter by setting the value of a variable restart to (stored in the component). As long as their degree is more that zero, they do not apply the rules of the protocol in their second component , but instead they deactivate their edges one by one. Eventually their degree reaches zero, and then they set restart to and continue executing protocol . We can say that a node , which is in phase (), becomes available for interactions of (in ) only after a successful restart. This guarantees that a node will not start executing the protocol again, unless its degree firstly reaches zero.

The additional Restarting Phase does not interfere with the execution of the PP Restarting Protocol, but it only adds a delay on the stabilization time.

Lemma 12.

The variable degree of a node always stores its correct degree.


In a failure-free setting, whenever a node forms a new connection, it increases its degree variable by one, and whenever it deactivates a connection, it decreases it by one. In case of a fault, all the adjacent nodes are notified, as their fault flag becomes one. Thus, they decrease their degree by one. In case of a fault with no adjacent nodes, a random node is notified, and its fault flag becomes two. In that case, it leaves the value of degree the same. ∎

Theorem 11.

For any N-NET protocol , it holds that is a fault-tolerant version of .


Consider the case where a node (either leader or follower) needs to be restarted. It enters to the restarting phase in order to deactivate all of its enabled connections, and it will start executing only after its degree becomes zero (by Lemma 12 this will happen correctly), thus, always run in nodes with no spurious edges (edges that are the result of previous executions). Whenever two connected nodes and