Log In Sign Up

Optimally Self-Healing IoT Choreographies

by   Jan Seeger, et al.
Siemens AG

In the industrial Internet of Things domain, applications are moving from the Cloud into the edge, closer to the devices producing and consuming data. This means applications move from the scalable and homogeneous cloud environment into a constrained heterogeneous edge network. Making edge applications reliable enough to fulfill Industrie 4.0 use cases is still an open research challenge. Maintaining operation of an edge system requires advanced management techniques to mitigate the failure of devices. This paper tackles this challenge with a twofold approach: (1) a policy-enabled failure detector that enables adaptable failure detection and (2) an allocation component for the efficient selection of failure mitigation actions. We evaluate the parameters and performance of our failure detection approach and the performance of an energy-efficient allocation technique, and present a vision for a complete system as well as an example use case.


page 1

page 2

page 3

page 4


A Domain-Specific Language for Simulation-Based Testing of IoT Edge-to-Cloud Solutions

The Internet of things (IoT) is increasingly prevalent in domains such a...

A Survey on Time-Sensitive Resource Allocation in the Cloud Continuum

Artificial Intelligence (AI) and Internet of Things (IoT) applications a...

UniquID: A Quest to Reconcile Identity Access Management and the Internet of Things

The Internet of Things (IoT) has caused a revolutionary paradigm shift i...

Cloud, Fog or Edge: Where to Compute?

The computing continuum extends the high-performance cloud data centers ...

Leveraging Context-awareness to Better Support the IoT Cloud-Edge Continuum

Novel Internet of Things (IoT) requirements derived from a broader inter...

Event-Driven Testing For Edge Applications

With the rise of the Internet of Things (IoT) and Edge computing, a cons...

1. Introduction

IoT application deployments are currently mostly cloud-based, with a central processing component provided with data via remote sensors and actuators. This centralized cloud structure limits the possible applications for latency and confidentiality reasons. As a response, processing is moving back into the edge of the network, or into the sensors and actuators themselves. Centralized cloud systems are being supplanted by edge systems in latency- and privacy-sensitive applications. Industrial applications have stringent requirements in these fields, with low latency necessary for monitoring and control applications, and confidentiality necessary for economic reasons.

Shi et al. (Shi et al., 2016) define the term “edge” as “any computing and network resources along the path between data sources and cloud data centers”. We use this definition, but focus on edge networks that are close to the data producers and consumers and fully under the control of the application owner. Thereby, applications consist of multiple chained tasks, which can be distributed over several edge nodes to enable collaboration, e.g., to process a complex algorithm or AI pipeline in a distributed and coordinated way. Challenge is the heterogeneity of industrial edge networks, which consist of various kinds of nodes (e.g., industrial PCs, HMI units, network switches, or field devices such as simple sensors and actuators). These nodes have varying computational and communication capacities. Hence, operating a distributed application reliably and efficiently requires intelligent management approaches.

In this paper, we focus on (1) efficiently detecting failures of devices and software components using an accrual-based failure detection augmented with policies, and (2) automatically mitigating failures by finding an optimal allocation of application tasks, e.g., towards minimized energy consumption of the system. This work describes the latest findings on our research agenda to enable distributed IoT choreographies. Our path began with the introduction of the “Recipe” concept for defining IoT application templates (Thuluva et al., 2017a), continued by our work on improving the runtime management of such recipes by handling them as service choreographies (Seeger et al., 2018), and most recently defined a mechanism for the dynamic and resilient management of IoT choreographies (Seeger et al., 2019).

A “Recipe” describes an application template as a graph of abstract IoT tasks (e.g., device services such as “stream video” or “notify user”) where data flows along the edges of the graph, and tasks are executed when they have received enough input data. An example recipe use case implementing a vibration analysis for rotating machinery is described in more detail in Section 5.

Recipe tasks are generic, and a concrete recipe is derived by “instantiating” these generic tasks with concrete implementations that are available in the system. Using this mechanism, parts of the application can be replaced automatically when failure is detected. Missing in our previous work are evaluations of the “quality” of these replacements. So far, the first replacement available was chosen without regarding the effect on the properties of the system. Such relevant properties are for example the end-to-end latency of the application or the energy usage. By optimizing the placement of tasks on devices, we can optimize the properties of a system even when devices fail. Building up on our previous work, we present here a thorough examination of our failure detector and an evaluation of its memory and processing usage, as well as a detailed description and evaluation of our mechanism to optimize the assignment of tasks to devices for minimal energy usage.

2. Background & Related Work

In this section, we describe the context of our work. Section 2.1 introduces relevant works in the field of composing services (and IoT devices) to applications. Section 2.2 provides an overview about mechanisms for detecting failures in distributed systems. Section 2.3 presents related work on optimal placements of system operators.

2.1. IoT Composition

The composition of web services has been extensively researched (Sheng et al., 2014)

. Thereby, service composition can be classified into two types, service orchestration and service choreography, based on the manner in which the participant services interact 

(Sheng et al., 2014). Productive solutions for web services composition typically follow the orchestration approach.

In the IoT domain, we have today established composition systems with a broad user community such as “If This Then That”111 and Node-RED222 These tools use simple composition techniques that are executed centrally as orchestrations. These platforms are targeting mainstream users and lack systematic engineering support, which leads e.g. to widely duplicated recipes, as shown by Ur(Ur et al., 2016).

Giang et al. (Giang et al., 2015) focus on application-level distributed choreographies by building on Node-RED as a visual programming tool. However, they do not address the configuration of critical automation systems and their need for failure detection and recovery.

Khan et al.(Khan et al., 2017) propose a reliable infrastructure for IoT compositions, but focus on communication of data instead of application-level orchestrations. Thuluva et al.(Thuluva et al., 2017b) employ Semantic Web technologies to enable low-effort engineering of industrial IoT applications. However, they do not focus on the runtime aspects and dynamic reconfiguration or failure detection. Focusing on building automation, Ruta et al.(Ruta et al., 2014)

present a multi-agent framework that uses semantic technologies and makes use of automated reasoning for enabling device discovery and orchestration of IoT components. Their approach misses to address failure handling or mitigation.

This work builds up on our previous works (Thuluva et al., 2017a; Seeger et al., 2019; Seeger et al., 2018) that present an IoT composition as a “Recipe”, i.e., separate from its implementation. A semi-automated service composition and instantiation tool assists the user in creating the composition. IoT choreographies are described by a directed graph of connected abstract application components, with data flowing along the edges of these components. During the instantiation phase, each abstract component is replaced with a concrete one. The approach of this work builds up on this concept by rerunning the instantiation algorithm when a failure is detected, to find another component that can fulfill the functionality of the failed component.

Figure 1 shows an example of a recipe that combines multiple services of devices in an intrusion detection system. The green boxes are ingredients that need to be replaced with concrete components when the recipe is instantiated. Ingredients are connected via their outputs and inputs. In this example, video and audio streams are connected to analytics components that feed into an aggregating intrusion detector and finally a component that is able to send notifications. The recipe designer can further specify application-level constraints (e.g., minimum video frame rate) on the interactions between sensors and analysis services (Seeger et al., [n.d.]).

Figure 1. Example of a recipe combining multiple device services for object detection (Source: (Seeger et al., [n.d.])).

2.2. Failure Detection

Failure detection is an essential building block for distributed systems. Without a suitable failure detector, distributed applications are generally not guaranteed to complete successfully. Chandra et al. describe the theory of failure detection in (Chandra and Toueg, 1996) and define two properties of failure detectors: completeness and accuracy. Completeness describes the property of a failure detector to correctly detect failure, while accuracy describes the capability of a detector to not detect failure on nodes functioning correctly.

Our research is based on -accrual failure detection, which is in detail described by Défago et al. (Défago et al., 2004) and more recent implementations include Satzger et al. (Satzger et al., 2007) and Liu et al. (Liu et al., 2018)

. Accrual failure detectors calculate the probability of a node having failed from the distribution of inter-arrival times of received failure detection messages. Thereby, accrual failure detectors are strongly

complete (there is some time after which all failed processes are permanently marked failed by all other processes) and eventually strongly accurate (there is some time after which correct processes are not marked failed by other correct processes).

In IoT environments, failures have to be detected on all involved devices and services, in order to ensure reliability. Kodeswaran et al. (Kodeswaran et al., 2016) present a system for efficient failure management in smart home environments based on tracking the most performed activities. This knowledge is used to predict future degradation and failures of involved devices. However, their work is restricted to the smart home domain and not directly usable in general IoT or industrial IoT cases.

The Gaia framework (Chetan et al., 2005) allows to build pervasive systems and also includes failure detection. Thereby, the implemented failure detector is part of a central controller. I.e., the availability of the controller has to be ensured for the failure detection to work. Our approach is based on the distributed nodes and does not require communication with a central controller to detect a failure.

In (De Moraes Rossetto et al., 2015) an unreliable failure detector is presented that enables the definition of an impact factor for the involved nodes. This allows to tune the performance of the failure detector for specific application needs.

Guclu et al. (Guclu et al., 2016) present a distributed failure detector that builds on trust management. Their method evaluates the trustworthiness of the data from neighboring nodes. However, their approach is limited to homogeneous networks and similar structured data. This makes it not applicable to our scenarios of heterogeneous edge and industrial IoT environments.

2.3. Optimal Allocation

Efficiently allocating application tasks of a recipe to available edge devices is comparable to a widely studied research problem in the distributed systems field: the optimal operator placement of distributed stream processing applications, or the optimal selection of networked devices for tasks or computations of a chained process or workflow. When either a hardware or software failure occurs, an application component has failed, and this failure needs to be mitigated.

Tasks have different parameters, depending on the optimization target. The result of the task allocation problem is an allocation, an assignment of tasks to devices, that fulfills the constraints, and improves the performance of the system in some metric.

An overview of existing allocation approaches for stream processing is given in (Lakshmanan et al., 2008).

Based on Constraint Programming, Haubenwaller & Vandikas (Haubenwaller and Vandikas, 2015) describe an approach for the efficient distribution of actors (processing tasks) to IoT devices. The approach resembles the Quadratic Assignment Problem and is NP-hard, resulting in long computation times when scaling up. Samie et al. (Samie et al., 2016) present another Constraint Programming-based approach that takes into account the bandwidth limitations and minimizing energy concumption of IoT nodes. The system optimizes computation offloading from an IoT node to a gateway, however, it does not consider composed computations that can be distributed to multiple devices.

A Game Theory-based approach is presented in 

(Sardellitti et al., 2015) that aims at the joint optimization of radio and computational resources of mobile devices. The system local optimum for multiple users, however, it only aims at deciding whether to fully offload a computation or to fully process it on device.

Based on Non-linear Integer Programming, Sahni et al. (Sahni et al., 2017) present their Edge Mesh algorithm for task allocation that optimizes overall energy consumption and considers data distribution, task dependency, embedded device constraints, and device heterogeneity. However, only basic evaluation and experimentation are done and no performance comparison has been performed.

Based on Integer Linear Programming (ILP), Mohan & Kangasharju 

(Mohan and Kangasharju, 2016) propose a task assignment solver that first minimizes the processing cost and secondly optimizes the network cost, which stems from the assumption that Edge resources may not be highly processing-capable. An intermediary step of reduces the sub-problem space by combining tasks and jobs with the same associated costs. This reduces the overall processing costs.

Cardellini et al. (Cardellini et al., 2016) describe a comprehensive ILP-based framework for optimally placing operators of distributed stream processing applications, while being flexible enough to be adjusted to other application contexts. Different optimization goals are considered, e.g., application response time and availability. They propose their solution as a unified general formulation of the optimal placement problem and provide a strong theoretical foundation. The framework is flexible so that it can be extended by adding further constraints or shifted to other optimization targets. Hence, we utilize their framework and extend it by incorporating further constraints for our optimization goal, the overall energy usage.

3. A Failure Detector for Self-Healing IoT Choreographies

In this section, we describe our failure detector PE-FD and its properties in comparison to other failure detectors, and present our policy concept to tune the failure detector for specific application requirements.

3.1. The PE-FD Failure Detector

Failure detection is a crucial functionality for a distributed system that should operate reliably. Without the information on the current state of components, a mitigator cannot decide what (if any) action to take to continue operation of the system. In this work, we focus on crash failures (Tanenbaum and Steen, 2007), where devices and software work correctly until they fail permanently. For efficient failure detection in IoT Choreographies, we have developed our Policy-Enabled Failure Detector (PE-FD). It is based on the principle of -accrual failure detection (Défago et al., 2004) and augmented with the support for “policies”, where parameters of the failure detection algorithm are adjusted according to application requirements.

In general, -accrual failure detectors are unreliable, meaning that errors in their output are permissible, but after some point, their output is always correct. Compared to “traditional” failure detectors (e.g., heartbeat-based or adaptive (Bertier et al., 2002; Chen et al., 2002)), -accrual detectors compute a suspicion function

that describes the probability of a node having crashed. This probability is computed by estimating the distribution of the inter-arrival times of heartbeats, and computing the probability of a new heartbeat arriving after the current time. With the current time being

, and the time of the last timestamp’s arrival being , the suspicion is thus given by the formula , where is computed from the inferred distribution of timestamp inter-arrival times.

Failure detectors differ in the parameters and implementation of estimating , either storing all inter-arrival times for every heartbeat and using the empirical distribution function, or assuming a certain distribution for inter-arrival times and estimating the parameters of the distribution. Our PE-FD failure detector computes suspicion in constant time and space by taking advantage of the one-sided Chebyshev inequality together with empirical estimators for both and .


In equations 1 and (2), we define two helper variables that store the sum of the timestamps and the sum of the squares of the timestamps. We can then compute the mean by dividing by the number of timestamps, and derive via equation (4

). To calculate mean and variance, we thus need to store three variables (

, and ), independent of the number of timestamps received. The suspicion can then be calculated without any additional information. To combat overflow and numerical instability, and are periodically reset after a number of timestamps have been received. We call this number “learning window”. However, after resetting, we need a certain number of heartbeats to regain a good estimate for the distribution parameters. Thus, we introduce a parameter that is the minimum number that needs to be received until the new estimate is used.

[fontsize=]pythoncode/ Suspicion calculation and state update for the PE-FD failure detector.

A python implementation of the suspicion algorithm is included in Listing 3.1. It can be seen that the calculation of the next state takes a constant amount of computations (three additions, and one multiplication). The necessary operations to calculate a suspicion value are three divisions, three multiplications and four additions. The number of computations is independent of the chosen parameters. With extremely large values for the learning window, the variables and could overflow, but this is not a realistic constraint for 32-bit variables.

-accrual failure detectors can be converted into binary failure detectors by choosing a threshold , and marking a process as failed when the suspicion rises above this threshold. We describe the choices for in the next section, and the behavior of our PE-FD failure detector is evaluated in detail in Section 6.1.

3.2. Policies for application-tuned failure detection

The PE-FD failure detector provides a number of adjustable parameters:

  • The minimum number of heartbeats required for an estimate ()

  • The maximum number of heartbeats until the current estimate is reset ()

  • The heartbeat period ()

  • The suspicion threshold ()

and are called the “learning window” in combination. These parameters represent a wide range of adaptability for our algorithm. By adjusting these parameters based on policies that take the structure and requirements of applications into account, failure detection can be improved over a “one-size-fits-all” approach.

The maximum and minimum number of heartbeats and are relevant for nodes with changing network conditions. For example, a mobile node can benefit from a lower maximum and minimum number of heartbeats, so the failure detection algorithm can adapt to changing network conditions with fewer received heartbeats. The complementary adjustment is possible as well: For wired nodes, increasing the size of the learning window allows them to ignore transient failures, and keep the application working.

When failure detection is used in a web service composition as those described in Section 2.1, the structure of the composition can be inspected to modify the suspicion threshold and the heartbeat period. When a task is central to an application, and no replacement is available, the heartbeat period should be set low to allow quick detection of failures. The suspicion threshold should be set relatively high, not to cause false positives.

4. Optimal mitigation of failures in IoT Choreographies

In this section, we describe how our failure detector is combined with a task assignment approach to an optimal self-healing procedure for IoT choreographies, and we describe the details of allocating tasks to optimize the overall energy usage.

4.1. Optimal Self-Healing Procedure

IoT applications are becoming more prevalent in various domains, e.g., smart homes and buildings, industrial manufacturing, transportation, or healthcare. Such IoT applications often consist of multiple tasks that interact in the form of a dataflow graph, where components exchange data along directed edges. A simple but popular execution engine for such application is “IFTTT” (Section 2.1).

With the recipe concept (Section 2.1(Thuluva et al., 2017a)), we have defined a schema for the expression of such dataflow graphs. These recipes are executed in a distributed fashion, and can be dynamically replaced and reconfigured (see (Seeger et al., 2018)). With the growing scale of IoT device deployments and applications, such dynamic replacement and reconfiguration will become more important. Additionally, IoT applications are penetrating more an more crucial areas (e.g., patient monitoring or optimization of industrial processes), i.e., failures may have large impact and the ability to reconfigure the system dynamically is crucial.

In our previous work, the chosen replacement component was not evaluated for its quality, besides fulfilling the obvious functional requirements. Hence, we aim here at evaluating the replacement of devices with regard to a defined quality metric and thereby improve the operating parameters of the application and the network. This is especially important with long-running and resource-constrained processes.

Combined with the failure detection presented in Section 2.2, the steps of our optimal self-healing procedure for an IoT orchestration are as follows:

  1. Detect software or device failure

  2. Find functionally matching replacement devices

  3. Optimal assignment of application tasks to available network nodes

  4. Reconfigure application with replacement device and software

Everything in that application is allocated. We can efficiently detect software or device failure (1) via the failure detection algorithm described in Section 3, and tune the detection according to the application requirements with policies as described in Section 3.2. The semantic matching algorithm in (Seeger et al., 2018; Thuluva et al., 2017a) can then find a functionally matching replacement component (2). Then, in step (3), we evaluate the placement of recipe components on nodes of the modified graph via the allocation algorithm described in the next Section 4.2. Step (4), the reconfiguration of the instantiated recipe, then happens as described in (Seeger et al., 2018).

4.2. Energy-optimal Task Assignment

As described in Section 2.3, we have based our approach for allocating recipe tasks on (Cardellini et al., 2016). Cardellini et al. evaluate configuring a system for optimal response time and optimal availability. They formulate the allocation of operators as an ILP problem, which they hand over to an IBM CPLEX333 solver to find the optimal approach. We have followed this approach to optimize the energy usage of an IoT application formulated as a recipe.

We define optimality of the allocation by total energy use over one execution of the recipe. Energy during recipe execution is consumed in two phases: “Device energy” is consumed by a device when executing a task, and “network energy” is consumed by the device when sending the result of the calculation over the network. The optimal configuration of the network is the assignment of tasks to devices that results in the lowest total consumption of energy and satisfies the constraints. The constraints concern the requirements that an assignment must satisfy: Each task should only be allocated once and resource requirements for assigned tasks should not exceed the resources of the node. This problem is a form of the quadratic assignment problem, and thus NP-hard. We have developed and evaluated a heuristic that reduces the problem to a non-quadratic assignment problem, which we describe in Section 


We define the energy-optimal task assignment as follows: The recipe consists of a set of tasks connected by directed edges . The network that tasks can be evaluated on consists of a set of nodes connected by a set of undirected links . The result of the allocation is a matrix where if and only if task is allocated to node .

Symbol Description
Resources required for the evaluation of task .
Output of task for a single received input.
Computation time required for completing task once.
Processing power of node .
Resources available on node .
Energy consumption of node for one unit of computation.
Energy use for the transfer of one data packet over link .
Energy cost of the shortest path between and .
Table 1. Parameters of energy-aware allocation algorithm.

Tasks, nodes and links have properties that are relevant for the energy consumption of the application once allocated. These parameters are described in Table 1. , , and are defined as multiples of some reference node. The resources of a node are expressed as a single scalar, but additional resource requirements can easily be introduced into the model.

For calculating the network energy, we need to know whether a link between two tasks is assigned to a link between two nodes. For this, we introduce a matrix , where if and only if the communication between task and task is allocated on the network link between nodes and . This corresponds to . Unfortunately, this is not a linear constraint, and thus we need to linearize the formulation.


We follow the formulation presented in (Cardellini et al., 2016) and define an ILP model as shown in Equations 7 to 9. Equations 7 to 9 describe the linearization of the network matrix . Equations 10 and 11 express the “only allocated once” and “resources not exceeded” constraint. Equations 12 and 13 calculate network and device energy as described above. Finally, we calculate the total energy use of the assignment by adding both energies in equation 14. The objective of the optimization is the minimization of the total used energy.

We implemented this model in a Python444 script using the PuLP555 linear programming library. We can then find solution for the problem using the CPLEX solver, which uses a branch-and-bound approach (Ross and Soland, 1975). For a discussion of the benchmark results, see Section 6.2.

4.3. A Linear Heuristic for Energy-Optimized Allocation

The quadratic assignment problem described in the previous section is NP-hard and thus compute intensive. The culprit for this is the network cost calculation and the linearization of Y resulting in a large number of constraints. By approximating the network energy, we can get a faster solution, which is however no longer optimal. We quantify the loss of optimality and speedup in Section 6.2.

By removing the matrix and the associated constraints, we create a linear problem that can be evaluated effectively by the simplex method (Nelder and Mead, 1965). Our approach approximates the energy required for sending a packet of data by taking the average of a node’s links. We introduce the parameter that describes the average transmission cost of a node’s links.


The complete model reuses constraints (10) and (11) with the constraint (15). By transforming the QAP into a linear problem, we greatly increase the speed of finding a solution, and make the optimization feasible for on-line usage.

5. System Model & Application Example

This section presents our implementation of the system for optimal self-healing approach for IoT choreographies and describes a use case example for applying the developed system.











Figure 2. System model of an optimally self-healing IoT choreography

Figure 2 shows the integrated system combining our optimal allocator and the PE-FD failure detector. The system consists of 3 main components, as well as the devices in the network. A knowledge base stores the knowledge about the system, such as available devices, applications and links between devices. One possible storage mechanism for such data would be a semantic triple store such as Apache Jena666 With this semantic store, the system can take advantage of semantic reasoning and translation, as described in (Seeger et al., [n.d.]). The configurator controls the creation of the system and is responsible for configuring devices into a choreography. It is not involved in the operation of the system, but for administrative actions (such as reconfiguring applications when devices fail). The allocator is the component responsible for running the allocation algorithms described in Section 4. Finally, the network contains devices that communicate via heterogeneous network links. These devices are running an engine that supports configuration by the configurator.

When a device or software component fails, this is detected by the PE-FD failure detection algorithm, and the configurator is informed by the devices that have detected the failure (1). The configurator retrieves the applications that the failed component was part of from the knowledge base and finds replacement for these devices that are available in the network (2). Then, the set of applications and available devices is passed to the allocator (3), which computes an allocation for tasks and devices, and returns the resulting allocation to the configurator (4), which then applies the new configuration to devices in the network (5).





Signal-basedfault detection

Figure 3. Wireless vibration analysis use case (based on: Krügel et al. (Krügel et al., 2019))

As motivated above, the usage of such a system for optimally self-healing IoT choreographies is increasingly important. As an example use case for this system, we discuss here the vibration analysis of rotating machinery via vibration sensors as described by Krügel et al. (Krügel et al., 2019). The structure of the application is shown in Figure 3. Vibration sensors sense the vibrations of a rotating machine (such as an engine or fan). The vibration information is preprocessed for the analysis and fed into a reduced-complexity model of the machine. From this model, the key performance indicators are derived. These KPIs are forces acting on the machine parts. The forces are then postprocessed, and finally, a maintenance decision is made. In parallel, signal-based fault detection based on flags can detect faults.

It is easy to see that these components of the vibration analysis have different requirements for processing power and required resources. The sensor nodes running the analysis are battery powered and wireless, since they need to be non-invasive and placed on machines without introducing extra infrastructure. As such, using an energy-optimal allocation for tasks is important to maximize the runtime of the analysis.

6. Evaluation

In this section we first evaluate our failure detector, PE-FD (Section 6.1), as well as the mitigator component for optimal allocation of tasks (Section 6.2).

6.1. Evaluation of PE-FD Algorithm

(a) False positive rate vs. threshold
(b) Detection time (s) vs. threshold
Figure 4. Behavior of PE-FD with varying thresholds

By adjusting the threshold , we can modify the behavior of the failure detection algorithm. A lower threshold leads to faster detection of failure, while increasing reduces the amount of false positives. Figure 4

shows the behavior of an example failure detection run. We chose a normally-distributed timestamp arrival time. We generated timestamp inter-arrival time with a mean of 20 seconds and a variance of 5 seconds for 3000 seconds, and then increased the mean of the distribution to 50 for another 3000 seconds. This might happen when the sending node switches into an energy saving mode, or changes its network connection to one with increased latency. We then calculated the detection time (first correct “failed” verdict) and mistake rate (incorrect “failed” verdicts) for thresholds

from 0.1 to 2.0. As seen in Figure 4, increasing the threshold decreases the false positive rate, but decreases the detection time.

Selecting such a threshold can be done with application-level policies based on reconfiguration policies for a node. A node for which replacements are available can be configured with a lower threshold, since replacing it on a false positive will still allow the system to function, and the node will be replaced faster on a “true” failure.

Figure 5. Detection time vs. timestamp interval.

Adjusting the timestamp interval has an impact on the detection time. By lowering the timestamp interval, the detection time is decreased at an increased cost of network traffic.

Additionally, lowering the timestamp interval can drastically decrease battery life for energy-starved nodes, as waking up and sending packets consumes a large amount of energy.

An example for this behavior in 10 runs of the PE-FD can be seen in Figure 5. We generated timestamp times that were normally distributed around the sending interval with a variance of 1 to account for network delays for a period of 1000 seconds. We configured PE-FD with a threshold of 0.8, and an infinite learning window. We then sampled the suspicion function every 5 seconds, and measured the detection time (i.e. the first “true” detection) for the resulting suspicion values.

We thus see it is advantageous to adjust the timestamp interval of the algorithm dynamically, trading off between the importance of high detection speed (which might be mandated by QoS requirements made by the user) and network and battery efficiency. Policies to adjust the timestamp interval should take into account the “kind” of node they are operating on (wireless or wired network, battery or mains powered) to get achieve optimal results.

(a) Detection time vs. .
(b) Mistake rate vs. .
Figure 6. Effect of learning window on parameters.

The final parameter remaining is the size of the learning window. We evaluated different configurations of PE-FD with varying window sizes. The timestamps for this experiment were generated with a normal distribution. We sampled timestamp inter-arrival times from for 1000 seconds, from for 1000 seconds, and again from for 500 seconds. This resulted in a total of 95 timestamps. Figure 6 shows the effect of on mistake rate and detection time. The graph shows an interesting trend at an of 50. The timestamp generation generated approximately 50 timestamps (1000 seconds / 20 seconds interval) with a 20 second delay. This means the learning window of the =50 configuration was reset right as the distribution was changing. Also, 50 is the largest configuration smaller than the “period” of timestamp changes. This means that this configuration adapts most slowly to changes. We can see this in the strong growth of both mistake rate and detection time. Smaller configurations adapt faster, while larger configurations “smear” across the two timestamp distributions and learn a “mixed” distribution with a mean between 50 and 20, and a higher variance. We see this in the graph by the detection time decreasing with larger . The majority of incorrect suspicions are generated at the change from to , at which time there were only 50 timestamps. Thus, the configurations with greater than 50 perform the same as the =50 configurations. Since the timestamp distribution decreased in average delay, generally, the change from 50 to 20 generated almost no suspicion, as average timestamp times decreased, and thus, the suspicion calculated was set to zero for most samplings of the suspicion function (see Equation (6).). The general takeaway is that setting the learning window is difficult. If periodic changes in the timestamp distribution are expected, care should be taken to select a learning window smaller than the period of change. If no periodic changes are expected, a large learning window should decrease false positives.

6.2. Evaluation of Mitigator

To evaluate the performance of the mitigator, we have built a Python-based evaluation framework. We evaluate the performance of the mitigator by generating a random network and a random recipe, and letting the allocator find the optimal allocation.

We generate the network with two classes of nodes: Wireless nodes are connected via an energy-inefficient wireless connection, and wired nodes are connected via an energy-efficient wireless connection. In our configuration, 60% of the nodes are wired nodes, and the remaining 40% are wireless nodes. Nodes are connected to each other with a certain probability. That probability is 0.8 for wired-wired connections, 0.5 for wireless-wireless connections and 0.4 for wireless-wired connections. Wired connections use 0.2 units of energy, while wireless connections use 0.8 units of energy. Nodes have a varying amount of resources uniformly distributed between a lower bound of 1 and an upper bound of 8 resource units. Nodes also have a varying processing speed between 1 and 3 speedup compared to a reference processor. Finally, nodes can use from 0.5 to 1.5 as much energy as a reference processor for a single unit of computation.

Figure 7. “Long” (left) and “wide” (right) recipes.

For the recipe, we generate two classes of recipes with a certain number of tasks, a “wide” recipe and a “long” recipe. In a “wide” recipe, two tasks are designated the “start” and “end” tasks, and every other task needs input from the start node and sends output to the end node. In a long recipe, tasks are linked serially. Figure 7 shows two example recipes. Each recipe task has resource requirements randomly distributed between 1 and 8, an output factor randomly distributed between 0.5 and 1.5, and a computation size of 1 or 2.

(a) CPU time for the optimal allocation algorithm vs. the number of nodes. Each experiment with n nodes was measured 5 times with 3 to n-1 tasks.
(b) CPU time for the optimal allocation algorithm vs. the number of tasks. Each experiment with n tasks was measured 5 times with 5 to 20 nodes.
Figure 8. Runtime for optimal allocation.

As expected, the optimal allocation algorithm scales very badly (non-polynomially). In Figure 8, we see the runtime of the algorithm for varying problem sizes. The shaded area shows the variance with the non-shown parameter (different recipe sizes for the network node graph, differing network sizes for the recipe node graph). The time needed for finding the optimal allocation grows unwieldy very quickly.

(a) CPU time of the allocation heuristic vs. the number of nodes. Each experiment with n nodes was measured 5 times with 3 to n-1 tasks.
(b) CPU time of the allocation heuristic vs. the number of tasks. Each experiment with n tasks was measured 5 times with 5 to 20 nodes.
Figure 9. Heuristic runtime

In comparison, our heuristic presented in Section 4.3 finds a solution much more quickly. Figure 9 shows the runtime of the heuristic for different network and recipe sizes. For the slowest case for the full allocation, the heuristic takes 8 seconds of CPU time, while the solver consumes 864104 seconds (about 10 days) of CPU time for finding the optimal allocation. The allocation evaluation was executed on an Amazon EC2 m4.10xlarge machine with 40 virtual cores and 160 GiB of memory. Peak memory use was 51 GiB.

Figure 10. Energy consumption of heuristic solution scaled against optimal solution.

However, the heuristic loses about 30% of energy efficiency over the optimal algorithm. As seen in Figure 10, 50% of the solution lie in the 0.6 to 0.8 range.

Figure 11. Performance of heuristic vs. number of tasks and recipe type.

In Figure 11, the performance of the heuristic as related to the size of the recipe can be see. The performance of the heuristic decreases with larger networks. This is explainable by the network links longer, as the difference between the node-local transmission energy and the actual transmission energy grows larger with a growing network. The diagram also shows that the long recipe is harder to allocate than the wide recipe.

This improves the results shown in (Cardellini et al., 2016) for the sampling heuristic, where a sampling factor of 30% (i.e. only one third of all nodes was considered) led to a runtime of 5 seconds, but solution quality of 40% for the sequential application.

7. Conclusions & Future Work

Today, IoT applications are increasingly executed in edge environments to avoid latency and privacy issues associated with a cloud-based execution. To enable the execution of complex applications on the edge, we need to split them in separate tasks and execute them on multiple devices. An example of such a complex application has been described above: the vibration analysis of rotating machinery on the manufacturing shop floor. Running such distributed applications reliably is a challenge.

We present in this work a system that supports the self-healing of such IoT choreographies. This system consists mainly of two contributions: (1) A novel failure detector concept that supports a wide range of parameters for application-specific and policy-based configuration and some guidelines towards the selection of these parameters. (2) We have introduced an ILP formulation for optimal task allocation with regards to energy, and designed a heuristic that makes on-line computation of allocations feasible. We have evaluated both the PE-FD failure detector and the performance of the allocation algorithm.

In the future, we plan to formalize the policies described textually in Section 3.2, possibly in the form of semantic rules in combination with a reasoner. In conjunction with formally described application requirements, this will allow to automatically infer a tuned failure detector parameterization without manual configuration.

Further, we aim to extend our allocation benchmarking framework to evaluate other heuristics for allocation, taking into account resource distribution in the network. These heuristics will likely perform better on large networks, where the simple “outgoing” heuristic fails.

Additionally, realizing and evaluating the use case as described in Section 5 will be required to gain a better understanding of the use of allocation and failure detection in industrial automation systems. Also, we aim to integrate our system with the Node-RED framework for the easy creation of applications. With approaches such as Distributed Node-RED(Giang et al., 2015) and the traction Node-RED is gaining in automation communities, this will be a promising field to apply the techniques described in this paper.

This work is part of the SEMIoTICS project777 that develops a pattern-driven framework to guarantee secure and dependable behavior in IoT environments (Fysarakis et al., 2019). It received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 780315.


  • (1)
  • Bertier et al. (2002) M. Bertier, O. Marin, and P. Sens. 2002. Implementation and performance evaluation of an adaptable failure detector. In Proceedings International Conference on Dependable Systems and Networks. 354–363.
  • Cardellini et al. (2016) Valeria Cardellini, Vincenzo Grassi, Francesco Lo Presti, and Matteo Nardelli. 2016. Optimal operator placement for distributed stream processing applications. In Proceedings of the 10th ACM International Conference on Distributed and Event-based Systems. ACM Press, 69–80.
  • Chandra and Toueg (1996) Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable Failure Detectors for Reliable Distributed Systems. J. ACM 43, 2 (March 1996), 225–267.
  • Chen et al. (2002) Wei Chen, S. Toueg, and M. K. Aguilera. 2002. On the quality of service of failure detectors. IEEE Trans. Comput. 51, 1 (Jan. 2002), 13–32.
  • Chetan et al. (2005) Shiva Chetan, Anand Ranganathan, and R. Campbell. 2005. Towards fault tolerance pervasive computing. IEEE Technology and Society Magazine 24, 1 (2005), 38–44.
  • De Moraes Rossetto et al. (2015) Anubis Graciela De Moraes Rossetto, Carlos O. Rolim, Valderi Leithardt, Guilherme A. Borges, Cláudio F.R. Geyer, Luciana Arantes, and Pierre Sens. 2015. A new unreliable failure detector for self-healing in ubiquitous environments. In Proceedings - International Conference on Advanced Information Networking and Applications, AINA.
  • Défago et al. (2004) X. Défago, N. Hayashibara, R. Yared, and T. Katayama. 2004. The Accrual Failure Detector. In Reliable Distributed Systems, IEEE Symposium on(SRDS). 66–78.
  • Fysarakis et al. (2019) K. Fysarakis, G. Panoudakis, N. Petroulakis, O. Soultatos, A. Bröring, and T. Marktscheffel. 2019. Architectural Patterns for Secure IoT Orchestrations. In Global Internet of Things Summit (GIoTS 2019), 17.-21. June 2019, Aarhus, DK. IEEE.
  • Giang et al. (2015) N. K. Giang, M. Blackstock, R. Lea, and V. C. M. Leung. 2015. Developing IoT applications in the Fog: A Distributed Dataflow approach. In 2015 5th International Conference on the Internet of Things (IOT). 155–162.
  • Guclu et al. (2016) Sila Ozen Guclu, Tanir Ozcelebi, and Johan Lukkien. 2016. Distributed Fault Detection in Smart Spaces Based on Trust Management. Procedia Computer Science 83 (Jan. 2016), 66–73.
  • Haubenwaller and Vandikas (2015) Andreas Moregård Haubenwaller and Konstantinos Vandikas. 2015. Computations on the edge in the internet of things. Procedia Computer Science 52 (2015), 29–34.
  • Khan et al. (2017) W. Z. Khan, M. Y. Aalsalem, M. K. Khan, M. S. Hossain, and M. Atiquzzaman. 2017. A reliable Internet of Things based architecture for oil and gas industry. In 2017 19th International Conference on Advanced Communication Technology (ICACT). 705–710.
  • Kodeswaran et al. (2016) Palanivel A. Kodeswaran, Ravi Kokku, Sayandeep Sen, and Mudhakar Srivatsa. 2016. Idea: A System for Efficient Failure Management in Smart IoT Environments. In Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys ’16). ACM, New York, NY, USA, 43–56.
  • Krügel et al. (2019) S. Krügel, J. Maierhofer, T. Thümmel, and D. J. Rixen. 2019. Rotor Model Reduction for Wireless Sensor Node Based Monitoring Systems. 13th International Conference on Dynamics of Rotating Machines (2019).
  • Lakshmanan et al. (2008) G. T. Lakshmanan, Y. Li, and R. Strom. 2008. Placement Strategies for Internet-Scale Data Stream Systems. IEEE Internet Computing 12, 6 (Nov. 2008), 50–60.
  • Liu et al. (2018) Jiaxi Liu, Zhibo Wu, Jian Dong, Jin Wu, and Dongxin Wen. 2018. An energy-efficient failure detector for vehicular cloud computing. PLOS ONE 13, 1 (Jan. 2018), e0191577.
  • Mohan and Kangasharju (2016) Nitinder Mohan and Jussi Kangasharju. 2016. Edge-Fog cloud: A distributed cloud for Internet of Things computations. In 2016 Cloudification of the Internet of Things (CIoT). IEEE, 1–6.
  • Nelder and Mead (1965) J. A. Nelder and R. Mead. 1965. A Simplex Method for Function Minimization. Comput. J. 7, 4 (Jan. 1965), 308–313.
  • Ross and Soland (1975) G. Terry Ross and Richard M. Soland. 1975. A branch and bound algorithm for the generalized assignment problem. Mathematical Programming 8, 1 (Dec. 1975), 91–103.
  • Ruta et al. (2014) M. Ruta, F. Scioscia, G. Loseto, and E. Di Sciascio. 2014. Semantic-Based Resource Discovery and Orchestration in Home and Building Automation: A Multi-Agent Approach. IEEE Transactions on Industrial Informatics 10, 1 (Feb. 2014), 730–741.
  • Sahni et al. (2017) Yuvraj Sahni, Jiannong Cao, Shigeng Zhang, and Lei Yang. 2017. Edge Mesh: A new paradigm to enable distributed intelligence in Internet of Things. IEEE access 5 (2017), 16441–16458.
  • Samie et al. (2016) Farzad Samie, Vasileios Tsoutsouras, Lars Bauer, Sotirios Xydis, Dimitrios Soudris, and Jörg Henkel. 2016. Computation offloading and resource allocation for low-power IoT edge devices. In 2016 IEEE 3rd World Forum on Internet of Things (WF-IoT). IEEE, 7–12.
  • Sardellitti et al. (2015) Stefania Sardellitti, Gesualdo Scutari, and Sergio Barbarossa. 2015. Joint optimization of radio and computational resources for multicell mobile-edge computing. IEEE Transactions on Signal and Information Processing over Networks 1, 2 (2015), 89–103.
  • Satzger et al. (2007) Benjamin Satzger, Andreas Pietzowski, Wolfgang Trumler, and Theo Ungerer. 2007. A New Adaptive Accrual Failure Detector for Dependable Distributed Systems. In Proceedings of the 2007 ACM Symposium on Applied Computing (SAC ’07). ACM, New York, NY, USA, 551–555.
  • Seeger et al. ([n.d.]) J. Seeger, A. Bröring, M.-O. Pahl, and E. Sakic. [n.d.]. Rule-Based Translation of Application-Level QoS Constraints into SDN Configurations for the IoT. In EuCNC 2019, Valencia, Spain. IEEE.
  • Seeger et al. (2018) Jan Seeger, Rohit A. Deshmukh, and Arne Bröring. 2018. Running Distributed and Dynamic IoT Choreographies. In 2018 IEEE Global Internet of Things Summit (GIoTS) Proceedings, Vol. 2. IEEE, Bilbao, Spain, 33–38. arXiv: 1802.03159.
  • Seeger et al. (2019) J. Seeger, R. A. Deshmukh, V. Sarafov, and A. Bröring. 2019. Dynamic IoT Choreographies. IEEE Pervasive Computing 18, 1 (Jan. 2019), 19–27.
  • Sheng et al. (2014) Quan Z. Sheng, Xiaoqiang Qiao, Athanasios V. Vasilakos, Claudia Szabo, Scott Bourne, and Xiaofei Xu. 2014. Web services composition: A decade’s overview. Information Sciences 280 (Oct. 2014), 218–238. WOS:000339132700014.
  • Shi et al. (2016) W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu. 2016.

    Edge Computing: Vision and Challenges.

    IEEE Internet of Things Journal 3, 5 (Oct. 2016), 637–646.
  • Tanenbaum and Steen (2007) Andrew S. Tanenbaum and Maarten van Steen. 2007. Distributed systems - principles and paradigms, 2nd Edition. Pearson Education.
  • Thuluva et al. (2017a) Aparna Saisree Thuluva, Arne Bröring, Ganindu P. Medagoda, Hettige Don, Darko Anicic, and Jan Seeger. 2017a. Recipes for IoT Applications. In Proceedings of the Seventh International Conference on the Internet of Things (IoT ’17). ACM, New York, NY, USA, 10:1–10:8.
  • Thuluva et al. (2017b) Aparna Saisree Thuluva, Kirill Dorofeev, Monika Wenger, Darko Anicic, and Sebastian Rudolph. 2017b. Semantic-Based Approach for Low-Effort Engineering of Automation Systems. In On the Move to Meaningful Internet Systems. OTM 2017 Conferences (Lecture Notes in Computer Science). Springer, Cham, 497–512.
  • Ur et al. (2016) Blase Ur, Melwyn Pak Yong Ho, Stephen Brawner, Jiyun Lee, Sarah Mennicken, Noah Picard, Diane Schulze, and Michael L. Littman. 2016. Trigger-Action Programming in the Wild: An Analysis of 200,000 IFTTT Recipes. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). ACM, New York, NY, USA, 3227–3231. event-place: San Jose, California, USA.