A Roadmap Towards Resilient Internet of Things for Cyber-Physical Systems

10/16/2018
by   Denise Ratasich, et al.
Intel
0

The Internet of Things (IoT) is an ubiquitous system connecting many different devices -- the things -- which can be accessed from the distance. With the possibility to monitor and control the physical environment from the distance, that is the IoT contains cyber-physical systems (CPS), the two concepts of dependability and security get deeply intertwined. The increasing level of dynamicity, heterogeneity, and complexity adds to the system's vulnerability, and challenges its ability to react to faults. This paper summarizes state-of-the-art of existing surveys on anomaly detection, fault-tolerance and self-healing and adds a number of other methods applicable to achieve resilience in an IoT. We particularly focus on non-intrusive methods ensuring data integrity in the network. Furthermore, this paper presents the main challenges to build a resilient IoT for CPS. It further summarizes our solutions, work-in-progress and future work to this topic for the project "Trustworthy IoT for CPS". Eventually, this framework is applied to the selected use case of a smart sensor infrastructure in the transport domain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 20

08/14/2019

Cyber-Physical Systems Resilience: State of the Art, Research Issues and Future Trends

Ideally, full integration is needed between the Internet and Cyber-Physi...
08/04/2021

A Survey of Honeypots and Honeynets for Internet of Things, Industrial Internet of Things, and Cyber-Physical Systems

The Internet of Things (IoT), the Industrial Internet of Things (IIoT), ...
12/16/2019

Cyber Physical Systems (CPS) Surveillance Using An Epidemic Model

Vast investments have recently been made worldwide in developing the Cyb...
09/21/2018

Prospect Theoretic Approach for Data Integrity in IoT Networks under Manipulation Attacks

As Internet of Things (IoT) and Cyber-Physical systems become more ubiqu...
12/24/2019

PhD Forum: Enabling Autonomic IoT for Smart Urban Services

The development of autonomous cyber-physical systems (CPS) and advances ...
04/24/2018

Authentication of Everything in the Internet of Things: Learning and Environmental Effects

Reaping the benefits of the Internet of things (IoT) system is contingen...
06/18/2021

A Declarative Goal-oriented Framework for Smart Environments with LPaaS

Smart environments powered by the Internet of Things aim at improving ou...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cyber-physical systems (CPS) [1, 2, 3, 4] are the emerging smart information and communications technology (ICT) that are deeply influencing our society in several application domains. Examples include unmanned aerial vehicles (UAV), wireless sensor networks, (semi-) autonomous cars [5], vehicular networks [3] and a new generation of sophisticated life-critical and networked medical devices [6]. CPS consist of collaborative computational entities that are tightly interacting with physical components through sensors and actuators. They are usually federated as a system-of-systems communicating with each other and with the humans over the Internet of Things (IoT), a network infrastructure enabling the interoperability of these devices.

Figure 1: Introduction to CPS and Internet of Things (IoT) and growing trends of connected devices and generated data. Data Sources: [7, 8, 9, 10]

1.1 Motivation for Resilient CPS

The advent of the Internet has revolutionized the communication between humans. Similarly the CPS and IoT are reshaping the way in which we perceive and interact with our physical world. This comes at a price: these systems are becoming so pervasive in our daily life that failures and security vulnerabilities can be the cause of fatal accidents, undermining their trustworthiness in the public eye.

Over the last years, popular mainstream newspapers have published several articles about CPS that are recalled from the market due to software and/or hardware bugs. For example in 2015, The New York Times published the news [11] about the finding of a software bug in Boeing 787 that could cause “the plane power control units to shut down power generators if they were powered without interruption for 248 days”. The Washington Post has recently published an article [12] about Fiat Chrysler Automobiles NV recalling over 4.8 million U.S. vehicles for a defect that prevents drivers from shutting off cruise control, placing them in a potential hazard. The recent accident of Uber’s self-driving vehicle killing a pedestrian shocked the world [13], raising several concerns about the safety and trustworthiness of this technology.

With the connection of a CPS to the Internet, security becomes a crucial factor, too, that is intertwined with safety (“if it is not secure it is not safe” [14]). The tight interaction between the software and the physical components in CPS enables cyber-attacks to have catastrophic physical consequences. The Guardian reported last year [15] that over half a million pacemakers have been recalled by the American Food and Drug Administration due to fears that hackers could exploit cyber security flaws to deplete their batteries or to alter the patient’s heartbeat. In 2015 the BBC announced [16] that the black-out of the Ukraine power grid was the consequence of a malware installed on computer systems at power generation firms, enabling the hackers to get remote access to these computers. In the same year two hackers have proved in front of the media [17] that they could hijack a Jeep over the internet.

The rise of the IoT, that is forecast to grow to 75 billions of devices in 2025 (Fig. 1), is exacerbating the problem, by providing an incredibly powerful platform to amplify these cyber-attacks. An example is the MIRAI botnet that in 2016 have exploited more than 400000 devices connected through the IoT as a vehicle to launch some of the most potent distributed denial-of-service (DDoS) in history [18].

Managing and monitoring such ultra large scale system is becoming extremely challenging. A desired property to achieve/enforce this is to be resilient, i.e., the service delivery (or functionality) that can justifiably be trusted persists, when facing changes [19]. In other words, the system shall remain safe and secure in the advent of faults and threats (see Fig. 2 for some examples in the automotive domain) that could be even unpredictable at design time or could emerge during runtime [19, 14].

Figure 2: Examples of faults and threats in a connected vehicle.

1.2 State-of-the-Art

Resilience has been identified and discussed as a challenge in IoT [20, 21, 14, 22]. However, it has been mostly studied in other areas of computer science (see Table 1). The majority of surveys focus on one building block of a resilient system, e.g., a CPS, or one attribute of resilience. For instance, some publications survey security by intrusion detection [23, 24]

(e.g., based on machine learning / data mining 

[25] or computational intelligence [26]). Recent surveys on the IoT (Table 2) review definitions, state IoT and research challenges or discuss technologies to enable interoperability and management of the IoT. However, to the best of our knowledge, resilience, adaptation and long-term dependability and security have not yet been discussed in the context of IoT for CPS.

References Dependability Techniques Security Techniques Challenges Case Study
Detection Diagnosis Recovery Detection Mitigation
[27, 19, 28] ✓(1) ✓(1)
[29, 30]
[31]
[23, 24, 25]
[32, 33, 34, 35, 36]
[37]
[38]
[39, 40, 41, 42]
This paper
Table 1: Comparison to resilience roadmaps and surveys (annotations: (1) terminology only).
References Enabling Technologies Resilience Techniques Challenges Applications Case Study
Dependability Security
[20, 21]
[43]
[22]
[14]
[44, 45, 46]
[47]
[48]
This paper
Table 2: Comparison to IoT roadmaps and surveys.

1.3 Novel Contributions

This paper provides an overview of the state-of-the-art to resilience - that is dependability and security - for the IoT. We focus on resilience mechanisms that can be applied during runtime and may be extended to adapt, such that a system undergoing changes remains resilient. We discuss a roadmap to achieve resilience, and illustrate our recent work on this topic with a case study. In particular:

  • We summarize state-of-the-art methods and discuss recent work on detection, diagnosis, recovery and/or mitigation of faults. Due to the expected heterogeneous architecture, we specifically target non-intrusive methods which reason and act in the communication network or at the interfaces of the IoT devices.

  • We state the challenges of these techniques when applied in the IoT and depict a roadmap on how to achieve resilience in an IoT for CPSs.

  • Besides discussing several new perspectives, we further demonstrate some of our key methods/solutions and ongoing works on providing high resilience for the information collected and employed by the IoT in an automotive case study.

1.4 Organization of the Paper

The rest of the paper is organized as follows (see also Fig. 3). The next two sections (Section 2 and Section 3) introduce the terminology around resilience, fault types and examples, building blocks of resilient systems and architectural layers to the readers. Section 4 collects state-of-the-art techniques for fault detection and recovery. Section 5 states research challenges for resilience, and particularly for the long-term dependability and security. Section 6 discusses challenges and our roadmap to resilience in IoT with several new perspectives. Section 7 presents some of our key solutions to this topic on the case study “resilient smart mobility”. Section 8 finally concludes the paper with a discussion of the presented solutions and future work.

Figure 3: Organization of the paper.

2 Resilience

In order to provide a better understanding of resilient IoT, we introduce resilience and its terminology in this section.

2.1 Attributes of Resilience

We desire the IoT for CPS to be dependable and secure throughout its entire life-cycle. Avizienis [27] defines the dependability property of a system to be the combination of following attributes: availability (readiness for correct service), reliability (continuity of correct service), safety (absence of catastrophic consequences), integrity (absence of improper system alterations), maintainability (ability to undergo modifications and repairs). Security includes availability, integrity and confidentiality (the absence of unauthorized disclosure of information).

Robustness can be considered as another attribute of dependability. It has its roots in the control theory or CPS where a system is called robust if it continues to function properly under faults of stochastic nature (e.g., noise). In recent work on the concepts of cyber-physical systems-of-systems (CPSoS) [4], robustness is extended to consider also the security issues in CPS as well: “Robustness is the dependability with respect to external faults (including malicious external actions)”. Figure 4 summarizes the attributes of a resilient system.

resilience

security

dependability

availability

integrity

confidentiality

maintainability

safety

reliability

robustness

all kinds of faults

external faults

long-term dependability and security or scalable resilience

evolvability
Figure 4: Relation of system attributes in the context of resilience.

A fault-tolerant system recovers from faults to ensure the ongoing service [27], i.e., achieving dependability and robustness of a system.

The term resilience is often used by the security community to describe the resistance to attacks (malicious faults). Laprie [19] defines resilience for a ubiquitous, large-scale, evolving system: Resilience is “The persistence of service delivery that can justifiably be trusted, when facing changes.”. The author builds upon the definition of dependability by giving the following short definition of resilience “The persistence of dependability when facing changes.”.

A ubiquitous, heterogeneous, complex system-of-systems will typically change over time raising the need for the dependability and security established during design time to scale up. We therefore find the definition of resilience from Laprie [19] a good fit to express the needs of an IoT for CPS. A resilient IoT ensures the functionality when facing also unexpected failures. Moreover, it should scale dependability and security when it comes to functional, environmental and technological changes [49] – we refer this capability to as long-term dependability and security.

However, to ensure resilience in a system, two important factors need to be analyzed: i) possible faults (the sources of dependability and security threats, see Sec. 3) and ii) available detection and mitigation methodologies (techniques and actions to apply, see Sec. 4).

3 Faults, Errors, Failures and Attacks

A failure is an event that occurs when a system deviates from its intended behavior. The failure manifests due to an unintended state - the error - of one or more components of the system. The cause of an error is called the fault [27].

Fault Type Examples
Physical Broken connector (e.g., due to aging effects), radiation, noise, interference, power transients, power-down or short generated by an attacker, material theft (e.g., copper), denial-of-service by jamming / signal interference
Development Hardware production defect, hardware design error (“errata”), software bug in program or data (memory leaks, accumulation of round-off errors, wrong set of parameters), unforeseen circumstances (of the system and/or its environment), vulnerabilities, aging effects like memory bloating or leaking
Interaction Input mistake, message collision, spoofing (obscure identity), modify information with a Trojan horse, no or late message delivery (e.g., by replay attack), denial-of-service by flooding (e.g., bomb of connection requests), hacked sensor producing inaccurate or false data causing incorrect control decisions and actuator actions
Permanent Design faults, broken connector, noise, stuck-at ground voltage due to a short, logic bomb carried by a virus slowing down or crashing the system, aging effects (e.g., electromigration)
Transient Radiation, power transients, input mistake, intrusion attempt (via vulnerabilities, e.g., heating the RAM to trigger memory errors)
Table 3: Main classifications of faults by [27] with examples.

The source of a fault (Table 3

) may be internal or external. Internal faults may be of physical nature (e.g., broken component connector) or introduced by the design (software/hardware bug). External faults originate from the environment (e.g., noise, radiation) or inputs (e.g., wrong or malicious usage of the system). Faults can be mainly classified into transient and permanent faults. Although a transient fault manifests only for a short period of time, it can cause an error and might lead to a permanent failure. Physical faults (internal/environmental) and inputs may be transient or permanent. Design faults are always permanent. Faults that cannot be systematically reproduced are often called intermittent faults (e.g., effects of temperature on hardware, a transient fault like a short in the circuit activated by a specific input). Such faults lead to so-called soft errors. A possible attack scenario (that is a malicious external fault) is often referred to as a

(security) threat.

Figure 5: Dependability failures and security threats with respect to CPS layers.

Consider the CPS/IoT infrastructure shown in Figure 5. Faults (e.g., radiation or a malicious signal for an actuator) may occur at different layers of the architecture (e.g., physical or control layer, respectively) [50]. The physical layer is vulnerable to disruption, direct intervention or destruction of physical objects (e.g., sensors, actuators and mechanical components). The network layer (here: the IoT) connects the devices. The monitors and controllers in the control layer are vulnerable to uncertainties of the environment and manipulation of measurements and control signals. The information layer collects information and is particularly vulnerable to privacy and integrity issues.

The next two sections state examples of known and emerging faults when the IoT meets CPS (see also Fig. 5).

3.1 Dependability Faults in IoT

The IoT is susceptible to communication failures particularly due to its size and heterogeneity. Traditional CPS would avoid or mitigate such failures by verification and sufficient testing of the design and final implementation of the network component. However, the IoT will evolve in technology and grow in size over time. For instance, following faults may occur per CPS layer:

  • Physical Layer:

    • Interference: Disruption of a signal. The number of connected devices and subsequently the radiation increases which may influence sensor measurements, transmitted messages or control signals [51].

  • Network Layer:

    • Message Collision: Similarly to interference, the number of communicating devices might trigger communication failures, e.g., collisions or an overload of the network.

    • Protocol Violation: Wrong message content due to different protocol version or protocol mismatch.

  • Control Layer:

    • Deadline Miss: Late control signal reception. Control loops still have to follow the timing constraints of a CPS application.

    • Misusage: Send/set wrong inputs to a component, e.g., due to wrong or incomplete syntactical and/or semantic information about the device.

  • Information Layer:

    • Unavailability: Missing information caused by a technology update. Things might be connected, disconnected or updated in the IoT.

3.2 Security Threats in CPS

Security has been a topic since the beginning of computer networks identifying vulnerabilities (that is an internal fault or a weak point in a system enabling an attacker to alter the system [27]) and avoiding or mitigating malicious attacks in devices. However, in CPS additional vulnerabilities arise given the connection to the physical domain and the uncertain behavior of the physical environment [38, 52]. For instance, following attacks may be applied per CPS layer:

  • Physical Layer: [53, 54, 55]

    • Information Leakage: Steal critical information from devices, e.g., secret keys or side channel parameters [56, 57, 58, 59, 60].

    • Denial of Service (DoS): Manipulate several parameters to perform a denial of service attack, e.g., hack the power distribution network to drain the energy [61, 62], destroy the sensors or actuators (in case of physical access), add extra power/communication load.

  • Network Layer: W.r.t. security, this is the most vulnerable layer in a CPS because of the vast possibilities of attacks on communication networks which emerged over the years [63, 64, 65, 66].

    • Jamming: Overload the communication network by introducing fake traffic [67, 68, 69, 70].

    • Collision: Manipulate the timing, power and/or frequency of a network to trigger metastable states which eventually lead to data collision or violation of communication protocols  [71, 72, 73, 74].

    • Routing ill-direct: Manipulate the routing mechanism leading to data collision, data flooding and selective forwarding of data [75, 76].

  • Control Layer:

    • Desynchronization: Violate the timing or manipulate clocks [77, 78, 79, 80]. This can also lead to a DoS [81] and/or information leakage [82, 83, 84].

  • Information Layer:

    • Eavesdropping: Steal or sniff information. This is one of the major threats related to privacy.

    Moreover, information can also be manipulated to perform several attacks, i.e., jamming, collision or DoS.

The potential threats and consequences can be expressed in security threat models for CPS [38]. To define a certain threat model, the following factors have to be identified:

  1. Source/Attacker: All the possible factors/actors which intentionally disturbs or interrupts the behavior or functionality of the CPS  [38].

  2. Attack Methodology: The methodology or framework used to perform the attacks. However, it depends upon the attacker’s capabilities (available computational power, access to CPS resources and layers, etc.) motive (reason for the attack) [38]

    and the type of attack vectors.

  3. Consequences/Payload: The consequences of the actions that a successful attack performs to achieve its motive, e.g., compromising the confidentiality [85], integrity [86], availability [87], privacy [88] and safety [89] of the CPS or information stealing [38].

Table 4 provides a summary of the possible threat models for each layer of CPS.

Layers Physical Sensor/Actuator Communication Control Information Integration Level
Attackers M, D, E M, D, E M, D, E M, D, E M, D, E M, D, E
Methodology
Physical
Intervention
Hacking, Control Access,
Data Manipulations
Replay, Sybil, Jamming,
Flooding, Spoofing
Control Access,
Eavesdropping
Eavesdropping
All possible control &
communication attacks
Payloads
DoS,
Aging Reliability
Energy Stealing, DoS,
Information Leakage,
Desynchronization
Energy Stealing, DoS,
Information Leakage,
Desynchronization
Information Leakage,
DoS, Desynchronization
Information
Leakage
Energy Stealing, DoS,
Information Leakage,
Desynchronization
Table 4: Threat models for different CPS layers (M: Manufacturer, D: Designer, E: External Attacker).

3.3 Long-term Dependability and Security Threats

The IoT and CPS will undergo changes over time, especially when subjected to long operational duration (over decades like in autonomous vehicles). Following aspects of the change [49] might trigger faults (see examples per CPS layer in Fig. 5).

  1. Environmental: Uncertainty of the physical world. Decay and aging of material and components.

  2. Functional: Different and/or new applications and requirements. Dynamic system, i.e., connecting/disconnecting devices.

  3. Technological: Different and/or new components (e.g., maintenance, upgrades, demands), devices, interfaces or protocols. Unknown attacks (zero-day malware).

3.4 Fault Behavior

A failure manifests in a wrong content or timing (early, late or no message at all) of the intended service. Components may contain an error detection mechanism and additionally suppress wrong outputs. Such components are called fail-silent. Some components may automatically stop their execution on failures or halt crash, so-called fail-stop components. However, an erroneous component may provide wrong outputs, i.e., the service is erratic (e.g., babbling) which can cause other services to fail. In the worst case the behavior/output of the failed component is inconsistent to different observers (Byzantine failure) [27, 28].

4 Techniques for Resilient IoT for CPS

There are various online and offline approaches to achieve resilience in a system. Developers may try to prevent faults (e.g., by an appropriate design, encryption or consensus), tolerate faults (e.g., by switching to a redundant component or another pre-defined configuration), remove/mitigate faults (e.g., isolate faulty components to avoid the propagation of faults) or forecast faults (e.g., to estimate the severity or consequences of a fault) 

[27]. We want to focus on the possibilities to fulfill the following requirements regarding resilience:

  • R1: Detection and identification of faulty, attacked or failed components during runtime in the IoT. Faulty or already failed components shall be detected to be able to maintain or recover to a healthy system state providing correct system services.

  • R2: Autonomously maintain resilience in the IoT. Ensure the functionality of a dynamic and heterogeneous system in the presence of faults, i.e., recover from failures in an automatic fashion.

The following two sections give an overview about methods split into detection and diagnosis, and recovery or mitigation of failures. They summarize background and terminology, highly-cited surveys (100 citations according to Google Scholar), recent surveys (2015), recent approaches not part of surveys / additional work, and examples (see distribution in Table 6) given the keywords in Table 5. Note that we tried to cite original publications and no derivations of basic fault-tolerant techniques.

Detection and Diagnosis Recovery or Mitigation
anomaly detection, fault detection/diagnosis, security in CPS, intrusion detection, runtime monitoring, runtime verification, self-awareness self-healing, self-adaptation, software adaptation, runtime reconfiguration, fault-tolerance, fault recovery, threat mitigation dependability, resilience
Table 5: Keywords used to find relevant research.
Type Detection and Diagnosis Recovery or Mitigation
Background / Terminology [27] [39] [31] [30] [29] [27] [19] [32] [28] [4] [36] [35]
Highly-cited surveys [39] [31] [24] [23] [90] [40] [32] [41]
Recent surveys [29] [25] [38] [37] [42] [46] [47]
Additional [91] [30] [92] [93] [29] [94] [95] [44] [96] [97]
Table 6: Collection and distribution of basic work (apart from derived and optimized techniques) per publication type (ordered by publication year, ascending).

4.1 Detection and Diagnosis

Anomaly detection is the process to identify an abnormal behavior or pattern. The abnormal behavior or service failure (e.g., wrong state, wrong message content) is caused by a fault [27]

, e.g., a random failure, a design error or an intruder. Though this definition probably complies with all fault detection mechanisms listed in this section, the various communities use different keywords depending on the application or type of the mechanism. The related term

monitoring is used in the field of runtime verification to refer to the act of observing and evaluating temporal behaviors [29]. In the security domain the phrase intrusion detection is used for reasoning about threats.

Detection

Redundancy

Specification

Anomaly Model

Verification

Signature

Statistics

Machine Learning

Information-theoretic

Classification

Nearest-Neighbor

Clustering

Distribution / Histogram

Hypothesis Tests

Principal Component Analysis
Figure 6: A taxonomy of methods for fault detection.

Halting failures (fail-stop or fail-silent behavior) can be detected by simple methods like watchdogs or timeouts. Faults that manifest in erratic or inconsistent values or timing need a behavior specification, model or replica to compare against (we therefore focus on these methods). Such detection methods can be roughly separated w.r.t. the knowledge used to compare to the actual behavior (Fig. 6).

The expected or faulty behavior is represented either via formal models or specifications (runtime verification) [30, 29], signatures describing attack behaviors [23, 24], learned models (classification, statistics) [31, 39, 23, 24], clusters or the data instances itself (nearest-neighbor) [31, 24].

Another field of reasoning about failures is the root cause analysis or fault localization which identifies the reason why a fault occurs (e.g., a vulnerability of the system or the first failed component which caused other components to fail due to fault propagation).

4.1.1 Redundancy

Additional information sources can detect many types of faults [98]. A simple method to verify a message’s content or intermediate result is plausibility checking or majority voting [28], e.g., by comparing a received message’s content against redundant information sources (see also “agreement” in Sec. 4.2). Nevertheless, redundancy is typically the last resort to increase the resilience or to ensure a specific level of dependability because it is costly when it is added explicitly (e.g., triple modular redundancy often deployed in the avionics [28]).

In hardware, fault detection by redundancy is also known as lockstep execution where typically two computational units run the same operations in parallel to detect faults [99, 100]. When three replicas are used, the fault can be masked by majority voting (under the assumption that only one component can fail at the same time), see also Triple Modular Redundancy (TMR) in Section 4.2.

However, some techniques exploit implicit or functional redundancy that is already available in the system. For instance, [92] combines anomaly detection with sensor fusion. Their approach uses a particle filter fusing data of different sensors and simultaneously calculating a value of trust of the information sources derived from the normalization factor, i.e., the sum of weights of the particles. When the weights of the particles are high, the information source match the prediction and are rated trustworthy. The authors in [101] propose to use hard-wired local data of an automotive ECU to check the plausibility of a received control input. Our method presented in Section 7 is based and relies upon implicit (and explicit) redundancy too.

4.1.2 Specification

Figure 7: Specification-based monitoring can be employed either during the CPS execution or at design-time during the CPS model simulation.
Verification of Safety Properties

The IoT generally consists of spatially distributed and networked CPS. At design time, the CPS behavior can be modeled using hybrid systems, a mathematical framework that combines discrete transition systems capturing the computational behavior of the software component with continuous (often stochastic and nonlinear) ordinary differential equations (ODEs) describing the behavior of the physical substratum with which the software component is deeply intertwined.

Although there has been a great effort in literature to provide efficient computational techniques and tools [102, 103, 104, 105, 106, 107, 108, 109] to analyze safety properties in hybrid systems, the exhaustive verification (i.e., model checking) is in general undecidable [91]. The approaches currently available to check safety properties are based on generating conservative over-approximations of the state variables dynamics called flow pipes [110] and on checking whether those intersect the unsafe regions of interest. However, these methods are generally limited to small scale CPS models. This limitation becomes more evident when we want to study more complex emergent behaviors, which result from the interactions among system components and that can be observed only by taking in consideration large scale CPS.

Hybrid systems are approximation models of the real CPS behavior and so their analysis may be not always faithful due to inevitable approximations errors (especially of the physical behavior) in the modeling phase. Furthermore, CPS models are not always available for intellectual property issues and indeed CPS need to be studied as black box systems where we are not able to observe the internal behavior.

Runtime Verification

A complementary approach to exhaustive verification is to equip CPS with monitors that verify the correctness of their execution. Monitoring consists of observing the evolution of the discrete and continuous variables characterizing the CPS behavior and deciding whether the observed trace of values is good or bad. As Fig. 7 illustrates, these traces can be obtained by simulating the CPS design or can be observed during the CPS execution through the instrumentation of the system under test (SUT) (more details concerning instrumentation techniques can be found in [111]).

Runtime verification (RV) [29] is a specification-based monitoring technique that decides whether an observed trace of a SUT conforms to rigorous requirements written in a formal specification language. The main idea of RV consists in providing efficient techniques and tools that enable the automatic generation of a software- or hardware-based monitor [112, 113] from a requirement. RV can provide useful information about the behavior of the monitored system, at the price of a limited execution coverage.

RV is nowadays a very well-established technique, widely employed in both academia and industry both before system deployment, for testing, verification, and post-deployment to ensure reliability, safety, robustness and security.

A typical example of formal specification language is the Linear Temporal Logic (LTL) introduced by Pnueli in [114]. LTL provides a very concise and elegant logic-based language to specify sequences of Boolean propositions and their relations at different points in time. LTL considers only the temporal order of the events and not the actual point in time at which they really occur. For example, it is not possible to specify that a property should hold after one unit of time and before three and a half units of time.

Real-time temporal logics [115] overcome these limits by embedding a continuous time interval in the until temporal operator. Signal Temporal Logic [116, 117] is a popular example of a real-time temporal logic suitable to reason about the real-time requirements for CPS which has been proposed for detection of threats [118].

Although reasoning about a single trace can provide an insight about safety properties, this is generally not sufficient to capture important information-flow security properties [119] such as noninterference, non-inference and information leakage. These properties are called hyperproperties, because in order to be verified, they need two or more execution traces of the system to be considered at the same time. In order to specify hyperproperties, both LTL and STL have been extended respectively in HyperLTL [120] and HyperSTL [121] adding in the syntax both universal and existential quantifiers over a set of traces. Runtime verification of such specification languages is still an open challenge (some preliminary results appeared in [122]), since the majority of the monitoring algorithms available are usually developed to handle only a single trace.

Falsification-based analysis and Parameter synthesis

As illustrated in Fig.7, the Boolean semantics of STL decides whether a signal is correct or not with respect to a given specification. However, this answer is not always informative enough to reason about the CPS behavior, since the continuous dynamics of these systems are expected to be tolerant with respect to the value of certain parameters, the initial conditions and the external inputs.

Several researchers have proposed to address this issue by defining a quantitative semantics for STL [123, 124]. This semantics replaces the binary satisfaction relation with a quantitative robustness degree function that returns a real value (see Fig.7) indicating how far is a signal from satisfying or violating a specification. The positive and negative sign of the robustness value indicates whether the formula is satisfied or violated, respectively.

The notion of STL robustness was exploited in several tools [125, 126] for falsification analysis [127] and parameter synthesis [128, 129] of CPS models. On one hand, trying to minimize the robustness [125] is suitable to search counterexamples in the input space that violates (falsifies) the specification. On the other hand, maximizing the robustness [126] can be used to tune the parameters of the system to improve its resilience. To this end, a global optimization engine is employed to systematically guide the search.

Signature-based Intrusion Detection

Signature-based intrusion detection compares pre-defined behavior (known as golden behavior or signature) to identify the the abnormal event during runtime [23]. Though these techniques effectively identify the intrusion with a small number of false positives they require a precisely calibrated signature [93]. Therefore, such techniques are not feasible if designers and IP providers are not trusted. Such misuse-based intrusion detection typically cannot handle zero-day attacks that are new unknown attacks. It is therefore often combined with anomaly detection (e.g., in [130]).

4.1.3 Anomaly-based Detection

Statistical Techniques

In statistical anomaly detection the data is fit into a statistical model. If a test instance occurs in the low probability region of the model, i.e., it is unlikely to be generated by the model, then it is claimed to be an anomaly. Statistical models can be specified with parameters when the underlying distribution is known (e.g., is Gaussian). The parameters are trained by machine learning (ML) algorithms [31] or estimation [39] describing the correct behavior of the system. The inverse of the test instance’s probability to be generated can directly be used as anomaly score. Statistical tests can also be used to label or score a test instance (e.g., box plot rule).

The model can be expressed by the data itself, e.g., in a histogram, by kernel functions or particles, which is typically used when the distribution of the data is unknown. The test instances or samples may be evaluated by statistical hypothesis tests. For instance, the Wilcoxon signed-rank test 

[131]

compares two related samples to determine if they have the same underlying distribution (which is unknown and does not have to be the normal distribution).

The principal component analysis (PCA) is used to project the data to lower dimensions, i.e., it reduces the dimensionality of the data to a set of uncorrelated variables. A test instance can be marked anomalous when the projection on the components result in a high variance meaning that the test instance does not fit the typical correlation of the data.

However, simple tests, Gaussian models and histograms are nowadays mostly replaced by (deep) neural networks which stand out handling multivariate and non-linear data.

Machine Learning or Data Mining

Typical anomaly detection techniques based on machine learning can be used with data where no domain knowledge is available (e.g., black-box components like IP cores). The models may be updated during operation. When the desired behavior is known it can be expressed as formal model (specification-based monitoring).

Classification-based anomaly detection learns a model (SVM, neural network, Bayesian networks, rules or decision trees) given labeled training data (e.g., states and observations of the system) to cluster the test data into normal classes and anomalies or outliers 

[31]. Instead of labeling a test instance to a class, one may use scores representing the likelihood of a test instance being an anomaly. For instance, the authors in [132]

use recurrent neural networks to detect anomalies in real-time data. The network models short and long term patterns of time series and serves as a prediction model of the data. The error between predicted and actual value serves as an anomaly score.

Nearest-neighbor-based detection techniques measure the distance from a data instance under test to neighbors to identify anomalies. Different metrics (e.g., euclidean distance) are applied to specify an anomaly score - that is the likelihood of a data instance to be an anomaly. Another approach is to measure the density that is the number of instances in the area specified by the data instance under test given a radius. The Nearest-Neighbor’s complexity increases with the power of two of the number of data instances. Unsupervised.

Data instances are first distributed into clusters (by clustering algorithms, e.g., expectation maximization, k-means, self-organizing maps, many of which use distance or density measures). An anomaly is a data instance that does not fit into any cluster.

Information-theoretic

By investigating the information content described by, e.g., the entropy of the information, one may draw conclusion about anomalies in the data (for information-theoretic measures characterizing regularity in data see [133]). When the entropy exceeds a threshold the test instance is marked as anomaly. The threshold is defined by the set of anomalies. In highly irregular data the gap between threshold and maximum entropy may be low (the set of true anomalies is small).

4.1.4 Fault-Localization

When the fault detection only gives us the information about a failure happened in a subsystem, we need means to identify the exclusive part causing the failure.

This is often performed by root cause analysis [134] or fault-localization [135, 136, 137, 138, 139, 140, 141]. In the software engineering community there is a considerable amount of literature about (semi-)automatic techniques assisting the developer to localize and to explain program bugs (for a comprehensive survey we refer the work in [141]). A well-established statistical approach, is the spectrum-based fault-localization (SFL) [139], a technique that provides a ranking of the program components that are most likely responsible for the observed fault.

This approach has been employed recently also to localize faults in Simulink/Stateflow CPS models [136, 137, 138, 140, 135], displaying a similar accuracy with the same method applied to software systems [137]. Although the classical SFL is agnostic to the nature of the oracle and only requires to know whether the system passes or not a specific test case, in [135], the authors have introduced a novel approach where the oracle is a specification-based monitor. This enables to leverage the trace diagnostic method proposed in [142] and to obtain more information (for example the segment of time where the fault occurred) about the failed tests improving the fault-localization.

Often this approach is only applied offline for debugging processes, however, it can be used to isolate a failed HW/SW component from the system to avoid fault propagation or trigger its recovery.

4.2 Recovery or Mitigation

Broadly speaking, a system can be adapted by changing its parameters or its structure (architecture) [32, 36]. Following four action types of possible re-configurations are defined by [37] (splitting structural adaptation into further classes): re-parameterization to change the parameters of a component, re-instantiation to create and remove components, rewiring to redirect connections between components or relocation to migrate functionality to another platform. The latter three action types require redundancy to some extent. We extend and refine these types in the following (Fig. 8).

Recovery or Mitigation

Re-Parameterization

Runtime Enforcement

Redundancy

Optimization

Rule-based

Re-Instantiation

Replacement

Agreement

Rewiring

Relocation
Figure 8: A taxonomy of methods for recovery or mitigation.

Unless otherwise stated, the adaptation can be applied on different architectural levels of the system. For instance, the change of the clock speed or other hardware parameters is the re-parameterization on the physical level of a device. Changing the receiver of a software component’s output is rewiring on the process/task level.

4.2.1 Re-Parameterization

In general, a re-parameterization (or reconfiguration) switches to another configuration of one or more components that is typically no longer the optimal setting, i.e., the quality of service is decreased (graceful degradation). Adaptation of parameters requires knowledge about the underlying algorithm of the erroneous component and is therefore typically performed by the component itself or within a subsystem. The configuration can be selected by optimization [143]

, or a reasoner based on a set of rules, an ontology or a logic program 

[37]. Approaches from the control theory use state observers or estimators to derive parameters to mitigate stochastic faults [39]

. For instance, an adaptive Kalman filter (AKF) 

[144] changes its filter parameters during runtime based on the inputs. For instance, the measurement covariance can be increased when an input signal gets worse or even permanently fails (cf.: a traditional KF or state estimator mitigates noise and transient failures only).

4.2.2 Runtime Enforcement

Runtime enforcement [95, 97] merges runtime verification with adaptation. This powerful technique ensures that a program conforms to its specification. A so-called enforcer acts on the interface of a component changing inputs or outputs to comply with a set of formal properties. The enforcer uses an automaton and/or rules to correct the IO in case of faults. This approach has been pioneered by the work of Schneider [145] on security automata which halt the program whenever it deviates from a safety requirement. Since then, there has been a great effort in the RV community to define new enforcement mechanisms with primitives [146, 147, 148, 149, 150] or that support more expressive specifications [151, 152, 153].

4.2.3 Redundancy

Redundant components ensure availability (passive) and increase reliability (active). Failed components can be re-instantiated, replaced by spares, mitigated by voting or fusion, rewired or relocated [41, 37].

Re-Instantiation or Restart

A straightforward fault-tolerance method is to restart a failed software component. The tasks or the system typically saves checkpoints or output messages of components on a periodic basis to roll back to a healthy state [154]. The restart might be combined with a re-parameterization. Checkpointing/restart techniques are well studied for operating systems [155] and may be applied on fog nodes or cloud servers. The primary/backup approach activates a typically aperiodic backup task if the primary task fails [156]. Adaptations in hardware and software also mitigate reliability threats while considering the optimization cost constraints [157, 158, 159]. Similarly, runtime reconfiguration polices have also been proposed to mitigate the reliability threats in microprocessors [160].

Replacement or Cold/Hot Spares

The simplex architecture [161] considers two redundant subsystems. A highly dependable subsystem jumps in when the high-performance subsystem fails. Triple modular redundancy (TMR) replicates HW and/or SW components to mask failures (through a voter, i.e., includes detection). The replicates are in the best (but most costly) case diverse w.r.t. their design such that also design and input errors can be masked [28]. Such hardware redundancy is typically added during design time and used in closed, non-elastic systems. To exploit these techniques, several reliability resilient microprocessor designs [162, 163, 164] and corresponding software layer controls [165, 166, 167] have been proposed to ensure the resilience towards reliability threats, i.e., soft errors. Typically, TMR-based solutions possess a large area and power overhead. However, adaptive-TMR solutions [168] can trade-off between power budget and reliability threats. Similarly, software and hardware error masking techniques [169, 163] exploit the dark silicon (under-utilized areas) in multi-core systems [170] to mitigate faults. However, an IoT orchestrator can maintain a directory of available services and redirect resource requests if necessary.

Implicit redundancy like related observations in a system (in contrast to traditional redundancy that is the explicit replication of components) can be exploited by structural adaptation. A substitute component is instantiated to replace the failed component which includes also rewiring and possibly also a relocation [171, 172] (see Sec. 7 for an example).

Agreement / Voting or Fusion

Byzantine failures (inconsistent failures to different observers) typically caused by malicious attacks can be detected and tolerated using replicas (here: redundant services on different nodes of a distributed system) by agreement or consensus on the outputs [94]. The output of redundant components can be combined or fused, e.g., via filters or fuzzy logic [42]. However, through recent implementations and usage in cryptocurrencies [173, 174] the attention is shifted towards smart contracts and blockchains which ensure authentication and integrity of data [44, 47, 46, 96]. Basically, a blockchain is a series of data records each attached by a cryptographically secure hash function which makes it computationally infeasible to alter the blockchain. However, blockchains suffer from complexity, energy consumption and latency and therefore currently cannot be used for real-time anomaly detection or applied by simple nodes with low computational power and restricted battery power budgets [96]. However, it is already examined to manage access to data (authorization), purchase devices or computing power or manage public-key infrastructure in the IoT [44, 175, 176].

Rewiring or Redirection

Broken links in mesh networks are typically reconfigured using graph theory considering node properties and application requirements [177]. A software component may route the task flow to a recovery routine [41].

Relocation

Migration of software components or tasks are studied in the field of resource optimization, utilization and dynamic scheduling on (virtual) machines. Optimization algorithms [143], multi-agent systems [178]

or reinforcement learning 

[179] find a new task configuration utilizing resources in case of a platform failure. Tasks may also be migrated in advance when the health state decreases [154]. Cloud applications boost and emerge new technologies like containerization, resource-centric architectures and microservices which ease service orchestration in complex and elastic systems. Dragoni et al. [180] prognoses increased dependability using microservices which focus on small, independent and scalable function units (cf. fault containment units in Kopetz [28]), however, security remains a concern.

5 Long-Term Dependability and Security

During design time only a subset of failures and threats can be considered, however, the changes of the system itself or the environment can not be predicted which may lead to new possible fault scenarios. Moreover, over the period of time (especially when considering systems deployed for several decades like autonomous vehicles), new attacks can emerge (adversarial machine learning, though, ML is decades old theory), new vulnerabilities in the system can be unleashed (some recent examples are Spectre and Meltdown in decades old technology of high-end processors), and attackers may get more powerful and intelligent (e.g., learning based attacks).

We therefore believe that the IoT needs enhanced self-adaptation techniques (may be cognitive in nature) to achieve long-term dependability and security. For instance, apart from traditional fault-tolerance like backup hardware/software components or checkpointing and restarting, self-healing is a promising approach which is related to self-adaptation and self-awareness. Self-aware systems learn the models of the system itself and its environment to reason and act (e.g., self-healing) in accordance to higher-level goals (e.g., availability) [36]. The key feature of self-* or self-X techniques is continuous learning and optimization which is performed during runtime to evolve the models upon system changes.

To design and build a long-term dependable and secure IoT of smart CPS, the following research questions need to be addressed first:

  1. How to detect and separate subsystem failures and minimize the failure dependencies of the subsystems? How to guarantee the resilience of the system when applying machine learning and/or self-adaptation?

  2. How to detect and recover compromised components with minimal performance and energy overhead? How to learn from unknown attacks on-the-fly and devise appropriate mitigation strategies online, e.g., online on-demand isolation, new fail-safe modes, etc. besides investigating fast learning of on-going attacks to minimize the attack surface?

  3. How to ensure the robustness of the resilience mechanisms itself?

To address these challenges, following techniques are envisioned to ensure long-term dependability and security.

5.1 Verification and validation

Ensuring the complex dependencies and integrity of several components and subsystems within a system is a very challenging research question. The state-of-the-art on dependability and security assurance is based on model-driven design that consists of specifying rigorously the structure and the behavior of the systems using formal models. These models are amenable to formal verification techniques [181, 182, 183, 184] that can provide comprehensive guarantees about correctness of the system’s properties. The accuracy of these models and the test coverage limit the validity of the assurance.

The addition of data-driven learning-enabled subsystems introduces uncertainty in the overall design process and may result in an unpredictable emergent behavior. This is because the operational behavior of these subsystems is a function of the data they train upon and it is very difficult to predict.

This lack of predictability force to think novel approaches for ensuring long-term dependability and security. Here we could envision at least two possible interesting research directions to pursue.

One idea could be to take inspiration by the natural immune systems that protect animals from dangerous foreign pathogens (i.e., bacteria, viruses, parasites). In our case, we can think to have a specialized subsystem that learns both how the surrounding environment evolves and how to best react to attacks. However, this approach would leave the system vulnerable during the learning process.

Another possible direction is to provide mechanisms to enforce dynamic assurance of security and dependability at runtime. A similar approach in control theory can be found in the Simplex Architecture [161] (SA). SA consists of a plant and two version of the controllers: a pre-certified baseline controller and a not certified high-performance controller. A decision module decides whether to switch between the two controllers depending on how much close is the high-performance controller to violate the safety region. In our case, we could envision a dedicated decision module (for example a runtime monitor) enabling a certain degree of autonomy and trust to its subsystems depending on how much the overall system is far to violate a certain safe and secure operating conditions.

5.2 Intelligent and Adaptive Systems

Long-term dependability and security can be achieved by intelligent and adaptive systems (i.e., so-called smart or cognitive systems) that consider uncertainties and changes throughout the lifecycle, and increase their inherent robustness levels on-the-fly autonomously through continuous self-optimization and self-healing.

Many of the techniques presented in the last section do not handle dynamic systems or systems that evolve over time. However, the approaches can be evaluated and extended by artificial intelligence, or new techniques developed to cope with the elasticity of the IoT.

Machine learning-based fault detection and recovery are replacing traditional (pre-configured) techniques because of their ability to extract new and hidden features from the complex and enormous amount of data [185, 186]. To design intelligent and adaptive ML-based secure sub/-systems, first a trained model must be acquired with respect to safe, secure and dependable behavior while considering uncertainties, unforeseen threats and failures, and design constraints. Next, the trained model is integrated within these sub/-systems for online threat and fault detection under the area and power constraints. However, ML-based techniques typically do not consider the limited computational resources, complexity, probably poor interoperability or real-time constraints of the IoT. Therefore, one has to apply scalable and/or distributed techniques.

5.3 Robustness

The subsystem ensuring resilience shall be robust against the elasticity or dynamicity of the system. The machine learning models need to be updated or reconstructed from the basic building blocks on system changes. Models like deep neural networks (DNNs) and recurrent neural networks (RNNs) are effective in classifying real-world inputs when trained over large data sets. Unfortunately, the decision-making systems using NNs cannot be analyzed and rectified due to the currently used black-box models of NNs.

However, AI systems used in industry (in particular safety-critical CPS) need to follow the strict regulations and are expected to explain the reasoning behind their decision-making which is not viable when using ML-based systems [187]. Recent advances in AI, e.g., biologically inspired NNs, may provide the necessary information to get certified. For instance, Hasani and Lechner et al. [188]

are able to interpret the purpose of individual neurons and can provide bounds on the dynamics of the NN.

Moreover, a NN has several security and reliability vulnerabilities w.r.t. data, e.g., data poisoning, model stealing or adversarial examples [189, 61]. To ensure the robustness in such NN-based decision system, several countermeasure have been proposed. A common approach is to encrypt the data or underlying model [190, 191, 192]. However, encryption works as long as the encryption techniques and confidence vectors remain hidden from the adversary. Moreover, it requires additional computational resources for encryption and decryption. Other approaches are, e.g., watermarking [193, 194], input transformation [195] and adversarial learning [196, 197]. Note that these countermeasures protect the NNs against known attacks only. Therefore, to ensure the robustness also under unknown attacks and unforeseen circumstances formal verification-based approaches [198, 199] are emerging as an alternate solution.

6 Roadmap

We are investigating techniques for anomaly detection and self-healing to ensure resilience in IoT for CPS.

6.1 Goals

The overarching goal of our research is to provide guidelines, methods and tools to enable a safe and secure IoT for CPS.

Our contributions are two-fold. We increase the dependability of the IoT (and in further consequence, the CPSs using it) by self-healing and the security by developing (semi-) automatic configuration, testing and threat detection. We plan to address the following research questions:

  • How to improve the resilience of the IoT by fail-operational mechanisms?

  • How to verify and monitor IoT components?

  • How to detect anomalies in the IoT with minimum performance and energy overhead?

  • How to ensure high resilience even under unpredictable attack and failure scenarios?

  • What architectural requirements are necessary to ensure resilience with these mechanisms?

In summary, the key research goals of our contribution are:

  • Propose novel design methodologies and architectures for scalable resilience in IoT for CPS.

  • Propose an energy-efficient analysis (verification) and threat detection.

  • Propose a framework to design low power and ML-based run-time anomaly.

  • Propose a methodology to identify and assert the runtime safety and security properties.

  • Propose a self-healing mechanism for the IoT.

6.2 Challenges

The resilience of systems using anomaly-based detection and self-healing raise the following research questions and challenges.

  • C1: Resource Limitations. The majority of IoT components are resource-constrained devices. The developer often has to trade off power, time and costs against resilience. Typically, small IoT devices like commercial off-the-shelf (COTS) microcontrollers may provide insufficient capabilities. Some technologies might therefore need hardware implementations (e.g., RV monitor) or should be designed as a lightweight and fully distributed, layered, or clustered service (e.g., a monitor per subsystem).

    For instance, one major challenge in anomaly detection is the data acquisition under the consideration of power and design constraints. This raises following research questions:

    1. How to extract/acquire and analyze a particular characteristics during run-time while considering the design and power constraints?

    2. How to reduce the area and energy overhead of the data acquisition, i.e., power-ports, for runtime measurement and modeling?

  • C2-1: Interoperability and Complexity. The IoT is a large dynamic network of heterogeneous components. In particular, COTS or components protected by intellectual property (IP) may not provide a proper specification of its behavior for some of the detection and adaptation methods. Furthermore, new devices or subsystems may introduce unknown interfaces (here: unknown to the resilience-enabling technologies). In particular, this raises following research questions:

    1. How to identify the reference communication behavior without any reference system?

    2. How to model the communication behavior which can be used to identify the anomalous behavior?

    In anomaly detection, for instance, one of the major challenge is to identify the appropriate golden/reference behavior which can be used to compare with online/offline behavior. This raises following particular research questions:

    1. How to model/identify the reference/golden behavior that covers the key characteristics and can be scalable?

    2. How to obtain the labeled data for supervised training to extract the reference model?

    3. Which modeling techniques and corresponding characteristics are appropriate to identify the anomalous behavior with complete coverage?

  • C2-2: Interoperability and Sharing. The devices of a CPS are specified during design time having a specific application in mind. The things of an IoT will most likely be shared between applications while different fog/cloud applications might request different QoS of the devices, e.g., regarding dependability. The methods therefore must also consider and combine the requirements of different applications and the value of trust of the information (e.g., used to derive actions). Due to the vast size of an IoT, a central mechanism most likely will not be able to cope with all the input data necessary to achieve resilience (considering memory and time constraints).

  • C3: Real-Time and Scalability. One major shift from sensor networks to the IoT is the control and manipulation of actuators from the distance, i.e., the IoT comprises a cyber-physical system. The CPS typically has to satisfy time constraints (rates, deadlines) in order to function correctly. In such real-time applications the probing of information by a monitor or changes in the system (e.g., connection of new things, updates, recovery) shall not influence the timing behavior of the CPS. Furthermore, the timeliness to detect and react to critical failures has to be considered.

However, the complexity and dynamicity of the network will leave the door ajar for some faults, e.g., physical faults, design errors or zero-day malware. Therefore a proper never-give-up strategy [28] to cope with unconsidered failures has to be developed.

6.3 Milestones

Figure 9: A brief history of computer systems and our roadmap towards resilient IoT for CPS.

Figure 9 depicts the evolution of embedded systems (milestones as junctions), their goals and requirements (as lines). The lower part of Fig. 9 summarizes our milestones (1.i-iv, 2.i-iii, 3) given below.

  • How to improve the resilience of the IoT by fail-operational mechanisms? How to monitor IoT components?

The IoT will most likely contain many heterogeneous components with different capabilities of resilience. We therefore consider fail-operational mechanisms that target the dependability of the information exchanged between IoT components. The mechanism shall be applicable within the fog and/or cloud running on an independent component or may be applied in an IoT device itself if the performance requirements for the mechanism are satisfied.

We use given implicit redundancy of information provided by distinct IoT components to self-heal the IoT. To this end, the major effort lies in i) developing and extending our redundancy model, ii) implementing a self-adaptive fault detection, iii) applying fault diagnosis, and iv) recovery considering currently available information.

  • How to verify IoT components? How to detect anomalies in the IoT with minimum performance and energy overhead?

We propose a methodology which consists of the following phases: i) security vulnerability analysis, ii) low-power and iii) ML-based anomaly detection.

The first phase of the proposed methodology is to analyze the IoT for CPS for the security vulnerabilities. Unlike the traditional simulation and emulation techniques, we plan to leverage the formal verification for analyzing the security vulnerabilities. After identifying the security vulnerabilities and the corresponding parameters, i.e., communication and side-channel parameters, the next step is to use this information to develop online anomaly detection techniques. In this project, we plan to leverage two key characteristics, i.e., communication behavior and power (dynamic and leakage) to develop the low power and ML-based anomaly detection techniques.

  • What architectural requirements are necessary to ensure resilience with these mechanisms?

Finally we collect the architectural requirements of our developed mechanisms to be added to design guidelines for resilient IoT.

In the following, we present a case study to demonstrate how the above mechanisms can be employed in a real-world use case to detect, diagnose and mitigate faults.

7 Case Study: Resilient Smart Mobility

To illustrate the effectiveness of our approach, we perform a case study on mobile autonomous systems, i.e., vehicle-to-everything (V2X) communication in automated driving. The network connects sensors, controllers and actuators, buildings, infrastructure and roadside systems.

In particular, let’s consider vehicles driving on a highway (Fig. 10). Radar sensors are mounted along the street and form a collaborative sensor field. In order to improve object detection and classification, a multi-object tracking scheme is employed, which uses subsequent sensor measurements in the form of prediction and update cycles to estimate vehicle locations. The tracking data can be used for, e.g., traffic congestion forecast or accident investigations. A set of radar sensors is connected to a fog node, that is a computing unit and IoT gateway in the near area of the sensors. The tracker - a software component running on a fog node - tracks the vehicles on the road segment covered by the associated radars. Some vehicles (e.g., autonomous cars) are equipped with distance sensors like radar, lidar or depth cameras. The fog node(s) of these cars can connect to near fog nodes of the street (directly over a vehicular network called VANET, or via the mobile network over the cloud). Additional MEMS sensors can support energy management, health and comfort in road transportation.

We assume the IoT infrastructure (things, fog, cloud, network) is given and propose methods to increase the resilience of the IoT.

Figure 10: Visualization of the use case.

Failures of the radar sensors in our example will lead to inaccurate or even unusable tracking results. Failure scenarios like communication crashes and dead batteries (fail-silent, fail-stop) are relatively easy to handle (e.g., watchdog/timeout). However, the sensor measurements received by the tracker running in the fog node may be erroneous due to noise (e.g., communication line, aging), environmental influences (e.g., dirtying of the radar) or a security breach (e.g., hacked fog node that collects data of a group of sensors). To detect a failure of the sensor one has to create particular failure models for each possible hazard (c.f., aging, dirtying and a security breach). A simple method detecting a faulty sensor value in different failure scenarios is to check against other information sources, i.e., exploit redundancy. However, explicit redundancy that is replicating observation components is costly.

Self-healing can be applied to react also to failures not specifically considered during design-time. A very promising way of achieving self-healing is through structural adaptation (SHSA), by replacing a failed component with a substitute component by exploiting implicit redundancy (or functional and temporal redundancy) [200]. We use a knowledge base [171, 172] modeling relationships among system variables given that certain implicit redundancy exists in the system and extract a substitute from that knowledge base using guided search (Sec. 7.2). The knowledge base can also be used to monitor the system by comparing the information of variables against each other, i.e., to detect failures (Sec. 7.1).

SHSA can be encapsulated in separate components listening and acting on the communication network of the IoT, e.g., as tasks monitor, diagnose and recover running on a fog node (Fig. 11).

Figure 11: Overview of the self-healing components and proposed integration into a fog node.

SHSA monitors the information communicated between components (typically the sensor measurements or filtered/estimated observations), identifies the failed component and replaces messages of the failed component delivering an erroneous output by spawning a substitute software component. SHSA considers the currently available information in the network, i.e., can be applied in dynamic systems like the IoT (components may be added and removed during runtime). The knowledge base, in particular the relationships between the communicated information, can be defined by the application’s domain expert or learned (approximated by, e.g., neural networks, SVMs or polynomial functions, see also [200]).

Alternatively, the monitor and diagnose task may be installed in the cloud analyzing the logged tracks to trigger maintenance of radar sensors. The requirements needed by SHSA regarding the architecture of the system (e.g., communication network) and a reference implementation of SHSA can be found in [200].

7.1 Detection and Diagnosis

In our future work, we want to use the SHSA knowledge base described below to perform plausibility checks upon related information.

As our focus is on adaptation of the software cyber-part in a CPS (cf. dynamic reconfiguration of an FPGA), we assume that each physical component comprises at least one software component (e.g., the driver of the radar in the vehicle) and henceforth, consider the software components only. The CPS implements certain functionality, e.g., a desired service (e.g., collision avoidance). The subset of components implementing the CPS’ objectives are called controllers.

7.1.1 SHSA Knowledge Base

A system can be characterized by properties referred to as variables (e.g., the position and velocity of a tracked vehicle). The values of system variables are communicated between different components typically via message-based interfaces. Such transmitted data that is associated to a variable, we denote as information atom, short itom [201]. A variable can be provided by different components simultaneously (e.g., two radars with overlapping field of view). Each software component executes a program that uses input itoms and provides output itoms. An itom is needed, when it is input of a controller. A variable is provided when at least one corresponding itom can be received.

Variables are related to each other. A relation is a function or program (e.g., math, pseudo code or executable python code) to evaluate an output variable from a set of input variables.

The knowledge base is a bipartite directed graph (which may also contain cycles) with independent sets of variables and relations of a CPS. Variables and relations are the nodes of the graph. Edges specify the input/output interface of a relation. For instance, Fig. 12 models the relationships between the variables in the tracking use case (only relevant nodes, relationships and edge directions for the scenario in Fig. 13 are shown). The knowledge base can also be encoded by a set of rules, e.g., written in Prolog. It is then possible to further customize the model, e.g., to follow the requirements and constraints of a CPS application.

A proper data association identifies which itoms or measurements represent the same variable, e.g., links the different position itoms to each other. For instance, the GPS position of a vehicle (transmitted by the vehicle itself) has to be linked to the corresponding radar track (provided by the radar).

Subsequently, the redundant itoms can be used, e.g., to monitor a radar sensor, to substitute a failed radar or to increase the accuracy of a tracking application by sensor fusion.

GPSradar

GPSradar

GPSradar

distance

..

radarlidarcamera system
Figure 12: Knowledge base. Ellipses are variables, boxes are relationships (functions). The variables are annotated with possible itoms. Bold itoms are available in the scenario in Fig. 13.

Figure 13: An exemplary scenario from the use case. Visualization of itoms  from the knowledge base in Fig. 12.

The interested reader is referred to [172] and [171] for more details on the SHSA knowledge base.

7.1.2 Fault Detection by Redundancy

An itom has failed, when it deviates from the specification. Our monitor uses the knowledge base to periodically perform a plausibility check to identify a failed itom. The automatic setup of a runtime monitor follows successive procedure:

  • Select the variable to be monitored (typically the corresponding variable to the itom under test), e.g., the position of a vehicle.

  • Collect the provided itoms (e.g., subscribe to all available messages). Note, the availability of variables may change from time to time which should trigger a new setup of the monitor.

  • Extract relations of the monitored variable and available variables from the knowledge base (similar to the search of valid substitutions in Sec. 7.2).

monitor

status

radar track

predecessor radar track

vehicle behind
Figure 14: A monitor checking the position of a vehicle using different itoms. The itoms are first transferred into the common domain (here: position of the vehicle ) and compared against each other.

The instantiated monitor for the position of a vehicle is depicted in Fig. 14. At each time step the relations are executed to bring the available itoms (provided variables) into the common domain (variable to be monitored) where the values are compared against each other. The monitor returns the fault status or a confidence / health / trust value for each itom used in the plausibility check.

The confidence may be expressed by a distance metric or error between the itoms in the common domain. The trust or confidence of a radar may be accumulated from the individual confidence values of the tracked vehicles, i.e., the vehicles in the field of view of the radar. As soon as the confidence falls below a specific threshold for a specific amount of time the status of the respective itom is classified as failed.

The monitor can identify failed itoms in the common domain, however, when the output of a relation mismatches in the common domain, all inputs of the relation are marked faulty. To avoid additional monitors (a monitor for each input variable is necessary to identify the failed itom) a fault localization can be performed.

7.1.3 Anomaly Detection

In addition to SHSA, we are developing a low power runtime anomaly detection and ML-based runtime anomaly detection [52] to ensure: i) a secure and safe platform for automated driving, and ii) secure V2X communication.

Low Power

To address the key challenge of power overhead in CPS, we propose a methodology that leverages the traditional low-power online anomaly detection techniques, in particular, assertions, sensor-based analysis and runtime monitoring. In this methodology, the first step is to identify an appropriate detection scheme based on the security threats, security metrics and design constraints. Second, based on the selected technique, the setup of corresponding assertions or sensor-based runtime monitoring is developed and implemented.

Figure 15: The effects of trust-hub Trojan benchmarks (i.e., MC8051-T200 and T400) on the communication behavior of MC8051 for Gaussian and Exponential input data distribution and an Overview of the motivational case study of an MC8051-based communication network.

We propose to use communication behavior-based assertions to identify the online anomalies with low power and area overhead. To illustrate the effect of intrusions on communication behavior, we analyzed the effects of several MC8051 trust-Hub benchmarks on an MC8051-based communication network. The analysis in Fig. 15 shows that in case of a denial-of-service attack, output packets of the communication channel are less than the input ones. However, in case of flooding, jamming, and information leakage attacks the traffic in the communication channel is more than the input data injection. Therefore, it can be concluded that the communication behavior can be used to identify the anomalous behavior. However, extracting the communication behavior without any golden circuit is not straight-forward which raises the following research challenges:

  1. How to identify the reference communication behavior without any reference circuits/systems?

  2. How to statistically model the communication behavior which can be used to identify the anomalous behavior?

  3. How to measure and analyze the communication for low power runtime anomaly detection?

Machine Learning

With the increasing trend of connected devices, the number of communication channels also increases exponentially. Thus, communication behavior-based assertions are not feasible to handle the large number of communication channels and corresponding communication data. Therefore, to increase the scope of the online anomaly detection for larger CPS with big data analysis, in this project, we propose to explore machine learning algorithms to extract the hidden features from the side-channel parametric and communication behaviors (i.e., power and communication behavior). The first step to develop a ML-based anomaly detection is to select an appropriate ML algorithm based on the design constraints, security threats and complexity of the measured data. Then, train and implement the ML algorithm based on measured data with minimum power and area overhead.

To illustrate the effect of intrusions on power profile, we analyzed the MC8051 with and without trust-Hub benchmarks, i.e., MC8051-T200 and T400, in Xilinx power analyzer. The experimental analysis in Fig. 16 shows that intrusions in MC8051 have a significant impact on the power distribution with respect to different pipeline stages (see labels 1 to 4). Therefore, it can be concluded that the power profiling of the processing elements/controllers in CPS can be used to identify the abnormalities.

Figure 16: Effects of trust-hub Trojan benchmarks (i.e., MC8051-T200, T300 and T400) on Power Correlation with respect to Pipeline Stages for Different Instructions, i.e., MOV, ADD, INC, JMP.

Though the power profiling of the microprocessor can be used to detect an anomalous behavior, power-based ML training and runtime measurement is not easy. Therefore, the following research challenges must be considered while designing the ML-based online anomaly detection:

  1. How to extract the power profiles of the processing elements (controllers in CPS) for efficient ML training?

  2. How to reduce the area and energy overhead of power-ports for runtime measurement and modeling?

7.1.4 Fault Localization [135]

The fault detection mechanisms described in the last sections can identify failed data on the communication network. In order to recover the failed component responsible for the wrong information we have to apply fault localization.

The engineers often design CPS using the MathWorks Simulink toolset to model their functionalities. These models are generally complex hybrid systems that are often impossible to analyze only by using the reachability analysis techniques described before. A popular technique to find bugs in Simulink/Stateflow models is falsification-based testing [125, 126, 202]

. This approach consists in monitoring an STL property over traces produced by systematically simulating the CPS design using different set of test cases. For each generated trace the monitor returns a real-value that provides an indication as how far the trace is from violation. This information can be used to guide the test case generation to find an input sequence that would falsify the specification. However, this approach does not provide any information concerning which is the failed component and the precise moment in time that is responsible for the observed violation. To overcome this shortcoming, in 

[135] Bartocci et al. have recently introduced a new procedure that aids designers in debugging Simulink/Stateflow hybrid system models, guided by STL specifications. This approach combines a trace diagnostics [142] technique that localizes time segments and interface variables contributing to the property violations, a slicing method [203] that maps these time segments to the internal states and transitions of the model and a spectrum-based fault-localization method [139] that produces a ranking of the internal states and/or transitions that are most likely to explain the fault.

7.2 Recovery or Mitigation

A failed itom can be replaced by a function of related itoms. To this end, the knowledge base is searched for relationships using provided variables and spawns a substitute.

7.2.1 Replacement

The substitute search algorithm traverses the knowledge base (Fig. 12) from the failed but needed information as root to find a valid substitution [172].

A substitution of a variable is a connected acyclic sub-graph of the knowledge base with following properties: i) The output variable is the only sink of the substitution. ii) Each variable has zero or one relationship as predecessor. iii) All input variables of a relation must be included (it follows that the sources of the substitution graph are variables only).

A substitution is valid if all sources are provided, otherwise the substitution is invalid (Fig. 17). Only a valid substitution can be instantiated (to a substitute) by concatenating the relationships which take the selected itoms as input (e.g., best itoms of the source variables).

radar track

predecessor radar track
Figure 17: A valid substitution for the failed street radar. Old data from the predecessor radar is used to forward estimate the position of the vehicles.

Substitutions can be found by depth-first search of the knowledge base with the failed variable as root. The search may stop as soon as all unprovided variables are substituted [171]. In [172] we present a guided search approach using a performance measure for substitutions.

The result of the search - the substitution - is instantiated in a substitute [200]. In particular, the substitute subscribes to the input itoms and concatenates the functions or programs from the relationships. The substitute then periodically publishes the output. To avoid inconsistencies and fault propagation, the failed component (probably publishing erratic messages) should be shut down as soon as possible.

8 Conclusion

This paper summarizes the state-of-the-art of detection and recovery to react to failures in IoT for CPS. We further presented the main challenges and a roadmap towards a resilient IoT. The summary of the main challenges identified for existing and new resilience methods are:

  • Limited resources of computation and power (e.g., for runtime data acquisition).

  • Limited knowledge of device and interface semantics (e.g., to retrieve a reference behavior for anomaly detection or model the redundancies in the system).

  • Ensure and do not alter (real-time) behavior by adding or applying resilience techniques.

  • Provide long-term dependability and security, that is, ensure resilience also after environmental, functional or technological changes of the system.

  • Adaptation, verification, validation and robustness of the resilience techniques.

Moreover, we introduced some of our key solutions on an automotive example. The SHSA knowledge base presented in Section 7 describes implicit and explicit redundancy in a communication network. It can therefore be exploited to monitor, replace or fuse information. Because SHSA is based on redundancy it can handle various fault scenarios. Especially permanent faults in the IoT can be detected and recovered given some redundancy exists. As long as the failed components can be isolated and replaced by redundant information the methods can handle physical, development or interaction faults manifested as failures at the components’ interfaces.

The monitors tackle the requirement on fault detection by voting over redundant information or comparing it to some reference behavior (R1). An additional fault localization identifies and triggers a disconnection of the failed component to avoid fault propagation. The substitution replaces failed information with redundant one (R2).

The presented techniques need a reference behavior, common understanding of the information or access to relevant redundancy (C2). Therefore, the IoT should provide proper interoperability (e.g., in form of standards). Under some constraints (bounded or static SHSA knowledge base, estimation of the worst-case execution time of relationships) SHSA is suitable for real-time applications [171]. However, solutions to increase scalability have to be investigated (C3). Moreover, the individual IoT devices might not have the resources to implement detection and recovery (C1). In future work we therefore want to focus on a distributed approach of the mechanism (e.g., by splitting the knowledge base for subsystems, or monitor in a distributed fashion like agreement protocols do).

Acknowledgement

The research leading to these results has received funding from the IoT4CPS project partially funded by the “ICT of the Future” Program of the FFG and the BMVIT. The authors acknowledge the TU Wien University Library for financial support through its Open Access Funding Programme.

References

  • [1] Edward A. Lee and Sanjit A. Seshia. An introductory textbook on cyber-physical systems. In Proceedings of the 2010 Workshop on Embedded Systems Education, WESE ’10, pages 1:1–1:6, New York, NY, USA, 2010. ACM.
  • [2] Ragunathan (Raj) Rajkumar, Insup Lee, Lui Sha, and John Stankovic. Cyber-physical systems: The next computing revolution. In Proc. of DAC ’10: the 47th Design Automation Conference, pages 731–736, New York, NY, USA, 2010. ACM.
  • [3] Ragunathan Rajkumar. A cyber-physical future. Proceedings of the IEEE, 100(Special Centennial Issue):1309–1312, 2012.
  • [4] Andrea Ceccarelli, Andrea Bondavalli, Bernhard Froemel, Oliver Hoeftberger, and Hermann Kopetz. Basic Concepts on Systems of Systems, pages 1–39. Springer International Publishing, Cham, 2016.
  • [5] Daniel J. Fagnant and Kara Kockelman. Preparing a nation for autonomous vehicles: opportunities, barriers and policy recommendations. Transportation Research Part A: Policy and Practice, 77:167–181, 2015.
  • [6] Insup Lee, Oleg Sokolsky, Sanjian Chen, John Hatcliff, Eunkyoung Jee, BaekGyu Kim, Andrew King, Margaret Mullen-Fortino, Soojin Park, Alexander Roederer, and Krishna K. Venkatasubramanian. Challenges and research directions in medical cyber-physical systems. Proceedings of the IEEE, 100(1):75–90, 2012.
  • [7] IHS. Internet of Things (IoT) connected devices installed base worldwide from 2015 to 2025 (in billions). In Statista - The Statistics Portal. Accessed September 10, 2018. Available from https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/.
  • [8] World Bank. World: Total population 2007-2017 (in billion inhabitants). In Statista - The Statistics Portal. Accessed October 18, 2018. Available from https://www.statista.com/statistics/805044/total-population-worldwide/.
  • [9] David Reinsel, John Gantz, and Johnl Rydning. Data age 2025: The evolution of data to life-critical don’t focus on big data; focus on data that’s big. In IDC, Seagate, April. Accessed October 19, 2018. Available from https://www.seagate.com/files/www-content/our-story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf.
  • [10] Kathy Winter. For self-driving cars, there’s big meaning behind one big number: 4 terabytes. In Intel News Room. Accessed October 19, 2018. Available from https://newsroom.intel.com/editorials/self-driving-cars-big-meaning-behind-one-number-4-terabytes/.
  • [11] Jad Mouawad. F.A.A. Orders Fix for Possible Power Loss in Boeing 787. New York Times, May 1, 2015.
  • [12] Peter Holley. Chrysler Fiat announces recall of nearly 5 million U.S. cars. The Guardian, May 25, 2018.
  • [13] Edward Helmore. Uber shuts down self-driving operation in Arizona after fatal crash. The Guardian, May 23, 2018.
  • [14] Robin Bloomfield, Kateryna Netkachova, and Robert Stroud. Security-informed safety: If it’s not secure, it’s not safe. In Anatoliy Gorbenko, Alexander Romanovsky, and Vyacheslav Kharchenko, editors, Software Engineering for Resilient Systems, pages 17–32, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
  • [15] Alex Hern. Hacking risk leads to recall of 500,000 pacemakers due to patient death fears . The Guardian, August 31, 2017.
  • [16] Rory Cellan-Jones. F.A.A. Orders Fix for Possible Power Loss in Boeing 787. BBC, February 26, 2016.
  • [17] Andy Greenberg. Hackers Remotely Kill a Jeep on the Highway – With Me in It. Wired, July 1, 2015.
  • [18] Constantinos Kolias, Georgios Kambourakis, Angelos Stavrou, and Jeffrey Voas. DDoS in the IoT: Mirai and other Botnets. Computer, 50(7):80–84, 2017.
  • [19] Jean-Claude Laprie. From Dependability to Resilience. In Dependable Systems and Networks (DSN 2008), 38th Annual IEEE/IFIP International Conference, 2008.
  • [20] Luigi Atzori, Antonio Iera, and Giacomo Morabito. The Internet of Things: A survey. Computer Networks, 54(15):2787 – 2805, 2010.
  • [21] Ovidiu Vermesan, Peter Friess, Patrick Guillemin, Sergio Gusmeroli, Harald Sundmaeker, Alessandro Bassi, Ignacio Soler Jubert, Margaretha Mazura, Mark Harrison, Markus Eisenhauer, et al. Internet of things strategic research roadmap. Internet of Things-Global Technological and Societal Trends, 1(2011):9–52, 2011.
  • [22] Martin Wollschlaeger, Thilo Sauter, and Juergen Jasperneite. The future of industrial communication: Automation networks in the era of the internet of things and industry 4.0. IEEE Industrial Electronics Magazine, 11(1):17–27, 2017.
  • [23] Ismail Butun, Salvatore D Morgera, and Ravi Sankar. A survey of intrusion detection systems in wireless sensor networks. IEEE communications surveys & tutorials, 16(1):266–282, 2014.
  • [24] Robert Mitchell and Ing-Ray Chen. A survey of intrusion detection techniques for cyber-physical systems. ACM Computing Surveys (CSUR), 46(4):55, 2014.
  • [25] A. L. Buczak and E. Guven. A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. IEEE Communications Surveys Tutorials, 18(2):1153–1176, Secondquarter 2016.
  • [26] Shelly Xiaonan Wu and Wolfgang Banzhaf. The use of computational intelligence in intrusion detection systems: A review. Applied Soft Computing, 10(1):1 – 35, 2010.
  • [27] Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans. on Dependable and Secure Computing, 1:11–33, 2004.
  • [28] Hermann Kopetz. Real-Time Systems: Design Principles for Distributed Embedded Applications. Springer, New York, 2nd edition, 2011.
  • [29] Ezio Bartocci and Yliès Falcone, editors. Lectures on Runtime Verification - Introductory and Advanced Topics, volume 10457 of Lecture Notes in Computer Science. Springer, 2018.
  • [30] Martin Leucker and Christian Schallhart. A brief account of runtime verification. The Journal of Logic and Algebraic Programming, 78(5):293 – 303, 2009.
  • [31] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly Detection: A Survey. ACM Comput. Surv., 41(3):15:1–15:58, July 2009.
  • [32] B. H. C. Cheng et al. Software Engineering for Self-Adaptive Systems: A Research Roadmap. In Software Engineering for Self-Adaptive Systems, pages 1–26. Springer Verlag, Berlin, Heidelberg, 2009.
  • [33] S. Siva Sathya and K. Syam Babu. Survey of fault tolerant techniques for grid. Computer Science Review, 4(2):101 – 120, 2010.
  • [34] R. de Lemos et al. Software Engineering for Self-Adaptive Systems: A Second Research Roadmap, pages 1–32. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
  • [35] Danny Weyns. Software Engineering of Self-Adaptive Systems: An Organised Tour and Future Challenges. Springer, 2017.
  • [36] Samuel Kounev, Peter Lewis, Kirstie L. Bellman, Nelly Bencomo, Javier Camara, Ada Diaconescu, Lukas Esterle, Kurt Geihs, Holger Giese, Sebastian Götz, Paola Inverardi, Jeffrey O. Kephart, and Andrea Zisman. The Notion of Self-aware Computing, pages 3–16. Springer International Publishing, Cham, 2017.
  • [37] Zoltan Papp and George Exarchakos, editors. Runtime Reconfiguration in Networked Embedded Systems - Design and Testing Practices. Internet of Things - Technology, Communications and Computing. Springer Science+Business Media Singapore, 2016.
  • [38] Abdulmalik Humayed, Jingqiang Lin, Fengjun Li, and Bo Luo. Cyber-physical systems security – a survey. IEEE Internet of Things Journal, 4(6):1802–1831, 2017.
  • [39] Rolf Isermann. Fault-diagnosis systems: an introduction from fault detection to fault tolerance. Springer Science & Business Media, 2006.
  • [40] Debanjan Ghosh, Raj Sharman, H. Raghav Rao, and Shambhu Upadhyaya. Self-healing Systems - Survey and Synthesis. Decis. Support Syst., 42(4):2164–2185, January 2007.
  • [41] Harald Psaier and Schahram Dustdar. A survey on self-healing systems: approaches and systems. Computing, 91(1):43–73, Jan 2011.
  • [42] Nancy E ElHady and Julien Provost. A systematic survey on sensor failure detection and fault-tolerance in ambient assisted living. Sensors, 18(7):1991, 2018.
  • [43] Andrea Zanella, Nicola Bui, Angelo Castellani, Lorenzo Vangelista, and Michele Zorzi. Internet of things for smart cities. IEEE Internet of Things journal, 1(1):22–32, 2014.
  • [44] Marco Conoscenti, Antonio Vetro, and Juan Carlos De Martin. Blockchain for the internet of things: A systematic literature review. In Computer Systems and Applications (AICCSA), 2016 IEEE/ACS 13th International Conference of, pages 1–6. IEEE, 2016.
  • [45] Partha Pratim Ray. A survey on internet of things architectures. Journal of King Saud University-Computer and Information Sciences, 30(3):291–319, 2018.
  • [46] Ana Reyna, Cristian Martín, Jaime Chen, Enrique Soler, and Manuel Díaz. On blockchain and its integration with iot. challenges and opportunities. Future Generation Computer Systems, 2018.
  • [47] Minhaj Ahmad Khan and Khaled Salah. Iot security: Review, blockchain solutions, and open challenges. Future Generation Computer Systems, 82:395–411, 2018.
  • [48] Arbia Riahi Sfar, Enrico Natalizio, Yacine Challal, and Zied Chtourou. A roadmap for security challenges in the internet of things. Digital Communications and Networks, 4(2):118–137, 2018.
  • [49] Jean-Claude Laprie. Resilience for the scalability of dependability. In Network Computing and Applications, fourth IEEE international symposium on, pages 5–6. IEEE, 2005.
  • [50] Song Han, Miao Xie, Hsiao-Hwa Chen, and Yun Ling. Intrusion detection in cyber-physical systems: Techniques and challenges. IEEE systems journal, 8(4):1052–1062, 2014.
  • [51] I. Yaqoob, E. Ahmed, I. A. T. Hashem, A. I. A. Ahmed, A. Gani, M. Imran, and M. Guizani. Internet of Things Architecture: Recent Advances, Taxonomy, Requirements, and Open Challenges. IEEE Wireless Communications, 24(3):10–16, June 2017.
  • [52] Muhammad Shafique, Faiq Khalid, and Semeen Rehman. Intelligent security measures for smart cyber physical systems. In 2018 21st Euromicro Conference on Digital System Design (DSD), pages 280–287. IEEE, 2018.
  • [53] Jacob Wurm, Yier Jin, Yang Liu, Shiyan Hu, Kenneth Heffner, Fahim Rahman, and Mark Tehranipoor. Introduction to cyber-physical system security: A cross-layer perspective. IEEE Trans. Multi-Scale Comput. Syst, 3(3):215–227, 2017.
  • [54] II Myers et al. Automated security domain partitioning with a formal method perspective of a cyber-physical systems. 2017.
  • [55] Fillia Makedon, Zhengyi Le, Heng Huang, Eric Becker, and Dimitrios Kosmopoulos. An event driven framework for assistive cps environments. ACM SIGBED Review, 6(2):3, 2009.
  • [56] Steve Zdancewic and Andrew C Myers. Secure information flow and cps. In European Symposium on Programming, pages 46–61. Springer, 2001.
  • [57] Qian Xu, Pinyi Ren, Houbing Song, and Qinghe Du. Security-aware waveforms for enhancing wireless communications privacy in cyber-physical systems via multipath receptions. IEEE Internet of Things Journal, 4(6):1924–1933, 2017.
  • [58] Sujit Rokka Chhetri, Sina Faezi, and Mohammad Abdullah Al Faruque. Fix the leak!: an information leakage aware secured cyber-physical manufacturing system. In Proceedings of the Conference on Design, Automation & Test in Europe, pages 1412–1417. European Design and Automation Association, 2017.
  • [59] Mauro Conti. Leaky cps: Inferring cyber information from physical properties (and the other way around). In Proceedings of the 4th ACM Workshop on Cyber-Physical System Security, pages 23–24. ACM, 2018.
  • [60] Sujit Rokka Chhetri, Sina Faezi, and Mohammad Abdullah Al Faruque. Information leakage-aware computer-aided cyber-physical manufacturing. IEEE Transactions on Information Forensics and Security, 13(9):2333–2344, 2018.
  • [61] Florian Kriebel, Semeen Rehman, Muhammad Abdullah Hanif, Faiq Khalid, and Muhammad Shafique. Robustness for smart cyber physical systems and internet-of-things: From adaptive robustness methods to reliability and security for machine learning. In 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages 581–586. IEEE, 2018.
  • [62] Juan Carlos Balda, Alan Mantooth, Rick Blum, and Paolo Tenti. Cybersecurity and power electronics: Addressing the security vulnerabilities of the internet of things. IEEE Power Electronics Magazine, 4(4):37–43, 2017.
  • [63] Yilin Mo and Bruno Sinopoli. Secure control against replay attacks. In Communication, Control, and Computing, 2009. Allerton 2009. 47th Annual Allerton Conference on, pages 911–918. IEEE, 2009.
  • [64] Saqib Ali, Taiseera Al Balushi, Zia Nadir, and Omar Khadeer Hussain. Wsn security mechanisms for cps. In Cyber Security for Cyber Physical Systems, pages 65–87. Springer, 2018.
  • [65] Yasser Shoukry, Paul Martin, Paulo Tabuada, and Mani Srivastava. Non-invasive spoofing attacks for anti-lock braking systems. In International Workshop on Cryptographic Hardware and Embedded Systems, pages 55–72. Springer, 2013.
  • [66] Jerome Radcliffe. Hacking medical devices for fun and insulin: Breaking the human scada system. In Black Hat Conference presentation slides, volume 2011, 2011.
  • [67] Sridhar Adepu, Jay Prakash, and Aditya Mathur. Waterjam: An experimental case study of jamming attacks on a water treatment system. In Software Quality, Reliability and Security Companion (QRS-C), 2017 IEEE International Conference on, pages 341–347. IEEE, 2017.
  • [68] Yuzhe Li, Ling Shi, Peng Cheng, Jiming Chen, and Daniel E Quevedo. Jamming attacks on remote state estimation in cyber-physical systems: A game-theoretic approach. IEEE Transactions on Automatic Control, 60(10):2831–2836, 2015.
  • [69] Lianghong Peng, Xianghui Cao, Changyin Sun, Yu Cheng, and Shi Jin. Energy efficient jamming attack schedule against remote state estimation in wireless cyber-physical systems. Neurocomputing, 272:571–583, 2018.
  • [70] Lianghong Peng, Xianghui Cao, Hongbao Shi, and Changyin Sun. Optimal jamming attack schedule for remote state estimation with two sensors. Journal of the Franklin Institute, 2018.
  • [71] Zhiyuan Zheng and AL Reddy. Towards improving data validity of cyber-physical systems through path redundancy. In Proceedings of the 3rd ACM Workshop on Cyber-Physical System Security, pages 91–102. ACM, 2017.
  • [72] Xianghui Cao and Changyin Sun. Probabilistic denial of service attack against remote state estimation over a markov channel in cyber-physical systems. In Control Conference (ASCC), 2017 11th Asian, pages 946–951. IEEE, 2017.
  • [73] Jun-Sheng Wang and Guang-Hong Yang. Data-driven methods for stealthy attacks on tcp/ip-based networked control systems equipped with attack detectors. IEEE transactions on cybernetics, (99):1–12, 2018.
  • [74] Soodeh Dadras and Chris Winstead. Insider vs. outsider threats to autonomous vehicle platooning. 2018.
  • [75] George Hatzivasilis, Ioannis Papaefstathiou, and Charalampos Manifavas. Scotres: Secure routing for iot and cps. IEEE Internet of Things Journal, 4(6):2129–2141, 2017.
  • [76] Saqib Ali, Taiseera Al Balushi, Zia Nadir, and Omar Khadeer Hussain. Distributed control systems security for cps. In Cyber Security for Cyber Physical Systems, pages 141–160. Springer, 2018.
  • [77] Hussam Amrouch, Prashanth Krishnamurthy, Naman Patel, Jörg Henkel, Ramesh Karri, and Farshad Khorrami. Emerging (un-) reliability based security threats and mitigations for embedded systems: special session. In Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion, page 17. ACM, 2017.
  • [78] Bikash Poudel, Naresh Kumar Giri, and Arslan Munir. Design and comparative evaluation of gpgpu-and fpga-based mpsoc ecu architectures for secure, dependable, and real-time automotive cps. In Application-specific Systems, Architectures and Processors (ASAP), 2017 IEEE 28th International Conference on, pages 29–36. IEEE, 2017.
  • [79] Ruggero Lanotte, Massimo Merro, Riccardo Muradore, and Luca Viganò. A formal approach to cyber-physical attacks. In Computer Security Foundations Symposium (CSF), 2017 IEEE 30th, pages 436–450. IEEE, 2017.
  • [80] Mark Yampolskiy, Wayne E King, Jacob Gatlin, Sofia Belikovetsky, Adam Brown, Anthony Skjellum, and Yuval Elovici. Security of additive manufacturing: Attack taxonomy and survey. Additive Manufacturing, 2018.
  • [81] Yanbo Dong and Peng Zhou. Jamming attacks against control systems: A survey. In Intelligent Computing, Networked Control, and Their Engineering Applications, pages 566–574. Springer, 2017.
  • [82] William M Nichols. Hybrid Attack Graphs for Use with a Simulation of a Cyber-Physical System. PhD thesis, The University of Tulsa, 2018.
  • [83] Samuel B Moore, Jacob Gatlin, Sofia Belikovetsky, Mark Yampolskiy, Wayne E King, and Yuval Elovici. Power consumption-based detection of sabotage attacks in additive manufacturing. arXiv preprint arXiv:1709.01822, 2017.
  • [84] S Rekhis, N Boudriga, and N Ellouze. Securing implantable medical devices against cyberspace attacks. In Anti-Cyber Crimes (ICACC), 2017 2nd International Conference on, pages 187–192. IEEE, 2017.
  • [85] Julius C Aguma and Bruce McMillin. Introduction of a hybrid monitor to cyber-physical systems. arXiv preprint arXiv:1805.01975, 2018.
  • [86] Chuadhry Mujeeb Ahmed, Aditya Mathur, and Martin Ochoa. arXiv preprint arXiv:1712.01598, 2017.
  • [87] Hasan Gunduz and Dilan Jayaweera. Reliability assessment of a power system with cyber-physical interactive operation of photovoltaic systems. International Journal of Electrical Power & Energy Systems, 101:371–384, 2018.
  • [88] Glenn A Fink, Thomas W Edgar, Theora R Rice, Douglas G MacDonald, and Cary E Crawford. Security and privacy in cyber-physical systems. In Cyber-Physical Systems, pages 129–141. Elsevier, 2017.
  • [89] Victoria Marquis, Rebecca Ho, William Rainey, Matthew Kimpel, Joseph Ghiorzi, William Cricchi, and Nicola Bezzo. Toward attack-resilient state estimation and control of autonomous cyber-physical systems. In Systems and Information Engineering Design Symposium (SIEDS), 2018, pages 70–75. IEEE, 2018.
  • [90] J. O. Kephart and D. M. Chess. The vision of autonomic computing. Computer, 36(1):41–50, Jan 2003.
  • [91] Thomas A. Henzinger. The theory of hybrid automata. In Proceedings, 11th Annual IEEE Symposium on Logic in Computer Science, New Brunswick, New Jersey, USA, July 27-30, 1996, pages 278–292. IEEE Computer Society, 1996.
  • [92] N. Bißmeyer, S. Mauthofer, K. M. Bayarou, and F. Kargl. Assessment of node trustworthiness in VANETs using data plausibility checks with particle filters. In 2012 IEEE Vehicular Networking Conference (VNC), pages 78–85, Nov 2012.
  • [93] Davide Fauri, Daniel Ricardo dos Santos, Elisa Costante, Jerry den Hartog, Sandro Etalle, and Stefano Tonetta. From system specification to anomaly detection (and back). In Proceedings of the 2017 Workshop on Cyber-Physical Systems Security and PrivaCy, pages 13–24. ACM, 2017.
  • [94] Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems (TOCS), 20(4):398–461, 2002.
  • [95] Yliès Falcone. You should better enforce than verify. In Howard Barringer, Yliès Falcone, Bernd Finkbeiner, Klaus Havelund, Insup Lee, Gordon J. Pace, Grigore Rosu, Oleg Sokolsky, and Nikolai Tillmann, editors, Proceedings of the 1st international conference on Runtime verification (RV 2010), volume 6418 of Lecture Notes in Computer Science, pages 89–105. Springer-Verlag, 2010.
  • [96] W. Meng, E. W. Tischhauser, Q. Wang, Y. Wang, and J. Han. When Intrusion Detection Meets Blockchain Technology: A Review. IEEE Access, 6:10179–10188, 2018.
  • [97] Yliès Falcone, Leonardo Mariani, Antoine Rollet, and Saikat Saha. Runtime failure prevention and reaction. In Lectures on Runtime Verification - Introductory and Advanced Topics, volume 10457, pages 103–134. Springer, 2018.
  • [98] J. Petit and S. E. Shladover. Potential Cyberattacks on Automated Vehicles. IEEE Transactions on Intelligent Transportation Systems, 16(2):546–556, April 2015.
  • [99] Stefan Poledna. Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism. Kluwer Academic Publishers, Norwell, MA, USA, 1996.
  • [100] Florian Kriebel, Muhammad Shafique, Semeen Rehman, Jörg Henkel, and Siddharth Garg. Variability and reliability awareness in the age of dark silicon. IEEE Design & Test, 33(2):59–67, 2016.
  • [101] Jürgen Dürrwang, Marcel Rumez, Johannes Braun, and Reiner Kriesten. Security Hardening with Plausibility Checks for Automotive ECUs. In VEHICULAR 2017: The Sixth International Conference on Advances in Vehicular Systems, Technologies and Applications, pages 38–41. IARIA, 07 2017.
  • [102] Matthias Althoff. Reachability analysis of nonlinear systems using conservative polynomialization and non-convex sets. In Proc. of HSCC ’13: the 16th International Conference on Hybrid Systems: Computation and Control, pages 173–182. ACM, 2013.
  • [103] Martin Fränzle and Christian Herde. Hysat: An efficient proof engine for bounded model checking of hybrid systems. Formal Methods in System Design, 30(3):179–198, 2007.
  • [104] Eugene Asarin, Thao Dang, and Oded Maler. The d/dt tool for verification of hybrid systems. In Computer Aided Verification, 14th International Conference, CAV 2002,Copenhagen, Denmark, July 27-31, 2002, Proceedings, volume 2404 of LNCS, pages 365–370. Springer, 2002.
  • [105] Rajarshi Ray, Amit Gurung, Binayak Das, Ezio Bartocci, Sergiy Bogomolov, and Radu Grosu. Xspeed: Accelerating reachability analysis on multi-core processors. In HVC, volume 9434 of LNCS, pages 3–18. Springer, 2015.
  • [106] Xin Chen, Erika Ábrahám, and Sriram Sankaranarayanan. Flow*: An analyzer for non-linear hybrid systems. In CAV, pages 258–263, 2013.
  • [107] Goran Frehse, Colas Le Guernic, Alexandre Donzé, Scott Cotton, Rajarshi Ray, Olivier Lebeltel, Rodolfo Ripado, Antoine Girard, Thao Dang, and Oded Maler. SpaceEx: Scalable verification of hybrid systems. In Shaz Qadeer Ganesh Gopalakrishnan, editor, CAV, LNCS. Springer, 2011.
  • [108] Soonho Kong, Sicun Gao, Wei Chen, and Edmund Clarke. dReach: -reachability analysis for hybrid systems. In TACAS, pages 200–205, 2015.
  • [109] Parasara Sridhar Duggirala, Sayan Mitra, Mahesh Viswanathan, and Matthew Potok. C2E2: A verification tool for stateflow models. In TACAS, pages 68–82. Springer, 2015.
  • [110] Hui Kong, Ezio Bartocci, and Thomas A. Henzinger. Reachable set over-approximation for nonlinear systems using piecewise barrier tubes. In Proc. of CAV 2018: the 30th International Conference on Computer Aided Verification, volume 10981 of LNCS, pages 449–467. Springer, 2018.
  • [111] Ezio Bartocci, Yliès Falcone, Adrian Francalanza, and Giles Reger. Introduction to runtime verification. In Lectures on Runtime Verification - Introductory and Advanced Topics, volume 10457 of Lecture Notes in Computer Science, pages 1–33. Springer, 2018.
  • [112] Stefan Jaksic, Ezio Bartocci, Radu Grosu, Reinhard Kloibhofer, Thang Nguyen, and Dejan Ničković. From signal temporal logic to FPGA monitors. In Proc. of MEMOCODE 2015: the 13th ACM/IEEE International Conference on Formal Methods and Models for Codesign, pages 218–227. IEEE, 2015.
  • [113] Konstantin Selyunin, Stefan Jaksic, Thang Nguyen, Christian Reidl, Udo Hafner, Ezio Bartocci, Dejan Nickovic, and Radu Grosu. Runtime monitoring with recovery of the SENT communication protocol. In Proc. of CAV 2017: the 29th International Conference on Computer Aided Verification, volume 10426 of LNCS, pages 336–355. Springer, 2017.
  • [114] Amir Pnueli. The temporal logic of programs. In Proc. of the 18th Annual Symposium on Foundations of Computer Science, pages 46–57. IEEE, 1977.
  • [115] Rajeev Alur and Thomas A. Henzinger. A really temporal logic. Journal of ACM, 41(1):181–204, 1994.
  • [116] Oded Maler and Dejan Nickovic. Monitoring Temporal Properties of Continuous Signals. In Proc. of FORMATS-FTRTFT 2004: the Joint International Conferences on Formal Modeling and Analysis of Timed Systmes and Formal Techniques in Real-Time and Fault-Tolerant Systems, volume 3253 of LNCS, pages 152–166. Springer-Verlag, 2004.
  • [117] Alexandre Donzé, Oded Maler, Ezio Bartocci, Dejan Ničković, Radu Grosu, and Scott A. Smolka. On temporal logic and signal processing. In Proc. of ATVA 2012: the 10th International Symposium on Automated Technology for Verification and Analysis, volume 7561 of LNCS, pages 92–106. Springer, 2012.
  • [118] Austin Jones, Zhaodan Kong, and Calin Belta. Anomaly detection in cyber-physical systems: A formal methods approach. In Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pages 848–853. IEEE, 2014.
  • [119] Michael R. Clarkson and Fred B. Schneider. Hyperproperties. J. Comput. Secur., 18(6):1157–1210, 2010.
  • [120] Michael R. Clarkson, Bernd Finkbeiner, Masoud Koleini, Kristopher K. Micinski, Markus N. Rabe, and César Sánchez. Temporal logics for hyperproperties. In Proc. of POST 2014: the Third International Conference on Principles of Security and Trust, volume 8414 of LNCS, pages 265–284. Springer, 2014.
  • [121] Luan Viet Nguyen, James Kapinski, Xiaoqing Jin, Jyotirmoy V. Deshmukh, and Taylor T. Johnson. Hyperproperties of real-valued signals. In Proc. of MEMOCODE ’17: the 15th ACM-IEEE International Conference on Formal Methods and Models for System Design, MEMOCODE ’17, pages 104–113. ACM, 2017.
  • [122] Borzoo Bonakdarpour and Bernd Finkbeiner. Runtime verification for hyperltl. In Proc. of RV 2016: the 16th International Conference on Runtime Verification, volume 10012 of LNCS, pages 41–45. Springer, 2016.
  • [123] Georgios E. Fainekos and George J. Pappas. Robustness of temporal logic specifications for continuous-time signals. Theor. Comput. Sci., 410(42):4262–4291, 2009.
  • [124] Alexandre Donzé and Oded Maler. Robust satisfaction of temporal logic over real-valued signals. In Proc. of FORMATS’10: the 8th International Conference on Formal Modeling and Analysis of Timed Systems, volume 6246 of LNCS, pages 92–106. Springer, 2010.
  • [125] Yashwanth Annpureddy, Che Liu, Georgios E. Fainekos, and Sriram Sankaranarayanan. S-TaLiRo: A tool for temporal logic falsification for hybrid systems. In Proc. of TACAS 2011: the 17th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, volume 6605 of LNCS, pages 254–257, 2011.
  • [126] Alexandre Donzé. Breach, A toolbox for verification and parameter synthesis of hybrid systems. In Proc. of CAV 2010: the 22nd International Conference on Computer Aided Verification, volume 6174 of LNCS, pages 167–170. Springer, 2010.
  • [127] Adel Dokhanchi, Aditya Zutshi, Rahul T. Sriniva, Sriram Sankaranarayanan, and Georgios Fainekos. Requirements driven falsification with coverage metrics. In Proc. of EMSOFT: the 12th International Conference on Embedded Software, pages 31–40. IEEE, 2015.
  • [128] Alexandre Donzé, Bruce Krogh, and Akshay Rajhans. Parameter synthesis for hybrid systems with an application to simulink models. In Proc. of HSCC 2009: the 12th International Conference on Hybrid Systems: Computation and Control, volume 5469 of LNCS, pages 165–179. Springer, 2009.
  • [129] Ezio Bartocci, Luca Bortolussi, Laura Nenzi, and Guido Sanguinetti. System design of stochastic models using robustness of temporal properties. Theor. Comput. Sci., 587:3–25, 2015.
  • [130] O. E. David and N. S. Netanyahu.

    Deepsign: Deep learning for automatic malware signature generation and classification.

    In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1–8, July 2015.
  • [131] Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83, 1945.
  • [132] Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. Long Short Term Memory Networks for Anomaly Detection in Time Series. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), pages 89–94, April 2015.
  • [133] Wenke Lee and Dong Xiang. Information-theoretic measures for anomaly detection. In Security and Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE Symposium on, pages 130–143. IEEE, 2001.
  • [134] Paul F Wilson. Root cause analysis: A tool for total quality management. ASQ Quality Press, 1993.
  • [135] Ezio Bartocci, Thomas Ferrère, Niveditha Manjunath, and Dejan Nickovic. Localizing faults in simulink/stateflow models with STL. In Proc. of HSCC 2018: the 21st International Conference on Hybrid Systems: Computation and Control, pages 197–206, 2018.
  • [136] Bing Liu, Lucia, Shiva Nejati, Lionel C. Briand, and Thomas Bruckmann. Localizing multiple faults in simulink models. In International Conference on Software Analysis, Evolution, and Reengineering, pages 146–156. IEEE Computer Society, 2016.
  • [137] Bing Liu, Lucia, Shiva Nejati, Lionel C. Briand, and Thomas Bruckmann. Simulink fault localization: an iterative statistical debugging approach. Softw. Test., Verif. Reliab., 26(6):431–459, 2016.
  • [138] Bing Liu, Lucia, Shiva Nejati, and Lionel C. Briand. Improving fault localization for simulink models using search-based testing and prediction models. In International Conference on Software Analysis, Evolution and Reengineering, pages 359–370. IEEE Computer Society, 2017.
  • [139] R. Abreu, P. Zoeteweij, and A. J. C. van Gemund. On the accuracy of spectrum-based fault localization. In Testing: Academic and Industrial Conference Practice and Research Techniques, pages 89–98. IEEE, 2007.
  • [140] Jyotirmoy V. Deshmukh, Xiaoqing Jin, Rupak Majumdar, and Vinayak S. Prabhu. Parameter optimization in control software using statistical fault localization techniques. In Proc. of ICCPS 2018: the 9th ACM/IEEE International Conference on Cyber-Physical Systems, pages 220–231. IEEE / ACM, 2018.
  • [141] W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. A survey on software fault localization. IEEE Trans. Software Eng., 42(8):707–740, 2016.
  • [142] Thomas Ferrère, Oded Maler, and Dejan Nickovic. Trace diagnostics using temporal implicants. In International Symposium on Automated Technology for Verification and Analysis, volume 9364 of LNCS, pages 241–258. Springer, 2015.
  • [143] R Timothy Marler and Jasbir S Arora. Survey of multi-objective optimization methods for engineering. Structural and multidisciplinary optimization, 26(6):369–395, 2004.
  • [144] R. Mehra. On the identification of variances and adaptive kalman filtering. IEEE Transactions on Automatic Control, 15(2):175–184, April 1970.
  • [145] Fred B. Schneider. Enforceable security policies. ACM Trans. Inf. Syst. Secur., 3(1):30–50, 2000.
  • [146] Jay Ligatti, Lujo Bauer, and David Walker. Edit automata: enforcement mechanisms for run-time security policies. Int. J. Inf. Sec., 4(1-2):2–16, 2005.
  • [147] Yliès Falcone, Laurent Mounier, Jean-Claude Fernandez, and Jean-Luc Richier. Runtime enforcement monitors: composition, synthesis, and enforcement abilities. Formal Methods in System Design, 38(3):223–262, 2011.
  • [148] Nataliia Bielova and Fabio Massacci. Iterative enforcement by suppression: Towards practical enforcement theories. Journal of Computer Security, 20(1):51–79, 2012.
  • [149] Egor Dolzhenko, Jay Ligatti, and Srikar Reddy. Modeling runtime enforcement with mandatory results automata. Int. J. Inf. Sec., 14(1):47–60, 2015.
  • [150] Yliès Falcone, Thierry Jéron, Hervé Marchand, and Srinivas Pinisetty. Runtime enforcement of regular timed properties by suppressing and delaying events. Systems & Control Letters, 123:2–41, 2016.
  • [151] Srinivas Pinisetty, Yliès Falcone, Thierry Jéron, Hervé Marchand, Antoine Rollet, and Omer Nguena-Timo. Runtime enforcement of timed properties revisited. Formal Methods in System Design, 45(3):381–422, 2014.
  • [152] Yliès Falcone and Hervé Marchand. Enforcement and validation (at runtime) of various notions of opacity. Discrete Event Dynamic Systems, 25(4):531–570, 2015.
  • [153] Matthieu Renard, Antoine Rollet, and Yliès Falcone. Runtime enforcement using büchi games. In Proceedings of the 24th ACM SIGSOFT International SPIN Symposium on Model Checking of Software, pages 70–79. ACM, 2017.
  • [154] E. Meneses, X. Ni, G. Zheng, C. L. Mendes, and L. V. Kalé. Using Migratable Objects to Enhance Fault Tolerance Schemes in Supercomputers. IEEE Transactions on Parallel and Distributed Systems, 26(7):2061–2074, July 2015.
  • [155] J. C. Sancho, F. Petrini, K. Davis, R. Gioiosa, and S. Jiang. Current practice and a direction forward in checkpoint/restart implementations for fault tolerance. In 19th IEEE International Parallel and Distributed Processing Symposium, April 2005.
  • [156] S. Ghosh, R. Melhem, and D. Mosse. Fault-tolerance through scheduling of aperiodic tasks in hard real-time multiprocessor systems. IEEE Transactions on Parallel and Distributed Systems, 8(3):272–284, Mar 1997.
  • [157] Tuo Li, Muhammad Shafique, Jude Angelo Ambrose, Semeen Rehman, Jörg Henkel, and Sri Parameswaran. Raster: Runtime adaptive spatial/temporal error resiliency for embedded processors. In Design Automation Conference (DAC), 2013 50th ACM/EDAC/IEEE, pages 1–7. IEEE, 2013.
  • [158] Tuo Li, Muhammad Shafique, Semeen Rehman, Jude Angelo Ambrose, Jörg Henkel, and Sri Parameswaran. Dhaser: dynamic heterogeneous adaptation for soft-error resiliency in asip-based multi-core systems. In Computer-Aided Design (ICCAD), 2013 IEEE/ACM International Conference on, pages 646–653. IEEE, 2013.
  • [159] Tuo Li, Muhammad Shafique, Semeen Rehman, Swarnalatha Radhakrishnan, Roshan Ragel, Jude Angelo Ambrose, Jörg Henkel, and Sri Parameswaran. Cser: Hw/sw configurable soft-error resiliency for application specific instruction-set processors. In Proceedings of the Conference on Design, Automation and Test in Europe, pages 707–712. EDA Consortium, 2013.
  • [160] Tuo Li, Muhammad Shafique, Jude Angelo Ambrose, Jörg Henkel, and Sri Parameswaran. Fine-grained checkpoint recovery for application-specific instruction-set processors. IEEE Transactions on Computers, 66(4):647–660, 2017.
  • [161] D. Seto, B. Krogh, L. Sha, and A. Chutinan. The simplex architecture for safe online control system upgrades. In Proc. of ACC 1998: the American Control Conference, volume 6, pages 3504–3508 vol.6, 1998.
  • [162] Muhammad Shafique, Siddharth Garg, Tulika Mitra, Sri Parameswaran, and Jörg Henkel. Dark silicon as a challenge for hardware/software co-design: Invited special session paper. In Proceedings of the 2014 International Conference on Hardware/Software Codesign and System Synthesis, page 13. ACM, 2014.
  • [163] Jorg Henkel, Lars Bauer, Hongyan Zhang, Semeen Rehman, and Muhammad Shafique. Multi-layer dependability: From microarchitecture to application level. In Design Automation Conference (DAC), 2014 51st ACM/EDAC/IEEE, pages 1–6. IEEE, 2014.
  • [164] Jörg Henkel, Lars Bauer, Nikil Dutt, Puneet Gupta, Sani Nassif, Muhammad Shafique, Mehdi Tahoori, and Norbert Wehn. Reliable on-chip systems in the nano-era: Lessons learnt and future trends. In Proceedings of the 50th Annual Design Automation Conference, page 99. ACM, 2013.
  • [165] Semeen Rehman, Florian Kriebel, Duo Sun, Muhammad Shafique, and Jörg Henkel. dtune: Leveraging reliable code generation for adaptive dependability tuning under process variation and aging-induced effects. In Proceedings of the 51st Annual Design Automation Conference, pages 1–6. ACM, 2014.
  • [166] Semeen Rehman, Muhammad Shafique, Florian Kriebel, and Jörg Henkel. Reliable software for unreliable hardware: embedded code generation aiming at reliability. In Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pages 237–246. ACM, 2011.
  • [167] Semeen Rehman, Florian Kriebel, Muhammad Shafique, and Joerg Henkel. Reliability-driven software transformations for unreliable hardware. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 33(11):1597–1610, 2014.
  • [168] Florian Kriebel, Semeen Rehman, Duo Sun, Muhammad Shafique, and Jörg Henkel. Aser: Adaptive soft error resilience for reliability-heterogeneous processors in the dark silicon era. In Proceedings of the 51st annual design automation conference, pages 1–6. ACM, 2014.
  • [169] Muhammad Shafique, Semeen Rehman, Pau Vilimelis Aceituno, and Jörg Henkel. Exploiting program-level masking and error propagation for constrained reliability optimization. In Proceedings of the 50th Annual Design Automation Conference, page 17. ACM, 2013.
  • [170] Dennis Gnad, Muhammad Shafique, Florian Kriebel, Semeen Rehman, Duo Sun, and Jörg Henkel. Hayat: Harnessing dark silicon and variability for aging deceleration and balancing. In Design Automation Conference (DAC), 2015 52nd ACM/EDAC/IEEE, pages 1–6. IEEE, 2015.
  • [171] Oliver Höftberger. Knowledge-based Dynamic Reconfiguration for Embedded Real-Rime Systems. PhD thesis, Technische Universität Wien, 2015.
  • [172] D. Ratasich, T. Preindl, K. Selyunin, and R. Grosu. Self-Healing by Property-Guided Structural Adaptation. In 2018 IEEE 1st International Conference on Industrial Cyber-Physical Systems (ICPS), pages 199–205, May 2018.
  • [173] Satoshi Nakamoto. Bitcoin: A peer-to-peer electronic cash system. 2008.
  • [174] Andrew Miller and Joseph J LaViola Jr. Anonymous byzantine consensus from moderately-hard puzzles: A model for bitcoin. Available on line: http://nakamotoinstitute. org/research/anonymous-byzantine-consensus, 2014.
  • [175] Ali Dorri, Salil S Kanhere, Raja Jurdak, and Praveen Gauravaram. Blockchain for iot security and privacy: The case study of a smart home. In Pervasive Computing and Communications Workshops (PerCom Workshops), 2017 IEEE International Conference on, pages 618–623. IEEE, 2017.
  • [176] Olivier Alphand, Michele Amoretti, Timothy Claeys, Simone Dall’Asta, Andrzej Duda, Gianluigi Ferrari, Franck Rousseau, Bernard Tourancheau, Luca Veltri, and Francesco Zanichelli. Iotchain: A blockchain security architecture for the internet of things. In Wireless Communications and Networking Conference (WCNC), 2018 IEEE, pages 1–6. IEEE, 2018.
  • [177] Cristina Alcaraz and Javier Lopez. Safeguarding Structural Controllability in Cyber-Physical Control Systems. In Ioannis Askoxylakis, Sotiris Ioannidis, Sokratis Katsikas, and Catherine Meadows, editors, Computer Security – ESORICS 2016, pages 471–489, Cham, 2016. Springer International Publishing.
  • [178] W. Khamphanchai, S. Pisanupoj, W. Ongsakul, and M. Pipattanasomporn. A multi-agent based power system restoration approach in distributed smart grid. In 2011 International Conference Utility Exhibition on Power and Energy Systems: Issues and Prospects for Asia (ICUE), pages 1–7, Sept 2011.
  • [179] Y. Yan, B. Zhang, and J. Guo. An Adaptive Decision Making Approach Based on Reinforcement Learning for Self-Managed Cloud Applications. In 2016 IEEE International Conference on Web Services (ICWS), pages 720–723, June 2016.
  • [180] Nicola Dragoni, Saverio Giallorenzo, Alberto Lluch Lafuente, Manuel Mazzara, Fabrizio Montesi, Ruslan Mustafin, and Larisa Safina. Microservices: yesterday, today, and tomorrow. In Present and Ulterior Software Engineering, pages 195–216. Springer, 2017.
  • [181] Brandon Bohrer, Yong Kiam Tan, Stefan Mitsch, Magnus O Myreen, and André Platzer. Veriphy: verified controller executables from verified cyber-physical system models. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 617–630. ACM, 2018.
  • [182] Mujahid Mohsin, Zahid Anwar, Ghaith Husari, Ehab Al-Shaer, and Mohammad Ashiqur Rahman. Iotsat: A formal framework for security analysis of the internet of things (iot). In Communications and Network Security (CNS), 2016 IEEE Conference on, pages 180–188. IEEE, 2016.
  • [183] Mujahid Mohsin, Muhammad Usama Sardar, Osman Hasan, and Zahid Anwar. Iotriskanalyzer: A probabilistic model checking based framework for formal risk analytics of the internet of things. IEEE Access, 5:5494–5505, 2017.
  • [184] Eun-Young Kang, Dongrui Mu, Li Huang, and Qianqing Lan. Model-based verification and validation of an autonomous vehicle system. arXiv preprint arXiv:1803.06103, 2018.
  • [185] Arslan Munir and Farinaz Koushanfar. Design and analysis of secure and dependable automotive cps: A steer-by-wire case study. IEEE Transactions on Dependable and Secure Computing, 2018.
  • [186] Seong-Taek Park, Guozhong Li, and Jae-Chang Hong. A study on smart factory-based ambient intelligence context-aware intrusion detection system using machine learning. Journal of Ambient Intelligence and Humanized Computing, pages 1–8, 2018.
  • [187] Travers Ching, Daniel S Himmelstein, Brett K Beaulieu-Jones, Alexandr A Kalinin, Brian T Do, Gregory P Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M Hoffman, et al. Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface, 15(141):20170387, 2018.
  • [188] Ramin M Hasani, Mathias Lechner, Alexander Amini, Daniela Rus, and Radu Grosu. Re-purposing compact neuronal circuit policies to govern reinforcement learning tasks. arXiv preprint arXiv:1809.04423, 2018.
  • [189] Muhammad Abdullah Hanif, Faiq Khalid, Rachmad Vidya Wicaksana Putra, Semeen Rehman, and Muhammad Shafique. Robust machine learning systems: Reliability and security for deep neural networks. In 2018 IEEE 24th International Symposium on On-Line Testing And Robust System Design (IOLTS), pages 257–260. IEEE, 2018.
  • [190] Lucjan Hanzlik, Yang Zhang, Kathrin Grosse, Ahmed Salem, Max Augustin, Michael Backes, and Mario Fritz. Mlcapsule: Guarded offline deployment of machine learning as a service. arXiv preprint arXiv:1808.00590, 2018.
  • [191] Nick Hynes, Raymond Cheng, and Dawn Song. Efficient deep learning on multi-source private data. arXiv preprint arXiv:1807.06689, 2018.
  • [192] M. Sadegh Riazi, Bita Darvish Rouhani, and Farinaz Koushanfar. Deep learning on private data. in IEEE Security and Privacy Magazine, 2018.
  • [193] Huili Chen, Bita Darvish Rohani, and Farinaz Koushanfar. Deepmarks: A digital fingerprinting framework for deep neural networks. arXiv preprint arXiv:1804.03648, 2018.
  • [194] Bita Darvish Rouhani, Huili Chen, and Farinaz Koushanfar. Deepsigns: A generic watermarking framework for ip protection of deep learning models. arXiv preprint arXiv:1804.00750, 2018.
  • [195] Valentina Zantedeschi, Maria-Irina Nicolae, and Ambrish Rawat. Efficient defenses against adversarial attacks. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 39–49. ACM, 2017.
  • [196] Adnan Siraj Rakin, Jinfeng Yi, Boqing Gong, and Deliang Fan. Defend deep neural networks against adversarial examples via fixed anddynamic quantized activation functions. arXiv preprint arXiv:1807.06714, 2018.
  • [197] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. arXiv preprint arXiv:1511.04508, 2015.
  • [198] Weiming Xiang, Patrick Musau, Ayana A Wild, Diego Manzanas Lopez, Nathaniel Hamilton, Xiaodong Yang, Joel Rosenfeld, and Taylor T Johnson. Verification for machine learning, autonomy, and neural networks survey. arXiv preprint arXiv:1810.01989, 2018.
  • [199] Weiming Xiang and Taylor T Johnson. Reachability analysis and safety verification for neural network control systems. arXiv preprint arXiv:1805.09944, 2018.
  • [200] D. Ratasich, O. Höftberger, H. Isakovic, M. Shafique, and R. Grosu. A Self-Healing Framework for Building Resilient Cyber-Physical Systems. In 2017 IEEE 20th International Symposium on Real-Time Distributed Computing (ISORC), pages 133–140, May 2017.
  • [201] Hermann Kopetz. A Conceptual Model for the Information Transfer in Systems-of-Systems. In 2014 IEEE 17th International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing, pages 17–24, June 2014.
  • [202] Aditya Zutshi, Sriram Sankaranarayanan, Jyotirmoy V. Deshmukh, James Kapinski, and Xiaoqing Jin. Falsification of safety properties for closed loop control systems. In International Conference on Hybrid Systems: Computation and Control, pages 299–300. ACM, 2015.
  • [203] Robert Reicherdt and Sabine Glesner. Slicing MATLAB simulink models. In International Conference on Software Engineering, pages 551–561. IEEE Computer Society, 2012.