# Minimum Violation Control Synthesis on Cyber-Physical Systems under Attacks

Cyber-physical systems are conducting increasingly complex tasks, which are often modeled using formal languages such as temporal logic. The system's ability to perform the required tasks can be curtailed by malicious adversaries that mount intelligent attacks. At present, however, synthesis in the presence of such attacks has received limited research attention. In particular, the problem of synthesizing a controller when the required specifications cannot be satisfied completely due to adversarial attacks has not been studied. In this paper, we focus on the minimum violation control synthesis problem under linear temporal logic constraints of a stochastic finite state discrete-time system with the presence of an adversary. A minimum violation control strategy is one that satisfies the most important tasks defined by the user while violating the less important ones. We model the interaction between the controller and adversary using a concurrent Stackelberg game and present a nonlinear programming problem to formulate and solve for the optimal control policy. To reduce the computation effort, we develop a heuristic algorithm that solves the problem efficiently and demonstrate our proposed approach using a numerical case study.

## Authors

• 7 publications
• 54 publications
• 17 publications
10/27/2019

### Linear Temporal Logic Satisfaction in Adversarial Environments using Secure Control Barrier Certificates

This paper studies the satisfaction of a class of temporal properties fo...
10/01/2017

### A Moving-Horizon Hybrid Stochastic Game for Secure Control of Cyber-Physical Systems

In this paper, we establish a zero-sum, hybrid state stochastic game mod...
04/07/2019

### Cause Mining and Controller Synthesis with STL

Formal control of cyber-physical systems allows for synthesis of control...
04/10/2020

### Deceptive Labeling: Hypergames on Graphs for Stealthy Deception

With the increasing sophistication of attacks on cyber-physical systems,...
11/13/2020

### Trajectory Optimization for High-Dimensional Nonlinear Systems under STL Specifications

Signal Temporal Logic (STL) has gained popularity in recent years as a s...
02/27/2020

### Formal Synthesis of Monitoring and Detection Systems for Secure CPS Implementations

We consider the problem of securing a given control loop implementation ...
10/25/2015

### Safe Control under Uncertainty

Controller synthesis for hybrid systems that satisfy temporal specificat...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Cyber-physical systems have been identified to play important roles in multiple application domains such as health care systems, cloud computing, and smart homes. To model the increasingly complex tasks and corresponding desired system behaviors consistently, rigorously and compactly, temporal logics such as linear temporal logic (LTL) and computation tree logic (CTL) are adopted in recent literature. Typical system properties that can be modeled using LTL, whose syntax and semantics have been well developed, include liveness (e.g., ‘always eventually A’), reactivity (e.g., ‘if A, then B’), safety (e.g., ‘always not A’) and so on.

Formal methods provide a class of theory and methods for controller design to satisfy given specifications modeled using temporal logics. Such control synthesis problems have been investigated in different applications such as robotic motion planning [1, 2] and optimal control [3, 4]. However, these works normally explicitly or implicitly assume the existence of the controller, which is not always the case.

In [5], unsynthesizable controllers are characterized as either unsatisfiability or unrealizability. Unsatisfiability is caused by the incompatibility of the specifications given to the system, while unrealizability is caused by uncertainties and stochastic errors. Different from uncertainties and stochastic errors, malicious attacks can also cause unsynthesizable controllers in CPS. Malicious attacks on CPS raise the concern on CPS security since they can lead to misbehaviors and failures. For instance, power outage caused by attackers on power system [6], a false data injection (FDI) based attack CarShark on automobiles [7] and widely known Stuxnet on industrial control system (ICS) all caused significant economic losses and/or safety risks.

The approaches proposed for analyzing uncertainties and stochastic errors are not applicable for analyzing malicious attacks on CPS. Moreover, uncertainties and stochastic errors are often viewed as identically and independently distributed random variables, which is not the case for malicious and strategic attacks. In the worst case, stochastic elements such as environment behavior are interpreted as malicious attacks on the system. Zero-sum game provides a good model for worst case analysis

[8]. Meanwhile, failures returned by control synthesis framework could also be caused by malicious and strategic attacks such as jamming attack and Denial-of-Service (DoS) attack which are subject to different information pattern comparing to zero-sum game. In security domain, Stackelberg game is a more reasonable model [9, 10], where player (always denoted as leader in the game) commits to its strategy first and player (always denoted as follower in the game) observes leader’s strategy and then plays its best response. Stackelberg game can capture the information asymmetry and model the value of information.

In this paper, we consider a stochastic discrete-time system with the presence of an adversary, which is abstracted as a stochastic game (SG). The system is given a set of specifications modeled in LTL co-safe (scLTL). We focus on the scenario where no controller can be synthesized to satisfy the specifications simultaneously due to either incompatibility between specifications or the presence of the adversary. Thus we aim at the minimum violation control strategy synthesis problem, i.e., compute a control strategy that violates the less important specifications and satisfies the most important specifications based on user’s preference [11]. To the best of our knowledge, this is the first attempt to analyze minimum violation control synthesis on stochastic system in the presence of adversary. To summarize, we make the following contributions. We formulate a stochastic game to model the interaction between the controller and adversary. We give examples for typical attacks in CPS that can be incorporated into our proposed framework. To model limited observation capability of human adversary, anchoring bias is considered. We present the completion procedure to augment each automaton associated with each specification given to the system. We calculate the product SG using the completed automaton and SG. We formulate a nonlinear programming problem on the product SG to calculate the optimal control policy. A heuristic algorithm is proposed to compute an approximate solution. The proposed algorithm significantly saves computation cost and memory cost. The convergence of the algorithm is proved. A numerical case study is used to demonstrate the proposed approach. By using the proposed approach, more specifications can be satisfied when considering the presence of the adversary. Finally, we show the relationship between the controller’s expected utility and the anchoring bias parameter of adversary.

The remainder of this paper is organized as follows. Related work is presented in Section II and preliminary backgrounds are presented in Section III. Section IV presents problem formulation. We give solution method in Section V. A numerical case study is given in Section VI. We conclude this paper in Section VII.

## Ii Related Work

Control synthesis under temporal logic constraints normally assumes the specifications can be satisfied. Contributions on the cases when the specifications cannot be fulfilled can be classified into four categories. First, the minimum violation problem for deterministic system has been studied in

[11, 12, 13, 14]. Violations caused by confliction between specifications have been studied in [11, 12, 13], and a control strategy that satisfies the most important specifications is synthesized. In [14], a two-player concurrent Stackelberg differential game is formulated. Quantitative preference over satisfactions of scLTL is investigated in [15]. However, contributions [11, 12, 13, 14, 15] focus on deterministic systems and hence the proposed approaches are not applicable to stochastic systems. Second, unsynthesizable specifications are analyzed in [5]. Third, model repair problem is investigated so that satisfaction on specifications is guaranteed [16, 17]. Finally, specification revision problem is investigated in [18]. Planning revision under temporal logic specification is investigated in [19]. However, none of the aforementioned papers consider the presence of adversary. Furthermore, non-deterministic automata are used in the aforementioned papers while deterministic automata are used in this paper.

Secure control in adversarial environment has been investigated using both control theoretic based approach [20] and game theoretic methods [10, 21]

. When game theory meets temporal logic, turn-based two-player SG has been used to construct model checker

[22] and model checking framework [23, 24]. The difference is that a general sum concurrent SG is considered in this paper. Secure control under LTL formula specification modeling liveness and safety constraints is considered in [25]. The proposed approach in [25] focuses on liveness and safety constraints, while this paper considers specifications modeled using scLTL.

## Iii Preliminaries

In this section, we present backgrounds on linear temporal logic and stochastic games.

### Iii-a Linear Temporal Logic (LTL)

An LTL formula consists of a set of atomic propositions , boolean operators including negation (), conjunction () and disjunction () and temporal operators including next () and until () [26]. An LTL formula is defined inductively as

 ϕ=True∣π∣¬ϕ∣ϕ1∧ϕ2∣Xϕ∣ϕ1 U ϕ2.

Other operators such as implication (), eventually () and always () can be defined using operators above. In particular, is equivalent to , is equivalent to , and is equivalent to .

The semantics of LTL formulas are defined over infinite words in . Informally speaking, is true if and only if is true for the current time step and all future time. is true if and only if is true at some future time. is true if and only if is true in the next time step. A word satisfying an LTL formula is denoted as .

In this paper, we focus on syntactically co-safe LTL (scLTL) formulas.

###### Definition 1.

(scLTL [27]): Any string that satisfies a scLTL formula consists of a finite string (a good prefix) followed by any infinite continuation. This continuation does not affect the formula’s truth value.

By Definition 1, a word satisfies an scLTL formula if it contains a good prefix such that for any suffix .

For each scLTL formula, a deterministic finite automaton (DFA) can be obtained. A DFA is defined as follows.

###### Definition 2.

(Deterministic finite automaton): A DFA is a tuple , where is a finite set of states, is the initial state, is alphabet, is the set of transitions and is the set of accepting states.

A run on a DFA over a finite input word is a sequence of states such that for all . A run is accepting if . The satisfaction of a formula by a run is denoted as . To enable violations on specifications, we assume any DFA is complete, i.e., for any and , is defined. The completion procedure can be achieved by adding an additional state and let if is undefined.

### Iii-B Stochastic Game (SG)

A Stochastic Game (SG) is defined as follows.

###### Definition 3.

(Stochastic Game): A stochastic game is a tuple , where is a finite set of states, is a set of actions of the controller, is a set of actions of an adversary, is a transition function where

is the probability of a transition from state

to state when the controller takes action and the adversary takes action . is a set of atomic propositions. is a labeling function mapping each state to a subset of propositions in .

Denote the admissible actions as the set of actions available to the controller (resp. adversary) at each state as (resp. ). A finite (resp. infinite) path on SG is a finite (resp. infinite) sequence of states denoted as (resp. ). Let be the set of finite paths. A control policy (resp. adversary policy

) is a function specifying the probability distribution over control (resp. attack) actions given historical trajectory

. An admissible policy is the policy whose support is the set of admissible actions at each state. In particular, we consider a memoryless control(resp. adversary) policy in this paper, i.e., (resp. ) depends only on the current state.

Stackelberg SG is a widely adopted model in security domain [9]. In the Stackelberg setting, one player is the leader and another player is the follower. The leader first commits to a strategy . The follower then observes the strategy and play its best response . Given any control policy , the best response from the adversary is represented as , where is the adversary’s utility given a pair of leader-follower strategies. The Stackelberg equilibrium is defined as follows.

###### Definition 4.

(Stackelberg Equilibrium (SE)): Denote the utility that the leader (resp. follower) gains in a stochastic game under leader follower strategy pair as (resp. ). A pair of leader follower strategy is an SE if leader’s strategy is optimal given that the follower observes its strategy and plays its best response, i.e., , where is the set of all admissible policies of the controller and denotes the best response to the leader’s strategy from the follower.

In Stackelberg games with human adversaries, anchoring bias is used to model the confidence of the adversary in its observations on [28]. When considering anchoring bias, the response might not be the best response to control policy . Human adversaries normally assign uniform probability to the control action at each state [28]

. When more information is obtained via observation, adversaries slowly update the distributions. In this paper a linear model is adopted to represent the estimated probability. In this model, the estimated probability of human adversary that the controller takes action

at each state is calculated as

 ~μ(s,uC)=α1|UC(s)|+(1−α)μ(s,uC), ∀s,uC (1)

where is a parameter to tune the balance between the original and true probability. When , the estimated probability becomes the true probability and thus the adversary plays its best response. When

, then the estimated probability becomes the uniform distribution, implying the adversary has no capability to observe or infer the control policy based on his observation.

## Iv Problem Formulation

In this section, we first present the problem formulation. Then we show that several typical CPS security problems can be analyzed using the proposed framework. We consider a finite-state discrete-time system in the presence of an adversary, which can be abstracted using a SG as defined in Definition 3.

We adopt the concurrent Stackelberg setting. In particular, the controller acts as the leader and the adversary is the follower. The controller first commits to its strategy (or control policy) . Then the adversary, who observes the historical behavior of the controller, plays its response to the control policy . We assume that both the controller and adversary can observe current state . At each system state , both the controller and adversary have to take actions simultaneously and the system evolves to state following transition function defined in Definition 3.

The system is assigned a set of specifications modeled using scLTL [12, 13]. By satisfying each specification , the controller gains a reward . The objective of the controller is to maximize the total reward obtained via satisfying specifications. In the worst case, the adversary attempts to deviate system behavior and drive the system to violate specifications in so as to minimize the total reward obtained by system. Hence, the specifications cannot be satisfied simultaneously due to either incompatibility of specifications or the presence of adversary. Thus we investigate the minimum violation problem on such a system as follows.

###### Problem 1.

Given an SG abstracted from the system in the presence of an adversary and a set of specifications that potentially cannot be satisfied by system simultaneously, with each associated with a reward function , compute a control policy such that and the best response from adversary constitutes SE defined in Definition 4.

In the following, we show several problems in security domain can be formulated using our proposed framework.

#### Iv-1 Patrolling Security Game with single type of adversary [29]

The states are set as locations in PSG. The actions and are the actions available to the patrol unit and adversary, respectively. In particular, includes the actions that transit the patrol unit among the locations, while are the intrusion actions modeling which location is targeted by the adversary. The transition probability captures the transition uncertainty. The actions taken by both players jointly determine their utilities. For instance, the adversary wins if the target region is under attack without protection and the patrol unit wins otherwise.

The interaction between the patrol unit and adversary is modeled as a Stackelberg game. The security force is the leader while the adversary is the follower. The adversary can observe the schedule of security force (by waiting outside the environment indefinitely) and play its best response.

By using our proposed framework, task dependent rewards can be defined and thus more complex behaviors of the patrolling unit can be considered. For example, the patrolling unit can be given the following tasks: visit areas in sequence (e.g., ‘First region A then region B then region C’: ) and reactivity (e.g., ‘if some passenger enters prohibited region, stop them’: ).

#### Iv-2 Jamming Attacks on CPS

Applications such as SCADA networks and remotely controlled UAVs can be modeled as CPS where the controller communicates with the plant via a wireless network corrupted by a strategic jamming attacker.

Let the state of the plant evolves following a finite state discrete-time dynamics , where is the system state, is the system input jointly determined by the control signal and adversary signal for all and is stochastic disturbance. Function can be formulated as: (i) [30], or (ii) [31]. The formulation of (i) models scenario where the adversary can cause collision at the receiver equipped on the plant and result in denial-of service (DoS) attack. The formulation in (ii) models the scenario where the adversary can flip several bits in the packet and result in false information at the plant. Note that when the adversary launches DoS attack, the actuator can generate no input for the plant as when [30], or when .

Consider the example of an autonomous UAV. The reachability specification can be given to the UAV as ‘eventually reach target region and avoid obstacles’, i.e, .

## V Proposed Solution for Problem 1

In this section, we first present a mixed integer non-linear programming (MINLP) formulation. Then we propose a heuristic solution to compute a

proper stationary control policy, which will be defined later.

For each specification , a complete DFA can be constructed. Given the set of complete automata with each associated with , we can construct a product automaton using the following definition [26].

###### Definition 5.

(Product automaton): A product automaton obtained from is a tuple , where is a finite set of states, is the initial state, is the alphabet inherited from , if for all and is the set of accepting states.

Given the SG and product automaton , we can construct a product SG defined as follows.

###### Definition 6.

(Product SG): Given SG and product automaton , a (weighted and labeled) product SG is a tuple , where is a finite set of states, (resp. ) is a finite set of control inputs (resp. attack signals), if , , and is a weight function assigning each transition a reward.

The weight function of product SG is defined as

 W((s,q1,⋯,qn),uC,uA,(s′,q1′,⋯,qn′))=n∑i=1Iii′r(ϕi), (2)

where the indicator if and and otherwise. By the definition (2), we have that a trace collects reward by satisfying specifications if a specification is satisfied at first time. We index the states in the product SG as .

A proper control policy on product SG is defined as follows.

###### Definition 7.

(Proper Policies): A stationary control policy is proper if under , regardless of the policy chosen by the adversary, the set of destination states can eventually be reached with positive probability, where a destination state is a state such that is an absorbing state in automaton for all .

If a control policy is improper, then under policy , there exists some state that has zero probability to reach the set of destination states.

### V-a MINLP Formulation

For the controller’s strategy, since randomized stationary strategies are considered in Problem 1, we have that

 μ(sP,uC)≥0, ∀sP∈SP,uC∈UC, (3) ∑uC∈UC(sP)μ(sP,uC)=1, ∀sP∈SP, (4) λ(sP,uA)∈{0,1}, ∀sP∈SP,uA∈UA, (5) ∑uA∈UA(sP)λ(sP,uA)=1, ∀sP∈SP, (6)

where (4) and (6) guarantees that the probability distribution sums to one. Eq. (5) holds since in Stackelberg games, it is sufficient to consider pure strategies for the follower [32].

The value function for the controller (resp. adversary ) is defined as the expected reward for the controller (resp. adversary) starting from state . The value functions can be characterized using the following lemma.

###### Lemma 1.

The expected reward of the controller and adversary induced by policy and can be represented as

 VC(sP)=∑uC∈UC[μ(sP,uC)∑uA∈UAλ(sP,uA) ∑s′PPrP(sP,uC,uA,s′P)(W(sP,uC,uA,s′P)+VC(s′P))], VA(sP)=∑uC∈UC[~μ(sP,uC)∑uA∈UAλ(sP,uA) ∑s′PPrP(sP,uC,uA,s′P)(−W(sP,uC,uA,s′P)+VA(s′P))].

Moreover, given a pair of policies and , the expected reward of the controller and adversary are the unique solutions to the linear equations above.

###### Proof.

The expected reward starting from state is calculated as

 ~W(sP)=∑uC∈UCμ(sP,uC)∑uA∈UAλ(sP,uA)∑s′PPrP(sP,uC,uA,s′P)W(sP,uC,uA,s′P).

Given a pair of policies and

, the stochastic game reduces to a Markov chain. The expected reward

is then viewed as the expected reward collected by the path starting from to the set of destination states, which is equivalent to the shortest path problem on the induced Markov chain. By the dynamic programming algorithm of stochastic shortest path problem on Markov chain, we have[33]

 VC(sP)=~W(sP)+∑s′PPrP(sP,uC,uA,s′P)VC(s′P). (7)

Then by the definition of , we have that Lemma 1 holds. ∎

Based on Lemma 1, we have the following proposition which gives the sufficiency of considering proper policies.

###### Proposition 1.

If a proper control policy is associated with the highest expected reward for the controller among all proper policies, then it associates with the highest expected reward among all stationary policies.

###### Proof.

Let be the control policy that enables the controller receiving highest reward among all stationary policies. If is a proper policy, then the result clearly holds. Next, we focus on the scenario where is improper and show that by construction, we have a proper control policy such that the expected rewards for the controller under policy and are equal. Divide the set of states into two subsets and . In particular, let be the set of states that cannot reach the set of destination states, while denotes the set of states that reach the set of destination states with positive probability. Let for all . By hypothesis on , we have that the proper control policy corresponds to the highest expected reward when the initial state is in . By the assumption on the existence of a proper policy , we let for all . Since is an improper policy while is a proper policy, we have that the expected reward received by the controller by committing to control policy is no less than committing to . Hence, we have a proper control policy such that the controller receives expected reward no less than committing to improper policy . ∎

By Proposition 1, we can restrict the search space of control policy to the set of proper control policies. Denote the expected reward obtained by the controller starting from state when the controller commits to strategy and adversary takes action as . Define for the adversary analogously. Then for all , the expected reward for the controller (resp. adversary) can be represented as

 BC(sP,μ,uA)=∑uC∈UCμ(sP,uC)[∑s′PPrP(sP,uC,uA,s′P) (W(sP,uC,uA,s′P)+VC(s′P))], (8) BA(sP,~μ,uA)=∑uC∈UC~μ(sP,uC)[∑s′PPrP(sP,uC,uA,s′P) (−W(sP,uC,uA,s′P)+VA(s′P))], (9)

which are the expected utility of the controller and adversary, respectively. Note that the adversary’s expected reward depends on its observation over the control policy defined in (1). Since is binary, we can bound the values for the adversary and controller using the big M method [32], respectively, for all and as follows:

 BA(sP,~μ,uA)≤VA(sP)≤BA(sP,~μ,uA) +(1−λ(sP,uA))Z, (10) VC(sP)≤BC(sP,μ,uA)+(1−λ(sP,uA))Z, (11)

where is a sufficiently large positive number. Inequality (10) and (11) give bounds for and . Depending on the value of , the upper bounds for (resp. ) can be either infinity () or (resp. ).

To compute the control policy that maximizes the expected utility of controller, the following optimization problem can be formulated [32].

 maxμ,λ,VC,VA γTVC (12) s.t. (???) (???) (???) (???) (???) (???)eq:constraint5 and (???)

where is the initial distribution over state space . Since constraints (10) and (11) introduce nonlinearity and is binary, the optimization problem (12) is an MINLP.

### V-B Heuristic Solution

The MINLP (12) is nonconvex and solving it is NP-hard. In the following, we present a value iteration based heuristic solution to the MINLP (12).

As shown in Algorithm 1, we first initialize an arbitrary set of initial policies using sampling approach, where the sample space is the product of probability simplices in . Then by solving the optimal control problem from the perspective of adversary on the MDP induced by each control policy [1, 2, 3, 4], we can solve for a set of expected rewards for the controller associated with the initial policies.

For each initial expected reward , value iteration (line to line ) is used to find a control policy such that the objective function is maximized. In particular, at iteration , given the expected reward obtained from previous iteration , the following mixed integer linear programming (MILP) is solved to calculate the proper control policy .

 maxμ,λ,VC,VA γTVC (13) s.t. VC(sP)≤BkC(sP,μ,uA) +(1−λ(sP,uA)Z, ∀sP,uA BkA(sP,μ,uA)≤VA(sP) ≤BkA(sP,μ,uA)+(1−λ(sP,uA)Z, ∀sP,uA

where and are obtained by (8) and (9) using and , respectively. Note that when solving the MILP (13), the policy chosen by the adversary is the best response to obtained from (1). The algorithm terminates when either or the MILP (13) is infeasible. The first termination condition focuses on the scenario where an optimal can be found by solving the optimization problem. Since the initial guess is given arbitrarily while is bounded within , thus MILP (13) might be infeasible. In this case, such an initial guess should be skipped and the value iteration module terminates. After a feasible is found at some iteration , we store

in vector

. Then the control policy returned by Algorithm 1 is the control policy corresponding to the maximum value in .

The convergence of Algorithm 1 is presented in the following theorem.

###### Theorem 1.

Algorithm 1 converges in finite time.

Before presenting the proof of Theorem 1, we first introduce two operators denoted as and as follows:

 TμVC(sP)=minλ∈BR(~μ)∑uC∈UCμ(sP,uC)∑uA∈UAλ(sP,uA) ∑s′P[PrP(sP,uC,uA,s′P)(W(sP,uC,uA,s′P)+VC(s′P))], (14) TVC(sP)=maxμminλ∈BR(~μ)∑uC∈UCμ(sP,uC)∑uA∈UAλ(sP,uA) ∑s′P[PrP(sP,uC,uA,s′P)(W(sP,uC,uA,s′P)+VC(s′P))], (15)

The following lemmas characterizes the operator .

###### Lemma 2.

For any vectors and such that , we have for all policies and , where iteratively applying operator times.

###### Proof.

By definition (14) and (15), we can view the operator as the total expected reward collected from a -stage problem with cost per stage . Increasing is equivalent to increasing the terminal reward (e.g., the reward collected when reaching the destination) in the -stage problem. Since cost per stage is fixed, hence increasing will increase the expected total reward in the -stage problem, which implies monotonicity of . ∎

We omit the proof for Lemma 2 due to space limit.

###### Lemma 3.

Denote the expected reward induced by proper control policy and adversary policy as . Then satisfies .

###### Proof.

Since we focus on stationary policies, then by inducting Lemma 1, can be represented as

 TMμVC=PrMVC+M−1∑m=0Prm~W, (16)

where is the transition matrix of the Markov chain induced by control policy and adversary policy . Since the control policy is proper, we can eventually reach the set of destination states with probability . By definition (2), no reward can be collected when starting from destination states. Therefore, we have . Then, by taking limit on both sides of (16) as tends to infinity, we have . By the definition of , we have , and hence Lemma 3 is proved. ∎

Finally, we have the following proposition.

###### Proposition 2.

The optimal expected total reward for the controller at each iteration satisfies .

###### Proof.

Suppose the expected reward for the controller is at some iteration such that . If , we have that is not a feasible solution to MILP (13). If , then starting from , we can always search along some direction in the feasible region of (13) until we reach the boundary of the feasible region to find some . Hence, is not the optimal solution to (13). Therefore, we have holds. ∎

In the following, we present the proof of Theorem 1.

###### Proof.

(Proof of Theorem 1.) We show that Algorithm 1 terminates within finite iterations because both outer and inner loops terminate within finite iterations.

First, the outer loop executes exactly times and thus the outer loop terminates within finite iterations.

Next, we show at each outer loop iteration , the value iteration module converges within finite time. It is obvious that the inner loop terminates when the initial guess on is not feasible. In the following we focus on the feasible case. Let be the iteration index of value iteration (line to line ). Let us denote the expected reward of the controller induced by control policy and adversary policy at each iteration as . Let the expected reward of each transition starting from state and the transition matrix under control policy and adversary policy be and , respectively. By Lemma 1 and Proposition 2, we observe that is equivalent to find a control policy such that . Therefore , where the inequality holds by definition (14) and (15), i.e., . View as . Then composing times and taking the limit as , by Lemma 2, we can construct a sequence of inequalities . Therefore, we have , where the convergence of follows from Lemma 3. Hence, the expected reward increases with respect to the number of iterations . Since is upper bounded by , we claim that the value iteration module converges within finite time. ∎

Furthermore, we characterize the value function returned by Algorithm 1 using the following proposition.

###### Proposition 3.

The expected reward of the controller returned by Algorithm 1 is the value function obtained by committing to control strategy returned by Algorithm 1.

The advantage of Algorithm 1 is that it significantly reduces computation and memory cost comparing to global optimization techniques [34] and discretization-based approximate algorithms [32]. Global optimization techniques, for example, spatial branch and bound has been demonstrated non-efficient comparing to MILP. The approximate solution proposed in [32]

introduces extra binary variables and constraints, whose sizes are linear to the discretization resolution. The introduction of extra variables and constraints weakens its scalability, especially for the large state space in product SG. In contrast, Algorithm

1 introduces no additional variables when solving the MILP. Therefore, Algorithm 1 significantly saves memory and model construction time for commercial solvers. Algorithm 1 does not guarantee that a global optimal solution will be found. Hence, executing Algorithm 1 from different initial points can improve the performance of Algorithm 1.

## Vi Case Study

In this section, we present a numerical case study to demonstrate the proposed approach.

### Vi-a Case Study Settings

Suppose a robot is performing tasks modeled in scLTL in a bounded environment. We consider the robot following standard discrete time model , where is the location of the robot at time , is the control input from the controller, is the input signal from the adversary and is the stochastic disturbance, is the time interval. Therefore, we have that the control signal of the robot is compromised by the adversary. Here we let .

We divide the region into sub-regions with each size is . We abstract the stochastic game as follows [25]. Let each sub-region be a state in the stochastic game. Hence, the stochastic game has states and we will refer to state and sub-region interchangeably in the following. Each state can be mapped to a subset of atomic propositions by labeling function as shown in Fig. 1. The action sets for the controller and adversary are defined as , implying moving towards the adjacent sub-region. When the adversary compromises the control input, the probability that the robot transits to its intended state is . Moreover, when the robot is at , the adversary can block all the transitions of the robot (e.g., close the door of the room).

Suppose the robot is given specifications as shown in Table I. The robot is required to visit the or before visiting . Moreover, they are required to be visited in this particular order if possible. In the meantime, the robot should avoid during the visit to guarantee safety property. Finally, the robot is required to eventually visit once it has visited .

### Vi-B Case Study Results

Let the upper left state in Fig. 1 be the initial state. Fig. 1 shows two trajectories generated using the proposed approach and the control policy synthesized without considering the presence of the adversary. Without considering the adversary, the control policy attempts to satisfy all the specifications in . However, the adversary is capable to block all the transitions at state marked as