I Introduction
Cyberphysical systems (CPSs) are entities in which the working of a physical system is governed by its interactions with computing devices and algorithms. These systems are ubiquitous [baheti2011cyber], and vary in scale from energy systems to medical devices and robots. In applications like autonomous cars and robotics, CPSs are expected to operate in dynamic and potentially dangerous environments with a large degree of autonomy. In such a setting, the system might be the target of malicious attacks that aim to prevent it from accomplishing a goal. An attack can be carried out on the physical system, on the computers that control the physical system, or on communication channels between components of the system. Such attacks by an intelligent attacker have been reported across multiple domains, including power systems [sullivan2017cyber], automobiles [shoukry2013non], water networks [slay2007lessons], and nuclear reactors [farwell2011stuxnet]. Adversaries are often stealthy, and tailor their attacks to cause maximum damage. Therefore, strategies designed to only address modeling and sensing errors may not satisfy performance requirements in the presence of an intelligent adversary who can manipulate system operation.
The preceding discussion makes it imperative to develop methods to specify and verify CPSs and the environments they operate in. Formal methods [baier2008principles] enable the verification of the behavior of CPS models against a rich set of specifications [lahijanian2015formal]. Properties like safety, liveness, stability, and priority can be expressed as formulas in linear temporal logic (LTL) [kress2007s, ding2014optimal], and can be verified using offtheshelf model solvers [cimatti1999nusmv, kwiatkowska2011prism]
that take these formulas as inputs. Markov decision processes (MDPs)
[bertsekas2015dynamic, puterman2014markov] have been used to model environments where outcomes depend on both, an inherent randomness in the model (transition probabilities) and an action taken by an agent. These models have been extensively used in applications like robotics [lahijanian2012temporal] and unmanned aircrafts [temizer2010collision].Current literature on the satisfaction of an LTL formula over an MDP assumes that states are fully observable [ding2014optimal, lahijanian2012temporal, niu2018secure]. In many practical scenarios, states may not be observable. For example, as seen in [thrun2005probabilistic]
, a robot might only have an estimate of its location based on the output of a vision sensor. The inability to observe all states necessitates the use of a framework that accounts for partial observability. For LTL formula satisfaction in partially observable environments with a single agent, partiallyobservable Markov decision processes (POMDPs) can be used to model and solve the problem
[sharan2014finite, sharan2014formal]. However, determining an ‘optimal policy’ for an agent in a partially observable environment is NPhard for the infinite horizon case, which was shown in [vlassis2012computational]. This demonstrates the need for techniques to determine approximate solutions.Heuristics to approximately solve POMDPs include belief replanning [cassandra1996acting], most likely belief state policy and entropy weighting [kaelbling1998planning], gridbased methods [brafman1997heuristic], and pointbased methods [kurniawati2008sarsop]. The difficulty in computing exactly optimal policies and the lack of complete observability may be exploited by an adversary to launch new attacks on the system. The synthesis of parameterized finite state controllers (FSCs) for a POMDP to maximize the probability of satisfying of an LTL formula (in the absence of an adversary) was proposed in [sharan2014finite] and [sharan2014formal]. This is an approximate strategy since it does not use the observation and action histories; it uses only the most recent observation in order to determine an action. This restricts the class of policies that are searched over, but the finite cardinality of states in an FSC makes the problem computationally tractable. The authors of [yu2008near] showed the existence of optimal FSCs for the average cost POMDP. In comparison, for the setting in this paper where we have two competing agents, we present guarantees on the convergence of a valueiteration based procedure in terms of the number of states in the environment and the FSCs.
In this paper, we study the problem of determining strategies for an agent that has to satisfy an LTL formula in the presence of an adversary in a partially observable environment. The agent and the adversary take actions simultaneously, and these jointly influence transitions between states.
Ia Contributions
The setting that we consider in this paper assumes two players or agents– a defender and an adversary– who are each limited in that they do not exactly observe the state. The policies of the agents are represented as FSCs. The goal for the defender will be to synthesize a policy that will maximize the probability of satisfying an LTL formula for any adversary policy. We make the following contributions.

We show that maximizing the satisfaction probability of the LTL formula under any adversary policy is equivalent to maximizing the probability of reaching a recurrent set of a Markov chain constructed by composing representations of the environment, the LTL objective, and the respective agents’ controllers.

We develop a heuristic algorithm to determine defender and adversary FSCs of fixed sizes that will satisfy the LTL formula with nonzero probability, and show that it is sound. The search for a defender policy that will maximize the probability of satisfaction of the LTL formula for any adversary policy can then be reduced to a search among these FSCs.

We propose a procedure based on valueiteration that maximizes the probability of satisfying the LTL formula under fixed defender and adversary FSCs. This satisfaction probability is related to a Stackelberg equilibrium of a partially observable stochastic game involving the defender and adversary. We also give guarantees on the convergence of this procedure.

We study the case when the size of the defender FSC can be changed to improve the satisfaction probability.

We present an example to illustrate our approach.
The valueiteration procedure and the varying defender FSC size described above is new to this work, along with more detailed examples. This differentiates the present paper from a preliminary version that appears in [bhaskar2019finite].
IB Outline
An overview of LTL and partially observable stochastic games (POSGs) is given in Section II. We define FSCs for the two agents, and show how they can be composed with a POSG to yield a Markov chain in Section III. Section IV relates LTL satisfaction on a POSG to reaching specific subsets of recurrent sets of an associated Markov chain. Section V gives a procedure to determine defender and adversary FSCs of fixed sizes that will ensure that the LTL formula will be satisfied with nonzero probability. A valueiteration procedure to maximize the probability of satisfying the LTL formula under fixed defender and adversary FSCs is detailed in Section VI. Section VII addresses the scenario when states may be added to the defender FSC in order to improve the probability of satisfying the LTL formula under an adversary FSC of fixed size. Illustrative examples are presented in Section VIII. Section IX summarizes related work in POMDPs and TL satisfaction on MDPs. Section X concludes the paper.
Ii Preliminaries
In this section, we give a concise introduction to linear temporal logic and partially observable stochastic games. We then detail the construction of an entity which will ensure that runs on a POSG will satisfy an LTL formula.
Iia Linear Temporal Logic
Temporal logic frameworks enable the representation and reasoning about temporal information on propositional statements. Linear temporal logic (LTL) is one such framework, where the progress of time is ‘linear’. An LTL formula [baier2008principles] is defined over a set of atomic propositions , and can be written as: , where , and and are temporal operators denoting the next and until operations respectively.
The semantics of LTL are defined over (infinite) words in . We write when a trace satisfies an LTL formula . Here, the superscript serves to indicate the potential infinite length of the word^{1}^{1}1To be more precise, is a word in an regular language, which is a generalization of regular languages to words of infinite length [baier2008principles].
Definition II.1 (LTL Semantics)
Let . Then, the semantics of LTL can be recursively defined as:

if and only if (iff) is true;

iff ;

iff ;

iff and ;

iff ;

iff such that and for all .
Moreover, the logic admits derived formulas of the form: i) ; ii) ; iii) ; iv) .
Definition II.2 (Deterministic Rabin Automaton)
A deterministic Rabin automaton (DRA) is a quintuple where is a nonempty finite set of states, is a finite alphabet, is a transition function, is the initial state, and is such that for all , and is a positive integer.
A run of is an infinite sequence of states such that for all and for some . The run is accepting if there exists such that the run intersects with finitely many times, and with infinitely often. An LTL formula over can be represented by a DRA with alphabet that accepts all and only those runs that satisfy .
IiB Stochastic Games and Markov Chains
A stochastic game involves one or more players, and starts with the system in a particular state. Transitions to a subsequent state are probabilistically determined by the current state and the actions chosen by each player, and this process is repeated. Our focus will be on twoplayer stochastic games, and we omit the quantification on the number of players for the remainder of this paper.
Definition II.3 (Stochastic Game)
A stochastic game [niu2018secure] is a tuple . is a finite set of states, is the initial state, and are finite sets of actions of the defender and adversary, respectively. encodes , the transition probability from state to state when defender and adversary actions are and respectively. is a set of atomic propositions. is a labeling function that maps a state to a subset of atomic propositions that are satisfied in that state.
Stochastic games can be viewed as an extension of Markov Decision Processes when there is more than one player taking an action. For a player, a policy
is a mapping from sequences of states to actions, if it is deterministic, or from sequences of states to a probability distribution over actions, if it is randomized. A policy is
Markov if it is dependent only on the most recent state.In this paper, we focus our attention on the Stackelberg setting [fudenberg1991game]. In this framework, the first player (leader) commits to a policy. The second player (follower) observes the leader’s policy and chooses its policy as the best response to the leader’s policy, defined as the policy that maximizes the follower’s utility. We also assume that the players take their actions concurrently at each time step.
We now define the notion of a Stackelberg equilibrium, which indicates that a solution to a Stackelberg game has been found. Let () be the utility gained by the leader (follower) by adopting a policy ().
Definition II.4 (Stackelberg Equilibrium)
A pair is a Stackelberg equilibrium if , where . That is, the leader’s policy is optimal given that the follower observes the leader’s policy and plays its best response.
When and , is a Markov chain [meyn2012markov]. For , is accessible from , written , if for some (finite subset of) states . Two states communicate if and . Communicating classes of states cover the state space of the Markov chain. A state is transient if there is a nonzero probability of not returning to it when we start from that state, and is positive recurrent otherwise. In a finite state Markov chain, every state is either transient or positive recurrent.
IiC Partially Observable Stochastic Games
Partially observable stochastic games (POSGs) extend Definition II.3 to the case when instead of observing a state directly, each player receives an observation that is derived from the state. This can be viewed as an extension of POMDPs to the case when there is more than one player.
Definition II.5 (Partially Observable Stochastic Game)
A partially observable stochastic game is defined by the tuple , where are as in Definition II.3. and denote the (finite) sets of observations available to the defender and adversary. encodes , where .
The functions model imperfect sensing. In order for to satisfy the conditions of a probability distribution, we need and .
IiD Adversary and Defender Models
The initial state of the system is . A transition from a state to the next state is determined jointly by the actions of the defender and adversary according to the transition probability function .
At a state , the adversary makes an observation, of the state according to . The adversary is also assumed to be aware of the policy (sequence of actions), , committed to by the defender. Therefore, the overall information available to the adversary is .
Different from the information available to the adversary, at state , the defender makes an observation of the state according to . Therefore, the overall information for the defender is .
Definition II.6 (POSG Policy)
A (defender or adversary) policy for the POSG is a map from the respective overall information to a probability distribution over the corresponding action space, i.e. , .
Policies of the form above are called randomized policies. If , it is called a deterministic policy. In the sequel, we will use finite state controllers as a representation of policies that consider only the most recent observation.
IiE The ProductPOSG
In order to find runs on that would be accepted by a DRA built from an LTL formula , we construct a productPOSG. This construction is motivated by the productstochastic game construction in [niu2018secure] and the productPOMDP construction in [sharan2014finite].
(1)  
Definition II.7 (ProductPOSG)
Given and (built from LTL formula ), a productPOSG is , where , , , iff , and otherwise, , , and iff , iff , .
From the above definition, it is clear that acceptance conditions in the productPOSG depend on the DRA while the transition probabilities of the productPOSG are determined by transition probabilities of the original POSG. Therefore, a run on the productPOSG can be used to generate a path on the POSG and a run on the DRA. Then, if the run on the DRA is accepting, we say that the productPOSG satisfies the LTL specification .
Iii Problem Setup
This section details the construction of FSCs for the two agents. An FSC for an agent can be interpreted as a policy for that agent. The defender and adversary policies will be determined by probability distributions over transitions in finite state controllers that are composed with the POSG. When the FSCs are composed with the productPOSG, the resulting entity is a Markov chain. We then establish a way to determine satisfaction of an LTL specification on the productPOSG in terms of runs on the composed MC.
Iiia Finite State Controllers
Finite state controllers consist of a finite number of internal states. Transitions between states is governed by the current observation of the agent. In our setting, we will have two FSCs, one for the defender and another for the adversary. We will then limit the search for defender and adversary policies to one over FSCs of fixed cardinality.
Definition III.1 (Finite State Controller)
A finite state controller for the defender (adversary), denoted (), is a tuple , where is a finite set of (internal) states of the controller, is the initial state of the FSC, and , written , is a probability distribution of the next internal state and action, given a current internal state and observation. Here, .
An FSC is a finitestate probabilistic automaton that takes the current observation of the agent as its input, and produces a distribution over the actions as its output. The FSCbased control policy is defined as follows: initial states of the FSCs are determined by the initial state of the POSG. The defender commits to a policy at the start. At each time step, the policy returns a distribution over the actions and the next state of , given the current state of the FSC and the state of observed according to . The adversary observes this and the state according to and responds with generated by . Actions at each step are taken concurrently.
Definition III.2 (Proper FSCs)
An FSC is proper with respect to an LTL formula if there is a positive probability of satisfying under this policy in an environment represented as a partially observable stochastic game.
This is similar to the definition in [hansen2003synthesis], with the distinction that the terminal state of an FSC in that context will be related to Rabin acceptance pairs of an MC formed by composing and with (Sec IIIB).
IiiB The Global Markov Chain
The FSCs and , when composed with , will result in a finitestate, (fully observable) Markov chain. To maintain consistency with the literature, we will refer to this as the global Markov chain (GMC) [sharan2014finite].
Definition III.3 (Global Markov Chain (GMC))
The GMC resulting from a productPOSG controlled by FSCs and is , where , , is given by Equation (1), and .
Similar to , the Rabin acceptance condition for is: , with iff and iff .
A state of is . A path on is a sequence such that , where is the transition probability in . The path is accepting if it satisfies the Rabin acceptance condition. This corresponds to an execution in controlled by and .
IiiC Problem Statement
The goal is to synthesize a defender policy that maximizes the probability of satisfaction of an LTL specification under any adversary policy. Clearly, this will depend on the FSCs, and . In this paper, we will assume that the size of the adversary FSC is fixed, and known to the defender. This can be interpreted as one way for the defender to have knowledge of the capabilities of an adversary. Future work will consider the problem for FSCs of arbitrary sizes. The problem is formally defined below.
Problem III.4
Given a partially observable environment and an LTL formula, determine a defender policy specified by an FSC that maximizes the probability of satisfying the LTL formula under any adversary policy that is represented as an FSC of fixed size . That is,
(2) 
Optimizing over and indicates that the solution will depend on , , and .
Iv LTL Satisfaction and Recurrent Sets
The first result in this section relates the probability of the LTL specification being satisfied by the productPOSG, denoted , in terms of recurrent sets of the GMC. We then present a procedure to generate recurrent sets of the GMC that additionally satisfy the LTL formula. The main result of this section relates Problem III.4 to determining FSCs that maximize the probability of reaching certain types of recurrent sets of the GMC.
Let denote the recurrent states of under FSCs and . Let be the restriction of a recurrent state to a state of .
Proposition IV.1
if and only if there exists such that for any , there exists a Rabin acceptance pair and an initial state of , , where the following conditions hold:
(3)  
Proof:
If for every , at least one of the conditions in Equation (3) does not hold, then at least one of the following statements is true: i): no state that has to be visited infinitely often is recurrent; ii): there is no initial state from which a recurrent state that has to be visited infinitely often is accessible; iii): some state that has to be visited only finitely often in steady state is recurrent. This means for all .
Conversely, if all the conditions in Equation (3) hold for some , then by construction.
To quantify the satisfaction probability for a defender policy under any adversary policy, assume that the recurrent states of are partitioned into recurrence classes . This partition is maximal, in the sense that two recurrent classes cannot be combined to form a larger recurrent class, and all states within a given recurrent class communicate with each other [sharan2014formal].
Definition IV.2 (feasible Recurrent Set)
A recurrent set is feasible under FSCs and if there exists such that and . Let denote the set of feasible recurrent sets under the respective FSCs.
Let be the event that a path of will reach a recurrent set. Algorithm 1 returns feasible recurrent sets of under fixed FSCs .
We have the following result:
Theorem IV.3
The probability of satisfying an LTL formula in a POSG with policies and is equal to the probability of paths in the GMC (under the same FSCs) reaching feasible recurrent sets. That is,
(4) 
Proof:
Since the recurrence classes are maximal, . From Definition IV.2, a feasible recurrent set will necessarily contain a Rabin acceptance pair. Therefore, the probability of satisfying the LTL formula under and is equivalent to the probability of paths on leading to feasible recurrent sets, which is given by Equation (4).
Corollary IV.4
From Theorem IV.3, it follows that:
(5) 
V Determining Candidate FSCs of Fixed Sizes
(6) 
(7) 
If the sizes of and are fixed, then their design is equivalent to determining the transition probabilities between their internal states. In this section, we present a heuristic procedure that uses only the most recent observations of the defender and adversary to generate a set of admissible FSC structures such that the resulting GMC will have a feasible recurrent set. We show that the procedure has a computational complexity that is polynomial in the number of states of the GMC and additionally establish that this algorithm is sound.
Definition V.1
An algorithm is sound if any solution returned by it is the Boolean constant true when evaluated on the output of the algorithm (i.e., every output is a correct output). It is complete if it returns a result for any input, and reports ‘failure’ if no solution exists.
Let , where . shows if an observation can enable the transition from an FSC state to while issuing action . We also assume that such that [sharan2014formal].
In Algorithm 2, for defender and adversary FSCs with fixed number of states, we determine candidate and such that the resulting will have a feasible recurrent set. We start with initial candidate structures and induce the digraph of the resulting GMC (Line 1). In our case, is such that for all . We first determine the set of communicating classes of the GMC, which is equivalent to determining the strongly connected components (SCCs) of the induced digraph (Line 3). A communicating class will be recurrent if it is a sink SCC of the corresponding digraph. The states in are those in that are part of the Rabin accepting pair that has to be visited only finitely many times (and therefore, to be visited with very low probability in steady state) (Line 6). further contains states that can be transitioned to from some state in . This is because once the system transitions out of , it will not be able to return to it in order to satisfy the Rabin acceptance condition (Line 5) (and hence, will not be recurrent). contains those states in that need to be visited infinitely often according to the Rabin acceptance condition (Line 7).
The agents have access to a state only via their observations. A defender action is forbidden if there exists an adversary action that will allow a transition to a state in under observations and . This is achieved by setting corresponding entries in to zero (Lines 1217). An adversary action is not useful if for every defender action, the probability of transitioning to a state in is nonzero under and . This is achieved by setting the corresponding entry in to zero (Lines 1823).
Proposition V.2
Define and . Then, Algorithm 2 has an overall computational complexity of .
Proof:
The overall complexity depends on: (i) Determining strongly connected components (Line 3): This can be done in [tarjan1972depth]. Since and , this is in the worst case, and (ii) Determining the structures in Lines 926: This is . The result follows by combining the two terms.
Proposition V.3
Algorithm 2 is sound.
Proof:
This is by construction. The output of the algorithm is a set such that the resulting GMC for each case has a state that is recurrent and has to be visited infinitely often. This state, by Definition IV.2, belongs to . Moreover, if the algorithm returns a nonempty solution, a solution to Problem III.4 will exist since the FSCs are proper.
Algorithm 2 is suboptimal since we only consider the most recent observations of the defender and adversary. It is also not complete, since there might be a feasible solution that cannot be determined by the algorithm. If no FSC structures of a particular size is returned by Algorithm 2, a heuristic is to increase the number of states in the defender FSC by one, and run the Algorithm again. Once we obtain proper FSC structures of fixed sizes, we will show in Section VII that the satisfaction probability can be improved by adding states to the defender FSC in a principled manner (for adversary FSCs of fixed size). Algorithm 2 and Proposition V.3 will allow us to restrict our treatment to proper FSCs for the rest of the paper.
Vi Value Iteration for POSGs
In this section, we present a valueiteration based procedure to maximize the probability of satisfying the LTL formula for FSCs and of fixed sizes. We prove that the procedure converges to a unique optimal value, corresponding to the Stackelberg equilibrium.
Notice that in Equation (1), the defender and adversary policies are specified as probability distributions over the next FSC internal state and the respective agent action, and conditioned on the current FSC internal state and the agent observation. With , we rewrite these in terms of a mapping :
(8) 
This will allow us to express Equation (1) as:
(9) 
Define a value over the state space of the GMC representing the probability of satisfying the LTL formula when starting from a state of the GMC. Additionally, define and characterize the following operators:
where is the transition probability in the GMC induced by policies and (Equation (9)).
Proposition VI.1
Before proving Proposition VI.1, we will need some intermediate results. Inequalities in the proofs of these statements are true elementwise.
Theorem VI.2
[Monotone Convergence Theorem][royden2010real] If a sequence is monotone increasing and bounded from above, then it is a convergent sequence.
Lemma VI.3
Let be the satisfaction probability obtained under any pair of policies and , where is the best response to . Let be the operation that composes the operator times, and be the corresponding value obtained (i.e., ). Then, there exists a value such that .
Proof:
We show Lemma VI.3 by showing that the sequence is bounded and monotone.
We first show boundedness. By definition of the operator , is obtained as a convex combination of . Since is the satisfaction probability, it is in . Thus, is bounded, and consequently, is bounded for all .
We next show monotonicity by induction. We have that is the value function associated with a control policy . Denote the best response of the adversary to as . Let . From the definitions of and , we have . Furthermore, since
Comments
There are no comments yet.