## I Introduction

Cyber-physical systems (CPSs) are complex entities in which the working of a physical system is governed by interactions with computing devices and algorithms. These systems are ubiquitous [1], and vary in scale from power systems to medical devices and robots. In applications like self-driving cars and robotics, the systems are expected to work in dynamically changing and potentially dangerous environments with a large degree of autonomy. A natural question to ask before solving a problem in this domain is the means by which the environment, goals, and constraints, if any, are specified.

Markov decision processes (MDPs) [2, 3]

have been used to model environments where outcomes depend on both, an inherent randomness in the model (transition probabilities), and an action taken by an agent. These models have been extensively used in applications, including in robotics

[4] and unmanned aircrafts [5]. Formal methods [6] are a means to verify the behavior of complex models against a rich set of specifications [7]. Linear temporal logic (LTL) is a particularly well-understood framework to express properties like safety, liveness, and priority [8, 9]. These properties can then be verified using off-the-shelf model solvers [10, 11].The system might be the target of malicious attacks with the aim of preventing it from reaching a goal. An attack can be carried out on the physical system, on the computers that control the working of the system, or on communication channels between components of the system. Such attacks have been reported across multiple application domains like power systems [12], automobiles [13], water networks [14], and nuclear reactors [15]. Therefore, strategies that are designed to only address modeling and sensing errors and uncertainties may not be optimal in the presence of an intelligent adversary who can manipulate the operation of the system.

Prior work in verifying the satisfaction of an LTL formula over an MDP or a stochastic game assumes that the states are fully observable. In many practical scenarios, this may not be the case. For example, a robot might only have an estimate of its current location based on the output of a vision sensor

[16]. This necessitates the use of a framework that accounts for partial observability. For the single-agent case, partially-observable Markov decision processes (POMDPs) can be used to try and solve the problem. However, partial observability is a serious limitation in determining an ‘optimal policy’ for an agent. This demonstrates the need for techniques to determine approximate solutions. Heuristics to approximately solve POMDPs include belief replanning, most likely belief state policy, and entropy weighting

[17], [18], grid-based methods [19], and point-based methods [20].A large body of work studies classes of problems that are relevant to this paper (see Sec VI).
These can be divided into three broad categories: *i)*: synthesis of strategies for systems represented as an MDP that has to additionally satisfy a TL formula; *ii)*: synthesis of strategies for POMDPs; *iii)*: synthesis of defender and adversary strategies for an MDP under a TL constraint.
While there has been recent work on the synthesis of controllers for POMDPs under TL specifications, these have largely been restricted to the single-agent case, and do not address the case when there might be an adversary with a competing objective.

In this paper, we study the problem of determining strategies for an agent that has to satisfy an LTL formula in the presence of an adversary in a partially observable environment. The defender and adversary take actions simultaneously, and these jointly influence the transitions of the system. Our approach is motivated by the treatment in [21] and [22] which propose the synthesis of parameterized finite state controllers (FSCs) for a POMDP that will maximize the probability of satisfaction of an LTL formula. This is an approximate strategy since it refrains from using the entire observation and action histories and uses only the most recent observation in order to determine an action. Although this restricts the class of policies that are searched over, FSCs are attractive since they can be used to solve the average reward problem over the infinite horizon [22].

### I-a Contributions

We extend this setting to include an adversary who is also limited in that it does not exactly observe the state. The adversary policy is determined by an FSC, whose goal is opposite to that of the defender. The goal for the defender will be to synthesize a policy that will maximize satisfaction of an LTL formula for any adversary policy. We show that this is equivalent to maximizing, under any adversary policy, the probability of reaching a recurrent set of a Markov chain that additionally contains states that need to be reached in order to satisfy the LTL formula. The search for policies involve optimizing over both the size of the FSC and its parameters (transition probabilities). We present a procedure that will allow for the determining of defender and adversary FSCs of fixed sizes that will satisfy the LTL formula with nonzero probability. The search for a defender policy that will maximize the probability of satisfaction of the LTL formula for any adversary policy is then reduced to a search among these FSCs of fixed size. If these FSCs are parameterized in an appropriate way, it might lend itself to gradient-based optimization techniques.

### I-B Outline

A quick introduction to LTL and partially observable stochastic games (POSGs) is given in Section II. We set up our problem in Section III, where we first define FSCs for the two agents, and show how they can be composed with a POSG to yield a Markov chain. Section IV presents our main results relating LTL satisfaction on a POSG to reaching recurrent sets of a Markov chain, and a procedure to determine candidate FSCs. An illustrative example is presented in Section V. Section VI summarizes related work in POMDPs and TL satisfaction on MDPs, and Section VII concludes the paper, along with a pointer to future directions of research.

## Ii Preliminaries

In this section, we give a concise introduction to linear temporal logic and partially observable stochastic games. We then detail the construction of an entity which will ensure that runs on a POSG will satisfy an LTL formula.

### Ii-a Linear Temporal Logic

A *linear temporal logic (LTL) formula* [6] is defined over a set of atomic propositions , and can be inductively written as:

Here, , and and are temporal operators denoting the *next* and *until* operations respectively.

The semantics of LTL are defined over (infinite) words in , and we write when a trace satisfies an LTL formula . Further, let . Then, if and only if (iff) is true; iff ; iff ; iff and ; iff ; iff such that and for all .

Further, the logic admits derived formulas of the form:
*i)*: ;
*ii)*: ;
*iii)*: ;
*iv)*: .

###### Definition II.1

A *deterministic Rabin automaton (DRA)* is a quintuple where is a nonempty finite set of states, is a finite alphabet, is a transition function, is the initial state, and is such that for all , and is a positive integer.

A *run* of is an infinite sequence of states such that for all and for some .
The run is *accepting* if there exists such that the run intersects with finitely many times, and with infinitely often.
An LTL formula over can be represented by a DRA with alphabet that accepts all and only those runs that satisfy .

### Ii-B Partially Observable Stochastic Games

###### Definition II.2

A *stochastic game* [23] is a tuple .
is a finite set of states, is the initial state, and are the finite sets of actions of the defender and adversary.
encodes , the probability of transition from a state to a state when defender and adversary actions are and respectively.
is a set of atomic propositions, and is a labeling function that maps a state to a subset of atomic propositions that are satisfied in that state.

A stochastic game can thus be viewed as an extension of Markov decision processes (MDPs) to the case when there is more than one player taking an action.

When and , is a *Markov chain (MC)*.
For , is *accessible* from , written , if for some (finite subset of) states .
Equivalently, if there is a positive probability of reaching from in a finite number of steps.
Two states *communicate*, written , if and .
*Communicating classes* of states cover the state space of the MC.
A state is *transient* if there is a nonzero probability of not returning to it when we start from that state, and is *positive recurrent* otherwise.
If some state in a communicating class is recurrent (transient), then the same holds for all other states in that class.
Moreover, in a finite state MC, every state is either transient or positive recurrent.
We refer the reader to [24] for a detailed exposition.

Partially observable stochastic games (POSGs) extend Definition II.2 to the case when states may not be observable, and each agent could observe the state according to a different observation function. This can be viewed as an interpretation of POMDPs to the case when there is more than one player.

###### Definition II.3

A *partially observable stochastic game*

is , where are as in Definition II.2. denote the (finite) sets of observations available to the defender and adversary. encodes , where .

The functions can be viewed as a means to model imperfect sensing. Then, we have .

The information available until time , denoted , can be inductively defined as: , . The overall information is .

###### Definition II.4

A *(defender or adversary) policy*

for the POSG is a map from the overall information to a probability distribution over the respective action space, i.e.

, where .Policies of the form above are called *randomized policies*.
If , it is called a *deterministic policy*.

In this paper, defender and adversary policies will be determined by probability distributions over transitions in finite state controllers (Sec III-A) that are composed with the POSG. This method is chosen because the FSCs when composed with the product-POSG (Sec II-C), will result in a finite state Markov chain.

### Ii-C The Product-POSG

In order to find runs on that would be accepted by a DRA built from an LTL formula , we construct a product-POSG. This construction is motivated by the product-stochastic game construction in [23] and the product-POMDP construction in [21].

###### Definition II.5

Given a POSG and a DRA corresponding to an LTL formula , a *product-POSG* is a tuple .

Here, , iff , and otherwise, , with , and iff , iff , .

From the above definition, it is clear that acceptance conditions in the product-POSG depend on the DRA while the transition probabilities of the product-POSG are determined by transition probabilities of the original POSG. Therefore, a run on the product-POSG can be used to generate a path on the POSG and a run on the DRA. Then, if the run on the DRA is accepting, we say that the product-POSG satisfies the LTL specification .

## Iii Problem Setup

This section details the construction of finite state controllers (FSCs) for the defender and adversary. An FSC for an agent can be interpreted as a policy for that agent. When the FSCs are composed with the product-POSG, the resulting entity is a Markov chain. We then establish a way to determine satisfaction of an LTL specification on the product-POSG in terms of runs on the composed Markov chain. A treatment for the single-agent case when the environment is specified as a POMDP was presented in [21].

(1) | |||

(2) |

(3) |

### Iii-a Finite State Controllers

Finite state controllers comprise a finite set of internal states. The transitions between any two states is governed by the current observation of the agent. A directed cyclic graph of internal states of the FSC will allow for remembering events relevant to taking optimal actions [21]. In our setting, we will have two FSCs, one for the defender and another for the adversary. We will then limit the search for defender and adversary policies to one over FSCs of fixed cardinality.

###### Definition III.1

A *finite state controller for the defender (adversary)*, denoted () is a tuple , where is a finite set of (internal) states of the controller, , written , is a probability distribution of the next internal state and action, given a current internal state and observation.
The initial state of is a probability distribution over , and will depend on the initial state of the system.
Here, .

The setup works as follows: Initial states of the FSCs are determined by the initial state of the POSG. At each time step, the defender will observe the state of according to and will commit to a policy generated by . The adversary observes this and the state according to and responds with generated by . These actions are taken concurrently, and are applied to , which transitions to the next state per the distribution , and the process is repeated.

###### Definition III.2

An FSC is *proper* if there is a positive probability of satisfying a given LTL formula in a finite number of steps under this policy on a system represented by a POMDP.

This is similar to the definition in [25], with the distinction that the terminal state of an FSC in that context will be directly related to Rabin acceptance pairs of a Markov chain formed by composing and with a product-POSG (Sec III-B). We will restrict ourselves to proper FSCs for the rest of this paper.

### Iii-B The Global Markov Chain

The FSCs and , when composed with , will result in a finite-state, fully observable Markov chain.
To maintain consistency with the literature, we will refer to this as the *global Markov chain (GMC)* [21].

###### Definition III.3

The *global Markov chain* resulting from a product-POSG controlled by FSCs and is the tuple , where , , and is given by Equation (1).

Similar to , the Rabin acceptance condition for is: , with iff and iff .

A state of is of the form . A path on is a sequence such that , where here corresponds to the transition probabilities in . A path on is accepting if it satisfies the Rabin acceptance condition. This corresponds to an execution in controlled by and . A probability space over is defined in the usual way [6].

### Iii-C System Model

Consider a discrete-time finite-state system: , where represents a stochastic disturbance. This system can be abstracted as an SG with finite state and action spaces using a simulation-based algorithm, similar to that in [26].

### Iii-D Problem Statement

The goal is to synthesize a defender policy that will maximize the probability of satisfaction of an LTL specification under any adversary policy. Clearly, this will depend on the FSCs, and . In this paper, we will assume that the size of the adversary FSC is fixed, and known. This can be interpreted as one way for the defender to have knowledge of the capabilities of an adversary, which is a reasonable assumption. Future work will consider the problem for FSCs of arbitrary sizes. Formally,

###### Problem III.4

Given a partially observable environment and an LTL formula, determine a defender policy specified by a finite state controller that maximizes the probability of satisfying the LTL formula under any adversary policy that is represented as a finite state controller of fixed size . That is,

(4) |

Optimizing over and indicates that the solution will depend on , , and .

## Iv Results

### Iv-a LTL Satisfaction and Recurrent Sets

Our main result relates the probability of the LTL specification being satisfied by the product-POSG, denoted , in terms of recurrent sets of the GMC. Let denote the recurrent states of under FSCs and . Let be the restriction of a recurrent state to a state of .

###### Proposition IV.1

if and only if there exists such that for any , there exists a Rabin acceptance pair and an initial state of , , the following conditions hold:

(5) | ||||

If for every , at least one of the conditions in Equation (5) does not hold, then at least one of the following statements is true:
*i)*: no state that has to be visited infinitely often is recurrent;
*ii)*: there is no initial state from which a recurrent state that has to be visited infinitely often is accessible;
*iii)*: some state that has to be visited only finitely often in steady state is recurrent.
This means for all .

Conversely, if all the conditions in Equation (5) hold for some , then by construction.

To quantify the satisfaction probability for a defender policy under any adversary policy, assume that the recurrent states of are partitioned into recurrence classes . This partition is maximal, in the sense that two recurrent classes cannot be combined to form a larger recurrent class, and all states within a given recurrent class communicate with each other [22].

###### Definition IV.2

A recurrent set is *feasible* under FSCs and if there exists such that and .
Let denote the set of feasible recurrent sets under the respective FSCs.

Over infinite executions, a path of will reach a recurrent set. Let denote the event that such a path will reach a recurrent set. Then, Theorem IV.3 states that Problem III.4 is equivalent to determining defender FSCs that maximize the probability of reaching feasible recurrent sets of the GMC under any adversary FSC.

###### Theorem IV.3

(6) |

Since the recurrence classes are maximal, . From Definition IV.2, a feasible recurrent set will necessarily contain a Rabin acceptance pair. Therefore, the probability of satisfying the LTL formula under and is equivalent to the probability of paths on leading to feasible recurrent sets. That is, .

Then, for some (fixed) (and initial state of ), the minimum probability of satisfying over all adversary FSCs is equal to the minimum probability of reaching a feasible recurrent set.

The result follows for a maximizing .

### Iv-B Determining Candidate and

If the sizes of and are fixed, then their design is equivalent to determining the transition probabilities between their internal states. We are guided by the treatment in [22]. However, our framework differs in that we additionally consider the effect of the presence of an adversary while aiming to satisfy an LTL specification.

Let the FSC policies and be parameterized by and respectively.
Then, with , .
Any parameterization of the controller is valid so long as it obeys the laws of probability.
We use the *softmax parameterization* [22, 27],
since it is convex and its derivative can be easily computed.
Let determine the relative probability of making a transition in the FSC along with taking a corresponding action given an observation.
Then, the transition probabilities of the FSCs are:

(7) |

The parameterization considered in Algorithm 1 can be viewed as a special case of the softmax parameterization with for all .

Define , where . then serves to indicate if it is possible for an observation in a state of an FSC to transition to in the FSC while issuing action . We further assume that such that [22]. Let a state in the GMC be denoted .

In Algorithm 1, for defender and adversary FSCs with fixed number of states, we determine candidate and such that the resulting will have a feasible recurrent set.
We start with initial candidate structures and induce the digraph of the resulting GMC (*Line 1*).
This MC might not contain a feasible recurrent set.
We first determine the set of communicating classes of the MC, which is equivalent to determining the strongly connected components (SCCs) of the induced digraph (*Line 3*).
A communicating class of the MC will be recurrent if it is a *sink* SCC of the corresponding digraph.
The states in are those in that are part of the Rabin accepting pair that has to be visited only finitely many times (and therefore, to be visited with very low probability in steady state) (*Line 6*).
further contains states that can be transitioned to from some state in .
This is because once the system transitions out of , it will not be able to return to it in order to satisfy the Rabin acceptance condition (*Line 5*) (and hence, will not be recurrent).
contains those states in that need to be visited infinitely often according to the Rabin acceptance condition (*Line 7*).

Recall that the agents have access to the actual state only via their individual observations.
A defender action is forbidden if there exists an adversary action that will allow a transition to a state in under observations and .
This is achieved by setting corresponding entries in to zero (*Lines 12-17*).
An adversary action is not useful if for every defender action, the probability of transitioning to a state in is nonzero under and .
This is achieved by setting the corresponding entry in to zero (*Lines 18-23*).

The computational complexity of Algorithm 1 depends on: *i)*: determining the SCCs.
This can be done in [28].
We have and .
Therefore, the SCCs can be determined in in the worst case.
*ii)*: determining the structures in *Lines 9-26*.
This, in the worst case, is .
Defining and , we have an overall computational complexity of .

###### Proposition IV.4

Algorithm 1 is sound. That is, each feasible FSC structure in will have at least one feasible recurrent set.

This is by construction. The output of the algorithm is a set such that the resulting GMC for each case has a state that is recurrent and has to be visited infinitely often. This state, by Definition IV.2, belongs to . Moreover, if the algorithm returns a nonempty solution, a solution to Problem III.4 will exist since we assume that the FSCs are proper.

Algorithm 1 is *suboptimal* since we only consider the most recent observations of the defender and adversary.
It is also not complete, since there might be a feasible solution that cannot be determined by the algorithm.

###### Remark IV.5

For and of fixed sizes and structures and , a solution to Problem III.4 is:

(8) |

This follows from the fact that for fixed FSC sizes and structures, the properties of a set (recurrent or transient) in the GMC will not change. What remains then is to choose the transition probabilities appropriately. For a softmax parameterization, this computation is presented in [22], and we omit it for want of space.

### Iv-C Determining Recurrent States to Visit

Algorithm 2 returns a subset of the recurrent states that are consistent with the Rabin acceptance pairs that need to be visited ‘often’ in steady state.
If there is a reward structure over the states of the GMC that incentivizes visits to , then the expected long-term average reward is equal to the *expected occupation measure* of [21].
Moreover, in the infinite horizon, we can assume that the system has been absorbed in a recurrent set, and the resulting (sub-)Markov chain is irreducible.
Then, this problem can be solved by viewing it as minimizing an *average cost per stage* problem [2].

## V Example

Assume the state space is given by . This will define an grid. The defender’s actions are and the adversary’s actions are , denoting right, left, up, down, attack, and not attack. The observations of both agents are , with , and . Let . Then, if , it can be shown that the corresponding DRA will have two states , with . The transition probabilities for and are defined below. The probabilities for other action pairs can be defined similarly. Let denote the neighbors of .

For this example, let . Then,

Comments

There are no comments yet.