Synthesis of Provably Correct Autonomy Protocols for Shared Control

05/15/2019 ∙ by Murat Cubuktepe, et al. ∙ The University of Texas at Austin Radboud Universiteit 0

We synthesize shared control protocols subject to probabilistic temporal logic specifications. More specifically, we develop a framework in which a human and an autonomy protocol can issue commands to carry out a certain task. We blend these commands into a joint input to a robot. We model the interaction between the human and the robot as a Markov decision process (MDP) that represents the shared control scenario. Using inverse reinforcement learning, we obtain an abstraction of the human's behavior and decisions. We use randomized strategies to account for randomness in human's decisions, caused by factors such as complexity of the task specifications or imperfect interfaces. We design the autonomy protocol to ensure that the resulting robot behavior satisfies given safety and performance specifications in probabilistic temporal logic. Additionally, the resulting strategies generate behavior as similar to the behavior induced by the human's commands as possible. We solve the underlying problem efficiently using quasiconvex programming. Case studies involving autonomous wheelchair navigation and unmanned aerial vehicle mission planning showcase the applicability of our approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In shared control, a robot executes a task to accomplish the goals of a human operator while adhering to additional safety and performance requirements. Applications of such human-robot interaction include remotely operated semi-autonomous wheelchairs [15], robotic teleoperation [19], and human-in-the-loop unmanned aerial vehicle mission planning [11]. A human operator issues a command through an input interface, which maps the command directly to an action for the robot. The problem is that a sequence of such actions may fail to accomplish the task at hand, due to limitations of the interface or failure of the human operator in comprehending the complexity of the problem. Therefore, a so-called autonomy protocol provides assistance for the human in order to complete the task according to the given requirements.

At the heart of the shared control problem is the design of an autonomy protocol. In the literature, there are two main directions, based on either switching the control authority between human and autonomy protocol [28], or on blending their commands towards joined inputs for the robot [9, 18].

One approach to switching the authority first determines the desired goal of the human operator with high confidence, and then assists towards exactly this goal [10, 21]. In [14], switching the control authority between the human and autonomy protocol ensures satisfaction of specifications that are formally expressed in temporal logic. In general, switching of authority may cause a decrease in human’s satisfaction, who usually prefers to retain as much control as possible [20].

Blending incorporates providing an alternative command in addition to the one of the human operator. To introduce a more flexible trade-off between the human’s control authority and the level of autonomous assistance, both commands are then blended to form a joined input for the robot. A blending function determines the emphasis that is put on the autonomy protocol in the blending, that is, regulating the amount of assistance provided to the human. Switching of authority can be seen as a special case of blending, as the blending function may assign full control to the autonomy protocol or to the human. In general, putting more emphasis on the autonomy protocol in blending may lead to greater accuracy in accomplishing the task [8, 9, 23]. However, as humans prefer to retain control of the robot and may not approve if a robot issues a set of commands that is significantly different to the human’s command [19, 20]. None of the existing blending approaches provide formal correctness guarantees that go beyond statistical confidence bounds. Correctness here refers to ensuring safety and optimizing performance according to the given requirements. Our goal is to design an autonomy protocol that admits formal correctness while rendering the robot behavior as close to the human’s commands as possible, which is shown to enhance the human experience.

A human may be uncertain about which command to issue in order to accomplish a task. Moreover, a typical interface used to parse human’s commands, such as a brain-computer interface, is inherently imperfect. To capture such uncertainties and imperfections in the human’s decisions, we introduce randomness to the commands issued by humans. It may not be possible to blend two different deterministic commands. If the human’s command is “up” and the autonomy protocol’s command is “right”, we cannot blend these two commands to obtain another deterministic command. By introducing randomness to the commands of the human and the autonomy protocol, we ensure that the blending is always well-defined.

Take as an example a scenario involving a semi-autonomous wheelchair [15] whose navigation has to account for a randomly moving autonomous vacuum cleaner, see Fig. 1. The wheelchair needs to navigate to the exit of a room, and the vacuum cleaner moves in the room according to a probabilistic transition function. The task of the wheelchair is to reach the exit gate while not crashing into the vacuum cleaner. The human may not fully perceive the motion of the vacuum cleaner. Note that the human’s commands, depicted with the solid red line in Fig 1(a), may cause the wheelchair to crash into the vacuum cleaner. The autonomy protocol provides another set of commands, which is indicated by the solid red line in Fig 1(b), to carry out the task safely without crashing. However, the autonomy protocol’s commands deviate highly from the commands of the human. The two sets of commands are then blended into a new set of commands, depicted using the dashed red line in Fig 1(b). The blended commands perform the task safely while generating behavior as similar to the behavior induced by the human’s commands as possible.

We model the behavior of the robot as a Markov decision process (MDP) [27], which captures the robot’s actions inside a potentially stochastic environment. Problem formulations with MDPs typically focus on maximizing an expected reward (or, minimizing the expected cost). However, such formulations may not be sufficient to ensure safety or performance guarantees in a task that includes a human operator. Recently, it was shown that a reward structure is not sufficient to capture temporal logic constraints in general [17]. We design the autonomy protocol such that the resulting robot behavior satisfies probabilistic temporal logic specifications. Such verification problems have been extensively studied for MDPs [2] and mature tools exist for efficient verification [22, 7].

(a) Autonomy perspective

(b) Human perspective
Fig. 1: A wheelchair in a shared control setting.

In what follows, we call a formal interpretation of a sequence of the human’s commands the human strategy, and the sequence of commands issued by the autonomy protocol the autonomy strategy. In [18], we formulated the problem of designing the autonomy protocol as a nonlinear programming problem. However, solving nonlinear programs is generally intractable [3]. Therefore, we proposed a greedy algorithm that iteratively repairs the human strategy such that the specifications are satisfied without guaranteeing optimality, based on [26]. Here, we propose an alternative approach for the blending of the two strategies. We follow the approach of repairing the strategy of the human to compute an autonomy protocol. We ensure that the resulting robot behavior induced by the repaired strategy deviates minimally from the human strategy, and satisfies safety and performance properties given in temporal logic specifications. We formally define the problem as a quasiconvex optimization problem, which can be solved efficiently by checking feasibility of a number of convex optimization problems [4].

The question remains how to obtain the human strategy in the first place. It may be unrealistic that a human can provide the strategy for an MDP that models a realistic scenario. To this end, we create a virtual simulation environment that captures the behavior of the MDP. We ask humans to participate in two case studies to collect data about typical human behavior. We use inverse reinforcement learning to get a formal interpretation as a strategy based on human’s inputs [1, 30]. We model a typical shared control scenario based on an autonomous wheelchair navigation [15] in our first case study. In our second case study, we consider an unmanned aerial vehicle mission planning scenario, where the human operator is to patrol certain regions while keeping away from enemy aerial vehicles.

In summary, the main contribution this paper is to efficiently synthesize an autonomy protocol such that the resulting blended or repaired strategy meets all given specifications while only minimally deviating from the human strategy. We present a new technique based on quasiconvex programming, which can be solved efficiently using convex optimization [4].

Organization. We introduce all formal foundations that we need in Section II. We provide an overview of the general shared control concept in Section III. We present the shared control synthesis problem and provide a solution based on convex optimization in Section IV. We indicate the applicability and scalability of our approach on experiments in Section V and draw a conclusion and critique of our approach in Section VI.

Ii Preliminaries

In this section, we introduce the required formal models and specifications that we use to synthesize the autonomy protocol, and we give a short example illustrating the main concepts.

Ii-a Markov Decision Processes

A probability distribution over a finite set is a function with . The set of all distributions is .

[MDP] A Markov decision process (MDP) has a finite set of states, an initial state , a finite set of actions, a transition probability function , a finite set of atomic propositions, and a labeling function that labels each state with a subset of atomic propositions . We extend to a sequence of states by for . MDPs have nondeterministic choices of actions at the states; the successors are determined probabilistically via the associated probability distribution. We assume that the MDP contains no deadlock states, that is, at every state at least one action is available. A cost function associates a cost to state-action pairs. If there is only a single action available at each state, the MDP reduces to a

discrete-time Markov chain (MC)

. We use strategies to resolve the choices of actions in order to define a probability and expected cost measure for MDPs. [Strategy] A memoryless and randomized strategy for an MDP is a function . The set of all strategies over is . A special case are deterministic strategies which are functions of the form with . Resolving all the nondeterminism for an MDP with a strategy yields an induced Markov chain .

[Induced MC] For an MDP and strategy , the MC induced by and is , where

In following, we assume that for a given MDP and for any state , there exists a strategy that induces a MC that ensures state is reachable under that strategy. Note that this is a standard assumption for MDPs, and we can remove the unreachable states by doing a graph search over the MDP as a preprocessing step [2].

A finite or infinite sequence of states generated in under a strategy is called a path. Given an induced MC , starting from the initial state , the state visited at step

is given by a random variable

. The probability of reaching state from state in one step, denoted is equal to . We can extend one-step reachability over a set of paths , i.e., . We denote the set of all paths in under the strategy by .

[Occupancy Measure] The occupancy measure of a strategy for an MDP is defined as

(1)

where and denote the state and action in at time step . The occupancy measure is the expected number of times to take action at state under the strategy . In our solution approach, we use the occupancy measure of a strategy to compute an autonomy protocol.

Ii-B Specifications

We use linear temporal logic (LTL) to specify a set of tasks [2]. A specification in LTL is built from a set of atomic propositions, and the Boolean and temporal connectives , and (always), (until), (eventually), and (next). An infinite sequence of subsets of defines an infinite word, and an LTL specification is interpreted over infinite words on . If a word satisfies an LTL specification , we denote it by .

(DRA) A deterministic Rabin automaton (DRA) is a tuple , with a finite set of states, an initial state , the alphabet , the transition relation between states of a DRA, and the set of accepting state pairs .

A run of a DRA , denoted by , is an infinite sequence of states. For each , for some . A run is accepting if there exists a pair and such that, for all , we have and there exists infinitely many that satisfies Given an LTL specification with atomic propositions , a DRA can be constructed with alphabet that accepts all words that satisfy the LTL specification  [2].

For an induced DTMC of an MDP and a strategy , a path generates a word where for all . We denote the word that is generated by as . For an LTL specification , the set of words that is accepted by the DRA and satisfies the LTL specification is given by , and is measurable [2]. We define

as the probability of satisfying the LTL specification for an MDP under the strategy .

The synthesis problem is to find one particular strategy for an MDP such that given an LTL specification and a threshold , the induced DTMC satisfies

(2)

which implies that the strategy satisfies the specification with at least a probability of .

We also consider expected cost properties , that restricts the expected cost to reach the set of goal states by an upper bound .

(a) MDP

(b) Induced MC
Fig. 2: MDP with target state and induced MC for strategy

Fig. 2(a) depicts an MDP with initial state . In state , the available actions are and . Similarly for state , the two available actions are and . If action is selected in state , the agent transitions to and with probabilities and . For states and we omit actions, because of the self loops.

For a safety specification , the deterministic strategy with and induces a probability of to reach . Therefore, the specification is not satisfied, see the induced MC in Fig. 2(b). Likewise, the randomized strategy with and violates the specification, as the probability of reaching is . However, the deterministic strategy with and induces a probability of , thus .

Ii-C Strategy synthesis in an MDP

Given an MDP, and an LTL specification , we aim to synthesize a strategy that satisfies , or equivalently, a strategy that satisfies the condition in (2).

(Product MDP) Let be an MDP and be a DRA. The product MDP is a tuple with a finite set of states, an initial state that satisfies , a finite set of actions, a transition probability function , and , a labeling function , and the acceptance condition where and for all and .

(AEC) The end component for the product MDP is given by a pair , where a non-empty set of states and a function is defined such that for any we have

and the induced directed graph is strongly connected. An accepting end component (AEC) is an end component that satisfies and for some

Given a product MDP , we modify it to by making all states in the end components absorbing, i.e., for all states , for all in the modified MDP . Making all end components absorbing is commonly used in tools for model checking of LTL specifications in MDPs [2, 22, 7]. We further assume that all states in the end component are absorbing. The modification does not change the probability of satisfying an LTL specification as stated below.

(From [6]) In each end component of an MDP, there exists a strategy in each state that reaches any other state with a probability of 1.

A memoryless and randomized strategy for a product MDP is a function . A memoryless strategy is a finite-memory strategy in the underlying MDP . Given a state , we consider to be a memory state and define , where the run satisfies and For the MDPs given in Definition 1 and LTL specifications, memoryless strategy in the product MDP are sufficient to achieve the maximal probability of satisfying the specification [2].

Some states in the product MDP may be unreachable from the initial state . These states do not affect the strategy synthesis in , and can be removed from . We assume that there is no unreachable states in the product MDP .

Let be a strategy for and let be the strategy on constructed from through the procedure explained above. The paths of the MDP under the strategy satisfy the LTL specification with a probability of at least , i.e., , if and only if the paths of the induced DTMC from the product MDP under the strategy reach and stay in some AECs in with a probability of at least  [2].

Iii Conceptual description of shared control

Human Strategy

Autonomy Strategy

Blended Strategy

Robot execution

command

command

blended command

Blending function

Formal model

Specifications

Human strategy
Fig. 3: Shared control architecture.

We now detail the general shared control concept adopted in this paper and state the formal problem. Consider the setting in Fig. 3. As inputs, we have a set of task specifications, a model for the robot behavior, and a blending function . The given robot task is described by certain performance and safety specifications . For example, it may not be safe to take the shortest route because there may be too many obstacles in that route. In order to satisfy performance considerations, the robot should prefer to take the shortest route possible while not violating the safety specifications. We model the behavior of the robot inside a stochastic environment as an MDP .

In our setting, a human issues a set of commands for the robot to execute. It may be unrealistic that a human can grasp an MDP that models a realistic shared control scenario. Indeed, a human will likely have difficulties interpreting a large number of possibilities and the associated probability of paths and payoffs [13], and it may be impractical for the human to provide the human strategy to the autonomy protocol, due to the possibly large state space of the MDP. Therefore, we compute a human strategy as an abstraction of a sequence of human’s commands, which we obtain using inverse reinforcement learning [1, 30].

We design an autonomy protocol that provides another strategy , which we call the autonomy strategy. Then, we blend the two strategies according to the blending function into the blended strategy . The blending function reflects preference over the human strategy or the autonomy strategy. We ensure that the blended strategy deviates minimally from the human strategy.

At runtime, we can then blend commands of the human with commands of the autonomy strategy. The resulting “blended” commands will induce the same behavior as the blended strategy , and the specifications are satisfied. Note that blending commands at runtime according to predefined blending function and autonomy protocol simply requires a linear combination of real values and is thus very efficient.

The shared control synthesis problem is then the synthesis of the repaired strategy such that it holds that while deviating minimally from . The deviation between the human strategy and the repaired strategy is measured by the maximal difference between the two strategies in each state of the MDP. We state the problem that we study as follows.

Let be an MDP, be an LTL specification, be a human strategy, and be a constant. Synthesize a repaired strategy that solves the following problem.

(3)
subject to
(4)

For convenience, we will use the original MDP instead of the product MDP in what follows as all concepts are directly transferrable.

Iv Synthesis of the autonomy protocol

In this section, we describe our approach to synthesize an autonomy protocol for the shared control synthesis problem. We start by formalizing the concepts of strategy blending and strategy repair. We then show how we can synthesize a repaired strategy that deviates minimally from the human strategy based on quasiconvex programming. We discuss how we can include additional specifications to the problem and discuss other measures for the human and the repaired strategy that induce a similar behavior.

Iv-a Strategy blending

Given the human strategy and the autonomy strategy , a blending function computes a weighted composition of the two strategies by favoring one or the other strategy in each state of the MDP [19, 8, 9].

Reference [9] argues that the weight of blending shows the confidence in how well the autonomy protocol can assist to perform the human’s task. Put differently, the blending function should assign a low confidence to the actions of the human if they may lead to a violation of the specifications. Recall Fig. 1 and the example in the introduction. In the cells of the gridworld where some actions may result in a collusion with the vacuum cleaner with a high probability, it makes sense to assign a higher weight to the autonomy strategy.

We pick the blending function as a state-dependent function that weighs the confidence in both the human strategy and the autonomy strategy at each state of the MDP  [19, 8, 9].

[Linear blending] Given the MDP , two memoryless strategies , and a blending function , the blended strategy for all states , and actions is

(5)

For each , the value of represents the “weight” of at , meaning how much emphasis the blending function puts on the human strategy at state . For example, referring back to Fig. 1, the critical cells of the gridworld correspond to certain states of the MDP . At these states, we may assign a very low confidence in the human strategy. For instance at such a state , we might have , meaning the blended strategy in state puts more emphasis on the autonomy strategy .

Iv-B Solution to the shared control synthesis problem

In this section, we propose an algorithm for solving the shared control synthesis problem. Our solution is based on quasiconvex programming which can be solved by checking feasibility of a number of convex optimization problems. We show that the result of the quasiconvex program is the repaired strategy as in Problem 1. The strategy satisfies the task specifications while deviating minimally from the human strategy. We use that result to compute the autonomy strategy that may then be subject to blending.

Iv-B1 Perturbation of strategies

As mentioned in the introduction, the blended strategy should deviate minimally from the human strategy. To measure the quantity of such a deviation, we introduce the concept of perturbation, which was used in [5]. To modify a (randomized) strategy, we employ additive perturbation by increasing or decreasing the probabilities of action choices in each state. We also ensure that for each state, the resulting strategy is a well-defined distribution over the actions.

[Strategy perturbation] Given the MDP and a strategy , a perturbation is a function with

The perturbation value at state for action is . Overloading the notation, the perturbed strategy is

(6)

Iv-B2 Dual linear programming formulation for MDPs

In this section, we recall the LP formulation to compute a strategy that maximizes the probability of satisfying a specification in an MDP [27, 12]. Let the set of states in accepting end components in (or in fact within in the product MDP ) and let be the set of all states that are not in and have nonzero probability of reaching a state . These sets can be computed in time polynomial in the size of by doing a graph search over the MDP  [2]. In this section, we assume that there exists a strategy that satisfies an LTL formula with a probability of at least

, which can be verified in time polynomial by solving a linear programming problem 

[2].

The variables of the dual LP formulation are following:

  • for each state and action defines the occupancy measure of a state-action pair for the strategy , i.e., the expected number of times of taking action in state .

  • for each state defines the probability of reaching a state in an accepting end component.

(7)
subject to
(8)
(9)
(10)

where if and if . The constraints (8) and (9) ensure that the expected number of times transitioning to a state is equal to the expected number of times to take action that transitions to a different state . The constraint (10) ensures that the specification is satisfied with a probability of at least . We refer the reader to [27, 12] for details about the constraints in the LP.

For any optimal solution to the LP in (7)–(10),

(11)

is an optimal strategy, and is the occupancy measure of , see [27] and [12] for details. After finding an optimal solution to the LP in (7)–(10), we can compute the probability of satisfying a specification by

Iv-B3 Strategy repair using quasiconvex programming

Given the human strategy, , the aim of the autonomy protocol is to compute the blended strategy, or the repaired strategy that induces a similar behavior to the human strategy while satisfying the specifications. We compute the repaired strategy by perturbing the human strategy, which is introduced in Definition IV-B1. We show our formulation to compute the repaired strategy in the following Lemma.

The shared control synthesis problem can be formulated as the following nonlinear programming program with following variables:

  • for each state and action and for each state as defined for the optimization problem in (7)–(10).

  • gives the maximal deviation between the human strategy and the repaired strategy .

(12)
(13)
(14)
(15)
(16)
Proof.

For any solution to the optimization problem above, the constraints in (13)–(15) ensure that the strategy computed by (11) satisfies the specification. We now show that by minimizing , we minimize the maximal deviation between the human strategy and the repaired strategy.

As in Definition IV-B1, we perturb the human strategy to the repaired strategy by

(17)

Note that this constraint is not a function of the occupancy measure of . By using the definition of occupancy measure in (11), we reformulate the constraint in (17) into the constraint

(18)

or equivalently to the constraint

(19)

Since we are interested in minimizing the maximal deviation, we assign a common variable for all state-action pairs in the MDP to put an upper bound on the deviation by

(20)

Therefore, by minimizing subject to the constraints in (13)–(16) we ensure that the repaired strategy deviates minimally from the human strategy . ∎

The constraint in (20) is a nonlinear constraint. In fact, it is a quadratic constraint due to multiplication of and . However, we show that the problem in (12)–(16) is a quasiconvex programming problem, which can be solved efficiently using bisection over  [4].

The constraint in (20) is quasiconvex, therefore the nonlinear programming problem in (12)–(16) is a quasiconvex programming problem.

Proof.

For a fixed , the set described by the inequality in (20) is convex, that is, the sublevel sets of the function are convex [4, Section 3.4]. Therefore, the constraint in (20) is quasiconvex and the nonlinear programming problem in (12)–(16) is a quasiconvex programming problem (QCP). ∎

We solve the QCP in (12)–(16) by employing bisection over the variable . We initialize a lower and upper bound of the maximal deviation between the human strategy and the repaired strategy to and respectively. Then, we iteratively refine the bounds by solving a number of convex feasibility problems. A method to solve quasiconvex optimization problems is given in [4, Algorithm 4.1]. Our approach is given in Algorithm 1 based on the Algorithm 4.1 in [4]. We now state the main result of the paper.

given , , , tolerance . repeat  1. Set  2. Solve the convex feasibility problem in (13)–(16).  3. if the problem in (13)–(16) is feasible, then     ,    else . until .
Algorithm 1 Bisection method to synthesize an optimal repaired strategy for the shared control synthesis problem.

The repaired strategy obtained from Algorithm 1 satisfies the task specifications and it deviates minimally from the human strategy , and is an optimal solution to the shared control synthesis problem.

Proof.

From a satisfying assignment to the constraints in (12)–(16), we compute a strategy that satisfies the specification using (11). Using Algorithm 1, we can compute the repaired strategy that deviates minimally from the human strategy up to accuracy in iterations. Therefore, Algorithm 1 computes an optimal strategy for the shared control synthesis problem. ∎

The strategy given by Algorithm 1 computes the minimally deviating repaired strategy that satisfies the LTL specification. In [18], we considered computing an autonomy protocol with a greedy approach. That approach requires solving possibly an unbounded number of LPs to compute a feasible strategy that is not necessarily optimal. On the other hand, using Algorithm 1, we only need to check feasibility of a number of LPs that can be determined to compute an optimal strategy. Note that we do not compute the autonomy strategy with the QCP in (12)–(16) directly. After computing the repaired strategy , we compute the autonomy strategy according to the Definition IV-A.

Computationally, the most expensive step of the Algorithm 1 is checking the feasibility of the optimization problem in (13)–(16). The number of variables and constraints in the optimization problem are linear in the number of states and actions in , therefore, checking feasibility of the optimization problem can be done in time polynomial in the size of with interior point methods [24]. Algorithm 1 terminates after iterations, therefore we can compute an optimal strategy up to accuracy in time polynomial in the size of .

Iv-B4 Additional specifications

The QCP in (12)–(16) computes an optimal strategy for a single LTL specification . Suppose that we are given a reachability specification with in addition to the LTL specification . We can handle this specification by appending the constraint

(21)

to the QCP in (12)–(16). The constraint in (21) ensures that the probability of reaching is greater than .

We handle an expected cost specification for , by adding the constraint

(22)

to the QCP in (12)–(16). The constraint in (21) ensures that the expected cost of reaching is less than .

Iv-B5 Additional measures

We discuss additional measures that can be used to render the behavior between the human and the autonomy protocol similar based on the occupancy measure of a strategy. Instead of minimizing the maximal deviation between the human strategy and the repaired strategy, we can also minimize the maximal difference of occupancy measures of the strategies. In this case, the difference between the human strategy and the repaired strategy will be smaller in states where the expected number of being in a state is higher, and will be higher if the state is not visited frequently. We can minimize the maximal difference of occupancy measures by adding the following objective to the constraints in (13)–(15):

or, equivalently,

The occupancy measure of the human strategy can be computed by finding a feasible solution to the constraints in (13)–(14) for the induced DTMC . We can also minimize further convex norms of the human strategy and the repaired strategy, such as 1-norm or 2-norm.

V Case study and experiments

We present two numerical examples that illustrate the efficacy of the proposed approach. In the first example, we consider a wheelchair scenario from Fig. 1. The goal in this scenario is to reach the target state while not crashing with the obstacle. In the second example, we consider an unmanned aerial vehicle (UAV) mission, where the objective is to survey certain regions while avoiding enemy agents.

We require an abstract representation of the human’s commands as a strategy to use our synthesis approach in a shared control scenario. We first discuss how such strategies may be obtained using inverse reinforcement learning and report on case study results.

Fig. 4: The setting of the case study for the shared control simulation. We collect sample data from a simulation environment, and compute the human strategy using maximum-entropy inverse reinforcement learning (MEIRL). From the human strategy, we compute an autonomous strategy based on our approach to the shared control synthesis problem.

V-a Experimental setting

We give an overview of the workflow of the experiments in Fig. 4. In an simulation environment, we collect sample data from the human’s commands. Based on these commands, we compute a human strategy using maximum-entropy inverse reinforcement learning (MEIRL) [30]. After computing the human strategy, we synthesize the repaired strategy using the procedure in Algorithm 1. After synthesizing the repaired strategy, we compute the autonomous strategy using (5). We can further refine our representation of the human strategy by collecting more sample data from the human’s commands before blending with the autonomous strategy.

We model the wheelchair scenario inside an interactive Python environment. In the second scenario, we use the UAV simulation environment AMASE111https://github.com/afrl-rq/OpenAMASE, developed at Air Force Research Laboratory. AMASE can be used to simulate multi-UAV missions. The graphical user interfaces of AMASE allow humans to send commands to one or multiple vehicles at run time. It includes three main programs: a simulator, a data playback tool, and a scenario setup tool.

We use the model checker PRISM [22] to verify if the computed strategies satisfy the specification. We use the LP solver Gurobi [16] to check the feasibility of the LP problems that is given in Section IV. We also implemented the greedy approach for strategy repair in [18]. In this section, we refer to the procedure given by Algorithm 1 as QCP method, and the procedure from [18] as greedy method.

V-B Data collection

We asked five participants to accomplish tasks in the wheelchair scenario. The goal is moving the wheelchair to a target cell in the gridworld while never occupying the same cell as the moving obstacle. Similarly, three participants performed the surveillance task in the AMASE environment.

From the data obtained from each participant, we compute an individual randomized human strategy via MEIRL. Reference [19] uses inverse reinforcement learning to reason about the human’s commands in a shared control scenario from human’s demonstrations. However, they lack formal guarantees on the robot’s execution.

In our setting, we denote each sample as one particular command of the participant, and we assume that the participant issues the command to satisfy the specification. Under this assumption, we can bound the probability of a possible deviation from the actual intent with respect to the number of samples using Hoeffding’s inequality for the resulting strategy, see [29] for details. Using these bounds, we can determine the required number of commands to get an approximation of a typical human behavior. The probability of a possible deviation from the human behavior is given by , where is the number of commands from the human and is the upper bound on the deviation between the probability of satisfying the specification with the true human strategy and the probability obtained by the strategy that is computed by inverse reinforcement learning. For example, to ensure an upper bound on the deviation of the probability of satisfying the specification with a probability of , we require demonstrations from the human.

We design the blending function by assigning a low weight to the human strategy at states where it yields a low probability of reaching the target set. Using this function, we create the autonomy strategy and pass it (together with the blending function) back to the environment. Note that the repaired strategy satisfies the specification, by Theorem IV-B3.

V-C Gridworld

The size of the gridworld in Fig. 1 is variable, and we generate a number of randomly moving (e.g., the vacuum cleaner) and stationary obstacles. An agent (e.g., the wheelchair) moves in the gridworld according to the commands from a human. For the gridworld scenario, we construct an MDP where the states represent the positions of the agent and the obstacles and the actions induce changes in the agent position.

The safety specification states that the agent has to reach a target cell while not crashing into an obstacle with a certain probability , formally .

First, we report results for one particular participant in a gridworld scenario with a grid and one moving obstacle. The states of the MDP are generated by the Cartesian product of the states of the agent and the obstacle. The agent and the obstacle have four actions in all states, namely left, right, up and down. At each state, a transition to the chosen direction occurs with a probability of 0.7, and the agent transitions to each adjacent state in the chosen direction with a probability 0.15. If a transition to the wall occurs, the agent remains in the same state. We fix a particular strategy for the obstacle, and determine the transition probabilities between states as a product of transitioning to the next states for the agent and the obstacle. The resulting MDP has states and transitions. We compute the human strategy using MEIRL where the features are the components of the cost function of the human, for instance the distance to the obstacle and the goal state.

We instantiate the safety specification with , which means the target should be reached with at least a probability of . The human strategy induces a probability of to satisfy the specification. That is, it does not satisfy the specification.

We compute the repaired strategy using the greedy and the QCP approach, and both strategies satisfy the specification with a probability larger than . On the one hand, the maximum deviation between and is 0.15 with the greedy approach, which implies that the strategy of the human and the autonomy protocol deviates at most 15% for all states and actions. On the other hand, the maximum deviation between and is 0.03 with the QCP approach. The results show that the QCP approach computes a repaired strategy that induces a more similar strategy to the human strategy compared to the LP approach.

(a) Strategy
(b) Strategy
(c) Strategy
Fig. 5: Graphical representation of the obtained human, blended, and autonomy strategy in the grid.

We give a graphical representation of the human strategy , repaired strategy , and the autonomy strategy in Fig. 5. For each strategy, we indicate the average probability of safely reaching the target with the QCP approach. Note that the probability of reaching the target depends on the current position of the obstacle. Therefore, the probability for satisfying a specification could be higher or lower than shown in Fig. 5. In Fig. 5, the probability of reaching the target increases with a darker color, and black indicates a probability of to reach the target. We observe that the human strategy induces a lower probability of reaching the target in most of the states, while for the repaired strategy, the probability of reaching target is higher in all cells. Note that the autonomy strategy induces a very high probability of reaching the target in each cell, but the autonomy strategy may be too safe and may not be similar to the human strategy.

Gridworld size Number of states Number of transitions

Synthesis time with the

  greedy approach (sec)

Synthesis time with the

  QCP approach (sec)

TABLE I: Scalability results for the gridworld example. We list the synthesis time of the both approaches in seconds. ’’ and ’ refer to the maximal deviation of the greedy and QCP approach.

To finally assess the scalability of our approach, consider Table I. We generated MDPs for different gridworlds with a different number of states and number of obstacles. We list the number of states in the MDP and the number of transitions. We report on the time that the synthesis process took with the greedy approach and QCP approach, which includes the time of solving the LPs in the greedy method or QCPs measured in seconds. It also includes the model checking times using PRISM for the greedy approach. To represent the optimality of the synthesis, we list the maximal deviation between the repaired strategy and the human strategy for the greedy and QCP approach (labeled as ”” and ””). In all of the examples, we observe that the strategies obtained by the QCP approach yield autonomy strategies with less deviation to the human strategy while having similar computation time with the greedy approach.

(a) Snapshot of a simulation using the AMASE simulator. The objective of the agent is to keep surveilling the green regions while avoiding enemy agents and restricted operating zones.
(b) The graphical user interface of the AMASE simulator for a UAV mission. The user interface contains various information about the vehicles such as the speed and the heading.
Fig. 6: An example of UAV mission that is simulated on AMASE.

V-D UAV mission planning

Similar to the gridworld scenario, we generate an MDP where states denote the position of the agent and the enemy agents in an AMASE scenario. Consider an example scenario in Fig. 6: The specification (or the mission) of the agent (blue UAV) is to keep surveilling the green regions (labeled as ) while avoiding restricted operating zones (labeled as ”, ”) and enemy agents (purple and green UAVs). We asked the participants to visit the regions in a sequence, i.e., visiting the first region, then second, and then the third region. After visiting the third region, the task is to visit the first region again to perform the surveillance.

For example, if the last visited region is , then the safety specification in this scenario is , where ROZ is to visit the ROZ areas and target is visiting .

We synthesize the autonomy protocol on the AMASE scenario with two enemy agents. The underlying MDP has 15625 states. We use the same blending function and same threshold as in the gridworld example. The features to compute the human strategy with MEIRL are given by the distance to the closest ROZ, enemy agents, and the target region.

The human strategy violates the specification with a probability of 0.496. Again, we compute the repaired strategy with the greedy and the QCP approach. Both strategies satisfy the specification. On the one hand, the maximum deviation between and is 0.418 with the greedy approach, which means the strategies of the human and the autonomy protocol are significantly different in some states of the MDP. On the other hand, the QCP approach yields a repaired strategy that is more similar to the human strategy with a maximum deviation of 0.038. The time of the synthesis procedure with the LP approach is 481.31 seconds and the computation time with the QCP approach is 749.18 seconds, showing the trade-offs between the greedy approach and the QCP approach. We see that the greedy approach can compute a feasible solution slightly faster, however the resulting blended strategy may be less similar to the human strategy compared to the QCP approach.

Synthesis time with the

  QCP approach (sec)

TABLE II: Results for different specification thresholds for the probability and expected time in the AMASE example. ’’ and ’’ refer to the threshold for the probability and the expected time of the specification.

To assess the effect of changing the threshold of satisfying the specification, we use a different threshold . The greedy approach did not terminate within one hour, and could not find a repaired strategy that satisfies the specification after 45 iterations. We compute a repaired strategy using the QCP approach with a maximum deviation of 0.093. The computation time with the QCP approach is 779.81 seconds, showing that the QCP approach does not take significantly more time to compute a repaired strategy even with a higher threshold. We conclude that the greedy approach may not be able to find a feasible strategy efficiently if most of the strategies in an MDP do not satisfy the specification.

We also assess the effect of adding additional constraints to the task, i.e., surveilling the next green region within a certain time step. We synthesize different policies for different expected times until the UAV reaches the next region. We summarize the results in Table II. For each different probability thresholds (labeled as ””) and expected times to complete the mission (labeled as ””), we report the synthesis time and the maximal deviation. The results in Table II illustrate that the maximal deviation increases with increasing threshold and decreasing expected time to complete the mission. For example, with the threshold and expected time , the maximal deviation between the human and the repaired strategy is , which shows that the strategies of the human and the autonomy protocol can be significantly different in some states. On the other hand, with the threshold and expected time , the maximal deviation between the human strategy and the repaired strategy is , which is significantly smaller than the previous examples. We also note that there is no significant difference in synthesis time for different thresholds and expected times.

Synthesis time with the

  QCP approach (sec)

TABLE III: Results for different perturbations of the human strategy in the AMASE example. ’’ refers to the maximal perturbation introduced to the human strategy. refers to the maximal deviation between the repaired strategy and the human strategy.

V-E Effect of changing the human strategy

In this section, we investigate how changing the human strategy changes the strategy of the autonomy protocol. We perturb the human strategy from the previous example using (6) with different perturbation functions . We use three different values for maximal perturbation for every state and action between the human strategy and the repaired strategy and two different thresholds to satisfy the specification with and .

We summarize our results in Table III. We generated three different perturbed human strategies with perturbation functions that have a different maximal perturbation (labeled as ””). We report the maximal deviation between the autonomy protocol that is synthesized using the original human strategy and the perturbed human strategy (labeled as ), the time that the synthesis process took with the QCP approach (labeled as ”QCP synth.”), and the maximal deviation between the perturbed human strategies and the repaired strategies (labeled as ).

The results in Table III show that the maximal deviation between the repaired strategy and the human strategy does not depend on the perturbation, and it depends on the threshold of satisfying the specification. The maximal deviation between the repaired strategies increases with larger perturbations introduced to the human strategy and with a larger threshold . The values for show that the maximal deviation between the human strategy and the repaired strategy does not depend heavily on a specific human strategy, and it mostly depends on the threshold. We also note that the synthesis time is similar for all cases.

Vi Conclusion and Critique

We introduced a formal approach to synthesize an autonomy protocol in a shared control setting subject to probabilistic temporal logic specifications. The proposed approach utilizes inverse reinforcement learning to compute an abstraction of a human’s behavior as a randomized strategy in a Markov decision process. We designed an autonomy protocol such that the resulting robot strategy satisfies safety and performance specifications. We also ensured that the resulting robot behavior is as similar to the behavior induced by the human’s commands as possible. We synthesized the robot behavior using quasiconvex programming. We showed the practical usability of our approach through case studies involving autonomous wheelchair navigation and unmanned aerial vehicle planning.

There is a number of limitations and also possible extensions of the proposed approach. First of all, we computed a globally optimal strategy by bisection, which requires checking feasibility of a number of linear programming problems. A direct convex formulation of the shared control synthesis problem would make computing the globally optimal strategy more efficient.

We assumed that the human’s commands are consistent through the whole execution, , the human issues each command to satisfy the specification. Also, this assumption implies the human does not consider assistance from the robot while providing commands - and in particular, the human does not adapt the strategy to the assistance. It may be possible to extend the approach to handle non-consistent commands by utilizing additional side information, such as the task specifications.

Finally, in order to generalize the proposed approach to other task domains, it is worth to explore transfer learning 

[25] techniques. Such techniques will allow us to handle different scenarios without requiring to relearn the human strategy from the human’s commands.

References

  • [1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, page 1. ACM, 2004.
  • [2] Christel Baier and Joost-Pieter Katoen. Principles of Model Checking. The MIT Press, 2008.
  • [3] Mihir Bellare and Phillip Rogaway. The complexity of approximating a nonlinear program. In Complexity in numerical optimization, pages 16–32. World Scientific, 1993.
  • [4] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
  • [5] Taolue Chen, Yuan Feng, David S. Rosenblum, and Guoxin Su. Perturbation analysis in verification of discrete-time Markov chains. In CONCUR, volume 8704 of LNCS, pages 218–233. Springer, 2014.
  • [6] Luca De Alfaro. Formal verification of probabilistic systems. Number 1601. Citeseer, 1997.
  • [7] Christian Dehnert, Sebastian Junges, Joost-Pieter Katoen, and Matthias Volk. A storm is coming: A modern probabilistic model checker. In International Conference on Computer Aided Verification, pages 592–600. Springer, 2017.
  • [8] Anca D. Dragan and Siddhartha S. Srinivasa. Formalizing assistive teleoperation. In Robotics: Science and Systems, 2012.
  • [9] Anca D. Dragan and Siddhartha S. Srinivasa. A policy-blending formalism for shared control. I. J. Robotic Res., 32(7):790–805, 2013.
  • [10] Andrew Fagg, Michael Rosenstein, Robert Platt, and Roderic Grupen. Extracting user intent in mixed initiative teleoperator control. In Intelligent Systems Technical Conference, page 6309, 2004.
  • [11] Lu Feng, Clemens Wiltsche, Laura Humphrey, and Ufuk Topcu. Synthesis of human-in-the-loop control protocols for autonomous systems. IEEE Transactions on Automation Science and Engineering, 13(2):450–462, 2016.
  • [12] Vojtěch Forejt, Marta Kwiatkowska, Gethin Norman, David Parker, and Hongyang Qu. Quantitative multi-objective verification for probabilistic systems. In TACAS, pages 112–127. Springer, 2011.
  • [13] Roland Fryer and Matthew O Jackson. A categorical model of cognition and biased decision making. The BE Journal of Theoretical Economics, 8(1).
  • [14] Jie Fu and Ufuk Topcu. Synthesis of shared autonomy policies with temporal logic specifications. IEEE Transactions on Automation Science and Engineering, 13(1):7–17, 2016.
  • [15] F. Galán, M. Nuttin, E. Lew, P. W. Ferrez, G. Vanacker, J. Philips, and J. del R. Millán. A brain-actuated wheelchair: Asynchronous and non-invasive brain-computer interfaces for continuous control of robots. Clinical Neurophysiology, 119(9):2159–2169, 2016/05/28.
  • [16] Gurobi Optimization, Inc. Gurobi optimizer reference manual. url=http://www.gurobi.com, 2013.
  • [17] Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, and Dominik Wojtczak. Omega-regular objectives in model-free reinforcement learning. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 395–412. Springer, 2019.
  • [18] Nils Jansen, Murat Cubuktepe, and Ufuk Topcu. Synthesis of shared control protocols with provable safety and performance guarantees. In ACC, pages 1866–1873. IEEE, 2017.
  • [19] Shervin Javdani, J Andrew Bagnell, and Siddhartha Srinivasa. Shared autonomy via hindsight optimization. In Robotics: Science and Systems, 2015.
  • [20] Dae-Jin Kim, Rebekah Hazlett-Knudsen, Heather Culver-Godfrey, Greta Rucks, Tara Cunningham, David Portee, John Bricout, Zhao Wang, and Aman Behal. How autonomy impacts performance and satisfaction: Results from a study with spinal cord injured subjects using an assistive robot. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 42(1):2–14, 2012.
  • [21] Jonathan Kofman, Xianghai Wu, Timothy J Luu, and Siddharth Verma. Teleoperation of a robot manipulator using a vision-based human-robot interface. IEEE transactions on industrial electronics, 52(5):1206–1219, 2005.
  • [22] Marta Kwiatkowska, Gethin Norman, and David Parker. Prism 4.0: Verification of probabilistic real-time systems. In CAV, volume 6806 of LNCS, pages 585–591. Springer, 2011.
  • [23] Adam Leeper, Kaijen Hsiao, Matei Ciocarlie, Leila Takayama, and David Gossow. Strategies for human-in-the-loop robotic grasping. In HRI, pages 1–8. IEEE, 2012.
  • [24] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming, volume 13. Siam, 1994.
  • [25] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
  • [26] Shashank Pathak, Erika Ábrahám, Nils Jansen, Armando Tacchella, and Joost-Pieter Katoen. A greedy approach for the efficient repair of stochastic models. In NFM, volume 9058 of LNCS, pages 295–309. Springer, 2015.
  • [27] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  • [28] Jian Shen, Javier Ibanez-Guzman, Teck Chew Ng, and Boon Seng Chew. A collaborative-shared control system with safe obstacle avoidance capability. In Robotics, Automation and Mechatronics, volume 1, pages 119–123. IEEE, 2004.
  • [29] Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. 2010.
  • [30] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. 2008.