## 1 Introduction

Solution diversity has value in numerous planning applications, including collaborative systems, reinforcement learning, and preference-based planning. In human groups and, more generally, animal groups, the so-called notion of behavioral diversity leads to the group members’ heterogeneous behavior. This heterogeneity ensures that the members learn complementary skills, thus improving the group’s overall performance. An agent learning a task in an unknown environment may benefit from inducing diversity in its decisions to explore the environment more efficiently. In planning with unknown preferences, one can use diversity to construct a set of behaviors that are suitable for different preferences.

Algorithms that use notions of diversity to address one or more of these applications are known as quality diversity (QD) algorithms. A key component of QD algorithms is a way to summarize the important properties of different solutions. This description, known as a behavior characterization, is used to define diversity-based metrics. Without proper behavior characterization, solutions with trivial differences can have high values of diversity as measured by the resulting metric.

Our work is motivated by planning in settings where, in addition to a known objective, there exist some unknown objectives. The unknown objectives may represent a human user or designer’s preference, which is either private or complex to model. In these settings, we propose a QD-based approach to construct a “representative” — small and diverse — set of near-optimal policies with respect to the known objective and then present that to the human to select from according to their unknown objectives. This approach allows the human to have the ultimate control over the behavior, without requiring prior knowledge of the human’s preferences.

Formally, we consider the multi-objective optimization problem of returning a set of feasible policies for an infinite horizon Markov decision process (MDP) that is both near-optimal and diverse. We define the optimality of a set of policies as the sum of each policy’s expected average reward in the set. Diversity captures the representativeness of a set of policies. We characterize the behavior of policies using their state-action occupancy measures and quantify diversity by the sum of pairwise divergences between the state-action occupancy measures of the policies in the set.

A key element of our approach is the behavior characterization of policies using their state-action occupancy measures. This characterization is domain-independent and fully encapsulates the dynamics of a given policy. We use this characterization to define the diversity of a set of policies through the pairwise Jensen-Shannon divergences between the occupancy measures. We then define the objective as a linear combination of the sum of the policies’ rewards and their diversity. By utilizing the dual of the average cost linear program, we recast our formulation as a constrained optimization problem. We then show that, due to the constraints’ linearity, the problem can be solved efficiently using the Frank-Wolfe algorithm. We also prove that the algorithm is guaranteed to converge to a stationary point. Furthermore, in a series of simulations, we evaluate the proposed algorithm’s performance and show its efficacy.

## 2 Related Work

Research on the development of QD algorithms has occurred within two different communities. In the field of optimization, perspectives on evolution as a process that finds distinct niches for different species have motivated the use of diversity. Simultaneously, there has been significant interest in the use of diversity to provide high-quality solutions for unknown objectives within the planning community.

In the optimization community, recent interest in QD algorithms has been driven by the success of the Novelty Search algorithm lehman2008exploiting. The original Novelty Search algorithm eschews the use of notions of solution quality entirely; its sole goal is to find a set of solutions that are diverse with respect to some distance measure. Surprisingly, this approach is able to find solutions with better performance on difficult tasks, such as maze navigation, than algorithms relying on an objective function. This result has led to considerable interest in the development of new QD algorithms to address tasks that were previously considered to be too difficult. For a review, see Pugh, Soros, and Stanley (pugh2016quality).

The type of behavior characterization used in these works varies and can be domain-dependent. For example, in navigation problems, diversity can be defined using Euclidean distances between points visited. Another approach, used by the popular MAP elites algorithm, is to assume that a domain-dependent behavior characterization is given mouret2015illuminating. A promising area of research is the development of new approaches to behavior characterization gaier2020automating.

The success of the Novelty Search and MAP elites algorithms has inspired the use of diversity in reinforcement learning, with the hope that diversity can help avoid poor local minima. Different methods of behavior characterization for policies have been used, including methods based on sequences of actions jackson2019novelty, state trajectories eysenbach2018diversity, or diversity through determinants of actions in states parker2020effective. Similarly to our work, Parker-Holder et al. consider an explicit tradeoff between the quality and diversity of the policies. However, our approach differs in that we leverage knowledge of the system dynamics to characterize policies in a way that includes information about both the states visited and the policy actions, and to develop a solution algorithm with guaranteed convergence to a local minimum.

Behavior characterization has also been a key focus of QD-based work in the planning community. For example, in an approach similar to MAP elites, Myers and Lee (myers1999generating) and Myers (myers2006metatheoretic) assume that there is a meta-description of the planning domain. They then define an approach that obtains solutions that are diverse with respect to the meta-description. Another approach to behavior characterization is through the use of domain landmarks, which are disjunctive sets of propositions that plans must satisfy, such as a set of states that a trajectory must reach before the goal state hoffmann2001ff. If the set of landmarks can be computed, a greedy algorithm can be used to iteratively select landmarks from the set and find a plan that satisfies the landmark (e.g., reaches a certain state) bryce2014landmark. Behavior characterization based on the plan actions, as in the RL community, is also a common technique coman2011generating; nguyen2012generating; katz2020reshaping.

The way behavioral characterization and diversity metrics are incorporated into planning algorithms varies. In some cases, the problem is formulated as maximizing the diversity of the set of solutions coman2011generating, or as finding a set of solutions that satisfy a diversity threshold nguyen2012generating; srivastava2007domain. In other cases, like our work, there exists both an unknown objective and a known objective, and the problem is formulated in terms of a tradeoff between the diversity of the solution set and the optimality of each of the candidate solutions coman2011generating; katz2020reshaping; petit2015finding. Our work is distinct from these approaches because we develop a new method for behavior characterization and consider a stochastic setting modeled as an MDP. In addition, unlike many QD-based planning algorithms, our approach does not rely on greedy strategies. While greedy algorithms have near-optimality guarantees in some settings, such as when the problem is submodular bach_2013, in general no such guarantee exists.

## 3 Problem Formulation

We now overview the required background related to Markov decision processes, occupancy measures, and divergence metrics. Then, we present the main problem as a nonlinear optimization problem over the space of occupancy measures.

### 3.1 Preliminaries

We consider systems whose behavior is modeled by a Markov decision process (MDP). An MDP is a tuple , where is a finite set of states, is a finite set of actions, is a probabilistic transition function such that for all and , , and is a reward function.

A stationary stochastic policy

on an MDP is a mapping from the state space to a probability distribution over the actions, formally defined as

. Here we consider only stationary stochastic policies and denote the set of all such policies as .We focus on the class of problems defined over MDPs that aim to maximize the long-run average reward. The long-run average reward of a policy is

(1) |

where the expectation is over all possible trajectory realizations from policy , and and are time-indexed states and actions according to a trajectory . We assume that the MDP satisfies the weak accessibility (WA) condition.

The occupancy measure of a policy, , is defined as the distribution induced by the execution of that policy over the state-action pairs, asymptotically, i.e.,

(2) |

The long run behavior of a stationary stochastic policy can be represented using its corresponding occupancy measure. An optimal stationary stochastic policy is a policy that maximizes the long-run average reward. It has been shown that under the WA condition, an optimal policy can be obtained by solving the Bellman equation, which can be reformulated as the dual form of a linear program (see Section 4.5 in Volume II of bertsekas1995dynamic),

(3) | ||||

subject to | ||||

over occupancy measures , where denotes the inner product in the space of , i.e.,

In particular, an optimal policy corresponding to the solution from solving the above linear program can be computed by defining

(4) |

for all non-transient states. We note that the optimal policy corresponding to is not uniquely defined as the choice of action in transient states does not affect the long-run behavior.

In our problem context, we seek to find a set of policies such that each policy in the set is near-optimal, and the set is representative of the diverse range of near-optimal behaviors. Specifically, we aim to find a small set of policies with cardinality , and we use to denote a set of stationary stochastic policies.

The state-action occupancy measures provide a natural and domain-independent way to characterize the behavior of policies and ensure diversity. In particular, by using a pairwise metric of the distance between occupancy measures, we can define a diversity metric for a set of policies. Given that the occupancy measures are probability distributions, a natural choice is the Jensen-Shannon divergence

briet2009properties. The Jensen–Shannon divergence between two probability distributions and is expressed as(5) |

where is the average distribution, and

denotes the Kullback–Leibler divergence. Kullback–Leibler divergence for two probability distributions

and , over the same discrete probability space , is defined as(6) |

We choose the Jensen-Shannon divergence over other probability distribution-based measures because it is symmetric and bounded between zero and one.

### 3.2 Problem Statement

We aim to design an algorithm that can provide a representative set of polices over an MDP that are near-optimal with respect to a known reward function. In particular, given the stated definitions, the objective is to construct policies that cumulatively, have high reward and diversity. We define the cumulative reward of a set of policies as the sum of their individual accumulated rewards, i.e.,

(7) |

and the their cumulative diversity as the sum of the pairwise Jensen–Shannon divergences between their occupancy measures, i.e.,

(8) |

Therefore, given an MDP and a parameter , the goal is to find a set of policies with high cumulative reward and high diversity .

## 4 Proposed Solution

Our problem statement defines a multi-objective optimization problem that aims to maximize a reward-based objective and a diversity-based objective. A standard method for tackling multi-objective problems is to linearly combine the objectives using judiciously chosen weights. To that end, we first note that the objectives should be independent of the cardinality of the solution set, i.e., the number of policies should not affect the quality of the solution. We address this point by normalizing the reward term by the number of policies, , and the diversity term by the number of unique policy pairs, . Then, we can define the compound objective function as a linear combination of the normalized reward and diversity. The problem of finding can thus be cast as finding a solution to the following optimization problem:

(9) |

where is the tradeoff parameter that controls the relative weightings of the reward and diversity. Using the dual of the linear program for finding an optimal policy, we reformulate the above problem as

(10) | ||||

subject to | ||||

where

denotes the occupancy measures corresponding to the policies, and .

The reformulated version is a constrained optimization problem with linear constraints and a nonlinear (and nonconcave) objective function. In general, this problem does not have a unique global solution. For instance, any permutation of the policies in an optimal solution will result in another optimal solution. Nonetheless, one can seek solutions that can at least converge to local stationary points.

### 4.1 Projected Gradient Ascent

The first optimization method that we consider is projected gradient ascent (PGA) boyd2004convex. PGA iteratively applies a gradient update followed by a projection step. Let denote the projection operator projecting onto the space of feasible occupancy measures for MDP , i.e., it returns the solution to the optimization problem

(11) | ||||

subject to | ||||

We choose the projection metric to be -norm, i.e., . The details of the PGA algorithm are outlined in Algorithm 1. Let represent the space of feasible occupancy measures defined by the constraints in (10). We initialize the occupancy measures by first defining random policies for the given MDP, i.e., policies with random probability distributions over actions in each state. The algorithm terminates once the convergence criteria are met, e.g., the gradient mapping nesterov2013introductory defined as

hits a target threshold or the number of iterations exceeds a prespecified number.

### 4.2 Frank-Wolfe Algorithm

Even though PGA can decouple the projection step for each policy, it still has to solve a convex optimization problem for each policy at each iteration. To avoid this complexity, we propose the use of the Frank-Wolfe (FW) algorithm frank1956algorithm. Every iteration of FW aims to move toward a minimizer of the linear approximation of the original objective function at the current point. Due to this fact, it has gained more popularity for optimization problems with structured constraint set. In particular, the linearity of the constraints in (10) turns every iteration of FW into a linear optimization problem. We implement FW with adaptive step sizes lacoste2016convergence and backtracking line search, as presented in Algorithm 2. At iteration , the algorithm finds a feasible point within the set of feasible policies, , that minimizes the linear approximation of at the current point. Then, it moves in the direction of by a step size that is computed using a line search. We efficiently implement the line search using backtracking. The algorithm terminates once the FW gap defined as

falls below a given tolerance or the number of iterations reaches a prespecified number .

## 5 Theoretical Guarantees

Next, we prove that by applying PGA and FW on a slightly revised problem one can establish non-asymptotic convergence rates to a stationary point.

Let for some represent a restricted space for occupancy measures. In the next lemma, we prove that the gradient of the objective function is Lipschitz continuous over the restricted space .

###### Lemma 1.

Let . The gradient of the objective function defined in (10) is -Lipschitz over . That is,

###### Proof.

First, we note that the linear term of does not contribute to the Lipschitzness. Moreover, the diversity term has been normalized. Therefore, if we can show that has a -Lipschitz gradient, then we can conclude that has a -Lipschitz gradient. To show that has Lipschitz gradient, we start by computing an entry of its Hessian . Let be an arbitrary state-action pair where . Then, we have

Taking the derivative with respect to an arbitrary , we obtain

By some straightforward calculation, one can see that the Hessian is sparse. More specifically, it holds that

where . Therefore, given that only two entries of each column of the Hessian are nonzero, using the Gershgorin circle theorem horn2012matrix, we can show that

Hence, is -Lipschitz and consequently, is -Lipschitz. ∎

The following two theorems establish that since the gradient is Lipschitz, both PGA and FW are guaranteed to converge to a stationary point.

###### Theorem 1 (Theorem 6.5 lan2020first).

Define the minimal gradient mapping of the PGA algorithm as encountered by the iterates during the algorithm until the iteration. Suppose that the stepsizes in the PGA scheme are chosen such that , where is the Lipschitz constant of the gradient of on . Then it holds that

(12) |

where denotes the optimal solution of (10) over the restricted domain. In particular, If , then

(13) |

###### Theorem 2 (lacoste2016convergence).

Define the minimal FW gap as encountered by the iterates during the algorithm until the iteration. Consider running the FW algorithm with the adaptive stepsize strategy specified in Line 11 of Algorithm 2. Then, it holds that

(14) |

where is the Lipschitz constant of the gradient of , and denotes the optimal solution of (10) over the restricted domain.

## 6 Experiments and Results

We evaluate the performance of our proposed approach using grid worlds.

### 6.1 Grid World Design

The grid worlds are two-dimensional nineteen-by-nineteen rectangular grids. The state of the agent is its current position. The agent receives a large reward for reaching a defined goal state, after which it transitions back to an initial state. In all states other than the goal, the agent has five choices of actions that correspond to moving down, left, up, right, or stopping and staying in the same place. The ‘stop’ action is deterministic; for all other actions, the probability of transitioning to the desired successor state is given by a ‘correct transition’ hyper-parameter, denoted . If the transition is not successful, the agent will transition to another neighboring state randomly. The environment is filled with obstacles and reaching an obstacle state results in a large penalty. We note that this design could represent, for example, the environment of a robot vacuum or guard robot.

We define two different grid world types. In the first, which we refer to as the four-room grid world, the environment consists of four eight-by-eight rooms arranged in a two-by-two grid. There is a -200 reward for reaching states with walls and obstacles, a reward of 400 for reaching the goal, and rewards of -4 in all other states to shape the agent behavior. In the second grid world, there are nine total five-by-five rooms arranged in a three-by-three grid. We refer to this setup as the nine-room grid world. In this setup, there is a -40 reward for reaching walls and obstacles, a reward of 200 for reaching the goal, and a reward of -1.2 in all other states. Both environments have walls separating adjacent rooms with a single door, represented by a hole in the wall, linking the rooms. There is a single additional obstacle placed within each room. In both worlds, the initial state is located in the top left room, and the goal state is located in the bottom right room.

To evaluate the robustness of our proposed approach, we test the performance over a range of trials. In each trial, the location of the agent and the goal state within their respective rooms, the locations of the doors in each wall, and the locations of the obstacles within the rooms are randomized. Figure 1 shows a single trial grid world for both the four- and nine-room setups.

### 6.2 Frank-Wolfe and Projected Gradient Ascent

Opt. Alg. | Reward/policy | Diversity | Runtime (s) |
---|---|---|---|

PGA | -39.02 | 0.34 | 1488.76 |

FW | 13.24 | 0.50 | 26.90 |

Here we compare the performance of the FW and PGA optimization algorithms. We terminate PGA when the maximum iteration number is reached or when the difference in norms between consecutive solutions falls below a tolerance threshold of .01. We use Sequential Least Squares Programming nocedal2006numerical to solve the projection step for each iteration. The Sequential Least Squares Programming algorithm terminates after ten iterations or when a stationarity condition is met. We implement the FW algorithm with a shrinkage factor of for the backtracking line search. The FW algorithm terminates when the Frank-Wolfe gap falls below a tolerance of .001 or when the maximum iteration number is reached. We set the maximum iteration number for both approaches as .

We evaluate performance using the four-room grid-world with the correct transition parameter , the number of policies in the return set , and the tradeoff parameter . Figure 2 shows the average performance over ten trials. FW is clearly superior to PGA in both performance and computational efficiency. This is because PGA involves solving a constrained least-squares optimization problem for each policy at each iteration to project the policies back onto the feasible space. Even small errors in the projection can considerably deteriorate the near-optimality and diversity of the policies. In contrast, FW only requires a linear program to be solved at each iteration. The solution to the linear program lies in the feasible space by construction and thus there are no issues with stability. We use FW as the optimization algorithm in the subsequent experiments.

### 6.3 Role of the Tradeoff Parameter

The reward and diversity of the policies found as a function of the tradeoff parameter in the nine-room grid-world. Mean and standard deviation displayed are computed for ten trials. (a) Average reward per policy. Reward for optimal policy is displayed in grey. (b) Average pairwise diversity.

The tradeoff parameter plays a crucial role in ensuring a proper balance between the near-optimality of the candidate solutions and the diversity of the set of solutions. Testing the performance as a function of the tradeoff parameter provides important insights into the performance and properties of our proposed approach. Here we evaluate the performance for a range of tradeoff parameters using the nine-room grid world. We fix the correct transition parameter , and set the number of policies in the return set as . This value is the number of unique door combinations the agent can take to reach the goal without cycling or other undesirable behavior.

Figure 3 shows the average reward per policy and the average pairwise diversity over ten trials. As expected, the pairwise diversity shows a marked increase as a function of the tradeoff parameter. The average reward decreases slightly as the tradeoff parameter increases until it begins to fall sharply around . This is the point where it becomes optimal in some trials to find policies that do not reach the goal but have maximal diversity. Up to this point, our approach is still able to find increasingly diverse near-optimal solutions.

This behavior can also be observed in Figure 4, which shows sample state occupancy maps for a small range of lambda values and a single trial. The state occupancy measure is the long-run expected probability of being in a given state , i.e., . With , our approach finds several policies with nearly identical behavior. With , the algorithm finds policies that utilize increasingly diverse strategies to reach the goal and traverse through many of the doors and rooms in the grid world. Note, however, that even with , policies one and three and policies four and six have relatively similar behavior as they utilize the same door combinations to reach the goal. This can be explained by the fact that our approach finds only a local minimum in the loss landscape, and by the fact that even with high values of the configuration of the doors and obstacles can limit the number of meaningfully distinct near-optimal policies.

### 6.4 Finding More Policies

We show how varying the desired number of policies in the return set affects performance using the nine-room grid world. Here we set the value of the tradeoff parameter based on the results in the previous section and again set the correct transition parameter . Figure 5 shows the average reward per policy and the average pairwise diversity of the policies found as a function of the size of the return set. As shown, there is a marked decrease in the average pairwise diversity as the size of the return set grows. This provides further evidence that the environment provides a natural limit on the number of meaningfully diverse policies that can be obtained.

Figure 6 shows the state occupancy maps for the policies learned for a single trial and two different return set sizes. When the return set size is , our approach is able to find four policies that utilize distinct doors and rooms to reach the goal. When the return set size is , our approach finds five clearly distinct policies, including ones extremely similar to the four found with . However, policies 1 and 3 are very similar to each other. This behavior of increasingly diverse policies up to a threshold is consistent across the trials. We conclude that depending on the location of the obstacles and doors, our approach is able to find a number of distinct policies before reaching a limit that is related to the environment design.

### 6.5 Correct Transition Parameter

Here we investigate the role of stochasticity in the performance of our approach. Figure 7 shows the average reward per policy and average pairwise diversity as a function of the correct transition parameter, , in the four-room grid world. With low values of the correct transition parameter, the probability of hitting obstacles on the way to the goal is high, and the algorithm finds stationary policies with near-maximal diversity but a low reward. As the correct transition parameter increases it becomes optimal to find policies that reach the goal and the average diversity decreases before reaching a minima at . The average diversity then begins to increase past this point as the decreased stochasticity leads to more distinct policies.

## 7 Conclusion and Future Work

In this work, we considered the problem of stochastic planning in situations where the objective function is known to be only partially specified. In this setting, we proposed generating a representative set of near-optimal policies with respect to the known objective. To that end, we formulated a nonlinear optimization problem that finds a small set of near-optimal and diverse policies. We showed that it is possible to efficiently solve the optimization problem using the Frank-Wolfe method and proved non-asymptotic convergence rates. We then compared the performance of the Frank-Wolfe method with projected gradient ascent and investigated the role of the hyperparameters using a series of navigation problems.

Our results show that the choice of the tradeoff parameter and the size of the return set play an important role in the performance of our approach. As the tradeoff parameter and the size of the return set increase, our approach is able to find increasing numbers of meaningfully distinct near-optimal policies up to a limit that is related to the structure of the environment. An interesting future extension of our approach would be investigating the utility of these near-optimal diverse strategies in generating effective collaboration between groups of autonomous agents.

Comments

There are no comments yet.