DeepAI

# An Intrinsically-Motivated Approach for Learning Highly Exploring and Fast Mixing Policies

What is a good exploration strategy for an agent that interacts with an environment in the absence of external rewards? Ideally, we would like to get a policy driving towards a uniform state-action visitation (highly exploring) in a minimum number of steps (fast mixing), in order to ease efficient learning of any goal-conditioned policy later on. Unfortunately, it is remarkably arduous to directly learn an optimal policy of this nature. In this paper, we propose a novel surrogate objective for learning highly exploring and fast mixing policies, which focuses on maximizing a lower bound to the entropy of the steady-state distribution induced by the policy. In particular, we introduce three novel lower bounds, that lead to as many optimization problems, that tradeoff the theoretical guarantees with computational complexity. Then, we present a model-based reinforcement learning algorithm, IDE^3AL, to learn an optimal policy according to the introduced objective. Finally, we provide an empirical evaluation of this algorithm on a set of hard-exploration tasks.

• 8 publications
• 48 publications
12/13/2018

### Revisiting Exploration-Conscious Reinforcement Learning

The objective of Reinforcement Learning is to learn an optimal policy by...
08/27/2019

### Exploration-Enhanced POLITEX

We study algorithms for average-cost reinforcement learning problems wit...
10/05/2022

### Reward-Mixing MDPs with a Few Latent Contexts are Learnable

We consider episodic reinforcement learning in reward-mixing Markov deci...
10/07/2019

### Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?

Modern deep learning methods provide an effective means to learn good re...
12/29/2020

### Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

We investigate the exploration of an unknown environment when no reward ...
02/24/2020

### Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning

It has been a trend in the Reinforcement Learning literature to derive s...
11/08/2021

### Safe Optimal Design with Applications in Policy Learning

Motivated by practical needs in online experimentation and off-policy le...

## 1 Introduction

In general, the Reinforcement Learning (RL) framework sutton2018reinforcement assumes the presence of a reward signal coming from a, potentially unknown, environment to a learning agent. When this signal is sufficiently informative about the utility of the agent’s decisions, RL has proved to be rather successful in solving challenging tasks, even at a super-human level (e.g., mnih2015human; silver2017mastering). However, in most real-world scenarios, we cannot rely on a well-shaped, complete reward signal. This may prevent the agent from learning anything until, while performing random actions, it eventually stumbles into some sort of external reward. Thus, what is a good objective for a learning agent to pursue, in the absence of an external reward signal, to prepare itself to learn efficiently, eventually, a goal-conditioned policy?

Intrinsic motivation chentanez2005intrinsically; oudeyer2009topology traditionally tries to answer this pressing question by designing self-motivated goals that favor exploration. In a curiosity-driven approach, first proposed in schmidhuber1991possibility, the intrinsic objective encourages the agent to explore novel states by rewarding prediction errors (e.g., stadie2015incentivizing; pathak2017curiosity; burda2018large; burda2018exploration). On a similar flavor, other works propose to relate an intrinsic reward to some sort of learning progress (e.g., lopes2012exploration) or information gain mohamed2015variational; houthooft2016vime, stimulating the agent’s empowerment over the environment. Count-based approaches (e.g., bellemare2016unifying; tang2017exploration; ostrovski2017count) consider exploration bonuses proportional to the state visitation frequencies, assigning high rewards to rarely visited states. Athough the mentioned approaches have been relatively effective in solving sparse-rewards, hard-exploration tasks (e.g., pathak2017curiosity; burda2018exploration), they have some common limitations that may affect their ability to methodically explore an environment in the absence of external rewards, as pointed out in ecoffet2019go. Especially, due to the consumable nature of their intrinsic bonuses, the learning agent could prematurely lose interest in a frontier of high rewards (detachment). Furthermore, the agent may suffer from derailment by trying to return to a promising state, previously discovered, if a naïve exploratory mechanism, such as -greedy, is combined to the intrinsic motivation mechanism (which is often the case). To overcome these limitations, recent works suggest alternative approaches to motivate the agent towards a more systematic exploration of the environment (e.g., hazan2018provably; ecoffet2019go). Especially, in hazan2018provably the authors consider an intrinsic objective which is directed to the maximization of an entropic measure over the state distribution induced by a policy. Then, they provide a provably efficient algorithm to learn a mixture of deterministic policies that is overall optimal w.r.t. the maximum-entropy exploration objective. To the best of our knowledge, none of the mentioned approaches explicitly address the related aspect of the mixing time of an exploratory policy, which represents the time it takes for the policy to reach its full capacity in terms of exploration. Nonetheless, in many cases we would like to reach target states in the environment in the minimum possible time, limiting the number of interactions to get there.

In this paper, we present a novel approach to learn exploratory policies that are, at the same time, highly exploring and fast mixing. In Section 3, we propose a surrogate objective to address the problem of maximum-entropy exploration over both the state space (Section 3.1) and the action space (Section 3.2). The idea is to search for a policy that maximizes a lower bound to the entropy of the induced steady-state distribution. We introduce three new lower bounds and the corresponding optimization problems, discussing their pros and cons. Furthermore, we discuss how to complement the introduced objective to account for the mixing time of the learned policy (Section 3.3). In Section 4, we present the Intrinsically-Driven Effective and Efficient Exploration ALgorithm (IDEAL), a novel, model-based, reinforcement learning method to learn highly exploring and fast mixing policies through iterative optimizations of the introduced objective. Then, in Section 5, we provide an empirical evaluation to illustrate the merits of our approach on hard-exploration, finite domains, and to show how it fares in comparison to count-based and maximum-entropy approaches. Finally, in Section 6, we discuss some related works. The proofs of the Theorems are reported in Appendix A.

## 2 Preliminaries

A discrete-time Markov Decision Process (MDP)

(puterman2014markov) is defined as a tuple , where is the state space, is the action space, is a Markovian transition model defining the distribution of the next state given the current state and action , is the reward function, such that is the expected immediate reward when taking action from state , and is the initial state distribution. A policy

defines the probability of taking an action

in state .

In the following we will indifferently turn to scalar or matrix notation, where

denotes a vector,

denotes a matrix, and , denote their transpose. A matrix is row (column) stochastic if it has non-negative entries and all of its rows (columns) sum to one. A matrix is doubly stochastic if it is both row and column stochastic. We denote with the space of doubly stochastic matrices. The -norm of a matrix is its maximum absolute row sum, while and are its and Frobenius norms respectively. We denote with a column vector of ones and with a matrix of ones with rows and columns. Using matrix notation, is a column vector of size having elements ,

is a row stochastic matrix of size

that describes the transition model , is a row stochastic matrix of size that contains the policy , and is a row stochastic matrix of size () that represents the state transition matrix under policy . We denote with the space of all the stationary Markovian policies.

In the absence of external rewards, i.e., when for every , a policy induces, over the MDP

, a Markov Chain (MC)

bremaud2013markov defined by where is the state transition model. Having defined the -step state transition matrix as , the state distribution of the MC at time step is , while is the steady state distribution. If the MC is ergodic, i.e., aperiodic and recurrent, it admits a unique steady state distribution, such that . The mixing time of the MC describes how fast the state distribution converges to the steady state distribution:

 tmix=min{t∈N:supd0∥dπt−dπ∥∞≤ϵ}, (1)

where is the mixing threshold. An MC is reversible if the condition holds. Let

be the eigenvalues of

. For ergodic reversible MCs the largest eigenvalue is 1 with multiplicity 1. Then, we can define the second largest eigenvalue modulus and the spectral gap as:

 λπ(2)=maxλπ(i)≠1|λπ(i)|, γπ=1−λπ(2). (2)

## 3 Optimization Problems for Highly Exploring and Fast Mixing Policies

In this section, we define a set of optimization problems whose goal is to identify a stationary Markovian policy that effectively explores the state-action space. The optimization problem is introduced in three steps: first we ask for a policy that maximizes some lower bound to the steady-state distribution entropy, then we foster exploration over the action space by adding a constraint on the minimum action probability, and finally we add another constraint to reduce the mixing time of the Markov chain induced by the policy.

### 3.1 Highly Exploring Policies over the State Space

Intuitively, a good exploration policy should guarantee to visit the state space as uniformly as possible. In this view, a potential objective function is the entropy of the steady-state distribution induced by a policy over the MDP hazan2018provably. The resulting optimal policy is:

 π∗∈argmaxπ∈ΠH(dπ), (3)

where is the state distribution entropy. Unfortunately, a direct optimization of this objective is particularly arduous since the steady-state distribution entropy is not a concave function of the policy hazan2018provably. To overcome this issue, a possible solution (see, hazan2018provably) is to use the conditional gradient method, such that the gradients of the steady-state distribution entropy become the intrinsic reward in a sequence of approximate dynamic programming problems bertsekas1995dynamic.

In this paper, we follow an alternative route that consists in maximizing a lower bound to the policy entropy. In particular, in the following we will consider three lower bounds that lead to as many optimization problems (named Infinity, Frobenius, Column Sum) that show different trade-offs between theoretical guarantees and computational complexity.

Infinity  From the theory of Markov chains bremaud2013markov, we know a necessary and sufficient condition for a policy to induce a uniform steady-state distribution (i.e., to achieve the maximum possible entropy). We report this result in the following theorem. [] Let be the transition matrix of a given MDP. The steady-state distribution induced by a policy is uniform over iff the matrix is doubly stochastic. Unfortunately, given the constraints specified by the transition matrix of the MDP, a stationary Markovian policy that induces a doubly stochastic may not exist. On the other hand, it is possible to lower bound the entropy of the steady-state distribution induced by policy as a function of the minimum -norm between and any doubly stochastic matrix. [] Let be the transition matrix of a given MDP and the space of doubly stochastic matrices. The entropy of the steady-state distribution induced by a policy is lower bounded by:

 H(dπ)≥log|S|−|S|infPu∈P∥Pu−ΠP∥2∞.

The maximization of this lower bound leads to the following constrained optimization problem:

 minimizePu∈P,Π∈Π ∥Pu−ΠP∥∞ (4)

It is worth noting that this optimization problem can be reformulated as a linear program with

optimization variables and inequality constraints and equality constraints (the linear program formulation can be found in Appendix B.1). In order to avoid the exponential growth of the number of constraints as a function of the number of states, we are going to introduce alternative optimization problems.

Frobenius  It is worth noting that different transition matrices having equal might lead to significantly different state distribution entropies , as the -norm only accounts for the state corresponding to the maximum absolute row sum. The Frobenius norm can better captures the distance between and over all the states, as discussed in Appendix C. For this reason, we have derived a lower bound to the policy entropy that replace the -norm with the Frobenius one. [] Let be the transition matrix of a given MDP and the space of doubly stochastic matrices. The entropy of the steady-state distribution induced by a policy is lower bounded by:

 H(dπ)≥log|S|−|S|2infPu∈P∥Pu−ΠP∥2F.

It can be shown (see Corollary A in Appendix A) that the lower bound based on the Frobenius norm cannot be better (i.e., larger) than the one with the Infinite norm. However, we have the advantage that the resulting optimization problem has significantly less constraints than Problem (4):

 minimizePu∈P,Π∈Π ∥Pu−ΠP∥F. (5)

The above problem is a (linearly constrained) quadratic problem with optimization variables and inequality constraints and equality constraints.

Column Sum  Problems (4) and (5) are aiming at finding a policy associated with a state transition matrix that is doubly stochastic. To achieve this result it is enough to guarantee that the column sums of the matrix are all equal to one kirkland2010column. A measure that can be used to evaluate the distance to a doubly stochastic matrix can be the absolute sum of the difference between one and the column sums: . The following theorem provides a lower bound to the policy entropy as a function of this measure. [] Let be the transition matrix of a given MDP. The entropy of the steady-state distribution induced by a policy is lower bounded by:

 H(dπ)≥log|S|−|S|∥∥(I−(ΠP)T)⋅1|S|∥∥21.

The optimization of this lower bound leads to the following linear program:

 minimizeΠ∈Π (6)

Besides being a linear program, unlike the other optimization problems presented, Problem (6) does not require to optimize over the space of all the doubly stochastic matrices, thus significantly reducing the number of optimization variables () and constraints ( inequalities and equalities). The linear program formulation of Problem (6) can be found in Appendix B.2.

### 3.2 Highly Exploring Policies over the State and Action Space

Although the policy resulting from the optimization of one of the above problems may lead to the most uniform exploration of the state space, the actual goal of the exploration phase is to collect enough information on the environment to optimize, at some point, a goal-conditioned policy (pong2019skew). To this end, it is essential to have an exploratory policy that adequately covers the action space in any visited state. Unfortunately, the optimization of Problems (4), (5), (6) does not guarantee even that the obtained policy is stochastic. Thus, we need to embed in the problem a secondary objective that takes into account the exploration over . This can be done by enforcing a minimal entropy over actions in the policy to be learned, adding to (4), (5), and (6) the following constraints:

 π(a|s)≥ξ,∀s∈S,∀a∈A, (7)

where . This secondary objective is actually in competition with the objective of uniform exploration over states. Indeed, an overblown incentive in the exploration over actions may limit the state distribution entropy of the optimal policy. Having a low probability of visiting a state decreases the likelihood of sampling an action from that state, hence, also reducing the exploration over actions. To illustrate that, Figure 1 (left) shows state distribution entropies () and state-action distribution entropies, i.e., , achieved by the optimal policy w.r.t. Problem (5) for different values of .

### 3.3 An Objective to Make Highly Exploring Policies Mix Faster

Although the doubly stochastic matrices are equally valid in terms of steady-state distribution, they are certainly not equivalent in terms of mixing time. Indeed, while an MC with a uniform transition matrix, i.e., transition probabilities for any , , mixes in no time, an MC with probability one on the self-loops never converges to a steady state. This is evident considering that the mixing time of an MC is trapped as follows (levin2017markov, Theorems 12.3 and 12.4):

 1−γπγπlog12ϵ≤tmix≤1γπlog1dπminϵ, (8)

where is the mixing threshold, is a minorization of , and is the spectral gap of  (2). The choice of the target strongly affects the mixing properties of the

induced by the learned policy. In many cases, such as in episodic tasks where the horizon for exploration is capped, we may have an interest in trading for a policy that leads to a less uniform distribution at convergence, but to a faster mix to the steady state. From the literature of MCs, we know that a variant of the Problems (

4), (5) having the uniform transition matrix as target and the as matrix norm, is equivalent to the problem of finding the fastest mixing transition matrix  boyd2004fastest. However, the choice of this target may overly limit the entropy over the state distribution induced by the optimal policy. Instead, we look for a generalization of the fastest mixing problem that allows us to prioritize fast exploration at will. This can be done by considering a continuum of relaxations in the fastest mixing objective reported in boyd2004fastest. Therefore, to set fast exploration as a secondary objective, we can embed in the optimization Problems (4) and (5) (but not in Problem (6)) the following constraints:

 Pu(s,s′)≤ζ,∀s,s′∈S, (9)

where . By setting , we force the optimization problem to consider the uniform transition matrix as a target, thus aiming to reduce the mixing time, while larger values of relax this objective, allowing us to get a higher steady-state distribution entropy. In Figure 1 we show how the parameter affects the trade-off between high steady-state entropy and low mixing times (i.e., high spectral gaps), reporting the values obtained by optimal policies w.r.t. Problem (5) for different .

## 4 A Model-Based Algorithm for Highly Exploring and Fast Mixing Policies

In this section, we present an approach to incrementally learn a highly exploring and fast mixing policy through interactions with an unknown environment, developing a novel model-based exploration algorithm called Intrinsically-Driven Effective and Efficient Exploration ALgorithm (IDEAL). Since Problems (4), (5), (6) requires an explicit representation of the matrix

, we need to estimate the transition model from samples before performing an objective optimization (model-based approach

sutton1990integrated). In tabular settings, this can be easily done by adopting the transition frequency as a proxy for the (unknown) transition probabilities, obtaining an estimated transition model . However, in hard-exploration tasks, it can be arbitrarily arduous to sample transitions from the most difficult to reach states by relying on naïve exploration mechanisms, such as a random policy. To address the issue, we lean on an iterative approach in which we alternate model estimation phases with optimization sweeps of the objectives (4), (5) or (6). In this way, we combine the benefit of collecting samples with highly exploring policies to better estimate the transition model and the benefit of having a better-estimated model to learn superior exploratory policies. In order to foster the policy towards pairs that have never been sampled, we keep their corresponding distribution to be uniform over all possible states, thus making the pair particularly valuable in the perspective of the optimization problem. The algorithm converges whenever the exploratory policy remains unchanged during consecutive optimization sweeps and, if we know the size of the MDP, when all state-action pairs have been sufficiently explored. In Algorithm 1 we report the pseudo-code of IDEAL. Finally, in Figure 2 we compare the iterative formulation against a not-iterative one, i.e., an approach that collects samples with a random policy and then optimizes the exploration objective off-line. Considering an exploration task on the Double Chain environment furmston2010variational, we show that the iterative form has a clear edge on the not-iterative in reducing the model estimation error . Both the approaches employ a Frobenius formulation with , , for the iterative.

## 5 Experimental Evaluation

In this section, we provide the experimental evaluation of IDEAL. First, we show a set of experiments on the illustrative Single Chain and Double Chain domains (see furmston2010variational; peters2010relative). The Single Chain consists of states having possible actions, one to climb up the chain from state to , and the other to directly fall to the initial state . The two actions are flipped with a probability , making the environment stochastic and reducing the probability of visiting the higher states. The Double Chain concatenates two Single Chain into a bigger one sharing the central state , which is the initial state. Thus, the chain can be climbed in two directions. These two domains, albeit rather simple from a dimensionality standpoint, are actually hard to explore uniformly, due to the high shares of actions returning to the initial state and preventing the agent to consistently reach the higher states (especially, state for the Single Chain, state and for the Double Chain). Then, we present an experiment on the much more complex Knight Quest environment (see, fruit2018efficient, Appendix), having and . This domain takes inspiration from classical arcade games, in which a knight has to rescue a princess in the shortest possible time without being killed by the dragon. To accomplish this feat, the knight has to perform an intricate sequence of actions. In the absence of any reward, it is a fairly challenging environment for exploration. On these domains, we address the task of learning the best exploratory policy in a limited number of samples. Especially, we evaluate these policies in terms of the induced state entropy and the probability of visiting the least favorable state under the policy, i.e., . At the same time, we consider the error in the transition model estimation, as , while learning the exploratory policy.

We compare our approach with MaxEnt hazan2018provably, the model-based algorithm to learn maximum entropy exploration that we have previously discussed in the paper, and a count-based approach inspired by the exploration bonuses of MBIE-EB strehl2008analysis, which we refer as CountBased in the following. The latter shares the same structure of our algorithm, but replace the policy optimization sweeps with approximate value iterations bertsekas1995dynamic, where the reward for a given state is inversely proportional to the visit count of that state. It is worth noting that the results reported for the MaxEnt algorithm are related to the mixture policy , where is a set of -deterministic policies, and

is a probability distribution over

. For the sake of simplicity, we have equipped all the approaches with a little domain knowledge, i.e., the cardinality of and , which allows us to build full-dimensional matrices even in the early stages. However, this can be avoided without a significant impact on the presented results. For every experiment, we will report the batch-size , and the parameters , of IDEAL. CountBased and MaxEnt employ -greedy policies having in all the experiments. More detailed information about the presented results, along with an additional experiment, can be found in Appendix D.

First, in Figure 3, we compare the Problems (4), (5), (6) on the Single Chain environment. On one hand, we show the performance achieved by the exact solutions, i.e., computed with a full knowledge of . While the plain formulations () are remarkably similar, adding a constraint over the action entropy () has a significantly different impact. On the other hand, we illustrate the performance of IDEAL, equipped with the alternative optimization objectives, in learning a good exploratory policy from samples. In this case, the Frobenius clearly achieves a better performance. In the following, we will report the results of IDEAL considering only the best-performing formulation, which, for all the presented experiments, corresponds to the Frobenius.

In Figure (a)a, we show that IDEAL compares well against the other approaches in exploring the Double Chain domain. It achieves a superior state entropy and converges faster to the optimum. It displays also a higher probability of visiting the least favorable state, and it behaves positively in the estimation of . Notably, the CountBased algorithm fails to reach high exploration due to a detachment problem ecoffet2019go, since it fluctuates between two exploratory policies that are greedy towards the two directions of the chain. By contrast, in a domain having a clear direction for exploration, such as the simpler Single Chain domain, CountBased ties the explorative performances of IDEAL (Figure (b)b). On the other hand, MaxEnt is effective in the exploration performance, but much more slower to converge, both in the Double Chain and the Single Chain. Note that in Figure (a)a, the model estimation error of MaxEnt starts higher than the other, since it employs a different strategy to fill the transition probabilities and the intrinsic rewards of never reached states, inspired by brafman2002r. In Figure (b)b, we present an experiment on the higher-dimensional Knight Quest environment. IDEAL achieves a remarkable state distribution entropy, while MaxEnt struggles to converge towards a satisfying exploratory policy. The CountBased algorithm (not reported in Figure (c)c, see Appendix D), fails to explore the environment altogether, oscillating between policies with low entropy values.

In Figure (d)d, we illustrate how the exploratory policies learned in the exploration of the Double Chain environment are effective to ease learning of any possible goal-conditioned policy afterwards. To this end, the exploratory policies, learned by the three approaches through 3000 samples (Figure (a)a), are employed to collect samples in a fixed horizon (within a range from 10 to 100 steps). Then, a goal-conditioned policy is learned off-line through approximate value iteration bertsekas1995dynamic, to optimize a reward function that is 1 for the hardest state to reach, 0 in all the other states. In this setting, all the methods prove to be rather successful, though IDEAL compares positively w.r.t. the other strategies.

## 6 Related Work

As discussed in the previous sections, the work of Hazan et al. hazan2018provably considers an objective not very dissimilar to the one presented in this paper, even if they propose a completely different solution to achieve this goal. In particular, their method learns a mixture of deterministic policies instead of a single stochastic policy. In tarbouriech2019active, the authors, albeit addressing the different problem of active exploration in an MDP, develop an approach with some analogies w.r.t. the one presented in hazan2018provably. In their case, while the gradient of an estimation loss replaces the gradient of the state distribution entropy in the design of the reward functions, the solution is a mixture of, much more practical, stochastic policies.

Other works propose to intrinsically motivate the agent towards learning to reach all possible states in the environment lim2012autonomous. To extend this same idea from the tabular setting to the context of a continuous, high-dimensional state space, Pong et al. pong2019skew employ a generative model to seek for a maximum-entropy goal distribution. In ecoffet2019go, the authors propose an approach, called Go-Explore, to methodically reach any state by keeping an archive of any visited state and the best trajectory that brought the agent there. At each iteration, the agent draws a promising state from the archive, returns there replicating the stored trajectory (Go), then explores from this state trying to discover new states (Explore).

Another promising intrinsic objective is to make value out of the exploration phase by acquiring a set of reusable skills, typically formulated by means of the option framework sutton1999between, which can be combined hierarchically to achieve challenging goals. In barto2004intrinsically, a set of options is learned by maximizing an intrinsic reward that is generated at the occurrence of some, user-defined, salient event. The approach proposed by Bonarini et al. bonarini2006incremental, which presents some similarities with the work in ecoffet2019go

, is based on learning a set of options to return with high probability to promising states. In their context, a promising state is both a hard state to reach, and a doorway to reach many other states. In this way, the learned options heuristically favor an even exploration of the state space.

## 7 Discussion and Conclusions

In this paper, we proposed a new model-based algorithm, IDEAL, to learn highly exploring and fast mixing policies. The algorithm outputs a sequence of policies that maximize a lower bound to the entropy of their steady-state distributions. We presented three formulations of the lower bound that differently tradeoff tightness with computational complexity of the optimization. The experimental evaluation showed that IDEAL is able to achieve superior performance than other approaches striving for uniform exploration of the environment, such as maximum-entropy and count-based algorithms. Furthermore, in contrast with the state-of-the-art approaches in intrinsic motivation, IDEAL avoids the risk of detachment and derailment ecoffet2019go during exploration. Future works could focus on extending the applicability of the presented approach to non-tabular environments, following the blueprint in bellemare2016unifying. Promising future directions include also the problem of learning a policy that achieves high exploration over multiple domains within a class of environments. We believe that this work provides a valuable contribution in view of solving the conundrum on what should a reinforcement learning agent learn in the absence of any reward coming from the environment.

## Appendix A Proofs

See 3.1

###### Proof.

Let us recall the definition of the steady state distribution of the MC induced by the policy over the MDP:

 dπ(s)=∑s′∈SPπ(s|s′)dπ(s′),∀s∈S.

If is a uniform distribution we have:

 ∑s′∈SPπ(s|s′)=1,∀s∈S, (10)

then, the state transition matrix is column stochastic, while it is also row stochastic by definition. Conversely, if the matrix is doubly stochastic, we aim to prove that a that is not uniform cause an inconsistency in the stationary condition . Let us consider a perturbation of the uniform , such that for all the states in outside of:

 dπ(sh)=1|S|+α,dπ(sl)=1|S|−α, (11)

where is a, sufficiently small, positive constant. Since is doubly stochastic, the sum:

 dπ(sh)=∑s′∈SPπ(s|s′)dπ(s′), (12)

is a convex combination of the elements in . Hence, for the stationary condition to hold, we must have and for all different from . Nevertheless, a state with probability one on the self-loop cannot have a stationary distribution different from or . ∎

See 3.1

###### Proof.

We start with rewriting the entropy of as follows:

 H(dπ)=−∑s∈Sdπ(s)log(dπ(s))=−∑s∈Sdπ(s)log(dπ(s)|S||S|)=log(|S|)−DKL(dπ||du), (13)

where is the uniform distribution over the state space (all the entries equal to ) and is the Kullback-Leibler (KL) divergence between distribution and .
Using the reverse Pinsker inequality [csiszar2006context, p. 1012 and Lemma 6.3], we can upper bound the KL divergence between and :

 DKL(dπ||du)≤∥du−dπ∥21mins∈Sdu(s)=|S|⋅∥du−dπ∥21. (14)

The total variation between the two steady-state distributions and can in turn be upper bounded by (see [schweitzer1968perturbation]):

 (15)

where is the fundamental matrix and is any doubly-stochastic matrix (). Since the fundamental matrix associated to any doubly-stochastic matrix is row stochastic [hunter2010some], then . Furthermore, since the bound in Equation (15) holds for any , we can rewrite the bound as follows:

 ∥du−dπ∥1≤infPu∈P∥Pu−ΠP∥∞. (16)

Combining Equations (14) and (16) we get an upper bound to the KL divergence, which, once replaced in Equation (13), provides the lower bound in the statement and concludes the proof.

See 3.1

###### Proof.

From the properties of the matrix norms [petersen2008matrix], we have that for any matrix it holds:

 ∥M∥F≤1√n∥M∥∞.

As a consequence:

 infPu∈P∥Pu−ΠP∥2F≥1|S|∥∥¯¯¯¯¯¯¯Pu−ΠP∥∥2∞≥1|S|infPu∈P∥Pu−ΠP∥2∞,

where . Combining this inequality with the result in Theorem 3.1 concludes the proof. ∎

See 3.1

###### Proof.

We start with defining the vector that results from the difference between the vector of ones and the vector of the column sums: . We denote with the matrix obtained from by adding to the row corresponding to state :

It is worth noting that, since , the column sums and the row sums of matrix are all equal to . Nonetheless, is not guaranteed to be doubly stochastic since its entries can be lower than . However, it is possible to show that

 infPu∈P∥Pu−Pπ∥∞≤∥∥ˆPx−Pπ∥∥∞=∥c∥1.

When is doubly stochastic, the above inequality holds by definition. When has negative entries, it is always possible to transform it to a doubly stochastic matrix without increasing the distance from . In order to remove the negative entries of , we need to trade probability with the other states, so as to preserve the row sum. Each state that gives probability to state , will receive the same amount of probability taken by the columns corresponding to positive values of the vector . In order to illustrate this procedure, we consider a four-state MDP and a policy that leads to the following state transition matrix:

 Pπ=⎡⎢ ⎢ ⎢⎣0.80.20000.90.100.30.50.10.10.80.10.10⎤⎥ ⎥ ⎥⎦.

The corresponding vector is

 c=[−0.9−0.70.70.9]T.

Summing to the first row of we get:

 ˆPs1=⎡⎢ ⎢ ⎢⎣−0.1−0.50.70.900.90.100.30.50.10.10.80.10.10⎤⎥ ⎥ ⎥⎦.

Since we have two negative elements, to get a doubly stochastic matrix we can modify the matrix as follows:

• move from element to and (to keep the row sum equal to 1) move from to

• move from element to and (to keep the row sum equal to 1) move from to

The resulting matrix is:

 ˆP=⎡⎢ ⎢ ⎢⎣000.10.900.40.600.20.50.20.10.80.10.10⎤⎥ ⎥ ⎥⎦∈P.

The described procedure yields a doubly stochastic matrix such that . Combining this upper bound with the result in Theorem 3.1 concludes the proof. ∎

The bound in Theorem 3.1 is never less than the bound in Theorem 3.1.

###### Proof.

From the properties of the matrix norms [petersen2008matrix], we have that for any matrix it holds:

 ∥M∥∞√n≤∥M∥F≤√n∥M∥∞. (17)

As a consequence:

where . It follows that

 log|S|−|S|2infPu∈P∥Pu−ΠP∥2F≤log|S|−|S|infPu∈P∥Pu−ΠP∥2∞.

## Appendix B Optimization Problems

### b.1 Linear program formulation of Problem (4)

Problem (4) can be rewritten as follows:

 minimizePu,Π,v v (18) subject to ∑s′∈S|Pu(s′|s)−Pπ(s′|s)|≤v,∀s∈S, Π(s,(s,a))≥0,∀s∈S,∀a∈A, Pu(s′|s)≥0,∀s∈S,∀s′∈S, ∑a∈AΠ(s,(s,a))=1,∀s∈S, ∑s′∈SPu(s′|s)=1,∀s∈S, ∑s′∈SPu(s|s′)=1,∀s∈S.

The first set of inequality constraints can be transformed in a set of linear inequality constraints. Each constraint is obtained by removing the absoulte values and considering a different permutation of the signs in front of the terms in the summation. As a result, if the original summation contains elements, the number of linear constraints is . Since this process needs to be done for each state , the first set of constraints can be replaced by .

### b.2 Linear program formulation of Problem (6)

Let be a vector of length . Problem (4) can be rewritten as follows:

 minimizeΠ,v ∑s∈Sv(s) (19) subject to 1−∑s′∈SPπ(s|s′)≤v(s),∀s∈S, ∑s′∈SPπ(s|s′)−1≤v(s),∀s∈S, Π(s,(s,a))≥0,∀s∈S,∀a∈A, ∑a∈AΠ(s,(s,a))=1,∀s∈S.

## Appendix C Illustrative Example

The example in Figure 9 shows that the Frobenius norm can better captures the distance between a transition matrix and a doubly stochastic w.r.t. the -norm. Indeed, the -norm only accounts for the state which corresponds to the maximum absolute row sum of the difference , while the Frobenius norm considers the difference across all the states. In the example, we see two transition matrices and that are equally bad in the worst state (), thus, have equal -norm. However, is fairly unbalanced also in the other states, where is uniform instead, and so it is clearly preferable in view of the uniform exploration objective.

## Appendix D Experimental Evaluation: Further Details

In the following, we provide further details on the experimental evaluation covered by Section 5. First, for the sake of clarity, we report the pseudo-code of MaxEnt and CountBased algorithms, which we have compared with our approach. Then, for any presented experiment, we recap the full set of parameters employed, along with a characterization of the time consumption in solving the optimization problems. Finally, we provide an additional experiment, not covered by the main paper, in the River Swim environment [see, strehl2008analysis].

As a side note, it is worth reporting that our implementation of the optimization Problems (4), (5), (6) is based on the CVXPY framework [cvxpy] and makes use of the MOSEK optimizer.

### d.1 Algorithms: Pseudo-Code

In Algorithm 2, we report the pseudo-code of the MaxEnt algorithm [hazan2018provably]. In Algorithm 3 the pseudo-code of the CountBased algorithm, which is inspired by the exploration bonus of MBIE-EB [strehl2008analysis].

### d.2 Experiments

For any experiment covered by Section 5, we provide, in the table below, the cardinality of the state space , the cardinality of the action space , the value of the parameters , of IDEAL, the parameter of MaxEnt and CountBased, the number of iterations , and the batch-size , which are shared by all the approaches. In the Goal-conditioned experiment, we employ the exploratory policies that are output of the Double Chain experiment, to collect small batches of samples on which we optimize a goal-conditioned policy. For any experiments (except Knight Quest), we provide additional figures reporting the performance of all the formulations of IDEAL.

The River Swim environment [strehl2008analysis] mimic the task of crossing a river either swimming upstream or downstream. Thus, the action of swimming upstream fails with high probability, while the action of swimming downstream is deterministic. Due to this imbalance in the effort needed to cross the environment in the two directions, it is a fairly hard task in view of uniform exploration.