# Sparse Markov Decision Processes with Causal Sparse Tsallis Entropy Regularization for Reinforcement Learning

In this paper, a sparse Markov decision process (MDP) with novel causal sparse Tsallis entropy regularization is proposed.The proposed policy regularization induces a sparse and multi-modal optimal policy distribution of a sparse MDP. The full mathematical analysis of the proposed sparse MDP is provided.We first analyze the optimality condition of a sparse MDP. Then, we propose a sparse value iteration method which solves a sparse MDP and then prove the convergence and optimality of sparse value iteration using the Banach fixed point theorem. The proposed sparse MDP is compared to soft MDPs which utilize causal entropy regularization. We show that the performance error of a sparse MDP has a constant bound, while the error of a soft MDP increases logarithmically with respect to the number of actions, where this performance error is caused by the introduced regularization term. In experiments, we apply sparse MDPs to reinforcement learning problems. The proposed method outperforms existing methods in terms of the convergence speed and performance.

• 13 publications
• 20 publications
• 20 publications
03/02/2019

### A Unified Framework for Regularized Reinforcement Learning

We propose and study a general framework for regularized Markov decision...
06/17/2020

### Parameterized MDPs and Reinforcement Learning Problems – A Maximum Entropy Principle Based Framework

We present a framework to address a class of sequential decision making ...
09/16/2021

### Comparison and Unification of Three Regularization Methods in Batch Reinforcement Learning

In batch reinforcement learning, there can be poorly explored state-acti...
11/01/2018

### Temporal Regularization in Markov Decision Process

Several applications of Reinforcement Learning suffer from instability d...
01/31/2019

### Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning

In this paper, we present a new class of Markov decision processes (MDPs...
02/10/2018

### Path Consistency Learning in Tsallis Entropy Regularized MDPs

We study the sparse entropy-regularized reinforcement learning (ERL) pro...
11/12/2021

### Q-Learning for MDPs with General Spaces: Convergence and Near Optimality via Quantization under Weak Continuity

Reinforcement learning algorithms often require finiteness of state and ...

## I Introduction

Markov decision processes (MDPs) have been widely used as a mathematical framework to solve stochastic sequential decision problems, such as autonomous driving [1], path planning [2], and quadrotor control [3]. In general, the goal of an MDP is to find the optimal policy function which maximizes the expected return. The expected return is a performance measure of a policy function and it is often defined as the expected sum of discounted rewards. An MDP is often used to formulate reinforcement learning (RL) [4], which aims to find the optimal policy without the explicit specification of stochasticity of an environment, and inverse reinforcement learning (IRL) [5], whose goal is to search the proper reward function that can explain the behavior of an expert who follows the underlying optimal policy.

While the optimal solution of an MDP is a deterministic policy, it is not desirable to apply an MDP to the problems with multiple optimal actions. In perspective of RL, the knowledge of multiple optimal actions makes it possible to cope with unexpected situations. For example, suppose that an autonomous vehicle has multiple optimal routes to reach a given goal. If a traffic accident occurs at the currently selected optimal route, it is possible to avoid the accident by choosing another safe optimal route without additional computation of a new optimal route. For this reason, it is more desirable to learn all possible optimal actions in terms of robustness of a policy function. In perspective of IRL, since the experts often make multiple decisions in the same situation, a deterministic policy has a limitation in expressing the expert’s behavior. For this reason, it is indispensable to model the policy function of an expert as a multi-modal distribution. These reasons give a rise to the necessity of a multi-modal policy model.

In order to address the issues with a deterministic policy function, a causal entropy regularization method has been utilized [6, 7, 8, 9, 10]. This is mainly due to the fact that the optimal solution of an MDP with causal entropy regularization becomes a softmax distribution of state-action values , i.e., , which is often referred to as a soft MDP [11]

. While a softmax distribution has been widely used to model a stochastic policy, it has a weakness in modeling a policy function when the number of actions is large. In other words, the policy function modeled by a softmax distribution is prone to assign non-negligible probability mass to non-optimal actions even if state-action values of these actions are dismissible. This tendency gets worse as the number of actions increases as demonstrated in Figure

1.

In this paper, we propose a sparse MDP by presenting a novel causal sparse Tsallis entropy regularization method, which can be interpreted as a special case of Tsallis generalized entropy [12]. The proposed regularization method has a unique property in that the resulting policy distribution becomes a sparse distribution. In other words, the supporting action set which has a non-zero probability mass contains a sparse subset of the action space.

We provide a full mathematical analysis about the proposed sparse MDP. We first derive the optimality condition of a sparse MDP, which is named as a sparse Bellman equation. We show that the sparse Bellman equation is an approximation of the original Bellman equation. Interestingly, we further find the connection between the optimality condition of a sparse MDP and the probability simplex projection problem [13]. We present a sparse value iteration method for solving a sparse MDP problem, where the optimality and convergence are proven using the Banach fixed point theorem [14]. We further analyze the performance gaps of the expected return of the optimal policies obtained by a sparse MDP and a soft MDP compared to that of the original MDP. In particular, we prove that the performance gap between the proposed sparse MDP and the original MDP has a constant bound as the number of actions increases, whereas the performance gap between a soft MDP and the original MDP grows logarithmically. From this property, sparse MDPs have benefits over soft MDPs when it comes to solving problems in robotics with a continuous action space.

To validate effectiveness of a sparse MDP, we apply the proposed method to the exploration strategy and the update rule of Q-learning and compare to the -greedy method and softmax policy [9]. The proposed method is also compared to the deep deterministic policy gradient (DDPG) method [15], which is designed to operate in a continuous action space without discretization. The proposed method shows the state of the art performance compared to other methods as the discretization level of an action space increases.

## Ii Background

### Ii-a Markov Decision Processes

A Markov decision process (MDP) has been widely used to formulate a sequential decision making problem. An MDP can be characterized by a tuple , where is the state space, is the corresponding feature space, is the action space, is the distribution of an initial state, is the transition probability from to by taking , is a discount factor, and is the reward function. The objective of an MDP is to find a policy which maximize , where policy is a mapping from the state space to the action space. For notational simplicity, we denote the expectation of a discounted summation of function , i.e., , by , where is a function of state and action, such as a reward function or an indicator function . We also denote the expectation of a discounted summation of function conditioned on the initial state, i.e., , by . Finding an optimal policy for an MDP can be formulated as follows:

 maximizeπEπ[r(st,at)]subject to∀s∑a′π(a′|s)=1,∀s,aπ(a′|s)≥0. (1)

The necessary condition for the optimal solution of (1) is called the Bellman equation. The Bellman equation is derived from the Bellman’s optimality principal as follows:

 Qπ(s,a) =r(s,a)+γ∑s′Vπ(s′)T(s′|s,a) Vπ(s) =maxa′Q(s,a′) π(s) =argmaxa′Q(s,a′),

where is a value function of , which is the expected sum of discounted rewards when the initial state is given as , and is a state-action value function of , which is the expected sum of discounted rewards when the initial state and action are given as and , respectively. Note that the optimal solution is a deterministic function, which is referred to as a deterministic policy.

### Ii-B Entropy Regularized Markov Decision Processes

In order to obtain a multi-modal policy function, an entropy-regularized MDP, also known as a soft MDP, has been widely used [9, 11, 8, 10]. In a soft MDP, causal entropy regularization over is introduced to obtain a multi-modal policy distribution, i.e., . Since causal entropy regularization penalizes a deterministic distribution, it makes an optimal policy of a soft MDP to be a softmax distribution. A soft MDP is formulated as follows:

 maximizeπEπ[r(st,at)]+αH(π)subject to∀s∑a′π(a′|s)=1,∀s,aπ(a′|s)≥0, (2)

where is a -discounted causal entropy and is a regularization coefficient. This problem (2) has been extensively studied in [6, 11, 8]. In [11], a soft Bellman equation and the optimal policy distribution are derived from the Karush Kuhn Tucker (KKT) conditions as follows:

 Qsoftπ(s,a)=r(s,a)+γ∑s′Vsoftπ(s′)T(s′|s,a)Vsoftπ(s)=αlog(∑a′exp(Qsoftπ(s,a′)α))π(a|s)=exp(Qsoftπ(s,a)α)∑a′exp(Qsoftπ(s,a′)α),

where

 Vsoftπ(s)=Eπ[r(st,at)−αlog(π(at|st))|s0=s]Qsoftπ(s,a)=Eπ[r(st,at)−αlog(π(at|st))|s0=s,a0=a].

is a soft value of indicating the expected sum of rewards including the entropy of a policy, obtained by starting at state and is a soft state-action value of , which is the expected sum of rewards obtained by starting at state by taking action . Note that the optimal policy distribution is a softmax distribution. In [11], a soft value iteration method is also proposed and the optimality of soft value iteration is proved. By using causal entropy regularization, the optimal policy distribution of a soft MDP is able to represent a multi-modal distribution.

The causal entropy regularization has an effect of making the resulting policy of a soft MDP closer to a uniform distribution as the number of actions increases. To handle this issue, we propose a novel regularization method whose resulting policy distribution still has multiple modes (a stochastic policy) but the performance loss is less than a softmax policy distribution.

## Iii Sparse Markov Decision Processes

We propose a sparse Markov decision process by introducing a novel causal sparse Tsallis entropy regularizer:

 W(π)≜E[∞∑t=0γt12(1−π(at|st))∣∣ ∣∣π,d,T]=Eπ[12(1−π(a|s))].

By adding to the objective function of (1), we aim to solve the following optimization problem:

 maximizeπEπ[r(s,a)]+αW(π)subject to∀s∑a′π(a′|s)=1,∀s,aπ(a′|s)≥0, (3)

where is a regularization coefficient. We will first derive the sparse Bellman equation from the necessary condition of (3). Then by observing the connection between the sparse Bellman equation and the probability simplex projection, we show that the optimal policy becomes a sparsemax distribution, where the sparsity can be controlled by . In addition, we present a sparse value iteration algorithm where the optimality is guaranteed using the Banach’s fixed point theorem. The detailed derivations of lemmas and theorems in this paper can be found in Appendix A.

### Iii-a Notations and Properties

We first introduce notations and properties used in the paper. In Table I

, all notations and definitions are summarized. The utility, value, state visitation can be compactly expressed as below in terms of vectors and matrices:

 Jspπ=d⊺G−1πrspπ,Vspπ=G−1πrspπJsoftπ=d⊺G−1πrsoftπ,Vsoftπ=G−1πrsoftπ,ρπ=d⊺G−1π

where is the transpose of vector , , indicates a sparse MDP problem and indicates a soft MDP problem.

### Iii-B Sparse Bellman Equation from Karush-Kuhn-Tucker conditions

The sparse Bellman equation can be derived from the necessary conditions of an optimal solution of a sparse MDP. We carefully investigate the Karush Kuhn Tucker (KKT) conditions, which indicate necessary conditions for a solution to be optimal when some regularity conditions about the feasible set are satisfied. The feasible set of a sparse MDP satisfies linearity constraint qualification [16] since the feasible set consists of linear afine functions. In this regards, the optimal solution of a sparse MDP necessarily satisfy KKT conditions as follows.

###### Theorem 1.

If a policy distribution is the optimal solution of a sparse MDP (3), then and the corresponding sparse value function necessarily satisfy following equations for all state and action pairs:

 Qspπ(s,a) =r(s,a)+γ∑s′Vspπ(s′)T(s′|s,a) Vspπ(s) =α⎡⎣12∑a∈S(s)⎛⎝(Qspπ(s,a)α)2−τ(Qspπ(s,⋅)α)2⎞⎠+12⎤⎦ π(a|s) =max(Qspπ(s,a)α−τ(Qspπ(s,⋅)α),0), (4)

where , is a set of actions satisfying with indicating the action with the th largest action value , and is the cardinality of .

The full proof of Theorem 1 is provided in Appendix A-A. The proof depends on the KKT condition where the derivative of a Lagrangian objective function with respect to policy becomes zero at the optimal solution, the stationary condition. From (4), it can be shown that the optimal solution obtained from the sparse MDP assigns zero probability to the action whose action value is below the threshold and the optimal policy assigns positive probability to near optimal actions in proportion to their action values, where the threshold determines the range of near optimal actions. This property makes the optimal policy to have a sparse distribution and prevents the performance drop caused by assigning non-negligible positive probabilities to non-optimal actions, which often occurs in a soft MDP.

From the definitions of and , we can further observe an interesting connection between the sparse Bellman equation and the probability simplex projection problem [13].

### Iii-C Probability Simplex Projection and SparseMax Operation

The probability simplex projection [13] is a well known problem of projecting a -dimensional vector into a dimensional probability simplex in an Euclidean metric sense. A probability simplex projection problem is defined as follows:

 minimizep12||p−z||22subject tod∑i=1pi=1,pi≥0,∀i=1,⋯,d, (5)

where is a given -dimensional vector, is the dimension of and , and is the th element of . Let be the th largest element of and be the supporting set of the optimal solution as defined by . It is a well known fact that the problem (5) has a closed form solution which is , where indicates the th dimension, is the th element of the optimal solution for fixed , and with [13, 17].

Interestingly, the optimal solution , and the supporting set of (5) can be precisely matched to those of the sparse Bellman equation (4). From this observation, it can be shown that the optimal policy distribution of a sparse MDP is the projection of into a probability simplex. Note that we refer as a sparsemax distribution.

More surprisingly, can be represented as an approximation of the max operation derived from . A differentiable approximation of the max operation is defined as follows:

 spmax(z)≜12K∑i=1(z2(i)−τ(z)2)+12 (6)

We call as sparsemax. In [17], it is proven that is an indefinite integral of , i.e., , where is a constant and, in our case, . We provide simple upper and lower bounds of with respect to

 max(z)≤αspmax(zα)≤max(z)+αd−12d. (7)

The lower bound of sparsemax is shown in [17]. However, we provide another proof of the lower bound and the proof for the upper bound in Appendix A-B.

The bounds (7) show that sparsemax is a bounded and smooth approximation of max and, from this fact, (4) can be interpreted as an approximation of the original Bellman equation. Using this notation, can be rewritten as,

 Vspπ(s)=αspmax(Qspπ(s,⋅)α).

### Iii-D Supporting Set of Sparse Optimal Policy

The supporting set of a sparse MDP is a set of actions with nonzero probabilities and the cardinality of can be controlled by regularization coefficient , while the supporting set of a soft MDP is always the same as the entire action space. In a sparse MDP, actions assigned with non-zero probability must satisfy the following inequality:

 α+iQspπ(s,a(i))>i∑j=1Qspπ(s,a(j)), (8)

where indicates the action with the th largest action value. From this inequality, it can be shown that controls the margin between the largest action value and the others included in the supporting set. In other words, as increases, the cardinality of a supporting set increases since the action values that satisfy (8) increase. Conversely, as decreases, the supporting set decreases. In extreme cases, if goes zero, only will be included in and if goes infinity, the entire actions will be included in . On the other hand, in a soft MDP, the supporting set of a softmax distribution cannot be controlled by the regularization coefficient even if the sharpness of the softmax distribution can be adjusted. This property makes sparse MDPs have an advantage over soft MDPs, since we can give a zero probability to non-optimal actions by controlling .

### Iii-E Connection to Tsallis Generalized Entropy

The notion of the Tsallis entropy was introduced by C. Tsallis as a general extension of entropy [12] and the Tsallis entropy has been widely used to describe thermodynamic systems and molecular motions. Surprisingly, the proposed regularization is closely related to a special case of the Tsallis entropy. The Tsallis entropy is defined as follows:

 Sq,k(p)=kq−1(1−∑ipqi),

where is a probability mass function, is a parameter called entropic-index, and is a positive real constant. Note that, if and , is the same as entropy, i.e., . In [18, 11], it is shown that is an extension of since .

We discover the connection between the Tsallis entropy and the proposed regularization when and .

###### Theorem 2.

The proposed policy regularization is an extension of the Tsallis entropy with parameters and to the version of causal entropy, i.e.,

 W(π)=Eπ[S2,12(π(⋅|s))].

The proof is provided in Appendix A-D

From this theorem, can be interpreted as an extension of to the case of causally conditioned distribution, similarly to the causal entropy.

## Iv Sparse Value Iteration

In this section, we propose an algorithm for solving a causal sparse Tsallis entropy regularized MDP problem. Similar to the original MDP and a soft MDP, the sparse version of value iteration can be induced from the sparse Bellman equation. We first define a sparse Bellman operation : for all ,

 Usp(x)(s)=αspmax(r(s,⋅)+γ∑s′x(s′)T(s′|s,⋅)α),

where is a vector in and is the resulting vector after applying to and is the element for state in . Then, the sparse value iteration algorithm can be described simply as

 xi+1=Usp(xi),

where is the number of iterations. In the following section, we show the convergence and the optimality of the proposed sparse value iteration method.

### Iv-a Optimality of Sparse Value Iteration

In this section, we prove the convergence and optimality of the sparse value iteration method. We first show that has monotonic and discounting properties and, by using those properties, we prove that is a contraction. Then, by the Banach fixed point theorem, repeatedly applying for an arbitrary initial point always converges into the unique fixed point.

###### Lemma 1.

is monotone: for , if , then , where indicates an element-wise inequality.

###### Lemma 2.

For any constant , , where is a vector of all ones.

The full proofs can be found in Appendix A-E. The proofs of Lemma 1 and Lemma 2 rely on the bounded property of the sparsemax operation. It is possible to prove that the sparse Bellman operator is a contraction using Lemma 1 and Lemma 2 as follows:

###### Lemma 3.

is a -contraction mapping and have a unique fixed point, where is in by definition.

Using Lemma 1, Lemma 2, and Lemma 3, the optimality and convergence of sparse value iteration can be proven.

###### Theorem 3.

Sparse value iteration converges to the optimal value of (3).

The proof can be found in Appendix A-E. Theorem 3 is proven using the uniqueness of the fixed point of and the sparse Bellman equation.

## V Performance Error Bounds for Sparse Value Iteration

We prove the bounds of the performance gap between the policy obtained by a sparse MDP and the policy obtained by the original MDP, where this performance error is caused by regularization. The boundedness of (7) plays an crucial role to prove the error bounds. The performance bounds can be derived from bounds of sparsemax. A similar approach can be applied to prove the error bounds of a soft MDP since a log-sum-exp function is also a bounded approximation of the max operation. Comparison of log-sum-exp and sparsemax operation is provided in Appendix A-C

Before explaining the performance error bounds, we introduce two useful propositions which are employed to prove the performance error bounds of a sparse MDP and a soft MDP. We first prove an important fact which shows that the optimal values of sparse value iteration and soft value iteration are greater than that of the original MDP.

###### Lemma 4.

Let and be the Bellman operations of an original MDP and soft MDP, respectively, such that, for state and ,

 U(x)(s)=maxa′(r(s,a′)+γ∑s′x(s′)T(s′|s,a′))Usoft(x)(s)=αlog∑a′exp(r(s,a′)+γ∑s′x(s′)T(s′|s,a′)α).

Then following inequalities hold for every integer :

 Un(x)≤(Usp)n(x),Un(x)≤(Usoft)n(x),

where (resp., ) is the result after applying (resp., ) times. In addition, let and be the fixed points of and , respectively. Then, following inequalities also hold:

 x∗≤xsp∗,x∗≤xsoft∗.

The detailed proof is provided in Appendix A-F. Lemma 4 shows that the optimal values, and , obtained by sparse value iteration and soft value iteration are always greater than the original optimal value . Intuitively speaking, the reason for this inequality is due to the regularization term, i.e., or , added to the objective function.

Now, we discuss other useful properties about the proposed causal sparse Tsallis entropy regularization and causal entropy regularization .

###### Lemma 5.

and have following upper bounds:

 W(π)≤11−γ|A|−12|A|,H(π)≤log(|A|)1−γ

where is the cardinality of the action space .

The proof is provided in Appendix A-F. Theorem 5 can be induced by extending the upper bound of and to the causal entropy and causal sparse Tsallis entropy.

By using Lemma 4 and Lemma 5, the performance bounds for a sparse MDP and a soft MDP can be derived as follows.

###### Theorem 4.

Following inequalities hold:

 Eπ∗(r(s,a))−α1−γ|A|−12|A|≤Eπsp(r(s,a))≤Eπ∗(r(s,a)),

where and are the optimal policy obtained by the original MDP and a sparse MDP, respectively.

###### Theorem 5.

Following inequalities hold:

where and are the optimal policy obtained by the original MDP and a soft MDP, respectively.

The proofs of Theorem 4 and Theorem 5 can be found in Appendix A-F. These error bounds show us that the expected return of the optimal policy of a sparse MDP has always tighter error bounds than that of a soft MDP. Moreover, it can be also known that the bounds for the proposed sparse MDP converges to a constant as the number of actions increases, whereas the error bounds of soft MDP grows logarithmically.

This property has a clear benefit when a sparse MDP is applied to a robotic problem with a continuous action space. To apply an MDP to a continuous action space, a discretization of the action space is essential and a fine discretization is required to obtain a solution which is closer to the underlying continuous optimal policy. Accordingly, the number of actions becomes larger as the level of discretization increases. In this case, a sparse MDP has advantages over a soft MDP in that the performance error of a sparse MDP is bounded by a constant factor as the number of actions increases, whereas performance error of optimal policy of a soft MDP grows logarithmically.

## Vi Sparse Exploration and Update Rule for Sparse Deep Q-Learning

In this section, we first propose sparse Q-learning and further extend to sparse deep Q-learning where a sparsemax policy and the sparse Bellman equation are employed as a exploration method and update rule.

Sparse Q-learning is a model free method to solve the proposed sparse MDP without the knowledge of transition probabilities. In other words, when the transition probability is unknown but sampling from

is possible, sparse Q-learning estimates an optimal

of the sparse MDP using sampling, as Q-learning finds an approximated value of an optimal of the conventional MDP. Similar to Q-learning, the update equation of sparse Q-learning is derived from the sparse Bellman equation,

 Qsp(si,ai)←Qsp(si,ai)+η(i)[r(si,ai)+γαspmax(Qsp(si+1,⋅)α)−Q(si,ai)],

where indicates the number of iterations and is a learning rate. If the learning rate satisfies and , then, as the number of samples increases to infinity, sparse Q-learning converges to the optimal solution of a sparse MDP. The proof of the convergence and optimality of sparse Q-learning is the same as that of the standard Q-learning [20].

The proposed sparse Q-learning can be easily extended to sparse deep Q-learning using a deep neural network as an estimator of the sparse Q value. In each iteration, sparse deep Q-learning performs a gradient descent step to minimize the squared loss

, where is the parameter of the Q network. Here, is the target value defined as follows:

 y=r(s,a)+γα%spmax(Q(s′,⋅;θ)α),

where is the next state sampled by taking action at the state and indicates network parameters.

Moreover, we employ the sparsemax policy as the exploration strategy where the policy distribution is computed by (4) with action values estimated by a deep Q network. The sparsemax policy excludes the action whose estimated action value is too low to be re-explored, by assigning zero probability mass. The effectiveness of the sparsemax exploration is investigated in Section VII.

For stable convergence of a Q network, we utilize double Q-learning [21], where the parameter for obtaining a policy and the parameter for computing the target value are separated and is updated to at every predetermined iterations. In other words, double Q-learning prevents instability of deep Q-learning by slowly updating the target value. Prioritized experience replay [19] is also applied where the optimization of a network proceeds in consideration of the importance of experience. The whole process of sparse deep Q-learning is summarized in Algorithm 1.

## Vii Experiments

We first verify Theorem 4, Theorem 5 and the effect of (8) in simulation. For verification of Theorem 4 and Theorem 5, we measure the performance of the expected return while increasing the number of actions, . For verification of the effect of (8), the cardinality of the supporting set of optimal policies of sparse and soft MDP are compared at different values of .

To investigate effectiveness of the proposed method, we test sparsemax exploration and the sparse Bellman update rule on reinforcement learning with a continuous action space. To apply Q-learning to a continuous action space, a fine discretization is necessary to obtain a solution which is closer to the original continuous optimal policy. As the level of discretization increases, the number of actions to be explored becomes larger. In this regards, an efficient exploration method is required to obtain high performance. We compare our method to other exploration methods with respect to the convergence speed and the expected sum of rewards. We further check the effect of the update rule.

### Vii-a Experiments on Performance Bounds and Supporting Set

To verify our theorem about performance error bounds, we create a transition model by discretization of unicycle dynamics defined in a continuous state and action space and solve the original MDP, a soft MDP and a sparse MDP under predefined rewards while increasing the discretization level of the action space. The reward function is defined as a linear combination of two squared exponential functions, i.e., , where is a location of a unicycle, is a goal point, is the point to avoid, and and are scale parameters. The reward function is designed to let an agent to navigate towards while avoiding . The absolute value of differences between the expected return of the original MDP and that of sparse MDP (or soft MDP) is measured. As shown in Figure 2(a), the performance gap of sparse MDP converges to a constant bound while the performance of the soft MDP grows logarithmically. Note that the performance gaps of the sparse MDP and soft MDP are always smaller than their error bounds. Supporting set experiments are conducted using discretized unicycle dynamics. The cardinality of optimal policies are measured while varies from to . In Figure 2(b), while the ratio of the supporting set for a soft MDP is changed from to , the ratio for a sparse MDP is changed from to , demonstrating the sparseness of the proposed sparse MDPs compared to soft MDPs.

### Vii-B Reinforcement Learning in a Continuous Action Space

We test our method in MuJoCo [22], a physics-based simulator, using two problems with a continuous action space: Inverted Pendulum and Reacher. The action space is discretized to apply Q-learning to a continuous action space and experiments are conducted with four different discretization levels to validate the effectiveness of sparsemax exploration and the sparse Bellman update rule.

We compare the sparsemax exploration method to the -greedy method and softmax exploration [10] and further compare the sparse Bellman update rule to the original Bellman update rule [20] and the soft Bellman update rule [11]. In addition, three different regularization coefficient settings are experimented. In total, we test combinations of variants of deep Q-learning by combining three exploration methods, three update rules, and three different regularization coefficients of , and . The deep deterministic policy gradient (DDPG) method [15], which operates in a continuous action space without discretization of the action space, is also compared111 To test DDPG, we used the code from Open AI available at https://github.com/openai/baselines.. Hence, a total of algorithms are tested.

Results are shown in Figure 3 and Figure 4 for inverted pendulum and reacher, respectively, where only the top five algorithms are plotted and each point in a graph is obtained by averaging the values from three independent runs with different random seeds. Results of all algorithms are provided in Appendix B. Q network with two 512 dimensional hidden layers is used for the inverted pendulum problem and a network with four 256 dimensional hidden layers is used for the reacher problem. Each Q-learning algorithm utilizes the same network topology. For inverted pendulum, since the problem is easier than the reacher problem, most of top five algorithms converge to the maximum return of at each discretization level as shown in Figure 3(a). Four of top five algorithms utilize the proposed sparsemax exploration. Only one of the top five methods utilizes the softmax exploration. In Figure 3(b), the number of episodes required to reach a near optimal return, 980, is shown. The sparsemax exploration requires a less number of episodes to obtain a near optimal value than -greedy, softmax exploration.

For the reacher problem, the algorithms with sparsemax exploration slightly outperforms -greedy methods and the performance of softmax exploration is not included in the top five as shown in Figure 4(a). In terms of the number of required episodes, sparsemax exploration outperforms epsilon greedy methods as shown in Figure 4(b), where we set the threshold return to be . DDPG shows poor performances in both problems since the number of sampled episodes is insufficient. In this regards, deep Q-learning with sparsemax exploration outperforms DDPG with less number of episodes. From these experiments, it can be known that the sparsemax exploration method has an advantage over softmax exploration, -greedy method and DDPG with respect to the number of episodes required to reach the optimal performance.

## Viii Conclusion

In this paper, we have proposed a new MDP with novel causal sparse Tsallis entropy regularization which induces a sparse and multi-modal optimal policy distribution. In addition, we have provided the full mathematical analysis of the proposed sparse MDPs: the optimality condition of sparse MDPs given as the sparse Bellman equation, sparse value iteration and its convergence and optimality properties, and the performance bounds between the propose MDP and the original MDP. We have also proven that the performance gap of a sparse MDP is strictly smaller than that of a soft MDP. In experiments, we have verified that the theoretical performance gaps of a sparse MDP and soft MDP from the original MDP are correct. We have applied the sparsemax policy and sparse Bellman equation to deep Q-learning as the exploration strategy and update rule, respectively, and shown that the proposed exploration method shows significantly better performance compared to -greedy, softmax exploration, and DDPG, as the number of actions increases. From the analysis and experiments, we have demonstrated that the proposed sparse MDP can be an efficient alternative to problems with a large number of possible actions and even a continuous action space.

## Appendix A

### A-a Sparse Bellman Equation from Karush-Kuhn-Tucker conditions

The following proof explains the optimality condition of the sparse MDP from Karush-Kuhn-Tucker (KKT) conditions.

###### Proof of Theorem 1.

The KKT conditions of (3) are as follows:

 ∀s,a ∑a′π(a′|s)−1=0,−π(a|s)≤0 (9) ∀s,a λsa≥0 (10) ∀s,a λsaπ(a|s)=0 (11) ∀s,a ∂L(π,c,λ)∂π(a|s)=0 (12)

where and are Lagrangian multipliers for the equality and inequality constraints, respectively, and (9) is the feasibility of primal variables, (10) is the feasibility of dual variables, (11) is the complementary slackness and (12) is the stationarity condition. The Lagrangian function of (3) is written as follows:

 L(π,c,λ) =−Jspπ+∑scs(∑a′π(a′|s)−1)−∑s,aλsaπ(a|s)

where the maximization of (3) is changed into the minimization problem, i.e., . First, the derivative of

can be obtained by using the chain rule.

 ∂Jπ∂π(a|s)=d⊺G−1π∂rspπ∂π(a|s)+γd⊺G−1π∂Tπ∂π(a|s)G−1πrspπ=ρ⊺π∂rspπ∂π(a|s)+γρ⊺π∂Tπ∂π(a|s)Vspπ=ρπ(s)(r(s,a)+α2−απ(a|s)+γ∑s′Vspπ(s′)T(s′|s,a))=ρπ(s)(Qspπ(s,a)+α2−απ(a|s)).

Here, the partial derivative of Lagrangian is obtained as follows:

 ∂L(π,c,λ)∂π(a|s)=−ρπ(s)(Qspπ(s,a)+α2−απ(a|s))+cs−λsa=0.

First, consider a positive where the corresponding Lagrangian multiplier is zero due to the complementary slackness. By summing with respect to action , Lagrangian multiplier can be obtained as follows:

where is the number of positive elements of . By replacing with this result, the optimal policy distribution is induced as follows.

As this equation is derived under the assumption that is positive. For , following condition is necessarily fulfilled,

 Qspπ(s,a)α>∑π(a′|s)>0Qspπ(s,a′)α−1K.

We notate this supporting set as