## Introduction

Reinforcement learning (RL), a framework for learning and control in which agents search for proper actions in an environment through trial and error, has witnessed rapid development in recent years, as evidenced by the super-human performances of deep Q-networks (DQN) [1] in video game playing and AlphaGo [2] in the game of Go. Moreover, the application range of RL extends not only to more complicated tasks on computers but also to the control of robots [3] and unmanned aerial vehicles (UAVs) [4] in the real world.

As RL algorithms are being applied to increasingly complicated and realistic tasks, the limits of sensors, processors, and actuators of agents are posing serious obstacles for conventional optimization algorithms. Simon proposed the notion of bounded rationality as the principle underlying agents’ behavior under resource limits [5]. A bounded rational agent may appear to behave irrationally, but by considering the limits and constraints, the agent’s behavior can be understood as rational. Bounded rationality has attracted considerable attention in recent years. Computational rationality [6]

, which has been claimed to integrate the three fields of neuroscience (brain), cognitive science (mind), and artificial intelligence (machine)

[7], is an updated form of bounded rationality. Further, it has been proposed that abstraction and hierarchy, which have been considered to enable flexible and efficient cognition of humans [8], result from the above-mentioned limitations and are bounded rational [9].The representative decision making policy in the theory of bounded rationality is satisficing [10, 11]. Satisficing agents do not keep searching for the optimal action; instead, they stop searching when an action whose quality is above a certain level (aspiration) is found. The satisficing strategy has not attracted much attention in reinforcement learning, except for a few studies [12, 13] (to be discussed later). In previous studies [14, 15], one of the authors proposed a simple satisficing value function called risk-sensitive satisficing () and empirically validated its effectiveness through numerical simulations of reinforcement learning tasks.

In this paper, we apply to the -armed bandit problems, which constitute the most basic class of reinforcement learning tasks, and prove two propositions. First, we prove that is guaranteed to find a satisfactory action: if the agent chooses an action in each trial and the number of trials is sufficient, the agent can stably choose an action whose value is above the aspiration level. Second, we prove the finiteness of the regret of . In general, the performance of algorithms in the -armed bandit problems is measured by how small their regret (expected loss) is. It is known that the regret increases at least in the logarithmic order with the number of trials [16]. Therefore, the regret increases infinitely as the trials are repeated. However, we prove that if a small amount of information on the reward distributions is available so that the aspiration level is set to an “optimal level” (hence, satisficing entails optimizing), then the regret of is upper bounded by a finite value. We confirm these results by numerical simulations and compare the performance of with that of other representative algorithms for the -armed bandit problems. Finally, we conclude the paper with a discussion on the possible applications of and the theoretical significance of this work.

## Methods

### -armed Bandit Problems

The -armed bandit problems that we deal with in this paper are as follows. Let there be actions

that lead to a reward of 1 or 0 according to the reward probabilities

, which are unknown to the agent. If the agent chooses action , it acquires a reward of 1 with probability or a reward of 0 with probability . The goal of the repetition of choice is maximization of the expected accumulated rewards, which is measured by minimization of regret (the expected cumulative loss). denotes the action with the maximal reward probability (i.e., ). The regret when the -th step (one step means one trial) ends is defined as follows.(1) |

where is the number of times action is chosen from the first to the -th step (simply written as when the number of steps is not explicitly indicated) and is the expectation. Regret represents the expected loss, i.e., “how inferior the cumulative expected reward from the actual chosen actions is to the cumulative expected reward when the optimal action continues to be chosen from the first step?” The smaller the regret, the better is the performance of the algorithms. The minimum value of the regret is zero when the optimal action has been chosen in all the steps. It has been proven that the regret increases at least in with the number of steps [16].

As for action selection by the agent, the basic policy is to take the action with the highest value (the greedy method). The basic valuation of action is based on its mean reward:

(2) |

where is the number of times is chosen and the reward is acquired. , i.e., the number of times the action is chosen, satisfies and . Under the greedy method with the mean reward valuation, if there is a non-optimal action that has a high value in early trials, there is a risk of being chosen all along. Each of the other actions must be tried for an appropriate number of times so that the optimal action is found in a timely manner. Merely choosing the action with the highest value based on the accumulated knowledge (exploitation) does not suffice, and various actions must be tried (exploration). Various algorithms have been proposed to balance exploitation and exploration.

### Models of Satisficing

We introduce two models of satisficing at the levels of policy and value function. The policy model follows the standard description of satisficing. The second model is the risk-sensitive value function that we analyze and test in this paper. The former is tested through simulations for comparison with the latter.

#### Policy Satisficing () Model

A standard definition of satisficing is to keep exploring until an action whose value is above the aspiration level is found and to then stop searching and keep choosing the action (exploit). Satisficing, unlike optimization, can reduce the search cost because it does not involve searching for all actions and deciding on the optimal action. This is formulated as a policy (of reinforcement learning) as follows. If there exists at least one action whose mean reward is above the aspiration level , exploitation (following the greedy method) is executed. Otherwise, when the mean reward of all the actions is below the aspiration level , an action is randomly chosen. We refer to this algorithm as policy satisficing ().

#### Risk-sensitive Satisficing () Value Function

One of the authors has proposed a value function called risk-sensitive satisficing () that realizes satisficing action selection behavior when operated under the greedy policy [14, 15] (see Supplementary Information for its relationship with other models). Before introducing the model, we first define the difference between the mean reward of action and the aspiration level :

(3) |

If there exists a positive , then the agent will choose such and be satisfied; otherwise, it will be unsatisfied. is defined as follows [14]:

(4) |

This value is used under the greedy policy: the agent chooses the action with the maximal value.

integrates two risk-sensitive satisficing behaviors. When unsatisfied, is risk-seeking, leading to optimistic exploration. If for all , then actions with smaller are prioritized. Let and let there be two unsatisfactory actions and with and . Then, ; hence, is chosen. This preference of a less tried action can be interpreted as the optimistic expectation of the action’s actual reward probability being set above . There might be some ; however, thus far, for all the actions. In terms of looking for a satisfactory action, it is rational to try actions with smaller . This accords with the motto “optimism in the face of uncertainty,” which is considered a general and rational exploration strategy in reinforcement learning [17]. The UCB model described later implements this idea [18].

When satisfied, is risk-averse, performing pessimistic exploitation. If there is only one for which is positive, the agent will keep choosing it. If there are multiple actions with positive , then the actions with larger are prioritized. Let , and let there be two satisfactory actions and with and that are equivalent to the example above. Then, ; hence, is chosen. In this case, a more tried action is preferred. This can be interpreted as the pessimistic expectation of the action’s actual reward probability being set below . It is possible that is a spuriously satisfactory action with ; however, . In terms of looking for a truly satisfactory action and avoiding spuriously satisfactory ones, it is rational to try actions with for a larger .

#### Setting of the Aspiration Level

The aspiration level defines the boundary between satisfactory and unsatisfactory, analogous to the break-even point between gain and loss or the neutral reference outcome in prospect theory [19]. It can be set according to the internal need for it or its knowledge of the environment. As an ecological example, let the agent be an animal, and let the rewards 1 and 0 represent the presence and absence of food. If the action is to look for food at a feeding ground from among multiple grounds and the agent has to obtain food around once every two days for survival, then would be or higher.

Optimization can be viewed as a special case of satisficing. If lies between the two reward probabilities of the optimal and second-optimal actions, then satisficing above means optimizing. Let us call such “an optimal aspiration level”. Let the highest reward probability be and the second-highest one be . can be set optimally as follows:

(5) |

It is known that the regret increases at least in with the number of steps [16]. This is the result of assuming no knowledge of the agent on the reward distribution. By relaxing this assumption and allowing to be set as in Eq. 5, it will be shown that the regret is upper bounded by a finite value as in Proposition 2 described later.

Note that having an optimal aspiration level does not make a -armed bandit problem trivial. Even if we know a point between the optimal and second-optimal actions, we do not know exactly which action is optimal. Efficient identification of such an action is not trivial. In the next section, will be compared in terms of its performance with other algorithms, one of which needs some similar information on the reward distribution to be optimal.

### Data availability

All data are generated by numerical simulations and they have all been reported in the paper.

## Results

### Analysis

We perform theoretical analysis of the basic satisficing and optimizing properties of . First, in Proposition 1, we prove that can stably choose actions above the aspiration level after a sufficient number of steps. Second, in Proposition 2, we prove that the regret of is upper bounded when an optimal aspiration level is given and satisficing becomes optimizing.

#### Guarantee of Satisficing

In the proof of Proposition 1, we adopt symbols clearly indicating the step number () and the chosen action () as follows. Both of the following represent values after steps: the mean reward

(6) |

and the value

(7) |

###### Proposition 1 (Theoretical Guarantee of Satisficing).

Let be the reward probability of action . Let be the set of actions whose reward probability is not smaller than the aspiration level , and let be the set of actions whose reward probability is smaller than . Let , and , , where is supposed to be a non-empty set. Then, the following holds for .

After a sufficient number of steps, a satisfactory action with will be always chosen, and this state is stable.

In other words, by letting be the probability that event will occur,

(8) |

Subsequently, by , we denote the set of steps in which action is chosen. Let be the number of elements in set . First, we prove two claims.

###### Claim A.

(9) |

###### Proof.

(Claim A) () Suppose that and . If , is constant for greater than or equal to some number. This is a contradiction; hence, we have . () Suppose that and

. By the law of large numbers, for any positive number

, there exists some such that we have for any integer greater than . Now, if , we have(10) |

As , we have ; hence, . Therefore, . Since is arbitrary, we obtain . ∎

###### Claim B.

(11) |

###### Proof.

(Claim B) We assume that for any , . Then, for any , is constant for any greater than or equal to some number. Furthermore, for some , we have . Hence, by Claim A, we have

(12) |

However, the following statements contradict each other: (i) , (ii) , for any greater than or equal to some number. Hence, we obtain

(13) |

Now, the following formula holds.

(14) |

Therefore, we must have . ∎

###### Proposition 1 (again).

(8) |

###### Proof.

(Proposition1) By Claim B, we have , . By the law of large numbers, for any positive number , there exists some such that we have for any integer greater than . Now, if , we have

(15) |

Hence, we have . Since is arbitrary, we obtain .

Here, we assume that there exists such that . Then, we may have by Claim A. On the other hand, follows from because for any sufficiently large . However, and contradict each other, which means that the initial assumption must be false. Hence, for any , holds. Therefore, the results obtained are summarized as , and , . From these results, the following follows immediately. . ∎

#### Theoretical Analysis of Regret

We prove that is upper bounded by a finite value when the level is set to the optimal aspiration level.

###### Proposition 2 (Finiteness of Regret of ).

Let the highest reward probability of all the actions be and the second-highest reward probability be . Further, we set as (an optimal aspiration level). Then, the following holds for :

“There exists a monotonically increasing function for step number such that . Then, , where is constant. Thus, ”.

We conceived the following proof by referring the papers[20, 21, 22] on TOW (tug-of-war) dynamics model (hereinafter simply referred to as TOW). TOW is similar to (See Supplementary Information for the similarities and differences between

and TOW). However, in their paper, the analysis of the finiteness of the regret by TOW was strictly limited to cases in which there are only two actions and the variances of the reward probabilities are equal. In the case of the bandit problems with the reward following the Bernoulli distributions, equal variance implies

or . (Let be the variance of action . or .) Thus, the equal variance is a strong assumption. Here, we generalize the proof to prove finite regret with arms () and without assuming equal variance.###### Proof.

(Proposition2) Suppose that . Let . The expectation and the variance of are and , respectively, where .

Note that

(16) |

holds, where , indicating the reward when action was chosen in the -th time. Let . Then,

(17) | ||||

(18) |

Since ,

(19) |

By Proposition 1, if the step number is sufficiently large, then with probability 1. Hence,

(20) | ||||

(21) |

By Eq. (16

) and the central limit theorem,

follows the normal distribution with expectation

and variance . The probability that is . Here, is the -function, which represents the tail distribution function of the standard normal distribution. Thus, . Let be the probability that action is chosen in the -th step.Then, is given by

(22) | ||||

(23) |

where we set .

By using the Chernoff bound , we evaluate the upper bound of the regret.

(24) | ||||

(25) |

Therefore,

(26) | ||||

(27) |

This concludes the proof.

∎

#### Empirical Verification

We verify the proven properties through simulations. As in Proposition 2, , where . All the results below are the averaged results of 1,000 simulations. As an additional performance index, we consider accuracy, which is the proportion of the simulations in which the algorithm chose the optimal action in each step. Thus, the accuracy in the -th step is as follows.

accuracy = (Number of times action with the highest reward probability is chosen in the -th step) / (Total number of simulations).

First, we test whether the difference in reward probabilities can be detected, even if the difference is small, when the optimal aspiration level is set for . We test it with where . The result is shown in Fig. 1. The dotted line at the top in Fig. 1 (b) represents the upper bound of the regret shown by Proposition 2. We see that the accuracy nearly reaches 1 after steps, even if the difference is only 0.002 as in . Moreover, we see that the regret does not exceed the upper bound (Eq. (27)) calculated by Proposition 2.

Next, we conduct simulations to confirm the propositions with . The reward probability of each action is generated uniformly randomly from . The result is shown in Fig. 2. We can see that the accuracy converges to 1 and the regret does not exceed the upper bound (Eq. (27)) calculated by Proposition 2. Here, the calculated upper bound of the regret for is considerably higher than the actual regret compared with the case of . As we evaluate the probability of choosing action only by comparing with action having the highest reward probability as shown in Eq. (22) in the proof of Proposition 2, the probability of choosing is increasingly overestimated as the number of actions increases.

### Comparison with Other Algorithms

Here, we clarify the performance and properties of by comparing it with some representative algorithms for the -armed bandit problems, namely UCB1-Tuned and [18]

#### UCB1-Tuned

Upper confidence bound (UCB) is an algorithm based on the idea that the value of relatively less tried actions (more uncertain) is potentially high, similar to ’s risk-seeking evaluation when unsatisfied [18]. The regret of UCB is guaranteed to increase in the logarithmic order, which is the theoretical limit [16]. We include the result of UCB1-Tuned (hereinafter referred to as UCB1T), which shows better performance compared to UCB1.

(28) |

Here, , and is the variance of the reward from choosing action

. Further, 1/4 is the upper bound of the variance of the random variable following the binomial distribution. In the algorithm, the action with the highest UCB1T value is chosen (the greedy method). The first term

of UCB1T, which is the mean reward, represents the already acquired knowledge (and its exploitation), whereas the second term, which decreases as action is tried more, expresses the (un-)reliability of (which leads to exploration). When , the second term cannot be calculated, but in the first steps, each action is chosen once so that the value of the second term for all the actions is subsequently finite.To set the level such that satisficing implies optimization, it is necessary to have some point in the interval between the highest and second-highest reward probabilities, usually unknown to the agent. Thus, having such “optimal” is a type of “cheating”. However, when such information is available, it should be utilized well, and does so.

Furthermore, there is another algorithm, namely [18], which requires similar information for optimal performance. In this algorithm, the probability of random action selection, , is gradually reduced by annealing so that the regret of is guaranteed to be of the logarithmic order. It starts with maximal exploration (random action selection) and then gradually shifts to more exploitation as the information of the environment gets accumulated. In , there are two parameters and that are set as and . When there are arms, the stepwise decreasing sequence is defined as follows:

(29) |

The agent chooses action with the highest mean reward with probability , and it chooses a random action with probability for Let be the highest reward probability, and define . Then, the parameter needs to satisfy

(30) |

Further, needs to be known in advance. Thus, some information about the reward probabilities is required, as in the case of with the optimal aspiration level. In addition, the performance of is sensitive to the value of the parameter , and it is difficult to find the optimal value of [18].

On the other hand, determining the optimal aspiration level for may be easier. It does not require a parameter like , and is sufficient. More generally, it is sufficient to obtain the interval or the value of any point within the interval.

#### Existing Satisficing Models

Here, we introduce the existing satisficing models and briefly explain the difference between those models and

. First, the framework that is the closest to ours is that of Bendor et al. on the heuristics of satisficing

[12], which analyzes the two-armed bandit problems when the rewards are Bernoulli distributed. They mainly analyzed the limiting behavior of the policy model similar to . Their model is different from in that it gives a probability parameter of switching actions with a certain probability (not always), when unsatisfied. Therefore, the performance of their model is lower than that of .The most recent and comprehensive study was conducted by Reverdy et al. [13] They decomposed satisficing into “satisfy” and “suffice” (from which the word “satisfice” is formed) and presented general problem settings that include the standard bandit problems and algorithms with optimal order. As their algorithm is an adaptation of the standard UCB [18], the difference between and their algorithm is similar to the difference between and UCB as described above. Furthermore, their analysis is limited to the bandit problems where the reward distributions are Gaussian. In their study, they extended the concept of regret and developed an algorithm that searches for actions that exceed the aspiration level with probability . They proved the finiteness of the regret for their algorithm when .

However, it should be noted that in their study, the definition of regret is changed. Specifically, the regret of their algorithm is calculated according to whether or not the expected reward exceeds the aspiration level with probability , and the definition that regards the regret occurring with probability as zero is adopted. If , their regret is calculated according to whether the expected reward always exceeds the aspiration level or not; therefore, it becomes the same framework as that of the ordinary bandit problems. In such cases, the regret of their algorithm increases in the logarithmic order, which is the theoretical limit, and it does not become finite. On the other hand, can achieve the finite regret without changing the definition of regret. Therefore, the purposes and problem settings are different in our study and their study.

According to the above-mentioned discussion, it is difficult to compare our study with other satisficing algorithms for reinforcement learning proposed in previous studies because the purposes and frameworks are different. It is sufficient to compare our approach with and UCB1. Accordingly, the other algorithms will not be handled directly hereafter.

#### Performance Comparison

We compare the performance of UCB1T, , , and with through numerical simulations. Furthermore, the reward probabilities are uniformly randomly selected from , and the average is over 1,000 simulations. As mentioned above, it is difficult to determine the parameter of . In this simulation, the regret of in the 10,000-th step is taken as a reference. It is empirically found by a long parameter sweep such that the regret of in the 10,000-th step is minimized at around . Hence, the results of are shown as comparison targets. We set as . As for and , we set the aspiration level to an optimal level, , so that we can evaluate the efficiency when satisficing implies optimization.

The results are shown in Fig. 3. As for accuracy, approaches 1 the fastest among these algorithms. As for regret, increases rapidly because it randomly chooses actions unless an action whose reward is above is found. The regret of remains small (and bound finitely), whereas UCB1T and diverge at a logarithmic order. In summary, we can see that with the optimal aspiration level shows better performance than UCB1T, , and .

#### Analysis of the Expected Change in Value Functions

Here, we qualitatively consider why with the optimal aspiration level performs better than the other algorithms. Let us consider how the value of in the -th step changes when action is chosen in the -th step. In the following formula,

(31) |

is the number of times a reward of 1 is obtained in the choice of action from the first to the -th step. In the -th step, the value of changes with probability to

(32) |

whereas it otherwise changes with probability to

(33) |

Let . Then, the expected value of the change, , is as follows:

(34) |

Thus, we see that the following relationships hold in any step:

(35) | ||||

(36) |

Let be set to an optimal level. Then, relationship 35 means that once the optimal action is chosen, will keep increasing on average, and it will continue to be chosen. On the other hand, relationship 36 means that if a non-optimal action has the highest value, and continues to be chosen for a while, then the value keeps decreasing on average. The value for other actions remains invariant. Therefore, at some point, another action than will start to be chosen. Further, note that the value decreases at an average rate of . Therefore, on average, the lower the reward probability of an action, the faster the action will stop being chosen, and another action will start being chosen.

To clarify the idiosyncrasies of , we carry out similar analyses for other value functions. First, let us analyze the mean reward. The value function is . When action is chosen, is given by

(37) |

whereas the values for other actions do not change. Further, is positive if and negative if , and both cases may occur regardless of the reward probability because is a variable, in contrast to the constant for . If action is chosen for a sufficient number of times, holds. Then, it leads to , and remains nearly unchanged. This implies that there is a possibility that a non-highest action keeps to be chosen (trapped into a local optimum). Let us consider the simplest example where there are only two actions (with ), and choosing the optimal action does not give much rewards, leading to and . As increases, converges to , and the relationship of becomes fixed because of . This leads to being chosen constantly. To avoid the local optima, prevents a non-highest action from being continuously chosen by randomly choosing actions with probability . With the mean reward, unlike , we cannot say that the smaller the reward probability of the action chosen once, the faster on average is the switching of the agent to choose another action.

Next, let us analyze UCB1, which is the simplest algorithm in the UCB family.

(38) |

When action is chosen, the expected change in the UCB1 value is

(39) |

whereas the expected change of non-chosen action is as follows:

(40) |

In Eq. (39), the first term is the same as that in Eq. (37). In Eq. (39), the second and third terms approach zero if action continues to be chosen. Hence, if we consider only Eq. (39), there is a possibility that the non-highest action continues to be chosen, as with Eq. (37). However, in UCB1, the value function of non-chosen action also changes, as in Eq. (40). Moreover, we can see that the value of the non-chosen action increases infinitely because of the second term of Eq. (38). As a result, a non-highest action does not continue to be chosen.

In Eq. (39), the first term is positive if and negative if , and both cases may occur regardless of the reward probability because is a variable, as it is for above. On the other hand, the second term between the parentheses is negative if , which results from the fact that monotonically decreases with . As a result, may be positive or negative, regardless of the reward probability. Therefore, UCB1 does not have the property of whereby the action with a lower reward probability will be switched from earlier.

Based on the analyses presented above, let us reconsider the form of . Starting from the most basic value function of the mean reward, , is formed through two operations, and . If it is merely , the value function works exactly as the original under the greedy policy. On the other hand, if only is applied, the value function is , and it is a special case of

Comments

There are no comments yet.