In real-life decision making, from deciding where to have lunch to finding an apartment when moving to a new city, and so on, people often face different level of information dependency. In the simplest case, you are given
possible actions (“arms”), each associated with a fixed, unknown and independent reward probability distribution, and the goal is to trade between following a good action chosen previously (exploitation) and obtaining more information about the environment which can possibly lead to better actions in the future (exploration). Themulti-armed bandit (MAB) (or simply, bandit) typically model this level of exploration-exploitation trade-off [24, 3]
. In many scenarios, the best strategy may depend on a context from current environment, such that the goal is to learn the relationship between the context vectors and the rewards, in order to make better prediction which action to choose given the context, modeled as thecontextual bandits (CB) [2, 25], where the context can be attentive [11, 28]
. In more complicated environments, there is an addition dependency between the environmental contexts given the action an agent takes, and that is modeled as a Markov decision process (MDP) in areinforcement learning (RL) problem .
To better model and understand human decision making behavior, scientists usually investigate reward processing mechanisms in healthy subjects . However, neurodegenerative and psychiatric disorders, often associated with reward processing disruptions, can provide an additional resource for deeper understanding of human decision making mechanisms. From the perspective of evolutionary psychiatry, various mental disorders, including depression, anxiety, ADHD, addiction and even schizophrenia can be considered as “extreme points” in a continuous spectrum of behaviors and traits developed for various purposes during evolution, and somewhat less extreme versions of those traits can be actually beneficial in specific environments. Thus, modeling decision-making biases and traits associated with various disorders may enrich the existing computational decision-making models, leading to potentially more flexible and better-performing algorithms. In this paper, we extended previous pursuits of human behavioral agents in MAB  and RL [29, 30]
into CB, built upon the Contextual Thompson Sampling (CTS), a state-of-art approach to CB problem, and unfied all three levels as a parametric family of models, where the reward information is split into two streams, positive and negative.
2 Problem Setting
In this section, we briefly outlined the three problem settings:
Multi-Armed Bandit (MAB).
The multi-armed bandit (MAB) problem models a sequential decision-making process, where at each time point a player selects an action from a given finite set of possible actions, attempting to maximize the cumulative reward over time.
Optimal solutions have been provided using a stochastic formulation [24, 3], or using an adversarial formulation [6, 4, 10].
Recently, there has been a surge of interest in a Bayesian formulation , involving the algorithm known as Thompson sampling . Theoretical analysis in  shows that Thompson sampling for Bernoulli bandits is asymptotically optimal.
Contextual Bandit (CB). Following , this problem is defined as follows. At each time point (iteration) , an agent is presented with a context (feature vector) before choosing an arm . We will denote by the set of features (variables) defining the context. Let denote a reward vector, where is a reward at time associated with the arm . Herein, we will primarily focus on the Bernoulli bandit with binary reward, i.e. . Let denote a policy. Also,
denotes a joint distribution over. We will assume that the expected reward is a linear function of the context, i.e. , where is an unknown weight vector associated with the arm .
Reinforcement Learning (RL). Reinforcement learning defines a class of algorithms for solving problems modeled as Markov decision processes (MDP) . An MDP is defined by the tuple , where is a set of possible states, is a set of actions, is a transition function defined as , where and , and is a reward function, is a discount factor that decreases the impact of the past reward on current action choice. Typically, the objective is to maximize the discounted long-term reward, assuming an infinite-horizon decision process, i.e. to find a policy function which specifies the action to take in a given state, so that the cumulative reward is maximized:
3 Background: Contextual Thompson Sampling (CTS)
As pointed out in the introduction, the main methodological contribution of this work is two-fold: (1) fill in the missing piece of split reward processing in the contextual bandit problem, and (2) unify the bandits, contextual bandits, and reinforcement learnings under the same framework of split reward processing mechanism. We first introduce the theoretical model we built upon for the contextual bandit problem: the Contextual Thompson Sampling.
In the general Thompson Sampling, the reward for choosing action at time follows a parametric likelihood function . Following , the posterior distribution at time ,
is given by a multivariate Gaussian distribution, , where , and where is the context size , with , , constants, and . At every step , the algorithm generates a -dimensional sample from , , for each arm, selects the arm that maximizes , and obtains reward .
4 Two-Stream Split Models in MAB, CB and RL
We now outlined the split models evaluated in our three settings: the MAB case with the Human-Based Thompson Sampling (HBTS) , the CB case with the Split Contextual Thompson Sampling (SCTS), and the RL case with the Split Q-Learning [29, 30]. All three split agent classes are standardized for their parametric notions (see Table 1 for a complete parametrization and Appendix 0.A for more literature review of these clinically-inspired reward-processing biases).
Split Multi-Armed Bandit Model. The split MAB agent is built upon Human-Based Thompson Sampling (HBTS, algorithm 1) . The positive and negative streams are each stored in the success and failure counts and .
Split Contextual Bandit Model. Similarly, we now extend Contextual Thompson Sampling (CTS)  to a more flexible framework, inspired by a wide range of reward-processing biases discussed in Appendix 0.A. The proposed Split CTS (Algorithm 2) treats positive and negative rewards in two separate streams. It introduces four hyper-parameters which represent, for both positive and negative streams, the reward processing weights (biases), as well as discount factors for the past rewards: and are the discount factors applied to the previously accumulated positive and negative rewards, respectively, while and represent the weights on the positive and negative rewards at the current iteration. We assume that at each step, an agent receives both positive and negative rewards, denote and , respectively (either one of them can be zero, of course). As in HBTS, the two streams are independently updated.
Split Reinforcement Learning Model. The split RL agent is built upon Split Q-Learning (SQL, Algorithm 3) by [29, 30] (and its variant, MaxPain, by ). The processing of the positive and negative streams is handled by the two independently updated Q functions, and .
Clinically inspired Reward Processing Biases in Split Models. For each agent, we set the four parameters: and as the weights of the previously accumulated positive and negative rewards, respectively, and as the weights on the positive and negative rewards at the current iteration. DISCLAIMER: while we use disorder names for the models, we are not claiming that the models accurately capture all aspects of the corresponding disorders.
In the following section we describe how specific constraints on the model parameters in the proposed method can generate a range of reward processing biases, and introduce several instances of the split models associated with those biases; the corresponding parameter settings are presented in Table 1. As we demonstrate later, specific biases may be actually beneficial in some settings, and our parameteric approach often outperforms the standard baselines due to increased generality and flexibility of our two-stream, multi-parametric formulation.
Note that the standard split approach correspond to setting the four (hyper)parameters used in our model to 1. We also introduce two variants which only learn from one of the two reward streams: negative split models (algorithms that start with N) and positive split models (algorithms that start with P), by setting to zero and , or and , respectively. Next, we introduce the model which incorporates some mild forgetting of the past rewards or losses (0.5 weights) and calibrating the other models with respect to this one; we refer to this model as M for “moderate” forgetting.
We also specified the mental agents differently with the prefix “b-” referring to the MAB version of the split models (as in “bandits’), “cb-” referring to the CB version, and no prefix as the RL version (for its general purposes).
We will now introduced several models inspired by certain reward-processing biases in a range of mental disorders-like behaviors in table 1.
Recall that PD patients are typically better at learning to avoid negative outcomes than at learning to achieve positive outcomes ; one way to model this is to over-emphasize negative rewards, by placing a high weight on them, as compared to the reward processing in healthy individuals. Specifically, we will assume the parameter for PD patients to be much higher than normal (e.g., we use here), while the rest of the parameters will be in the same range for both healthy and PD individuals. Patients with bvFTD are prone to overeating which may represent increased reward representation. To model this impairment in bvFTD patients, the parameter of the model could be modified as follow: (e.g., as shown in Table 1), where is the parameter of the bvFTD model has, and the rest of these parameters are equal to the normal one. To model apathy in patients with Alzheimer’s, including downplaying rewards and losses, we will assume that the parameters and are somewhat smaller than normal, and (e.g, set to 0.1 in Table 1), which models the tendency to forget both positive and negative rewards. Recall that ADHD may be involve impairments in storing stimulus-response associations. In our ADHD model, the parameters and are smaller than normal, and , which models forgetting of both positive and negative rewards. Note that while this model appears similar to Alzheimer’s model described above, the forgetting factor will be less pronounced, i.e. the and parameters are larger than those of the Alzheimer’s model (e.g., 0.2 instead of 0.1, as shown in Table 1). As mentioned earlier, addiction is associated with inability to properly forget (positive) stimulus-response associations; we model this by setting the weight on previously accumulated positive reward (“memory” ) higher than normal, , e.g. , while . We model the reduced responsiveness to rewards in chronic pain by setting so there is a decrease in the reward representation, and so the negative rewards are not forgotten (see table 1).
Of course, the above models should be treated only as first approximations of the reward processing biases in mental disorders, since the actual changes in reward processing are much more complicated, and the parameteric setting must be learned from actual patient data, which is a nontrivial direction for future work. Herein, we simply consider those models as specific variations of our general method, inspired by certain aspects of the corresponding diseases, and focus primarily on the computational aspects of our algorithm, demonstrating that the proposed parametric extension of standard algorithms can learn better than the baselines due to added flexibility.
|“Chronic pain” (CP)|
|Standard (HBTS, SCTS, SQL)||1||1||1||1|
|Positive (PTS, PCTS, PQL)||1||1||0||0|
|Negative (NTS, NCTS, NQL)||0||0||1||1|
5 Empirical Evaluation
|Baseline||Variants of Split MAB agents|
|avg wins (%)||46.72||52.65||12.25||10.86||45.08||52.02||38.26||25.00|
|Baseline||Variants of Split CB Agents|
|avg wins (%)||43.10||57.07||5.05||39.56||38.89||18.35|
|Baseline||Variants of Split RL agents|
|avg wins (%)||38.64||39.39||40.40||37.50||36.87||35.86||22.35||31.82|
|Baseline||Variants of Split MAB agents|
|avg wins (%)||36.99||33.84||36.74||39.90||33.33||36.62||32.70||32.70|
|Baseline||Variants of Split CB Agents|
|avg wins (%)||25.42||62.12||28.11||24.92||34.85||26.60|
|Baseline||Variants of Split RL agents|
|avg wins (%)||55.05||45.20||53.41||40.53||47.98||55.68||37.75||17.93|
Empirically, we evaluated the algorithms in four settings: the gambling game of a simple MDP task, a simple MAB task, a real-life Iowa Gambling Task (IGT) , and a PacMan game. There is considerable randomness in the reward, and predefined multimodality in the reward distributions of each state-action pairs in all four tasks. We ran split MAB agents in MAB, MDP and IGT tasks, and split CB and RL agents in all four tasks.
5.1 MAB and MDP Tasks with bimodal rewards
In this simple MAB example, a player starts from initial state A, choose between two actions: go left to reach state B, or go right to reach state C. Both states B and C reveals a zero rewards. From state B, the player observes a reward from a distribution . From state C, the player observes a reward from a distribution . The reward distributions of states B and C are both multimodal distributions (for instance, the reward
can be drawn from a bi-modal distribution of two normal distributionswith probability and with ). The left action (go to state B) by default is set to have an expected payout lower than the right action. However, the reward distributions can be spread across both the positive and negative domains. For Split models, the reward is separated into a positive stream (if the revealed reward is positive) and a negative stream (if the revealed reward is negative).
To evaluate the robustness of the algorithms, we simulated 100 randomly generated scenarios of bi-modal distributions, where the reward can be drawn from two normal distribution with means as random integers uniformly drawn from -100 to 100, standard deviations as random integers uniformly drawn from 0 to 50, and sampling distributionuniformly drawn from 0 to 1 (assigning to one normal distribution and
to the other one). Each scenario was repeated 50 times with standard errors as bounds. In all experiments, the discount factorwas set to be 0.95. For non-exploration approaches, the exploration is included with -greedy algorithm with set to be 0.05. The learning rate was polynomial , which is better in theory and in practice .
We compared the following algorithms: In MAB setting, we have Thompson Sampling (TS) , Upper Confidence Bound (UCB) , epsilon Greedy (eGreedy) , EXP3  (and gEXP3 for the pure greedy version of EXP3), Human Based Thompson Sampling (HBTS) . In CB setting, we have Contextual Thompson Sampling (CTS) , LinUCB , EXP4  and Split Contextual Thompson Sampling (SCTS). In RL setting, we have Q-Learning (QL), Double Q-Learning (DQL) , State–action–reward–state–action (SARSA) , Standard Split Q-Learning (SQL) [29, 30], MaxPain (MP) , Positive Q-Learning (PQL) and Negative Q-Learning (NQL).
In order to evaluate the performances of the algorithms, we need a scenario-independent measure which is not dependent on the specific selections of reward distribution parameters and pool of algorithms being considered. The final cumulative rewards might be subject to outliers because they are scenario-specific. The ranking of each algorithms might be subject to selection bias due to different pools of algorithms being considered. The pairwise comparison of the algorithms, however, is independent of the selection of scenario parameters and selection of algorithms. For example, in the 100 randomly generated scenarios, algorithm X beats Y fortimes while Y beats X times. We may compare the robustness of each pairs of algorithms with the proportion .
Results. Figure 1 and Figure 2 are two example scenarios plotting the reward distributions, the percentage of choosing the better action (go right), the cumulative rewards and the changes of two Q-tables (the weights stored in and ) over the number of iterations, drawn with standard errors over multiple runs. Each trial consisted of a synchronous update of all 100 actions. With polynomial learning rates, we see split models (HBTS in bandit agent pool, SCTS in contextual bandit agent pool, and SQL in RL agent pool) converged much more quickly than baselines.
Tables 3 and 5 summarized the pairwise comparisons between the agents with the row labels as the algorithm X and column labels as algorithm Y giving in each cell denoting X beats Y times and Y beats X times. For each cell of th row and th column, the first number indicates the number of rounds the agent beats agent , and the second number the number of rounds the agent beats agent . The average wins of each agent is computed as the mean of the win rates against other agents in the pool of agents in the rows. The bold face indicates that the performance of the agent in column is the best among the agents, or the better one. Among the algorithms, split models never seems to fail catastrophically by maintaining an overall advantages over the other algorithms.
For instance, in the MAB task, among the MAB agent pool, HBTS beats non-split version of TS with a winning rate of 52.65% over 46.72%. In the CB agent pool, LinUCB performed the best with a winning rate of 57.07%. This suggested that upper confidence bound (UCB)-based approach are more suitable for the two-armed MAB task that we proposed, although theoretical analysis in  shows that Thompson sampling models for Bernoulli bandits are asymptotically optimal. Further analysis is worth pursuing to explore UCB-based split models. In the RL agent pool, we observe that SARSA algorithm is the most robust among all agents, suggesting a potential benefit of the on-policy learning in the two-armed MAB problem that we proposed. Similarly in the MDP task, the behavior varies. In the MAB agent pool, despite not built with state representation, gEXP, an adversarial bandit algorithm with the epsilon greedy exploration performed the best. We suspected that our non-Gaussian reward distribution might resemble the nonstationary or adversarial setting that EXP3 algorithm is designed for. In the CB agent pool, we observed that LinUCB performed the best, which matched our finding in the similar MAB task above. In the RL agent pool, one of the split models, MP performed the best against all baselines, suggesting a benefit in the split mechanism in the MDP environments that we generated.
To explore the variants of split models representing different mental disorders, we also performed the same experiments on the 7 disease models proposed above. Tables 3 and 5 summarized their pairwise comparisons with the standard ones, where the average wins are computed averaged against three standard baseline models. Overall, PD (“Parkinson’s”), CP (“chronic pain”), ADHD and M (“moderate”) performed relatively well. In the MAB setting, the optimal reward bias are PD and M for the split MAB models, ADHD and CP for the split CB models, and bvFTD and M for the split RL models. In the MDP setting, the optimal reward bias are PD and M for the split MAB models, ADHD and bvFTD for the split CB models, and ADHD and CP for the split RL models.
|Decks||win per card||loss per card||expected value||scheme|
|A (bad)||+100||Frequent: -150 (p=0.1), -200 (p=0.1), -250 (p=0.1), -300 (p=0.1), -350 (p=0.1)||-25||1|
|B (bad)||+100||Infrequent: -1250 (p=0.1)||-25||1|
|C (good)||+50||Frequent: -25 (p=0.1), -75 (p=0.1),-50 (p=0.3)||+25||1|
|D (good)||+50||Infrequent: -250 (p=0.1)||+25||1|
|A (bad)||+100||Frequent: -150 (p=0.1), -200 (p=0.1), -250 (p=0.1), -300 (p=0.1), -350 (p=0.1)||-25||2|
|B (bad)||+100||Infrequent: -1250 (p=0.1)||-25||2|
|C (good)||+50||Infrequent: -50 (p=0.5)||+25||2|
|D (good)||+50||Infrequent: -250 (p=0.1)||+25||2|
5.2 Iowa Gambling Task
The original Iowa Gambling Task (IGT) studies decision making where the participant needs to choose one out of four card decks (named A, B, C, and D), and can win or lose money with each card when choosing a deck to draw from , over around 100 actions. In each round, the participants receives feedback about the win (the money he/she wins), the loss (the money he/she loses), and the combined gain (win minus lose). In the MDP setup, from initial state I, the player select one of the four deck to go to state A, B, C, or D, and reveals positive reward (the win), negative reward (the loss) and combined reward simultaneously. Decks A and B by default is set to have an expected payout (-25) lower than the better decks, C and D (+25). For baselines, the combined reward is used to update the agents. For split models, the positive and negative streams are fed and learned independently given the and .
There are two major payoff schemes in IGT. In the traditional payoff scheme, the net outcome of every 10 cards from the bad decks (i.e., decks A and B) is -250, and +250 in the case of the good decks (i.e., decks C and D). There are two decks with frequent losses (decks A and C), and two decks with infrequent losses (decks B and D). All decks have consistent wins (A and B to have +100, while C and D to have +50) and variable losses (summarized in Table 6, where scheme 1  has a more variable losses for deck C than scheme 2 ). We performed the each scheme for 200 times over 500 actions.
Results. Among the variants of Split models and baselines, the split contextual bandit (SCTS) performs best in scheme 1 with an averaged final cumulative rewards of 1200.76 over 500 draws of cards, significantly better than the MAB baseline TS (991.26), CB baseline LinUCB (1165.23) and RL baseline QL (1086.33). Mental variants of SCTS, such as CP (“chronic pain”, 1136.38), also performed quite well. This is consistent to the clinical implication of chronic pain patients which tend to forget about positive reward information (as modeled by a smaller ) and lack of drive to pursue rewards (as modeled by a smaller ). In scheme 2, eGreedy performs best with the final score of 1198.32, followed by CP (1155.84) and SCTS (1150.22). These examples suggest that the proposed framework has the flexibility to map out different behavior trajectories in real-life decision making (such as IGT). Figure 3 demonstrated the short-term (in 100 actions) and long-term behaviors of different mental agents, which matches clinical discoveries. For instance, ADD (“addiction”) quickly learns about the actual values of each decks (as reflected by the short-term curve) but in the long-term sticks with the decks with a larger wins (despite also with even larger losses). At around 20 actions, ADD performs better than baselines in learning about the decks with the better gains. In all three agent pools (MAB agents, CB agents, RL agents), we observed interesting trajectories revealed by the short-term dynamics (Figure 3), suggesting a promising next step to map from behavioral trajectories to clinically relevant reward processing bias of the human subjects.
5.3 PacMan game across various stationarities
We demonstrate the merits of the proposed algorithm using the classic game of PacMan. The goal of the agent is to eat all the dots in the maze, known as Pac-Dots, as soon as possible while simultaneously avoiding collision with ghosts, which roam the maze trying to kill PacMan. The rules for the environment (adopted from Berkeley AI PacMan 222http://ai.berkeley.edu/project_overview.html) are as follows. There are two types of negative rewards: on collision with a ghost, PacMan loses the game and gets a negative reward of ; and at each time frame, there is a constant time-penalty of for every step taken. There are three types of positive rewards. On eating a Pac-Dot, the agent obtains a reward of . On successfully eating all the Pac-Dots, the agent wins the game and obtains a reward of . The game also has two special dots called Power Pellets in the corners of the maze, which on consumption, give PacMan the temporary ability of “eating” ghosts. During this phase, the ghosts are in a “scared” state for 40 frames and move at half their speed. On eating a “scared” ghost, the agent gets a reward of , the ghost returns to the center box and returns to its normal “unscared” state. As a more realistic scenarios as real-world agents, we define the agents to receive their rewards in positive and negative streams separately. Traditional agents sum the two streams as a regular reward, while Split agents use two streams separately.
We applied several types of stationarities to PacMan as in . In order to simulate a lifelong learning setting, we assume that the environmental settings arrive in batches (or stages) of episodes, and the specific rule of the game (i.e., reward distributions) may change across batches, while remaining stationary within each batch. The change is defined by a stochastic process of the game setting that an event is defined for the positive stream and an event is defined for the negative stream, independent of each other (). The stochastic process is resampled every 10 rounds (i.e. a batch size of 10).
Stochastic reward muting. To simulate the changes of turning on or off of a certain reward stream, we define the event as turning off the positive reward stream (i.e. all the positive rewards are set to be zero) and the event as turning off the negative reward stream (i.e. all the penalties are set to be zero). in the experiments.
Stochastic reward scaling. To simulate the changes of scaling up a certain reward stream, we define the event as scaling up the positive reward stream by 100 (i.e. all the positive rewards are multiplied by 100) and the event as scaling up the negative reward stream (i.e. all the penalties are multiplied by 100). in the experiments.
Stochastic reward flipping. To simulate the changes of flipping certain reward stream, we define the event as flipping the positive reward stream (i.e. all the positive rewards are multiplied by -1 and considered penalties) and the event as flipping the negative reward stream (i.e. all the penalties are multiplied by -1 and considered positive rewards). We set .
We ran the proposed agents across these different stationarities for 200 episodes over multiple runs and plotted their average final scores with standard errors.
Results. As in Figure 4, in all four scenarios, the split models demonstrated competitive performance against their baselines. In the CB agent pools, where the state-less agents were not designed for such a complicated gaming environment, we still observe a converging learning behaviors from these agents. LinUCB as a CB baseline, performed better than the SCTS, which suggested a potentially better theoretical model to integrate split mechanism for this game environment. However, it is worth noting that in the reward flipping scenario, several mental agents are even more advantageous than the standard split models as in Figure 4(d), which matches clinical discoveries and the theory of evolutionary psychiatry. For instance, ADHD-like fast-switching attention seems to be especially beneficial in this very non-stationary setting of flipping reward streams. Even in a full stationary setting, the behaviors of these mental agents can have interesting clinical implications. For instance, the video of a CP (“chronic pain”) agent playing PacMan shows a clear avoidance behavior to penalties by staying at a corner very distant from the ghosts and a comparatively lack of interest to reward pursuit by not eating nearby Pac-Dots, matching the clinical characters of chronic pain patients. From the video, we observe that the agent ignored all the rewards in front of it and spent its life hiding from the ghosts, trying to elongate its life span at all costs, even if that implies a constant time penalty to a very negative final score. (The videos of the mental agents playing PacMan after training here333https://github.com/doerlbh/mentalRL/tree/master/video)
This research proposes a novel parametric family of algorithms for multi-armed bandits, contextual bandits and RL problems, extending the classical algorithms to model a wide range of potential reward processing biases. Our approach draws an inspiration from extensive literature on decision-making behavior in neurological and psychiatric disorders stemming from disturbances of the reward processing system, and demonstrates high flexibility of our multi-parameter model which allows to tune the weights on incoming two-stream rewards and memories about the prior reward history. Our preliminary results support multiple prior observations about reward processing biases in a range of mental disorders, thus indicating the potential of the proposed model and its future extensions to capture reward-processing aspects across various neurological and psychiatric conditions.
The contribution of this research is two-fold: from the machine learning perspective, we propose a simple yet powerful and more adaptive approach to MAB, CB and RL problems; from the neuroscience perspective, this work is the first attempt at a general, unifying model of reward processing and its disruptions across a wide population including both healthy subjects and those with mental disorders, which has a potential to become a useful computational tool for neuroscientists and psychiatrists studying such disorders. Among the directions for future work, we plan to investigate the optimal parameters in a series of computer games evaluated on different criteria, for example, longest survival time vs. highest final score. Further work includes exploring the multi-agent interactions given different reward processing bias. These discoveries can help build more interpretable real-world humanoid decision making systems. On the neuroscience side, the next steps would include further tuning and extending the proposed model to better capture observations in modern literature, as well as testing the model on both healthy subjects and patients with mental conditions.
-  (2012) Analysis of thompson sampling for the multi-armed bandit problem. In COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, pp. 39.1–39.26. External Links: Cited by: §2, §5.1.
-  (2013) Thompson sampling for contextual bandits with linear payoffs. In ICML (3), pp. 127–135. Cited by: §1, §1, §3, §4, §5.1.
-  (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 (2-3), pp. 235–256. Cited by: §1, §2, §5.1.
-  (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32 (1), pp. 48–77. Cited by: §2.
-  (2002) The nonstochastic multiarmed bandit problem. SIAM Journal on Computing 32 (1), pp. 48–77. Cited by: §5.1.
-  (1998) On-line learning with malicious noise and the closure algorithm. Ann. Math. Artif. Intell. 23 (1-2), pp. 83–99. Cited by: §2.
Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal. Neuron 47 (1), pp. 129–141. External Links: Cited by: Appendix 0.A.
-  (1994) Insensitivity to future consequences following damage to human prefrontal cortex. Cognition 50 (1-3), pp. 7–15. Cited by: §5.2.
Contextual bandit algorithms with supervised learning guarantees. In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 19–26. Cited by: §5.1.
-  (2016) Multi-armed bandit problem with known trend. Neurocomputing 205, pp. 16–21. External Links: Cited by: §2.
-  (2017) Context attentive bandits: contextual bandit with restricted context. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 1468–1475. Cited by: §1.
-  (2017) Bandit models of human behavior: reward processing in mental disorders. In International Conference on Artificial General Intelligence, pp. 237–248. Cited by: §1, §4, §4, §5.1.
-  (2011) An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pp. 2249–2257. Cited by: §2.
-  (2008) Reinforcement learning: the good, the bad and the ugly. Current opinion in neurobiology 18 (2), pp. 185–196. Cited by: Appendix 0.A.
-  (2017) Parallel reward and punishment control in humans and robots: safe reinforcement learning using the maxpain algorithm. In 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp. 140–147. Cited by: §4, §5.1.
-  (2003) Learning rates for q-learning. Journal of Machine Learning Research 5 (Dec), pp. 1–25. Cited by: §5.1.
-  (2006) A Mechanistic Account of Striatal Dopamine Function in Human Cognition: Psychopharmacological Studies With Cabergoline and Haloperidol. Behavioral Neuroscience 120 (3), pp. 497–517. External Links: Cited by: Appendix 0.A.
-  (2004) By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science 306 (5703), pp. 1940–1943. Cited by: Appendix 0.A, §4.
-  (2010) Cognitive mechanisms underlying risky decision-making in chronic cannabis users. Journal of mathematical psychology 54 (1), pp. 28–38. Cited by: §5.2.
-  (2014) Phasic Dopamine Release in the Rat Nucleus Accumbens Symmetrically Encodes a Reward Prediction Error Term. Journal of Neuroscience 34 (3), pp. 698–704. External Links: Cited by: Appendix 0.A.
-  (2010) Double q-learning. In Advances in Neural Information Processing Systems, pp. 2613–2621. Cited by: §5.1.
-  (2018-02) The Myth of Optimality in Clinical Neuroscience. Trends in Cognitive Sciences 22 (3), pp. 241–257. External Links: Cited by: Appendix 0.A.
-  (2012) Iowa gambling task: there is more to consider than long-term outcome. using a linear equation model to disentangle the impact of outcome and frequency of gains and losses.. Frontiers in Neuroscience 6, pp. 61. Cited by: §5.2.
-  (1985) Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1), pp. 4–22. External Links: Cited by: §1, §2.
-  (Website) Cited by: §1.
The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, pp. 817–824. Cited by: §2.
-  (2011) Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms.. In WSDM, I. King, W. Nejdl, and H. Li (Eds.), pp. 297–306. External Links: Cited by: §5.1.
Contextual bandit with adaptive feature extraction. In 2018 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 937–944. Cited by: §1.
-  (2019) Split q learning: reinforcement learning with two-stream rewards. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 6448–6449. Cited by: §1, §4, §4, §5.1.
-  (2020-05) A story of two streams: reinforcement learning models from human behavior and neuropsychiatry. In Proceedings of the Nineteenth International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS-20, pp. 744–752. External Links: Cited by: §1, §4, §4, §5.1.
-  (2020) Diabolical games: reinforcement learning environments for lifelong learning. pp. . Cited by: §5.3.
-  (2011) From reinforcement learning models to psychiatric and neurological disorders. Nature Neuroscience 14 (2), pp. 154–162. External Links: Cited by: Appendix 0.A.
-  (2004) Dissociable Roles of Ventral and Dorsal Striatum in Instrumental. Science 304 (16 April), pp. 452–454. External Links: Cited by: Appendix 0.A.
-  (2015) Reward processing in neurodegenerative disease. Neurocase 21 (1), pp. 120–133. Cited by: §1.
-  (1994) On-line q-learning using connectionist systems. Vol. 37, University of Cambridge, Department of Engineering Cambridge, England. Cited by: §5.1.
-  (1997-03) A Neural Substrate of Prediction and Reward. Science 275 (5306), pp. 1593–1599. External Links: Cited by: Appendix 0.A, Appendix 0.A.
-  (2007-04) The neurobiology of punishment. Nature Reviews Neuroscience 8 (4), pp. 300–311. External Links: Cited by: Appendix 0.A.
-  (2015) Data from 617 healthy participants performing the iowa gambling task: a" many labs" collaboration. Journal of Open Psychology Data 3 (1), pp. 340–353. Cited by: §5.
-  (1998) Introduction to reinforcement learning. Vol. 135, MIT press Cambridge. Cited by: Appendix 0.A, §1, §2.
-  (1998) Introduction to reinforcement learning. 1st edition, MIT Press, Cambridge, MA, USA. External Links: Cited by: §5.1.
-  (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.. Biometrika 25, pp. 285–294. Cited by: §2, §5.1.
-  (1981) The Framing of Decisions and the Psychology of Choice. Science 211 (4481), pp. 453–458. External Links: Cited by: Appendix 0.A.
Appendix 0.A Further Motivation from Neuroscience
In the following section, we provide further discussion with a literature review on the neuroscience and clinical studies related to the reward processing systems.
Cellular computation of reward and reward violation. Decades of evidence has linked dopamine function to reinforcement learning via neurons in the midbrain and its connections in the basal ganglia, limbic regions, and cortex. Firing rates of dopamine neurons computationally represent reward magnitude, expectancy, and violations (prediction error) and other value-based signals . This allows an animal to update and maintain value expectations associated with particular states and actions. When functioning properly, this helps an animal develop a policy to maximize outcomes by approaching/choosing cues with higher expected value and avoiding cues associated with loss or punishment. The mechanism is conceptually similar to reinforcement learning widely used in computing and robotics , suggesting mechanistic overlap in humans and AI. Evidence of Q-learning and actor-critic models have been observed in spiking activity in midbrain dopamine neurons in primates  and in the human striatum using the BOLD signal .
Positive vs. negative learning signals. Phasic dopamine signaling represents bidirectional (positive and negative) coding for prediction error signals , but underlying mechanisms show differentiation for reward relative to punishment learning . Though representation of cellular-level aversive error signaling has been debated , it is widely thought that rewarding, salient information is represented by phasic dopamine signals, whereas reward omission or punishment signals are represented by dips or pauses in baseline dopamine firing . These mechanisms have downstream effects on motivation, approach behavior, and action selection. Reward signaling in a direct pathway links striatum to cortex via dopamine neurons that disinhibit the thalamus via the internal segment of the globus pallidus and facilitate action and approach behavior. Alternatively, aversive signals may have an opposite effect in the indirect pathway mediated by D2 neurons inhibiting thalamic function and ultimately action, as well . Manipulating these circuits through pharmacological measures or disease has demonstrated computationally-predictable effects that bias learning from positive or negative prediction error in humans , and contribute to our understanding of perceptible differences in human decision making when differentially motivated by loss or gain .
Clinical Implications. Highlighting the importance of using computational models to understand predict disease outcomes, many symptoms of neurological and psychiatric disease are related to biases in learning from positive and negative feedback . Studies in humans have shown that when reward signaling in the direct pathway is over-expressed, this may enhance the value associated with a state and incur pathological reward-seeking behavior, like gambling or substance use. Conversely, when aversive error signals are enhanced, this results in dampening of reward experience and increased motor inhibition, causing symptoms that decrease motivation, such as apathy, social withdrawal, fatigue, and depression. Further, it has been proposed that exposure to a particular distribution of experiences during critical periods of development can biologically predispose an individual to learn from positive or negative outcomes, making them more or less susceptible to risk for brain-based illnesses . These points distinctly highlight the need for a greater understanding of how intelligent systems differentially learn from rewards or punishments, and how experience sampling may impact reinforcement learning during influential training periods.