Log In Sign Up

Modeling Mobile Health Users as Reinforcement Learning Agents

by   Eura Shin, et al.
University of California, San Diego

Mobile health (mHealth) technologies empower patients to adopt/maintain healthy behaviors in their daily lives, by providing interventions (e.g. push notifications) tailored to the user's needs. In these settings, without intervention, human decision making may be impaired (e.g. valuing near term pleasure over own long term goals). In this work, we formalize this relationship with a framework in which the user optimizes a (potentially impaired) Markov Decision Process (MDP) and the mHealth agent intervenes on the user's MDP parameters. We show that different types of impairments imply different types of optimal intervention. We also provide analytical and empirical explorations of these differences.


page 1

page 2

page 3

page 4


MONEYBaRL: Exploiting pitcher decision-making using Reinforcement Learning

This manuscript uses machine learning techniques to exploit baseball pit...

Off-Policy Estimation of Long-Term Average Outcomes with Applications to Mobile Health

With the recent advancements in wearables and sensing technology, health...

Multi-Agent Learning of Numerical Methods for Hyperbolic PDEs with Factored Dec-MDP

Factored decentralized Markov decision process (Dec-MDP) is a framework ...

Personalized HeartSteps: A Reinforcement Learning Algorithm for Optimizing Physical Activity

With the recent evolution of mobile health technologies, health scientis...

Optimal Immunization Policy Using Dynamic Programming

Decisions in public health are almost always made in the context of unce...

Sequential Fair Resource Allocation under a Markov Decision Process Framework

We study the sequential decision-making problem of allocating a limited ...

Markov Decision Process for Video Generation

We identify two pathological cases of temporal inconsistencies in video ...

1 Introduction

In digital health, a human user and a mobile health (mHealth) application work together to acheive user specified behavioral goals. For example, the user may own a physical therapy (PT) app that guides them through a physician-recommended daily exercise routine. To plan effective intervention, the app agent maintains a model of the user’s behavior. In our paper, the app agent models the user’s decision making process as that of a reinforcement learning (RL) agent. As a result, there are two RL agents that operate in this scenario. The first RL agent is the autonomous app agent, whose policy provides personalized interventions (e.g. a push notification about the importance of exercising daily) to the user in order to maintain healthy behavior (e.g. the PT routine). The second RL agent is the user, whose action space is binary – they choose whether or not to engage in the suggested behavior (e.g. do the PT routine).

Even though the user and app agents share the same goal – long-term behavior change – without intervention, the user’s default decision may not be to engage in the target behavior, due to systematic impairments in the human’s decision making. For example, the user may be myopic – that is, they may heavily discounts future rewards (Story et al., 2014). In our PT example, a fully rehabilitated shoulder may seem too distantly located in the future to motivate the user towards the goal. Prior work has tried to infer the user’s impairment from demonstration. In this paper, we explore a complementary problem: assuming we know the user’s impairment, what should the app agent do about it?

Our contributions. In this work, we explore effective ways for the app gent to intervene to maintain users’ goal-oriented progress, in situations where the user would have otherwise disengaged. To do this, we propose a formal framework that represents the user as an RL agent, wherein different parametrizations of the user’s Markov Decision Process (MDP) capture a range of commonly observed user behaviors in mHealth – state-dependent motivation, disengagement, and difficulty of adherence. For example, a myopic user is represented as an agent planning with a small discount factor . Furthermore, our framework formalizes the mechanism through which the app agent’s interventions affect the user decisions; namely, the app agent intervenes on the user’s MDP parameters. For example, intervening on a myopic user corresponds to increasing the user’s discount factor . Finally, as a precursor to user studies, we use our framework to extract concrete intervention strategies that are expected to work well for a given type of user. We end with a discussion of the interesting behavioral and computational open questions that arise within this framework.

Related work.

Reinforcement learning is frequently used to model the complex mechanisms underlying human behavior– from the firing of dopaminergic neurons in the brain

(Niv, 2009; Shteingart and Loewenstein, 2014) to disorders in computational psychiatry (Maia and Frank, 2011; Chen et al., 2015). In digital health, RL has been used to model maladaptive eating behaviors (Taylor et al., 2021). Although these settings use RL to produce models of human decision making, the models themselves are not used to enrich planning for an autonomous agent.

In some settings, such as in Human Robot Interaction (Tejwani et al., 2022; Xie et al., 2021; Losey and Sadigh, 2019) or assistive AI Chen et al. (2022); Reddy et al. (2018), the human is modeled to inform the decisions of another RL algorithm. Though these applications can be formalized as a multi-agent RL problem (i.e. a two-player cooperative game), we formalize this as a single-agent problem, where we must solve for the mHealth agent’s policy. This choice is motivated by our setting, where the strongest detectable effect of the app’s intervention on the user tends to be immediate and transitory.

Modeling the actions of an external, human RL agent requires inferring the parameters that drive their policy. In Inverse Reinfocement Learning (IRL), this means inferring the user agent’s reward function (Ziebart et al., 2008). IRL has been applied in digital health to infer user preferences– Zhou et al. (2018) infer the user’s utility function in order to set adaptive daily step goals. Similar to Herman et al. (2016) and Reddy et al. (2018)’s goals of inferring the transition dynamics, we are interested in modeling the user’s entire decision making process, beyond the rewards.

Despite evidence of humans demonstrating systematic behavioral impairments (e.g. myopic planning), most IRL approaches assume that humans agents act optimally, or near-optimally, with respect to a task. To improve on collaboration with humans, prior work has focused on representing and inferring these impairments from human actions (Evans et al., 2016; Laidlaw and Dragan, 2022; Shah et al., 2019; Jarrett et al., 2021). However, the goal of prior work has been for the app agent to function in these collaborative settings despite these impairments but not to intervene on them.

2 Computational Framework for Behavior Change in Digital Health

In this section, we capture the basic dynamics of a digital health application in a one-dimensional gridworld (visualized in fig. 1): this is the environment in which the user and app agents operate. Although simple, this environment captures many basic elements of behavior change in digital health, such as the difficulty of making progress, potential for disengagement, and goals.

The single dimension represents the user’s current progress toward the behavior goal, for example, the current strength of the rehabilitated joint. Let represent this world state. There are two absorbing states: one for disengagement (), which happens when the user stops doing PT and/or deletes the app, and another for reaching a “goal” (

), when the joint meets the desired level of strength. When the user practices PT, they go right with probability

, which means they increase the strength of the joint. If (and only if) the user abstains from PT, there is a chance that they disengage with probability . For now it is impossible for the user to lose progress (go backwards).

The two RL entities, the app agent and user agent, interface with this world in different ways. The user directly interacts with the world– they ultimately decide when and how to move between states (e.g. the app cannot perform the PT exercises for the user). The app indirectly interacts with the world through the user; by influencing the user’s decision making process, the app influences how the user moves in the world. In the next two sections, we formalize how the two agents perceive and plan actions within this world.

Figure 1: Visualization of states and transitions in digital health gridworld. Arrows are marked with the required action and probability of transitioning between states.
Interv. Effect Example
Increases by Use of implementation intentions (Wieber et al., 2015) to reduce the effort of starting the PT.
Decreases by Use of multi-day “streaks”; disengaging means losing the streak.
Increases by Magen et al. (2008)’s hidden-zero framing: “Skip PT today and lose shoulder mobility in the future OR do PT today and regain shoulder mobility in the future.”
Increases by Prompt self-monitoring of progress through graphs and metrics.
Table 1: Definitions of app interventions. The magnitude of can be interpreted as the effectiveness of the app’s intervention on . We assume the ’s are fixed over time. Examples are purely demonstrative and not final interventions.

2.1 User agent: Model of user’s decision making

In this section we model the user’s internal decision making process. The user’s decision making is guided by a policy. The policy is optimal for a Markov Decision Process , where is a set of user specific parameters defined below:

State. The user observes the app’s intervention, which we call and the current progress towards the goal, .

Action. At each time step, the user decides to perform () or not perform () the behavior. For example, the user decides daily whether or not to do PT.

Rewards. The user anticipates actions in certain states will have consequences. Doing PT may incur burden () which the user perceives as a negative consequence. Reaching the goal state– rehabilitating the shoulder– may be of great positive consequence (). Finally, disengaging could either have a positive or negative consequence (). The user’s perceived rewards are:


Transitions. The user believes that there is probability that doing the behavior will make them progress toward the goal. Note that there can be a difference between the user’s perceived and true probability of making progress, . For example, the user may think there is a low probability, , that doing PT will make them stronger. In actuality, the PT may be very effective, and the user progresses toward the goal at a much higher rate, . In this model, only the value of affects the user’s decision on whether or not to take an action.

Discount. The user exponentially discounts future rewards via .

To recap, the user’s behavior is governed by the following parameters: . Learning these parameters corresponds to learning the user’s decision making process. In practice, all of these parameters are internal to the user and must be inferred.

2.2 App agent: digital intervention policy

In this section, we define the app agent, who helps the user achieve their goal by maximizing the frequency at which the user performs the behavior.

State. The app observes the world state and the user’s . Unlike , which is the user’s intended action, is the observed action. This accounts for some degree of randomness in the user’s life: although they had planned to start PT for the day (), the user may receive an urgent phone call that prevents it from happening (). If the user decides , then the app observes with probability and with probability . If the user decides , then always.

Action. The action space consists of a separate intervention for each user parameter, described in table 1. We exclude an intervention on because it is similar to an intervention on (both increase motivation toward long term goal). We assume app actions affect the user’s decision making

in the moment

, and not permanently. For example, an intervention may increase in one timestep, but it will revert to the original value in the next timestep, when the user has to decide again whether to perform the action.

Rewards. The app receives a negative reward if the user takes no action and a positive reward if the user takes an action following intervention: .

Transitions. The app experiences transitions according to the true probability of making progress ( instead of ) and the true probability that the user executes an action ( instead of ).

2.3 Intervening on the user’s parameters to change user’s value function

The app agent’s reward and transition functions depend on the actions of the user. Therefore it is critical to understand how potential changes to the user’s MDP parameters, , affect the user’s policy. At a given state, the user’s policy is to perform the behavior if:


where is the user’s current distance from the goal state and . The derivation of eq. 2 is in appendix A.

At a high level, component [1] is the burden the user would accumulate in order to reach the goal (and so depends on ), component [2] is the temporally-discounted value of the goal , and component [3] is the relative consequence of disengagement. From the user’s perspective, it is worth trying to reach the goal if the net benefit on the left side outweighs the potential consequence of disengaging on the right side of eq. 2. Some key insights from this equation that will provide intuition for the intervention strategies we discover in section 3.2:

  1. [leftmargin=*]

  2. Interventions on , , and are technically “unbounded” in effectiveness because they can be any real number. An intervention that makes could make the inequality hold (thus ) for any relatively small values of and .

  3. Interventions on and are “bounded” in this model because they cannot exceed a value of . Even if an effective app intervention causes and , there could still be a situation in which the user is so far from the goal state that they choose not to take action. That is, .

3 Insights from our model

We are now prepared to user our model to represent different kinds of users and to gain insight on (1) whether different users warrant different intervention strategies, and if so, (2) how these strategies differ.

3.1 Representing common user behaviors

We represent two common decision making impairments– extreme discounting and mis-estimation of probabilities– as user agents in our framework. Unless otherwise noted, default parameters for all users in the experiments are:

. We also set , meaning that . In appendix C, we include a sensitivity analysis of the empirical results below when the default parameters are sampled from a range.

Users with extreme discounting. The role of extreme temporal discounting in unhealthy behaviors has been explored in smoking cessation, alcohol use, obesity, and many more health applications (Story et al., 2014). We represent a myopic user with a low discount factor () and a farsighted user with a high discount factor ().

Underconfident and overconfident users. On certain tasks, human decision-makers tend to report overly-extreme values in estimating probabilities (Brenner et al., 1996). In our setting, the user could misestimate the probability of going right. Since, in our experiments, we represent an underconfident user with and an overconfident user with .

(a) Myopic User
(b) Farsighted User
(c) Unconfident User
(d) Confident User
Figure 2: The app agent’s intervention strategy differs for different types of users. App’s policy is plotted. The represents “ or .” In some states, no intervention could make the user move (when ). In other states, the user’s default policy is to move, even without intervention (when in fig. 1(d)).

3.2 Intervention strategies

(a) Underconfident
(b) Overconfident
Figure 3: Within a user type, some parameters may be easier to intervene on than others. Lines represent the minimum effectiveness the intervention must have in order to change the user’s behavior, where is the maximum feasible effect on a parameter. For example, and . Lower is better; low means the intervention can have a small but still change behavior. Remaining users in appendix B.

What interventions to use and when?

Inspection of fig. 2 reveals a recurring pattern in the intervention strategies. The app’s policy always contains two “windows”: Window 1 covers states that are far from the reward and consists of interventions on or . Window 2 covers states that are closer to the reward and consists of interventions only on or only on , depending on the type of user. Below, the intuition for why such windows exist.

Window 1: interventions on or . Interventions in this window target user parameters that can affect user behavior regardless of distance from the goal state. Mathematically, this is due to the observation we made about the unboundedness of the value of , , and in section 2.3. Intuitively, if we can make the user feel enjoyment during PT, then they will work out even if it does not help them move toward the goal. In our model, this corresponds to an intervention that makes positive. If we make the consequence of disengagement very high, such as a reminder that the consequence of quitting PT is to never regain full mobility of the shoulder, then the user will work out. In our model, this corresponds to an intervention which makes the right hand threshold in eq. 2 very negative.

Window 2: depending on user type, interventions on or . For myopic and overconfident users, Window 2 interventions are on . For underconfident and farsighted users, Window 2 interventions are on . The respective interventions on and make sense for the myopic and underconfident users, since these are the parameters that are directly impairing the decision making. For overconfident users, intuition comes from realizing that when in eq. 2, the relative values of affect the decision making. Since and are considered in Window 1, this leaves in Window 2. The same holds by considering for farsighted users.

For each user, the best intervention depends on the effectiveness of the intervention.

All of the policies shown in fig. 2 assume that the interventions are maximally effective. In reality, the effect of an intervention may vary across users, even within the same user type. For example, one underconfident user may respond well to interventions on (i.e. is large), while another may not be impacted by such interventions at all (i.e. ). Consider fig. 2(a). Suppose the underconfident user is in state . Although intervening on , , or is valid, the interventions on and would require a high level of effectiveness near , while the intervention on must be at least effective. To account for variability in effectiveness for different underconfident users, it may be best to intervene on in this scenario. However, for the overconfident user in fig. 2(b), all interventions truly seem equally preferable, since they require similar levels of effectiveness.

4 Conclusion and Future Work

In this work, we represented the user’s decision making process with reinforcement learning and formalized how interventions from the app agent effect the user’s MDP parameters. Our framework exposes interesting research questions that across computer and behavioral science.

Computational questions. Do these computational insights generalize to real world users? Assuming we can infer a user’s parameters when they download the mHealth app, how do we incorporate knowledge of these parameters in guiding the app’s interventions? How do we modify this framework to represent interventions that have a permanent, or at least a multi-timestep effect, on the user’s parameters?

Behavioral What are the most worthwhile ways to add complexity to the world model in order to make it more representative of human behavior (e.g. making probability of disengagement depend on state, adding more states, etc.)? What set of questions, cognitive tasks, or sensor readings can we use to estimate the user’s MDP parameters? How do we design effective interventions for each of these MDP parameters?

5 Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. IIS-2107391 and the National Institute of Biomedical Imaging and Bioengineering and the Office of the Director of the National Institutes of Health under award number P41EB028242. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


  • L. A. Brenner, D. J. Koehler, V. Liberman, and A. Tversky (1996) Overconfidence in probability and frequency judgments: a critical examination. Organizational Behavior and Human Decision Processes 65 (3), pp. 212–219. Cited by: §3.1.
  • C. Chen, T. Takahashi, S. Nakagawa, T. Inoue, and I. Kusumi (2015) Reinforcement learning in depression: a review of computational research. Neuroscience & Biobehavioral Reviews 55, pp. 247–267. Cited by: §1.
  • K. Chen, J. Fong, and H. Soh (2022) MIRROR: differentiable deep social projection for assistive human-robot communication. arXiv preprint arXiv:2203.02877. Cited by: §1.
  • O. Evans, A. Stuhlmüller, and N. Goodman (2016) Learning the preferences of ignorant, inconsistent agents. In

    Thirtieth AAAI Conference on Artificial Intelligence

    Cited by: §1.
  • M. Herman, T. Gindele, J. Wagner, F. Schmitt, and W. Burgard (2016) Inverse reinforcement learning with simultaneous estimation of rewards and dynamics. In Artificial Intelligence and Statistics, pp. 102–110. Cited by: §1.
  • D. Jarrett, A. Hüyük, and M. Van Der Schaar (2021) Inverse decision modeling: learning interpretable representations of behavior. In

    International Conference on Machine Learning

    pp. 4755–4771. Cited by: §1.
  • C. Laidlaw and A. Dragan (2022) The boltzmann policy distribution: accounting for systematic suboptimality in human models. arXiv preprint arXiv:2204.10759. Cited by: §1.
  • D. P. Losey and D. Sadigh (2019) Robots that take advantage of human trust. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7001–7008. Cited by: §1.
  • E. Magen, C. S. Dweck, and J. J. Gross (2008) The hidden zero effect: representing a single choice as an extended sequence reduces impulsive choice. Psychological Science 19 (7), pp. 648. Cited by: Table 1.
  • T. V. Maia and M. J. Frank (2011) From reinforcement learning models to psychiatric and neurological disorders. Nature neuroscience 14 (2), pp. 154–162. Cited by: §1.
  • Y. Niv (2009) Reinforcement learning in the brain. Journal of Mathematical Psychology 53 (3), pp. 139–154. Cited by: §1.
  • S. Reddy, A. Dragan, and S. Levine (2018) Where do you think you’re going?: inferring beliefs about dynamics from behavior. Advances in Neural Information Processing Systems 31. Cited by: §1, §1.
  • R. Shah, N. Gundotra, P. Abbeel, and A. Dragan (2019) On the feasibility of learning, rather than assuming, human biases for reward inference. In International Conference on Machine Learning, pp. 5670–5679. Cited by: §1.
  • H. Shteingart and Y. Loewenstein (2014) Reinforcement learning and human behavior. Current Opinion in Neurobiology 25, pp. 93–98. Cited by: §1.
  • G. W. Story, I. Vlaev, B. Seymour, A. Darzi, and R. J. Dolan (2014) Does temporal discounting explain unhealthy behavior? a systematic review and reinforcement learning perspective. Frontiers in behavioral neuroscience 8, pp. 76. Cited by: §1, §3.1.
  • V. A. Taylor, I. Moseley, S. Sun, R. Smith, A. Roy, V. U. Ludwig, and J. A. Brewer (2021) Awareness drives changes in reward value which predict eating behavior change: probing reinforcement learning using experience sampling from mobile mindfulness training for maladaptive eating. Journal of Behavioral Addictions 10 (3), pp. 482–497. Cited by: §1.
  • R. Tejwani, Y. Kuo, T. Shu, B. Katz, and A. Barbu (2022) Social interactions as recursive mdps. In Conference on Robot Learning, pp. 949–958. Cited by: §1.
  • F. Wieber, J. L. Thürmer, and P. M. Gollwitzer (2015) Promoting the translation of intentions into action by implementation intentions: behavioral effects and physiological correlates. Frontiers in human neuroscience, pp. 395. Cited by: Table 1.
  • A. Xie, D. Losey, R. Tolsma, C. Finn, and D. Sadigh (2021) Learning latent representations to influence multi-agent interaction. In Conference on robot learning, pp. 575–588. Cited by: §1.
  • M. Zhou, Y. Mintz, Y. Fukuoka, K. Goldberg, E. Flowers, P. Kaminsky, A. Castillejo, and A. Aswani (2018) Personalizing mobile fitness apps using reinforcement learning. In CEUR workshop proceedings, Vol. 2068. Cited by: §1.
  • B. D. Ziebart, A. L. Maas, J. A. Bagnell, A. K. Dey, et al. (2008) Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8, pp. 1433–1438. Cited by: §1.

Appendix A Derivation of the user’s optimal value function

In this section, we derive the value function of the user’s optimal policy, , with respect to a user’s MDP parameters. For brevity, we will refer to this as . Similarly, while the user’s MDP parameters are subscripted in the main text (e.g. ), we will forgo these subscripts here ().

In our setting, once the optimal policy is to go right in a given state, the best strategy is to continue going right in subsequent states that are closer to the goal. That is, if , then . The opposite is also true; if the optimal policy is to stay in place in a given state, then the best strategy in a state that is farther away from the goal is also to stay in place — if , then .

The optimal policy chooses the maximum value between the value of a policy that always chooses to go right, , and the value of a policy that always chooses to stay in place, :


The derivation for is as follows:


We will derive for specific states, and generalize these findings to a general equation. First, we will derive the value of a state which is right before the goal state, :


In eq. 9, the term represents the expected reward of taking action in state . Recall that when the user moves right with probability and stays in place with probability . The user always receives a reward of for choosing and potentially a reward of if they transition to the goal state. Then the calculation for the expected reward is:


Next, we derive the value of a state which is two spaces away from the goal state, :


Notice that in eq. 11, we can apply the Bellman equation to “recursively” expand the form of the value function, so that it can be written as an infinite geometric series:


We will apply eq. 12 to our final derivation of :


In general, for any state , the value function is:


where .

Recall eq. 3, where we defined the optimal value function with respect to and . Replacing these terms with their analytical counterparts, we arrive at:


Along the same lines, the user’s optimal policy is to follow the action of the policy that maximizes the value in the current state:


Appendix B Required intervention effect for all user types

(a) Myopic users
(b) Farsighted users
(c) Uunderconfident users
(d) Overconfident users
Figure 4: Within a user type, some parameters may be easier to intervene on than others. Lines represent the minimum effect the intervention must have in order to change the user’s behavior, where is the maximum feasible effect on a parameter. For example, and . Lower is better; it means the intervention can have a small effect but still achieve the same behavior change.

Appendix C Sensitivity analysis: sampling the default parameters

The “default” parameters for the user MDP from the main experiments are recapped in the first column of table 2. We checked that the main results of the paper are not sensitive to these parameter settings. To do so, we generated multiple random samples of different default parameters and confirmed that the main intervention patterns from the results still hold. To form the default parameter set, each parameter is uniformly sampled from the ranges in the second column of table 2.

We kept the values of and fixed in this analysis. We fix because the app cannot intervene on and the user’s policy does not depend on , and we do not expect will not affect the app’s policy. We fix the value of because we do not consider interventions on and because the value of is meaningful only in relation to the value of , which we do sample.

Default parameters in main body Sampled parameter range
not sampled
not sampled
Table 2: Default user MDP parameters used in the experiments in the main body (left) and the range from which they are sampled in the sensitivity analysis (right).

The main patterns that we expect will continue to hold across different samples of default parameters are:

  1. The myopic and overconfident users will require intervention on or in Window 1 and then intervention on in Window 2 (the sizes of these windows will vary).

  2. The underconfident and farsighted users will require intervention on or in Window 1 and then intervention on in Window 2 (the sizes of these windows will vary).

  3. Right before the goal (i.e. in Window 3), it may be the case that an intervention on any parameter makes enough of a difference to nudge the user to take action.

These patterns hold across different samples of the default parameters, shown in REF.

(a) App policy for myopic user
(b) App policy for overconfident user
(c) App policy for underconfident user
(d) App policy for farsighted user
Figure 5: The expected results hold for trial 1 with sampled parameters
(a) App policy for myopic user
(b) App policy for overconfident user
(c) App policy for underconfident user
(d) App policy for farsighted user
Figure 6: The expected results hold for trial 2 with sampled parameters
(a) App policy for myopic user
(b) App policy for overconfident user
(c) App policy for underconfident user
(d) App policy for farsighted user
Figure 7: For trial 3, all user’s policies are to take action by default. This is likely due to the small value of . The parameters are:
(a) App policy for myopic user
(b) App policy for overconfident user
(c) App policy for underconfident user
(d) App policy for farsighted user
Figure 8: The expected results hold for trial 4 with sampled parameters
(a) App policy for myopic user
(b) App policy for overconfident user
(c) App policy for underconfident user
(d) App policy for farsighted user
Figure 9: The expected results hold for trial 5 with sampled parameters