Robust Actor-Critic Contextual Bandit for Mobile Health (mHealth) Interventions

02/27/2018 ∙ by Feiyun Zhu, et al. ∙ 0

We consider the actor-critic contextual bandit for the mobile health (mHealth) intervention. State-of-the-art decision-making algorithms generally ignore the outliers in the dataset. In this paper, we propose a novel robust contextual bandit method for the mHealth. It can achieve the conflicting goal of reducing the influence of outliers while seeking for a similar solution compared with the state-of-the-art contextual bandit methods on the datasets without outliers. Such performance relies on two technologies: (1) the capped-ℓ_2 norm; (2) a reliable method to set the thresholding hyper-parameter, which is inspired by one of the most fundamental techniques in the statistics. Although the model is non-convex and non-differentiable, we propose an effective reweighted algorithm and provide solid theoretical analyses. We prove that the proposed algorithm can find sufficiently decreasing points after each iteration and finally converges after a finite number of iterations. Extensive experiment results on two datasets demonstrate that our method can achieve almost identical results compared with state-of-the-art contextual bandit methods on the dataset without outliers, and significantly outperform those state-of-the-art methods on the badly noised dataset with outliers in a variety of parameter settings.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Due to the explosive growth of smart device (i.e. smartphones and wearable devices, such as Fitbit etc.) users globally, mobile health (mHealth) technologies draw increasing interests from the scientist community [Liao, Tewari, and Murphy2015, Murphy et al.2016]. The goal of mHealth is to deliver in-time interventions to device users, guiding them to lead healthier lives, such as reducing the alcohol abuse [Gustafson et al.2014, Witkiewitz et al.2014] and increasing physical activities [Abby et al.2013]. With the advanced smart technologies, the mHealth interventions can be formed according to the users’ ongoing statuses and changing needs, which is more portable and flexible compared with the traditional treatments. Therefore, mHealth technologies are widely used in lots of health-related applications, such as eating disorders, alcohol abuses, mental illness, obesity management and HIV medication adherence [Murphy et al.2016, Liao, Tewari, and Murphy2015].

Formally, the tailoring of mHealth intervention is modeled as a sequential decision-making (SDM) problem. The contextual bandit algorithm provides a framework for the SDM [Tewari and Murphy2017]. In 2014, Lei [Lei, Tewari, and Murphy2014] proposed the first contextual bandit algorithm for the mHealth study. It is in the actor-critic setting [Sutton and Barto2012], which has two advantages compared with the critic only contextual bandit methods for the internet advertising [Li et al.2010]

: (a) Lei’s method has an explicit parameterized model for the stochastic policy. By analyzing the estimated parameters in the learned policy, we could know the key features that contribute most to the policy. This is important to the behavior scientists for the state (feature) design. (b) From the perspective of optimization, the actor-critic algorithm has great properties of quick convergence with low variance 

[Grondman et al.2012].

However, Lei’s method assumes that the states at different decision points are i.i.d. and the current action only influences the immediate reward [Lei2016]. This assumption is infeasible in real situations. Taking the delayed effect in the SDM or mHealth for example, the current action influences not only the immediate reward but also the next state and through that, all the subsequent rewards [Sutton and Barto2012]. Accordingly, Lei proposed a new method [Lei2016] by emphasizing on explorations and reducing exploitations.

Although those two methods serve a good start for the mHealth study, they assume that the noise in the trajectory follows the Gaussian distribution. The least square based algorithm is employed to estimate the expected reward. In reality, however, there are various kinds of complex noises that can badly degrade the collected data, for example: (1) the wearable device is unreliable to accurately record the states and rewards from users under different conditions. (2) The mobile network is unavailable in some areas. Such case hinders the collecting of users’ states as well as the sending of interventions. (3) The mHealth relies on the self-reported information (Ecological Momentary Assessments, i.e. EMAs) 

[Firth, Torous, and Yung2016] to deliver effective interventions. However, some users are annoyed at the EMAs. They either fill out the EMAs via random selections or just leave some or all the EMAs blank. We consider the various kinds of badly noised observations in the trajectory as outliers.

There are several robust methods for the SDM problem [Dudík, Langford, and Li2011, Zhang et al.2012, Xu2009]

. However, those methods are neither in the actor-critic setting, nor focusing on the outlier problem. Thus, they are different from this paper’s focus. In the general machine learning task, there are some robust learning methods to deal with the outlier problem 

[Sun, Xiang, and Ye2013, Jiang, Nie, and Huang2015]. However, none of them are contextual bandit algorithms—it may cost of lots of work to transfer their methods to the (actor-critic) contextual bandit algorithms. Besides, those methods seldom pay attention to the dataset without outliers. In practice, however, we don’t know whether a given dataset consists of outliers or not. It is necessary to propose a robust learning method that works well on the dataset both with and without outliers.

To alleviate the above problems, we propose a robust contextual bandit method for the mHealth. The capped- norm is used to measure the learning error for the expected reward estimation (i.e. the critic updating). It prevents outlier observations from dominating our objective. Besides, the learned weights in the critic updating are considered in the actor updating. As a result, the robustness against outliers are greatly boosted in both actor-critic updatings. There is an important thresholding parameter in our method. We propose a solid method to set its value according to the distribution of samples, which is based on one of the most fundamental ideas in the statistics. It has two benefits: (1) the setting of becomes very easy and reliable; (2) we may achieve the conflicting goal of reducing the influence of outliers when the dataset indeed contains outliers, while achieving almost identical results compared with the state-of-the-art contextual bandit method on the datasets without outliers. Although the objective is non-convex and non-differentiable, we derive an effective algorithm. As a theoretical contribution, we prove that our algorithm could find sufficiently decreasing point after each iteration and finally converges after a finite number of iterations. Extensive experiment results on two datasets verify that our methods could achieve clear gains over the state-of-the-art contextual bandit methods for the mHealth.


Multi-armed bandit (MAB) is the simplest algorithm for the sequential decision making problem (SDM). The contextual bandit is a more practical extension of the MAB by considering some extra information that is helpful for the SDM problem [Tewari and Murphy2017]. The use of context information allows for many interesting applications, such as internet advertising and health-care tasks [Dudík, Langford, and Li2011, Tewari and Murphy2017].

In contextual bandit, the expected reward is an core concept that measures how many rewards we may averagely get when it is in state and choosing action , i.e., . Since the state space is usually very large or even infinite in the mHealth tasks, the parameterized model is employed to approximate the expected reward: , where is a feature processing step that combines information in the state and the action ,

is the unknown coefficient vector.

In 2014, Lei [Lei, Tewari, and Murphy2014] proposed the first contextual bandit method for the mHealth study. It is in the actor-critic learning setting. The actor updating is the overall optimization goal. It aims to learn an optimal policy that maximizes the average rewards over all the states and actions [Grondman et al.2012]. The objective function is for the user, where


and is a reference distribution over states for user .

Obviously, we need the estimation of expected rewards to define the objective (1

) for the actor updating. Such procedure is called the critic updating. State-of-the-art method generally employs the ridge regression to learn the expected reward from the observations. The objective is


where is the trajectory of observed tuples from the user and is the tuple in ; is a tuning parameter to control the strength of constraints. It has a closed-form solution for (2) as:


where is a identity matrix. Unfortunately, similar to the existing least square based models in machine learning and statistics, the objective function in (2) is prone to the presence of outliers [Nie et al.2010, Zhu et al.2015].

Robust Actor-critic Contextual Bandit via the Capped- norm

To enhance the robustness in the critic updating, the capped- norm based measure is used for the estimation of expected rewards. By imposing the learned weights for the actor updating, we propose a robust objective for the actor updating.

Robust Critic Updating via the Capped- Norm

To simplify the notation, we get rid of the subscript index , which is used to indicate the model for the user. The new objective for the critic updating (i.e. policy evaluation) is


where is the capped- norm for a vector ; is the thresholding hyper-parameter to choose the effective observations for the critic updating; is the feature for the estimation of expected rewards.

If the residual of the tuple is , we treat it as an outlier. Its residual is capped to a fixed value . That is, the influence of the tuple is fixed [Gao et al.2015], which can’t cause bad influences on the learning procedure. For the tuples whose residuals satisfy , we consider them as effective observations and keep them as they are in the optimization process.

Therefore, it is extremely important to properly set the value of . When is too large, the outliers that distribute far away from the majority of tuples will be treated as effective samples, causing bad influences to the learning procedure. When is too small, most tuples are treated as outliers—there would be very few of effective samples for the cirtic learning. Such case easily leads to some unstable policies that contain lots of variances. Specially if , our objective is equivalent to the least square objective in (2).

As a profound contribution, we propose a reliable method to properly set the value of . Our method doesn’t need any specific assumption on the data distribution. It is derivated from the boxplot—one of the most fundamental ideas in the statistics [Dawson2011, Williamson, Parker, and Kendrick1989]

. To give a descriptive illustration of the data distribution, the boxplot is widely used by specifying 5 points, including the min, lower quartile

, median, upper quartile and max. Based on the 5 points, the boxplot provides a method to detect outliers. Following this idea, we set the value of as


where is the interquartile range; is introduced only for the experiment setting S2, otherwise we may ignore the parameter by keeping it fixed at . Intuitively, the data points that are more above the third quartile are detected as the outliers. Compared with the state-of-the-art robust learning methods [Gao et al.2015, Sun, Xiang, and Ye2013, Jiang, Nie, and Huang2015] in the other fields that have to manually set the thresholding hyper-parameter, we provide an adaptive method to set , which is well adapted to the data distribution.

With the capped- norm and the method to set in (5), we can achieve the conflicting goals of (a) reducing the influence of outliers when the dataset has outliers, while (b) seeking for a similar solution compared with that of the state-of-the-art method if there is no outlier in the dataset. As a result, our method can deal with various datasets, regardless of whether they consist of outliers or not.

Derivation of a General Objective Function for (4)

To give an efficient algorithm, we consider a more general capped- norm based objective for (4) as follows


where is the norm for a vector; and are both scalar functions of . In this section, we propose an iteratively re-weighted method to simplify the objective (6).

Due to the non-smooth and non-differentiable property of (6), we could only obtain the sub-gradient of (6) as:


Letting gives


where and . For the sake of easy optimization, we provide a compact expression that satisfies the sub-gradient in (8) by introducing a variable . Then Eq. (7) is rewritten as


Since depends on , it is very challenging to directly solve the objective (9). Once is given for every the objective (7) is equivalent to the following problem


in the sense that they have the same partial derivative.

Robust Algorithm for the Critic Updating

In this section, we provide an effective updating rule for the objective function (4) (cf. Proposition 1 and Algorithm 1). We prove that our algorithm can find sufficiently decreasing point after each iteration (cf. Lemma 2) and finally converge after a finite number of iterations (cf. Theorem 4).

Proposition 1.

The iterative updating rule (4) () is


where is a nonnegative diagonal matrix. The element is .

Lemma 2.

The updating rule in Proposition 1 leads to sufficient decrease of the objective function in (4):

where the bivariate function is defined as

which is the same as in (4).

Lemma 3.

For in Lemma 2, we show that and consequently

Theorem 4.

The updating rule in Proposition 1 converges after a finite number of iterations.

Robust Actor Updating for the Stochastic Policy

Besides the critic updating, outliers can also badly influence the actor updating in (1), which is our ultimate objective. To boost its robustness, the estimated weights learned in the critic updating are considered. Since is usually unavailable in reality, the -trial based objective [Chou et al.2014] is widely used. Thus, the objective (1) is rewritten as


where is the estimated expected reward; is the least square constraint to make the objective (12) a well-posed problem and is a balancing parameter that controls the penalization strength [Lei, Tewari, and Murphy2014].

Compared with the current objective for the actor updating in (1), our objective has an extra weight term , which gives those tuples, whose residuals are very large in the critic updating, zero weights. As a result, the outlier tuples that are far away from the majority of tuples are removed from the actor updating, enhancing the robustness. The algorithm of the actor updating performs the maximization of (12) over . This is learned via the Sequential Quadratic Programming (SQP) algorithm. We utilize the implementation of SQP with finite-difference approximation to the gradient in the fmincon function of Matlab.


1:  Initialize states and policy parameters .
2:  repeat
3:     /*Critic updating for the expected reward*/
4:     repeat
5:        Update  for the expected reward via (11).
6:        Update the weights via the Proposition 1.
7:     until  convergence
8:     /*Actor updating*/ via , where
9:  until  convergence

Output: the stochastic policy for user, i.e. .

Algorithm 1 robust actor-critic contextual bandit for user .


Two Datasets

Two datasets are used to verify the performance of our method. The first dataset is on the personalizing treatment delivery in mobile health, which is a common application of the sequential decision making algorithm [Murphy et al.2016]. In this paper, we focus on the Heartsteps, where the participants are periodically sent activity suggestions aimed at decreasing sedentary behavior [Klasnja et al.2015]. Specifically, Heartsteps is a 42-days trial study where 50 participants are involved. For each participant, there are 210 decision points—five decisions per participant per day. At each time point, the set of intervention actions can be the intervention type, as well as whether or not to send interventions. The intervention actions generally depend on the state of the participant as well as the formerly sent interventions. Interventions can be sent via smartphones, or via other wearable devices like a wristband [Dempsey et al.2016].

Table 1: The ElrAR of nine contextual methods on the two datasets: Heartsteps and chain walk. (experiment setting S1)

The Heartsteps is a common application for the contextual bandit algorithm [Lei2016, Murphy et al.2016]. The goal is to learn an optimal SDM algorithms to decide what type of intervention actions to send to each user to maximize the cumulative steps each user takes. The resulting data for each participant is a sequence , where is the participant’s state at time . It is a three dimensional vector that consists of (1) the weather condition, (2) the engagement of participants, (3) the treatment fatigue of participants. indicates whether or not to send the intervention to users; since the goal of Heartsteps is to increase the participant’s activities, we define the reward, , as the step count for the 3 hours following a decision point.

The second dataset is the 4-state chain walk, which is a benchmark dataset for the (contextual) bandit and reinforcement learning study. Please refer to 

[Lagoudakis and Parr2003] for the details of the chain walk dataset. The reward vector over the four states is in this paper.

Table 3: The ElrAR of six methods vs. outlier strength on two datasets: (1) Heartsteps and (2) chain walk. (experiment S3).
Table 3: The ElrAR of six methods vs. outlier strength on two datasets: (1) Heartsteps and (2) chain walk. (experiment S3).
Table 2: The ElrAR of six methods vs. outlier ratio on two datasets: (1) Heartsteps and (2) chain walk. (experiment S2).

Nine Compared Methods

There are nine contextual bandit methods compared in the experiment, including three state-of-the-art methods: (1) Linear Upper Confidence Bound Algorithm (LinUCB) is one of the most famous contextual bandit methods used in the personalized news recommendation [Li et al.2010, Ho and Lin2015]; (2) the actor-critic contextual bandit (ACCB) is the first SDM method for the mHealth intervention [Lei, Tewari, and Murphy2014]; (3) the stochasticity constrained actor-critic contextual bandit (SACCB) for the mHealth [Lei2016]. We improve the above three methods by first using the state-of-the-art outlier filter [Liu, Shah, and Jiang2004, Gustafson et al.2014] to remove outliers, then employing the above three contextual bandit algorithms for the SDM task. In the to compared methods, we apply the proposed robust model and optimization algorithms in the three state-of-the-art contextual bandit methods, leading to (7) Robust LinUCB (Ro-LinUCB for short), (8) Robust ACCB (Ro-ACCB), (9) Robust SACCB (Ro-SACCB).

Evaluation Methodology and Parameter Setting

It has been a challenging problem to reliably evaluate the sequential decision making (e.g., bandit and reinforcement learning) algorithms when the simulator (like the Atari games) is unavailable [Li et al.2010, Li et al.2015]

. After the model is trained, we use it to interact with the enviroment (or simulator) to collect thousands of immediate rewards for the calculation of the long term rewards as the evaluation metric. However for a wide variety of applications including our HeartSteps, the simulator is unavailable. In this paper, we use a benchmark evalutaion methods 

[Li et al.2015]

. The main idea is to make use of the collected dataset to build a simulator, based on which we train and evaluate the contextual bandit methods. Such case makes it impossible to train and evaluate the contextual bandit algorithms on up to ten datasets like the general supervised learning tasks.

Figure 1: The ElrAR of two contextual bandit methods vs. (or in Eq. (5)), outlier ratio and outlier strength respectively on the two datasets. The top row of sub-figures illustrate the results on the HeartSteps, and the bottom row shows the results on the chain walk. The three columns of sub-figures show the result in the experiment setting (S4), (S2) and (S3) respectively.

In the Heartsteps study, there are 50 users; the simulator for each user is as follows: the initial state is drawn from the Gaussian distribution , where is a covariance matrix with pre-defined elements. For is drawn from the learned policy during the evaluation procedure. When , the state and immediate reward are generated as


where =[0.4,0.3,0.4,0.7,0.05,0.6,3,0.25,0.25 ,0.4,0.1,0.5,500] is the main parameter for the MDP system. and are the gaussian noises in the state (13) and in the reward (14) respectively.

The parameterized policy is assumed to follow the Boltzmann distribution , where is the policy feature, is the unknown coefficient vector. The feature for the estimation of expected rewards is set . The constraint for the actor-critic learning is set as . The outlier ratio and strength are set and respectively. In our methods, is set as 1 by default.

The expected long-run average reward (ElrAR) [Murphy et al.2016] is used to quantify the quality of the estimated policy for . Intuitively, ElrAR measures the average steps users take per day in the long-run Heartsteps study when we use the estimated policy to send interventions to users. There are two steps to obtain the ElrAR: (a) get the average reward for the user by averaging the rewards over the last decision points in a trajectory of tuples under the policy ; (b) the ElrAR is achieved by averaging over the s.

Comparisons in the Four Experiment Settings

We carry out the following experiments to verify four aspects of the contextual bandit methods:

(S1) To verify the significance of the proposed method, we compare nine contextual bandit methods on two datases: (1) HeartSteps and (2) chain walk. The experiment results are summarized in Table 1, where there are three sub-tables; each sub-table displays three methods in a type: (a) the state-of-the-art contextual bandit, like ACCB; (b) “OutlierFilter + ACCB” means that we first use the state-of-the-art outlier filter [Liu, Shah, and Jiang2004, Suomela2014]

to get rid of the outliers, then employ the state-of-the-art contextual bandit method ACCB for the SDM task; (c) is the proposed robust contextual bandit method (Ro-ACCB). As we shall see, “OutlierFilter” is helpful to improve the performance of the state-of-the-art contextual bandit methods. However, our three methods (i.e., Ro-LinUCB, Ro-ACCB and Ro-SACCB) always obtain the best results compared with all the other state-of-the-art methods in their type on both datasets. Compared with the best state-of-the-art method, our three methods improve 131.4, 136.2 and 139.9 steps respectively on the HeartSteps. Although there are lots of general outlier detection or outlier filter methods that can be helpful to relieve the bad influence of outliers, it is still meaningful to specifically propose a robust contextual bandit algorithm.

(S2) In this part, the ratio of tuples that contains outliers rises from to . The experiment results are summarized in Table 3 and Figs. 1b, 1e. In Table 3, there are two sub-tables, displaying the ElrARs on the HeartSteps in the top, and the ElrARs on the chain walk in the bottom. As we can see, when , there is no outlier in the trajectory. In such case, our results are (almost) identical to that of LinUCB, ACCB and SACCB on both datasets. When rises, our results keep stable, while the ElrARs, of LinUCB, ACCB and SACCB, decrease dramatically. Compared with ACCB, Ro-ACCB averagely achieves an improvement of on the HeartSteps dataset and on the chain walk dataset. Such results demonstrate that our method is able to deal with the badly noised dataset that consists of a large percentage (up to ) of outliers.

(S3) In this part, the strength of outliers ranges from to times of the average value in the trajectory. The experiment results are summarized in Table 3, and Figs. 1c, 1f. We have three observations based on the experiment results: (1) when , there is no outlier in the trajectory. Out methods achieve similar results with that of LinUCB, ACCB and SACCB; (2) when rises in the domain, our method keeps stable on the HeartSteps and decreases slightly on the chain walk. However, the results of LinUCB, ACCB and SACCB decrease obviously when rises. Such phenomena verify that our method is able to deal with the dataset both with or without outliers. Besides, we may use our method on the dataset with various strengths of outliers.

(S4) In this part, the value of ranges from to on the HeartSteps dataset and from to on the chain walk dataset. The experiment results are displayed in Figs. 1a and 1d. As we shall see, the proposed method obtains clear advantage over the state-of-the-art method, i.e., ACCB [Lei, Tewari, and Murphy2014], in a wide range of settings. In average, our method improves the ElrAR by on the HeartSteps and on the chain walk, compared with the ACCB. Such results verify that the proposed method to set is very promising. It is able to adapt to the data property and select the effective, neither too few nor too many, tuples in the trajectory for the actor-critic updating. Note that ACCB does not have the parameter . Thus the result of ACCB remains unchanged as rises.

Conclusion and Discussion

To deal with the outlier in the trajectory, we propose a robust actor-critic contextual bandit for the mHealth intervention. The capped- norm is employed to boost the robustness for the critic updating. With the learned weights in the critic updating, we propose a new objective for the actor updating, enhancing its robustness. Besides, we propose a solid method to set an important thresholding parameter in the capped- norm. With it, we can achieve the conflicting goal of boosting the robustness of our algorithm on the dataset with outliers, and achieving almost identical results compared with the state-of-the-art method on the datasets without outliers. Besides, we provide theoretical guarantee for our algorithm. It shows that our algorithm could find sufficiently decreasing point after each iteration and finally converges after a finite number of iterations. Extensive experiment results show that in a variety of parameter settings our method achieves significant improvements.

Appendix 1: the proof of Proposition 1


According to the analyses in Eqs (6) and (10), we simplify (4) into the following objective


Taking the partial derivative and setting it to zero give us the updating rule

where is nonnegative diagonal. The element is

Appendix 2: the proof of Lemma 2


For when fix , we find that

is a quadratic function, which is strongly convex. The updating rule in Proposition 1 minimizes globally over Via the strong convexity of , we have


When fixing , updating gives


Finally , we conclude the following inequation


Appendix 3: the proof of Lemma 3


We sum up the function descent inequality (16) for :


From (16), the sequence is nonincreasing with , . Taking the limit of on both sides of (19), we get

and thus

Remark 5.

With Lemma 2 and Lemma 3, one can actually show that, given a fixed outlier thresholding , the algorithm converges after finite number of iterations.

Appendix 4: the proof of Theorem 4


We first show that the sequence is bounded. It is easily to see that maps an unbounded set to an unbounded range. If

the critic update (the updating rule in Proposition 1) will stop with , . So the sequence must be bounded such that , for some .

Now , we have

for some . Then via Lemma 3, for a given fixed outlier threshold parameter , we deduce that there exists when , we have

. That is , remain unchanged for all and the problem will become a least square problem. Thus after steps, the updating rule in Proposition 1 will converge at a closed form solution

and its corresponding . ∎


  • [Abby et al.2013] Abby, K.; Eric, H.; Lauren, G.; Sandra, W.; Jylana, S.; Matthew, B.; .̇.; and Jesse, C. 2013. Harnessing different motivational frames via mobile phones to promote daily physical activity and reduce sedentary behavior in aging adults. Plos ONE 8(4).
  • [Chou et al.2014] Chou, K.; Lin, H.; Chiang, C.; and Lu, C. 2014. Pseudo-reward algorithms for contextual bandits with linear payoff functions. In JMLR: Workshop and Conference Proceedings, 1–19.
  • [Dawson2011] Dawson, R. 2011. How significant is a boxplot outlier. Journal of Statistics Education 19(2):1–12.
  • [Dempsey et al.2016] Dempsey, W.; Liao, P.; Klasnja, P.; Nahum-Shani, I.; and Murphy, S. A. 2016. Randomised trials for the fitbit generation. Significance 12(6):20 – 23.
  • [Dudík, Langford, and Li2011] Dudík, M.; Langford, J.; and Li, L. 2011. Doubly robust policy evaluation and learning. In ICML, 1097–1104.
  • [Firth, Torous, and Yung2016] Firth, J.; Torous, J.; and Yung, A. 2016. Ecological momentary assessment and beyond: the rising interest in e-mental health research. Journal of psychiatric research 80:3–4.
  • [Gao et al.2015] Gao, H.; Nie, F.; Cai, W.; and Huang, H. 2015. Robust capped norm nonnegative matrix factorization: Capped norm nmf. In ACM International Conference on Information and Knowledge, 871–880.
  • [Grondman et al.2012] Grondman, I.; Busoniu, L.; Lopes, G. A. D.; and Babuska, R. 2012. A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Trans. Systems, Man, and Cybernetics 42(6):1291–1307.
  • [Gustafson et al.2014] Gustafson, D.; McTavish, F.; Chih, M.; Atwood, A.; …; and Shah, D. 2014. A smartphone application to support recovery from alcoholism: a randomized clinical trial. JAMA Psychiatry 71(5):566–572.
  • [Ho and Lin2015] Ho, C.-Y., and Lin, H.-T. 2015. Contract bridge bidding by learning. In AAAI Workshop: Computer Poker and Imperfect Information.
  • [Jiang, Nie, and Huang2015] Jiang, W.; Nie, F.; and Huang, H. 2015. Robust dictionary learning with capped l1-norm. In IJCAI, 3590–3596.
  • [Klasnja et al.2015] Klasnja, P.; Hekler, E. B.; Shiffman, S.; Boruvka, A.; Almirall, D.; Tewari, A.; and Murphy, S. A. 2015. Microrandomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychology 34(S):1220.
  • [Lagoudakis and Parr2003] Lagoudakis, M. G., and Parr, R. 2003. Least-squares policy iteration. Journal of machine learning research 4(Dec):1107–1149.
  • [Lei, Tewari, and Murphy2014] Lei, H.; Tewari, A.; and Murphy, S. 2014. An actor-critic contextual bandit algorithm for personalized interventions using mobile devices. In NIPS 2014 Workshop: Personalization: Methods and Applications, 1 – 9.
  • [Lei2016] Lei, H. 2016. An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention. Ph.D. Dissertation, University of Michigan.
  • [Li et al.2010] Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In International Conference on World Wide Web (WWW), 661–670.
  • [Li et al.2015] Li, X.; Li, L.; Gao, J.; He, X.; Chen, J.; Deng, L.; and He, J. 2015. Recurrent reinforcement learning: a hybrid approach. arXiv:1509.03044.
  • [Liao, Tewari, and Murphy2015] Liao, P.; Tewari, A.; and Murphy, S. 2015. Constructing just-in-time adaptive interventions. Phd Section Proposal 1–49.
  • [Liu, Shah, and Jiang2004] Liu, H.; Shah, S.; and Jiang, W. 2004. On-line outlier detection and data cleaning. Computers & chemical engineering 28(9):1635–1647.
  • [Murphy et al.2016] Murphy, S. A.; Deng, Y.; Laber, E. B.; Maei, H. R.; Sutton, R. S.; and Witkiewitz, K. 2016. A batch, off-policy, actor-critic algorithm for optimizing the average reward. CoRR abs/1607.05047.
  • [Nie et al.2010] Nie, F.; Huang, H.; Cai, X.; and Ding, C. H. 2010.

    Efficient and robust feature selection via joint

    -norms minimization.
    In NIPS, 1813–1821.
  • [Sun, Xiang, and Ye2013] Sun, Q.; Xiang, S.; and Ye, J. 2013.

    Robust principal component analysis via capped norms.

    In ACM SIGKDD, 311–319.
  • [Suomela2014] Suomela, J. 2014. Median filtering is equivalent to sorting. arXiv:1406.1717.
  • [Sutton and Barto2012] Sutton, R. S., and Barto, A. G. 2012. Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 2nd edition.
  • [Tewari and Murphy2017] Tewari, A., and Murphy, S. A. 2017. From ads to interventions: Contextual bandits in mobile health. In Rehg, J.; Murphy, S. A.; and Kumar, S., eds., Mobile Health: Sensors, Analytic Methods, and Applications. Springer.
  • [Williamson, Parker, and Kendrick1989] Williamson, D. F.; Parker, R. A.; and Kendrick, J. S. 1989. The box plot: a simple visual method to interpret data. Annals of internal medicine 110(11):916–921.
  • [Witkiewitz et al.2014] Witkiewitz, K.; Desai, S.; Bowen, S.; Leigh, B.; Kirouac, M.; and Larimer, M. 2014. Development and evaluation of a mobile intervention for heavy drinking and smoking among college studen. Psychology of Addictive Behaviors 28(3):639–650.
  • [Xu2009] Xu, H. 2009. Robust decision making and its applications in machine learning. McGill University.
  • [Zhang et al.2012] Zhang, B.; Tsiatis, A. A.; Laber, E. B.; and Davidian, M. 2012. A robust method for estimating optimal treatment regimes. Biometrics 68(4):1010–1018.
  • [Zhu et al.2015] Zhu, F.; Fan, B.; Zhu, X.; Wang, Y.; Xiang, S.; and Pan, C. 2015. 10,000+ times accelerated robust subset selection (ARSS). In Proc. Assoc. Adv. Artif. Intell. (AAAI), 3217–3224.