Introduction
Due to the explosive growth of smart device (i.e. smartphones and wearable devices, such as Fitbit etc.) users globally, mobile health (mHealth) technologies draw increasing interests from the scientist community [Liao, Tewari, and Murphy2015, Murphy et al.2016]. The goal of mHealth is to deliver intime interventions to device users, guiding them to lead healthier lives, such as reducing the alcohol abuse [Gustafson et al.2014, Witkiewitz et al.2014] and increasing physical activities [Abby et al.2013]. With the advanced smart technologies, the mHealth interventions can be formed according to the users’ ongoing statuses and changing needs, which is more portable and flexible compared with the traditional treatments. Therefore, mHealth technologies are widely used in lots of healthrelated applications, such as eating disorders, alcohol abuses, mental illness, obesity management and HIV medication adherence [Murphy et al.2016, Liao, Tewari, and Murphy2015].
Formally, the tailoring of mHealth intervention is modeled as a sequential decisionmaking (SDM) problem. The contextual bandit algorithm provides a framework for the SDM [Tewari and Murphy2017]. In 2014, Lei [Lei, Tewari, and Murphy2014] proposed the first contextual bandit algorithm for the mHealth study. It is in the actorcritic setting [Sutton and Barto2012], which has two advantages compared with the critic only contextual bandit methods for the internet advertising [Li et al.2010]
: (a) Lei’s method has an explicit parameterized model for the stochastic policy. By analyzing the estimated parameters in the learned policy, we could know the key features that contribute most to the policy. This is important to the behavior scientists for the state (feature) design. (b) From the perspective of optimization, the actorcritic algorithm has great properties of quick convergence with low variance
[Grondman et al.2012].However, Lei’s method assumes that the states at different decision points are i.i.d. and the current action only influences the immediate reward [Lei2016]. This assumption is infeasible in real situations. Taking the delayed effect in the SDM or mHealth for example, the current action influences not only the immediate reward but also the next state and through that, all the subsequent rewards [Sutton and Barto2012]. Accordingly, Lei proposed a new method [Lei2016] by emphasizing on explorations and reducing exploitations.
Although those two methods serve a good start for the mHealth study, they assume that the noise in the trajectory follows the Gaussian distribution. The least square based algorithm is employed to estimate the expected reward. In reality, however, there are various kinds of complex noises that can badly degrade the collected data, for example: (1) the wearable device is unreliable to accurately record the states and rewards from users under different conditions. (2) The mobile network is unavailable in some areas. Such case hinders the collecting of users’ states as well as the sending of interventions. (3) The mHealth relies on the selfreported information (Ecological Momentary Assessments, i.e. EMAs)
[Firth, Torous, and Yung2016] to deliver effective interventions. However, some users are annoyed at the EMAs. They either fill out the EMAs via random selections or just leave some or all the EMAs blank. We consider the various kinds of badly noised observations in the trajectory as outliers.There are several robust methods for the SDM problem [Dudík, Langford, and Li2011, Zhang et al.2012, Xu2009]
. However, those methods are neither in the actorcritic setting, nor focusing on the outlier problem. Thus, they are different from this paper’s focus. In the general machine learning task, there are some robust learning methods to deal with the outlier problem
[Sun, Xiang, and Ye2013, Jiang, Nie, and Huang2015]. However, none of them are contextual bandit algorithms—it may cost of lots of work to transfer their methods to the (actorcritic) contextual bandit algorithms. Besides, those methods seldom pay attention to the dataset without outliers. In practice, however, we don’t know whether a given dataset consists of outliers or not. It is necessary to propose a robust learning method that works well on the dataset both with and without outliers.To alleviate the above problems, we propose a robust contextual bandit method for the mHealth. The capped norm is used to measure the learning error for the expected reward estimation (i.e. the critic updating). It prevents outlier observations from dominating our objective. Besides, the learned weights in the critic updating are considered in the actor updating. As a result, the robustness against outliers are greatly boosted in both actorcritic updatings. There is an important thresholding parameter in our method. We propose a solid method to set its value according to the distribution of samples, which is based on one of the most fundamental ideas in the statistics. It has two benefits: (1) the setting of becomes very easy and reliable; (2) we may achieve the conflicting goal of reducing the influence of outliers when the dataset indeed contains outliers, while achieving almost identical results compared with the stateoftheart contextual bandit method on the datasets without outliers. Although the objective is nonconvex and nondifferentiable, we derive an effective algorithm. As a theoretical contribution, we prove that our algorithm could find sufficiently decreasing point after each iteration and finally converges after a finite number of iterations. Extensive experiment results on two datasets verify that our methods could achieve clear gains over the stateoftheart contextual bandit methods for the mHealth.
Preliminaries
Multiarmed bandit (MAB) is the simplest algorithm for the sequential decision making problem (SDM). The contextual bandit is a more practical extension of the MAB by considering some extra information that is helpful for the SDM problem [Tewari and Murphy2017]. The use of context information allows for many interesting applications, such as internet advertising and healthcare tasks [Dudík, Langford, and Li2011, Tewari and Murphy2017].
In contextual bandit, the expected reward is an core concept that measures how many rewards we may averagely get when it is in state and choosing action , i.e., . Since the state space is usually very large or even infinite in the mHealth tasks, the parameterized model is employed to approximate the expected reward: , where is a feature processing step that combines information in the state and the action ,
is the unknown coefficient vector.
In 2014, Lei [Lei, Tewari, and Murphy2014] proposed the first contextual bandit method for the mHealth study. It is in the actorcritic learning setting. The actor updating is the overall optimization goal. It aims to learn an optimal policy that maximizes the average rewards over all the states and actions [Grondman et al.2012]. The objective function is for the user, where
(1) 
and is a reference distribution over states for user .
Obviously, we need the estimation of expected rewards to define the objective (1
) for the actor updating. Such procedure is called the critic updating. Stateoftheart method generally employs the ridge regression to learn the expected reward from the observations. The objective is
(2) 
where is the trajectory of observed tuples from the user and is the tuple in ; is a tuning parameter to control the strength of constraints. It has a closedform solution for (2) as:
(3) 
where is a identity matrix. Unfortunately, similar to the existing least square based models in machine learning and statistics, the objective function in (2) is prone to the presence of outliers [Nie et al.2010, Zhu et al.2015].
Robust Actorcritic Contextual Bandit via the Capped norm
To enhance the robustness in the critic updating, the capped norm based measure is used for the estimation of expected rewards. By imposing the learned weights for the actor updating, we propose a robust objective for the actor updating.
Robust Critic Updating via the Capped Norm
To simplify the notation, we get rid of the subscript index , which is used to indicate the model for the user. The new objective for the critic updating (i.e. policy evaluation) is
(4) 
where is the capped norm for a vector ; is the thresholding hyperparameter to choose the effective observations for the critic updating; is the feature for the estimation of expected rewards.
If the residual of the tuple is , we treat it as an outlier. Its residual is capped to a fixed value . That is, the influence of the tuple is fixed [Gao et al.2015], which can’t cause bad influences on the learning procedure. For the tuples whose residuals satisfy , we consider them as effective observations and keep them as they are in the optimization process.
Therefore, it is extremely important to properly set the value of . When is too large, the outliers that distribute far away from the majority of tuples will be treated as effective samples, causing bad influences to the learning procedure. When is too small, most tuples are treated as outliers—there would be very few of effective samples for the cirtic learning. Such case easily leads to some unstable policies that contain lots of variances. Specially if , our objective is equivalent to the least square objective in (2).
As a profound contribution, we propose a reliable method to properly set the value of . Our method doesn’t need any specific assumption on the data distribution. It is derivated from the boxplot—one of the most fundamental ideas in the statistics [Dawson2011, Williamson, Parker, and Kendrick1989]
. To give a descriptive illustration of the data distribution, the boxplot is widely used by specifying 5 points, including the min, lower quartile
, median, upper quartile and max. Based on the 5 points, the boxplot provides a method to detect outliers. Following this idea, we set the value of as(5) 
where is the interquartile range; is introduced only for the experiment setting S2, otherwise we may ignore the parameter by keeping it fixed at . Intuitively, the data points that are more above the third quartile are detected as the outliers. Compared with the stateoftheart robust learning methods [Gao et al.2015, Sun, Xiang, and Ye2013, Jiang, Nie, and Huang2015] in the other fields that have to manually set the thresholding hyperparameter, we provide an adaptive method to set , which is well adapted to the data distribution.
With the capped norm and the method to set in (5), we can achieve the conflicting goals of (a) reducing the influence of outliers when the dataset has outliers, while (b) seeking for a similar solution compared with that of the stateoftheart method if there is no outlier in the dataset. As a result, our method can deal with various datasets, regardless of whether they consist of outliers or not.
Derivation of a General Objective Function for (4)
To give an efficient algorithm, we consider a more general capped norm based objective for (4) as follows
(6) 
where is the norm for a vector; and are both scalar functions of . In this section, we propose an iteratively reweighted method to simplify the objective (6).
Due to the nonsmooth and nondifferentiable property of (6), we could only obtain the subgradient of (6) as:
(7) 
Letting gives
(8) 
where and . For the sake of easy optimization, we provide a compact expression that satisfies the subgradient in (8) by introducing a variable . Then Eq. (7) is rewritten as
(9) 
Since depends on , it is very challenging to directly solve the objective (9). Once is given for every the objective (7) is equivalent to the following problem
(10) 
in the sense that they have the same partial derivative.
Robust Algorithm for the Critic Updating
In this section, we provide an effective updating rule for the objective function (4) (cf. Proposition 1 and Algorithm 1). We prove that our algorithm can find sufficiently decreasing point after each iteration (cf. Lemma 2) and finally converge after a finite number of iterations (cf. Theorem 4).
Proposition 1.
Lemma 2.
The updating rule in Proposition 1 leads to sufficient decrease of the objective function in (4):
where the bivariate function is defined as
which is the same as in (4).
Lemma 3.
For in Lemma 2, we show that and consequently
Theorem 4.
The updating rule in Proposition 1 converges after a finite number of iterations.
Robust Actor Updating for the Stochastic Policy
Besides the critic updating, outliers can also badly influence the actor updating in (1), which is our ultimate objective. To boost its robustness, the estimated weights learned in the critic updating are considered. Since is usually unavailable in reality, the trial based objective [Chou et al.2014] is widely used. Thus, the objective (1) is rewritten as
(12) 
where is the estimated expected reward; is the least square constraint to make the objective (12) a wellposed problem and is a balancing parameter that controls the penalization strength [Lei, Tewari, and Murphy2014].
Compared with the current objective for the actor updating in (1), our objective has an extra weight term , which gives those tuples, whose residuals are very large in the critic updating, zero weights. As a result, the outlier tuples that are far away from the majority of tuples are removed from the actor updating, enhancing the robustness. The algorithm of the actor updating performs the maximization of (12) over . This is learned via the Sequential Quadratic Programming (SQP) algorithm. We utilize the implementation of SQP with finitedifference approximation to the gradient in the fmincon function of Matlab.
Input:
Output: the stochastic policy for user, i.e. .
Experiment
Two Datasets
Two datasets are used to verify the performance of our method. The first dataset is on the personalizing treatment delivery in mobile health, which is a common application of the sequential decision making algorithm [Murphy et al.2016]. In this paper, we focus on the Heartsteps, where the participants are periodically sent activity suggestions aimed at decreasing sedentary behavior [Klasnja et al.2015]. Specifically, Heartsteps is a 42days trial study where 50 participants are involved. For each participant, there are 210 decision points—five decisions per participant per day. At each time point, the set of intervention actions can be the intervention type, as well as whether or not to send interventions. The intervention actions generally depend on the state of the participant as well as the formerly sent interventions. Interventions can be sent via smartphones, or via other wearable devices like a wristband [Dempsey et al.2016].
The Heartsteps is a common application for the contextual bandit algorithm [Lei2016, Murphy et al.2016]. The goal is to learn an optimal SDM algorithms to decide what type of intervention actions to send to each user to maximize the cumulative steps each user takes. The resulting data for each participant is a sequence , where is the participant’s state at time . It is a three dimensional vector that consists of (1) the weather condition, (2) the engagement of participants, (3) the treatment fatigue of participants. indicates whether or not to send the intervention to users; since the goal of Heartsteps is to increase the participant’s activities, we define the reward, , as the step count for the 3 hours following a decision point.
The second dataset is the 4state chain walk, which is a benchmark dataset for the (contextual) bandit and reinforcement learning study. Please refer to
[Lagoudakis and Parr2003] for the details of the chain walk dataset. The reward vector over the four states is in this paper.Nine Compared Methods
There are nine contextual bandit methods compared in the experiment, including three stateoftheart methods: (1) Linear Upper Confidence Bound Algorithm (LinUCB) is one of the most famous contextual bandit methods used in the personalized news recommendation [Li et al.2010, Ho and Lin2015]; (2) the actorcritic contextual bandit (ACCB) is the first SDM method for the mHealth intervention [Lei, Tewari, and Murphy2014]; (3) the stochasticity constrained actorcritic contextual bandit (SACCB) for the mHealth [Lei2016]. We improve the above three methods by first using the stateoftheart outlier filter [Liu, Shah, and Jiang2004, Gustafson et al.2014] to remove outliers, then employing the above three contextual bandit algorithms for the SDM task. In the to compared methods, we apply the proposed robust model and optimization algorithms in the three stateoftheart contextual bandit methods, leading to (7) Robust LinUCB (RoLinUCB for short), (8) Robust ACCB (RoACCB), (9) Robust SACCB (RoSACCB).
Evaluation Methodology and Parameter Setting
It has been a challenging problem to reliably evaluate the sequential decision making (e.g., bandit and reinforcement learning) algorithms when the simulator (like the Atari games) is unavailable [Li et al.2010, Li et al.2015]
. After the model is trained, we use it to interact with the enviroment (or simulator) to collect thousands of immediate rewards for the calculation of the long term rewards as the evaluation metric. However for a wide variety of applications including our HeartSteps, the simulator is unavailable. In this paper, we use a benchmark evalutaion methods
[Li et al.2015]. The main idea is to make use of the collected dataset to build a simulator, based on which we train and evaluate the contextual bandit methods. Such case makes it impossible to train and evaluate the contextual bandit algorithms on up to ten datasets like the general supervised learning tasks.
In the Heartsteps study, there are 50 users; the simulator for each user is as follows: the initial state is drawn from the Gaussian distribution , where is a covariance matrix with predefined elements. For is drawn from the learned policy during the evaluation procedure. When , the state and immediate reward are generated as
(13)  
(14)  
where =[0.4,0.3,0.4,0.7,0.05,0.6,3,0.25,0.25 ,0.4,0.1,0.5,500] is the main parameter for the MDP system. and are the gaussian noises in the state (13) and in the reward (14) respectively.
The parameterized policy is assumed to follow the Boltzmann distribution , where is the policy feature, is the unknown coefficient vector. The feature for the estimation of expected rewards is set . The constraint for the actorcritic learning is set as . The outlier ratio and strength are set and respectively. In our methods, is set as 1 by default.
The expected longrun average reward (ElrAR) [Murphy et al.2016] is used to quantify the quality of the estimated policy for . Intuitively, ElrAR measures the average steps users take per day in the longrun Heartsteps study when we use the estimated policy to send interventions to users. There are two steps to obtain the ElrAR: (a) get the average reward for the user by averaging the rewards over the last decision points in a trajectory of tuples under the policy ; (b) the ElrAR is achieved by averaging over the s.
Comparisons in the Four Experiment Settings
We carry out the following experiments to verify four aspects of the contextual bandit methods:
(S1) To verify the significance of the proposed method, we compare nine contextual bandit methods on two datases: (1) HeartSteps and (2) chain walk. The experiment results are summarized in Table 1, where there are three subtables; each subtable displays three methods in a type: (a) the stateoftheart contextual bandit, like ACCB; (b) “OutlierFilter + ACCB” means that we first use the stateoftheart outlier filter [Liu, Shah, and Jiang2004, Suomela2014]
to get rid of the outliers, then employ the stateoftheart contextual bandit method ACCB for the SDM task; (c) is the proposed robust contextual bandit method (RoACCB). As we shall see, “OutlierFilter” is helpful to improve the performance of the stateoftheart contextual bandit methods. However, our three methods (i.e., RoLinUCB, RoACCB and RoSACCB) always obtain the best results compared with all the other stateoftheart methods in their type on both datasets. Compared with the best stateoftheart method, our three methods improve 131.4, 136.2 and 139.9 steps respectively on the HeartSteps. Although there are lots of general outlier detection or outlier filter methods that can be helpful to relieve the bad influence of outliers, it is still meaningful to specifically propose a robust contextual bandit algorithm.
(S2) In this part, the ratio of tuples that contains outliers rises from to . The experiment results are summarized in Table 3 and Figs. 1b, 1e. In Table 3, there are two subtables, displaying the ElrARs on the HeartSteps in the top, and the ElrARs on the chain walk in the bottom. As we can see, when , there is no outlier in the trajectory. In such case, our results are (almost) identical to that of LinUCB, ACCB and SACCB on both datasets. When rises, our results keep stable, while the ElrARs, of LinUCB, ACCB and SACCB, decrease dramatically. Compared with ACCB, RoACCB averagely achieves an improvement of on the HeartSteps dataset and on the chain walk dataset. Such results demonstrate that our method is able to deal with the badly noised dataset that consists of a large percentage (up to ) of outliers.
(S3) In this part, the strength of outliers ranges from to times of the average value in the trajectory. The experiment results are summarized in Table 3, and Figs. 1c, 1f. We have three observations based on the experiment results: (1) when , there is no outlier in the trajectory. Out methods achieve similar results with that of LinUCB, ACCB and SACCB; (2) when rises in the domain, our method keeps stable on the HeartSteps and decreases slightly on the chain walk. However, the results of LinUCB, ACCB and SACCB decrease obviously when rises. Such phenomena verify that our method is able to deal with the dataset both with or without outliers. Besides, we may use our method on the dataset with various strengths of outliers.
(S4) In this part, the value of ranges from to on the HeartSteps dataset and from to on the chain walk dataset. The experiment results are displayed in Figs. 1a and 1d. As we shall see, the proposed method obtains clear advantage over the stateoftheart method, i.e., ACCB [Lei, Tewari, and Murphy2014], in a wide range of settings. In average, our method improves the ElrAR by on the HeartSteps and on the chain walk, compared with the ACCB. Such results verify that the proposed method to set is very promising. It is able to adapt to the data property and select the effective, neither too few nor too many, tuples in the trajectory for the actorcritic updating. Note that ACCB does not have the parameter . Thus the result of ACCB remains unchanged as rises.
Conclusion and Discussion
To deal with the outlier in the trajectory, we propose a robust actorcritic contextual bandit for the mHealth intervention. The capped norm is employed to boost the robustness for the critic updating. With the learned weights in the critic updating, we propose a new objective for the actor updating, enhancing its robustness. Besides, we propose a solid method to set an important thresholding parameter in the capped norm. With it, we can achieve the conflicting goal of boosting the robustness of our algorithm on the dataset with outliers, and achieving almost identical results compared with the stateoftheart method on the datasets without outliers. Besides, we provide theoretical guarantee for our algorithm. It shows that our algorithm could find sufficiently decreasing point after each iteration and finally converges after a finite number of iterations. Extensive experiment results show that in a variety of parameter settings our method achieves significant improvements.
Appendix 1: the proof of Proposition 1
Appendix 2: the proof of Lemma 2
Proof.
For when fix , we find that
is a quadratic function, which is strongly convex. The updating rule in Proposition 1 minimizes globally over Via the strong convexity of , we have
(16) 
When fixing , updating gives
(17) 
Finally , we conclude the following inequation
(18) 
∎
Appendix 3: the proof of Lemma 3
Proof.
Appendix 4: the proof of Theorem 4
Proof.
We first show that the sequence is bounded. It is easily to see that maps an unbounded set to an unbounded range. If
the critic update (the updating rule in Proposition 1) will stop with , . So the sequence must be bounded such that , for some .
Now , we have
for some . Then via Lemma 3, for a given fixed outlier threshold parameter , we deduce that there exists when , we have
. That is , remain unchanged for all and the problem will become a least square problem. Thus after steps, the updating rule in Proposition 1 will converge at a closed form solution
and its corresponding . ∎
References
 [Abby et al.2013] Abby, K.; Eric, H.; Lauren, G.; Sandra, W.; Jylana, S.; Matthew, B.; .̇.; and Jesse, C. 2013. Harnessing different motivational frames via mobile phones to promote daily physical activity and reduce sedentary behavior in aging adults. Plos ONE 8(4).
 [Chou et al.2014] Chou, K.; Lin, H.; Chiang, C.; and Lu, C. 2014. Pseudoreward algorithms for contextual bandits with linear payoff functions. In JMLR: Workshop and Conference Proceedings, 1–19.
 [Dawson2011] Dawson, R. 2011. How significant is a boxplot outlier. Journal of Statistics Education 19(2):1–12.
 [Dempsey et al.2016] Dempsey, W.; Liao, P.; Klasnja, P.; NahumShani, I.; and Murphy, S. A. 2016. Randomised trials for the fitbit generation. Significance 12(6):20 – 23.
 [Dudík, Langford, and Li2011] Dudík, M.; Langford, J.; and Li, L. 2011. Doubly robust policy evaluation and learning. In ICML, 1097–1104.
 [Firth, Torous, and Yung2016] Firth, J.; Torous, J.; and Yung, A. 2016. Ecological momentary assessment and beyond: the rising interest in emental health research. Journal of psychiatric research 80:3–4.
 [Gao et al.2015] Gao, H.; Nie, F.; Cai, W.; and Huang, H. 2015. Robust capped norm nonnegative matrix factorization: Capped norm nmf. In ACM International Conference on Information and Knowledge, 871–880.
 [Grondman et al.2012] Grondman, I.; Busoniu, L.; Lopes, G. A. D.; and Babuska, R. 2012. A survey of actorcritic reinforcement learning: Standard and natural policy gradients. IEEE Trans. Systems, Man, and Cybernetics 42(6):1291–1307.
 [Gustafson et al.2014] Gustafson, D.; McTavish, F.; Chih, M.; Atwood, A.; …; and Shah, D. 2014. A smartphone application to support recovery from alcoholism: a randomized clinical trial. JAMA Psychiatry 71(5):566–572.
 [Ho and Lin2015] Ho, C.Y., and Lin, H.T. 2015. Contract bridge bidding by learning. In AAAI Workshop: Computer Poker and Imperfect Information.
 [Jiang, Nie, and Huang2015] Jiang, W.; Nie, F.; and Huang, H. 2015. Robust dictionary learning with capped l1norm. In IJCAI, 3590–3596.
 [Klasnja et al.2015] Klasnja, P.; Hekler, E. B.; Shiffman, S.; Boruvka, A.; Almirall, D.; Tewari, A.; and Murphy, S. A. 2015. Microrandomized trials: An experimental design for developing justintime adaptive interventions. Health Psychology 34(S):1220.
 [Lagoudakis and Parr2003] Lagoudakis, M. G., and Parr, R. 2003. Leastsquares policy iteration. Journal of machine learning research 4(Dec):1107–1149.
 [Lei, Tewari, and Murphy2014] Lei, H.; Tewari, A.; and Murphy, S. 2014. An actorcritic contextual bandit algorithm for personalized interventions using mobile devices. In NIPS 2014 Workshop: Personalization: Methods and Applications, 1 – 9.
 [Lei2016] Lei, H. 2016. An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention. Ph.D. Dissertation, University of Michigan.
 [Li et al.2010] Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextualbandit approach to personalized news article recommendation. In International Conference on World Wide Web (WWW), 661–670.
 [Li et al.2015] Li, X.; Li, L.; Gao, J.; He, X.; Chen, J.; Deng, L.; and He, J. 2015. Recurrent reinforcement learning: a hybrid approach. arXiv:1509.03044.
 [Liao, Tewari, and Murphy2015] Liao, P.; Tewari, A.; and Murphy, S. 2015. Constructing justintime adaptive interventions. Phd Section Proposal 1–49.
 [Liu, Shah, and Jiang2004] Liu, H.; Shah, S.; and Jiang, W. 2004. Online outlier detection and data cleaning. Computers & chemical engineering 28(9):1635–1647.
 [Murphy et al.2016] Murphy, S. A.; Deng, Y.; Laber, E. B.; Maei, H. R.; Sutton, R. S.; and Witkiewitz, K. 2016. A batch, offpolicy, actorcritic algorithm for optimizing the average reward. CoRR abs/1607.05047.

[Nie et al.2010]
Nie, F.; Huang, H.; Cai, X.; and Ding, C. H.
2010.
Efficient and robust feature selection via joint
norms minimization. In NIPS, 1813–1821. 
[Sun, Xiang, and
Ye2013]
Sun, Q.; Xiang, S.; and Ye, J.
2013.
Robust principal component analysis via capped norms.
In ACM SIGKDD, 311–319.  [Suomela2014] Suomela, J. 2014. Median filtering is equivalent to sorting. arXiv:1406.1717.
 [Sutton and Barto2012] Sutton, R. S., and Barto, A. G. 2012. Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 2nd edition.
 [Tewari and Murphy2017] Tewari, A., and Murphy, S. A. 2017. From ads to interventions: Contextual bandits in mobile health. In Rehg, J.; Murphy, S. A.; and Kumar, S., eds., Mobile Health: Sensors, Analytic Methods, and Applications. Springer.
 [Williamson, Parker, and Kendrick1989] Williamson, D. F.; Parker, R. A.; and Kendrick, J. S. 1989. The box plot: a simple visual method to interpret data. Annals of internal medicine 110(11):916–921.
 [Witkiewitz et al.2014] Witkiewitz, K.; Desai, S.; Bowen, S.; Leigh, B.; Kirouac, M.; and Larimer, M. 2014. Development and evaluation of a mobile intervention for heavy drinking and smoking among college studen. Psychology of Addictive Behaviors 28(3):639–650.
 [Xu2009] Xu, H. 2009. Robust decision making and its applications in machine learning. McGill University.
 [Zhang et al.2012] Zhang, B.; Tsiatis, A. A.; Laber, E. B.; and Davidian, M. 2012. A robust method for estimating optimal treatment regimes. Biometrics 68(4):1010–1018.
 [Zhu et al.2015] Zhu, F.; Fan, B.; Zhu, X.; Wang, Y.; Xiang, S.; and Pan, C. 2015. 10,000+ times accelerated robust subset selection (ARSS). In Proc. Assoc. Adv. Artif. Intell. (AAAI), 3217–3224.
Comments
There are no comments yet.