1 Introduction
Nowadays, billions of people frequently use various kinds of smart devices, such as smartphones and wearable activity sensors [13, 14, 26, 25]
. It is increasingly popular among the scientist community to make use of the stateoftheart artificial intelligence technology to leverage supercomputers and big data to facilicate the prediction of healthcare tasks
[21, 29, 20]. In this paper, we use the mobile health (mHealth) technologies to collect and analyze realtime data from users. Based on that, the goal of mHealth is to decide when, where, and how to deliver the intime intervention to best serve uses, helping them to lead healthier lives. For example, the mHealth guides people how to reduce alcohol abuses, increase physical activities and regain the control of eating disorders, obesity/weight management [13, 14, 9].The tailoring of mHealth interventions is generally modeled as a sequential decisionmaking (SDM) problem. The contextual bandit provides a paradigm for the SDM [18, 22, 25, 26]. In mHealth, the first contextual bandit [10]
was proposed in 2014. It is in an actorcritic setting and has an explicit parameterized stochastic policy. Such setting has two advantages: (1) the actorcritic algorithm has good properties of quick convergence with low variance
[8]; (2) we could understand the key features that contribute most to the policy by analyzing the estimated parameters. This is important for the behavior scientists to design the state (feature). Then, Dr. Lei [9] improved the method by emphasizing the explorations and introducing the stochasticity constraint on the policy coefficients.Those two methods serve a good start for the mHealth. However, they assume that there is no outlier in the data. They use the leastsquarebased algorithm to learn the expected reward, which, however, is prone to the presence of outliers [24, 27, 23, 19, 5]. In practice, there are various kinds of complex noise in the mHealth system. For example, the wearable devices are unable to accurately record the states and rewards from users under various conditions. The mHealth requires selfreport to deliver effective interventions to device users. However, some users are unwilling to report the selfreport. They sometimes randomly fill out the report to save time. We treat the various of complex noises in the system as outliers. We want to get rid of the extreme observations.
In this paper, a novel robust actorcritic contextual bandit is proposed to deal with the outlier issue in the mHealth system. The capped norm is used in the estimation of the expected reward in the critic updating. As a result, we obtain a set of weights. With them, we propose a weighted objective for the actor updating, which gives the samples that are ineffective for the critic updating zero weights. As a result, the robustness of both actorcritic updating is greatly enhanced. There is a key parameter in the capped norm. We propose a solid method to set it properly, which is based on a solid method to detect outliers in statistics. With it, we can achieve the conflicting goal of enhancing the robustness of our algorithm and obtaining almost same results compared with the stateoftheart method on the datasets without outliers. Extensive experiment results show that in a variety of parameter settings our method obtains clear gains compared with the stateoftheart methods.
2 Preliminaries
The expected reward is a core concept in the contextual bandit to evaluate the policy for the dynamic system. In case of large state or action spaces, the parameterized approximation is widely accepted: , which is assumed to be in a low dimensional space, where is the unknown coefficients and is the contextual feature for the stateaction pair.
The aim of the actorcritic algorithm is to learn an optimal policy to maximize the reward for all the stateaction pairs. The objective is , where is the average reward over all the possible states & actions; is a reference distribution over states. To make the actor updating a wellposed objective, various constraints on are considered [10]. Specifically, the stochasticity constraint is introduced to reduce the habituation and facilitate learning [9]
. The stochasticity constraint specifies the probability of selecting both actions is at least
for more than contexts: . Via the Markov inequality, a relaxed and smoother stochasticity constraint is as follows [9], leading to the objective as(1) 
where ; is the feature for the policy [9].
According to (1), we need the estimation of the expected reward to form the objective. This process is called the critic updating [8]
. Current methods generally use the ridge regression to learn it. The objective is defined as follows
(2) 
It has a closedform solution: , where is a designed matrix with the th column as ; consists of all the immediate rewards. However similar to the existing least square based algorithms, the objective is sensitive to the existence of outliers [16, 15].
3 Robust Contextual Bandit with Capped norm
To boost the robustness of the actorcritic learning, the capped norm is used to measure the approximation error:
(3) 
By properly setting the value of , we can get rid of the outliers that distribute far away from the majority of samples while keep the effective samples. Otherwise when is too large, there are outliers left in the data; while is too small, lots of effective samples will be removed, leading to unstable estimations.
It is important to properly set the value of . We propose an effective method to set . It is derived from one of the most widely accepted outlier definitions in the statistics community. When we use the boxplot to give a descriptive illustration of the distribution of a dataset, the samples that are
more above the third quartile are treated as outliers. Thus, we set
as:(4) 
where is the interquartile range; is a tuning parameter to give us a flexible setting of , which is set to by default.
3.1 Algorithm for the Critic Updating
Proposition 1
The critic objective (3) is equivalent to the following objective
(5) 
where is dependent on the unknown variable .
According to Proposition 1, we have a simplified objective for the critic updating. However, it is still complex to minimize (5) since the weight term depends on the unknown variable . In this section, an iteratively reweighted algorithm is proposed for the optimization of (5) (cf. Algorithm 1). It assumes the weight is fixed when seeking for the optimal and vice versa. When is fixed, the objective (5) is convex over We may get the solver by differentiating (5) and setting the derivative to zero, leading to the following linear system
(6) 
where is the weight at the th iteration. Then we update the weight term as for .
3.2 Algorithm for the Actor Updating
Since the distribution of in the objective (1) is generally unavailable, we consider the trial based objective as follows
(7) 
where is the weighted term learned from the critic updating, cf. Section (3.1). With the weight , the outlier tuples that have large approximation errors are removed for the actor updating. As a result, the robustness is boosted. The actor updating aims to maximizes the objective (7) over . We use the Sequential Quadratic Programming (SQP) algorithm for the optimization. Specially, the implementation of SQP with finitedifference approximation to the gradient in fmincon is utilized in our algorithm (cf. Algorithm 1).
4 Experiments
4.1 Datasets
To evaluate the performance, we utilize a dataset from the mHealth study (called HeartSteps) to approximate the generative model. The HeartSteps is a 42day mHealth study, resulting in 210 decision points per user. It aims to increase the users’ daily activities (i.e. steps) by sending them positive interventions, for example, suggesting them to go for a hike on the sunny weekend etc.
For each user, a trajectory of tuples of observations are generated via the microrandomized trials [14, 13]. The initial state is drawn from the Gaussian distribution , with the predefined covariance matrix . The random policy provides a method to select actions. is chosen with a probability of , i.e. for all states . When , the state and immediate reward are generated as
(8)  
(9) 
where is the main coefficient for the dynamic system. It is set as [0.4,0.3,0.4,0.7,0.05,0.6, 0.25, 3,0.25,0.25,0.4,0.1,0.5,500]. is the Gaussian noise in the state (8) and is the Gaussian noise in the reward model (9).
To simulate the outliers in the trajectory, there are two processing steps: (a) a fixed ratio (i.e. ) of tuples is randomly selected in each user’s trajectory; (b) we add a large noise ( times the average value in the trajectory) to the states and rewards in the selected tuples. Additionally, the actions in the selected tuples are randomly set to simulate the random failure of sending interventions due to the weak mobile network.
4.2 Experiments Settings
In the experiment, there are three contextual bandit methods for comparison: (1) LinUCB (linear upper confidence bound) is a famous contextual bandit method that achieves great successes in the Internet advertising [12, 6, 18]; (2) SACCB is the stochasticity constrained actorcritic contextual bandit for the mHealth [9]; (3) RSACCB is the proposed Robust ACCB with the stochasticity constraint.
We use the the expected longrun average reward (ElrAR) [14] to evaluate the estimated policies for . There are two processing steps to obtain the ElrAR: (a) get the average reward for the th user by averaging the rewards over the last elements in a trajectory of tuples under the policy ; (b) the ElrAR is achieved by averaging the ’s.
There are users’ MDPs used in the experiment. Each user has a trajectory of tuples. There are variables in the state. The noises in the MDP are set as and respectively. The parameterized policy is assumed to be the Boltzmann distribution [14], where is the unknown coefficients, is the policy feature and
. The feature vector for the estimation of expected rewards is set as
, where . The tuning parameters for the actorcritic learning are set as . The outlier ratio and strength are set and respectively. In our algorithm, is set as 1.4.3 Results and Discussion
In this section, the experiments are carried out to verify the performance of three contextual bandit methods from the following two aspects:
(S1) We change the ratio of tuples that contain outliers from to . The experiment results are displayed in the left subtable in Table 1 and Fig. 1(a). As we can see, when , there is zero percentage of outliers in the dataset. Under such condition, our method achieves almost identical results compared with the SACCB [9]. This results verify that though our method aims to the robust learning, it is well adapted to the dataset without outliers. As
rises, the performance of both LinUCB and SACCB drops obviously. While their standard deviations increases dramatically. Compared with those two methods, both the performance and the standard deviation of our method keep stable. As a result, our method averagely improves the performance by 146.8 steps, i.e. 10.26%, compared with the best of stateoftheart methods.
(S2) The strength of outliers ranges from to times of the average value in the trajectory. The right subtable in Table 1 and Fig. 1(b) summarize the experiment results. As we shall see, when rises, the strength of outliers increases gradually. We have the following observations from the experiment results: (1) when there is no outlier in the trajectory, our method achieves similar results compared with SACCB; (2) as rises, the performances of SACCB and LinUCB decrease obviously and their standard deviations increase dramatically; (3) as rises, both the performance of our method and the standard deviation keep stable. Compared with the stateoftheart methods, our method get clear gains in a variety of parameter settings. Averagely, it improves the performance by 139.3 steps and 143.3 steps compared with LinUCB and SACCB respectively.
width=1center
Average reward vs. outlier ratio  Average reward vs. outlier strength  
LinUCB  SACCB  RSACCB  LinUCB  SACCB  RSACCB  
1578.713.75  1578.312.70  1578.312.55  1578.713.75  1578.312.70  1578.312.55  
1462.540.24  1462.939.88  1578.412.61  1535.621.94  1527.730.71  1578.312.68  
1428.149.69  1429.545.79  1578.212.57  1431.744.13  1424.746.53  1578.212.65  
1391.049.42  1383.250.40  1578.612.66  1380.849.03  1377.248.83  1578.212.62  
1370.650.20  1365.049.02  1578.712.62  1359.849.76  1357.148.51  1578.212.63  
1358.948.43  1365.049.02  1578.712.62  1346.848.83  1344.946.94  1578.212.64  
Avg  1431.6  1430.7  1578.5  Avg  1438.9  1435.0  1578.2 
5 Conclusions and Future Directions
To alleviate the influence of outliers in the mHealth study, a robust actorcritic contextual bandit method is proposed to form robust interventions. We use the capped norm to boost the robustness for the critic updating, which results in a set of weights. With them, we propose a weighted objective for the actor updating. It gives the tuples that have large approximate errors zero weights, enhancing the robustness against those tuples. Additionally, a solid method is provided to properly set the thresholding parameter in the capped norm, i.e.,
With it, we can achieve the conflicting goal of enhancing the robustness of the actorcritic algorithm as well as obtaining almost identical results compared with the stateoftheart method on the datasets without outliers. Extensive experiment results show that in a variety of parameter settings the proposed method obtains significant improvements compared with the stateoftheart contextual bandit methods. In the future, we may explore the robust learning on the reinforcement learning method. It could be on both the discount reward setting and the average reward setting
[8, 14]. Those two directions are much more challenging since it is not a general regression task to estimate the value function. Besides, mining the cohesion information among users helps a lot to enrich the data (or restrict the parameter space) [28, 11, 1, 2, 4, 3].Appendix: the proof of Proposition 1
Proof
The objective of (3) is nonconvex and nondifferentiable [17, 7]. We could obtain its subgradient: where
(10) 
Letting for gives a simplified partial derivative of (3) that satisfies the subgradient (10). It is defined as
which is equivalent to the partial derivative of the following objective
(11) 
From the perspective of optimization, the objective (11) is equivalent to (3).
References
 [1] G. Cheng, Y. Wang, Y. Gong, F. Zhu, and C. Pan. Urban road extraction via graph cuts based probability propagation. In Image Processing (ICIP), 2014 IEEE International Conference on, pages 5072–5076. IEEE, 2014.
 [2] G. Cheng, Y. Wang, F. Zhu, and C. Pan. Road extraction via adaptive graph cuts with multiple features. In Image Processing (ICIP), IEEE International Conference on, pages 3962–3966. IEEE, 2015.
 [3] G. Cheng, F. Zhu, S. Xiang, and C. Pan. Road centerline extraction via semisupervised segmentation and multidirection nonmaximum suppression. IEEE Geoscience and Remote Sensing Letters, 13(4):545–549, 2016.

[4]
G. Cheng, F. Zhu, S. Xiang, Y. Wang, and C. Pan.
Accurate urban road centerline extraction from vhr imagery via multiscale segmentation and tensor voting.
Neurocomputing, 205:407–420, 2016.  [5] G. Cheng, F. Zhu, S. Xiang, Y. Wang, and C. Pan. Semisupervised hyperspectral image classification via discriminant analysis and robust regression. IEEE J. of Selected Topics in Applied Earth Observations and Remote Sensing, 9(2):595–608, 2016.
 [6] M. Dudík, J. Langford, and L. Li. Doubly robust policy evaluation and learning. In ICML, pages 1097–1104, 2011.
 [7] H. Gao, F. Nie, T. W. Cai, and H. Huang. Robust capped norm nonnegative matrix factorization: Capped norm nmf. In ACM International Conference on Information and Knowledge (CIKM), pages 871–880, 2015.
 [8] I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska. A survey of actorcritic reinforcement learning: Standard and natural policy gradients. IEEE Trans. Systems, Man, and Cybernetics, 42(6):1291–1307, 2012.
 [9] H. Lei. An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention. PhD thesis, University of Michigan, 2016.
 [10] H. Lei, A. Tewari, and S. Murphy. An actorcritic contextual bandit algorithm for personalized interventions using mobile devices. In NIPS 2014 Workshop: Personalization: Methods and Applications, pages 1 – 9, 2014.
 [11] H. Li, Y. Wang, S. Xiang, J. Duan, F. Zhu, and C. Pan. A label propagation method using spatialspectral consistency for hyperspectral image classification. International Journal of Remote Sensing, 37(1):191–211, 2016.
 [12] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextualbandit approach to personalized news article recommendation. In International Conference on World Wide Web (WWW), pages 661–670, 2010.
 [13] P. Liao, A. Tewari, and S. Murphy. Constructing justintime adaptive interventions. Phd Section Proposal, pages 1–49, 2015.
 [14] S. A. Murphy, Y. Deng, E. B. Laber, H. R. Maei, R. S. Sutton, and K. Witkiewitz. A batch, offpolicy, actorcritic algorithm for optimizing the average reward. CoRR, abs/1607.05047, 2016.

[15]
F. Nie, H. Huang, X. Cai, and C. H. Ding.
Efficient and robust feature selection via joint
norms minimization. In Advances in Neural Information Processing Systems (NIPS), pages 1813–1821. Curran Associates, Inc., 2010.  [16] F. Nie, H. Wang, X. Cai, H. Huang, and C. Ding. Robust matrix completion via joint schatten pnorm and lpnorm minimization. In IEEE International Conference on Data Mining (ICDM), pages 566–574, Washington, DC, USA, 2012.

[17]
Q. Sun, S. Xiang, and Y. Ye.
Robust principal component analysis via capped norms.
In ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Min., pages 311–319, 2013.  [18] A. Tewari and S. A. Murphy. From ads to interventions: Contextual bandits in mobile health. In J. Rehg, S. A. Murphy, and S. Kumar, editors, Mobile Health: Sensors, Analytic Methods, and Applications. Springer, 2017.
 [19] Y. Wang, C. Pan, S. Xiang, and F. Zhu. Robust hyperspectral unmixing with correntropybased metric. IEEE Transactions on Image Processing, 24(11):4027–4040, 2015.
 [20] Z. Xu, S. Wang, F. Zhu, and J. Huang. Seq2seq fingerprint: An unsupervised deep molecular embedding for drug discovery. In ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACMBCB), 2017.
 [21] J. Yao, X. Zhu, F. Zhu, and J. Huang. Deep correlational learning for survival prediction from multimodality datay. In International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2017.
 [22] L. Zhou and E. Brunskill. Latent contextual bandits and their application to personalized recommendations for new users. In International Joint Conference on Artificial Intelligence, pages 3646–3653, 2016.
 [23] F. Zhu. Unsupervised Hyperspectral Unmixing Methods. PhD thesis, 2015.
 [24] F. Zhu, B. Fan, X. Zhu, Y. Wang, S. Xiang, and C. Pan. 10,000+ times accelerated robust subset selection (ARSS). In Proc. Assoc. Adv. Artif. Intell. (AAAI), pages 3217–3224, 2015.
 [25] F. Zhu and P. Liao. Effective warm start for the online actorcritic reinforcement learning based mhealth intervention. In The Multidisciplinary Conference on Reinforcement Learning and Decision Making, pages 6 – 10, 2017.
 [26] F. Zhu, P. Liao, X. Zhu, Y. Yao, and J. Huang. Cohesionbased online actorcritic reinforcement learning for mhealth intervention. arXiv:1703.10039, 2017.
 [27] F. Zhu, Y. Wang, B. Fan, G. Meng, and C. Pan. Effective spectral unmixing via robust representation and learningbased sparsity. CoRR, abs/1409.0685, 2014.
 [28] F. Zhu, Y. Wang, S. Xiang, B. Fan, and C. Pan. Structured sparse method for hyperspectral unmixing. {ISPRS} Journal of Photogrammetry and Remote Sensing, 88(0):101–118, 2014.

[29]
X. Zhu, J. Yao, F. Zhu, and J. Huang.
Wsisa: Making survival prediction from whole slide histopathological
images.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pages 7234 – 7242, 2017.
Comments
There are no comments yet.