Robust Contextual Bandit via the Capped-ℓ_2 norm

08/17/2017 ∙ by Feiyun Zhu, et al. ∙ 0

This paper considers the actor-critic contextual bandit for the mobile health (mHealth) intervention. The state-of-the-art decision-making methods in mHealth generally assume that the noise in the dynamic system follows the Gaussian distribution. Those methods use the least-square-based algorithm to estimate the expected reward, which is prone to the existence of outliers. To deal with the issue of outliers, we propose a novel robust actor-critic contextual bandit method for the mHealth intervention. In the critic updating, the capped-ℓ_2 norm is used to measure the approximation error, which prevents outliers from dominating our objective. A set of weights could be achieved from the critic updating. Considering them gives a weighted objective for the actor updating. It provides the badly noised sample in the critic updating with zero weights for the actor updating. As a result, the robustness of both actor-critic updating is enhanced. There is a key parameter in the capped-ℓ_2 norm. We provide a reliable method to properly set it by making use of one of the most fundamental definitions of outliers in statistics. Extensive experiment results demonstrate that our method can achieve almost identical results compared with the state-of-the-art methods on the dataset without outliers and dramatically outperform them on the datasets noised by outliers.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays, billions of people frequently use various kinds of smart devices, such as smartphones and wearable activity sensors [13, 14, 26, 25]

. It is increasingly popular among the scientist community to make use of the state-of-the-art artificial intelligence technology to leverage supercomputers and big data to facilicate the prediction of healthcare tasks 

[21, 29, 20]. In this paper, we use the mobile health (mHealth) technologies to collect and analyze real-time data from users. Based on that, the goal of mHealth is to decide when, where, and how to deliver the in-time intervention to best serve uses, helping them to lead healthier lives. For example, the mHealth guides people how to reduce alcohol abuses, increase physical activities and regain the control of eating disorders, obesity/weight management [13, 14, 9].

The tailoring of mHealth interventions is generally modeled as a sequential decision-making (SDM) problem. The contextual bandit provides a paradigm for the SDM [18, 22, 25, 26]. In mHealth, the first contextual bandit [10]

was proposed in 2014. It is in an actor-critic setting and has an explicit parameterized stochastic policy. Such setting has two advantages: (1) the actor-critic algorithm has good properties of quick convergence with low variance 

[8]; (2) we could understand the key features that contribute most to the policy by analyzing the estimated parameters. This is important for the behavior scientists to design the state (feature). Then, Dr. Lei [9] improved the method by emphasizing the explorations and introducing the stochasticity constraint on the policy coefficients.

Those two methods serve a good start for the mHealth. However, they assume that there is no outlier in the data. They use the least-square-based algorithm to learn the expected reward, which, however, is prone to the presence of outliers [24, 27, 23, 19, 5]. In practice, there are various kinds of complex noise in the mHealth system. For example, the wearable devices are unable to accurately record the states and rewards from users under various conditions. The mHealth requires self-report to deliver effective interventions to device users. However, some users are unwilling to report the self-report. They sometimes randomly fill out the report to save time. We treat the various of complex noises in the system as outliers. We want to get rid of the extreme observations.

In this paper, a novel robust actor-critic contextual bandit is proposed to deal with the outlier issue in the mHealth system. The capped- norm is used in the estimation of the expected reward in the critic updating. As a result, we obtain a set of weights. With them, we propose a weighted objective for the actor updating, which gives the samples that are ineffective for the critic updating zero weights. As a result, the robustness of both actor-critic updating is greatly enhanced. There is a key parameter in the capped- norm. We propose a solid method to set it properly, which is based on a solid method to detect outliers in statistics. With it, we can achieve the conflicting goal of enhancing the robustness of our algorithm and obtaining almost same results compared with the state-of-the-art method on the datasets without outliers. Extensive experiment results show that in a variety of parameter settings our method obtains clear gains compared with the state-of-the-art methods.

2 Preliminaries

The expected reward is a core concept in the contextual bandit to evaluate the policy for the dynamic system. In case of large state or action spaces, the parameterized approximation is widely accepted: , which is assumed to be in a low dimensional space, where is the unknown coefficients and is the contextual feature for the state-action pair.

The aim of the actor-critic algorithm is to learn an optimal policy to maximize the reward for all the state-action pairs. The objective is , where is the average reward over all the possible states & actions; is a reference distribution over states. To make the actor updating a well-posed objective, various constraints on are considered [10]. Specifically, the stochasticity constraint is introduced to reduce the habituation and facilitate learning [9]

. The stochasticity constraint specifies the probability of selecting both actions is at least

for more than contexts: . Via the Markov inequality, a relaxed and smoother stochasticity constraint is as follows  [9], leading to the objective as


where ; is the feature for the policy [9].

According to (1), we need the estimation of the expected reward to form the objective. This process is called the critic updating [8]

. Current methods generally use the ridge regression to learn it. The objective is defined as follows


It has a closed-form solution: , where is a designed matrix with the -th column as ; consists of all the immediate rewards. However similar to the existing least square based algorithms, the objective is sensitive to the existence of outliers [16, 15].

3 Robust Contextual Bandit with Capped- norm

To boost the robustness of the actor-critic learning, the capped- norm is used to measure the approximation error:


By properly setting the value of , we can get rid of the outliers that distribute far away from the majority of samples while keep the effective samples. Otherwise when is too large, there are outliers left in the data; while is too small, lots of effective samples will be removed, leading to unstable estimations.

It is important to properly set the value of . We propose an effective method to set . It is derived from one of the most widely accepted outlier definitions in the statistics community. When we use the boxplot to give a descriptive illustration of the distribution of a dataset, the samples that are

more above the third quartile are treated as outliers. Thus, we set



where is the interquartile range; is a tuning parameter to give us a flexible setting of , which is set to by default.

3.1 Algorithm for the Critic Updating

Proposition 1

The critic objective (3) is equivalent to the following objective


where is dependent on the unknown variable .

According to Proposition 1, we have a simplified objective for the critic updating. However, it is still complex to minimize (5) since the weight term depends on the unknown variable . In this section, an iteratively re-weighted algorithm is proposed for the optimization of (5) (cf. Algorithm 1). It assumes the weight is fixed when seeking for the optimal and vice versa. When is fixed, the objective (5) is convex over We may get the solver by differentiating (5) and setting the derivative to zero, leading to the following linear system


where is the weight at the -th iteration. Then we update the weight term as for .

3.2 Algorithm for the Actor Updating

Since the distribution of in the objective (1) is generally unavailable, we consider the -trial based objective as follows


where is the weighted term learned from the critic updating, cf. Section (3.1). With the weight , the outlier tuples that have large approximation errors are removed for the actor updating. As a result, the robustness is boosted. The actor updating aims to maximizes the objective (7) over . We use the Sequential Quadratic Programming (SQP) algorithm for the optimization. Specially, the implementation of SQP with finite-difference approximation to the gradient in fmincon is utilized in our algorithm (cf. Algorithm 1).


1:  Initialize the state and policy parameters.
2:  repeat
3:     /*Critic updating to estimate the expected reward */
4:     repeat
5:        Update the parameter for the expected reward  for via (6).
6:        Update the weight term according to the estimated .
7:     until convergence
8:     Actor updating to estimate the policy parameter , where is defined in (7).
9:  until convergence

Output: the stochastic policy, i.e. .

Algorithm 1 Robust actor-critic contextual bandit (RS-ACCB).

4 Experiments

4.1 Datasets

To evaluate the performance, we utilize a dataset from the mHealth study (called HeartSteps) to approximate the generative model. The HeartSteps is a 42-day mHealth study, resulting in 210 decision points per user. It aims to increase the users’ daily activities (i.e. steps) by sending them positive interventions, for example, suggesting them to go for a hike on the sunny weekend etc.

For each user, a trajectory of tuples of observations are generated via the micro-randomized trials [14, 13]. The initial state is drawn from the Gaussian distribution , with the pre-defined covariance matrix . The random policy provides a method to select actions. is chosen with a probability of , i.e. for all states . When , the state and immediate reward are generated as


where is the main coefficient for the dynamic system. It is set as [0.4,0.3,0.4,0.7,0.05,0.6, 0.25, 3,0.25,0.25,0.4,0.1,0.5,500]. is the Gaussian noise in the state (8) and is the Gaussian noise in the reward model (9).

To simulate the outliers in the trajectory, there are two processing steps: (a) a fixed ratio (i.e. ) of tuples is randomly selected in each user’s trajectory; (b) we add a large noise ( times the average value in the trajectory) to the states and rewards in the selected tuples. Additionally, the actions in the selected tuples are randomly set to simulate the random failure of sending interventions due to the weak mobile network.

4.2 Experiments Settings

In the experiment, there are three contextual bandit methods for comparison: (1) Lin-UCB (linear upper confidence bound) is a famous contextual bandit method that achieves great successes in the Internet advertising [12, 6, 18]; (2) S-ACCB is the stochasticity constrained actor-critic contextual bandit for the mHealth [9]; (3) RS-ACCB is the proposed Robust ACCB with the stochasticity constraint.

We use the the expected long-run average reward (ElrAR) [14] to evaluate the estimated policies for . There are two processing steps to obtain the ElrAR: (a) get the average reward for the -th user by averaging the rewards over the last elements in a trajectory of tuples under the policy ; (b) the ElrAR is achieved by averaging the ’s.

There are users’ MDPs used in the experiment. Each user has a trajectory of tuples. There are variables in the state. The noises in the MDP are set as and respectively. The parameterized policy is assumed to be the Boltzmann distribution  [14], where is the unknown coefficients, is the policy feature and

. The feature vector for the estimation of expected rewards is set as

, where . The tuning parameters for the actor-critic learning are set as . The outlier ratio and strength are set and respectively. In our algorithm, is set as 1.

Figure 1: Average reward of three contextual bandit methods. The left sub-table shows the results when the trajectory is short, i.e. ; the right one shows the results when . RS-ACCB is our method. A larger value is better.

4.3 Results and Discussion

In this section, the experiments are carried out to verify the performance of three contextual bandit methods from the following two aspects:

(S1) We change the ratio of tuples that contain outliers from to . The experiment results are displayed in the left sub-table in Table 1 and Fig. 1(a). As we can see, when , there is zero percentage of outliers in the dataset. Under such condition, our method achieves almost identical results compared with the S-ACCB [9]. This results verify that though our method aims to the robust learning, it is well adapted to the dataset without outliers. As

rises, the performance of both Lin-UCB and S-ACCB drops obviously. While their standard deviations increases dramatically. Compared with those two methods, both the performance and the standard deviation of our method keep stable. As a result, our method averagely improves the performance by 146.8 steps, i.e. 10.26%, compared with the best of state-of-the-art methods.

(S2) The strength of outliers ranges from to times of the average value in the trajectory. The right sub-table in Table 1 and Fig. 1(b) summarize the experiment results. As we shall see, when rises, the strength of outliers increases gradually. We have the following observations from the experiment results: (1) when there is no outlier in the trajectory, our method achieves similar results compared with S-ACCB; (2) as rises, the performances of S-ACCB and Lin-UCB decrease obviously and their standard deviations increase dramatically; (3) as rises, both the performance of our method and the standard deviation keep stable. Compared with the state-of-the-art methods, our method get clear gains in a variety of parameter settings. Averagely, it improves the performance by 139.3 steps and 143.3 steps compared with Lin-UCB and S-ACCB respectively.


Average reward vs. outlier ratio Average reward vs. outlier strength
1578.713.75 1578.312.70 1578.312.55 1578.713.75 1578.312.70 1578.312.55
1462.540.24 1462.939.88 1578.412.61 1535.621.94 1527.730.71 1578.312.68
1428.149.69 1429.545.79 1578.212.57 1431.744.13 1424.746.53 1578.212.65
1391.049.42 1383.250.40 1578.612.66 1380.849.03 1377.248.83 1578.212.62
1370.650.20 1365.049.02 1578.712.62 1359.849.76 1357.148.51 1578.212.63
1358.948.43 1365.049.02 1578.712.62 1346.848.83 1344.946.94 1578.212.64
Avg 1431.6 1430.7 1578.5 Avg 1438.9 1435.0 1578.2
Table 1: Average reward vs. ourlier ratio (setting S1) and outlier strength (setting S2) on the two sub-tables. The three methods are (a) Lin-UCB [12], (b) S-ACCB [9] and (c) RS-ACCB (is our method). A larger value is better.

5 Conclusions and Future Directions

To alleviate the influence of outliers in the mHealth study, a robust actor-critic contextual bandit method is proposed to form robust interventions. We use the capped- norm to boost the robustness for the critic updating, which results in a set of weights. With them, we propose a weighted objective for the actor updating. It gives the tuples that have large approximate errors zero weights, enhancing the robustness against those tuples. Additionally, a solid method is provided to properly set the thresholding parameter in the capped- norm, i.e.,

With it, we can achieve the conflicting goal of enhancing the robustness of the actor-critic algorithm as well as obtaining almost identical results compared with the state-of-the-art method on the datasets without outliers. Extensive experiment results show that in a variety of parameter settings the proposed method obtains significant improvements compared with the state-of-the-art contextual bandit methods. In the future, we may explore the robust learning on the reinforcement learning method. It could be on both the discount reward setting and the average reward setting 

[8, 14]. Those two directions are much more challenging since it is not a general regression task to estimate the value function. Besides, mining the cohesion information among users helps a lot to enrich the data (or restrict the parameter space) [28, 11, 1, 2, 4, 3].

Appendix: the proof of Proposition 1


The objective of (3) is non-convex and non-differentiable [17, 7]. We could obtain its sub-gradient: where


Letting for gives a simplified partial derivative of (3) that satisfies the sub-gradient (10). It is defined as

which is equivalent to the partial derivative of the following objective


From the perspective of optimization, the objective (11) is equivalent to (3).


  • [1] G. Cheng, Y. Wang, Y. Gong, F. Zhu, and C. Pan. Urban road extraction via graph cuts based probability propagation. In Image Processing (ICIP), 2014 IEEE International Conference on, pages 5072–5076. IEEE, 2014.
  • [2] G. Cheng, Y. Wang, F. Zhu, and C. Pan. Road extraction via adaptive graph cuts with multiple features. In Image Processing (ICIP), IEEE International Conference on, pages 3962–3966. IEEE, 2015.
  • [3] G. Cheng, F. Zhu, S. Xiang, and C. Pan. Road centerline extraction via semisupervised segmentation and multidirection nonmaximum suppression. IEEE Geoscience and Remote Sensing Letters, 13(4):545–549, 2016.
  • [4] G. Cheng, F. Zhu, S. Xiang, Y. Wang, and C. Pan.

    Accurate urban road centerline extraction from vhr imagery via multiscale segmentation and tensor voting.

    Neurocomputing, 205:407–420, 2016.
  • [5] G. Cheng, F. Zhu, S. Xiang, Y. Wang, and C. Pan. Semisupervised hyperspectral image classification via discriminant analysis and robust regression. IEEE J. of Selected Topics in Applied Earth Observations and Remote Sensing, 9(2):595–608, 2016.
  • [6] M. Dudík, J. Langford, and L. Li. Doubly robust policy evaluation and learning. In ICML, pages 1097–1104, 2011.
  • [7] H. Gao, F. Nie, T. W. Cai, and H. Huang. Robust capped norm nonnegative matrix factorization: Capped norm nmf. In ACM International Conference on Information and Knowledge (CIKM), pages 871–880, 2015.
  • [8] I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska. A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Trans. Systems, Man, and Cybernetics, 42(6):1291–1307, 2012.
  • [9] H. Lei. An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention. PhD thesis, University of Michigan, 2016.
  • [10] H. Lei, A. Tewari, and S. Murphy. An actor-critic contextual bandit algorithm for personalized interventions using mobile devices. In NIPS 2014 Workshop: Personalization: Methods and Applications, pages 1 – 9, 2014.
  • [11] H. Li, Y. Wang, S. Xiang, J. Duan, F. Zhu, and C. Pan. A label propagation method using spatial-spectral consistency for hyperspectral image classification. International Journal of Remote Sensing, 37(1):191–211, 2016.
  • [12] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In International Conference on World Wide Web (WWW), pages 661–670, 2010.
  • [13] P. Liao, A. Tewari, and S. Murphy. Constructing just-in-time adaptive interventions. Phd Section Proposal, pages 1–49, 2015.
  • [14] S. A. Murphy, Y. Deng, E. B. Laber, H. R. Maei, R. S. Sutton, and K. Witkiewitz. A batch, off-policy, actor-critic algorithm for optimizing the average reward. CoRR, abs/1607.05047, 2016.
  • [15] F. Nie, H. Huang, X. Cai, and C. H. Ding.

    Efficient and robust feature selection via joint

    -norms minimization.
    In Advances in Neural Information Processing Systems (NIPS), pages 1813–1821. Curran Associates, Inc., 2010.
  • [16] F. Nie, H. Wang, X. Cai, H. Huang, and C. Ding. Robust matrix completion via joint schatten p-norm and lp-norm minimization. In IEEE International Conference on Data Mining (ICDM), pages 566–574, Washington, DC, USA, 2012.
  • [17] Q. Sun, S. Xiang, and Y. Ye.

    Robust principal component analysis via capped norms.

    In ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Min., pages 311–319, 2013.
  • [18] A. Tewari and S. A. Murphy. From ads to interventions: Contextual bandits in mobile health. In J. Rehg, S. A. Murphy, and S. Kumar, editors, Mobile Health: Sensors, Analytic Methods, and Applications. Springer, 2017.
  • [19] Y. Wang, C. Pan, S. Xiang, and F. Zhu. Robust hyperspectral unmixing with correntropy-based metric. IEEE Transactions on Image Processing, 24(11):4027–4040, 2015.
  • [20] Z. Xu, S. Wang, F. Zhu, and J. Huang. Seq2seq fingerprint: An unsupervised deep molecular embedding for drug discovery. In ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), 2017.
  • [21] J. Yao, X. Zhu, F. Zhu, and J. Huang. Deep correlational learning for survival prediction from multi-modality datay. In International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2017.
  • [22] L. Zhou and E. Brunskill. Latent contextual bandits and their application to personalized recommendations for new users. In International Joint Conference on Artificial Intelligence, pages 3646–3653, 2016.
  • [23] F. Zhu. Unsupervised Hyperspectral Unmixing Methods. PhD thesis, 2015.
  • [24] F. Zhu, B. Fan, X. Zhu, Y. Wang, S. Xiang, and C. Pan. 10,000+ times accelerated robust subset selection (ARSS). In Proc. Assoc. Adv. Artif. Intell. (AAAI), pages 3217–3224, 2015.
  • [25] F. Zhu and P. Liao. Effective warm start for the online actor-critic reinforcement learning based mhealth intervention. In The Multi-disciplinary Conference on Reinforcement Learning and Decision Making, pages 6 – 10, 2017.
  • [26] F. Zhu, P. Liao, X. Zhu, Y. Yao, and J. Huang. Cohesion-based online actor-critic reinforcement learning for mhealth intervention. arXiv:1703.10039, 2017.
  • [27] F. Zhu, Y. Wang, B. Fan, G. Meng, and C. Pan. Effective spectral unmixing via robust representation and learning-based sparsity. CoRR, abs/1409.0685, 2014.
  • [28] F. Zhu, Y. Wang, S. Xiang, B. Fan, and C. Pan. Structured sparse method for hyperspectral unmixing. {ISPRS} Journal of Photogrammetry and Remote Sensing, 88(0):101–118, 2014.
  • [29] X. Zhu, J. Yao, F. Zhu, and J. Huang. Wsisa: Making survival prediction from whole slide histopathological images. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , pages 7234 – 7242, 2017.