1 Introduction
In numerous realworld settings, acquiring useful information is often costly. In many applications such as recommender systems, finance, or healthcare, the decisionmaker performs costly research/experimentation to learn valuable information. For instance, a website must pay costs to observe (e.g. through cookies) the contextual information of its online users. In doing so it must decide the best information to observe in order to minimize informational costs while also achieving high rewards. However, classical contextual MultiArmed Bandit (MAB) formulations (Chu et al. [2011], Slivkins [2011], Lu et al. [2010], Dudik et al. [2011], Langford and Zhang [2007]) have not previously considered these important informational costs and are thus unable to provide satisfactory performance in such settings. This paper presents new and powerful methods and algorithms for Contextual MAB with Costly Observations (CMABCO). We show numerically that our algorithms achieve significant performance gains in breast cancer setting, and we note that the methods and algorithms we develop are widely applicable (perhaps with some modifications) to an enormous range of other settings as well, from recommender systems to finance.
A major challenge in these settings is the learning of both optimal observations and actions. Current MAB methods could potentially be modified to address this issue by combining the choice of the context to observe and the action to be taken as a single metaaction and folding the costs of observations in the rewards. However, the regret of such an approach can be shown to be exponential in the number of actions and the number of possible context states; therefore, it is so inefficient as to be impractical for any realistic problem. Therefore there is a strong need for the development of new algorithms that achieve better performance.
To overcome the limitations and challenges discussed above, we propose an alternative approach. We formalize the CMABCO problem and show that this problem can be reduced to a two stage Markov Decision Process (MDP) problem with a canonical start state. We propose two different algorithms for this dual learning problem: SimOOS and SeqOOS where observations are made simultaneously and sequentially, respectively. These algorithms build upon the UCRL2 algorithm of (
Jaksch et al. [2010]) to efficiently learn optimal observations and actions. We show that both SimOOS and SeqOOS algorithms achieve a regret that is sublinear in time. These algorithm thus perform well when the number of observations is small, and it represents a significant improvement over existing algorithms, which would be exponential in the number of observations as well as actions.Our main contributions can be summarized as follows:

We formalize the CMABCO problem as a twostage MDP.

We propose two algorithms under two assumptions: simultaneous and sequential observation selection. We show sublinear in time regret bounds for both algorithms.

We use a breast cancer dataset and show that we can achieve up to significant improvement in performance with respect to an important benchmark.
As we have noted, our algorithms apply in many settings with different observations, actions and rewards. In the medical context, the observations might consist of different types of (costly) medical tests (e.g., blood tests, MRI, etc.), actions might consist of choices of treatment, and rewards might consist of year survival rates. Hence, an important aspect of the decisionmaking is which medical tests to conduct and which treatment option to recommend. In the recommendation system context, the observations might consist of (costly) information about the user (e.g., previous search records, likes in social media, etc.), actions might consist of item choices and rewards might consist of click rates. In financial applications, the observations might represent (costly) research and information gathering about specific assets (stocks, loans, IPOs, etc.), actions might represent investment decisions, and rewards might represent investment returns. Indeed, the financial literature has studied the costs (and incentives) associated with information gathering in a variety of settings (Campbel and Kracaw [1980], Chemmanur [1993]).
2 Related Work
Our paper contributes to multiple strands of literature, including MAB, MDP and budgeted learning. We describe the contributions of our work to each topic in turn.
2.1 MAB Literature
This work relates to various strands of research in the MAB literature (Chu et al. [2011], Slivkins [2011], Lu et al. [2010], Dudik et al. [2011], Langford and Zhang [2007], Tekin and Van Der Schaar [2014]). For example, Tekin and Van Der Schaar [2014] focuses on learning the optimal actions by discovering relevant information. However, this work does not consider the costs associated with gathering information and is thus unable to provide satisfactory performance in the considered setting. The CMABCO problem is similar to combinatorial semibandits since multiple actions (observations and real actions) are selected and the rewards of all selected actions (observation cost and real action rewards) are selected in our setting. However, combinatorial semibandits do not utilize the observed states when taking the action.
Our work is also very related to online probing (Zolghadr et al. [2013]). However, the goal in (Zolghadr et al. [2013]) is to learn the optimal observations and a single best function that maps observed features to labels in order to minimize the loss and the observation cost jointly. Unlike in the considered CMABCO setting, an adversarial setup is assumed and a complete loss feedback (the loss associated with all the various actions) is obtained at each stage.
2.2 MDP literature
The CMABCO problem which we consider can be formalized as a twostage MDP (Jaksch et al. [2010], Ortner and Auer [2007], Osband et al. [2016]) with a canonical start state. The action set available in the start state is the set of observations. Following an observation action in the start state, the decisionmaker moves to a new state (which consists of the realized states of the selected observations) from which the decisionmaker selects a real action and moves back to the start state. The reward in the first step is the observation cost (negative) and the second step is the random reward obtaind by taking the real action. Stemming from this and building upon the UCRL2 algorithm of (Ortner and Auer [2007], Jaksch et al. [2010]
), we construct efficient algorithms by exploiting the structure of the CMABCO problem: sparse observation probabilities, known costs.
2.3 Budgeted Learning
The CMABCO problem is also similar to budgeted learning as the decisionmaker’s goal there is to adaptively choose which features to observe in order to minimize the loss. For example, (Cesa Bianchi et al. [2011], Hazan and Koren [2012]
) adaptively choose the features of the next training example in order to train a linear regression model while having restricted access to only a subset of the features. However, these problems do not consider information costs and are restricted to batch learning.
Another related work is adaptive submodularity (Golovin and Krause [2010]) which aims to maximize rewards by selecting at most observations/actions. However, their approach assumes that observation states are statistically independent and rewards have a submodular structure in observations.
3 Contextual Multiarmed Bandits with Costly Observations
3.1 Problem Formulation
Next, we present our problem formulation and illustrate it with a specific example from in the medical context. Let be a finite set of observations (types of medical tests such as MRI, mamogram, ultrasound etc.). Each observation is in a (initially unknown) particular state from a finite set of of possible values (describing the outcomes of the medical tests such as the BIRADS score associated with a mamogram). Let
represent the set of all possible state vectors.. The state vector is
where is the state of observation , which represents the context in the CMAB formulation. We assume that the state vector is drawn according to a fixed but unknown distribution. We write to denote a random state vector and to denote the probability of state vector being drawn. In the medical context, models a joint probability over the results of the medical tests.We assume that only the states of the observations that are selected by the decisionmaker are revealed in each time instance. Let denote a partial state vector, which only contains the state of a subset of the selected observations. For example, for selected observations , the partial state vector is with
where denotes our symbol for missing observation states. We use the notation to refer to the domain of (i.e., the set of the medical test outcomes realized in ). Let denote the set of all possible partial state vectors with observations from (i.e., the set of all possible medical test outcomes of ). Let denote the set of all possible partial state vector states. We say is consistent with if they are equal everywhere in the domain of , i.e., for all . In this case, we write . If and are both consistent with some , and , we say is a substate of . In this case, we write .
We illustrate these definitions on a simple example. Let be a state vector, and and be partial state vectors. Then, all of the following claims are true:
We consider a MAB setting with costly observations where the following sequence of the events is taking place at each time :

The environment draws a state vector according to unknown distribution . The state vector is initially unknown to the decisionmaker.

The decisionmaker is allowed to select at most observation at time , denoted as , with paying a known cost of for each observations in the set . We assume that the decisionmaker has an upper bound on the maximum number of observations that can be made at each time . Let denote the subset of the observations with cardinality less than , i.e., . The partial state vector from the observations is revealed to the decisionmaker, while the remainder of the states remain unknown to the decisionmaker.

Based on its available information , the decisionmaker takes an action from a finite set of actions and observes a random reward with support and where is an unknown expected reward function.
We overload the definition of and to denote marginal probabilities and expected rewards of partial state vectors. We write to denote the marginal probability of being realized and to denote the marginal expected reward of action when the partial state vector is . Observe that .
The policy for selecting observations and associated actions consists of a set of observations and an adaptive action strategy , which maps each possible partial state vectors from to actions (e.g., a policy consists of a subset of medical tests and treatment recommendation for each possible test results from ). The expected gain of the policy is given by
(1) 
where is the gain parameter, which balances the tradeoff between the rewards and observation costs. For example, represents the revenue made by one click in the recommendation system context. The expected gain of the policy is the expected reward of minus the observation cost incurred by . Without loss of generality, we assume that decisionmaker is allowed to make at most observations. Let denote the set of all possible policies. The oracle policy is given by .
The expected gain of the oracle policy is given by . Note that our oracle is different than the oracle used in the contextual bandit literature. To illustrate the difference, define to be the expected reward of the best action when the partial state vector is . We refer to the policy that selects observations and the best actions for all as the fixed oracle policy. The expected reward of the fixed oracle policy is given by
It can be shown that the oracle policy is given by and . Note that . Therefore, the oracle defined in our setting achieves the best expected reward among all the fixed oracle policies.
Consider an adaptive policy , which takes observationaction , observes , uses this observation to take an action and receives the reward of . The cumulative reward of is . The time regret of the policy is given by
The goal here is to compute the policy to minimize this regret by selecting at most observations.
Current online learning methods could be modified to address the CMABCO problem by defining a set of metaactions that comprises all the combinations of observation subsets and actions taken based on these observations, and then applying a standard MAB algorithm (such as the UCB algorithm Auer et al. [2002]) by considering these metaactions to be the action space. While this algorithm is straightforward to implement, it scales linearly with the total number of policies . This is exponential in the number of state vectors. This makes such algorithms computationally infeasible and suboptimal (compared to the lower bound) even when the numbers of actions and partial states is small. This poor scaling performance is due to the fact that the algorithm does not take into account that selecting an action yields information for many policies.
3.2 Simultaneous Optimistic Observation Selection (SimOOS) Algorithm
To address the above mentioned limitations of such MAB algorithms, we develop a new algorithm, which we refer to as Simultaneous Optimistic Observation Selection (SimOOS). SimOOS operates in rounds . Let denote time at the beginning of round
. The decisionmaker keeps track of the estimates of the mean rewards and the observation probabilities. Note that when the partial state vector
from observation set is revealed, the decisionmaker can use this information to not only update the observation probability estimate of but also update the observation probability estimate of all substates of . However, the decisionmaker cannot update the mean reward estimate of pairs of and substates of since this would result in a bias on the mean reward estimates. Therefore, at each round , we define , and if and if .We define the following counters: , , . In addition to these counters, we also keep counters of partial stateaction pair visits in a specific round . Let denote the number of times action is taken when partial state is observed in round . Furthermore, we can express the mean reward estimate and observation probability estimates as follows:
provided that and
. Since these estimates can deviate from their true mean values, we need to add appropriate confidence intervals when optimizing the policy. In the beginning of each round
, the SimOOS computes the policy of round by solving an optimization problem given in (2). The optimization problem with the mean reward estimate and observation probability estimates is given by(2) 
where and are the confidence bounds on the estimators at time . We will set these confidence bounds later in order to achieve provable regret guarantees with high probability. Let denote the policy computed by the SimOOS.
The SimOOS follows policy in round . At time in round (), the SimOOS selects and observes the partial state vector from observations and on the basis of this, it takes an action . Round ends when one of the visits to the partial state vectoraction pair in round is the same as (the total observations of the partial stateaction pair from previous rounds ). This ensures that the optimization problem given in (2) is only solved when the estimates and confidence bounds are improved.
The optimization problem in (2) can be reduced to a set of convex optimization problems which can be solved efficiently in polynomial time complexity (Boyd and Vandenberghe [2004]) (the details of this reduction are discussed in the supplementary material). In round , let be the optimistic reward of value of the partial state vector in round of . The optimistic gain of a fixed oracle in round , denoted by , is defined as the maximizer of the following optimization problem:
(3) 
At any time of round , it can be shown that the optimization in (2) can be solved as: and . The pseudocode for the SimOOS is given in Algorithm 1. It can be easily shown that the computational complexity of the SimOOS algorithm for instances is .
3.3 Regret Bounds for the SimOOS algorithm
In this subsection, we provide distributionindependent regret bounds for the SimOOS algorithm. Let denote the number of all possible states (all possible results from at most distinct medical tests).
Theorem 1.
Suppose . For any , set
and
Then, with probability at least , the regret of the SimOOS satisfies
The proof of Theorem 1 and all the other results can be found in the supplementary material. The UCRL2 (Jaksch et al. [2010]) is designed for general MDP problems and achieves a regret of . Hence, these regret results are better than those obtained by UCRL2. This is an important result since it demonstrates that the SimOOS can effectively exploit the structure of our CMABCO problem to achieve efficient regret bounds which scale better than these that can be obtained for general MDP problems.
We illustrate this bound using the same example above. Suppose for all and . The upper bound given in Theorem 1 is in the order of .
The SimOOS algorithm performs well for smaller values of which is the case in the medical setting, as it is for instance the case in breast cancer screening, in which imaging tests are limited to a small set: mammogram, MRI and ultrasound (Saslow et al. [2007]). In this context, the observations are usually selected sequentially. To address such settings, we next propose the SeqOOS algorithm that selects observations sequentially.
4 Multiarmed Bandits with Sequential Costly Observations
4.1 Problem Formalism
Our current setting assumes that decisionmaker makes all the observations simultaneously. If the decisionmaker is allowed to make observations sequentially, she can use the partial state from already selected observations to inform the selection of future observations. For example, in the medical settings, although a positive result in a medical test is usually followed by additional medical test for validity, a negative result in a medical test is not usually followed by additional medical tests. Since any resulting simultaneous observation policy can be achieved by a sequential observation policy, the oracle defined with sequential observations achieves higher expected reward than that with simultaneous observations. At each time , the following sequence of events is taking place:

The decisionmaker has initially no observations. In phase , we denote the empty partial state as where .

At each phase , if the partial state is and observation is made, the resulting partial state is where if and otherwise.

The decisionmaker takes an action when either observation is made or the final phase is reached and observes a random reward .
Let be the set of resulting partial state when observation is made at previous partial state of , i.e., . In this section, we define as the probability of resulting partial state when the observation is made at previous partial state of , which is referred to as partial state transition probability. For all , the partial state transition probability is defined as if and otherwise. In the medical example, this is the probability of observing test ’s result as given the previous test results (records) . We define and for all . Let denote partial state transition probability matrix.
A sequential policy consists of observation function and action function where and (e.g., refers to the next medical test applied on a patient with previous records (test results) and refers to treatment recommendation for a patient with previous records(test results) ). A sequential policy works as follows. Decisionmaker keeps making observations until either observations are made or an empty observation is picked and takes an action in a terminal state where terminal partial states of policy is the state with either cardinality or with .
We illustrate these definitions in a medical example. Assume that there are different tests with possible outcomes of positive and negative result and different possible treatments. Suppose that a sequential policy with , , . Basically, this policy initially picks the medical test for all patients (). If the result of the medical test is positive , the policy picks medical test (). On the other hand, if the result of medical test is negative , the policy does not make any additional test. In this example, terminal partial states of policy are .
Given a sequential policy , let denote the random partial state in phase and denote the random cost in phase by making observation . Note that is random since partial state in phase is random. Similarly, let denote random reward revealed by taking action in terminal partial state. Then, for each sequential policy , we define a value function for :
(4) 
where expectation is taken with respect to randomness of the states and rewards. In the terminal phase, we define value function as . The optimal value function is defined by . A policy is said to be optimal if . It is also useful to define partial stateobservation optimal value function for :
A sequential policy is optimal if and only if , .
Consider a sequential learning algorithm . The algorithm makes observation and realizes a cost in phase of time and then selects action and realizes a random reward , which realizes a reward of . To quantify the performance of sequential learning algorithm, we define cumulative regret of sequential learning algorithm up to time as
where denotes empty state. In the next subsection, we propose a sequential learning algorithm, which aims to minimize regret.
4.2 Sequential Optimistic Observation Selection (SeqOOS)
In addition to observation sets that are tracked by SimOOS, SeqOOS keeps track of the following sets at each round : , . Let and . In addition to these counters, we also keep counters of visits in partial stateaction pairs and stateobservation pairs in a particular round . Let denote the number of times observation is made when partial state is realized in round . We can express the estimated transition probabilities as , provided that .
The SeqOOS works in rounds . In the beginning of round ( denotes time of beginning of round ), the SeqOOS solves Optimistic Dynamic Programming (ODP), which takes the estimates and as an input and outputs a policy . The ODP first orders the partial states with respect to size of their domains. Let denote partial states with observations, which is defined by (e.g., all possible results from distinct medical tests). Since the decisionmaker is not allowed to make any more observations for any state , estimated value of state is computed by where is the confidence interval for partial stateaction pair in round . The action and observation functions on partial state computed by ODP is given by and . After computing value and policy in partial states , the ODP solves convex optimization problem to compute optimistic value function for each partial stateobservation pair and . Let denote optimistic value function for making observation in partial state in round of phase , which is the solution of the following convex optimization problem :
(5) 
Note that the variables () used in the convex optimization problem given in (5) is computed in the previous step by the ODP. The optimistic value of the empty observation in partial state in round is computed by . Based on the optimistic value of partial stateobservation pairs , the ODP computes the optimistic value of partial state and action and observation function of partial state as , , . These computations are repeated for to find the complete policy .
Given , at each time of round (), the SeqOOS follows the policy . Basically, if the state at phase is , the SeqOOS decides to make the observation and observes the state . If the state is at phase and observation computed by the ODP is empty set, i.e., , then SeqOOS takes action . If it is a terminal phase, i.e., , SeqOOS takes an action .
4.3 Regret Bounds of the SeqOOS
The analysis of the regret of the SeqOOS exhibits similarities to the analysis of the regret of the SimOOS. The SeqOOS has at most phases in which it makes observations sequentially followed by an action while SimOOS has phases in which it makes simultaneous observations at once followed by an action. The difference is that we need to decompose the regret of the SeqOOS into regret due to phases with suboptimal observations and regret due to suboptimal actions. Let . The next theorem bounds the distributionindependent regret.
Theorem 2.
Suppose . For , set
and
Then, with probability at least , regret of the SeqOOS satisfies
.
The difference in the regret bounds of SimOOS and SeqOOS is because SimOOS estimates the observation probabilities for each whereas SeqOOS estimates observation transition probabilities for each and .
Now, we illustrate and compare the regret bounds on our algorithms. Suppose that for all and . In this case, we have the distribution independent regret of for SimOOS and for SeqOOS with probability at least . Our algorithms become computationally feasible when is small.
5 Illustrative Results
We evaluate the SimOOS and SeqOOS on a dataset of 10,000 records of breast cancer patients participating in the National Surgical Adjuvant Breast and Bowel Project (NSABP) by ["removed for anonymous submission"]. Each instance consists of the following information about the patient: age, estrogen receptor, tumor stage, WHO score. The treatment is a choice among four chemotherapy regimes AC, ACT, CAF, CEF. The outcomes for these regimens were derived based on 32 references from PubMed Clinical Queries; this is a medically accepted procedure. Hence, the data contains the feature vector and all derived outcomes for each treatment. The details are given in ["removed for anonymous submission"]. We generate instances by randomly selecting a sample from the breast cancer dataset. In each instance, we set the observations as , and the rewards as if the treatment with the highest outcome is given to the patient and otherwise. For the experimental results, we set and .
We compare SimOOS and SeqOOS algorithms with a contextual bandit algorithm that observes realization of all observation states by paying cost of , referred to as ContextualUCB. We define the following metric of Gain of our algorithms ,which make observations and receives reward of by taking action at each time , over time steps by .
Performance of the SimOOS and SeqOOS with Different Costs: We consider that the cost of each observation . We illustrate gain of SimOOS, SeqOOS and ContextualUCB algorithms for increasing values of cost . As Figure 1 illustrate, the gain of the SimOOS and SeqOOS algorithm decreases as the observation cost increases. However, it should be noted that these algorithms learn the best simultaneous and sequential policies while simultaneously taking actions irrespective of the costs of observation. These figures show that when the observation cost is increasing, the SimOOS and SeqOOS achieves better gains than ContextualUCB by observing less information, hence paying less cost. Therefore, the slope of the gaincost curve of the SimOOS and SeqOOS illustrated in Figure 1 decreases as the observation cost increases.
6 Conclusions
In this paper, we introduced the novel, yet ubiquitous problem of contextual MAB with costly observations: selecting what information (contexts) to observe to inform the decision making process. To address this problem, we developed two different algorithms: SimOOS and SeqOOS, and prove that these algorithms achieve distributionindependent regret bounds that are sublinear in time. Future work will be dedicated to exploring algorithms with regret bounds that are polynomial on the number of observations.
References
 Auer et al. [2002] P. Auer, N. CesaBianchi, and P. Fischer. Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002.
 Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
 Campbel and Kracaw [1980] T. S. Campbel and W. A. Kracaw. Information production, market signalling, and the theory of financial intermediation. The Journal of Finance, 35(4):863–882, 1980.
 Cesa Bianchi et al. [2011] N. Cesa Bianchi, S. Shalev Shwartz, and O. Shamir. Efficient learning with partially observed attributes. The Journal of Machine Learning Research, 12:2857–2878, 2011.
 Chemmanur [1993] T. J. Chemmanur. The pricing of initial public offerings: A dynamic model with information production. The Journal of Finance, 48(1):285–304, 1993.

Chu et al. [2011]
W. Chu, L. Li, L. Reyzin, and R. E. Schapire.
Contextual bandits with linear payoff functions.
In
International Conference on Artificial Intelligence and Statistics
, pages 208–214, 2011.  Dudik et al. [2011] M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Efficient optimal learning for contextual bandits. arXiv preprint arXiv:1106.2369, 2011.

Golovin and Krause [2010]
D. Golovin and A. Krause.
Adaptive submodularity: A new approach to active learning and stochastic optimization.
In COLT, pages 333–345, 2010.  Hazan and Koren [2012] E. Hazan and T. Koren. Linear regression with limited observation. In Proc. 29th Int. Conf. on Machine Learning, pages 807–814, 2012.

Jaksch et al. [2010]
T. Jaksch, R. Ortner, and P. Auer.
Nearoptimal regret bounds for reinforcement learning.
Journal of Machine Learning Research, 11:1563–1600, 2010. 
Langford and Zhang [2007]
J. Langford and T. Zhang.
The epochgreedy algorithm for contextual multiarmed bandits.
Advances in Neural Information Processing Systems (NIPS), 20:1096–1103, 2007.  Lu et al. [2010] T. Lu, D. Pál, and M. Pál. Contextual multiarmed bandits. In International Conference on Artificial Intelligence and Statistics, pages 485–492, 2010.
 Ortner and Auer [2007] P. Ortner and R. Auer. Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems, 2007.
 Osband et al. [2016] I. Osband, B. Van Roy, and Z. Wen. Generalization and exploration via randomized value functions. In International Conference on Machine Learning, 2016.
 Saslow et al. [2007] D. Saslow, C. Boetes, W. Burke, S. Harms, M. O. Leach, C. D. Lehman, E. Morris, E. Pisano, M. Schnall, S. Sener, et al. American cancer society guidelines for breast screening with mri as an adjunct to mammography. CA: a cancer journal for clinicians, 57(2):75–89, 2007.
 Slivkins [2011] A. Slivkins. Contextual bandits with similarity information. In 24th Annual Conference On Learning Theory, 2011.
 Tekin and Van Der Schaar [2014] C. Tekin and M. Van Der Schaar. Discovering, learning and exploiting relevance. In Advances in Neural Information Processing Systems, pages 1233–1241, 2014.
 Zolghadr et al. [2013] N. Zolghadr, G. Bartók, R. Greiner, A. György, and C. Szepesvári. Online learning with costly features and labels. In Advances in Neural Information Processing Systems (NIPS), pages 1241–1249, 2013.
Comments
There are no comments yet.