In numerous real-world settings, acquiring useful information is often costly. In many applications such as recommender systems, finance, or healthcare, the decision-maker performs costly research/experimentation to learn valuable information. For instance, a website must pay costs to observe (e.g. through cookies) the contextual information of its online users. In doing so it must decide the best information to observe in order to minimize informational costs while also achieving high rewards. However, classical contextual Multi-Armed Bandit (MAB) formulations (Chu et al. , Slivkins , Lu et al. , Dudik et al. , Langford and Zhang ) have not previously considered these important informational costs and are thus unable to provide satisfactory performance in such settings. This paper presents new and powerful methods and algorithms for Contextual MAB with Costly Observations (CMAB-CO). We show numerically that our algorithms achieve significant performance gains in breast cancer setting, and we note that the methods and algorithms we develop are widely applicable (perhaps with some modifications) to an enormous range of other settings as well, from recommender systems to finance.
A major challenge in these settings is the learning of both optimal observations and actions. Current MAB methods could potentially be modified to address this issue by combining the choice of the context to observe and the action to be taken as a single meta-action and folding the costs of observations in the rewards. However, the regret of such an approach can be shown to be exponential in the number of actions and the number of possible context states; therefore, it is so inefficient as to be impractical for any realistic problem. Therefore there is a strong need for the development of new algorithms that achieve better performance.
To overcome the limitations and challenges discussed above, we propose an alternative approach. We formalize the CMAB-CO problem and show that this problem can be reduced to a two stage Markov Decision Process (MDP) problem with a canonical start state. We propose two different algorithms for this dual learning problem: Sim-OOS and Seq-OOS where observations are made simultaneously and sequentially, respectively. These algorithms build upon the UCRL2 algorithm of (Jaksch et al. ) to efficiently learn optimal observations and actions. We show that both Sim-OOS and Seq-OOS algorithms achieve a regret that is sublinear in time. These algorithm thus perform well when the number of observations is small, and it represents a significant improvement over existing algorithms, which would be exponential in the number of observations as well as actions.
Our main contributions can be summarized as follows:
We formalize the CMAB-CO problem as a two-stage MDP.
We propose two algorithms under two assumptions: simultaneous and sequential observation selection. We show sublinear in time regret bounds for both algorithms.
We use a breast cancer dataset and show that we can achieve up to significant improvement in performance with respect to an important benchmark.
As we have noted, our algorithms apply in many settings with different observations, actions and rewards. In the medical context, the observations might consist of different types of (costly) medical tests (e.g., blood tests, MRI, etc.), actions might consist of choices of treatment, and rewards might consist of year survival rates. Hence, an important aspect of the decision-making is which medical tests to conduct and which treatment option to recommend. In the recommendation system context, the observations might consist of (costly) information about the user (e.g., previous search records, likes in social media, etc.), actions might consist of item choices and rewards might consist of click rates. In financial applications, the observations might represent (costly) research and information gathering about specific assets (stocks, loans, IPOs, etc.), actions might represent investment decisions, and rewards might represent investment returns. Indeed, the financial literature has studied the costs (and incentives) associated with information gathering in a variety of settings (Campbel and Kracaw , Chemmanur ).
2 Related Work
Our paper contributes to multiple strands of literature, including MAB, MDP and budgeted learning. We describe the contributions of our work to each topic in turn.
2.1 MAB Literature
This work relates to various strands of research in the MAB literature (Chu et al. , Slivkins , Lu et al. , Dudik et al. , Langford and Zhang , Tekin and Van Der Schaar ). For example, Tekin and Van Der Schaar  focuses on learning the optimal actions by discovering relevant information. However, this work does not consider the costs associated with gathering information and is thus unable to provide satisfactory performance in the considered setting. The CMAB-CO problem is similar to combinatorial semi-bandits since multiple actions (observations and real actions) are selected and the rewards of all selected actions (observation cost and real action rewards) are selected in our setting. However, combinatorial semi-bandits do not utilize the observed states when taking the action.
Our work is also very related to online probing (Zolghadr et al. ). However, the goal in (Zolghadr et al. ) is to learn the optimal observations and a single best function that maps observed features to labels in order to minimize the loss and the observation cost jointly. Unlike in the considered CMAB-CO setting, an adversarial setup is assumed and a complete loss feedback (the loss associated with all the various actions) is obtained at each stage.
2.2 MDP literature
The CMAB-CO problem which we consider can be formalized as a two-stage MDP (Jaksch et al. , Ortner and Auer , Osband et al. ) with a canonical start state. The action set available in the start state is the set of observations. Following an observation action in the start state, the decision-maker moves to a new state (which consists of the realized states of the selected observations) from which the decision-maker selects a real action and moves back to the start state. The reward in the first step is the observation cost (negative) and the second step is the random reward obtaind by taking the real action. Stemming from this and building upon the UCRL2 algorithm of (Ortner and Auer , Jaksch et al. 
), we construct efficient algorithms by exploiting the structure of the CMAB-CO problem: sparse observation probabilities, known costs.
2.3 Budgeted Learning
The CMAB-CO problem is also similar to budgeted learning as the decision-maker’s goal there is to adaptively choose which features to observe in order to minimize the loss. For example, (Cesa Bianchi et al. , Hazan and Koren 
) adaptively choose the features of the next training example in order to train a linear regression model while having restricted access to only a subset of the features. However, these problems do not consider information costs and are restricted to batch learning.
Another related work is adaptive submodularity (Golovin and Krause ) which aims to maximize rewards by selecting at most observations/actions. However, their approach assumes that observation states are statistically independent and rewards have a submodular structure in observations.
3 Contextual Multi-armed Bandits with Costly Observations
3.1 Problem Formulation
Next, we present our problem formulation and illustrate it with a specific example from in the medical context. Let be a finite set of observations (types of medical tests such as MRI, mamogram, ultrasound etc.). Each observation is in a (initially unknown) particular state from a finite set of of possible values (describing the outcomes of the medical tests such as the BIRADS score associated with a mamogram). Let
represent the set of all possible state vectors.. The state vector iswhere is the state of observation , which represents the context in the CMAB formulation. We assume that the state vector is drawn according to a fixed but unknown distribution. We write to denote a random state vector and to denote the probability of state vector being drawn. In the medical context, models a joint probability over the results of the medical tests.
We assume that only the states of the observations that are selected by the decision-maker are revealed in each time instance. Let denote a partial state vector, which only contains the state of a subset of the selected observations. For example, for selected observations , the partial state vector is with
where denotes our symbol for missing observation states. We use the notation to refer to the domain of (i.e., the set of the medical test outcomes realized in ). Let denote the set of all possible partial state vectors with observations from (i.e., the set of all possible medical test outcomes of ). Let denote the set of all possible partial state vector states. We say is consistent with if they are equal everywhere in the domain of , i.e., for all . In this case, we write . If and are both consistent with some , and , we say is a substate of . In this case, we write .
We illustrate these definitions on a simple example. Let be a state vector, and and be partial state vectors. Then, all of the following claims are true:
We consider a MAB setting with costly observations where the following sequence of the events is taking place at each time :
The environment draws a state vector according to unknown distribution . The state vector is initially unknown to the decision-maker.
The decision-maker is allowed to select at most observation at time , denoted as , with paying a known cost of for each observations in the set . We assume that the decision-maker has an upper bound on the maximum number of observations that can be made at each time . Let denote the subset of the observations with cardinality less than , i.e., . The partial state vector from the observations is revealed to the decision-maker, while the remainder of the states remain unknown to the decision-maker.
Based on its available information , the decision-maker takes an action from a finite set of actions and observes a random reward with support and where is an unknown expected reward function.
We overload the definition of and to denote marginal probabilities and expected rewards of partial state vectors. We write to denote the marginal probability of being realized and to denote the marginal expected reward of action when the partial state vector is . Observe that .
The policy for selecting observations and associated actions consists of a set of observations and an adaptive action strategy , which maps each possible partial state vectors from to actions (e.g., a policy consists of a subset of medical tests and treatment recommendation for each possible test results from ). The expected gain of the policy is given by
where is the gain parameter, which balances the trade-off between the rewards and observation costs. For example, represents the revenue made by one click in the recommendation system context. The expected gain of the policy is the expected reward of minus the observation cost incurred by . Without loss of generality, we assume that decision-maker is allowed to make at most observations. Let denote the set of all possible policies. The oracle policy is given by .
The expected gain of the oracle policy is given by . Note that our oracle is different than the oracle used in the contextual bandit literature. To illustrate the difference, define to be the expected reward of the best action when the partial state vector is . We refer to the policy that selects observations and the best actions for all as the fixed -oracle policy. The expected reward of the fixed -oracle policy is given by
It can be shown that the oracle policy is given by and . Note that . Therefore, the oracle defined in our setting achieves the best expected reward among all the fixed -oracle policies.
Consider an adaptive policy , which takes observation-action , observes , uses this observation to take an action and receives the reward of . The cumulative reward of is . The -time regret of the policy is given by
The goal here is to compute the policy to minimize this regret by selecting at most observations.
Current online learning methods could be modified to address the CMAB-CO problem by defining a set of meta-actions that comprises all the combinations of observation subsets and actions taken based on these observations, and then applying a standard MAB algorithm (such as the UCB algorithm Auer et al. ) by considering these meta-actions to be the action space. While this algorithm is straightforward to implement, it scales linearly with the total number of policies . This is exponential in the number of state vectors. This makes such algorithms computationally infeasible and suboptimal (compared to the lower bound) even when the numbers of actions and partial states is small. This poor scaling performance is due to the fact that the algorithm does not take into account that selecting an action yields information for many policies.
3.2 Simultaneous Optimistic Observation Selection (Sim-OOS) Algorithm
To address the above mentioned limitations of such MAB algorithms, we develop a new algorithm, which we refer to as Simultaneous Optimistic Observation Selection (Sim-OOS). Sim-OOS operates in rounds . Let denote time at the beginning of round
. The decision-maker keeps track of the estimates of the mean rewards and the observation probabilities. Note that when the partial state vectorfrom observation set is revealed, the decision-maker can use this information to not only update the observation probability estimate of but also update the observation probability estimate of all substates of . However, the decision-maker cannot update the mean reward estimate of pairs of and substates of since this would result in a bias on the mean reward estimates. Therefore, at each round , we define , and if and if .
We define the following counters: , , . In addition to these counters, we also keep counters of partial state-action pair visits in a specific round . Let denote the number of times action is taken when partial state is observed in round . Furthermore, we can express the mean reward estimate and observation probability estimates as follows:
provided that and
. Since these estimates can deviate from their true mean values, we need to add appropriate confidence intervals when optimizing the policy. In the beginning of each round, the Sim-OOS computes the policy of round by solving an optimization problem given in (2). The optimization problem with the mean reward estimate and observation probability estimates is given by
where and are the confidence bounds on the estimators at time . We will set these confidence bounds later in order to achieve provable regret guarantees with high probability. Let denote the policy computed by the Sim-OOS.
The Sim-OOS follows policy in round . At time in round (), the Sim-OOS selects and observes the partial state vector from observations and on the basis of this, it takes an action . Round ends when one of the visits to the partial state vector-action pair in round is the same as (the total observations of the partial state-action pair from previous rounds ). This ensures that the optimization problem given in (2) is only solved when the estimates and confidence bounds are improved.
The optimization problem in (2) can be reduced to a set of convex optimization problems which can be solved efficiently in polynomial time complexity (Boyd and Vandenberghe ) (the details of this reduction are discussed in the supplementary material). In round , let be the optimistic reward of value of the partial state vector in round of . The optimistic gain of a fixed -oracle in round , denoted by , is defined as the maximizer of the following optimization problem:
At any time of round , it can be shown that the optimization in (2) can be solved as: and . The pseudocode for the Sim-OOS is given in Algorithm 1. It can be easily shown that the computational complexity of the Sim-OOS algorithm for instances is .
3.3 Regret Bounds for the Sim-OOS algorithm
In this subsection, we provide distribution-independent regret bounds for the Sim-OOS algorithm. Let denote the number of all possible states (all possible results from at most distinct medical tests).
Suppose . For any , set
Then, with probability at least , the regret of the Sim-OOS satisfies
The proof of Theorem 1 and all the other results can be found in the supplementary material. The UCRL2 (Jaksch et al. ) is designed for general MDP problems and achieves a regret of . Hence, these regret results are better than those obtained by UCRL2. This is an important result since it demonstrates that the Sim-OOS can effectively exploit the structure of our CMAB-CO problem to achieve efficient regret bounds which scale better than these that can be obtained for general MDP problems.
We illustrate this bound using the same example above. Suppose for all and . The upper bound given in Theorem 1 is in the order of .
The Sim-OOS algorithm performs well for smaller values of which is the case in the medical setting, as it is for instance the case in breast cancer screening, in which imaging tests are limited to a small set: mammogram, MRI and ultrasound (Saslow et al. ). In this context, the observations are usually selected sequentially. To address such settings, we next propose the Seq-OOS algorithm that selects observations sequentially.
4 Multi-armed Bandits with Sequential Costly Observations
4.1 Problem Formalism
Our current setting assumes that decision-maker makes all the observations simultaneously. If the decision-maker is allowed to make observations sequentially, she can use the partial state from already selected observations to inform the selection of future observations. For example, in the medical settings, although a positive result in a medical test is usually followed by additional medical test for validity, a negative result in a medical test is not usually followed by additional medical tests. Since any resulting simultaneous observation policy can be achieved by a sequential observation policy, the oracle defined with sequential observations achieves higher expected reward than that with simultaneous observations. At each time , the following sequence of events is taking place:
The decision-maker has initially no observations. In phase , we denote the empty partial state as where .
At each phase , if the partial state is and observation is made, the resulting partial state is where if and otherwise.
The decision-maker takes an action when either observation is made or the final phase is reached and observes a random reward .
Let be the set of resulting partial state when observation is made at previous partial state of , i.e., . In this section, we define as the probability of resulting partial state when the observation is made at previous partial state of , which is referred to as partial state transition probability. For all , the partial state transition probability is defined as if and otherwise. In the medical example, this is the probability of observing test ’s result as given the previous test results (records) . We define and for all . Let denote partial state transition probability matrix.
A sequential policy consists of observation function and action function where and (e.g., refers to the next medical test applied on a patient with previous records (test results) and refers to treatment recommendation for a patient with previous records(test results) ). A sequential policy works as follows. Decision-maker keeps making observations until either observations are made or an empty observation is picked and takes an action in a terminal state where terminal partial states of policy is the state with either cardinality or with .
We illustrate these definitions in a medical example. Assume that there are different tests with possible outcomes of positive and negative result and different possible treatments. Suppose that a sequential policy with , , . Basically, this policy initially picks the medical test for all patients (). If the result of the medical test is positive , the policy picks medical test (). On the other hand, if the result of medical test is negative , the policy does not make any additional test. In this example, terminal partial states of policy are .
Given a sequential policy , let denote the random partial state in phase and denote the random cost in phase by making observation . Note that is random since partial state in phase is random. Similarly, let denote random reward revealed by taking action in terminal partial state. Then, for each sequential policy , we define a value function for :
where expectation is taken with respect to randomness of the states and rewards. In the terminal phase, we define value function as . The optimal value function is defined by . A policy is said to be optimal if . It is also useful to define partial state-observation optimal value function for :
A sequential policy is optimal if and only if , .
Consider a sequential learning algorithm . The algorithm makes observation and realizes a cost in phase of time and then selects action and realizes a random reward , which realizes a reward of . To quantify the performance of sequential learning algorithm, we define cumulative regret of sequential learning algorithm up to time as
where denotes empty state. In the next subsection, we propose a sequential learning algorithm, which aims to minimize regret.
4.2 Sequential Optimistic Observation Selection (Seq-OOS)
In addition to observation sets that are tracked by Sim-OOS, Seq-OOS keeps track of the following sets at each round : , . Let and . In addition to these counters, we also keep counters of visits in partial state-action pairs and state-observation pairs in a particular round . Let denote the number of times observation is made when partial state is realized in round . We can express the estimated transition probabilities as , provided that .
The Seq-OOS works in rounds . In the beginning of round ( denotes time of beginning of round ), the Seq-OOS solves Optimistic Dynamic Programming (ODP), which takes the estimates and as an input and outputs a policy . The ODP first orders the partial states with respect to size of their domains. Let denote partial states with observations, which is defined by (e.g., all possible results from distinct medical tests). Since the decision-maker is not allowed to make any more observations for any state , estimated value of state is computed by where is the confidence interval for partial state-action pair in round . The action and observation functions on partial state computed by ODP is given by and . After computing value and policy in partial states , the ODP solves convex optimization problem to compute optimistic value function for each partial state-observation pair and . Let denote optimistic value function for making observation in partial state in round of phase , which is the solution of the following convex optimization problem :
Note that the variables () used in the convex optimization problem given in (5) is computed in the previous step by the ODP. The optimistic value of the empty observation in partial state in round is computed by . Based on the optimistic value of partial state-observation pairs , the ODP computes the optimistic value of partial state and action and observation function of partial state as , , . These computations are repeated for to find the complete policy .
Given , at each time of round (), the Seq-OOS follows the policy . Basically, if the state at phase is , the Seq-OOS decides to make the observation and observes the state . If the state is at phase and observation computed by the ODP is empty set, i.e., , then Seq-OOS takes action . If it is a terminal phase, i.e., , Seq-OOS takes an action .
4.3 Regret Bounds of the Seq-OOS
The analysis of the regret of the Seq-OOS exhibits similarities to the analysis of the regret of the Sim-OOS. The Seq-OOS has at most phases in which it makes observations sequentially followed by an action while Sim-OOS has phases in which it makes simultaneous observations at once followed by an action. The difference is that we need to decompose the regret of the Seq-OOS into regret due to phases with suboptimal observations and regret due to suboptimal actions. Let . The next theorem bounds the distribution-independent regret.
Suppose . For , set
Then, with probability at least , regret of the Seq-OOS satisfies
The difference in the regret bounds of Sim-OOS and Seq-OOS is because Sim-OOS estimates the observation probabilities for each whereas Seq-OOS estimates observation transition probabilities for each and .
Now, we illustrate and compare the regret bounds on our algorithms. Suppose that for all and . In this case, we have the distribution independent regret of for Sim-OOS and for Seq-OOS with probability at least . Our algorithms become computationally feasible when is small.
5 Illustrative Results
We evaluate the Sim-OOS and Seq-OOS on a dataset of 10,000 records of breast cancer patients participating in the National Surgical Adjuvant Breast and Bowel Project (NSABP) by ["removed for anonymous submission"]. Each instance consists of the following information about the patient: age, estrogen receptor, tumor stage, WHO score. The treatment is a choice among four chemotherapy regimes AC, ACT, CAF, CEF. The outcomes for these regimens were derived based on 32 references from PubMed Clinical Queries; this is a medically accepted procedure. Hence, the data contains the feature vector and all derived outcomes for each treatment. The details are given in ["removed for anonymous submission"]. We generate instances by randomly selecting a sample from the breast cancer dataset. In each instance, we set the observations as , and the rewards as if the treatment with the highest outcome is given to the patient and otherwise. For the experimental results, we set and .
We compare Sim-OOS and Seq-OOS algorithms with a contextual bandit algorithm that observes realization of all observation states by paying cost of , referred to as Contextual-UCB. We define the following metric of Gain of our algorithms ,which make observations and receives reward of by taking action at each time , over time steps by .
Performance of the Sim-OOS and Seq-OOS with Different Costs: We consider that the cost of each observation . We illustrate gain of Sim-OOS, Seq-OOS and Contextual-UCB algorithms for increasing values of cost . As Figure 1 illustrate, the gain of the Sim-OOS and Seq-OOS algorithm decreases as the observation cost increases. However, it should be noted that these algorithms learn the best simultaneous and sequential policies while simultaneously taking actions irrespective of the costs of observation. These figures show that when the observation cost is increasing, the Sim-OOS and Seq-OOS achieves better gains than Contextual-UCB by observing less information, hence paying less cost. Therefore, the slope of the gain-cost curve of the Sim-OOS and Seq-OOS illustrated in Figure 1 decreases as the observation cost increases.
In this paper, we introduced the novel, yet ubiquitous problem of contextual MAB with costly observations: selecting what information (contexts) to observe to inform the decision making process. To address this problem, we developed two different algorithms: Sim-OOS and Seq-OOS, and prove that these algorithms achieve distribution-independent regret bounds that are sublinear in time. Future work will be dedicated to exploring algorithms with regret bounds that are polynomial on the number of observations.
- Auer et al.  P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47:235–256, 2002.
- Boyd and Vandenberghe  S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
- Campbel and Kracaw  T. S. Campbel and W. A. Kracaw. Information production, market signalling, and the theory of financial intermediation. The Journal of Finance, 35(4):863–882, 1980.
- Cesa Bianchi et al.  N. Cesa Bianchi, S. Shalev Shwartz, and O. Shamir. Efficient learning with partially observed attributes. The Journal of Machine Learning Research, 12:2857–2878, 2011.
- Chemmanur  T. J. Chemmanur. The pricing of initial public offerings: A dynamic model with information production. The Journal of Finance, 48(1):285–304, 1993.
Chu et al. 
W. Chu, L. Li, L. Reyzin, and R. E. Schapire.
Contextual bandits with linear payoff functions.
International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
- Dudik et al.  M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Efficient optimal learning for contextual bandits. arXiv preprint arXiv:1106.2369, 2011.
Golovin and Krause 
D. Golovin and A. Krause.
Adaptive submodularity: A new approach to active learning and stochastic optimization.In COLT, pages 333–345, 2010.
- Hazan and Koren  E. Hazan and T. Koren. Linear regression with limited observation. In Proc. 29th Int. Conf. on Machine Learning, pages 807–814, 2012.
Jaksch et al. 
T. Jaksch, R. Ortner, and P. Auer.
Near-optimal regret bounds for reinforcement learning.Journal of Machine Learning Research, 11:1563–1600, 2010.
Langford and Zhang 
J. Langford and T. Zhang.
The epoch-greedy algorithm for contextual multi-armed bandits.Advances in Neural Information Processing Systems (NIPS), 20:1096–1103, 2007.
- Lu et al.  T. Lu, D. Pál, and M. Pál. Contextual multi-armed bandits. In International Conference on Artificial Intelligence and Statistics, pages 485–492, 2010.
- Ortner and Auer  P. Ortner and R. Auer. Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems, 2007.
- Osband et al.  I. Osband, B. Van Roy, and Z. Wen. Generalization and exploration via randomized value functions. In International Conference on Machine Learning, 2016.
- Saslow et al.  D. Saslow, C. Boetes, W. Burke, S. Harms, M. O. Leach, C. D. Lehman, E. Morris, E. Pisano, M. Schnall, S. Sener, et al. American cancer society guidelines for breast screening with mri as an adjunct to mammography. CA: a cancer journal for clinicians, 57(2):75–89, 2007.
- Slivkins  A. Slivkins. Contextual bandits with similarity information. In 24th Annual Conference On Learning Theory, 2011.
- Tekin and Van Der Schaar  C. Tekin and M. Van Der Schaar. Discovering, learning and exploiting relevance. In Advances in Neural Information Processing Systems, pages 1233–1241, 2014.
- Zolghadr et al.  N. Zolghadr, G. Bartók, R. Greiner, A. György, and C. Szepesvári. Online learning with costly features and labels. In Advances in Neural Information Processing Systems (NIPS), pages 1241–1249, 2013.