I Introduction
One of the classes of decision making models is the multiarmed bandit (MAB) framework where decision makers learn the model of different arms that are unknown and actions do not change the state of arms [1]. The MAB problem was originally proposed by Robbins [2], and has a wide range of applications in finance [3, 4], communication and networks [5, 6], healthcare [7], autonomous vehicles [8, 9], and energy management [10, 11] to name but a few. In the classical MAB problem, the decisionmaker sequentially selects an arm (action) with unknown reward distribution out of independent arms. Then, the noisy reward of the selected arm is revealed and the values of other arms remain unknown. At each step, the decisionmaker encounters a dilemma between exploitation of the best identified arm versus exploration of alternative arms. The goal of the classical model of multiarmed bandit is to maximize the expected cumulative reward over time horizon.
In this paper, we focus on a setting where a player is allowed to explore different arms in the exploration (or experimentation, used interchangeably) phase before committing to the best identified arm for exploitation in one or a given finite number of times. This setting interest is motivated by several application domains such as personalized healthcare and onetime investment. In such applications, exploitation is costly and/or it is infeasible to exploit for a large number of times, but arms can be experimented by simulation and/or based on the historical data for multiple times with negligible cost [12]. The big step in personalized healthcare is to provide an individual patient with his/her disease risk profiles based on his/her electronic medical record and personalized assessments [13, 14]. The different treatments (arms) are evaluated for a person by simulation or mice trials for many times with a low cost, but one personalized treatment is exploited once for a patient in the end [15, 16]. Another example of onetime exploitation is onetime investment where an investor chooses a factory out of multiple ones. Based on experimentation on historical data, he/she selects a factory to invest in once. The common theme in both above examples is to identify the best arm for onetime exploitation after an experimentation phase of pure exploration.
The above setting falls in the class of MAB problems called explorethencommit. The previous work [12, 17, 18, 19, 10, 20] on explorethencommit bandits, to the best of our knowledge, try to identify the arm with an optimum riskreturn criterion on an expectation sense up to a hyperparameter. Even though this objective is desirable in the settings with infinite exploitations, it is not necessarily the best objective in the explorethencommit setting with a single or finite exploitations. We further elaborate this observation by an illustrative example in Section III.
We advocate an alternative approach in which the objective is to select an arm that is most probable to reward the most. Note that it has been realized that in many scenarios of multiarmed bandit, considering maximum expected reward as an objective to select an arm is not the best strategy. In such scenarios, players not only aim to achieve the maximum cumulative reward, but they also want to minimize the uncertainty such as risk in the outcome [21], and the approaches are known as riskaverse
MAB. In literature, there are several approaches to address the riskaverse MAB including meanvariance (MV)
[20] and the conditional value at risk (CVaR) [10]. The performance of both MV and CVaR, are highly dependent on different single scalar hyperparameters, and selecting an inappropriate hyperparameter might degrade the performance substantially. More details on MV and CVaR criteria are given in Section II, and the negative impact of hyperparameter mismatch is studied in Section V.Contributions: We propose a class of hyperparameterfree riskaverse algorithms (called OTE/FTEMAB) for explorethencommit bandits with finitetime exploitations. The goal of the algorithms is to select the arm that is most probable to give the player the highest reward. To analyze the algorithms, we define a new notion of finitetime exploitation regret for our setting of interest. We provide concrete mathematical support to obtain an upper bound for the minimum number of experiments that should be done to guarantee an upper bound for the regret. More specifically, our result shows that by utilizing the proposed algorithm, the regret can be bounded arbitrarily small by sufficient amount of experimentation. As a salient feature, the OTE/FTEMAB algorithm is hyperparameterfree, so it is not prone to errors due to the hyperparameter mismatch.
Organization of the Paper: Section II discusses related work. In Section III, the one/finitetime exploitation multiarmed bandit problem after an experimentation phase is formally described. We define a new notion of one/finitetime exploitation regret for our problem setup. An example is provided clarifying the motivation of our work. In Section IV, we propose the OTEMAB and FTEMAB algorithms, and find an upper bound for the minimum number of pure explorations needed to guarantee an upper bound for regret. In Section V, we evaluate the OTEMAB algorithm versus riskaverse baselines and compare the minimum number of experiments needed to guarantee an upper bound on regret for both the OTEMAB and FTEMAB algorithms. We conclude the paper with a discussion of opportunities for future work in Section VI.
Ii Related Work
Explorethencommit is a class of multiarmed bandit problems that has two consecutive phases named as exploration (experimentation) and commitment. The decision maker can arbitrarily explore each arm in the experimentation phase; however, he/she needs to commit to one selected arm in the commitment phase. There are several studies on explorethencommit bandits in the literature as follows. Bui et al. [12] studied the optimal number of explorations when cost is incurred in both phases. Liau et al. [22] designed an explorethencommit algorithm for the case where there is a limited space to record the arm reward statistics. Perchet et al. [23] studied explorethencommit policy under the assumption that the employed policy must split explorations into a number of batches. None of these works have addressed the riskaverse issue on explorethencommit bandits. In the following, we present an overview on riskaverse bandits.
There are several criteria to measure and to model risk in the riskaverse multiarmed bandit problem. One of the common risk measurements is the meanvariance paradigm [24]. The two algorithms MVLCB and ExpExp proposed by Sani et al. [20] are based on meanvariance concept. They define the meanvariance of an arm with mean and variance as MV, where
is the absolute risk tolerance coefficient. In an infinite horizon multiarmed bandit problem, MVLCB plays the arm with minimum lower confidence bound for estimation of MV. In a bestarm identification setting, the ExpExp algorithm explores each of the arms for the same number of times and selects the arm with minimum estimated MV. This approach is followed by numerous researchers in riskaverse multiarmed bandit problems
[21, 25, 26, 27].Another way of considering risk in multiarmed bandit problems is to use conditional value at risk level , CVaR
, where it is the expected policy return in a specified quantile. CVaR
is utilized by Galichet et al. [10] in riskaware multiarmed bandit problems. They presented the MultiArmed RiskAware Bandit (MaRaB) algorithm aiming to select the arm with the maximum conditional value at risk level , CVaR. Formally, let be the target quantile level and defined as be the associated quantile value, where is the arm reward. The conditional value at risk is then defined as CVaR. CVaR is also followed by researchers in multiarmed bandit problems [21, 28, 29, 30, 31].Iii Problem Statement
Consider arms
whose rewards are random variables
that have unknown distributions with unknown finite expected values , respectively. The goal is to identify the best arm at the end of an experimentation phase that is followed by an exploitation phase, where the best arm is exploited for a given number of times, . In the experimentation phase, each arm is sampled for independent times. Denote the observed reward of arm at sample of experimentation by .Let where are independent and identically distributed random variables and . The optimum arm for exploitations in the sense that maximizes the probability of receiving the highest reward is
(1) 
where and what we mean by
being greater than or equal to a vector is that it is greater than or equal to all elements of the vector. Let
. Given the above preliminaries, the finitetime exploitation regret is defined below.Definition 1
The finitetime exploitation regret, , is defined as a function of an input for the selected arm as
(2) 
Note that the above definition of regret is different from the commonly used regret in bandit problems. In the following, we present an example that motivates us to define this new notion of regret for the finitetime exploitation setting.
Iiia Illustrative Example
As mentioned in the Introduction, although the arm with the highest expected reward is the optimum arm for utilization in infinite number of exploitations, it is not necessarily the one that is most probable to have the highest reward in a single or some finite number of exploitations. In the following example, two arms are considered such that , but it is more probable that a onetime exploitation of the first arm rewards us more than a onetime exploitation of the second arm. Hence, arm is not necessarily the ideal arm for onetime exploitation let alone the arm with the maximum empirical mean, i.e. .
Example 1
Consider two arms with the following independent reward distributions:
where and are constants for which each of the two distributions integrate to one and is the indicator function.
In example 1, although the second arm has a larger mean than the first one, and , the variance of reward received from the second arm is larger than that from the first one, which increases the risk of choosing the second arm for a onetime exploitation application. In fact, the first arm with lower mean is more probable to reward us more than the second arm since . In general, a larger variance for the received reward is against the principle of riskaversion where the objective is to keep a balance in the tradeoff between the expected return and risk of an action [20]. Meanvariance is an existing approach to tackle this scenario. However, it has some drawbacks that are explained in details in the following.
The meanvariance (MV) of an arm depends on the hyperparameter , which is the absolute risk tolerance coefficient. The tradeoff on is that if it is set to zero, the arm with the minimum variance is selected. On the other hand, if goes to infinity, the arm with the maximum expected reward is selected, which is the same as classical multiarmed bandit approach. Although the behavior of meanvariance tradeoff is known for marginal values of , it is not obvious what value of the hyperparameter keeps a desirable balance between return and risk. The choice of this hyperparameter can be tricky and as will be shown in Section V; a bad choice can increase the regret dramatically. As a simple example, consider two arms with unknown parameters , and . The meanvariance tradeoff is formalized as , where and are empirical estimates of variance and mean of each arm. Note that the empirical means and variances converge to true values, so the second arm that is performing worse with probability one is selected if . In order to address this issue, we alternatively propose the following best arm identification algorithm for OneTime (Finitetime) Exploitation in a MultiArmed Bandit problem (OTE/FTEMAB algorithm) that has concrete mathematical support for its action and is hyperparameterfree.
Iv One/FiniteTime Exploitation in MultiArmed Bandit Problem after an Experimentation Phase
In this section, we propose the OTEMAB and FTEMAB algorithms. The OTEMAB algorithm is a specific case of FTEMAB algorithm. Since the proof of theorem related to the FTEMAB algorithm is notationally heavy, we first propose the OTEMAB algorithm in Subsection IVA and postpone the FTEMAB algorithm to Subsection IVB.
Iva The OTEMAB Algorithm
The OTEMAB algorithm desires to play the arm that is most probable to reward the most for the case as
(3) 
which is a specific case of Equation (1). Due to simplicity of notation, the notation is eliminated in this subsection.
Remark 1
If there is any hard constraint on the minimum required reward in the onetime exploitation, , we can concatenate that hard constraint to vector and define it as .
Since the reward distributions of the independent arms are not known, the exact values of are unknown. Hence, we need to evaluate estimates of those probabilities, , based on our observations in the experimentation phase as follows:
(4) 
where and we assume independence of rewards of different arms.
Remark 2
If rewards of different arms are dependent, we need to have instantaneous observations of all arms at the same time for times and calculate as follows:
(5) 
The OTEMAB algorithm selects arm as the best arm in terms of rewarding the most with the highest probability in onetime exploitation. The onetime exploitation regret, , which is a specific case of Definition 1, is
(6) 
The OTEMAB algorithm is summarized in Algorithm 1. We next present a theorem on an upper bound of the minimum number of experiments needed to guarantee an upper bound on regret of Algorithm 1.
Theorem 1
For any if each of the arms are experimented for times in the experimentation phase, then the onetime exploitation regret defined in Equation (6) is bounded by , i.e. . Note that simultaneous exploration of the arms are required in the experimentation phase if the rewards of different arms are dependent.
Consider the Bernoulli random variables and their unknown means for . Possessing independent observations from each of the
independent or dependent arms in pure exploration, the confidence interval for estimating
based on Equation (4) or (5) with confidence level has the property that(7) 
Note that for the case of dependent arms, there is an tuple containing the instantaneous observation of the arm rewards as for which is used for estimation of in Equation (5). On the other hand, for the case of independent arms, we can use any of the orderings of the observations of the arm rewards for estimation of as we do in Equation (4). However, we cannot use as confidence interval for confidence level . The reason is that, although is derived from samples, not all those samples are independent, but exactly of the
samples are independent. In fact, the observed independent rewards can be classified as
tuples of the arm rewards with independent elements in different ways. None of such tuples has any priority over the other ones to estimate , so we can compute based on any of the tuples. The estimate of derived from any of those tuples is in with probability at least , so the average of those estimations is again in the mentioned interval with probability at least . Note that the average of estimates of derived from all of the different tuples is equal to derived from Equation (4) due to the following reason. An element of an tuple is repeated for times in all tuples. Hence, averaging over the number of distinct elements of tuples results in the same answer as the case of averaging the estimates of derived from all of different tuples. As a result, we can use as the half width of the confidence interval for estimators obtained from Equations (4) and (5) for both independent and dependent arms, respectively. Instead, the computational complexity is of order .In order to find a bound on regret, defined in Equation (6) as , we note that
(8) 
where is true if . By using union bound and Equation (17), the probability of the righthand side of the above equation can be bounded as follows, which results in the following bound on regret:
(9) 
The above upper bound on regret is derived under the condition that , which by using and simple algebraic calculations is equivalent to .
According to Theorem 1, the selected arm by Algorithm 1, , satisfies for any if each of the arms is explored in the experimentation phase for times. Hence, can get arbitrarily close to by increasing the number of pure explorations in the experimentation phase.
Let be the ordered list of in descending order. Note that arm is actually arm defined in Equation (3). Define the difference between the two maximum ’s as , where without loss of generality we assume it to be nonzero. Having the knowledge of or a lower bound on it, we can define a stronger notion of regret as
(10) 
and have the following corollary.
Corollary 1
From the theoretical point of view, upon the knowledge of or a lower bound on it, for any , the regret defined in Equation (10) is bounded by , if the arms are explored for times each. If arms are dependent, instantaneous explorations of the arms are needed.
IvB The FTEMAB Algorithm
Consider the case where an arm is going to be exploited for finite number of times, . The best arm for times exploitation is defined in Equation (1). Since reward distributions are unknown, we need to estimate ’s based on observations in pure exploration phase. In the case of independent arms, define the vector with cardinality as
(11) 
Let for be the different elements of . Let be the estimate of , then we can compute them as
(12) 
In the case of dependent arms, ’s are defined in the same way as independent arms, but note that the set corresponding to is used for generating for all . Hence, is defined as follows for dependent arms:
(13) 
The FTEMAB algorithm selects arm for times exploitation. This algorithm is summarized in Algorithm 2. We next present a theorem for an upper bound of the minimum number of experiments needed to guarantee an upper bound on regret of Algorithm 2 which is the generalization of Theorem 1.
Theorem 2
For any if each of the arms is explored for times in the experimentation phase such that , then the finitetime exploitation regret defined in Definition 1 is bounded by , i.e. . If the rewards of different arms are dependent, simultaneous explorations of the arms are required for the same bound on regret.
Let be the ordered list of in descending order. Note that arm is actually arm defined in Equation (1). Define the difference between the two maximum ’s as , where without loss of generality we assume it to be nonzero. Having the knowledge of or a lower bound on it, we can define a stronger notion of regret as
(14) 
and have the following corollary.
Corollary 2
From the theoretical point of view, upon the knowledge of or a lower bound on it, for any , the regret defined in Equation (14) is bounded by , if the arms are explored for times each, where . If arms are dependent, instantaneous explorations of the arms are needed.
Corollary 3
If we let go to infinity, the problem becomes the classical multiarmed bandit problem since is the same as
and due to the law of large numbers
as . Hence, the FTEMAB algorithm selects the arm with maximum expected reward if the arm is going to be exploited for infinitely many times and the cumulative reward is desired to be maximized.V Simulation Results
In this section, we report numerical simulations validating the theoretical results presented in this paper. We compare our proposed OTEMAB algorithm with the Upper Confidence Bound (UCB) [32], ExpExp [20], and MaRaB [10] algorithms. Consider two arms with the reward distributions given in example 1. The regret defined in Equation (10) versus the number of pure explorations for each arm, , is averaged over 100,000 runs. The result is plotted in Figure 1 and as we see OTEMAB outperforms the stateoftheart algorithms for the purpose of riskaversion in terms of the regret we defined in this paper. Note that the UCB algorithm aims at selecting an arm that maximizes the expected received reward, but in example 1, the arm with higher expected reward is less probable to have the highest reward, which is why the UCB algorithm performs poorly in this example. However, in the following example where the arm that rewards more on expectation is also more probable to reward more, the UCB and ExpExp algorithms perform as well as the OTEMAB algorithm.
Example 2
Consider two arms with the following unknown independent reward distributions:
where and
are constants so that the two probability distribution functions integrate to one.
Note that in example 2, and . For this scenario, the regret defined in Equation (10) versus the number of pure explorations for each arm, , averaged over 100,000 runs is plotted in Figure 2.
In another experiment, the multiarmed bandit is simulated for example 1 and the probability that the selected arm has the higher reward is calculated over 500,000 runs for different algorithms. The result is shown in Figure 3. This result confirms the motivation of our study on riskaverse finitetime exploitations in multiarmed bandit.
In the above comparison of OTEMAB with state of the art algorithms, three different choices of hyperparameters for the ExpExp and MaRaB algorithms are tested and the best performance is presented. However, note that the performances of these algorithms depend on the choice of hyperparameter. In Figure 4, the sensitivity of the performance of ExpExp algorithm with respect to the choice of hyperparameter is depicted for example 1 and a third example where the variance of the best arm is larger than the variance of the arm with lower expected reward. The two plots are the averaged regret over 100,000 runs versus the value of for the ExpExp algorithm for two different multiarmed bandit problems when . As depicted in Figure 4, a choice of can be good for one multiarmed bandit problem, but not good for another one. Due to our observations, the sensitivity of the MaRaB algorithm to its hyperparameter can even be more complex. Figure 5 depicts the averaged regret over 100,000 runs versus the value of MaRaB hyperparameter, , when . This figure is plotted for example 1
and a fourth example where reward of the first arm has a truncated Gaussian distribution with mean three and variance two over the interval
and the second arm is the same as the one in example 1.In another experiment, we compare the minimum number of explorations needed to guarantee a bound on regret for two cases of onetime and twotime exploitations. Theorems 1 and 2 suggest that for given , and , the upper bound of minimum number of explorations needed for time exploitations to guarantee that the regret is bounded by is times that of onetime exploitation. We design two examples of twoarmed bandits such that and plot the minimum number of explorations to guarantee bounded regret by in Figure 6. The dotted plot is the plot of the OTEMAB algorithm multiplied by two which is close to the one of the FTEMAB algorithm for twoarmed bandits. This observation validates our theoretical results.
Taking a closer look at example 1, we note that the regret defined in (10) can be found as
(15)  
Deriving the regret from the above equation, we get the same regret we get through simulation for the OTEMAB algorithm that is plotted in Figure 1. In this paper, we assume the experimentation has zero cost, which is often a valid assumption. However, if experimentation is timeconsuming, there is a cost to postpone the exploitation of the best identified arm. For example, for more experimentation, a patient receives medication by delay or an investor keeps his/her money on hold with zero interest, both of which incur costs. Let such a cost be formulated by an increasing function , where is the incurred cost of experiments. Then, a tradeoff between more exploration for higher accuracy of bestarm identification and lower incurred cost of experimentation emerges. We can formalize such a tradeoff by solving
(16) 
where is the regretcost tradeoff and is calculated by Equation (15) based on an estimation of which is updated after each experiment. Figure 7 plots under example 1 for , and the real value of . The rigorous analysis of (16) is postponed for future work and is beyond the scope of this paper.
Vi Conclusion and Future Work
The focus of this work is on application domains, such as personalized healthcare and onetime investment, where an experimentation phase of pure arm exploration is followed by a given finite number of exploitations of the best identified arm. We show through an example that the arm with maximum expected reward does not necessarily maximize the probability of receiving the maximum reward. The OTEMAB and FTEMAB algorithms are presented in this paper whose goals are to select the arm that maximizes the probability of receiving the maximum reward. We define a new notion of regret for our problem setup and find an upper bound on the minimum number of experiments that should be done to guarantee an upper bound on regret. The cost of experimentation is assumed to be negligible in this paper, but if such an assumption is violated in an application domain, one can study the costregret tradeoff as a promising future work.
References
 [1] L. Zhou, “A survey on contextual multiarmed bandits,” arXiv preprint arXiv:1508.03326, 2015.
 [2] H. Robbins, “Some aspects of the sequential design of experiments,” Bulletin of the American Mathematical Society, vol. 58, no. 5, pp. 527–535, 1952.
 [3] D. Bergemann and U. Hege, “Dynamic venture capital financing, learning and moral hazard,” Journal of Banking and Finance, vol. 22, no. 68, pp. 703–735, 1998.
 [4] ——, “The financing of innovation: Learning and stopping,” RAND Journal of Economics, pp. 719–752, 2005.
 [5] O. Avner and S. Mannor, “Multiuser lax communications: a multiarmed bandit approach,” in IEEE INFOCOM 2016The 35th Annual IEEE International Conference on Computer Communications. IEEE, 2016, pp. 1–9.
 [6] A. Yekkehkhany and R. Nagi, “Blind gbpandas: A blind throughputoptimal load balancing algorithm for affinity scheduling,” arXiv preprint arXiv:1901.04047, 2019.
 [7] D.S. Zois, “Sequential decisionmaking in healthcare iot: Realtime health monitoring, treatments and interventions,” in 2016 IEEE 3rd World Forum on Internet of Things (WFIoT). IEEE, 2016, pp. 24–29.
 [8] N. Musavi, D. Onural, K. Gunes, and Y. Yildiz, “Unmanned aircraft systems airspace integration: A game theoretical framework for concept evaluations,” Journal of Guidance, Control, and Dynamics, pp. 96–109, 2016.
 [9] N. Musavi, K. B. Tekelioğlu, Y. Yildiz, K. Gunes, and D. Onural, “A game theoretical modeling and simulation framework for the integration of unmanned aircraft systems in to the national airspace,” in AIAA Infotech@ Aerospace, 2016, p. 1001.

[10]
N. Galichet, M. Sebag, and O. Teytaud, “Exploration vs exploitation vs safety:
Riskaware multiarmed bandits,” in
Asian Conference on Machine Learning
, 2013, pp. 245–260.  [11] S. Maghsudi and E. Hossain, “Distributed user association in energy harvesting dense small cell networks: A meanfield multiarmed bandit approach,” IEEE Access, vol. 5, pp. 3513–3523, 2017.
 [12] L. X. Bui, R. Johari, and S. Mannor, “Committing bandits,” in Advances in Neural Information Processing Systems, 2011, pp. 1557–1565.
 [13] N. V. Chawla and D. A. Davis, “Bringing big data to personalized healthcare: a patientcentered framework,” Journal of general internal medicine, vol. 28, no. 3, pp. 660–665, 2013.
 [14] D. E. Pritchard, F. Moeckel, M. S. Villa, L. T. Housman, C. A. McCarty, and H. L. McLeod, “Strategies for integrating personalized medicine into healthcare practice,” Personalized medicine, vol. 14, no. 2, pp. 141–152, 2017.
 [15] K. Priyanka and N. Kulennavar, “A survey on big data analytics in health care,” International Journal of Computer Science and Information Technologies, vol. 5, no. 4, pp. 5865–5868, 2014.
 [16] E. Abrahams, G. S. Ginsburg, and M. Silver, “The personalized medicine coalition,” American Journal of Pharmacogenomics, vol. 5, no. 6, pp. 345–355, 2005.
 [17] A. Garivier, T. Lattimore, and E. Kaufmann, “On explorethencommit strategies,” in Advances in Neural Information Processing Systems, 2016, pp. 784–792.
 [18] A. Garivier, P. Ménard, and G. Stoltz, “Explore first, exploit next: The true shape of regret in bandit problems,” Mathematics of Operations Research, 2018.
 [19] L. Prashanth, “Cs6046: Multiarmed bandits,” 2018.
 [20] A. Sani, A. Lazaric, and R. Munos, “Riskaversion in multiarmed bandits,” in Advances in Neural Information Processing Systems, 2012, pp. 3275–3283.
 [21] S. Vakili and Q. Zhao, “Riskaverse multiarmed bandit problems under meanvariance measure,” IEEE Journal of Selected Topics in Signal Processing, vol. 10, no. 6, pp. 1093–1111, 2016.
 [22] D. Liau, E. Price, Z. Song, and G. Yang, “Stochastic multiarmed bandits in constant space,” arXiv preprint arXiv:1712.09007, 2017.
 [23] V. Perchet, P. Rigollet, S. Chassang, E. Snowberg et al., “Batched bandit problems,” The Annals of Statistics, vol. 44, no. 2, pp. 660–681, 2016.
 [24] H. M. Markowitz, “Portfolio selection/harry markowitz,” The Journal of Finance, vol. 7, no. 1, pp. 77–91, 1952.
 [25] S. Vakili and Q. Zhao, “Meanvariance and value at risk in multiarmed bandit problems,” in 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2015, pp. 1330–1335.

[26]
J. Y. Yu and E. Nikolova, “Sample complexity of riskaverse banditarm
selection,” in
TwentyThird International Joint Conference on Artificial Intelligence
, 2013.  [27] S. Vakili and Q. Zhao, “Riskaverse online learning under meanvariance measures,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 1911–1915.
 [28] J. Xu, W. B. Haskell, and Z. Ye, “Indexbased policy for riskaverse multiarmed bandit,” arXiv preprint arXiv:1809.05385, 2018.
 [29] N. Galichet, “Contributions to multiarmed bandits: Riskawareness and subsampling for linear contextual bandits,” Ph.D. dissertation, Université Paris SudParis XI, 2015.
 [30] A. Cassel, S. Mannor, and A. Zeevi, “A general approach to multiarmed bandits under risk criteria,” arXiv preprint arXiv:1806.01380, 2018.
 [31] R. K. Kolla, K. Jagannathan et al., “Riskaware multiarmed bandits using conditional valueatrisk,” arXiv preprint arXiv:1901.00997, 2019.
 [32] P. Auer, N. CesaBianchi, and P. Fischer, “Finitetime analysis of the multiarmed bandit problem,” Machine learning, vol. 47, no. 23, pp. 235–256, 2002.
Comments
There are no comments yet.