Multi-armed bandit processes (MAB in short) model the resource allocation problem with uncertainties where a decision maker attempts to optimize his decisions based on the existing knowledge, so as to maximize his expected total reward over time (Gittins et al., 2011). It has extensive applications in clinical trials, design of experiments, manufacturing systems, economics, queuing and communication networks, control theory, search theory, scheduling, machine learning/reinforcement learning etc.
In this paper we are concerned with a general multi-armed bandit problem with restricted random stopping time sets, which can be roughly described as follows: There is a multi-armed bandit process consisting of a set of statistically independent arms evolving in continuous time among which a resource (time, effort) has to be allocated. Every arm is associated with a restricted stopping time set, in the sense that the arm must be engaged exclusively if its operation time does not belong to the stopping time set. The allocation respects the restrictions and any engaged arm accrues rewards that are represented as a general stochastic process. The objective is to maximize the total expected discounted reward over an infinite time horizon.
The early versions of discrete-time MAB processes in Markovian and semi-Markovian fashions have been well understood due to the pioneer work of Gittins and Jones (1972) and subsequently the seminal contributions of Gittins (1979, 1989) and Whittle (1980, 1982). The significance of Gittins’ contribution is the drastic dimension reduction: Instead of solving the optimal problems of the Markov (or semi-Markov) decision models formed by all arms, one only needs to compute an index function of the states based merely on the information delivered by each arm itself and then picks an arm with the highest index to operate. That index function, known generally as Gittins Indices today, was defined by Gittins as the maximum reward rate over all arm-specified stopping times, Whittle (1980) provided a mathematically elegant proof by showing that Gittins index policies solve the optimality equations of the corresponding dynamic programming modeling the multi-bandit processes. For general reward processes in integer time (without Markovian assumption), Varaiya et. al. (1984) defined an optimal policy in abstract terms by reducing every -armed problem to independent stopping problems of the type solved by Snell (1952). Mandelbaum (1986) proposed a technically convenient framework by formulating a control problem with time parameters in a multidimensional, partially ordered set. EL Karoui and Karatzas (1993) presented a mathematically rigorous proof of Gittins index policies for arbitrary stochastic processes evolving in integer times by combining the formulation of Mandelbaum (1986) with ideas from Whittle (1980). The most general treatments for discrete time setting can be found in Cai, et al. (2014, Section 6.1) and Cowan and Katehakis (2015) by dropping the Markovian property from the semi-Markovian model so that switches from one arm to another can only take place at certain time points and the intervals between any pair of consecutive points are random quantities. One key feature in discrete time setting is that the switches from any arm can only occur in countably many time instants, even though the arms can evolve continuously over the time horizon. This type of problems are referred to as general discrete time setting . Some aspects of the theory in the discrete time version and applications in searching, job scheduling, etc., can also be found in the comprehensive monograph by Gittins et al. (2011) .
The parallel theory for MAB processes in continuous time was not developed until a later time, due to mainly the technical intricacy in mathematics, where the term “continuous time” emphasizes not only that rewards can be continuously collected but, most significantly in mathematics, that switches from one arm to another are allowed to be made at arbitrary time points in also, such that the time set for an arm from which switches can be made is the whole positive axis, i.e., essentially uncountable, sharply in contrast to the discrete time version in which the switches are essentially countable. It is consensus that continuous time stochastic processes are far more difficult to attack than their discrete time versions, due to the difficulties in dealing with the measurability of the quantities involved. As to the continuous time version of the problem in a Markovian case, relevant results were first obtained by Karatzas (1984) and Eplett (1986). By insightfully formulating the model as a stochastic control problem for certain multi-parameter processes, Mandelbaum (1987) extended the problem to a general dynamic setting. Based on Mandelbaum’s formulation, EL Karoui and Karatzas (1994) derived general results by combining martingale-based methodologies with the retirement option designed by Whittle (1980) for his elegent proof of the optimality of Gittins index policies in discrete time. These results were further revisited by Kaspi and Mandelbaum (1998) with a relatively short and rigorous proof by means of excursion theory.
To sum up, studies on MAB processes have treated only the two regular ends: the discrete time version (including the general discrete time setting) in which switches from any arm to another are at most countably infinite, and the continuous time version in which the controller can switch from one arm to another in any time point in the positive time horizon, with technically different methods.
Clearly, in between the two regular ends, there exist many real-life situations that could not be put in the framework formed by solely either of the two versions, especially when there are technical restrictions on the switch times of the arms. As an example, consider a simple job scheduling scenario subject to machine breakdowns (see, e.g., Cai et al, 2014), in which a single unreliable machine is to process a set of jobs and, in serving the jobs, the machine may be subject to breakdowns from time to time, caused by, for instance, damage of components of the machine or power supply. When the machine is workable, a job can be processed and the processing can be preempted so as to switch the machine to any one of the unfinished jobs. Once the machine is broken down, it must be continuously repaired until it can resume its operation again. In this scenario, the stopping times for the machine to be switched from one job to another are restricted to the time interval in which the machine is in good condition. By associating the repairing duration of the machine to the job being processed, this problem can be modeled by a multi-armed bandit process. This bandit process, however, cannot be put in any of the frameworks of discrete time and continuous time bandit processes, owing to two significant features: First, for any job, the set of its potential switching times are essentially continuum in the interval in which the machine is workable so that the framework cannot be the discrete time version. Second, in the time intervals of machine reparation, a switch from the job is prohibited so that the framework cannot be the continuous time version. As another example that the classical MAB models cannot accommodate, consider a second job scheduling problem in which some of the jobs can be preempted at any time points, whereas the other jobs consist of a number of nonpreemptable components so that, once a job is selected to process, it could not be preempted until the completion of a component. This problem can be translated to such an MAB formula that some arms evolve in continuous time setting and the others respect to a discrete time mechanism. Furthermore, one can even image such situations where jobs consist of possibly preemptable and nonpreemptable components, so that, being represented as MAB models, the arms can be in continuous time, discrete time version or in a mixture mode in which the switch times contain both continuum and discrete time parts. Clearly, the existing optimality theory of MAB processes is not applicable to these situations.
This paper is dedicated to propose a new MAB process model so as to accommodate these situations. This is accomplished by introducing a type of restrictions on switch times, or equivalently the arm-specified stopping times as what discussed recently in Bao et al (2017) for restricted optimal stopping problems. Firstly, it turns out that this new model also unifies the existing versions of MAB processes. Specifically, for the sole discrete time version, the switching times of every arm are only the integer times, for the general discrete time version, the switching times are clearly the end points of the intervals during which no switch is allowed and the purely continuous time setting corresponds simply to the case of no restriction (see Section 2 for details). Moreover, an obvious merit of this new framework is, by introducing different restrictions on different arms, it can give the optimal solution to irregular cases in which some of the arms follow continuous time, some others follow discrete time and still others even respect more complicated mixtures; see the examples above. Such important types of MAB processes have not yet been touched in the existing literature.
To successfully tackle this problem, we will combine the martingale techniques as employed by EL Karoui and Karatzas (1994) with the excursion method similar to that used by Kaspi and Mandelbaum (1998), but now under the new framework of general -armed bandit processes with each arm attached with a restricted stopping time set.
The main contribution of this paper consists of the following:
We develop a general and new framework of MAB processes, suggest correspondingly a general definition of Gittins indices and demonstrate their optimality in arm allocation under switch time restrictions. This framework generalizes and unifies the models, methodologies and theory for all versions of MAB processes and can apply to more other situations.
While the proof follows the ideas partly from EL Karoui and Karatzas (1994) and partly from Kaspi and Mandelbaum (1998), new techniques (e.g., the discounted gain process (3.2) and Lemma 4.1) are introduced such that the proof is drastically shorter than the ones for the unrestricted MAB processes in continuous time.
The reminder of the paper is organized as follows. Section 2 formulates the restricted MAB processes with each arm associated with a restriction on stopping times. After a concise review of the theory of optimal stopping times with restrictions in Section 3.1 so as to prepare some necessary theoretical foundation, Section 3.2 associates each arm with a Gittins index process defined under the restrictions on stopping times, which unifies and extends the classical definitions for discrete time, continuous time and semi-Markovian setting. The properties of the Gittins index process are also addressed there. Section 4 is dedicated to demonstrate the optimality of Gittins index policies. The paper is concluded in Section 5 with a few remarks.
2 Model Specification
The MAB processes for which the switches among arms are subject to restrictions are referred to as “restricted multi-armed Bandit processes” (RMAB processes).
In this paper, a RMAB process refers to a stochastic control process governed by the following mechanism. The primitives are stochastic processes , evolving on
, all of which are defined on a common probability spaceto represent arms, meeting the following formulation:
Filtrations. For every , is a quasi-left-continuous filtration satisfying the usual conditions and mod . The collection of filtrations are assumed to be mutually independent.
Rewards. For every ,
, the instant reward rate obtained at the moment when armhas just been pulled for units of time, is assumed to be -progressive and, with no loss of generality, satisfies .
Restrictions. Let be an -adapted random time set, referred to as the feasible time set of arm , satisfying and is closed for every . For an -stopping time , also write if almost surely; the symbol refers to both a random set and the set of stopping times with a.s.. Here may vary over , subject to different requirements.
Policies under restrictions. An allocation policy is characterized by a -dimensional stochastic process where is the total amount of time that allocates to arm during the first units of calendar time, satisfying the following technical requirements:
is component-wise nondecreasing in with .
for every .
For any nonnegative vector, .
if , where indicates the right derivative.
Objective. With any policy , the total reward of the bandit in calendar time interval is , so that the total expected present value of this -armed bandit system is
where indicates the interest rate. The objective is to find a policy such that , where the maximization is taken over all the policies characterized above.
The following remarks give more details on the formulation of RMAB processes.
For the reward processes, the requirement makes the problem nontrivial, because, supposing it does not hold for some , then one can optimally obtain an infinite expected reward by operating arm all the time.
While, from a practical point of view, policies satisfying for every allow for machine idle and are also practically feasible and can contain more policies than those defined by condition (2) in the “Policies under restrictions” which does not allow for machine idle. Nevertheless, by introducing a dummy arm with constantly zero reward rate, constant filtration and the trivial feasible random time set , the setting in condition (2) can model this more realistic situation.
Conditions (1) – (3) in “Policies under restrictions” are similar to those in Kaspi and Mandelbaum (1998), whereas condition (4) that is new captures the feature of restricted policies that the machine can operate arm at a rate strictly less than only when its operation time is in ; in other words, if , then at time , the machine can only be occupied by arm exclusively.
Clearly, the setting we have just formulated subsumes classical versions in discrete time, continuous time and general discrete time setting, as discussed below:
Because indicates that arm can be switched at only integer times, an integer time MAB process corresponds to a RMAB process in which for every .
In the case of a semi-Markov process, let be the state of the process and denote by , , the time instants at which makes transitions, with . Arm can only be switched only at the time instants , so that
A semi-Markovian MAB corresponds to a RMAB process with every of the form in (2.2).
We in this item show how the RMAB processes can be reduced to general discrete time MAB processes. Let be a sequence of increasing -stopping times at which arm can be can be stopped to switch to another arm, satisfying for all and a.s.. Clearly, for this example,
Also, an general discrete time MAB corresponds to a RMAB process with every having the form in Equation (2.3
). This model extends the semi-Markov model by dropping the Markovian property in the transition. Note that this model essentially covers MAB in discrete time, because the evolving of the process in betweenand are irrelevant for the purpose of making decision on stopping at those stopping times . It was discussed in Cai et al (2014, Section 6.1) and Cowan and Katehakis (2015) when they discussed their multi-armed bandit processes. Clearly, RMAB process clearly covers general discrete time model as a special case, but not vice versa because, as just stated, RMAB process covers the continuous time version of MAB whereas that discrete time version of MAB does not.
If , arm is an arm in continuous time in which one can stop at any time,and for optimal stopping problem in discrete time.
Moreover, the restrictions allow one to tackle many more situations. Here is a selection of some examples, for all of which but the first the existing theory for MAB processes cannot apply.
If the case , then , so that the arm will be operated exclusively forever once it is picked. Obviously, it corresponds a nonpreemptable arm.
If , where is an -stopping time, so that , then switches from arm are all time points no larger than and the integer time points larger than .
Let be a sequence of -stopping times increasing in and . Then arm can only be switched from at its private random time intervals whereas in its private time intervals , the occupation of machine by this arm is exclusive.
One can treat MAB processes of multiple types of arms, where operation on some of the arms can be switched to other arms at any time (corresponding to a continuous time setting) but operation on some other arms can only be switched when the machine has been served for integer amount of time (discrete time setting) or when the state of the arm is just transferred in the case of the semi-Markovian setting. Some arms can even be nonpreemptable.
3 Gittins Indices for A Single-Arm Process
After the RMAB processes were formulated in the last section, we now associate each arm with an appropriately defined Gittins index process, which unifies and extends the classical definitions for discrete time, continuous time and general discrete time setting. Because we consider only a single arm so as to define the associated Gittins index process and demonstrate its desired properties, for the time being, the arm identifier is suppressed for the time being for notation convenience. Hence we work only with a single stochastic process that is -adapted on a filtered probability space , equipped with a quasi-left-continuous filtration satisfying the usual conditions of right continuity and augmentation by the null sets of , where . To is associated with a random set to represent the restricted feasibility on the stopping times, as defined in Section 2.
This section consists of two parts: Section 3.1 gives a concise review of restricted optimal stopping times with some material taken from Bao et al. (2017), which is put here for easy reference and in Section 3.2 we define the Gittins index process induced over a single arm and gives its details.
3.1 Optimal stopping times under the restrictions
The optimal stopping time problem with restrictions, denoted by , is defined as the following: For an arbitrary stopping time (unnecessarily in ), find a optimal stopping times such that
where esssup stands for the operation of essential supremum, and is assumed to satisfy the following assumptions:
(1). has almost surely right continuous paths.
By Bao et al. (2017), problem (3.1) is solved by the following two theorems that are cited here for later reference. The first theorem characterizes the optimal stopping times should they exist.
The following three statements are equivalent for any :
(a) is optimal for problem (3.1), i.e., ;
(b) The stochastic process is an -martingale and ;
(c) and .
For any and stopping time , define and . The following theorem indicates the existence of the required stopping time .
If is quasi-left continuous, then
(1) is optimal for the stopping problem (3.1), that is, ,
(2) a.s., and
(3) is also quasi-left-continuous.
3.2 Gittins index process
For the instant reward rate process and an arbitrary stochastic processes that is -adapted, pathwise right continuous, nonincreasing, bounded and nonnegative, introduce a discounted gain process
Note that gives the well-known gain process with retirement option , which was introduced by Whittle (1980). To any finite -stopping time , associate a class of optimal stopping problems
indexed by , indicating the optimal expected rewards from onwards. Then, for every fixed , the optimal stopping time theory reviewed in Section 3.1 can be translated for to:
The process is a quasi-left-continuous supermartingale.
The feasible stopping time
is an optimal solution for .
is a martingale family.
Moreover, for any finite and , write
It is then immediate that
Given a stopping time , owing to the the esssup operation, even though for any couple of nonnegative numbers and , it is clear that
definition (3.5) does not necessarily ensure pathwise monotonicity and convexity of in . This difficulty can be overcome by a procedure as follows. First, order the rationals arbitrarily as and write . Let . For , denote Then is decreasing in and Let such that and for every , is decreasing along set . For the other (real) numbers , take as the limit of along , so that defined as such is a decreasing and convex function of for every . That is, we get a version of that is pathwise decreasing and convex in almost surely. We will thoroughly work with this version of .
The following is a fundamental property of .
Given a stopping time , is nonincreasing and right-continuous in .
Proof. The monotonicity of follows from the fact that
so that . For the right-continuity of in , consider a decreasing sequence of real numbers. By the monotonicity above, the sequence is a nondecreasing sequence dominated by . Then there exists such that . On the other hand, thanks to the quasi-left-continuity of (implied by that of , cf. Theorem 3.2 (3)) and the fact that for any , we see that . Hence, the continuity of in implies that , which in turn implies . Consequently, , that is, .
This completes the proof.
Thanks to this lemma, with a procedure similar to Remark 3.1, we can work with the version of that is nonincreasing and right continuous in for every , so that we can speak of its pathwise inverse
and write particularly
The following lemma explains what these quantities indicate and states that is a direct extension of the classical Gittins index to the setting with restricted stopping times.
Given , the following properties hold for the stochastic process :
(a). is -adapted.
(c). for .
Proof. (a). For any finite and , if follows that
(b). For , it is clear that
(c). Note that, by (3.2), for ,
That is, Re-expressing this in terms of stopping times leads to the desired equality for and .
(d). It is obvious that for , where
The assertion in (d) thus follows from the equivalence
The proof is thus completed.
The following lemma establishes a crucial expression for by means of the right derivative of with respect to .
For any stopping time , is increasing in with right-hand derivative
As a result,
On the other hand, the relationship
which is obtained from the supermartingale property of , implies that
By (3.12) and the equality , it follows that
Noting that , it is immediate that
Due to the relationship
it follows by interchanging the integrations in (3.16) that
Thus the desired equality in (3.13) follows.
We will need to treat the case where one has an extra -algebra that is independent of the filtration . This introduces a new filtration by , generally called an initial enlargement (or augmentation) of by . Denote the set of all -stopping times taking values a.s. in by and those taking values in and larger than or equal to by . Consider the setting in which
is -adapted and
is -adapted, almost surely right continuous, and right decreasing at such time with .
Under the augmented filtration , taking the right continuous version of , we can extend the notation to any -stopping times by . Define a new optimization problem . Then it is straightforward that for any -stopping time , which states that, regardless of the enlargement of the domain of stopping times by initially introducing extra information, the optimal stopping problem basically remains if the additionally obtained information is independent of the original information filtration and is -adapted. The following lemma holds for any -adapted, right continuous that is right decreasing only when .
Let be an arbitrary -adapted process and be -adapted, right continuous, and right decreasing at time . Then, for any -stopping times , the inequality implies
Proof. Introduce the right continuous inverse . Then, for any , is a -stopping time because . In addition,
for any , the relationship