# Learn and Pick Right Nodes to Offload

Task offloading is a promising technology to exploit the benefits of fog computing. An effective task offloading strategy is needed to utilize the computation resources efficiently. In this paper, we endeavor to seek an online task offloading strategy to minimize the long-term latency. In particular, we formulate a stochastic programming problem, where the expectations of the system parameters change abruptly at unknown time instants. Meanwhile, we consider the fact that the queried nodes can only feed back the processing results after finishing the tasks. We then put forward an effective algorithm to solve this challenging stochastic programming under the non-stationary bandit model. We further prove that our proposed algorithm is asymptotically optimal in a non-stationary fog-enabled network. Numerical simulations are carried out to corroborate our designs.

## Authors

• 6 publications
• 153 publications
• 1 publication
• 3 publications
06/27/2018

08/01/2020

### Green Offloading in Fog-Assisted IoT Systems: An Online Perspective Integrating Learning and Control

In fog-assisted IoT systems, it is a common practice to offload tasks fr...
08/01/2020

### PORA: Predictive Offloading and Resource Allocation in Dynamic Fog Computing Systems

In multi-tiered fog computing systems, to accelerate the processing of c...
12/02/2020

Vehicular fog computing (VFC) is envisioned as a promising solution to p...
11/24/2020

Comparing to cloud computing, fog computing performs computation and ser...
06/22/2020

11/09/2021

The rise of NewSpace provides a platform for small and medium businesses...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

With the ever-increasing demands for intelligent services, devices such as the smart phones are facing challenges in both battery life and computing power [1]. Rather than offloading computation to remote clouds, fog computing distributes computing, storage, control, and communication services along the Cloud-to-Thing continuum [3, 2].

The aforementioned task offloading schemes all assumed the availability of perfect knowledge about the system parameters. However, there are some cases where these parameters are unknown or partially known at the user. For example, some particular values (a.k.a. bandit feedbacks) are only revealed for the nodes that are queried. Specifically, the authors in [14] treated the communication delay and the computation delay of each task as a posteriori. In [15], the mobility of each user was assumed to be unpredictable. When the number of nodes that can be queried is limited due to the finite available resources, there exists a tradeoff between exploiting the empirically best node as often as possible and exploring other nodes to find more profitable actions [16, 17]. To balance this tradeoff, one popular approach is to model the exploration versus exploitation dilemma as a multi-armed bandit (MAB) problem, which has been extensively studied in statistics [18].

The rest of this paper is organized as follows. Section II introduces the task offloading model and system assumptions. Section III presents one efficient algorithm and the corresponding performance guarantee. Numerical results are presented in Section IV and Section V concludes the paper.

Notations: Notation , , , and stand for the cardinality of set

, the uniform distribution on

, the expectation of random variable

, and the probability of event

. Notation indicates the sequence converges almost surely towards . One indicator function takes the value of when the specified condition is met (otherwise).

## Ii System Model

### Ii-a Network Model

Our goal is to minimize the long-term latency at a particular task node. In particular, the set of

fog nodes can be classified as

 I:=⎧⎪ ⎪⎨⎪ ⎪⎩1,2,⋯,K−1Helper nodes,KTask node⎫⎪ ⎪⎬⎪ ⎪⎭. (1)

In this paper, we assume the task node cannot offload tasks to a helper node when it is communicating with others. We also assume each task is generated independently and the task nodes do not cooperate with each other111 The cooperation among multiple task nodes is beyond the scope of the current paper and is left for our future works..

We use to represent the amount of time needed to deliver one bit of information to node-. It is a distance-dependent value and can be measured before transmission222We assume different task nodes occupy pre-allocated orthogonal time or spectrum resources for the communication to the helper nodes, e.g. TDMA or FDMA. Note the optimal time/spectrum reusing is itself a non-trivial research problem [21].. Denote the data length of task- by . We also assume the task size is such that the transmission delay is no more than one time slot. Note the transmission delay is zero for a locally processed task, i.e. .

Let denote the queue length of node- at the beginning of time slot-. Meanwhile, we denote the time needed to process one bit waiting in the queue at node- by , and denote the time needed to process one bit in task- at node- by when all the tasks ahead in the queue are completed. Furthermore, we treat and as random variables in this paper. Accordingly, the expectations are defined as:

 μWt(i):=E[Wt(i)],  μPt(i):=E[Pt(i)]. (2)

We assume the total latency of each task is dominated by the delays mentioned above, i.e. the transmission delay , the waiting delay in the queue, and the processing delay . We ignore the latency introduced during the transmitting of the computing results. Therefore, the total latency when allocating task- to node- can be written as follows.

 Ut(i):=LtT(i)+Qt(i)Wt(i)+LtPt(i). (3)

Before we proceed further, here we make the following assumptions:

• AS-1: The total latency is unknown before the task is completed;

• AS-2: The queue length is broadcasted by node- at the beginning of each slot and is available for all the nearby fog nodes;

• AS-3: The waiting delay and the processing delay, i.e. and , follow unknown distributions. The corresponding expectations, i.e. and , change abruptly at unknown time instants (a.k.a. breakpoints).

In our paper, the processing delay and the waiting delay are only reported after the corresponding task is finished. Namely, the observations of the waiting delay and the processing delay are treated as posterior information. Note these delays can be obtained via the timestamp feedback from the corresponding node after finishing task-. Accordingly, we obtain the realizations of and as

 wt(i)=τWtQt(i)1{It=i}, pt(i)=τPtLt1{It=i}. (4)

### Ii-B Problem Formulation

The general minimization of the long-term average latency of tasks can be formulated as follows.

 minimize{It,∀t}limT→∞1TT∑t=1K∑i=1Ut(i)1{It=i}subject to  It∈I,t=1,2,⋯,T, (5)

where represents the index of the node to process task-. There are two difficulties in solving the above problem. Firstly, it is a stochastic programming problem. The exact information about the latency is not available before the -th task is completed. Additionally, even if

is known apriori, this problem is still a combinatorial optimization problem and the complexity is in the order of

 minimizeIt∈I∑i∈IE[Ut(i)]1{It=i} (6)

However, the above formulation is still a stochastic programming problem. Although the tasks offloaded previously do enable an empirical average as an estimate of the expectation

, this information may be inaccurate due to limited number of observations. Note the information about node- is from the feedbacks from node- when it finishes the corresponding tasks. In order to get more information about one specific node, the task node has to offload more tasks to that node even though it may not be the empirically best node to offload. Therefore, an exploration-exploitation tradeoff exists in this problem. In the following parts, we endeavor to find one efficient scheme to solve the problem in (6).

To strike a balance between the aforementioned exploration and exploitation, we model the task offloading as a non-stationary multi-armed bandit (MAB) problem [17], where each node in is regarded as one arm. When a particular task is generated, we need to determine one fog node, either one helper node or the task node, to deal with it. This corresponds to choosing one arm to play in the MAB.

Recall that the task node generates one task at the beginning of each time slot. Let be the time when the feedback of the -th task is received, where is the maximum permitted latency. If , the task fails and is discarded. According to [17], we can estimate and with the UCB policy as

 ¯Wt(γ,i):=1Nt(γ,i)t∑s=1γt−τsws(i)1{Is=i,τs≤t},¯Pt(γ,i):=1Nt(γ,i)t∑s=1γt−τsps(i)1{Is=i,τs≤t}, (7)

where represents the discount factor, and

 Nt(γ,i):=t∑s=1γt−τs1{Is=i,τs≤t}. (8)

Then the latency can be estimated as

 ¯μt(γ,i):=LtT(i)+Qt(i)¯Wt(γ,i)+Lt¯Pt(γ,i). (9)

Note that the latency in (9) is estimated based on the history information of and instead of the previous latency values, i.e. . This is due to the fact that the individual latency closely depends on the queue length and the task length , which may vary significantly for different types of tasks. Thus it is not trustworthy to estimate with the previous latency values directly. On the other hand, the time needed to process one bit of a task is typically determined by the node capability, which is relatively stable and thus suitable to be estimated with the sample mean.

At node-, the total amount of time utilized to process task- is compared with the maximal tolerable latency and the time difference is defined as a reward, i.e. . Clearly, a negative reward indicates a task failure. Based on the estimated latency in (9), the estimated reward is given by

 ¯Xt(γ,i):=τmax−¯μt(γ,i). (10)

The parameters , , and can be updated iteratively with low complexity. Particularly, let denote the set of indices of tasks completed within the interval and we can have

 Nt+1(γ,i)=γNt(γ,i)+∑s∈Stγt+1−τs1{Is=i}, (11)
 ¯Wt+1(γ,i)=1Nt+1(γ,i)[γNt(γ,i)¯Wt(γ,i)+∑s∈Stγt+1−τsws(i)1{Is=i}], (12)
 ¯Pt+1(γ,i)=1Nt+1(γ,i)[γNt(γ,i)¯Pt(γ,i)+∑s∈Stγt+1−τsps(i)1{Is=i}]. (13)

The exploration-exploitation tradeoff is then handled by applying the UCB policies as in [17]. An UCB is constructed as

characterizes the exploration bonus, which is defined as

 ct(γ,i):=2τmax√ξlognt(γ)Nt(γ,i), (14)

where stands for an exploration constant and

 nt(γ):=K∑i=1Nt(γ,i). (15)

The node selected to process task- is then determined by

 It=argmaxi∈I ¯Xt(γ,i)+ct(γ,i). (16)

Although the above proposed task offloading model is essentially a non-stationary MAB model, there are two main differences compared with the conventional model as proposed in [17]. First, the feedback was obtained instantaneously with the decision making in the conventional model. While in our model, as indicated in (7), the feedback is not available until the task is finished. The corresponding latency should not be ignored since it is exactly the information we need. Note the delayed feedback affects the performance analyses as discussed in [19]. Second, the best arm is assumed to change only at the breakpoints in [17]. However, our model allows the best node to vary when processing different tasks. Therefore, the performance guarantee for the conventional discounted-UCB algorithm cannot be applied directly to our proposed TOD.

### Iii-B Performance Analysis

According to (2) and (3), the expected latency can be expressed as

 μt(i):=E[Ut(i)]=LtT(i)+Qt(i)μWt(i)+LtμPt(i). (17)

 ~NT(i):=T∑t=11{It=i≠i∗t}

to denote the number of tasks offloaded to node- while it is not the best node during the first time slots. From AS-3, we know the expectations of system parameters could change abruptly at each breakpoint. We use to denote the number of breakpoints before time . The following proposition provides an upper bound for .

###### Proposition 1.

Assume and satisfies

 γτmax(1−γ1/(1−γ))/(1−γ)>e.

For each node , we have the following upper bound for :

 E[~NT(i)]≤1+T(1−γ)B(γ)+ΥTC(γ)+21−γ, (18)

where

 C(γ) := logγ((1−γ)ξlognK(γ))+τmax, B(γ) := (−16τ2maxξlog[γτmax(1−γ)](ΔμT(i))2+τmax) ⋅⌈T(1−γ)⌉T(1−γ)γ−11−γ+2γτmaxlogγτmax1−γ, ΔμT(i) := mint∈{1,⋯,T},i∗t≠i μt(i)−μt(i∗t).

Detailed proof for the above proposition can be found in arXiv and is omitted here due to the space limitation333arXiv:1804.08416, https://arxiv.org/abs/1804.08416.. Clearly, the upper bound depends on the number of total tasks, the number of breakpoints , and the choice of discount factor . From (18), we see the term decreases as the feasible increases. On the other hand, the last two terms, i.e. and , are increasing when the feasible is increasing. This is consistent with our intuition that a higher discount factor contributes to a better estimation in the stationary case, while it results in slow reaction to abrupt changes of environments. Therefore, there is a tradeoff between different terms in (18). To strike a balance between the stable and the abruptly-changing environments, similar to [17], we choose as

 γ=1−(4τmax)−1√ΥT/T. (19)

Accordingly, we can establish the following proposition.

###### Proposition 2.

When , and , the value of is in the order of

 E[~NT(i)]=O(√TΥTlogT).
###### Proof.

Let , then the three terms in (18), i.e. , , and , are in the order of , , and , respectively. Thus is in the order of . ∎

To show the optimality of our proposed Algorithm 1, we define the pseudo-regret in offloading the first tasks as [20]

 (20)

We have the following result regarding the pseudo-regret .

###### Proposition 3.

When , the proposed approach in Algorithm 1 is asymptotically optimal in the sense that .

###### Proof.

Note . We have

 ζT≤1TE[T∑t=1τmax1{It≠i∗t}]≤τmaxT∑i∈IE[~NT(i)]. (21)

According to Proposition 1 and Proposition 2, we obtain

 (22)

Then for any , there exists a finite integer , such that

 P(|ζT|≥ε)=0, ∀T≥Nε. (23)

Therefore,

 ∞∑T=1P(|ζT|≥ε)≤Nε<∞. (24)

The above equation indicates . ∎

## Iv Numerical Results

The network consists of task node and helper nodes;
Each time slot is ms. Data size follows KB;
Maximal latency is slots, ;
The delay of processing one bit of the task is simulated following , where characterizes the complexity of task-, and reflects the CPU capability of node-. Both variables follow ;
The CPU capability of node- is changed as or at each breakpoint.

We compare the performance of TOD with two other schemes, i.e. Greedy and Round-Robin. In the greedy scheme, we assume full information of every realization and offload the task to the node achieving minimal latency in each time slot. Note that the greedy scheme is not causal and cannot be applied in practice. In the round-robin scheme, each task is offloaded to the fog nodes in a cyclic way with equal chances.

In Fig. 2, with different number of breakpoints, i.e

, we demonstrate the effectiveness and robustness of TOD by showing cumulative distribution functions (CDFs) of the latency of processing

tasks with different schemes. TOD-Opt and TOD-Cal in Fig. 2 represent two different criteria to choose in TOD. The discount factor in the former criterion is searched over to achieve the minimal average latency, while the other one is calculated following (19). Both the left and the right parts in Fig. 2 show the proposed TOD algorithm performs much better than the round-robin scheme, and performs close to the greedy method, which achieves the minimal realization of latency in each round. Additionally, we can learn from Fig. 2 that the calculated following (19) performs as well as the optimal one. In Fig. 2(a), there exists one breakpoint every tasks on average as , which indicates TOD is able to learn the system under frequent changes of parameter distribution. It is also worth noting that, in Fig. 2(b), TOD achieves even less average latency than the greedy scheme in the case of limited abrupt changes, i.e. . This phenomenon reveals that the decision with minimal latency in each time slot may not be the global optimum of (5). It also corroborates our previous analysis that every choice will affect the future state of the node, and further affects the following offloading decisions.

Fig. 3 presents the ratio of the number of successfully processed tasks to the number of tasks. A task is successful if the latency is less than . Although the success ratios of TOD are lower than the greedy scheme due to the exploration of nodes, they show the tendencies to approaching the greedy scheme with time going on.

Fig. 4 depicts the regrets from different schemes. Note the regret is based on the optimal realization (greedy method), and the pseudo-regret is based on the optimal expectation. In IIR, we separate the exploration and the exploitation to two phases. In the exploration phase, the round-robin method is adopted. In the exploitation phase, we focus on maximizing the estimated reward defined in (10), which is actually an estimate based on the infinite impulse response (IIR) filter. The ratio of two phases is searched to achieve the minimal regret. It can be observed that, either in the sense of the regret or in the sense of the pseudo-regret, the proposed TOD algorithm achieves much lower regrets than the round-robin scheme and the IIR scheme. This phenomenon shows that our proposed method performs well in dealing with the exploration-exploitation tradeoff. Besides, as the discount factor is set to , the TOD performance deteriorates a lot. This further indicates the importance of the exploration bonus.

## References

• [1] H. T. Dinh, C. Lee, D. Niyato, and P. Wang, “A survey of mobile cloud computing: Architecture, applications, and approaches,” Wireless Commun. Mobile Comput., vol. 13, no.18, pp. 1587–1611, Dec. 2013.
• [2] M. Chiang and T. Zhang, “Fog and IoT: An overview of research opportunities,” IEEE Internet Things J., vol. 3, no. 6, pp. 854–864, Dec. 2016.
• [3] M. Satyanarayanan, P. Bahl, R. Caceres, and N. Davies, “The case for VM-based cloudlets in mobile computing,” IEEE Pervasive Comput., vol. 8, no. 4, pp. 14–23, Oct. 2009.
• [4] M. V. Barbera, S. Kosta, A. Mei, and J. Stefa, “To offload or not to offload? The bandwidth and energy costs of mobile cloud computing,” in Proc. IEEE INFOCOM, Turin, Italy, Apr. 2013, pp. 1285–1293.
• [5] S. Kosta, A. Aucinas, P. Hui, R. Mortier, and X. Zhang, “ThinkAir: Dynamic resource allocation and parallel execution in the cloud for mobile code offloading,” in Proc. IEEE INFOCOM, Orlando, FL, USA, Mar. 2012, pp. 945–953.
• [6] Y. Yang, K. Wang, G. Zhang, X. Chen, X. Luo, and M. Zhou, “MEETS: Maximal energy efficient task scheduling in homogeneous fog networks,” submitted to IEEE Internet Things J., 2017.
• [7] T. Q. Dinh, J. Tang, Q. D. La, and T. Q. S. Quek, “Offloading in mobile edge computing: Task allocation and computational frequency scaling,” IEEE Trans. on Commun., vol. 65, no. 8, pp. 3571–3584, Aug. 2017.
• [8] C. You, K. Huang, H. Chae, and B.-H. Kim, “Energy-efficient resource allocation for mobile-edge computation offloading,” IEEE Trans. Wireless Commun., vol. 16, no. 3, pp. 1397–1411, Mar. 2017.
• [9] J. Kwak, Y. Kim, J. Lee, and S. Chong, “DREAM: Dynamic resource and task allocation for energy minimization in mobile cloud systems,” IEEE J. Sel. Areas Commun, vol. 33, no. 12, pp. 2510–2523, Dec. 2015.
• [10] Y. Mao, J. Zhang, S. H. Song, and K. B. Letaief, “Stochastic joint radio and computational resource management for multi-user mobile-edge computing systems,” IEEE Trans. Wireless Commun., vol. 16, no. 9, pp. 5994–6009, Sept. 2017.
• [11] Y. Yang, S. Zhao, W. Zhang, Y. Chen, X. Luo, and J. Wang, “DEBTS: Delay energy balanced task scheduling in homogeneous fog networks,” IEEE Internet Things J., in press.
• [12] L. Pu, X. Chen, J. Xu, and X. Fu, “D2D fogging: An energy-efficient and incentive-aware task offloading framework via network-assisted D2D collaboration,” IEEE J. Sel. Areas Commun., vol. 34, no.12, pp. 3887–3901, Dec. 2016.
• [13] X. Chen, “Decentralized computation offloading game for mobile cloud computing,” IEEE Trans. Parallel Distrib. Syst., vol. 26, no. 4, pp. 974–983, Apr. 2015.
• [14] T. Chen and G. B. Giannakis, “Bandit convex optimization for scalable and dynamic IoT management”, arXiv preprint arXiv:1707.09060, 2017.
• [15] C. Tekin and M. van der Schaar, “An experts learning approach to mobile service offloading,” in Proc. Annu. Allerton Conf. Commun., Control, Comput., 2014, pp. 643–650.
• [16] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Mach. Learn., vol. 47, no. 2, pp. 235–256, May 2002.
• [17] A. Garivier and E. Moulines, “On upper-confidence bound policies for switching bandit problems,” in Proc. Int. Conf. Algorithmic Learn. Theory, Espoo, Finland, Oct. 2011, pp. 174–188.
• [18] D. A. Berry and B. Fristedt, Bandit Problems: Sequential Allocation of Experiments. London, U.K.: Chapman & Hall, 1985.
• [19] P. Joulani, A. Gyorgy, and C. Szepesvari, “Online learning under delayed feedback,” in Proc. Int. Conf. Mach. Learn., Atlanta, GA, USA, Jun. 2013, pp. 1453–1461.
• [20] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Found. Trends Mach. Learn., vol. 5, no. 1, pp. 1–122, 2012.
• [21] Z. Zhu, S. Jin, Y. Yang, H. Hu, and X. Luo, “Time reusing in D2D-enabled cooperative networks,” IEEE Trans. Wireless Commun., in press.

## Vi Appendix

###### Proof.

According to the definition of , it can be decomposed as

 ~NT(i)≤1+T∑t=K+11{It=i≠i∗t,Nt(γ,i)

where is a particular function with respect to . The number of missing feedbacks when task- is offloaded can be defined as

 Gt(i):=t−1∑s=11{Is=i}−Nt(1,i). (26)

Clearly, the number of missing feedbacks is no larger than , i.e. . According to Lemma 1 in [17], for any , the following inequality is derived:

 T∑t=K+11{It=i,t−1∑s=t−τ1{Is=i}

Due to the fact that

 (28)

we have

 T∑t=K+11{It=i,Nt(γ,i)+τmax<γτm}≤⌈T/τ⌉m. (29)

Let , we have

 T∑t=K+11{It=i≠i∗t,Nt(γ,i)

Let denote the number of breakpoints before time , and denote the set of “well offloaded” tasks. Mathematically, these tasks are defined as follows.

 T(γ):={t|t∈{K+1,⋯,T};μs(j)=μt(j),∀s∈(t−C(γ),t),∀j∈I}, (31)

where indicates the number of tasks, of which the delay is poorly estimated. Because of this, the D-UCB policy may not offload tasks to the optimal node, which leads to the following bound:

 T∑t=K+11{It=i≠i∗t,Nt(γ,i)≥A(γ,i)}≤ΥTC(γ)+∑t∈T(γ)1{It=i≠i∗t,Nt(γ,i)≥A(γ,i)}. (32)

Next, we need to upper-bound the last term in (32). There are three facts:
i) The event occurs if and only if the event occurs;
ii) ;
iii) .
Based on these facts, the following inequality is obtained:

 {It=i≠i∗t,Nt(γ,i)≥A(γ,i)}⊆{¯μt(γ,i∗t)−ct(γ,i∗t)≥μt(i∗t)}∪{¯μt(γ,i)+ct(γ,i)<μt(i)}∪{μt(i)−μt(i∗t)<2ct(γ,i),Nt(γ,i)≥A(γ,i)}. (33)

Namely, when node- is tested enough times by the task node, the event only occurs under three circumstances: i) the delay of the optimal node is substantially overestimated; ii) the delay of node- is substantially underestimated; iii) both delay expectations, i.e. and , are close enough.

However, if is chosen appropriately, the event never occurs. Denote the minimal difference between the expected delay of node- and the expected delay of the best node- by , i.e.

 ΔμT(i):=mint∈{1,⋯,T},i∗t≠i μt(i)−μt(i∗t). (34)

Let , where . Recalling , we have

 ΔμT(i)2=2τmax√ξlognt∗(γ)A(γ,i)≥ct(γ,i). (35)

However, from the definition of we obtain:

 ΔμT(i)2≤μt(i)−μt(i∗t)2

which is contradict with (35). Thus the events and never occur simultaneously, which indicates that we only need to upper-bound the probability of events and . Define as

 Mt(γ,i):=t∑s=1γt−τsmt(s,i)1{Is=i,τs≤t}, (37)

where , then

 (38)

Combining with the following two facts:

 ∣∣∣Mt(γ,i)Nt(γ,i)−μt(i)∣∣∣≤τmax, min(1,x)≤√x,∀x≥0, (39)

we obtain

 ∣∣∣Mt(γ,i)Nt(γ,i)−μt(i)∣∣∣≤τmax√γC(γ)−τmax(1−γ)Nt(γ,i). (40)

Let

 C(γ):=logγ((1−γ)ξlognK(γ))+τmax, (41)

the inequality in (40) turns to be

 ∣∣∣Mt(γ,i)Nt(γ,i)−μt(i)∣∣∣≤12ct(γ,i). (42)

Defining , the following inequality can be deduced:

 P(μt(i)−¯μt(γ,i)>ct(γ,i))≤P(μt(i)−¯μt(γ,i)>12ct(γ,i)+∣∣∣Mt(γ,i)Nt(γ,i)−μt(j)∣∣∣)≤P(Mt(γ,i)Nt(γ,i)−¯μt(γ,i)>τmax√ξlognt(γ)Nt(γ,i))=P(Mt(γ,i)−Yt(γ,i)√Nt(γ2,i)>τmax√ξNt(γ,i)lognt(γ)Nt(γ2,i))≤P(Mt(γ,i)−Yt(γ,i)√Nt(γ2,i)>τmax√ξlognt(γ))(a)≤⌈lognt(γ)log(1+η)⌉exp(−2ξlognt(γ)(1−η216)), (43)

where holds due to Theorem 4 in [17]. Let , we further obtain:

 P(μt(i)−¯μt(γ,i)>ct(γ,i))≤⌈lognt(γ)log(1+η)⌉nt(γ)−1. (44)

Till now, the expectation of can be upper-bounded as

 E[~NT(i)]≤1+⌈T/τ⌉γ−τ(A(γ,i)+τmax)+ΥTC(γ)+2∑t∈T(γ)⌈lognt(γ)log(1+η)⌉nt(γ)−1. (45)

Assuming

 τ∑s=1γτ−s+τmax=γτmax(1−γτ)1−γ>e,  τ=(1−γ)−1, (46)

we have

 nt(γ)≥γτmax(1−γτ)1−γ=~n(γ),∀t≥τ, (47)

and

 ⌈lognt(γ)log(1+η)⌉nt(γ)−1≤⌈log~n(γ)log(1+η)⌉~n(γ)−1,∀t≥τ. (48)

Then the following inequality holds:

 ∑t∈T(γ)⌈lognt(γ)log(1+η)⌉nt(γ)−1≤τ−K+T∑