In recent years there has been a growing research interest in an age of information . The age of information is motivated by a variety of network applications requiring timely information. Examples range from information updates for network users, e.g., live traffic, transportation, air quality, and weather, to status updates for smart systems, e.g., smart home systems, smart transportation systems, and smart grid systems.
Fig. 1 shows an example network, where network users are running applications that need some timely information (e.g., user
needs both traffic and transportation information for planning the best route), while at some epochs, snapshots of the information are generated at the sources and sent to the users in the form of packets over wired or wireless networks. The users are being updated and keep the latest information only. Since the information at the end-users is expected to be as timely as possible, the age of information is therefore proposed to capture thefreshness of the information at the end-users; more precisely, it measures the elapsed time since the generation of the information. In addition to the timely information for the network users, the smart systems also need timely status (e.g., locations and velocities in smart transportation systems) to accomplish some tasks (e.g., collision-free smart transportation systems). As such, the age of information is a good metric to evaluate these networks supporting age-sensitive applications.
Next, we characterize the age-sensitive networks in two aspects. First, while packet delay is usually referred to as the elapsed time from the generation to its delivery, the age of information includes not only the packet delay but also the inter-delivery time, because the age of information keeps increasing until the information at the end-users is updated. We hence need to jointly consider the two parameters so as to design an age-optimal network. Moreover, while traditional relays (i.e., intermediate nodes) need buffers to keep all packets that are not served yet, the relays in the network of Fig. 1 for timely information at most store the latest information and discard out-of-date packets. The buffers for minimizing the age here are no longer as useful as those in traditional relay networks.
In this paper, we consider a wireless broadcast network, where a base-station (BS) is updating many network users on timely information. The new information is randomly generated at its source. We assume that the BS can serve at most one user for each transmission opportunity. Under the transmission constraint, a transmission scheduling algorithm manages how the channel resources are allocated for each time, depending on the packet arrivals at the BS and the ages of the information at the end-users. The scheduling design is a critical issue to optimize network performance. In this paper we hence develop scheduling algorithms for minimizing the long-run average age.
We study the age-optimal scheduling problem in the wireless broadcast network without the buffers at the BS. Our main contributions lie at designing novel scheduling algorithms and analyzing their age-optimality. For the case when the arrival statistics are available at the BS as prior information, we develop two offline scheduling algorithms, leveraging a Markov decision process (MDP) and the Whittle index. However, the MDP and the Whittle index in our problem will be difficult to analyze as they involve long-run average cost optimization problems with infinite state spaces and unbounded immediate costs . Moreover, it is in general very challenging to obtain the Whittle index in closed form. By investigating some structural results, we not only successfully resolve the issues but also simplify the calculation of the Whittle index. It turns out that our index scheduling algorithm is very simple. When the arrival statistics are unknown, we develop online versions of the two offline algorithms. We show that both offline and online MDP-based scheduling algorithms are asymptotically age-optimal, and the index scheduling algorithm is age-optimal when the information arrival rates for all users are the same. Finally, we compare these algorithms via extensive computer simulations, and further investigate the impact of the buffers storing the latest information.
1.2 Related works
The general idea of age was proposed in  to study how to refresh a local copy of an autonomous information source to maintain the local copy up-to-date. The age defined in  is associated with discrete events at the information source, where the age is zero until the source is updated. Differently, the age of information in  measures the age of a sample of continuous events; therefore, the sample immediately becomes old after generated. Many previous works, e.g., [3, 6, 7, 8, 9, 10], studied the age of information for a single link. The papers [3, 6] considered buffers to store all unserved packets (i.e., out-of-date packets are also stored) and analyzed the long-run average age, based on various queueing models. They showed that neither the throughput-optimal sampling rate nor the delay-optimal sampling rate can minimize the average age. The paper  considered a smart update and showed that the always update scheme might not minimize the average age. Moreover, [8, 9] developed power-efficient updating algorithms for minimizing the average age. The model in  considered no buffer or a buffer to store the latest information.
Of the most relevant works on scheduling multiple users are [11, 12, 13, 14, 15]. The works [11, 12, 13] considered buffers at a BS to store all out-of-date packets. The paper  considered a buffer to store the latest information with periodic arrivals, while information updates in  can be generated at will. In contrast, our work is the first to develop both offline and online scheduling algorithms for random information arrivals, with the purpose of minimizing the long-run average age.
2 System overview
2.1 Network model
We consider a wireless broadcast network in Fig. 2 consisting of a base-station (BS) and wireless users . Each user is interested in timely information generated by a source , while the information is transmitted through the BS in the form of packets. We consider a discrete-time system with slot . Packets from the sources (if any) arrive at the BS at the beginning
of each slot. The packet arrivals at the BS for different users are independent of each other and also independent and identically distributed (i.i.d.) over slots, following a Bernoulli distribution. Precisely, by, we indicate if a packet from source arrives at the BS in slot , where if there is a packet; otherwise,
. We denote the probabilityby .
Suppose that the BS can successfully transmit at most one packet during each slot, i.e., the BS can update at most one user in each slot. By we denote a decision of the BS in slot , where if the BS does not transmit any packet and for if user is scheduled to be updated in slot .
In this paper we fosus on the scenario without depolying any buffer at the BS, where an arriving packet is discarded if it is not transmitted in the arriving slot. The no-buffer network is not only simple to implement for practical systems, but also was shown to achieve good performance in a single link (see ). In Section 6, we will numerically study networks with buffers in general.
2.2 Age of information model
We initialize the ages of all arriving packets at the BS to be zero. The age of information at a user becomes one on receiving a new packet, due to one slot of the transmission time. Let be the age of information at user in slot before the BS makes a scheduling decision. Suppose that the age of information at a user increases linearly with slots if the user is not updated. Then, the dynamics of the age of information for user is
where the age of information in the next slot is one if the user gets updated on the new information; otherwise, the age increases by one. Let
be the age vector in slot.
Let be the set consisting of all age vectors where the ages satisfy for all and for all . Since the BS can update at most one user in each slot, if an initial age vector is outside , then eventually age vector will enter and stay in onwards; otherwise, someone is never updated and its age approaches infinity. In other words, the age vector outside is transient. Without loss of generality, we assume that initial age vector is within . Later, in the proof of Lemma 6, we will show that any transmission decision before the age vector enters will not affect the minimum average age (defined in the next section).
2.3 Problem formulation
A scheduling algorithm specifies a transmission decision for each slot. We define the average age under scheduling algorithm by
where represents the conditional expectation, given that scheduling algorithm is employed. We remark that this paper focuses on the total age of users for delivering clean results; whereas our design and analysis can work perfectly for the weighted sum of the ages. Our goal is to develop age-optimal scheduling algorithms, defined below.
A scheduling algorithm is age-optimal if it minimizes the average age.
In this paper, we will develop two offline scheduling algorithms and two online scheduling algorithms. Leveraging Markov decision process (MDP) techniques and Whittle’s methodology, we develop two offline scheduling algorithms in Sections 3 and 4, respectively, when the arrival statistics are available to the BS; later, two online versions of the offline algorithms are proposed in Section 5 for the case when the arrival statistics are oblivious to the BS.
3 A structural MDP scheduling algorithm
Our first scheduling algorithm is driven by the MDP techniques. To that end, we formulate our problem as an MDP with the components  below.
States: We define the state of the MDP in slot by . Let be the state space consisting of all states where
or for all ;
for all .
The state space includes some transient age vectors. That is used to fit truncated states in Section 3.2. We will show later in Lemma 6 that adding these transient states will not change the minimum average age. Note that is a countable infinite set because the ages are possibly unbounded.
Actions: We define the action of the MDP in slot to be . Note that the action space is finite.
Transition probabilities: By we denote the transition probability of the MDP from state to state under action . According to the age dynamics in Eq. (1) and the i.i.d. assumption of the arrivals, we can describe the non-zero as
if for all , where is the indicator function.
Cost: Let be the immediate cost of the MDP if action is taken in slot under state , representing the resulting total age in the next slot:
where we define and for all (for the no update case ), while the last term indicates that user is updated in slot .
The objective of the MDP is to find a policy (with the same definition as the scheduling algorithm) that minimizes the average cost defined by
A policy of the MDP is -optimal if it minimizes the average cost .
-optimal policy is an age-optimal scheduling algorithm. Moreover, policies of the MDP can be classified as follows. A policy of the MDP ishistory dependent if depends on and . A policy is stationary if when for any . A randomized
policy specifies a probability distribution on the set of decisions, while adeterministic policy makes a decision with certainty. A policy in general belongs to one of the following sets :
: a set of randomized history dependent policies;
: a set of randomized stationary policies;
: a set of deterministic stationary policies.
Note that , while the complexity of searching a -optimal policy increases from left to right. According to , there may exist neither nor policy that is -optimal. Hence, we target at exploring a regime under which a -optimal policy lies in a smaller policy set , and investigating its structures.
3.1 Characterization of the -optimality
To characterize the -optimality, we start with introducing an infinite horizon -discounted cost, where is a discount factor. We subsequently connect the discounted cost case to the average cost case, because structures of a -optimal policy usually depend on its discounted cost case.
Given initial state , the expected total -discounted cost under scheduling algorithm (that can be history dependent) is
A policy of the MDP is -optimal if it minimizes the expected total -discounted cost . In particular, we define
Moreover, by we define the relative cost function, which is the difference of the minimum discounted costs between state and a reference state . We can arbitrarily choose the reference state, e.g., in this paper. We then introduce the discounted cost optimality equation of below.
The minimum expected total -discounted cost , for initial state , satisfies the following discounted cost optimality equation:
where the expectation is taken over all possible next state reachable from the state , i.e., . A deterministic stationary policy that realizes the minimum of the right-hand-side (RHS) of the discounted cost optimality equation in Eq. (2) is a -optimal policy. Moreover, we can define a value iteration by and for any ,
Then, as , for every and .
Please see Appendix A. ∎
The value iteration in Eq. (3) is helpful for characterizing , e.g., showing that is a non-decreasing function in the following.
is a non-decreasing function in , for given and .
Please see Appendix B. ∎
There exists a deterministic stationary policy that is -optimal. Moreover, there exists a finite constant for every state such that the minimum average cost is , independent of initial state .
Please see Appendix C. ∎
We want to further elaborate on Lemma 6.
First, note that there is no condition for the existence of a deterministic stationary policy that is
-optimal. In general, we need some conditions to ensure that the reduced Markov chain by a deterministic stationary policy is positive recurrent. Intuitively, we can think of the age of our problem as an age-queuing system, consisting of an age-queue, input to the queue, and a server. The input rate is one per slot since the age increases by one for each slot, while the server can serve aninfinite number of age-packets for each service opportunity. As such, we always can find a scheduling algorithm such that the average arrival rate is less than the service rate and thus the reduced Markov chain is positive recurrent. Please see the proof in Appendix C for details.
Second, since our MDP involves a long-run average cost optimization with a countably infinite state space and unbounded immediate cost, a -optimal policy of such an MDP might not satisfy the average cost optimaility equation like Eq. (2) (see  for a counter-example), even though the optimality of a deterministic stationary policy is established in Lemma 6.
In addition to the optimality of deterministic stationary policies, we show that a -optimal policy has a nice structure. To investigate such structural results not only facilitates the scheduling algorithm design in Section 3, but also simplifies the calculation of the Whittle index in Section 4.
A switch-type policy is a special deterministic stationary policy of the MDP : for given and , if the action of the policy for state is , then the action for state is as well.
In general, showing that a -optimal policy satisfies a structure relies on an optimality equation; however, as discussed, the average cost optimality equation for the MDP might not be available. To resolve this issue, we first investigate the discounted cost case by the well-established value iteration in Eq. (3), and then extend to the average cost case.
There exists a switch-type policy of the MDP that is -optimal.
First, we start with the discounted cost case, and show that a -optimal scheduling algorithm is switch-type. Let . Then, . Without loss of generality, we suppose that a -optimal action at state is to update the user with . Then, according to the optimality of ,
for all .
Let be the vector with all entries being one. Let be the zero vector except for the -th entry being replaced by . To demonstrate the switch-type structure, we consider the following two cases:
For any other user with : Since is a non-decreasing function in (see Proposition 5), we have
where is the next arrival vector.
For any other user with : Similarly, we have
Considering the two cases, a -optimal action for state is still to update , yielding the switch-type structure.
Then, we preceed to establish the optimality for the average cost case. Let be a sequence of the discount factors. According to , if the both conditions in Appendix C hold, then there exists a subsequence such that a -optimal algorithm is the limit point of the -optimal policies. By induction on again, we obtain that a -optimal is switch-type as well. ∎
3.2 Finite-state MDP approximations
The classical method for solving an MDP is to apply a value iteration method . However, as mentioned, the average cost optimality equation might not exist. Even though average cost value iteration holds like Eq. (3), the value iteration cannot work in practice, as we need to update an infinite number of states for each iteration. To address the issue, we propose a sequence of finite-state approximate MDPs. In general, a sequence of approximate MDPs might not converge to the original MDP according to . Thus, we will rigorously show the convergence of the proposed sequence.
Let be a virtual age of information for user in slot , with the dynamic being
where we define the notation by if and if , i.e., we truncate the real age by . This is different from Eq. (1). While the real age can go beyond , the virtual age is at most . Here, we reasonably choose the truncation to be greater than the number of users, i.e., . Later, in Appendix D (see Remark 23), we will discuss some mathematical reasons for the choice.
By we define a sequence of approximate MDPs for , where each MDP is the same as the original MDP except:
States: The state in slot is . Let be the state space.
Transition probabilities: Under action , the transition probability of the MDP from state to state is
if for all .
Remember that the state space of the MDP includes some transient age vectors, e.g., . That is because, if not, the truncated state space would not be a subset of original state space .
Next, we show that the proposed sequence of approximate MDPs converges to the -optimum.
Let be the minimum average cost for the MDP . Then, as .
Please see Appendix D. ∎
3.3 Structural MDP scheduling algorithm
Now, for a given truncation , we are ready to propose a practical algorithm to solve the MDP . The traditional relative value iteration algorithm (RVIA), as follows, can be applied to obtain an optimal deterministic stationary policy for :
for all where the initial value function is . For each iteration, we need to update actions for all virtual states by minimizing the RHS of Eq. (4) as well as update for all . As the size of the state space is , the computational complexity of updating all virtual states in each iteration of Eq. (4) is more than . The complexity primarily results from the truncation of the MDP and the number of users. In this section, we focus on dealing with large values of for the case of fewer users. In next section we will solve the case of more users.
To develop a low-complexity scheduling algorithm for fewer users, we propose structural RVIA in Alg. 1, which is an improved RVIA by leveraging the switch-type structure. In Alg. 1, we seek an optimal action for each virtual state by iteration. For each iteration, we update both the optimal action and for all virtual states. If the switch property holds111The optimal policy for the truncated MDPs is switch-type as well, according to the same proof as Theorem 8., we can determine an optimal action immediately in Line 1; otherwise we find an optimal action according to Line 1. By in Line 1 we temporarily keep the updated value, which will replace in Line 1. Using the switch structure to prevent from the minimum operations on all virtual states in the conventional RVIA, we can reduce the computational complexity resulting from the size . Next, we establish the optimality of the structural RVIA for the approximate MDP .
Based on the structural RVIA in Alg. 1, we propose the structural MDP scheduling algorithm: Given the actions for all state from Alg. 1, for each slot the scheduling algorithm makes a decision according to the virtual age for all , instead of the real age . Then, combining Theorems 9 and 10 yields that the proposed algorithm is asymptotically -optimal as approaches infinity.
4 An index scheduling algorithm
By mean of the MDP techniques, we have developed the structural MDP scheduling algorithm. The scheduling algorithm not only reduces the complexity from the traditional RVIA, but also was shown to be asymptotically age-optimal. However, the scheduling algorithm might not be feasible for many users; thus, a low-complexity scheduling algorithm for many users is still needed. To fill this gap, we investigate the scheduling problem from the perspective of restless bandits . A restless bandit generalizes a classic bandit by allowing the bandit to keep evolving under a passive action, but in a distinct way from its continuation under an active action.
The restless bandits problem, in general, is PSPACE-hard . Whittle hence investigated a relaxed version, where a constraint on the number of active bandits for each slot is replaced by the expected number. With this relaxation, Whittle then applied a Lagrangian approach to decouple the multi-armed bandit problem into multiple sub-problems, while proposing an index policy and a concept of indexability. The index policy is optimal for the relaxed problem; moreover, in many practical systems, the low-complexity index policy performs remarkably well, e.g., see .
With the success of the Whittle index policy to solve the restless bandit problem, we apply the Whittle’s approach to develop a low-complexity scheduling algorithm. However, to obtain the Whittle index in closed form and to establish the indexability can be very challenging . To address the issues, we simplify the derivation of the Whittle index by investigating structural results like Section. 3.
4.1 Decoupled sub-problem
We note that each user in our scheduling problem can be viewed as a restless bandit. Then, applying the Whittle’s approach, we can decouple our problem into sub-problems. Each sub-problem consists of a single user and adheres to the network model in Section 2 with , except for an additional cost for updating the user. In fact, the cost is a scalar Lagrange multiplier in the Lagrangian approach. In each decoupled sub-problem, we aim at determining whether or not the BS updates the user in each slot, for striking a balance between the updating cost and the cost incurred by age. Since each sub-problem consists of a single user only, hereafter in this section we omit the index for simplicity.
Similarly, we cast the sub-problem into an MDP , which is the same as the MDP in Section 3 with a single user except:
Actions: Let be an action of the MDP in slot indicating the BS’s decision, where if the BS decides to update the user and if the BS decides to idle. Note that the action is different from the scheduling decision . The action is used for the decoupled sub-problem. In Section. 4.4, we will use the MDP to decide .
Cost: Let be an immediate cost if action is taken in slot under state , with the definition as follows.
where the first part is the resulting age in the next slot and the second part is the incurred cost for updating the user.
A policy of the MDP specifies an action for each slot . The average cost under policy is defined by
Again, the objective of the MDP is to find an -optimal policy defined as follows.
A policy of the MDP is -optimal if it minimizes the average cost .
Traditionally, the Whittle index might be obtained by solving the optimality equation of , e.g. [20, 22]. However, as discussed, the average cost optimality equation for the MDP might not exit, and even if it exists, solving an optimality equation might be tedious. To look for a simpler way for obtaining the Whittle index, we investigate structures of an -optimal policy instead, by looking at its discounted case again. It turns out that our structural results will further simplify the derivation of the Whittle index.
4.2 Characterization of the -optimality
First, we show that an -optimal policy is stationary deterministic as follows.
There exists a deterministic stationary policy that is -optimal, independent of the initial state.
Please see Appendix F. ∎
Next, we show that an -optimal policy is a special type of deterministic stationary policies.
A threshold-type policy is a special deterministic stationary policy of the MDP . The action for state is to idle, for all . Moreover, if the action for state is to update, then the action for state is to update as well. In other words, there exists a threshold such that the action is to update if there is an arrival and the age is greater than or equal to ; otherwise, the action is to idle.
If the update cost , then there exists a threshold-type policy that is -optimal.
It is obvious that an optimal action for state is to idle if . To establish the optimality of the threshold structure for state , we need the discounted cost optimality equation for , similar to Proposition 4:
Similar to the proof of Theorem 8, we can focus on the discounted cost case and show that an -optimal policy is the threshold type. Let . Then, . Moreover, an -optimal action for state is . Suppose that an -optimal action for state is to update, i.e.,
Then, an -optimal action for state is still to update since
where (a) results from the non-decreasing function of in given (similar to Proposition 5). Hence, an -optimal policy is threshold-type. ∎
Thus far, we have successfully identify the threshold structure of an -optimal policy. The MDP then can be reduced to a two-dimensional discrete-time Markov chain (DTMC) by applying a threshold-type policy. To find an optimal threshold for minimizing the average cost, in the next lemma we explicitly derive the average cost for a threshold-type policy.
Given the threshold-type policy with the threshold , then the average cost , denoted by , under the policy is
Let be the age after an action in slot ; precisely, if and , then . Note that , called post-action age (similar to the post-decsion state [23, 24]), is different from the pre-action age . Then, the post-action age by the threshold-type policy forms an one-dimensional DTMC in Fig. 3, with the transition probabilities being
To calculate the average cost of the policy, we associate each state in the DTMC with a cost. The DTMC incurs the cost of in slot when the post-action age in slot is . That is because the post-action age implies that the BS updates the user. In addition, the DTMC incurs the age cost of in slot when the post-action age is .
The steady-state distribution of the DTMC can be solved as
Therefore, the average cost of the DTMC is
We remark that the post-action age introduced in the above proof are beneficial in many aspects:
The post-action age can form an one-dimensional DTMC, instead of the original two-dimensional state .
We cannot associate each pre-action age with a fixed cost, since the cost in Eq. (5) depends on not only state but also action. Instead, the cost for each post-action age is determined by its age only.
The post-action age will facilitate the online algorithm design in Section 5.
4.3 Derivation of the Whittle index
Now, we are ready to define and derive the Whittle index as follows.
We define the Whittle index by the cost that makes both actions, to update and to idle, for state equally desirable.
The Whittle index of the sub-problem for state is
It is obvious that the Whittle index for state is as both actions result in the same immediate cost and the same age of next slot if .
Let in Eq. (6) for the domain of . Note that is strictly convex in the domain. Let be the minimizer of . Then, an optimal threshold for minimizing the average cost is either or : the optimal threshold is if and if . If there is a tie, both choices are optimal, i.e, equally desirable.
Hence, both actions for state are equally desirable if and only if the age satisfies
i.e., and both thresholds of and are optimal. By solving Eq. (8), we obtain the cost, as stated in the theorem, to make both actions equally desirable. ∎
According to Theorem 17, both actions might have a tie. If there is a tie, we break the tie in favor of idling. Then, we can explicitly express the optimal threshold in the next theorem.
The optimal threshold for minimizing the average cost is if the cost satisfies , for all .
Please see Appendix G. ∎
Next, according to , we have to demonstrate the indexability such that the Whittle index is feasible.
For a given cost , let be the set of states such that the optimal actions for the states are to idle. The sub-problem is indexable if the set monotonically increases from the empty set to the entire state space, as increases from to .
The sub-problem is indexable.
If , the optimal action for every state is to update; as such, . If , then is composed of the set and a set of for some ’s. According to Lemma 18, the optimal threshold monotonically increases to infinity as increases, and hence the set monotonically increases to the entire state space. ∎
4.4 Index scheduling algorithm
Now, we are ready to propose a low-complexity index scheduling algorithm based on the Whittle index. For each slot , the BS observes age and arrival indicator for every user ; then, updates user with the highest value of the Whittle index , i.e., . We can think of the index as a value of updating user . The intuition of the index scheduling algorithm is that the BS intends to send the most valuable packet.
The optimality of the index scheduling algorithm for the relaxed version is known . Next, we show that the proposed index scheduling algorithm is age-optimal for the original problem (without relaxation), when the packet arrivals for all users are stochastically identical.
If the arrival rates of all information sources are the same, i.e., for all , then the index scheduling algorithm is age-optimal.
Note that, for this case, the index scheduling algorithm send an arriving packet with the largest age of information, i.e., for each slot . Then, in Appendix H we show that the policy is -optimal. ∎
In Section 6 we will further validate the index scheduling algorithm for stochastically non-identical arrivals by simulations.
5 Online scheduling algorithm design
Thus far, we have developed two scheduling algorithms in Sections 3 and 4. Both algorithms are offline, as the structural MDP scheduling algorithm and the index scheduling algorithm need the arrival statistics as prior information to pre-compute an optimal action for each virtual state and the Whittle index, respectively. To solve the more challenging case when the arrival statistics are unavailable, in this section we develop online versions for both offline algorithms.
5.1 An MDP-based online scheduling algorithm
We first develop an online version of the MDP scheduling algorithm by leveraging stochastic approximation techniques . The intuition is that, instead of updating for all virtual states in each iteration of Eq. (4), we update by following a sample path, which is a set of outcomes of the arrivals over slots. It turns out that the sample-path updates will converge to the -optimal solution. To that end, we need a stochastic version of the RVIA. However, the RVIA in Eq. (4) is not suitable because the expectation is inside the minimization (see  for details). While minimizing the RHS of Eq. (4) for a given current state, we would need the transition probabilities to calculate the expectation. To tackle this, we design post-action states for our problem, similar to the proof of Lemma 15.
We define post-action state as the ages and the arrivals after an action. The state we used before is referred to as the pre-action state. If is a virtual state of the MDP , then the virtual post-action state after action is with
and for all .
Let be the value function based on the post-action states defined by
where the expectation is taken over all possible the pre-action states reachable from the post-action state. We can then write down the post-action average cost optimality equation  for the virtual post-action state :
where is the next arrival vector; denotes the zero vector except for the -th entry being replaced by ; denotes the zero vector except for the -th entry being replaced by ; the vector is the unit vector. From the above optimality equation, the RVIA is as follows:
Subsequently, we propose the MDP-based online scheduling algorithm in Alg. 2 based on the stochastic version of the RVIA. In Lines 2-2, we initialize of all virtual post-action states and start from the reference point. Moreover, by we record of the current virtual post-action state. By observing the current arrivals and plugging in Eq. (9), the expectation in Eq. (9) can be removed; as such, in Line 2 we optimally update a user by minimizing Eq. (10). Then, we update of the current virtual post-action state in Line 2, where is a stochastic step-size in slot to strike a balance between the previous and the updated value . Finally, the next virtual post-action state is updated in Lines 2 and 2