I Introduction
We consider a source node that communicates the most uptodate status packets to multiple users (see Figure 1). We are interested in the average age of information (AoI) [1, 2, 3] at the users, for a system in which the source node samples an underlying timevarying process and schedules the transmission of the sample values over imperfect links. The AoI at each user at any point in time can simply be defined as the amount of time elapsed since the most recent status update at that user was generated. Most of the earlier work on AoI consider queuebased models, in which the status updates arrive at the source node randomly following a memoryless Poisson process, and are stored in a buffer before being transmitted to the destination [2, 3]. Instead, in the socalled generateatwill model [4, 1, 5, 6, 7], also considered in this paper, the status updates of the underlying process of interest can be generated at any time by the source node.
AoI in multiuser networks has been studied in [8, 7, 6, 9, 10, 11]. It is shown in [8] that the scheduling problem for the age minimization is NPhard in general. Scheduling transmissions to multiple receivers is investigated in [7], focusing on a perfect transmission medium, and the optimal scheduling algorithm is shown to be thresholdtype. Average AoI has also been studied when status updates over unreliable multiaccess channels [10] and multicast networks [11] are considered. A base station sending timesensitive information to a number of users through unreliable channels is considered in [6], where the problem is formulated as a multiarmed restless bandit. AoI in the presence of retransmissions has been considered in [12, 9]. The status update system is modeled as an M/G/1/1 queue in [12], where the status update arrivals are assumed to be memoryless and random. Maximum distance separable (MDS) coding is considered in [12]
, and the successful decoding probabilities are derived in closed form.
In this paper, we address the scheduling of status updates in a multiuser network for both the standard ARQ and HARQ protocols. Our goal is to minimize the expected average AoI under an average transmissionrate constraint. This constraint is motivated by the fact that sensors sending status updates have usually limited energy supplies (e.g., are powered via energy harvesting [13]); hence, they cannot afford to send an unlimited number of updates, or increase the signaltonoiseratio in the transmission. First, we assume that the success probability before each transmission attempt is known; hence, the source can judiciously decide when to retransmit, and when to discard failed information and send a fresh update. Then, we consider scheduling status updates over unknown channels, in which the success probabilities of transmission attempts are not known a priori, and must be learned in an online fashion using the ACK/NACK feedback signals.
In previous work [14], we have studied a pointtopoint status update system in the presence of transmission errors and resource constraint. Here, the results obtained in [14] are extended to the multiuser setting; in addition, more sophisticated reinforcement learning (RL) algorithms are proposed to minimize the average AoI and are demonstrated to perform very close to a lower bound.
The rest of the paper is organized as follows. In Section II, the system model is presented and the problem of minimizing the average AoI in multiuser networks under a resource constraint is formulated as a
constrained Markov decision process
(CMDP). After determining the structure of the optimal policy, a primaldual algorithm is proposed to solve this CMDP in Section III. Minimization of the AoI for the standard ARQ protocol is investigated in Section IV, and a lower bound on the average AoI is presented. Section V introduces RL algorithms to minimize the AoI in an unknown environment. Simulation results are presented in Section VI, and the paper is concluded in Section VII.Ii System Model and Problem Formulation
We consider a slotted status update system where multiple users await timesensitive information regarding a timevarying process. The source monitors the underlying timevarying process, for which it is able to generate a status update at the beginning of each time slot. The source can only transmit the status update to a single user at each time slot. This can be either because of dedicated orthogonal links to the users, e.g., a wired network, or because the users are interested in distinct processes. A transmission attempt of a status update to a single user takes constant time, which is assumed to be equal to the duration of one time slot.
We assume that the channel state changes randomly from one time slot to the next in an independent and identically distributed fashion. We further assume the availability of an error and delayfree singlebit ACK/NACK feedback from each user to the source node.
Let denote the number of users and denote the index for each user . The AoI for each user is defined as the time elapsed since the most uptodate packet they received had been generated at the source. Assume that the most uptodate packet at the destination at time has a time stamp of generation for the user, then the AoI for user at the beginning of time slot , denoted by , is defined as Therefore, increases by one when the source chooses not to transmit to user or a transmission fails, while it decreases to one (or, to the number of retransmissions in the case of HARQ) when a status update is successfully decoded.
In the classical ARQ protocol, a packet is retransmitted after each NACK feedback, until it is successfully decoded. However, in the AoI framework there is no point in retransmitting a failed outofdate status packet if it has the same error probability with a fresh status update. Hence, the source always removes a failed status signal, and transmits a fresh status update. On the other hand, in the HARQ protocol, signals from all previous transmission attempts are combined for decoding; and therefore, the probability of error decreases with every retransmission [15].
Let
denote the number of previous transmission attempts of the same packet. Then, the state of the system can be described by the vector
. At each time slot, the source node takes one of the several actions, denoted by , where denotes the set of possible actions. It can i) remain idle (); ii) generate and transmit a new status update packet to the user (); or, iii) retransmit the previously failed packet to the user (). Without loss of generality, each user in the network is assumed to have different priority levels represented by the weights for user .For the user, the probability of error after retransmissions, denoted by , depends on , the particular HARQ scheme used for combining multiple transmission attempts, and the channel quality between the source and user
. An empirical method to estimate
is presented in [15]. As in any reasonable HARQ strategy, is nonincreasing in , i.e., for all . To simplify the analysis and meet with practical constraints, we assume that there is a maximum number of retransmissions .Note that if no resource constraint is imposed on the source, remaining idle is clearly a suboptimal action since it does not contribute to decreasing the AoI. However, continuous transmission is typically not possible in practice due to energy or interference constraints. To model these situations, we impose a constraint on the average number of transmissions, denoted by .
This leads to the CMDP formulation, defined by the 5tuple [16]: The countable set of states and the finite set of actions have already been defined. refers to the transition kernel, where is the probability that action in state at time will lead to state at time , which will be explicitly defined in (1). The instantaneous cost function , which models the weighted sum of AoI for multiple users, is defined as for any , independently of . The instantaneous transmission cost related to the constraint, , is independent of the state and depends only on the action , where if , and , otherwise. The transition probabilities of the CMDP are given below where is zero elsewhere.
(1) 
A stationary policy is a decision rule represented by , which maps the state into action with some probability and . We will use and to denote the sequences of states and actions, respectively, induced by policy with initial state . Let denote the infinite horizon average age, and denote the expected average number of transmissions, when is employed with initial state . We can state the CMDP optimization problem as follows:
Problem 1.
(2a)  
(2b) 
where . A policy is called optimal if for all . For a deterministic policy, we will use to denote the action taken with probability one in state . Also, without loss of generality, we assume that the initial state at the beginning of the problem is ; and will be omitted from the notation for simplicity. We also assume throughout this paper that the Markov decision process (MDP) is unichain [16], similarly to [14].
Iii PrimalDual Algorithm to Minimize AoI
In this section, we derive the solution for Problem 1, based on [16]. While there exits a stationary and deterministic optimal policy for countablestate finiteaction averagecost MDPs [17], this is not necessarily true for CMDPs [16].
To solve the constrained MDP, we start by rewriting Problem 1 in its Lagrangian form. The average Lagrangian cost of a policy with Lagrange multiplier , denoted by , is defined as
(3) 
and, for any , the optimal achievable cost is defined as . This formulation is equivalent to an unconstrained averagecost MDP, in which the instantaneous overall cost becomes . It is wellknown that there exits an optimal stationary deterministic policy for this problem. In particular, there exists a function , called the differential cost function, satisfying the socalled Bellman optimality equations:
(4) 
where is the next state obtained from after taking action . Then the optimal policy, for any , is given by the action achieving the minimum in (4):
(5) 
The relative value iteration (RVI) algorithm can be employed to solve (4) for any given ; and hence, to find the policy (more precisely, an arbitrarily close approximation) [17].
Similarly to Corollary 1 in [14], it is possible to characterize optimal policies for our CMDP problem using the deterministic policies ,: Specializing Theorem 4.4 of [16] to Problem 1 (since it has a single global constraint), one can think of the optimal policy as a randomized policy between two deterministic policies: in any state , the optimal policy in the CMDP problem chooses action with probability and with probability independently for each time slot where is the probability vector describing the deterministic choice of the optimal policy in the unconstrained MDP with Lagrange multiplier .
For any , let denote the average resource consumption under the optimal policy (note that and can be computed directly through finding the stationary distribution of the chain, but can also be estimated empirically just by running the MDP with policy ). Obviously, and are monotone functions of . Therefore, given and , one can find a weight, denoted by , by solving , which has a solution if .
Next, we present a heuristic method to find
and : With the aim of finding a single value such that , starting with an initial parameter , we run an iterative algorithm updating as for some step size parameter . We continue this iteration until is smaller than a given , and denote the resulting value as . Then, we approximate the values of and by , where is a small perturbation and the mixture policy can obtained as:(6) 
Iv AoI with Classical ARQ Protocol
Now, assume that the system adopts the classical ARQ protocol; that is, failed transmissions are discarded at the destination. In this case, there is no point in retransmitting a failed packet since the successful transmission probabilities are the same for a retransmission and the transmission of a new update. The state space reduces to as , and the action space to . The probability of error of each status update is for user . State transitions in (1), Bellman optimality equations and the RVI algorithm can all be simplified accordingly. Thanks to these simplifications, we are able to provide a closedform lower bound to the constrained MDP.
Iva Lower Bound on the AoI under Resource Constraint
In this section, we derive a lower bound to the average AoI for the multiuser network with standard ARQ protocol.
Theorem 1.
For a given network setup, we have , , where
(7)  
Proof.
The proof will be provided in the extended version of the paper. ∎
Previously, [6] proposed a universal lower bound on the average AoI for the broadcast channel with multiple users for the special case of . Differently from [6], the lower bound derived in this paper shows the effect of constraint () and even for , it is tighter than the lower bound provided in [6].
V Learning to minimize AoI in an unknown environment
In most practical scenarios, channel error probabilities for retransmissions may not be known at the time of deployment, or may change over time, where the source node does not have a priori information about the decoding error probabilities and has to learn them over time. We employ online learning algorithms to learn the error probabilities over time without degrading the performance significantly.
The Upper Confidence RL (UCRL2) [18] is a wellknown RL algorithm for generic MDP problems which has strong theoretical guarantees with regard to high probability regret bounds. However, the computational complexity of the algorithm scales quadratically with the size of the state space, which makes the algorithm unsuitable for large state spaces. UCRL2 has been initially proposed for generic MDPs with unknown rewards and transition probabilities: thus, they need to be learned for each stateaction pair. On the other hand, for the average AoI problem, the number of parameters to be learned can be reduced to the number of transmission error probabilities to each user; thus, the computational complexity can be reduced significantly. In addition, the constrained structure of the average AoI problem requires additional modifications to the UCRL2 algorithm, which is achieved in this paper by updating the Lagrange multiplier according to the empirical resource consumption.
Va UCRL2 with standard ARQ
In this section, we consider a multiuser network with standard ARQ where a source node transmits to multiple users with unknown and distinct error probabilities
. UCRL2 exploits the optimistic MDP characterized by the optimistic estimation of error probabilities within a certain confidence interval. The details of the algorithm are given in Algorithm
1, where and represent the empirical and the optimistic estimate of the error probability for user .We propose several methods to find the optimal policy using the optimistic estimate defined in steps 4 and 5 of Algorithm 1. In the generic UCRL2, extended value iteration (VI) is used for steps 4 and 5, which has high computational complexity for large networks. For the average AoI problem, the computational complexity can be reduced since the optimistic MDP can be found easily using the lower bound for the error probabilities and value iteration can be adopted to compute induced by in step 5. The resulting algorithm will be called as UCRL2VI.
In order to further reduce the computational complexity, we can also adopt a suboptimal Whittle index policy, proposed in [6], in step 5 of the algorithm. The resulting algorithm is called as UCRL2Whittle in this paper and the policy in step 5 can be found as follows:

Compute the index for each user (similarly to [6]),
(8) 
Compare the highest index with the Lagrange parameter : if is smaller then the source transmits to the user with the highest index, otherwise the source idles.
VB UCRL2 with HARQ
The pseudocode of the algorithm is given in Algorithm 2, where and represent the empirical and the optimistic estimates of the error probability for user , after retransmissions.
Vi Numerical Results
First, we analyze the average AoI in a multiuser setting with standard ARQ protocols. The average AoI for a given resource constraint is illustrated in Figure 2 for a 3user network with error probabilities given as . It can be seen from Figure 2 that both UCRL2VI and UCRL2Whittle perform very close to lower bound particularly when is low, i.e. the system is more constrained. Although UCRL2Whittle algorithm has a significantly lower computational complexity, it performs very similar to UCRLWhittle for all values.
Figure 3 illustrates the average AoI with standard ARQ with respect to the size of a network when there is no constraint on the average number of transmissions (i.e. ) and the performance of the UCRL2 algorithm is compared with the lowerbound since the computational cost of value/policy iteration algorithms is very high. Learning algorithm performs close lowerbound and very close to the Whittle index policy [6] which assumes the a priori knowledge of error probabilities. Moreover, the UCRL2 algorithm outperforms the greedy benchmark policy which always transmits to the user with the highest age and Round Robin policy which transmits to each user in turns.
The performance of UCRL2Whittle and average cost SARSA are shown in Figure 4. UCRL2Whittle converges much faster compared to the standard Averagecost SARSA algorithm, and it performs very close to the optimal algorithm computed by value iteration (VI) with known error probabilities. Figure 5 shows the performance of learning algorithms for HARQ protocol for a 2user scenario. It is worth noting that although UCRL2VI converges to the optimal policy in fewer iterations than averagecost SARSA, iterations in UCRL2VI is computationally more demanding since it uses value iteration in each . Therefore, UCRL2VI is not practical for problems with large state spaces, in our case for large networks.
Vii Conclusion
Scheduling the transmission of status updates to multiple destination nodes has been considered with the average AoI as the performance measure. Under a resource constraint, the problem is modeled as a CMDP considering both the classical ARQ and the HARQ protocols and an online scheduling policy has been proposed. A lower bound on the average AoI has been shown for the standard ARQ protocol. RL algorithms are presented for scenarios when the error probabilities may not be known in advance, and demonstrated to perform very close optimal for scenarios investigated in numerical simulations. The algorithms adopted in this paper are also relevant to different multiuser systems concerning the timeliness of information, and the proposed methodology can be used in other CMDP problems.
References
 [1] E. Altman, R. E. Azouzi, D. S. Menasché, and Y. Xu, “Forever young: Aging control in DTNs,” CoRR, abs/1009.4733, 2010.
 [2] S. Kaul, M. Gruteser, V. Rai, and J. Kenney, “Minimizing age of information in vehicular networks,” in IEEE Coms. Society Conf. on Sensor, Mesh and Ad Hoc Coms. and Nets., 2011.
 [3] S. Kaul, R. Yates, and M. Gruteser, “Realtime status: How often should one update?” in Proc. IEEE INFOCOM,, March 2012, pp. 2731–2735.
 [4] Y. Sun, E. UysalBiyikoglu, R. Yates, C. E. Koksal, and N. B. Shroff, “Update or wait: How to keep your data fresh,” in IEEE Int’l Conf. on Comp. Comms. (INFOCOM), April 2016, pp. 1–9.
 [5] B. T. Bacinoglu, E. T. Ceran, and E. UysalBiyikoglu, “Age of information under energy replenishment constraints,” in Inf. Theory and Applications Workshop (ITA), Feb 2015, pp. 25–31.
 [6] I. Kadota, E. UysalBiyikoglu, R. Singh, and E. Modiano, “Scheduling policies for minimizing age of information in broadcast wireless networks,” CoRR, 2018.
 [7] Y. P. Hsu, E. Modiano, and L. Duan, “Age of information: Design and analysis of optimal scheduling algorithms,” in IEEE Int’l Symp. on Inf. Theory (ISIT), June 2017, pp. 561–565.
 [8] Q. He, D. Yuan, and A. Ephremides, “Optimal link scheduling for age minimization in wireless systems,” IEEE Trans. on Inf. Theory, vol. PP, no. 99, pp. 1–1, 2017.
 [9] R. D. Yates, E. Najm, E. Soljanin, and J. Zhong, “Timely updates over an erasure channel,” in IEEE Int’l Symposium on Inf. Theory (ISIT) (ISIT), June 2017, pp. 316–320.
 [10] R. D. Yates and S. K. Kaul, “Status updates over unreliable multiaccess channels,” in IEEE Int’l Symp. on Inf. Theory (ISIT), June 2017, pp. 331–335.
 [11] J. Zhong, E. Soljanin, and R. D. Yates, “Status updates through multicast networks,” CoRR, vol. abs/1709.02427, 2017.
 [12] E. Najm, R. Yates, and E. Soljanin, “Status updates through M/G/1/1 queues with HARQ,” in IEEE International Symposium on Information Theory (ISIT), June 2017, pp. 131–135.
 [13] D. Gunduz, K. Stamatiou, N. Michelusi, and M. Zorzi, “Designing intelligent energy harvesting communication systems,” IEEE Communications Magazine, vol. 52, pp. 210–216, 2014.
 [14] E. T. Ceran, A. György, and D. Gündüz, “Average age of information with hybrid ARQ under a resource constraint,” in IEEE Wireless Comms. and Netw. Conf. (WCNC), April 2018.
 [15] V. Tripathi, E. Visotsky, R. Peterson, and M. Honig, “Reliabilitybased type ii hybrid ARQ schemes,” in IEEE Int’l Conf. on Communications,, vol. 4, May 2003, pp. 2899–2903 vol.4.
 [16] E. Altman, Constrained Markov Decision Processes, ser. Stochastic modeling. Chapman & Hall/CRC, 1999.
 [17] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. NY, USA: John Wiley & Sons, 1994.
 [18] P. Auer, T. Jaksch, and R. Ortner, “Nearoptimal regret bounds for reinforcement learning,” in Advances in Neural Inf. Processing Systems 21. Curran Associates, Inc., 2009, pp. 89–96.
Comments
There are no comments yet.