A Reinforcement Learning Approach to Age of Information in Multi-User Networks

by   Elif Tuğçe Ceran, et al.
Imperial College London

Scheduling the transmission of time-sensitive data to multiple users over error-prone communication channels is studied with the goal of minimizing the long-term average age of information (AoI) at the users under a constraint on the average number of transmissions at the source node. After each transmission, the source receives an instantaneous ACK/NACK feedback from the intended receiver and decides on what time and to which user to transmit the next update. The optimal scheduling policy is first studied under different feedback mechanisms when the channel statistics are known; in particular, the standard automatic repeat request (ARQ) and hybrid ARQ (HARQ) protocols are considered. Then a reinforcement learning (RL) approach is introduced, which does not assume any a priori information on the random processes governing the channel states. Different RL methods are verified and compared through numerical simulations.



There are no comments yet.


page 1

page 2

page 3

page 4


A Reinforcement Learning Approach to Age of Information in Multi-User Networks with HARQ

Scheduling the transmission of time-sensitive information from a source ...

Reinforcement Learning to Minimize Age of Information with an Energy Harvesting Sensor with HARQ and Sensing Cost

The time average expected age of information (AoI) is studied for status...

Learning to Minimize Age of Information over an Unreliable Channel with Energy Harvesting

The time average expected age of information (AoI) is studied for status...

AoI-Constrained Bandit: Information Gathering over Unreliable Channels with Age Guarantees

Age-of-Information (AoI) is an application layer metric that has been wi...

Minimizing Age of Information via Scheduling over Heterogeneous Channels

In this paper, we investigate how to minimize the age of information whe...

Adaptive Coding for Information Freshness in a Two-user Broadcast Erasure Channel

In this paper, we investigate the impact of coding on the Age of Informa...

Efficient Scheduling for the Massive Random Access Gaussian Channel

This paper investigates the massive random access Gaussian channel with ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We consider a source node that communicates the most up-to-date status packets to multiple users (see Figure 1). We are interested in the average age of information (AoI) [1, 2, 3] at the users, for a system in which the source node samples an underlying time-varying process and schedules the transmission of the sample values over imperfect links. The AoI at each user at any point in time can simply be defined as the amount of time elapsed since the most recent status update at that user was generated. Most of the earlier work on AoI consider queue-based models, in which the status updates arrive at the source node randomly following a memoryless Poisson process, and are stored in a buffer before being transmitted to the destination [2, 3]. Instead, in the so-called generate-at-will model [4, 1, 5, 6, 7], also considered in this paper, the status updates of the underlying process of interest can be generated at any time by the source node.

AoI in multi-user networks has been studied in [8, 7, 6, 9, 10, 11]. It is shown in [8] that the scheduling problem for the age minimization is NP-hard in general. Scheduling transmissions to multiple receivers is investigated in [7], focusing on a perfect transmission medium, and the optimal scheduling algorithm is shown to be threshold-type. Average AoI has also been studied when status updates over unreliable multi-access channels [10] and multi-cast networks [11] are considered. A base station sending time-sensitive information to a number of users through unreliable channels is considered in [6], where the problem is formulated as a multi-armed restless bandit. AoI in the presence of retransmissions has been considered in [12, 9]. The status update system is modeled as an M/G/1/1 queue in [12], where the status update arrivals are assumed to be memoryless and random. Maximum distance separable (MDS) coding is considered in [12]

, and the successful decoding probabilities are derived in closed form.

Figure 1: The system model of a status update system over error prone links in a multi-user network.

In this paper, we address the scheduling of status updates in a multi-user network for both the standard ARQ and HARQ protocols. Our goal is to minimize the expected average AoI under an average transmission-rate constraint. This constraint is motivated by the fact that sensors sending status updates have usually limited energy supplies (e.g., are powered via energy harvesting [13]); hence, they cannot afford to send an unlimited number of updates, or increase the signal-to-noise-ratio in the transmission. First, we assume that the success probability before each transmission attempt is known; hence, the source can judiciously decide when to retransmit, and when to discard failed information and send a fresh update. Then, we consider scheduling status updates over unknown channels, in which the success probabilities of transmission attempts are not known a priori, and must be learned in an online fashion using the ACK/NACK feedback signals.

In previous work [14], we have studied a point-to-point status update system in the presence of transmission errors and resource constraint. Here, the results obtained in [14] are extended to the multi-user setting; in addition, more sophisticated reinforcement learning (RL) algorithms are proposed to minimize the average AoI and are demonstrated to perform very close to a lower bound.

The rest of the paper is organized as follows. In Section II, the system model is presented and the problem of minimizing the average AoI in multi-user networks under a resource constraint is formulated as a

constrained Markov decision process

(CMDP). After determining the structure of the optimal policy, a primal-dual algorithm is proposed to solve this CMDP in Section III. Minimization of the AoI for the standard ARQ protocol is investigated in Section IV, and a lower bound on the average AoI is presented. Section V introduces RL algorithms to minimize the AoI in an unknown environment. Simulation results are presented in Section VI, and the paper is concluded in Section VII.

Ii System Model and Problem Formulation

We consider a slotted status update system where multiple users await time-sensitive information regarding a time-varying process. The source monitors the underlying time-varying process, for which it is able to generate a status update at the beginning of each time slot. The source can only transmit the status update to a single user at each time slot. This can be either because of dedicated orthogonal links to the users, e.g., a wired network, or because the users are interested in distinct processes. A transmission attempt of a status update to a single user takes constant time, which is assumed to be equal to the duration of one time slot.

We assume that the channel state changes randomly from one time slot to the next in an independent and identically distributed fashion. We further assume the availability of an error- and delay-free single-bit ACK/NACK feedback from each user to the source node.

Let denote the number of users and denote the index for each user . The AoI for each user is defined as the time elapsed since the most up-to-date packet they received had been generated at the source. Assume that the most up-to-date packet at the destination at time has a time stamp of generation for the user, then the AoI for user at the beginning of time slot , denoted by , is defined as Therefore, increases by one when the source chooses not to transmit to user or a transmission fails, while it decreases to one (or, to the number of retransmissions in the case of HARQ) when a status update is successfully decoded.

In the classical ARQ protocol, a packet is retransmitted after each NACK feedback, until it is successfully decoded. However, in the AoI framework there is no point in retransmitting a failed out-of-date status packet if it has the same error probability with a fresh status update. Hence, the source always removes a failed status signal, and transmits a fresh status update. On the other hand, in the HARQ protocol, signals from all previous transmission attempts are combined for decoding; and therefore, the probability of error decreases with every retransmission [15].


denote the number of previous transmission attempts of the same packet. Then, the state of the system can be described by the vector

. At each time slot, the source node takes one of the several actions, denoted by , where denotes the set of possible actions. It can i) remain idle (); ii) generate and transmit a new status update packet to the user (); or, iii) retransmit the previously failed packet to the user (). Without loss of generality, each user in the network is assumed to have different priority levels represented by the weights for user .

For the user, the probability of error after retransmissions, denoted by , depends on , the particular HARQ scheme used for combining multiple transmission attempts, and the channel quality between the source and user

. An empirical method to estimate

is presented in [15]. As in any reasonable HARQ strategy, is non-increasing in , i.e., for all . To simplify the analysis and meet with practical constraints, we assume that there is a maximum number of retransmissions .

Note that if no resource constraint is imposed on the source, remaining idle is clearly a suboptimal action since it does not contribute to decreasing the AoI. However, continuous transmission is typically not possible in practice due to energy or interference constraints. To model these situations, we impose a constraint on the average number of transmissions, denoted by .

This leads to the CMDP formulation, defined by the 5-tuple [16]: The countable set of states and the finite set of actions have already been defined. refers to the transition kernel, where is the probability that action in state at time will lead to state at time , which will be explicitly defined in (1). The instantaneous cost function , which models the weighted sum of AoI for multiple users, is defined as for any , independently of . The instantaneous transmission cost related to the constraint, , is independent of the state and depends only on the action , where if , and , otherwise. The transition probabilities of the CMDP are given below where is zero elsewhere.


A stationary policy is a decision rule represented by , which maps the state into action with some probability and . We will use and to denote the sequences of states and actions, respectively, induced by policy with initial state . Let denote the infinite horizon average age, and denote the expected average number of transmissions, when is employed with initial state . We can state the CMDP optimization problem as follows:

Problem 1.

where . A policy is called optimal if for all . For a deterministic policy, we will use to denote the action taken with probability one in state . Also, without loss of generality, we assume that the initial state at the beginning of the problem is ; and will be omitted from the notation for simplicity. We also assume throughout this paper that the Markov decision process (MDP) is unichain [16], similarly to [14].

Iii Primal-Dual Algorithm to Minimize AoI

In this section, we derive the solution for Problem 1, based on [16]. While there exits a stationary and deterministic optimal policy for countable-state finite-action average-cost MDPs [17], this is not necessarily true for CMDPs [16].

To solve the constrained MDP, we start by rewriting Problem 1 in its Lagrangian form. The average Lagrangian cost of a policy with Lagrange multiplier , denoted by , is defined as


and, for any , the optimal achievable cost is defined as . This formulation is equivalent to an unconstrained average-cost MDP, in which the instantaneous overall cost becomes . It is well-known that there exits an optimal stationary deterministic policy for this problem. In particular, there exists a function , called the differential cost function, satisfying the so-called Bellman optimality equations:


where is the next state obtained from after taking action . Then the optimal policy, for any , is given by the action achieving the minimum in (4):


The relative value iteration (RVI) algorithm can be employed to solve (4) for any given ; and hence, to find the policy (more precisely, an arbitrarily close approximation) [17].

Similarly to Corollary 1 in [14], it is possible to characterize optimal policies for our CMDP problem using the deterministic policies ,: Specializing Theorem 4.4 of [16] to Problem 1 (since it has a single global constraint), one can think of the optimal policy as a randomized policy between two deterministic policies: in any state , the optimal policy in the CMDP problem chooses action with probability and with probability independently for each time slot where is the probability vector describing the deterministic choice of the optimal policy in the unconstrained MDP with Lagrange multiplier .

For any , let denote the average resource consumption under the optimal policy (note that and can be computed directly through finding the stationary distribution of the chain, but can also be estimated empirically just by running the MDP with policy ). Obviously, and are monotone functions of . Therefore, given and , one can find a weight, denoted by , by solving , which has a solution if .

Next, we present a heuristic method to find

and : With the aim of finding a single value such that , starting with an initial parameter , we run an iterative algorithm updating as for some step size parameter . We continue this iteration until is smaller than a given , and denote the resulting value as . Then, we approximate the values of and by , where is a small perturbation and the mixture policy can obtained as:


Iv AoI with Classical ARQ Protocol

Now, assume that the system adopts the classical ARQ protocol; that is, failed transmissions are discarded at the destination. In this case, there is no point in retransmitting a failed packet since the successful transmission probabilities are the same for a retransmission and the transmission of a new update. The state space reduces to as , and the action space to . The probability of error of each status update is for user . State transitions in (1), Bellman optimality equations and the RVI algorithm can all be simplified accordingly. Thanks to these simplifications, we are able to provide a closed-form lower bound to the constrained MDP.

Iv-a Lower Bound on the AoI under Resource Constraint

In this section, we derive a lower bound to the average AoI for the multi-user network with standard ARQ protocol.

Theorem 1.

For a given network setup, we have , , where


The proof will be provided in the extended version of the paper. ∎

Previously, [6] proposed a universal lower bound on the average AoI for the broadcast channel with multiple users for the special case of . Differently from [6], the lower bound derived in this paper shows the effect of constraint () and even for , it is tighter than the lower bound provided in [6].

V Learning to minimize AoI in an unknown environment

In most practical scenarios, channel error probabilities for retransmissions may not be known at the time of deployment, or may change over time, where the source node does not have a priori information about the decoding error probabilities and has to learn them over time. We employ online learning algorithms to learn the error probabilities over time without degrading the performance significantly.

The Upper Confidence RL (UCRL2) [18] is a well-known RL algorithm for generic MDP problems which has strong theoretical guarantees with regard to high probability regret bounds. However, the computational complexity of the algorithm scales quadratically with the size of the state space, which makes the algorithm unsuitable for large state spaces. UCRL2 has been initially proposed for generic MDPs with unknown rewards and transition probabilities: thus, they need to be learned for each state-action pair. On the other hand, for the average AoI problem, the number of parameters to be learned can be reduced to the number of transmission error probabilities to each user; thus, the computational complexity can be reduced significantly. In addition, the constrained structure of the average AoI problem requires additional modifications to the UCRL2 algorithm, which is achieved in this paper by updating the Lagrange multiplier according to the empirical resource consumption.

V-a UCRL2 with standard ARQ

In this section, we consider a multi-user network with standard ARQ where a source node transmits to multiple users with unknown and distinct error probabilities

. UCRL2 exploits the optimistic MDP characterized by the optimistic estimation of error probabilities within a certain confidence interval. The details of the algorithm are given in Algorithm

1, where and represent the empirical and the optimistic estimate of the error probability for user .

We propose several methods to find the optimal policy using the optimistic estimate defined in steps 4 and 5 of Algorithm 1. In the generic UCRL2, extended value iteration (VI) is used for steps 4 and 5, which has high computational complexity for large networks. For the average AoI problem, the computational complexity can be reduced since the optimistic MDP can be found easily using the lower bound for the error probabilities and value iteration can be adopted to compute induced by in step 5. The resulting algorithm will be called as UCRL2-VI.

In order to further reduce the computational complexity, we can also adopt a suboptimal Whittle index policy, proposed in [6], in step 5 of the algorithm. The resulting algorithm is called as UCRL2-Whittle in this paper and the policy in step 5 can be found as follows:

  • Compute the index for each user (similarly to [6]),

  • Compare the highest index with the Lagrange parameter : if is smaller then the source transmits to the user with the highest index, otherwise the source idles.

0:  A confidence parameter , an update parameter , , confidence bound , , .
1:  , and observe the initial state .
2:  for episodes  do
3:      Set ,,,,.
4:      Compute the optimistic error probabilities
5:      Use to find a policy
6:      Execute policy
7:      while  do
8:          Choose an action ,Obtain cost and observe Update , Set ;
9:      end while
10:  end for
Algorithm 1 UCRL2 for the average AoI with standard ARQ.


The pseudocode of the algorithm is given in Algorithm 2, where and represent the empirical and the optimistic estimates of the error probability for user , after retransmissions.

0:  A confidence parameter , an update parameter , , , .
1:  , and observe the initial state .
2:  for episodes  do
3:      Set ,,,,,.
4:      Compute the optimistic error probabilities
5:      Use and value iteration to find a policy
6:      Execute policy
7:      while  do
8:          Choose an action ,Obtain cost and observe Update , Set ;
9:      end while
10:  end for
Algorithm 2 UCRL2 for the average AoI with HARQ.

Vi Numerical Results

First, we analyze the average AoI in a multi-user setting with standard ARQ protocols. The average AoI for a given resource constraint is illustrated in Figure 2 for a 3-user network with error probabilities given as . It can be seen from Figure 2 that both UCRL2-VI and UCRL2-Whittle perform very close to lower bound particularly when is low, i.e. the system is more constrained. Although UCRL2-Whittle algorithm has a significantly lower computational complexity, it performs very similar to UCRL-Whittle for all values.

Figure 2: Average AoI with respect to for a 3-user network with and error probabilities , . Time horizon is set to , and the results are averaged over runs.

Figure 3 illustrates the average AoI with standard ARQ with respect to the size of a network when there is no constraint on the average number of transmissions (i.e. ) and the performance of the UCRL2 algorithm is compared with the lower-bound since the computational cost of value/policy iteration algorithms is very high. Learning algorithm performs close lower-bound and very close to the Whittle index policy [6] which assumes the a priori knowledge of error probabilities. Moreover, the UCRL2 algorithm outperforms the greedy benchmark policy which always transmits to the user with the highest age and Round Robin policy which transmits to each user in turns.

Figure 3: Average AoI for networks with different sizes where , and .The simulation results are averaged over 100 runs.

The performance of UCRL2-Whittle and average cost SARSA are shown in Figure 4. UCRL2-Whittle converges much faster compared to the standard Average-cost SARSA algorithm, and it performs very close to the optimal algorithm computed by value iteration (VI) with known error probabilities. Figure 5 shows the performance of learning algorithms for HARQ protocol for a 2-user scenario. It is worth noting that although UCRL2-VI converges to the optimal policy in fewer iterations than average-cost SARSA, iterations in UCRL2-VI is computationally more demanding since it uses value iteration in each . Therefore, UCRL2-VI is not practical for problems with large state spaces, in our case for large networks.

Figure 4: Average AoI for networks for a 3-user ARQ network with and error probabilities where and . The simulation results are averaged over 100 runs.
Figure 5: Average AoI for networks for a 2-user HARQ network with and error probabilities and where and . The simulation results are averaged over 100 runs.

Vii Conclusion

Scheduling the transmission of status updates to multiple destination nodes has been considered with the average AoI as the performance measure. Under a resource constraint, the problem is modeled as a CMDP considering both the classical ARQ and the HARQ protocols and an online scheduling policy has been proposed. A lower bound on the average AoI has been shown for the standard ARQ protocol. RL algorithms are presented for scenarios when the error probabilities may not be known in advance, and demonstrated to perform very close optimal for scenarios investigated in numerical simulations. The algorithms adopted in this paper are also relevant to different multi-user systems concerning the timeliness of information, and the proposed methodology can be used in other CMDP problems.


  • [1] E. Altman, R. E. Azouzi, D. S. Menasché, and Y. Xu, “Forever young: Aging control in DTNs,” CoRR, abs/1009.4733, 2010.
  • [2] S. Kaul, M. Gruteser, V. Rai, and J. Kenney, “Minimizing age of information in vehicular networks,” in IEEE Coms. Society Conf. on Sensor, Mesh and Ad Hoc Coms. and Nets., 2011.
  • [3] S. Kaul, R. Yates, and M. Gruteser, “Real-time status: How often should one update?” in Proc. IEEE INFOCOM,, March 2012, pp. 2731–2735.
  • [4] Y. Sun, E. Uysal-Biyikoglu, R. Yates, C. E. Koksal, and N. B. Shroff, “Update or wait: How to keep your data fresh,” in IEEE Int’l Conf. on Comp. Comms. (INFOCOM), April 2016, pp. 1–9.
  • [5] B. T. Bacinoglu, E. T. Ceran, and E. Uysal-Biyikoglu, “Age of information under energy replenishment constraints,” in Inf. Theory and Applications Workshop (ITA), Feb 2015, pp. 25–31.
  • [6] I. Kadota, E. Uysal-Biyikoglu, R. Singh, and E. Modiano, “Scheduling policies for minimizing age of information in broadcast wireless networks,” CoRR, 2018.
  • [7] Y. P. Hsu, E. Modiano, and L. Duan, “Age of information: Design and analysis of optimal scheduling algorithms,” in IEEE Int’l Symp. on Inf. Theory (ISIT), June 2017, pp. 561–565.
  • [8] Q. He, D. Yuan, and A. Ephremides, “Optimal link scheduling for age minimization in wireless systems,” IEEE Trans. on Inf. Theory, vol. PP, no. 99, pp. 1–1, 2017.
  • [9] R. D. Yates, E. Najm, E. Soljanin, and J. Zhong, “Timely updates over an erasure channel,” in IEEE Int’l Symposium on Inf. Theory (ISIT) (ISIT), June 2017, pp. 316–320.
  • [10] R. D. Yates and S. K. Kaul, “Status updates over unreliable multiaccess channels,” in IEEE Int’l Symp. on Inf. Theory (ISIT), June 2017, pp. 331–335.
  • [11] J. Zhong, E. Soljanin, and R. D. Yates, “Status updates through multicast networks,” CoRR, vol. abs/1709.02427, 2017.
  • [12] E. Najm, R. Yates, and E. Soljanin, “Status updates through M/G/1/1 queues with HARQ,” in IEEE International Symposium on Information Theory (ISIT), June 2017, pp. 131–135.
  • [13] D. Gunduz, K. Stamatiou, N. Michelusi, and M. Zorzi, “Designing intelligent energy harvesting communication systems,” IEEE Communications Magazine, vol. 52, pp. 210–216, 2014.
  • [14] E. T. Ceran, A. György, and D. Gündüz, “Average age of information with hybrid ARQ under a resource constraint,” in IEEE Wireless Comms. and Netw. Conf. (WCNC), April 2018.
  • [15] V. Tripathi, E. Visotsky, R. Peterson, and M. Honig, “Reliability-based type ii hybrid ARQ schemes,” in IEEE Int’l Conf. on Communications,, vol. 4, May 2003, pp. 2899–2903 vol.4.
  • [16] E. Altman, Constrained Markov Decision Processes, ser. Stochastic modeling.   Chapman & Hall/CRC, 1999.
  • [17] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming.   NY, USA: John Wiley & Sons, 1994.
  • [18] P. Auer, T. Jaksch, and R. Ortner, “Near-optimal regret bounds for reinforcement learning,” in Advances in Neural Inf. Processing Systems 21.   Curran Associates, Inc., 2009, pp. 89–96.