I Introduction
CyberPhysical Systems (CPS), Internet of Things (IoT), Remote Automation, and Haptic Internet are all examples for systems and applications that require realtime monitoring and lowlatency constrained information delivery. The growth of time sensitive information led to a new data freshness measure named Age of Information (AoI). AoI measures packet freshness at the destination accounting for the time elapsed since the last update generated by a given source [1].
Consider a cyberphysical system such as an automated factory where a number of sensors are located at a remote site and they are transmitting time sensitive information to a remote observer through the internet (multihop wired and wireless network). Each sensor’s job is to sample measurements from a physical phenomena and transmit them to the monitoring site. Due to the variability of the network bandwidth because of many reasons such as the resource competition, packet corruptions, and wireless channel variations, maintaining fresh data at the monitoring side become challenging problem.
Recently, there were few papers tackling the problem of minimizing the AoI of number of sources that are competing for the available resources. [2] considers the problem of many sensors connected wirelessly to a single monitoring node and formulates an optimization problem that minimizes the weighted expected AoI of the sensors at the monitoring node. Moreover, the authors of [3] also consider the sum expected AoI minimization problem when constraints on the packet deadlines are imposed. In [4], sum expected AoI minimization is considered in a cognitive shared access.
However, the problem of AoI minimization for ultra reliable low latency communication (URLLC) systems should pay more attention to the tail behavior in addition to optimizing on average metrics [5]. In URLLC, the probability that the AoI of each node sharing the resources exceeds a certain predetermined threshold should be minimized. Therefore, the AoI minimization metric for URLLC should account for the tradeoff between minimizing the expected AoI of each node and maintaining the probability that the AoI of each node exceeding a predefined threshold at its minimum. Therefore, in this paper, we consider optimizing a metric that is a weighted sum of two metrics (i) Expected AoI, and (ii) probability of threshold violation.
Introducing the second metric in the object function, accounting for the fact that different nodes are generating packets at different sizes, and different nodes can tolerate different threshold violation percentages. Moreover, accounting for the noncausality knowledge of the available bandwidth, and including the non zero packet drop that might be encountered in the network. All these factors, contribute to the complexity of the problem. In other words, the proposed scheduling problem is a stochastic optimization problem with integer non convex constraint which is in general hard to solve in polynomial time.
Therefore, motivated by the success of machine learning in solving many of the online large scale networking problem when trained offline. In this paper, we propose an algorithm based on Reinforcement Learning (RL) to solve the proposed formulation. RL is defined by three components (state, action, and reward). Given a state, the RL agent is trained in offline manner to choose the action that maximizes the system reward. The RL agent interacts with the environment in a continuous way and tries to find the best policy based on the reward/cost fed back from that environment [6]. In other words, the RL agent tries to choose the trajectory of actions that leads to maximum average reward.
The rest of the paper is organized as follows, In section II, we describe our system model and problem formulation. In section III, we explain our Reinforcement Learning based approach that we propose for solving the proposed formulation. In section IV, we describe in details our implementation and we also discuss the results. Finally, in section V, we conclude the paper.
Ii System Model and Problem Formulation
The system model considered in this work focuses on a remote monitoring/automation scenario (Fig 1) in which the monitor/controller resides in a remote site and allows one sensor at a time to update its state by generating fresh packet and send it to the monitor/controller. We assume that there are sensors, and only one sensor is chosen to update its state at job number by sending a packet of size Bytes at a rate to the controller. Where is the average rate at the time of job number .
Let be an indicator variable that indicates the selection of sensor by the controller at job id , i.e. if the controller selects sensor at job id , and otherwise. When , sensor samples a new state, generates a packet accordingly, and sends it to the controller. The packet is successfully received by the controller with probability . Therefore, the packet is corrupted with probability . Let be the indicator function that is equal to the number of times sensor retransmits its packet at the job id before it is successfully received at the controller. For example, if the packet fails at the first transmission and then successfully received at the second transmission then . Since, the controller selects only one sensor at any job id , the following constraints must hold for :
(1) 
(2) 
Finally, let be the age of the th sensor’s information in seconds at the time of controller’s job . For example, if the th sensor has successfully transmitted a packet to the controller at the job , the age of the th sensor’s information at the controller will be . Mathematically, the AoI of sensor at the time of job , evolves according to the following equation:
(3) 
Equation (3) simply states that, the AoI for sensor at the time of job is equal to the time that is spent transmitting a packet from sensor to the controller if sensor was selected in job . Otherwise, if another sensor was selected in job , the AoI of sensor at the controller side at time of job is equal to the AoI at the time of job plus the time that is spent transmitting a packet from sensor to the controller.
To account for latency and reliability, we minimize an objective function that is a weighted sum of two terms. The first term is the sum average AoI, and the second term is the sum of the probability that AoI of each sensor at any job exceeds a predefined threshold. Therefore, the overall AoI minimization problem is:
Minimize:  
(4) 
where is the predefined threshold of sensor . In our objective function (4), we assume that different sensors can tolerate different AoI thresholds. is the weight of sensor n; higher the weight more is the penalty of exceeding the AoI’s threshold. Finally, In order to optimize the tail performance, we should choose , so that exceeding has a very high penalty and reduces the performance significantly.
Structure of the problem: The proposed problem is a stochastic optimization problem with integer (nonconvex) constraints. Integerconstrained problems even in the deterministic settings are known to be NP hard in general. Very limited problems in this class of discrete optimization are known to be solvable in polynomial time. Moreover, neither the packet drop probability nor the available bandwidth are known noncausally. Furthermore, all the sensors are competing over the available bandwidth. Therefore, we propose a learning based algorithm in order to learn and solve the proposed problem. In particular, we investigate the use of Reinforcement Learning (RL). In the next section, we describe our proposed algorithm for solving our formulated problem using RL.
Iii Proposed Algorithm Based on Reinforcement Learning
In this paper, we consider a learningbased approach to find the scheduling policy from real observations. Our approach is based on reinforcement learning (RL). In RL, an agent interacts with an environment. At each job, the agent observes some state and performs an action . After performing the action, the state of the environment transitions to , and the agent receives a reward . The objective of learning is to maximize the expected cumulative discounted reward defined as . The discounted reward is considered to ensure a long term system reward of the current action.
Our reinforcement learning approach is described in Fig.2
. As shown, the scheduling policy is obtained from training a neural network. The agent observes a set of metrics including the current AoI of every sensor
, and the throughput achieved in the last jobs and feeds these values to the neural network, which outputs the action. The action is defined by which sensor to choose for the next job. The reward is then observed and fed back to the agent. The agent uses the reward information to train and improve its neural network model. We explain the training algorithms later in the section. Our reward function, state, and action spaces are defined as follows:
Reward Function: The reward function at the end of job , is defined by the following equation which is a sum of two terms. The first term is the sum of all sensors’ AoIs at the end of job , and the second term is a weighted sum of the penalties encountered when exceeding AoI thresholds.
(5) Where is an indicator function.

State: The state at job number is.

The age of every sensor at job ,

The throughput that was achieved for the last assigned jobs .

The time that was spent in downloading the last packet . which reflects the packet size as well as the current network conditions (packet drop and throughput).


Action
: The action is represented by a probability vector of length
such that if the th element is the maximum, sensor will be scheduled for job id .
The first step toward generating the scheduling algorithm using RL is to run a training phase in which the learning agent explores the network environment. In order to train our RL based system, we use A3C [7] which is the stateofart actor critic algorithm. A3C involves training two neural networks.
Given state as described in Fig 2, the RL agent takes an action which corresponds to choosing one sensor for job . The agent selects a certain action based on a policy which is defined as a probability over actions . Specifically, is the probability of choosing action given state . The policy is a function of parameter , which is referred to as policy parameter in reinforcement learning. Therefore, for each choice of , we have a parametrized policy .
After performing an action at job id , a reward is observed by the RL agent. The reward reflects the performance of each action (sesnor selection) in the metric we need to optimize. Policy gradient method [8]
is used to train the actorcritic policy. Here, we describe the main steps of the algorithm. The main job of policy gradient methods is to estimate the gradient of the expected total reward. The gradient of the cumulative discounted reward with respect to the policy parameters
is computed as:(6) 
where is the advantage function [7]. reflects how better an action compared to the average one chosen according to the policy. Since the exact
is not known, the agent samples a trajectory of scheduling decisions and uses the empirically computed advantage as an unbiased estimate of
. The update of the actor network parameter follows the policy gradient which is defined as follows:(7) 
where is the learning rate. In order to compute the advantage for a given action, we need to estimate the value function of the current state which is the expected total reward starting at state and following the policy . Estimating the value function from the empirically observed rewards is the task of the critic network. To train the critic network, we follow the standard temporal difference method [9]. In particular, the update of follows the following equation:
(8) 
Where is the learning rate of the critic, and is the estimate of . Therefore, for a given (), the advantage function, is estimated as
Finally, we would like to mention that the critic is only used at the training phase in order to help the actor converge to the optimal policy. The actor network is then used to make the scheduling decisions.
Iv Simulation
Iva Implementation
To generate our scheduling algorithm, we train a RL agent that consists of an actorcritic pair. Both actor and critic network uses the same NN structure, except that the final output of the critic network is a linear neuron with no activation function. We pass the throughput that was achieved in the last 5 jobs to a 1D convolution layer (CNN) with 128 filters, each of size 4 with stride 1. The output of this layer is aggregated with all other inputs in a hidden layer that uses 128 neurons with a relu activation function. At the end, the output layer (10 neurons) applies the softmax function. To account for discounted reward, we choose
. Moreover, we set the learning rates of both the actor and critic to. We implemented this architecture in python using TensorFlow
[10]. Moreover, we used real bandwidth traces [11, 12, 13] to train our RL agent, and we set the packet drop probability to be .We assume that there are sensors () generating packets of sizes, to bytes, with step size of bytes. The sensors have their AoI thresholds ranging from to ms. Therefore, the first sensor (sensor 0) imposes a stricter AoI threshold, and the last one (sensor 9) has the much looser threshold value and higher packet size. We set to be equal to , where represents the sensor ID. Therefore, we get more penalty when the AoI of the sensor that has the tighter threshold exceeds its target.
For comparison, we considered two baselines, baseline 1 always chooses the sensor that has maximum AoI to transmit. We refer to this baseline by “Maximum AoI” algorithm. In the other hand, baseline 2 randomly selects a sensor for job with a probability of selection that is inversely proportional to its AoI threshold. Therefore, the sensor that has a stricter deadline is chosen more frequently. We refer to this baseline by “Proportional Fair” algorithm
IvB Discussion
Now we report our results using the test traces. throughout this section, we refer to our proposed algorithm by “RL” algorithm. The results are described in both table I and Fig 3. We clearly see from the total normalized objective function shown in table I, which is computed as per equation (4) and normalized with respect to the “RL” algorithm, that our proposed RL based algorithm significantly outperforms baselines. The RL algorithm achieves the minimum objective value among the three algorithms. Its objective is around , and lower than maximum AoI (baseline 1) and proportional fair (baseline 2) algorithms respectively. Moreover, RL algorithm achieves the minimum threshold violation for almost each user. For example, the AoI of the first user exceeds its threshold for only of the time using the proposed algorithm. However, the AoI of the same sensor exceeds its threshold of the time using baseline 1 and 2. Furthermore, the RL algorithm totally eliminates the violations for the sensors 5 onward (except the last sensor). In the other hand, baseline 2 continues to experience considerably high threshold violation for all sensors.
We also notice that the RL algorithm maintains a tradeoff between minimizing the average AoI of each user and minimizing the threshold violation which is very important requirement for URLLC. For example, for the first 3 sensors (sensors 0, 1, and 2), RL algorithm achieves the minimum AoI and minimum threshold violation. Note that the penalty of AoI violation is inversely proportional to the sensor’s index. i.e, sensor 0’s threshold violation degrades the performance of the algorithm much higher than sensor 1’s violation and so on. We see for some sensors that maximum AoI based approach (baseline 1) achieves a lower AoI. However, that comes at the cost of having more threshold violation for sensors that have stricter AoI threshold and higher threshold violation penalty weight (e.g, sensor 0). Consequently, it degrades the performance more significant than the higher average AoI. Therefore, for the considered objective function which gives higher penalty to violating threshold AoI, our proposed RL based algorithm is able to learn that and to significantly outperform the other two algorithms.
Fig 3(aj) plots the CDF of each sensor’s AoI for the three algorithms. The CDF plots along side the results reported in table I show that RL algorithm outperforms the baselines in both average AoI and probability of exceeding the AoI threshold for the sensors with stricter deadlines Fig. 3(ac). For example, in Fig. 3(a), the AoI CDF of sensor 0 is plotted, which has the stricter threshold among all sensors. Hence, we see that RL algorithm achieves the minimum AoI for this sensor all the time. The RL algorithm runs into higher average AoI than “maximum AoI” algorithm (baseline 1) for sensors with looser AoI threshold, but that comes at the gain of achieving minimum threshold violation for most of the sensors. In conclusion, we clearly see that the RL based approach that is proposed in this paper learns how to consider minimizing the average AoI of every sensor while maintaining the probability of exceeding the AoI threshold of each sensor as low as possible. Moreover, it learns how to respect the weights specified by the objective function which gives much higher penalty to violating thresholds than achieving lower average AoI.
RL  Baseline 1  Baseline 2  
Normalized Objective Eq(4)  1  1.25  1.7 
Pr AoI  0.23%  16.24%  16.18% 
Pr AoI  0.06%  6.04%  7.80% 
Pr AoI  0.016%  2.46%  7.02 % 
Pr AoI  0.06%  0.04%  5.55% 
Pr AoI  0.043%  0%  3.98% 
Pr AoI  0%  0%  4.32% 
Pr AoI  0%  0%  3.21% 
Pr AoI  0%  0%  3.63% 
Pr AoI  0%  0%  3.82 % 
Pr AoI  0.016%  0%  4.42% 
Avg sensor 0  9.82  16.93  15.89 
Avg sensor 1  14.39  19.12  19.34 
Avg sensor 2  17.30  20.83  26.31 
Avg sensor 3  22.53  22.05  30.90 
Avg sensor 4  26.07  22.82  35.35 
Avg sensor 5  23.86  23.12  41.90 
Avg sensor 6  25.51  22.83  45.09 
Avg sensor 7  31.25  22.08  51.75 
Avg sensor 8  26.2 8  20.83  61.27 
Avg sensor 9  39.14  19.09  67.14 
V Conclusion
In this paper, we investigated the use of machine learning in solving network resource allocation/scheduling problems. In particular, we developed a reinforcement learning based algorithm to solve the problem of AoI minimization for URLLC networks. We considered the system in which a number of sensor nodes are transmitting a time sensitive data to a remote monitoring side. We optimize a metric that maintains a tradeoff between minimizing the sum of the expected AoI of all sensors and minimizing the probability of exceeding a certain AoI threshold for each sensor. We trained our reinforcement learning algorithm using the stateoftheart actorcritic algorithm over a set of public bandwidth traces with non zero probability of packet drop. Simulation results show that the proposed algorithm outperforms the considered baselines in terms of optimizing the considered metric.
References
 [1] S. Kaul, R. Yates, and M. Gruteser, “Realtime status: How often should one update?” in 2012 Proceedings IEEE INFOCOM, March 2012, pp. 2731–2735.
 [2] I. Kadota, A. Sinha, and E. Modiano, “Optimizing age of information in wireless networks with throughput constraints,” in IEEE INFOCOM 2018  IEEE Conference on Computer Communications, April 2018, pp. 1844–1852.
 [3] C. Kam, S. Kompella, G. D. Nguyen, J. E. Wieselthier, and A. Ephremides, “On the age of information with packet deadlines,” IEEE Transactions on Information Theory, vol. 64, no. 9, pp. 6419–6428, Sept 2018.
 [4] A. Kosta, N. Pappas, A. Ephremides, and V. Angelakis, “Age of Information and Throughput in a Shared Access Network with Heterogeneous Traffic,” ArXiv eprints, Jun. 2018.
 [5] M. K. AbdelAziz, C.F. Liu, S. Samarakoon, M. Bennis, and W. Saad, “UltraReliable LowLatency Vehicular Networks: Taming the Age of Information Tail,” ArXiv eprints, nov 2018.
 [6] Y. Sun, M. Peng, Y. Zhou, Y. Huang, and S. Mao, “Application of Machine Learning in Wireless Networks: Key Techniques and Open Issues,” ArXiv eprints, Sep. 2018.
 [7] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1928–1937.
 [8] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in neural information processing systems, 2000, pp. 1057–1063.
 [9] R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction,” 2011.
 [10] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for largescale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
 [11] H. Riiser, P. Vigmostad, C. Griwodz, and P. Halvorsen, “Commute path bandwidth traces from 3g networks: analysis and applications,” in Proceedings of the 4th ACM Multimedia Systems Conference. ACM, 2013, pp. 114–118.
 [12] “Federal Communications Commission. 2016. Raw Data  Measuring Broadband America. (2016).” https://www.fcc.gov/reportsresearch/reports/measuringbroadbandamerica/rawdatameasuringbroadbandamerica2016.
 [13] H. Mao, R. Netravali, and M. Alizadeh, “Neural adaptive video streaming with pensieve,” in Proceedings of the Conference of the ACM Special Interest Group on Data Communication. ACM, 2017, pp. 197–210.
Comments
There are no comments yet.