Phasic Policy Gradient Based Resource Allocation for Industrial Internet of Things

by   Lokesh Bommisetty, et al.

Time Slotted Channel Hopping (TSCH) behavioural mode has been introduced in IEEE 802.15.4e standard to address the ultra-high reliability and ultra-low power communication requirements of Industrial Internet of Things (IIoT) networks. Scheduling the packet transmissions in IIoT networks is a difficult task owing to the limited resources and dynamic topology. In this paper, we propose a phasic policy gradient (PPG) based TSCH schedule learning algorithm. The proposed PPG based scheduling algorithm overcomes the drawbacks of totally distributed and totally centralized deep reinforcement learning-based scheduling algorithms by employing the actor-critic policy gradient method that learns the scheduling algorithm in two phases, namely policy phase and auxiliary phase.


page 1

page 2

page 3

page 4


Job Scheduling on Data Centers with Deep Reinforcement Learning

Efficient job scheduling on data centers under heterogeneous complexity ...

Better than the Best: Gradient-based Improper Reinforcement Learning for Network Scheduling

We consider the problem of scheduling in constrained queueing networks w...

Data Centers Job Scheduling with Deep Reinforcement Learning

Efficient job scheduling on data centers under heterogeneous complexity ...

DeepCAS: A Deep Reinforcement Learning Algorithm for Control-Aware Scheduling

We consider networked control systems consisting of multiple independent...

Local Voting: A New Distributed Bandwidth Reservation Algorithm for 6TiSCH Networks

The IETF 6TiSCH working group fosters the adaptation of IPv6-based proto...

Enhanced Minimal Scheduling Function for IEEE802.15.4e TSCH Networks

MAC layer protocol design in a WSN is crucial due to the limitations on ...

6TiSCH++ with Bluetooth 5 and Concurrent Transmissions

Targeting dependable communications for industrial Internet of Things ap...

I Introduction

Industrial Internet of Things (IIoT) has revolutionised the manufacturing processes due to its low-cost solutions for massive data collection features [1]

. The devices in IIoT are mostly battery operated and hence cannot afford to keep their radio on for long hours. Hence the transmission, reception and sleep routines of nodes has to be scheduled to save their energy and to provide a synchronized connection in the network. The nodes in TSCH network communicate using a scheduling matrix composed of cells which are designated by its time slot index and the channel index. Recently, learning algorithms such as machine learning (ML) and reinforcement learning (RL) methods are being adopted to address resource allocation problems in wireless networks

[2]. Cobbe et al. proposed the PPG, by modifying the traditional actor-critic policy gradient method [3]. In PPG, the policy and the value functions are learnt independently so that the sharing of network parameters between the value optimization and the policy optimization networks can be regulated. Hence PPG overcomes the disadvantages of sharing network parameters in the traditional actor-critic policy gradient methods.

Ii Problem Formulation

We consider the TSCH network consisting of a set of nodes communicating to a single border router that collects data from the nodes in multiple hops. In TSCH, time is discretized into slots with time-slot index . The communication bandwidth is divided into number of channels indexed by , where

. Let us define a indicator random variable

that takes value if node is scheduled in the cell and takes otherwise. Let number of timeslots together constitute a slotframe [4].

Let us consider to be the maximum delay that can be experienced by a packet generated by node . Packets will be dropped if the waiting time of the packet in queue (), is more than the deadline . Let

be the probability of a packet generated by node

being dropped in the network due to the violation of deadline constraint. Let be the probability that a packet is discarded due to co-channel interference where is the instantaneous signal to interference and noise ratio (SINR) of node transmitting on channel and is the threshold. With the above given cases of packet failure, the success probability of a packet generated by node and transmitted on channel is given as . The network throughput can be determined as follows.


The energy efficiency of the network is given by the transmission rate achieved by the network per unit transmission power spent. Considering to be the transmitting power of node the energy efficiency can be defined as follows


where is the transmitting power of node .

The resource allocation is given by the schedule

and the power allocation vector

. In our paper, the objective is to jointly optimize the cell allocation and power allocation to nodes in the network to maximise the network throughput and energy efficiency while guaranteeing the QoS requirements. Hence, the resource allocation problem can be formulated as an optimization problem as follows.


where and are the optimization parameters giving weightage to throughput and energy efficiency respectively. The maximisation of the objective function as shown in (3) is subjected to the constraints given in equations (4),(5) and (6). The constraint in (4) tells that a node cannot transmit on more than one channel in a given time-slot. Equations (5) and (6) ensures the QoS requirements of the network in terms of the deadline based delivery and error probability respectively.

Iii PPG for TSCH resource allocation

In this section, we discuss the usage of PPG method to solve the above formulated optimization problem for resource allocation for TSCH networks. The state space of the optimization problem is defined by where is the set of possible queue length vectors and is the set of network topologies. Action space is the node’s choice of cells and the transmission power i.e., . Reward obtained at each step is nothing but the utility function . In PPG, the optimization of the objective function occurs in two phases namely, policy phase and the auxiliary phase. During policy phase, the agent is trained using proximal policy optimization (PPO) [5]. During auxiliary phase, the features from the value function are distilled into the policy network, so that the future policy phases improve.

Iii-1 Policy Phase

During policy phase, we update the policy network by optimizing the following objective function.


In the above equation, is the ratio of the target policy and the old policy, where and .

is the advantage estimator function at time

[5]. The function

is the Kullback-Leibler divergence function that gives the distance between two probability distributions or also known as their relative entropy. Similar to the policy network, we train the value function network by optimizing the following function.


where is the training parameter for the value function analogous to in the policy network. Here, and are the target value function and estimated value function respectively.

Iii-2 Auxiliary Phase

In auxiliary phase, the joint objective function that includes behavioral cloning loss and an arbitrary auxiliary loss is used to optimized the policy network.


where is the policy right after the policy phase and just before the beginning of auxiliary phase. Here, if the auxiliary objective is not present, then the optimization just preserves the original policy with hyper-parameter regulating the trade-off. The auxiliary function can be any objective function. In our case, we use the value optimization function as the auxiliary objective as done by Cobbe et al. in their paper proposing PPG.

Iv Simulation Results

Fig. 1: (left) Convergence Performance of PPG and PPO learning algorithms for TSCH schedule. (right) System Throughput performance against the number of nodes in the network.

To evaluate the performance of PPG based scheduling algorithm in TSCH networks, we have simulated the network using 6TiSCH simulator. We consider that the network consists of 70 nodes communicating to a border router over a maximum of 3 hops. We compare the performance of our proposed scheduling algorithm with that of PPO [5] based algorithm and the Minimum Scheduling Function (MSF) [6], which is a default scheduling algorithm in TSCH networks.

Firstly, we discuss the convergence performance of the proposed PPG based TSCH scheduling algorithm in comparison with PPO learning algorithm in Fig. 1 (left). It can be seen that the PPG achieves a better reward than PPO right from the starting of the training phase and converges quickly. We show the performance of the network in terms of the system throughput when different scheduling schemes are implemented. In Fig. 1 (right), we show the throughput performance when PPG and PPO based scheduling algorithms are implemented in comparison with the default scheduling algorithm MSF. As the number of nodes in the network increase, the interference on each link increases resulting in the increase of error probability of transmitted packets and thus reducing the system throughput. We show that the PPG improves the system throughput by compared to the default scheduling algorithm MSF and over the proximal policy optimization method (PPO). As a part of future work, we wish to explore auxiliary functions to account for the packet drops by employing a penalty in the policy network to further improve the convergence performance.


  • [1] S. Longo, T. Su, G. Herrmann, and P. Barber,Optimal and robust scheduling for networked control systems. CRC press, 2018.
  • [2] B. Cunha, A. Madureira, B. Fonseca, and J. Matos, “Intelligent scheduling with reinforcement learning,”Applied Sciences, vol. 11, 2021.
  • [3] K. W. Cobbe, J. Hilton, O. Klimov, and J. Schulman, “Phasic policygradient,” in International Conference on Machine Learning.PMLR,2021.
  • [4] A. F. Molisch, K. Balakrishnan, C.-C. Chong, S. Emami, A. Fort,J. Karedal, J. Kunisch, H. Schantz, U. Schuster, and K. Siwiak, “Ieee802.15. 4a channel model-final report,”IEEE P802, 2004.
  • [5] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,”arXiv:1707.06347, 2017.
  • [6] T. Chang, M. Vucinic, X. Vilajosana, S. Duquennoy, and D. Dujovne,“6tisch minimal scheduling function (msf),”Internet Engineering TaskForce, Internet-Draft draft-ietf-6tischmsf-02, 2019.