Log In Sign Up

A Learning Approach for Joint Design of Event-triggered Control and Power-Efficient Resource Allocation

by   Atefeh. Termehchi, et al.

In emerging Industrial Cyber-Physical Systems (ICPSs), the joint design of communication and control sub-systems is essential, as these sub-systems are interconnected. In this paper, we study the joint design problem of an event-triggered control and an energy-efficient resource allocation in a fifth generation (5G) wireless network. We formally state the problem as a multi-objective optimization one, aiming to minimize the number of updates on the actuators' input and the power consumption in the downlink transmission. To address the problem, we propose a model-free hierarchical reinforcement learning approach with uniformly ultimate boundedness stability guarantee that learns four policies simultaneously. These policies contain an update time policy on the actuators' input, a control policy, and energy-efficient sub-carrier and power allocation policies. Our simulation results show that the proposed approach can properly control a simulated ICPS and significantly decrease the number of updates on the actuators' input as well as the downlink power consumption.


page 1

page 10

page 11


Power-efficient Sampling Time and Resource Allocation in Cyber Physical Systems with Industrial Application

Cyber Physical Systems (CPSs) are the result of convergence of computati...

Dynamic Energy-Efficient Power Allocation in Multibeam Satellite Systems

Power consumption is a major limitation in the downlink of multibeam sat...

Learning Event-triggered Control from Data through Joint Optimization

We present a framework for model-free learning of event-triggered contro...

Data-driven control of micro-climate in buildings; an event-triggered reinforcement learning approach

Smart buildings have great potential for shaping an energy-efficient, su...

Optimal Resource Allocation for Multi-user OFDMA-URLLC MEC Systems

In this paper, we study resource allocation algorithm design for multi-u...

Utility-Based Wireless Resource Allocation for Variable Rate Transmission

For most wireless services with variable rate transmission, both average...

A Formal Approach to Power Optimization in CPSs with Delay-Workload Dependence Awareness

The design of cyber-physical systems (CPSs) faces various new challenges...

I Introduction

Emerging ICPSs, such as smart grid, smart manufacturing, and smart transportation are spatially distributed and high-dimensional. These systems require high reliability, communication between numerous devices, low latency, power-efficient communication, and high computational load [1, 2]

. To manage these requirements, 5G and beyond 5G networks present a wide range of services that are classified as 1) enhanced mobile broadband (eMBB), 2) ultra-reliable and low-latency communication (URLLC), and 3) massive machine-type communication (mMTC). The eMBB, URLLC, and mMTC services provide a high data rate with a moderate latency, a communication with low end-to-end delay, and connecting many devices respectively. Because the type of most communications in control sub-systems is URLLC, 5G network is a good choice for exchanging data between a controller and sensors or actuators.

However, there are some serious challenges to deploying 5G network in ICPSs. Specifically, ICPSs with 5G networks have limited network resources and lack the desired stability and performance guarantees [3, 4]. The performance of the control sub-system is defined as achieving the required dynamics response, which is specified by measures of performance such as a desirable steady-state tracking error. The stability and performance of a control sub-system may be guaranteed through periodic transmissions with a high data rate. It, however, comes at the cost of a higher packet loss rate due to limited resources in wireless networks [5]. Furthermore, in many applications of ICPSs, most wireless devices rely on batteries and their battery life may be significantly reduced by the increased transmission rate [6]. Consequently, the event-triggered control (ETC) method is proposed, in which the transmission times of the control sub-system are triggered based on a predefined event instead of a periodic transmission. This event is characterized according to the stability and performance requirement of the control sub-system.
In recent years, extensive research has concentrated on different classes of ETC strategies; see [7, 8] and the references therein. Besides, in this context, there are substantial works, which prove better energy-saving and performance of ETC in comparison with the traditional periodic control [9, 10]. Nevertheless, these works analyze only low-dimensional or linear models of control sub-systems [6, 11]. Moreover, the analysis of the event-triggered control becomes too complicated when the volatile properties of wireless communication such as delay, limited resources, packet drops, and unreliable links are considered.
The design of the event-triggered control in the presence of unreliable links and packet losses has been recently drawn a lot of attention [12, 13, 11]. However, in addition to packet drops, there exist many other features of wireless communication, such as the delay and limited resources, which make a direct impact on the stability and performance of control sub-systems. To deal with these interconnections between the control and communication sub-systems, the joint design method is taken in ICPSs [14, 15, 6, 16]. However, developing an analytical model of all control and network features is a fundamental challenge to this method. This is because the sub-systems are typically high-dimensional and the conditions of radio resources are continuously and randomly changing.
Therefore, researchers have used model-free reinforcement learning (RL) in the joint design of ICPS’ sub-systems [17, 18, 6, 14, 19]. In [17], RL is used for proposing a sensors scheduler while the controller is designed beforehand. The actor-critic RL method is also used in [18] to learn the event-triggered control. In [6], option method of Deep RL (DRL) is used for joint optimization of an event policy and a control policy. The event policy determines when the control input should be transmitted and the control policy determines what the control input value should be sent. Nonetheless, the varying characteristics of the wireless network are not considered in [6, 18]. In [14], RL approach is used to jointly design the sampling rate of the control sub-system and the modulation type of the wireless network.
Although stability is an essential property for every control sub-system, RL methods could hardly guarantee the stability and reliability of a learning-based controller [20]. Nonetheless, in [20, 21, 22], a learning-based controller with uniformly ultimate boundedness (UUB) stability guarantee is proposed, which can be usefully employed in ICPSs with safety constraints. In general, UUB stability says that if the norm of starting state variables of a control sub-system is less than a specified value, then the state variables will eventually enter the neighborhood of the sub-system’s equilibrium within a finite time and will never escape from this neighborhood set afterwards [21].
The goal of this paper is to jointly design the event-triggered control and the energy-efficient allocation of radio resources in an ICPS. To the best of our knowledge, this joint design problem has not yet been studied. We propose to use a novel Hierarchical RL (HRL) approach with UUB stability guarantee to solve the problem. Our contributions are as follows.

  • We assume an ICPS containing multiple eMBB users and a control plant with multiple URLLC users sharing a single cell Orthogonal Frequency-Division Multiple Access (OFDMA) network. We formulate the joint design of the event-triggered control and the energy-efficient resource allocation in the ICPS as a multi-objective optimization problem. The goals of the problem are both minimizing the number of updates on the actuators’ input and the energy consumption in the downlink. The constraints of this problem contain the dynamics and UUB stability of the control plant, the minimum Quality of Service (QoS) demand of eMBB and URLLC users, and the power and sub-carrier constraints of the OFDMA network.

  • The problem is high-dimensional, complicated and associated with a hybrid action space. To handle these properties, we combine Cascade Attribute Learning Network (CAN) method and option-critic method to develop a novel model-free HRL approach with UUB stability guarantee. First, we use CAN method and decouple the problem into two low-dimensional sub-problems of control and resource allocation. We show that using the decoupling method leads to a Pareto solution to the optimization problem. In the second step, we use option-critic method, which is reformulated as Double Actor-Critic (DAC) architecture, to address each sub-problem with a hybrid action space.

  • The novel model-free HRL with UUB stability guarantee can simultaneously learn four policies: 1) update time policy on the actuators’ input, 2) control policy, which determines the value of control input, 3) energy-efficient sub-carrier allocation policy, and 4) energy-efficient power allocation policy.

  • We demonstrate the effectiveness and capability of the proposed approach by several simulation results. In comparison with a disjoint and model-based method, our numerical simulation results show that both the number of updates on the actuators’ input and the downlink energy consumption are reduced significantly by applying the proposed approach. Moreover, we show the capability of the proposed approach compared with the soft actor-critic algorithm.

This paper is outlined as follows. The system model and problem formulation is described in Section II. The proposed approach is presented in Section III. In Section IV, simulation results are discussed. In Section V, the paper’s conclusion and future work are given.

Fig. 1: System model of the considered ICPS


Ii-a System Model

Consider a model of ICPS that consists of 1) a control plant, 2) a central event-triggered controller, and 3) a downlink model of an OFDMA cellular network (Fig. 1). The state values of the control plant, are measured by multiple sensors and sent to the event-triggered learning-based controller. Next, the controller calculates and sends the control input to actuators, whenever required, through the OFDMA single cell network. Following the 5G architecture explained by International Telecommunication Union (ITU), the central learning-based controller is supposed to run on a specific or shared hardware in the central office data center layer, which is placed near the network’s Base Station (BS) [23].
Control Plant: We suppose dynamics of the control plant is unknown, that is:


where and are unknown functions, , , and

denote the vector of the control state, control input, and sensors’ output at discrete time

() respectively. Also, vector is actuation disturbances at discrete time . We assume the control plant, described by dynamics (1), is completely state observable, as it is regularly assumed in the related literature, e.g. [24].
Event-triggered Controller: When the sensors’ output vector () is received by the central event-triggered controller, it decides whether actuators’ input should update () or ignore the update and save wireless resource (). This decision has been taken based on UUB stability guarantee of the control plant, defined in what follows.

Definition 1 [25].

A control plant is uniformly ultimately bounded with ultimate bound , if there are positive constants and , such that . If can be arbitrary large, then the control plant is globally uniformly ultimately bounded.

In addition, if update variable , then the controller calculates the control input variable () considering UUB stability. We assume that Zero Order Hold (ZOH) holds actuators’ input constant between two consecutive updates. This can be mathematically given by:


OFDMA Network: We assume the downlink model of a single cell OFDMA network with one BS. The model has downlink users denoted by . The downlink users have a set of control plant users (URLLC users) defined by and a set of eMBB moving users (coexisted with the control plant users) defined by . It is noted that URLLC users and control plant users are employed interchangeability from hereon. We consider that URLLC users are fixed and eMBB users move within the range of the BS coverage area. Let dividing the total bandwidth of the network in sub-carriers forming set . Also, let be the base station’s transmit power for communicating with downlink user on sub-carrier at discrete time . The variable of is assumed continues. The overall power transmit of the BS is limited to a maximum value represented by , which means . Moreover, the BS’ total power usage in the considered ICPS is calculated as [26]:


where is a constant power used by BS circuit, is the amplifier inefficiency constant,

is the sub-carrier allocation variable, which is a binary variable.

if sub-carrier is allocated to downlink user at discrete time , or else, . Also, and are power and sub-carrier allocation matrices at discrete time respectively ( and ).

The downlink Signal-to-Noise Ratio (SNR) for user

on sub-carrier is given by [27]:


where is the channel gain for each user on sub-carrier at discrete time and denotes the corresponding additive white Gaussian noise power at the receiver of user . In accordance with the Shannon’s formula, the achievable instantaneous transmission rate for each eMBB user is computed in bit/s as:


where is the bandwidth of sub-carrier . Moreover, the QoS requirement for each eMBB user is computed in terms of a minimum transmission rate [27]. Therefore, the required QoS of eMBB users is represented by:


where is the minimum required QoS of eMBB user at discrete time . The packet size of URLLC users are generally short so the Shannon’s formula cannot exactly describe their transmission rate [28, 27]. The achievable transmission rate of URLLC users with the finite blocklength channel coding method is derived in [28] as:


where is the number of symbols in each codeword block, is the inverse of Gaussian Q-function,

is the error probability, and

is dispersion of sub-carrier for user given by:


In a single time slot , to satisfy the required QoS of URLLC users, it is necessary to provide the achievable instantaneous data rate condition as below:


where is the length of actuator’s packet size in bits and is the maximum tolerable transmission delay for the packet. We calculate according to the given maximum tolerable end-to-end (e2e) delay between the controller and actuators. Let be the maximum queuing and computation delay that is and the propagation delay is negligible. Thus, we conservatively assume the e2e delay is:


Noticeably, we assume the minimum reliability requirement for URLLC users is satisfied through some enabler techniques such as low-rate codes.

Ii-B Problem formulation

We now formally state the joint design problem of the event-triggered control and the energy-efficient resource allocation of the OFDMA network, as a multi-objective optimization problem. It aims to minimize both the number of updates on the actuators’ input and the total downlink power usage, subject to the dynamics and UUB stability of the control plant, the QoS demands of eMBB and URLLC users, power and sub-carrier constraint, and the maximum practicable level of the BS’ transmit power. This problem is formulated as:


where constraints , , and illustrate the plant dynamics, the event-triggered controller function, and UUB stability requirement of the control plant respectively. Constraint shows update variable takes binary value. and represent the required QoS of eMBB and URLLC users respectively. Constraints and are related to the exclusive assignment of the sub-carrier in the OFDMA network. And constraint shows the maximum allowable transmit power of the BS.
In multi-objective optimization problem (11), thanks to minimizing the second objective, the transmit power of the control plant’s users is reduced. Consequently, the downlink transmission rates are reduced and the transmission delay is increased. Accordingly, to guarantee UUB stability of the control plant (), the number of updates on the actuators’ input is increased in future time steps and the first objective function is increased. Due to the trade-off between these two objective functions, the idea of the Pareto optimality is employed as a solution for problem (11) [29]. The Pareto optimal solution is defined as follows.

Definition 2 [29].

Assuming a multi-objective optimization problem with , as its objective functions and considering all objectives are minimizing functions, a feasible solution, , can dominate another one, , (or is better than ) if:

  1. for all and

  2. for at least one .

is named as a Pareto optimal solution when any other solution cannot be found to dominate . In other words, is a Pareto optimal solution if and only if it is a feasible solution and there exists no better feasible solution.

Iii The Proposed approach

In optimization problem (11

), the dynamics model of the control plant and its interconnection with the network is unknown. To address this problem, we propose a novel model-free HRL approach. Specifically, a Markov Decision Process (MDP) is first constructed associated with problem (

11). Due to the state and action spaces of the MDP are large, we first apply CAN method and decompose problem (11) into two sub-problems. Then, DAC architecture is used for solving each sub-problem with a hybrid action space.

Iii-a RL-related Definition

The joint design problem can be described by MDP , where is the set of possible states, is the set of actions, is a reward function (), is an initial distribution (), is the probability of states transition (). The state at time step , , is defined as:


where , denotes status of URLLC and eMBB users at environment time step , which if user receives its minimum required rate; otherwise . We consider the learning agent action at time step , , as follows:


An action is taken, at each time step , on the basis of policy , which is a likelihood function of each action for every possible state. By choosing , the environment state is transmitted from current state to according to the probability of and also a reward of is gotten (). Assuming the transition trajectory as , the goal of RL is to obtain a policy (), which maximize the expected receiving cumulative reward trough the trajectory, which is given by where denotes the discount factor showing the important weight of future rewards. is the cumulative reward of an episode between the step of and the terminal step of .

Iii-B Applying CAN Method and Decomposing the Problem

It is obvious that the size of the state and action spaces of the joint design problem may be too large in practical cases. In such a high-dimensional and complex problem, the speed of learning is considerably reduced. Furthermore, the training process generally consumes an unreasonable amount of computation power in the high-dimensional problem. To manage these challenges, CAN method is used as explained in [30]. In CAN method, the learning process of a complicated problem is decomposed into low-dimensional attribute modules, which are linked in cascade series. The state space of every attribute is determined as minimum as possible provided that the space can completely describe the attribute, indicated by . Also, every attribute enjoys its own reward function (

). Moreover, the transition probability distribution in every attribute is indicated by

. Although it is shown that CAN method makes the training process significantly faster and more simple in [30], it is not mathematically proven that applying the decoupling method results in an optimal/sub-optimal solution. Here, however, we demonstrate this through the following lemmas in the case of problem (11).

Lemma 1.

The second objective of problem (11) is decreasing with respect to .


By decreasing , the number of control users that require to communicate decreases, through which the number of downlink users, N, is decreasing. Consequently, the total power consumption of the BS will decrease in accordance with equation (3). ∎

Lemma 2.

Assuming be a subset of feasible solution set of problem (11), if minimize the first objective function, then is a subset of the Pareto solution set of optimization problem (11).

Fig. 2: Proposed approach

Lemma 2 will be demonstrated by the contradiction. Assuming that there is a feasible solution , which dominant ( minimize the first objective function). But, in accordance with Lemma 1, the second objective function is decreasing by decreasing and also the first objective is optimized in . Thus, the conditions presented in Definition 2 are not fulfilled. Accordingly, does not dominate . As a result, the initial presumption that there is a feasible point, which dominate is contradicted. ∎

Lemma 2 allows us to decouple optimization problem (11) into two sub-problems as:




The architecture of the proposed approach applying CAN method is shown in Fig. 2. The training process of the proposed approach has two parts. In the first part, DRL policy of the base attribute module is trained to address sub-problem (14). The base module is fed with and output , considering reward function . Notably, contains and , which is a continues variable and is a binary variable. Having decided , DRL policy of the first attribute module is trained subsequently, which is accountable to solve sub-problem (15). This module is fed with and output power matrix along with sub-carrier matrix , considering reward function .
The action space of each sub-problem is a hybrid space, and the majority of regular RL-based solutions are not appropriate to solve these hybrid problems [6]. Therefore, to address each sub-problem, we propose to use option-critic method, which is reformulated as DAC architecture in [31], since it is well-suited to deal with hybrid action space [32, 6].

Iii-C The Base Attribute Module

To handle sub-problem (14), the state and action spaces of the base module are defined as:


The base module is responsible for learning a policy () over and . The policy aim to maximize the expected receiving cumulative reward through transmission trajectory . The reward function of the base module is defined as:


where the first term () is the control reward and the second term () is to minimize the number of updates on the actuators’ input. The control reward is defined to encourage the control plant to reach its specified targets. Also, is a hyper-parameter denoting the penalty weight of the number of actuators updates.
To guarantee UUB stability of the learning controller with policy , we use a more general definition of UUB stability presented in [21]. Indeed, in [21], the classical definition of UUB stability (Definition 1) is extended for general cases in which the stability constraint functions are not necessarily the norm of the control state (). Let be the constraint function under the policy and be a continuous nonnegative constraint function, which is defined to measure how good or bad a stateaction pair of the base module is. The general definition of UUB stability with respect to is stated in what follows.

Definition 3 [21].

A control plant is UUB with respect to , if there are positive constants and : , such that .

It is shown that Definition 3 is an inherent feature of the control plant when it is UUB stable. Thus, if the control plant is UUB with respect to , then the closed-loop control is UUB [21, 22]. It is noted that UUB points to the property defined by Definition 3 from hereon.

Theorem 1 [21].

Assuming that the Markov chain induced by policy

is ergodic, , and , if there are a function and positive constants , and , such that




where shows the average distribution of over the finite time steps, , and , then guarantees UUB stability of the control plant with ultimate bound . If for any , there is a , such that , then .

Similar to [21]

, a fully connected deep neural network is used to construct function

, which satisfies and the function is parameterized by

. A ReLU activation function is employed in the output layer of the deep neural network to guarantee positive output. To update

, the following objective function is minimized:


where is the average over a mini-batch of samples collected from the sampling distribution .
In the following, an approach based on option-critic method, which is reformulated as DAC architecture, is proposed to obtain . In the obtaining procedure of policy , we employ Theorem 1 to guarantee UUB stability.
Option-Critic Method: Option-critic method is an HRL that has three policies: a master policy, an intraoption policy, and an option termination function [31, 6]. The master policy decides which option should be performed. On the basis of this decision, an action is taken through intraoption policy until the option is terminated by the termination function. Accordingly, in the context of sub-problem (14), the master policy specifies the probability of choosing update variable at each time step and then control input is determined by the intraoption policy. Furthermore, the termination function is omitted (similar to [32] and [6]) because of the binary type of update variable . Indeed, when the master policy chooses one option ( or ), it terminates another option simultaneously. Considering this performing model, we have:


In [31], it is demonstrated that option-critic method can be reformulated as DAC architecture, which contains two augmented MDPs. The MDPs contain the high-level MPD, , and the low-level MPD, , which are employed for choosing the option and the action respectively. Consequently, the high-level MPD of the base module is defined as:


where is the indicator function. Also, the high-level policy on is defined as:


The low-level MPD and policy of the base module are respectively stated as:




Considering trajectories of , and , two bijection functions as and are obtained, which map to and to respectively. Here, the following lemmas holds, which appear similar to [31]:

Lemma 3.

Assuming the bijection function , we have and .

Fig. 3: DAC architecture of the base module
Lemma 4.

Assuming the bijection function , we have and .

The proof of above lemmas are provided in Appendices A and B respectively. These lemmas specify that and } can share the same samples with . In the same way of the provided proof, Theorem 2 can be simply driven as follows.

Theorem 2.

Following Lemma 3, Lemma 4, and Theorem 2, to handle sub-problem (14), the learning agent alternately optimize (decide on option variable ) and (decide on continues variable ). Thereby option-critic method is reformulated to DAC architecture (see Fig. 3). To optimize policies in each augmented MDP (, ), Proximal Policy Optimization (PPO) method is used similar to [31].
In DAC architecture, two parameterized polices , which is summarized as and , which is summarized as

are estimated in two actor neural networks. Additionally, the parameterized value function (

) and parameterized function are estimated in two critic neural networks. Therefore, to update two parameters and , considering UUB stability constraint, the following objective functions are minimized respectively:




where is the average over a mini-batch of samples (the size of the mini-batch is ), function constrains the ratio of between the interval of , and is a hyper-parameter. is a positive Lagrangian multiplier, which is adjusted via gradient ascent to maximize the following objective function [21]:


In equations (27) and (28), is the advantage function at time step and is estimated via the Generalized Advantage Estimation (GAE) as:


where is the GAE parameter and is the temporal difference(TD) error, given by .

is updated by an Stochastic Gradient Descent (SGD) algorithm as:


where is the learning rate and is the objective function calculated as:


where is the target value of time-difference error.
In summary, at time step , each actor network selects its action according to its current state using its policy. This leads to the state transition to and a new reward value, which is estimated by the critic network via value function . Afterward, the TD error is calculated, which is the critic network feedback to optimize and using optimizing problems (27) and (28). In addition, the selected actions and the state transition ( and ) lead to updating and through (20) and (29). Then, the updated is sent to actor networks as feedback of Critic Network to optimize and .

Iii-D The First Attribute Module

Having decided update variable , DRL policy of the first attribute module is trained to address sub-problem (15). The state and action spaces of this module are given by:


Considering the objective and the constraints of sub-problem (15), the reward function of the first module is calculated as:


where is a hyper-parameter denoted the penalty weight on crossing the limitation of the BS’ power consumption. Also, is a hyper-parameter indicated the weight on the number of users received their required rate.
Option-Critic Method: The action space of this module is also hybrid. Accordingly, to address sub-problem (15), option-critic method reformulated as DAC architecture is employed too. Assuming sub-carrier allocation matrix is the option variable, we have:


Option-critic method can be reformulated as two augmented MDPs. The high-level MPD () is used for the sub-carrier assignment () and the low-level MPD () is used for the power allocation (). The high-level MPD of the first module is given as: