I Introduction
Emerging ICPSs, such as smart grid, smart manufacturing, and smart transportation are spatially distributed and highdimensional. These systems require high reliability, communication between numerous devices, low latency, powerefficient communication, and high computational load [1, 2]
. To manage these requirements, 5G and beyond 5G networks present a wide range of services that are classified as 1) enhanced mobile broadband (eMBB), 2) ultrareliable and lowlatency communication (URLLC), and 3) massive machinetype communication (mMTC). The eMBB, URLLC, and mMTC services provide a high data rate with a moderate latency, a communication with low endtoend delay, and connecting many devices respectively. Because the type of most communications in control subsystems is URLLC, 5G network is a good choice for exchanging data between a controller and sensors or actuators.
However, there are some serious challenges to deploying 5G network in ICPSs. Specifically, ICPSs with 5G networks have limited network resources and lack the desired stability and performance guarantees [3, 4]. The performance of the control subsystem is defined as achieving the required dynamics response, which is specified by measures of performance such as a desirable steadystate tracking error. The stability and performance of a control subsystem may be guaranteed through periodic transmissions with a high data rate. It, however, comes at the cost of a higher packet loss rate due to limited resources in wireless networks [5]. Furthermore, in many applications of ICPSs, most wireless devices rely on batteries and their battery life may be significantly reduced by the increased transmission rate [6]. Consequently, the eventtriggered control (ETC) method is proposed, in which the transmission times of the control subsystem are triggered based on a predefined event instead of a periodic transmission. This event is characterized according to the stability and performance requirement of the control subsystem.
In recent years, extensive research has concentrated on different classes of ETC strategies; see [7, 8] and the references therein. Besides, in this context, there are substantial works, which prove better energysaving and performance of ETC in comparison with the traditional periodic control [9, 10]. Nevertheless, these works analyze only lowdimensional or linear models of control subsystems [6, 11]. Moreover, the analysis of the eventtriggered control becomes too complicated when the volatile properties of wireless communication such as delay, limited resources, packet drops, and unreliable links are considered.
The design of the eventtriggered control in the presence of unreliable links and packet losses has been recently drawn a lot of attention [12, 13, 11]. However, in addition to packet drops, there exist many other features of wireless communication, such as the delay and limited resources, which make a direct impact on the stability and performance of control subsystems. To deal with these interconnections between the control and communication subsystems, the joint design method is taken in ICPSs [14, 15, 6, 16]. However, developing an analytical model of all control and network features is a fundamental challenge to this method. This is because the subsystems are typically highdimensional and the conditions of radio resources are continuously and randomly changing.
Therefore, researchers have used modelfree reinforcement learning (RL) in the joint design of ICPS’ subsystems [17, 18, 6, 14, 19]. In [17], RL is used for proposing a sensors scheduler while the controller is designed beforehand. The actorcritic RL method is also used in [18] to learn the eventtriggered control. In [6], option method of Deep RL (DRL) is used for joint optimization of an event policy and a control policy. The event policy determines when the control input should be transmitted and the control policy determines what the control input value should be sent. Nonetheless, the varying characteristics of the wireless network are not considered in [6, 18]. In [14], RL approach is used to jointly design the sampling rate of the control subsystem and the modulation type of the wireless network.
Although stability is an essential property for every control subsystem, RL methods could hardly guarantee the stability and reliability of a learningbased controller [20]. Nonetheless, in [20, 21, 22], a learningbased controller with uniformly ultimate boundedness (UUB) stability guarantee is proposed, which can be usefully employed in ICPSs with safety constraints. In general, UUB stability says that if the norm of starting state variables of a control subsystem is less than a specified value, then the state variables will eventually enter the neighborhood of the subsystem’s equilibrium within a finite time and will never escape from this neighborhood set afterwards [21].
The goal of this paper is to jointly design the eventtriggered control and the energyefficient allocation of radio resources in an ICPS. To the best of our knowledge, this joint design problem has not yet been studied. We propose to use a novel Hierarchical RL (HRL) approach with UUB stability guarantee to solve the problem. Our contributions are as follows.

We assume an ICPS containing multiple eMBB users and a control plant with multiple URLLC users sharing a single cell Orthogonal FrequencyDivision Multiple Access (OFDMA) network. We formulate the joint design of the eventtriggered control and the energyefficient resource allocation in the ICPS as a multiobjective optimization problem. The goals of the problem are both minimizing the number of updates on the actuators’ input and the energy consumption in the downlink. The constraints of this problem contain the dynamics and UUB stability of the control plant, the minimum Quality of Service (QoS) demand of eMBB and URLLC users, and the power and subcarrier constraints of the OFDMA network.

The problem is highdimensional, complicated and associated with a hybrid action space. To handle these properties, we combine Cascade Attribute Learning Network (CAN) method and optioncritic method to develop a novel modelfree HRL approach with UUB stability guarantee. First, we use CAN method and decouple the problem into two lowdimensional subproblems of control and resource allocation. We show that using the decoupling method leads to a Pareto solution to the optimization problem. In the second step, we use optioncritic method, which is reformulated as Double ActorCritic (DAC) architecture, to address each subproblem with a hybrid action space.

The novel modelfree HRL with UUB stability guarantee can simultaneously learn four policies: 1) update time policy on the actuators’ input, 2) control policy, which determines the value of control input, 3) energyefficient subcarrier allocation policy, and 4) energyefficient power allocation policy.

We demonstrate the effectiveness and capability of the proposed approach by several simulation results. In comparison with a disjoint and modelbased method, our numerical simulation results show that both the number of updates on the actuators’ input and the downlink energy consumption are reduced significantly by applying the proposed approach. Moreover, we show the capability of the proposed approach compared with the soft actorcritic algorithm.
This paper is outlined as follows. The system model and problem formulation is described in Section II. The proposed approach is presented in Section III. In Section IV, simulation results are discussed. In Section V, the paper’s conclusion and future work are given.
Ii SYSTEM MODEL and PROBLEM FORMULATION
Iia System Model
Consider a model of ICPS that consists of 1) a control plant, 2) a central eventtriggered controller, and 3) a downlink model of an OFDMA cellular network (Fig. 1). The state values of the control plant, are measured by multiple sensors and sent to the eventtriggered learningbased controller. Next, the controller calculates and sends the control input to actuators, whenever required, through the OFDMA single cell network.
Following the 5G architecture explained by International Telecommunication Union (ITU), the central learningbased controller is supposed to run on a specific or shared hardware in the central office data center layer, which is placed near the network’s Base Station (BS) [23].
Control Plant: We suppose dynamics of the control plant is unknown, that is:
(1) 
where and are unknown functions, , , and
denote the vector of the control state, control input, and sensors’ output at discrete time
() respectively. Also, vector is actuation disturbances at discrete time . We assume the control plant, described by dynamics (1), is completely state observable, as it is regularly assumed in the related literature, e.g. [24].Eventtriggered Controller: When the sensors’ output vector () is received by the central eventtriggered controller, it decides whether actuators’ input should update () or ignore the update and save wireless resource (). This decision has been taken based on UUB stability guarantee of the control plant, defined in what follows.
Definition 1 [25].
A control plant is uniformly ultimately bounded with ultimate bound , if there are positive constants and , such that . If can be arbitrary large, then the control plant is globally uniformly ultimately bounded.
In addition, if update variable , then the controller calculates the control input variable () considering UUB stability. We assume that Zero Order Hold (ZOH) holds actuators’ input constant between two consecutive updates. This can be mathematically given by:
(2) 
OFDMA Network:
We assume the downlink model of a single cell OFDMA network with one BS. The model has downlink users denoted by . The downlink users have a set of control plant users (URLLC users) defined by and a set of eMBB moving users (coexisted with the control plant users) defined by . It is noted that URLLC users and control plant users are employed interchangeability from hereon. We consider that URLLC users are fixed and eMBB users move within the range of the BS coverage area. Let dividing the total bandwidth of the network in subcarriers forming set . Also, let be the base station’s transmit power for communicating with downlink user on subcarrier at discrete time . The variable of is assumed continues. The overall power transmit of the BS is limited to a maximum value represented by , which means . Moreover, the BS’ total power usage in the considered ICPS is calculated as [26]:
(3) 
where is a constant power used by BS circuit, is the amplifier inefficiency constant,
is the subcarrier allocation variable, which is a binary variable.
if subcarrier is allocated to downlink user at discrete time , or else, . Also, and are power and subcarrier allocation matrices at discrete time respectively ( and ).The downlink SignaltoNoise Ratio (SNR) for user
on subcarrier is given by [27]:(4) 
where is the channel gain for each user on subcarrier at discrete time and denotes the corresponding additive white Gaussian noise power at the receiver of user .
In accordance with the Shannon’s formula, the achievable instantaneous transmission rate for each eMBB user is computed in bit/s as:
(5) 
where is the bandwidth of subcarrier . Moreover, the QoS requirement for each eMBB user is computed in terms of a minimum transmission rate [27]. Therefore, the required QoS of eMBB users is represented by:
(6) 
where is the minimum required QoS of eMBB user at discrete time . The packet size of URLLC users are generally short so the Shannon’s formula cannot exactly describe their transmission rate [28, 27]. The achievable transmission rate of URLLC users with the finite blocklength channel coding method is derived in [28] as:
(7) 
where is the number of symbols in each codeword block, is the inverse of Gaussian Qfunction,
is the error probability, and
is dispersion of subcarrier for user given by:(8) 
In a single time slot , to satisfy the required QoS of URLLC users, it is necessary to provide the achievable instantaneous data rate condition as below:
(9) 
where is the length of actuator’s packet size in bits and is the maximum tolerable transmission delay for the packet. We calculate according to the given maximum tolerable endtoend (e2e) delay between the controller and actuators. Let be the maximum queuing and computation delay that is and the propagation delay is negligible. Thus, we conservatively assume the e2e delay is:
(10) 
Noticeably, we assume the minimum reliability requirement for URLLC users is satisfied through some enabler techniques such as lowrate codes.
IiB Problem formulation
We now formally state the joint design problem of the eventtriggered control and the energyefficient resource allocation of the OFDMA network, as a multiobjective optimization problem. It aims to minimize both the number of updates on the actuators’ input and the total downlink power usage, subject to the dynamics and UUB stability of the control plant, the QoS demands of eMBB and URLLC users, power and subcarrier constraint, and the maximum practicable level of the BS’ transmit power. This problem is formulated as:
(11) 
where constraints , , and illustrate the plant dynamics, the eventtriggered controller function, and UUB stability requirement of the control plant respectively. Constraint shows update variable takes binary value. and represent the required QoS of eMBB and URLLC users respectively.
Constraints and are related to the exclusive assignment of the subcarrier in the OFDMA network. And constraint shows the maximum allowable transmit power of the BS.
In multiobjective optimization problem (11), thanks to minimizing the second objective, the transmit power of the control plant’s users is reduced. Consequently, the downlink transmission rates are reduced and the transmission delay is increased. Accordingly, to guarantee UUB stability of the control plant (), the number of updates on the actuators’ input is increased in future time steps and the first objective function is increased. Due to the tradeoff between these two objective functions, the idea of the Pareto optimality is employed as a solution for problem (11) [29]. The Pareto optimal solution is defined as follows.
Definition 2 [29].
Assuming a multiobjective optimization problem with , as its objective functions and considering all objectives are minimizing functions, a feasible solution, , can dominate another one, , (or is better than ) if:

for all and

for at least one .
is named as a Pareto optimal solution when any other solution cannot be found to dominate . In other words, is a Pareto optimal solution if and only if it is a feasible solution and there exists no better feasible solution.
Iii The Proposed approach
In optimization problem (11
), the dynamics model of the control plant and its interconnection with the network is unknown. To address this problem, we propose a novel modelfree HRL approach. Specifically, a Markov Decision Process (MDP) is first constructed associated with problem (
11). Due to the state and action spaces of the MDP are large, we first apply CAN method and decompose problem (11) into two subproblems. Then, DAC architecture is used for solving each subproblem with a hybrid action space.Iiia RLrelated Definition
The joint design problem can be described by MDP , where is the set of possible states, is the set of actions, is a reward function (), is an initial distribution (), is the probability of states transition (). The state at time step , , is defined as:
(12) 
where , denotes status of URLLC and eMBB users at environment time step , which if user receives its minimum required rate; otherwise . We consider the learning agent action at time step , , as follows:
(13) 
An action is taken, at each time step , on the basis of policy , which is a likelihood function of each action for every possible state. By choosing , the environment state is transmitted from current state to according to the probability of and also a reward of is gotten (). Assuming the transition trajectory as , the goal of RL is to obtain a policy (), which maximize the expected receiving cumulative reward trough the trajectory, which is given by where denotes the discount factor showing the important weight of future rewards. is the cumulative reward of an episode between the step of and the terminal step of .
IiiB Applying CAN Method and Decomposing the Problem
It is obvious that the size of the state and action spaces of the joint design problem may be too large in practical cases. In such a highdimensional and complex problem, the speed of learning is considerably reduced. Furthermore, the training process generally consumes an unreasonable amount of computation power in the highdimensional problem. To manage these challenges, CAN method is used as explained in [30]. In CAN method, the learning process of a complicated problem is decomposed into lowdimensional attribute modules, which are linked in cascade series. The state space of every attribute is determined as minimum as possible provided that the space can completely describe the attribute, indicated by . Also, every attribute enjoys its own reward function (
). Moreover, the transition probability distribution in every attribute is indicated by
. Although it is shown that CAN method makes the training process significantly faster and more simple in [30], it is not mathematically proven that applying the decoupling method results in an optimal/suboptimal solution. Here, however, we demonstrate this through the following lemmas in the case of problem (11).Lemma 1.
The second objective of problem (11) is decreasing with respect to .
Proof:
By decreasing , the number of control users that require to communicate decreases, through which the number of downlink users, N, is decreasing. Consequently, the total power consumption of the BS will decrease in accordance with equation (3). ∎
Lemma 2.
Proof:
Lemma 2 will be demonstrated by the contradiction. Assuming that there is a feasible solution , which dominant ( minimize the first objective function). But, in accordance with Lemma 1, the second objective function is decreasing by decreasing and also the first objective is optimized in . Thus, the conditions presented in Definition 2 are not fulfilled. Accordingly, does not dominate . As a result, the initial presumption that there is a feasible point, which dominate is contradicted. ∎
Lemma 2 allows us to decouple optimization problem (11) into two subproblems as:
(14)  
and
(15)  
The architecture of the proposed approach applying CAN method is shown in Fig. 2. The training process of the proposed approach has two parts. In the first part, DRL policy of the base attribute module is trained to address subproblem (14). The base module is fed with and output , considering reward function . Notably, contains and , which is a continues variable and is a binary variable. Having decided , DRL policy of the first attribute module is trained subsequently, which is accountable to solve subproblem (15). This module is fed with and output power matrix along with subcarrier matrix , considering reward function .
The action space of each subproblem is a hybrid space, and the majority of regular RLbased solutions are not appropriate to solve these hybrid problems [6]. Therefore, to address each subproblem, we propose to use optioncritic method, which is reformulated as DAC architecture in [31], since it is wellsuited to deal with hybrid action space [32, 6].
IiiC The Base Attribute Module
To handle subproblem (14), the state and action spaces of the base module are defined as:
(16) 
The base module is responsible for learning a policy () over and . The policy aim to maximize the expected receiving cumulative reward through transmission trajectory . The reward function of the base module is defined as:
(17) 
where the first term () is the control reward and the second term () is to minimize the number of updates on the actuators’ input. The control reward is defined to encourage the control plant to reach its specified targets. Also, is a hyperparameter denoting the penalty weight of the number of actuators updates.
To guarantee UUB stability of the learning controller with policy , we use a more general definition of UUB stability presented in [21]. Indeed, in [21], the classical definition of UUB stability (Definition 1) is extended for general cases in which the stability constraint functions are not necessarily the norm of the control state ().
Let be the constraint function under the policy and be a continuous nonnegative constraint function, which is defined to measure how good or bad a stateaction pair of the base module is. The general definition of UUB stability with respect to is stated in what follows.
Definition 3 [21].
A control plant is UUB with respect to , if there are positive constants and : , such that .
It is shown that Definition 3 is an inherent feature of the control plant when it is UUB stable. Thus, if the control plant is UUB with respect to , then the closedloop control is UUB [21, 22]. It is noted that UUB points to the property defined by Definition 3 from hereon.
Theorem 1 [21].
Assuming that the Markov chain induced by policy
is ergodic, , and , if there are a function and positive constants , and , such that(18) 
and
(19)  
where shows the average distribution of over the finite time steps, , and , then guarantees UUB stability of the control plant with ultimate bound . If for any , there is a , such that , then .
Similar to [21]
, a fully connected deep neural network is used to construct function
, which satisfies and the function is parameterized by. A ReLU activation function is employed in the output layer of the deep neural network to guarantee positive output. To update
, the following objective function is minimized:(20) 
where is the average over a minibatch of samples collected from the sampling distribution .
In the following, an approach based on optioncritic method, which is reformulated as DAC architecture, is proposed to obtain . In the obtaining procedure of policy , we employ Theorem 1 to guarantee UUB stability.
OptionCritic Method:
Optioncritic method is an HRL that has three policies: a master policy, an intraoption policy, and an option termination function [31, 6]. The master policy decides which option should be performed. On the basis of this decision, an action is taken through intraoption policy until the option is terminated by the termination function. Accordingly, in the context of subproblem (14), the master policy specifies the probability of choosing update variable at each time step and then control input is determined by the intraoption policy. Furthermore, the termination function is omitted (similar to [32] and [6]) because of the binary type of update variable . Indeed, when the master policy chooses one option ( or ), it terminates another option simultaneously. Considering this performing model, we have:
(21) 
In [31], it is demonstrated that optioncritic method can be reformulated as DAC architecture, which contains two augmented MDPs. The MDPs contain the highlevel MPD, , and the lowlevel MPD, , which are employed for choosing the option and the action respectively. Consequently, the highlevel MPD of the base module is defined as:
(22) 
where is the indicator function. Also, the highlevel policy on is defined as:
(23) 
The lowlevel MPD and policy of the base module are respectively stated as:
(24) 
and
(25) 
Considering trajectories of , and , two bijection functions as and are obtained, which map to and to respectively. Here, the following lemmas holds, which appear similar to [31]:
Lemma 3.
Assuming the bijection function , we have and .
Lemma 4.
Assuming the bijection function , we have and .
The proof of above lemmas are provided in Appendices A and B respectively. These lemmas specify that and } can share the same samples with . In the same way of the provided proof, Theorem 2 can be simply driven as follows.
Theorem 2.
(26) 
Following Lemma 3, Lemma 4, and Theorem 2, to handle subproblem (14), the learning agent alternately optimize (decide on option variable ) and (decide on continues variable ). Thereby optioncritic method is reformulated to DAC architecture (see Fig. 3). To optimize policies in each augmented MDP (, ), Proximal Policy Optimization (PPO) method is used similar to [31].
In DAC architecture, two parameterized polices , which is summarized as and , which is summarized as
are estimated in two actor neural networks. Additionally, the parameterized value function (
) and parameterized function are estimated in two critic neural networks. Therefore, to update two parameters and , considering UUB stability constraint, the following objective functions are minimized respectively:(27) 
and
(28) 
where is the average over a minibatch of samples (the size of the minibatch is ), function constrains the ratio of between the interval of , and is a hyperparameter. is a positive Lagrangian multiplier, which is adjusted via gradient ascent to maximize the following objective function [21]:
(29) 
In equations (27) and (28), is the advantage function at time step and is estimated via the Generalized Advantage Estimation (GAE) as:
(30) 
where is the GAE parameter and is the temporal difference(TD) error, given by .
is updated by an Stochastic Gradient Descent (SGD) algorithm as:
(31) 
where is the learning rate and is the objective function calculated as:
(32) 
where is the target value of timedifference error.
In summary, at time step , each actor network selects its action according to its current state using its policy. This leads to the state transition to and a new reward value, which is estimated by the critic network via value function . Afterward, the TD error is calculated, which is the critic network feedback to optimize and using optimizing problems (27) and (28). In addition, the selected actions and the state transition ( and ) lead to updating and through (20) and (29). Then, the updated is sent to actor networks as feedback of Critic Network to optimize and .
IiiD The First Attribute Module
Having decided update variable , DRL policy of the first attribute module is trained to address subproblem (15). The state and action spaces of this module are given by:
(33) 
Considering the objective and the constraints of subproblem (15), the reward function of the first module is calculated as:
(34) 
where is a hyperparameter denoted the penalty weight on
crossing the limitation of the BS’ power consumption. Also, is a hyperparameter indicated the weight on the number of users received their required rate.
OptionCritic Method: The action space of this module is also hybrid. Accordingly, to address subproblem (15), optioncritic method reformulated as DAC architecture is employed too. Assuming subcarrier allocation matrix is the option variable, we have:
(35) 
Optioncritic method can be reformulated as two augmented MDPs. The highlevel MPD () is used for the subcarrier assignment () and the lowlevel MPD () is used for the power allocation (). The highlevel MPD of the first module is given as: