I Introduction
Mobile edge computing (MEC) is an emerging technology that provides cloud computing capabilities at the edge of the mobile networks in close proximity to the mobile subscribers. Compared with mobile cloud computing (MCC), MEC can reduce latency and offer an improved user experience. On the other hand, the Internet of Things (IoT) comprises IoT devices with sensing, actuating, computation, and communication capabilities, which are connected into the Internet and collaboratively enable a wide of variety of new applications, including smart city/home, ehealth, and industrial automation. As the IoT devices normally have very limited computation and storage capabilities, MEC enables the latencysensitive IoT applications to offload the huge amount of sensed data to the MEC servers, which are deployed near the base stations (BSs) and offer large storage and computation facilities [1, 2, 3, 4]. To upload the sensed data from the IoT devices to the MEC server, NBIoT cellular transmission technology is an attractive option, which is recently introduced in Third Generation Partnership Project (3GPP) Release 13, and is a longterm evolution (LTE) variant designed specifically for IoT [5, 6]. It enables mobile operators to efficiently support a massive number of IoT devices with low data rate transmissions and improved coverage using a small portion of their existing available licensed spectrum. NBIoT has received great interest from major industrial partners in 3GPP, such as Ericsson, Nokia, Intel and Huawei [7].
In this paper, we consider an NBIoT edge computing system, where MEC servers are deployed at NBIoT enabled BSs. Based on this system, mobile operators can provide an efficient solution to the IoT applications by jointly optimize the radio and computational resources. One important challenge in the resource control for such a system is the offloading problem, which decides whether an IoT device should offload a chunk of sensed data to the MEC server or not. Offloading reduces the data computation delay as the central processing units (CPUs) of the MEC servers are much faster than those of the IoT devices, but it also incurs additional delay from data transmission. Moreover, the power consumption of local computation versus wireless transmission for an IoT device usually needs to be considered as well, as many IoT devices have limited energy (e.g., powered by batteries). On the other hand, the radio resource allocation decisions in NBIoT will have significant effects on the data transmission delay and power consumption, which in turn affect the offloading performance.
Ia Related Work
The joint radio and computational resource control problem in multiuser MEC system has been studied in a few recent literatures, where several mobile devices share the same MEC server. A survey is provided in [8], where the computation task models considered in the existing research works are divided into deterministic versus stochastic. The deterministic task models consider that no new task will arrive until the old task is executed or discarded, so that the resource control decision of a particular task is made solely based on the information of the current task [9, 10]. On the other hand, the stochastic task models are more practical and consider that the tasks arrive according to a stochastic process and are buffered in a queue if cannot be processed immediately upon arrival. The resource control decisions for a particular task under the stochastic task models need to consider their impacts on the future tasks in terms of the longterm average performance of the system. Therefore, the problem is more complex under the stochastic task models, especially in the multiuser scenario due to the large dimensionality of the problem. A solution using the Lyapunov Optimization method is given in [11] which considers a general wireless network and optimizes the energy consumption. In [12], a perturbed Lyapunov function is designed to stochastically maximize a network utility balancing throughput and fairness, and a knapsack problem is solved per slot for the optimal offloading schedule.
Markov Decision Process (MDP) is a powerful dynamic optimization theory to obtain the optimal resource control policy under the stochastic task arrival model in terms of the longterm average performance. However, solving the MDP model for the multiuser system is difficult due to the wellknown curseofdimensionality problem, where the state space grows exponentially with the number of users [13, 14]. For this reason, previous studies based on the MDP models are mainly restricted to the single user MEC system [15, 17, 16]. On the other hand, reinforcement learning (RL), especially deep reinforcement learning (DRL), provides a class of solution methods to address the curseofdimensionality problem in MDP, where the agents interact with the environment to learn optimal policies that map states to actions [18]
. DRL algorithm can be broadly classified into valuebased method, such as DQN
[19]; policy gradient method; and actorcritic method which can be considered as a combination of valuebased and policy gradient methods. The DRL algorithms enable RL to scale to problems that were previously intractable. Recent years have seen increasing applications of RL [14, 15, 17, 16, 22] and DRL [23, 24, 25, 29, 27, 28, 26, 30] algorithms on the resource control problems in the MEC and IoT systems.Specifically, DRL algorithms for multiuser MEC system have been considered in several existing works. [27] and [28] focus on the offloading and resource allocation problems under deterministic task models, where a fixed number of tasks per user need to be processed either locally or offloaded to the edge server. DQN based techniques are applied to solve the respectively problems. This is different from the stochastic task model considered in this paper. In [26], distributed power allocation policies for local execution and computation offloading are derived under stochastic task model with dynamic task arrival process by applying the deep deterministic policy gradient (DDPG) algorithm. It is considered in [26] that all the users can transmit simultaneously by leveraging multiuser MIMO. This is different from the consideration in this paper for NBIoT system, where only one user can be scheduled for transmission over the kHz bandwidth. The mutual exclusion nature in multiuser resource allocation makes it hard to design a fully distributed solution as in [26], where each user makes independent decisions according to its local state information. Moreover, the offloading and resource allocation problem in [26] is reduced to power allocation problem by considering a datapartition task model [8], which results in a continuous action space that favors policy gradient or actorcritic algorithms over valuebased algorithms. In this paper, we adopt the valuebased algorithm as the action space is discrete.
Another thread of related research is the multiagent RL [31], which typically involves multiple agents learning individual policies. The state transitions and rewards depend on the joint actions of all the agents. Compared with singleagent RL, multiagent RL can solve the action space explosion problem, i.e., the cardinality of action space grows exponentially with the number of agents. For example, independentQ learning is a popular algorithm in which each agent independently learns its own policy, treating other agents as part of the environment [32]. However, a problem with independentQ learning is that the environment becomes nonstationary [33]. There are several survey papers on multiagent RL that introduce the challenges and solutions [34, 35, 36]
. In this paper, multiagent RL algorithms cannot be applied directly because of the mutual exclusion nature of the resource allocation problem. As the radio resources can only be allocated to at most one user at a time, each agent cannot make individual decisions ignoring the decisions of the other agents. Moreover, due to the semiMarkov characteristics of the RL model, the action space does not grow exponentially with the number of users as in multiagent RL. At each decision epoch, only the offloading decision of one user needs to be considered upon the arrival of a new task.
IB Contributions
In this paper, we propose a deep reinforcement learning method with the value function approximation architecture based on ANNs for the multiuser resource control problem of the NBIoT edge computing system. We formulate the dynamic optimization problem as an infinitehorizon averagereward continuoustime Markov decision process (CTMDP) model. In the CTMDP model, the global reward function can be represented as the sum of local reward functions per user. This corresponds to a typical optimization objective for multiuser resource control problem, where the overall system performance, e.g., delay, power consumption, is the sum or average value of the peruser performance. Moreover, the resource control action includes the offloading action and multiuser scheduling action. The latter has the constraint that at most one user can be scheduled for data transmission at a time. This is a typical intracell resource allocation consideration in cellular networks, which makes it difficult to directly apply existing multiagent RL algorithms.
The main contribution of this paper lies in the design of a neural network architecture for function approximation that facilitates semidistributed implementation of the learning algorithm in the multiuser environment. Specifically, the edge server and BS make the resource control decisions with an auctionbased mechanism, where the large amount of IoT devices distributively compute and submit bids to the BS and edge server.
The motivation for semidistributed implementation is twofold. Firstly, although the proposed algorithm can be implemented centrally at the BS, the computation complexity and required storage capacity increase with the increasing number of IoT devices. Therefore, by efficient collaboration between BS and IoT devices, the IoT devices can help to alleviate the computational and storage burdens from the BS. This is in accordance with the design principles for new generation of wireless networks  making use of smart user equipments (UEs) to help the BS. Secondly, although a fully distributed implementation seems attractive from performance perspective, the mobile operators need to be able to control the scarce spectrum resources in the license band [37]. Therefore, in the proposed semidistributed implementation, the BS makes control decisions while the IoT devices submit individual bids.
In the design of neural network architecture, we propose several novel features to facilitate semidistributed implementation with good performance and limited communications overhead. Firstly, we approximate the global value function by the summation over all the users of their respective product of local value function and local feature. The local value function depends solely on the local system state of a user. On the other hand, the local feature depends on the global system state to improve the accuracy of approximation. Secondly, we adopt a convolutional layer to compress the local system state of every user to a single scalar. This can greatly reduce the signaling overhead for the BS to inform IoT devices of the global system state as well as improving the performance of the learning algorithm. Thirdly, we insert a multiplication layer before the output layer so that only the local value function associated with the current local system state needs to be updated per decision epoch for each user. This greatly reduces the computation complexity and signaling overhead associated with parameter update. Finally, with the auctionbased mechanism in implementation, each IoT device submits a bid per local action, and the BS selects the joint action that results in the optimum global value function. In this way, global optimum is ensured through semidistributed implementation. The proposed function approximation architecture can be adopted by other multiuser resource control problems that share similar problem structure.
The rest of the paper is organized as follows. In Section II, the system model is introduced. Section III formulates the CTMDP problem, which is solved in Section IV using the value function approximation, neural networks, and reinforcement learning techniques. The semidistributed implementation procedure is also discussed in Section IV. In Section V, the performances of the proposed algorithm are compared with those of the baseline algorithms as well as the other DRL algorithms by simulation. Section VI concludes the paper.
Ii System Model
We consider an IoT edge computing system, where a BS with an MEC server serves IoT devices in a singel cell [22]. For each IoT device
, the sensed data arrives in packets according to a Poisson distribution of mean arrival rate
. There are two queues for each IoT device to buffer the sensed packets. One is the transmission queue for the packets that are to be offloaded to the MEC server for remote computation, and the other is the processing queue for the packets that are be locally processed by the IoT device. When a new packet arrives at an IoT device, the offloading function decides whether to place it in the transmission queue for offloading, or in the processing queue for local processing. Moreover, the multiuser scheduling function in the wireless network decides how to allocate the radio resources to different IoT devices for the transmission of the offloaded packets. The system model for the IoT edge computing system considered in this paper is illustrated in Fig.1.Assumption 1 (Resource Unit (RU) configuration in NBIoT)
We consider that the RU configuration with subcarriers time slots is always selected for every IoT device [38]. Therefore, only one IoT device can be scheduled for transmission at the same time.
Assumption 2 (Link adaption in NBIoT)
In LTE system, link adaptation is performed dynamically per ms subframe to adapt the Modulation and Coding Scheme (MCS) level according to the instantaneous channel quality. As a narrowband transmission technology with a relatively low data rate, the transmission of a transport block (TB) in NBIoT can occupy multiple consecutive subframes, i.e., a TB may be mapped to RUs in time [38]. This means that the transmission duration can be larger than the coherence time of the wireless channel. Therefore, in this paper, we consider that the link adaptation is performed according to the timeaverage wireless channel conditions of the IoT devices determined only by the largescale fading effects, i.e., pathloss and shadowing. Moreover, we focus on those IoT applications where the locations of the IoT devices will not often change once they are deployed, e.g., smart metering. Therefore, the MCS level and the corresponding transmission data rate for an IoT device will remain the same as long as it does not change its location.
As a narrowband transmission technology with a relatively low data rate, the transmission of a transport block in NBIoT can occupy multiple consecutive subframes [38]
. In this paper, we consider that the transmission duration of a packet is exponentially distributed with a mean value
, where is the mean transmission rate in terms of packets per second for IoT device . Moreover, the power consumption is a constant value for any IoT device [39].We consider that the mean local processing time of IoT device is exponentially distributed with a mean of , where is the mean processing rate in terms of packets per second for IoT device . The power consumption for processing the sensed data locally at the IoT device is a constant value denoted by [17].
In this paper, we will try to jointly derive the optimal scheduling policy and computation offloading policy that minimizes the weighted sum of the average delay and power consumption over all the IoT devices. Specifically, the average delay depends on the delay values for both the offloaded as well as the unoffloaded packets. The delay of an offloaded packet includes three parts, i.e., the uplink transmission delay, the remote computation delay, and possibly the downlink transmission delay. We make the following assumption when deriving the average delay for the offloaded packets.
Assumption 3 (Delay for offloaded packets)
We assume that the average delay for the offloaded packets equal to their average uplink transmission delay. This is because the sum of the remote computation delay and the downlink transmission delay is usually neligible compared with the uplink transmission delay and local computation delay due to much more powerful CPUs of the MEC servers and much heavier uplink IoT traffic.
Iii CTMDP Model
In this section, we shall formulate an infinite horizon average reward Continuous Time Markov Decision Process (CTMDP) problem to minimize the weighted sum of the average delay and power consumption for the IoT devices.
Iiia Global System State
We formulate a CTMDP model where the global system states are observed at each packet arrival and departure event. We denote the global system state at the th decision epoch, , by .
Transmission queue state
is the vector of transmission queue length observed at the beginning of the
th decision epoch when the packet arrival/departure event has just occurred. , denotes the transmission queue length of IoT device , where is the maximum transmission queue length.Processing queue state
is the vector of processing queue length observed at the beginning of the th decision epoch, where , denotes the processing queue length of IoT device . is the maximum processing queue length.
Event
Let indicate the event occurred at the beginning of the th decision epoch which triggers the state transition from to .

[leftmargin=5em,style=nextline]

represents a packet arrival at IoT device ;

represents a packet departure from the scheduled transmission queue;

represents a packet departure from the processing queue of IoT device , where .
Scheduled transmission queue
is the scheduling action at the last (i.e., th) decision epoch.

[leftmargin=5em,style=nextline]

represents the index of the scheduled transmission queue at the last (i.e., th) decision epoch;

means no transmission queue is scheduled.
Example 1 (Definition of global system state)
The global system state ( denotes a vector of zeros) indicates that the system state transits to the current state due to a packet arrival at IoT device . At the beginning of the current system state, all the transmission queues and processing queues are empty. No transmission queue is scheduled at the previous system state.
The cardinality of the global system state space is , which grows exponentially with the number of IoT devices .
IiiB Action
When a system state transition occurs due to a packet arrival or departure event, an action will be taken in the CTMDP model. Define the action at the th decision epoch as .
Offloading action
represents the offloading action, which is only performed when there is a packet arrival.

[leftmargin=5em,style=nextline]

means the newly arrived packet is offloaded;

means the newly arrived packet is not offloaded; or an offloading action is not applicable in the current system state (i.e., when in the current system state is a packet departure event);

means the newly arrived packet is dropped because both the transmission queue and processing queue of the IoT device are saturated.
From the above definition, it is obvious that the offloading action space is dependent on the system state. This dependency is further demonstrated by the fact that if one of two queues (i.e., transmission queue and processing queue) of the IoT device at which the packet arrived is saturated, the packet can only be dispatched to the other queue, and thus the offloading action is determined. Therefore, the statedependent offloading action space is given as
(1) 
After the offloading action is made in the th decision epoch, the arrived packet will be dropped or added to the transmission queue or processing queue of IoT device depending on the offloading action. Therefore, the processing and transmission queue length of IoT device can be different from the values of and in the system state. Let and denote the processing and transmission queue length of any IoT device during the th decision epoch after the offloading decision is made with system state , we have
(2) 
and
(3) 
Define the postdecision transmission and processing queue vectors at the th decision epoch as and , based on the values of which the scheduling action will be made for the th decision epoch.
Scheduling action
is the scheduling action.

[leftmargin=5em,style=nextline]

represents the index of the transmission queue that is scheduled at the current (th) decision epoch;

means that no queue is scheduled, which only happens when all the transmission queues are empty at the time the scheduling action is determined, i.e., .;
In this paper, we consider nonpreemptive scheduling. Therefore, the scheduling action is only updated when (1) there is a packet departure from a transmission queue (i,e, ); (2) no queue is scheduled at the time that an arrival event occurs (i.e., and ). In either case, the scheduled IoT device is selected from the the set of IoT devices with nonempty transmission queues, i.e., . Otherwise, the scheduling action remains the same as the previous decision epoch (i.e., ). Therefore, the statedependent scheduling action set is given as
(4) 
Note that the cardinalities of the offloading action space and scheduling action space are and , respectively.
IiiC PostDecision Global System State
We define the postdecision global system state at the th decision epoch as , which is a deterministic function of the global system state and the action at the th decision epoch as below:
(5) 
Note that the state space of postdecision global system states is the same with that of the global system states as denoted by .
IiiD Transition Probability
Given the global systems state and action at the th decision epoch, the transition to the global systems state at the th decision epoch can be described in two phases.

[leftmargin=10em,style=nextline]
 Phase 1

where can be derived by (5) as a deterministic function of and ;
 Phase 2

where is a deterministic function of and as below:
(6) 
where , .
Note that the event at the
th decision epoch occurs when there is a packet arrival at any of the IoT devices, or there is a packet departure from the scheduled transmission queue, or from any of the nonempty processing queues. Therefore, the transition probabilities
corresponds to the probabilities that event happens:(7) 
where we set as will not happen when no transmission queue is scheduled during the th decision epoch, i.e., .
The duration of the th decision epoch or equivalently, the sojourn time of the CTMDP in state given action is exponentially distributed with parameter as
(8) 
where can also be expressed as a function of the postdecision state .
IiiE Reward Function
In order to derive the reward function of the CTMDP model, we first examine the optimization objective, which is to find the policy that minimizes the weighted sum of the average delay and power consumption over all the IoT devices. Note that a policy in an MDP model is a function that specifies the action that the decision maker will choose when in state . We formulate the above dynamic optimization problem as an average reward CTMDP problem.
Problem 1 (average reward CTMDP problem to minimize the weighted sum of average delay and power consumption)
(9)  
where and on the RHS of the first equality are the weights for the average delay and power consumption of IoT device , respectively. The weights and indicate the relative importance of the average delay and power consumption of IoT device in the optimization problem. The RHS of the second equality is the classical form of an average reward CTMDP problem, where and are the starting time and duration of the the th decision epoch. represents that a reward is incurred at this rate when the system is in state and action is chosen at the the th decision epoch.
From (9) and using Little’s Law to derive the delay, we can derive the expression of as below:
(10) 
where
is a random variable that takes the value of
if the condition in the subscript is true, and otherwise. The detailed derivation is given in Appendix A.According to the CTMDP theory [40], the reward function of the CTMDP model can be derived as
(11) 
where can also be expressed as a function of the postdecision state .
The optimal policy of the above CTMDP problem can be derived by solving the postdecision Bellman equation as
(12) 
where is the postdecision global value function, and is the optimal average reward rate.
Iv Solution by DRL
Iva Local System State
Define the local system state for IoT device at the th decision epoch as .
 Local event

, where

[leftmargin=6em,style=nextline]

indicates a packet arrives at IoT device ;

indicates a packet departs from the transmission queue of IoT device ;

indicates a packet departs from the processing queue of IoT device ;

indicates the event at the th decision epoch does not happen at IoT device .

 Local schedule

, where

[leftmargin=6em,style=nextline]

indicates that IoT device is scheduled at the th decision epoch;

indicates that IoT device is not scheduled at the th decision epoch.

Given the global system state at the th decision epoch, the local system state at IoT device can be derived by
(13) 
(14) 
The global system state corresponds to the aggregation of the local system states .
Example 2 (Definition of local system state)
Consider there are IoT devices in the system, and the global system state is . Thus, the local system states at the IoT devices are , , and , respectively.
Given the local system state of IoT device at the th decision epoch, and the action , we define the postdecision local system state , which can be derived by a deterministic function as
(15) 
Given the postdecision local system state of IoT device at the th decision epoch, and the event at the th decision epoch, the local system state of IoT device at the th decision epoch can be derived by a deterministic function as below by combining (IIID) with the definitions of (postdecision) local system states:
(16) 
As a remark, note that the cardinality of the local state space for any IoT device is , which does not grow with the number of IoT devices . In contrast, the cardinality of the global state space grows exponentially with the number of IoT devices .
IvB Value Function Approximation
First, the local reward function is given as
(17) 
so that .
Moreover, we decompose the optimal average reward rate in (IIIE) as the sum of optimal local average reward rates of IoT , i.e.,
(18) 
In order to formulate our approximation architecture, we first introduce some notations to efficiently describe the mapping relations between the postdecision global system states and the postdecision local system states. Specifically, denote as the th postdecision global system state in the state space. We introduce a mapping function which denotes the index of the postdecision local system state of device within its local state space when the postdecision global system state is . Therefore, let denote the local system state of device when the global system state is . In other words, we have .
The approximation architecture for the postdecision global value function is given as
(19) 
where is the cardinality of the local system space of any device , and is the postdecision pernode value function of IoT device for its postdecision local system state . In the following discussion, we’ll omit the term “postdecision” before the global value function and pernode value function for simplicity. is the feature vector of the postdecision global system state . The simplest method is to set for any . However, the feature values of the local system state are the same for all the global system states that belongs to, i.e., the component values within the set are the same. This can lead to inaccuracy in the approximation by (19). In this paper, we will use an ANN to train the values of and simultaneously as shown in Fig.2.
IvB1 Input layer
The input of the neural network is a postdecision global system state , . The input layer has neurons, where the th neuron corresponds to the th postdecision local system state of device , i.e., .

Activation : The activation of the neuron is denoted by , which is a  variable indicating whether is a component of the input postdecision global system state , i.e, . Let be the dimensional matrix of input neurons where the element in the th row and th column is .
IvB2 Convolutional layer
The number of neurons in the convolutional layer is . The th neuron corresponds to the device .

Activation : The activation of the th neuron is denoted by
, which represents the convolutional layer feature extracted from the local system state
that is encoded by the th row of . Let be the dimensional activation vector of the convolutional layer whose th element is . 
Cweight/Filter : The filter is a dimensional vector , where each element is referred to as the cweight.
Thus, we have
(20) 
where denotes the transpose of the matrix .
IvB3 Fully connected layer
The number of neurons in the fully connected layer is the same with that in the input layer, i.e., , where the th neuron corresponds to the postdecision local system state .

Activation : The activation of the neuron is denoted by . Let be the dimensional activation vector of the fully connected layer whose th element is .

Fweight : The fweight is the weight of the link from the th neuron in the convolutional layer to the neuron in the fully connected layer. Let denote the dimensional matrix of fweights, whose element in the th row and the th column is .
To derive the activation vector of the fully connected layer, we have
(21) 
where we set to be the Tanh function defined by .
IvB4 Multiplication layer
Note that when , we would like according to (19). In order to guarantee this, we add a multiplication layer between the fully connected layer and the output layer. There are also neurons in the multiplication layer.

Activation : The activation of the th neuron is the feature in (19). The dimensional activation vector whose th element is can be derived by
(22) where is the Hadamard product or the elementwise product of two vectors.
Remark 1 (Discussion on the multiplication layer)
Note that the neural network and the DRL algorithm still work without the multiplication layer. The advantage of the multiplication layer lies in that it greatly reduces the number of weights or parameters that need to be udpated at each decision epoch. Specifically, among the neurons in the multiplication layer, only neurons are active per decision epoch, which means that only the fweights and pernode value functions associated with these active neurons need to be updated. This is similar to Dropout in NN, which can prevent overfitting of the value function. Therefore, the multiplication layer not only greatly reduces the computation complexity of the DRL algorithm, but also improves its performance.
IvB5 Output layer

Activation : The output of the neural network is the global value function of the input postdecision global system state , . Therefore, there is only one output neuron whose activation is .

Weight/Pernode value function : The purpose of the output layer is to derive the global value function of according to (19). The pernode value function is the weight of the link from the neuron in the fully connected layer to the output neuron. Let be the dimensional pernode value function (weight) vector with the th element as .
Then, (19) can be written as below
(23) 
(24) 
In the following discussion, we will refer to our proposed algorithm as the neuralICFMO algorithm, which takes the first letter of each neural network layer.
IvC Optimal Control Action
To illustrate the structure of our solution, we first assume that we could obtain the pernode value function vector , the fweight matrix , the cweight vector , and the local optimal average reward rate for every IoT device via somemeans. At the th decision epoch, we focus on deriving the optimal action under the current system state to minimize the value of the RHS of (24) as below:
(25) 
IvD PerNode Value Function and Weight Update
In the above discussion, we consider that all the pernode value functions, fweights, cweights, and are known for every IoT device, and derive the optimal control actions based on these values. In this section, we will discuss how to derive the above values using the stochastic gradient (SGD) TD(0) method under function approximation for the averagereward problem [18]
. The loss function at the
th decision epoch is defined as(26) 
where is the average reward rate of IoT device up to the
th decision epoch. The gradient for the loss function is derived by the wellknown backpropagation algorithm for the ANN. The detailed procedures and equations to update the value functions and weights are given in Appendix B and summarized in Algorithm
2 below. Note that at the th decision epoch, is used in place of in (IVC) to derive the optimal action . In the following discussion, we add a subscript to the notations described in Section III.B to represent the parameter values at the th decision epoch.IvE SemiDistributed Implementation of the Solution
The proposed deep reinforcement learning algorithm (i.e., Algorithm 2) can be implemented centrally at the BS. In this case, the BS needs to store the pernode value function vectors, the fweights and the cweights for all the IoT devices, whose number grows quadratically instead of exponentially with the number of IoT devices due to the function approximation. Moreover, all the computational tasks for deriving control actions and maintaining the pernode value function vectors, the fweights and cweights need to be performed at the BS. On the other hand, the proposed algorithm also allows semidistributed implementation, in which the BS and IoT devices collaboratively determine the optimal policy as illustrated in Fig.3.
Specifically, we consider that each IoT device stores and updates its own pernode value function vector and the fweight vectors
Comments
There are no comments yet.