I Introduction
Since the multimedia and realtime applications such as augmented reality require powerful computational capability [1], mobile devices with limited computational capability may not be able to perform these novel applications [2]. To overcome this issue, mobile edge computing (MEC) servers can be deployed at the wireless base stations (BSs) to help mobile devices process their computational tasks [3]. However, the deployment of MEC servers over wireless networks also faces a number of challenges such as the optimization of MEC server deployment, task allocation, and energy efficiency [4].
A number of existing works studied important problems related to wireless and computational resource allocation such as in [5]–[13]. In [5], the authors maximized the spectrum efficiency via optimizing computational task allocation. The authors in [6] studied the minimization of the energy consumption of all users using MEC. In [7], the authors optimized the energy efficiency of each user in MEC based networks. However, the existing works [5]–[7] only optimized the resource allocation for one BS. Hence, they may not be suitable for a network with several BSs. The authors in [8] studied the multiuser computational task offloading problem to minimize the users’ energy consumption. In [9], the authors proposed a binary computational task offloading scheme to maximize the total throughput. The work in [10] developed a resource management algorithm to minimize the longterm system energy cost. The authors in [11] developed a task offloading scheme to minimize the energy consumption. However, the existing works in [8]–[11] that studied the resource allocation policies assuming that all users request a computational task that can be offloaded to the MEC servers, did not consider the scenario in which the types of requested computational tasks are different (e.g., some users must process the computational task locally and other users can process the computational tasks with the help of MEC servers). The computational and communication resources required for processing different types of tasks are different [12]. For example, a task that is performed by both the user and the MEC server needs more communication resource than a task that is processed by user itself [13]. Meanwhile, as the data size of each computational task requested by each user varies, the BSs need to rerun their optimization algorithms to cope with this change thus resulting in additional overhead and delay for computational task processing [14]. To solve this problem, one promising solution is to use reinforcement learning (RL) approach since RL algorithms can find a relationship between the users’ computational tasks and the resource allocation policy so as to directly generate the resource allocation policy without the time consumption for finding the optimal resource allocation strategy [15].
The existing literature in [16]–[19] studied the use of RL algorithms for solving MEC related problems. The work in [16] developed a federated RL to minimize the sum of the energy consumption of the devices. In [17], the authors used a federated deep RL approach to optimize the caching strategy in an MECbased network. However, the works in [16] and [17] require the BSs to exchange the resource allocation scheme and each user’s state, thus increasing communication overhead. In [18], the authors proposed a modelfree RL task offloading mechanism to minimize the energy consumption of users. An RL algorithm is used in [19] to maximize the throughput of the BSs under the constraint of communication cost of each user. However, the RL algorithms in most of these existing works [18]–[19] may repeatedly learn the same resource allocation scheme during the training process thus increase RL convergence time. Therefore, it is necessary to develop a novel algorithm that can avoid learning the same resource allocation scheme and improve the learning efficiency.
The main contribution of this paper is a novel resource allocation framework for an MECbased network with the users who can request different computational tasks. In summary, the main contributions of the paper are:

We consider an MECbased network in which each user can request different computational tasks. Different from the existing works that consider only a single type of computation tasks [8]–[11], we assume that each user can request different types of computational tasks. To effectively serve the users, a novel resource allocation scheme must be developed. This problem is formulated as an optimization problem aiming to minimize the maximum computation and transmission delay among all users.

To solve the proposed problem, we develop a multistack RL method. Compared to the conventional RL algorithms in [16]–[19], the proposed algorithm uses multiple stacks to record historical resource allocation schemes and users’ states, which can avoid learning the same information, thus improving the convergence speed and the learning efficiency.

We perform fundamental analysis on the gains that stem from the change of the transmit power and the subcarriers over uplink and downlink for each user. The analytical result shows that, to reduce the maximum delay among all users, each BS prefers to allocate more downlink subcarriers and the downlink transmit power to a user with a task that must be processed by the MEC server. In contrast, each BS prefers to allocate more uplink subcarriers and the uplink transmit power to a user with a task that must be locally processed.
Simulation results illustrate that the proposed RL algorithm can reduce the number of iterations needed for convergence and the maximal delay among all users by up to 18% and 11.1% compared to Qlearning. To the best of our knowledge, this is the first work that studies the use of multistack RL method to optimize the resource allocation in an MEC based network.
The rest of this paper is organized as follows. The system model and the problem formulation are described in Section II. The multiple stack RL method for resource and task allocation is presented in Section III. In Section IV, numerical results are presented and discussed. Finally, conclusions are drawn in Section V.
Ii System Model and Problem Formulation
We consider an MECbased network with a set of BSs serving a set of of users, as shown in Fig. 1. In our model, each user can only connect to one BS for task processing and each BS can simultaneously execute multiple computational tasks requested by its associated users [20].
Iia Transmission Model
The orthogonal frequency division multiple access (OFDMA) transmission scheme is adopt for each BS [21]. Let and be the set of uplink orthogonal subcarriers and downlink orthogonal subcarriers, respectively. Given a bandwidth for each uplink or downlink subcarrier, the uplink and downlink data rates of user associated with BS over uplink subcarrier and downlink subcarrier can be given by (in bits/s) [22]:
(1) 
(2) 
respectively, where is user ’s transmit power on uplink subcarrier and is BS ’s transmit power on downlink subcarrier . and is the channel gain between user and BS over subcarrier and , respectively. Here, and are the Rayleigh fading parameters, is the distance between user and BS , and is the path loss exponent. is the power of the Gaussian noise. is the uplink subcarrier allocation index with indicating that user associates with BS n using subcarrier , and otherwise, we have . is the downlink subcarrier allocation index with indicating that BS connects to user using subcarrier , and , otherwise.
The sum transmission rate over uplink and downlink between user m and BS n is:
(3) 
(4) 
where , , , and
.
IiB Computation Model
We assume that each user can request one computational task from three types of computational tasks, specified as follows:

Edge task: Edge tasks requested by users must be completely computed by MEC servers [23]. Then, the computational result must be transmitted to the users. For example, when a user wants to watch a movie on the mobile device, the BS must compress the video before this video is transmitted to the user [24]. The time used to process an edge task requested by user is given by:
(5) where F is the CPU clock frequency of an MEC server. represents the number of CPU cycles used to compute one bit data at an MEC server. is the data size of the computational task of user m. is a constant to represent the ratio between the data size of each computational task before processing and the data size of the computational result after processing. F, , and are assumed to be equal for all MEC servers. The first term represents the time consumption for computing the task requested by user in the MEC server and the second term represents the time consumption for transmitting the computational result to user m.

Local task: A local task must be completely computed at mobile devices and then transmitted to the BS [25]. For example, when a user wants to upload photos to Twitter, the user must compress the images locally before they are transmitted to the BS [26]. The time that user uses to compute its local task is given by:
(6) where f is the CPU clock frequency of each user and is the number of CPU cycles used to compute one bit data at each user m. The first term implies the time consumption for computing the task locally and the second term implies the time consumption for transmitting the computational result to BS n.

Collaborative task: Each collaborative task can be divided into a local computational task processed by a user and and an edge computational task processed by an MEC server [27]. For example, when a user plays a virtual reality (VR) online games, the BS must collect the tracking information from the user and then, transmit the generated VR image to the user [28]. The time consumption for processing the collaborative task can be given by:
(7) where is the fraction of the task that user processes locally (called local computing) with being the task division parameter. represents the computational time of user m, represents the time consumption for computing the offloaded task in the MEC server, and represent the time for the computational task transmission over uplink and downlink, respectively. In our model, each BS cannot simultaneously communicate with the users and compute the tasks that are offloaded from the users. This is because each BS must first communicate with the users to receive each user’s offloaded task and then compute these tasks. Since the collaborative task can be processed by the MEC server and the user simultaneously, depends on the maximum time between the local computing time and the edge computing time , as shown in (7).
IiC Problem Formulation
Next, we formulate the optimization problem that aims to minimize the maximal computational and transmission delay among all users. The minimization problem involves determining uplink subcarrier allocation indicator , downlink subcarrier allocation indicator , the uplink transmit power , the downlink transmit power , and the task allocation indicator of each user . The optimization problem can be formulated as follows:
(8)  
(8a)  
(8b)  
(8c)  
(8d)  
(8e)  
(8f)  
(8g)  
(8h)  
(8i) 
where , , , , and . (8a) implies that each user can request one of three types of computational tasks. (8b) indicates the uplink and downlink subcarrier allocation between user and BS . (8c) and (8d) guarantee that each uplink or downlink subcarrier can be allocated to at most one user. (8e) and (8f) ensure that each user can connect to at most one BS for data transmission. (8g) and (8h) are the constraints on the maximum transmit power of each BS and each user , respectively. (8i) indicates that the collaborative tasks can be cooperatively processed by both BSs and users. Problem (8) is a mixed integer nonlinear programming problem with discrete variables and and continuous variables , , and . Hence, it is difficult to solve problem (8) by traditional algorithms such as dual method directly [29]. Moveover, as the data size of each computational task requested by each user varies, the BSs must rerun their optimization algorithms to cope with this change thus resulting in additional overhead and delay for computational task processing [30]. In consequence, we develop a novel RL approach that can find a relationship between the users’ computational task and resource allocation policy so as to directly generate the resource allocation policy without the time consumption for finding the optimal resource allocation strategy.
Iii Reinforcement Learning for Optimization of Resource Allocation
Next, we introduce a novel RL approach to solve the optimization problem in (8). First, the components of the proposed learning algorithm is introduced. Then, we explain the use of the learning algorithm to solve (8). Finally, the convergence and implementation of the proposed algorithm is analyzed.
Iiia Components of Multistack RL Method
A multistack RL algorithm consists of three components: a) state, b) action, and c) reward. In particular, is the discrete space of environment states, is the discrete sets of available actions for BS n at step k, and is the reward function of BS n. The components of the multistack RL algorithm are specified as follows:

State: The environment state consists of three components, , where represents the maximal computational and transmission delay among all users, represents the user whose time consumption is maximal among all users, and represents the notion of the user that requests computational and transmission resource at current step. Note that, is determined by the finite and discrete actions , , , and . Since and , the defined environment states are finite and discrete.

Action: Since each BS jointly optimizes task, subcarrier, and transmit power allocation scheme, the action , where , , , and . The uplink transmit power and downlink transmit power are separately divided into levels. Hence, we assume that and . To find the optimal task allocation , we present the following result:
The type of computational tasks The variation of resource allocation downlink subcarriers uplink subcarriers downlink transmit power uplink transmit power Edge task Local task Collaborative task TABLE I: Summarization of the Time Consumption. Theorem
For the collaborative task, the optimal task allocation is given by:
(9) where
Proof:
See Appendix A.
Theorem 1 shows that the task allocation depends on the transmit power and the subcarrier allocation. In particular, as the transmit power and the number of the subcarriers over uplink and downlink allocated to each user increases, the part of a task computed by the MEC server increases. In consequence, the computational time decreases.
Substituting (9) into (7), we have:
(10) where
In Theorem 1, we build the relationship between the task allocation and the transmit power and the subcarrier allocation. Next, we analyze the gain that stems from the change of the transmit power and the number of the subcarriers over uplink and downlink allocated to user m. To present the reduction of the delay due to the change of the number of the subcarriers and transmit power allocated to user, we first summarize the time consumption notations, as shown in Table I. In Table I, represents the variation of time consumption for processing edge task when the resource allocation scheme changes. In particular, , , , and , respectively, represents the variation of time consumption for processing edge task due to the change of the number of downlink subcarriers, the number of uplink subcarriers, downlink transmit power, and uplink transmit power. Similarly, and represent the variation of time consumption for processing local task and collaborative task when the resource allocation scheme changes, respectively. Given time consumption notions, we present the relationship between the time consumption and the change of the number of subcarriers and transmit power allocated to each user.
Theorem
The reduction of the delay due to the change of the number of the subcarriers and transmit power allocated to user m is:

The gain due to the change of the number of downlink subcarriers allocated to user m that requests an edge task, , is:
(11) where represents the variation of downlink subcarriers allocation. indicates that BS allocates downlink subcarrier to user , otherwise, we have . is the module of , which indicates the number of downlink subcarriers that will be allocated to user . Similarly, represents the number of downlink subcarriers that are already allocated to user .

The gain that stems from the change of the number of uplink subcarriers allocated to user m that requests a local task, , is:
(12) where represents the variation of uplink subcarriers allocation. Similarly, indicates that BS allocates uplink subcarrier to user and , otherwise. indicates the number of uplink subcarriers that will be allocated to user . is the number of uplink subcarriers that are already allocated to user .

The gain that stems from the change of the number of downlink subcarriers allocated to user m that requests a collaborative task, , is:
(13) where and

The gain that stems from the change of the number of uplink subcarriers allocated to user m that requests a collaborative task, , is:
(14) where and

The gain that stems from the change of the downlink transmit power of m that requests an edge task, , is:
(15)

The gain that stems from the change of the uplink transmit power of m that requests a local task, , is:
(16)

The gain that stems from the change of the downlink transmit power of user m that requests a collaborative task, , is:
(17) where and

The gain that stems from the change of the uplink transmit power of user m that requests a collaborative task, , is:
(18) where and
Proof:
See Appendix B.
From Theorem 2, we can see that the number of subcarriers and transmit power allocated to each user m, will directly affect the delay of user m. Therefore, to minimize the maximal transmission and computational delay among users, we can increase the number of subcarriers as well as the transmit power allocated to each user according to the type of the task that each user requests. Although increasing the number of subcarriers as well as the transmit power allocated to each user can decrease the delay of each user, the gain that stems from increasing the same number of subcarriers or transmit power allocated to the user who requests various types of computational tasks is different. To capture the maximum gain that stems from the change of the same number of subcarriers and the transmit power as a given user has various types of computational tasks, we state the following result:
Corollary
The relationship among the gains that stem from the change of the same number of subcarriers or transmit power for a user that has different computational tasks are:

The relationship among the gains that stem from the change of the number of downlink subcarriers allocated to user m is: .

The relationship among the gains that stem from the change of the number of uplink subcarriers allocated to user m is: .

The relationship among the gains that stem from the change of the downlink transmit power allocated to user m is: .

The relationship among the gains that stem from the change of the uplink transmit power allocated to user m is: .
Proof:
See Appendix C.
From Corollary 1, we can see that, the gain that stems from increasing the number of subcarriers and the transmit power of a user who has a collaborative task is less than that for a user that requests an edge task or a local task. This is because as the number of subcarriers or transmit power for uplink (downlink) increases, the data rate for uplink (downlink) increases, thus decreasing the uplink (downlink) transmission delay. Meanwhile, due to the increase of the uplink (downlink) transmission rate, the user will send more data to the MEC server that can use its high performance CPUs to process the data. Thus, the downlink (uplink) transmission delay increases. In particular, the increase of the downlink (uplink) transmission delay is lager than the decrease of the computational delay. Based on Theorem 2 and Corollary 1, to minimize the maximal computation and transmission delay among all users, BS n prefers to allocate more downlink subcarriers and downlink transmit power to a user that requests an edge task and allocate more uplink subcarriers and uplink transmit power to a user that requests a local task.


Reward: Given the current environment state and the selected action , the reward function of each BS is given by:
(19) where with being the maximal time consumption of all users to process its own task locally and being the maximal transmission and computational time of all users. To calculate , each BS must exchange its maximal delay among its associated users with other BSs so as to adjust the resource allocation scheme to minimize the maximal computational and transmission delay among all users.
IiiB Multistack RL for Optimization of Resource Allocation
Given the components of the proposed learning algorithm (the flowchart is shown in Algorithm 1), next, we present the use of the proposed learning algorithm to solve problem (8). In particular, each BS n first selects an action a from at each step . After the selected action a is performed by BS n, the environment state changes and BS records the obtained reward in its Qtable Q(, a
). To ensure that any action can be chosen with a nonzero probability, an
greedy exploration [18] is adopted. This mechanism is responsible for action selection during the learning process and balance the tradeoff between exploration and exploitation. Here, exploration refers to the case in which each BS explores actions to find a better strategy. Exploitation refers to the case in which each BS will adopt the action with the maximum reward. Therefore, the probability for BS n selecting action a can be given by:
Comments
There are no comments yet.