I Introduction
It is envisioned that 5Gandbeyond will enable an unprecedented proliferation of dataintensive and computationallyintensive applications such as face recognition, locationbased augmented/virtual reality (AR/VR), and online 3D gaming
[4, 3, 14, 13, 12]. However, adoption of these resourcehungry applications will be negatively affected by limited onboard computing and energy resources. In addition to the computationallyintensive applications, billions of IoT devices are expected to be deployed for various applications such as health monitoring, environmental monitoring and smart cities, to name a few. These applications require a large number of lowpower and resourceconstrained wireless nodes to collect, preprocess, and analyze huge amounts of sensory data [2], which may not be feasible due to the limited onboard computing resources.In order to bridge the gap between increasing demand for mobile computational power and constrained onboard resources, mobile edge computing (MEC) has been contemplated as a solution to supplement the computing capabilities of the endusers [15, 5, 8, 6, 1]. In contrast to the traditional cloud computing architectures, such as Amazon Web Services (AWS) and Microsoft Azure, MEC leverages the radio access networks to boost the computing power in close proximity to endusers, thus enabling users to offload their computations to MEC servers, as shown in Figure 1.
Under the MEC model, each user either offloads its computation to the server or uses its own resources to locally perform the computation. In this case, users can save energy and prolong the overall lifetime of the system by offloading to the central node (assuming central node is not energy sensitive). However, if all users offload their computations to the central node, on one hand the communication resources need to be divided among all users, which decreases the effective uplink throughput, and on the other hand, the queuing delay and computation time at the central node increases. Therefore, a dynamic policy to select the “best” offloading user is needed in order to strike the optimal tradeoff between the lifetime of the system and the computation time. Thus, we note that before a practical MEC architecture becomes a reality, it faces several challenges including efficient management of communication and computing resources and coordination among distributed users and several base stations.
In practical MEC scenarios, the system is partially observable in the sense that users are distributed and the central node only observes the state (e.g., energy level and computation load) of those users that have offloaded so far. In addition, imperfect and delayed channel state information (CSI) makes the problem even more challenging since the central node needs to optimally balance the intricate “exploration and exploitation” tradeoffs, i.e., to exploit those users with more uptodate information or to explore those users which have not offloaded yet or their state information is not fresh.
In this paper, we consider a MEC architecture involving multiple users and multiple MEC servers. The reason we focus on a multiserver architecture is due to the fact that densification of small cells with abundant amounts of computational power is a key technique for improving the system throughput in 5G networks and beyond [11, 10]. In such a scenario, we develop an autonomous and energyaware distributed computing platform via multiagent deep reinforcement learning, whose objective is to increase the lifetime of the system, as well as to decrease the average duration for computing incoming tasks to the users. We show, through simulation results, that our proposed approach strikes the right tradeoff between the aforementioned metrics, outperforming two greedy baseline algorithms.
Ii Background
Iia Reinforcement Learning
Reinforcement learning (RL) is a type of machine learning, in which an agent or a group of agents interact(s) with an environment by collecting
observations, taking actions, and receiving rewards. The agent’s experience is given by the tuple such that at time step , the agent observes current state of the environment denoted by , and chooses action , which results in a reward . The state will transition toaccording to the transition probability
The ultimate goal for the agent is to learn what action to take given each observation to maximize its cumulative reward over time.Deep reinforcement learning has been proposed as an enhancement to more traditional RL approaches, where the agent uses a deep neural network as a function approximator to represent its policy and/or value function. This enables the observation space (and potentially the action space) to be continuous and uncountable. Deep QNetwork (DQN)
[9] is a specific deep RL agent, where its stateaction value function is updated by minimizing the following loss, which is derived through the Bellman Equation:where
represents the estimated stateaction value function for state
and action and the set of DQN parameters denoted by .IiB Related Work
Recently, there has been an extensive amount of work investigating the mobile edge computing paradigm. Based on the number of users and servers, there are several architectures that have been investigated.
The work in [13] considers offloading with one base station (BS) and one user, where the user may offload a set of virtual reality (VR) tasks for computation at the BS, or it may compute them locally as well. An optimization problem is solved to schedule the tasks for computation at the user or the BS in order to minimize the average transmitted data per task. Similarly, the authors in [8] consider a singleuser scenario with multiple tasks, some of which can be offloaded to a central server. The tasks have dependency, which is represented by a graph. The graph is partitioned to multiple clusters, and then an integer programming problem is formulated to determine whether to offload each cluster or not such that the total execution time of tasks is minimized given energy constraints.
The authors in [15] consider a multiuser offloading problem with different computing tasks, each of which can be partially done by the user and the rest to be offloaded to a central server. The objective is to minimize the total weighted energy consumption (local computing plus offloading to central BS), subject to a total fixed delay constraint and computation capacity constraint at all users and central BS. In this case, weights are set arbitrarily, and users offload only using TDMA. In addition, communication time is ignored compared to computation time. The work in [5] considers TDMA and FDMA methods for the users to offload to the central server such that downlink delay is ignored. However, contrary to [15], the system considered in [5] does not necessarily assign computation resources proportionally to the size of offloaded tasks. The objective is to minimize the total computation delay (maximum of local computation delays and central computation plus offloading delays).
The authors in [1] apply deep reinforcement learning to obtain efficient computation offloading policies independently at each mobile user. In this case, a continuous state space is defined and a deep deterministic policy gradient (DDPG) agent is adopted to handle the highdimensional action space. Moreover, in [7]
, an energy minimization offloading problem with a time constraint is tackled. A game theory approach is used to decompose the problem into two subproblems such that first the access point (AP) receives the offloading decisions from users, and then it optimizes the communication and computation resources (e.g., channel access time and computation power allocated to each user). Based on the assigned resources, each user autonomously decides between local computation, offloading to AP, or offloading to the cloud. Users then report their decisions to the AP.
Iii System Model
We consider a network with MEC servers and users, which are located randomly within an network area. The network operates in a timeslotted fashion, where the duration of each time interval is denoted by . The users receive multiple computation tasks to complete over time. In order to do that, they have two options: i) compute the tasks locally, or ii) offload the tasks to be computed at one of the MEC servers. We assume that all users start from a full energy level of , and then gradually consume energy over time until depletion, in which case the system’s lifetime is over. We use to denote the energy level of user at the beginning of time interval .
Iiia Task Arrival Process
At time interval , each individual user receives a set of computation tasks, denoted by to compute. We assume a Poisson
arrival process for the tasks, where the number of incoming tasks at each time interval follows an i.i.d. Poisson distribution with rate
; i.e.,(1) 
The tasks will be buffered in the user’s queue and served on a firstinfirstout basis. We assume the tasks are homogeneous in size, implying that for any user at any time interval , every task in has a fixed size of bits.
IiiB Local Computation Model
As mentioned above, one way for each user to serve its incoming tasks is to compute the task using its local processor. We adopt a local computation model similar to [1], where the user first computes its maximum feasible computing power at any interval, and uses that to compute the maximum number of bits it can compute. To be precise, for user , the maximum feasible local computation power at time interval is calculated as:
(2) 
Then, the maximum feasible CPU frequency is computed as:
(3) 
where denotes the absolute maximum CPU frequency for user , and represents the effective switched capacitance. This will lead to the maximum number of bits that can be computed by user at time as
(4) 
where denotes the number of CPU cycles per bit at user . The user then checks its task buffer, and computes the tasks at the head of the queue one by one as long as the total number of computed bits does not exceed . Note that if the size of the first task is already larger than , then the user remains idle and does not do any local computation at that step. We denote the effective consumed energy for the local computation of user at time interval by .
IiiC Task Offloading Model
The other option for the users to compute their incoming tasks is to offload the tasks to the MEC servers. We assume that before the task arrival process begins, each user is associated with the MEC server which has the strongest longterm channel gain to it. We denote by the MEC server to whom user is associated, and by the set of associated users to MEC server . The local user pools of the MEC servers are disjoint; i.e., .
For user to offload its computation tasks to server at time interval , it first calculates its maximum feasible transmit power based on its instantaneous energy level as in (2). It then obtains its maximum uplink achievable rate as
where denotes the amount of bandwidth allocated to the uplink transmission between user and server at time interval ,
denotes the received signaltonoise ratio (SNR) from user
to server at time interval , and denotes the absolute maximum transmit power of user . The uplink transmissions of users to their respective MEC servers at each time interval may share the spectrum using multiple access techniques, such as FDMA or TDMA. Therefore, the maximum number of bits that user can transmit to server at time interval can be computed as(5) 
Similar to local computation, the user offloads tasks from head of its task buffer whose total number of bits does not exceed . We denote the effective consumed energy by user to offload its tasks to its associated server at time interval by .
IiiD Energy Model
We assume that at each interval, each user either stays idle, does local computation of tasks, or offloads some tasks to its serving MEC server. Denoting the action taken by user at time interval by , the energy level of the user evolves over time as follows:
where denotes the unit standby energy consumption for each user at every time interval.
IiiE Problem Statement
As mentioned before, we assume that the systemncrashes once at least one of the users runs out of energy. This leads to the definition of the system lifetime, denoted by , as follows:
(6) 
Furthermore, for any incoming task , let and respectively denote the time intervals when the task arrives and when the task computation is completed, either through local computation or offloading to the servers. We define the mean task completion time, denoted by , as the average time it takes for a task to be computed before the system crashes; i.e.,
(7) 
where denotes the set of all completed tasks within the system lifetime, defined as:
(8) 
Having defined these metrics, our goal is to minimize the mean task completion time, while increasing the system lifetime as much as possible. Note that there is an inherent tradeoff between these two metrics since reducing the mean task completion time requires more local computation and offloading to the MEC servers, which depletes the users’ energy levels more quickly, hence reducing the system lifetime.
Iv Proposed MultiAgent Deep Reinforcement Learning Approach
In order to enhance the tradeoff between system lifetime and task completion time, we propose to equip each MEC server with a DQN agent, which selects the best user (across its associated users) for offloading its tasks to the server at each time interval. The proposed model is shown in Figure 3.
In particular, we consider an episodic time frame, where at the beginning of each episode, the user and server nodes are dropped randomly within the network area, with user nodes at their maximum energy level. We then run the system until at least one of the nodes runs out of energy, in which case the episode terminates and the node locations, task buffers, and energy levels are reset for the next episode.
Iva Observations and Actions
We assume that at the beginning of each time interval, the DQN agent at each MEC server receives a partial observation of the environment, including the queue length, energy level, mean task waiting time, and uplink SNR of its associated users, and then it decides which user from its local associated user pool should offload its tasks to the server. The rest of the users in the pool perform local computation of their tasks at that step provided that they have sufficient energy to do so.
IvB Rewards
As mentioned in Section III, our ultimate goal is to increase the system lifetime and decrease the average time it takes to compute an incoming task. In order to do that, at each time interval, after the agents take their actions, we provide each agent with an individual reward in the form of energy efficiency, i.e., the ratio of the selected user’s computed bits (which were offloaded to the server) to the selected user’s consumed energy for offloading.
IvC Numerical Results
We have conducted extensive simulations in order to evaluate the performance of our proposed approach. We consider a network area of size . We assume the maximum energy level of each user at the beginning of each episode is selected uniformly at random from the interval . The maximum transmit power of each user is taken to be dBm. We assume a time interval length of . The server and user CPU frequencies are taken to be GHz and GHz, respectively, with respective cycles per bit of and . The effective switched capacitance is set to
. The noise variance is taken to be
dBm/Hz, the total system bandwidth is set to MHz, and the transmissions are assumed to use FDMA. The mean task arrival rate is taken to be , the task length is equal to KB and the unit standby energy is set to .As for the DQN agent, we use a layer neural network with
nodes per layer and tanh activation function. We use an
greedy policy, with probability of random actions staying at for initial pretraining episodes, and then decaying to over time intervals. The experience buffer size is set to samples, and a discount factor of is utilized. The agent is updated at the end of every episode, with a batch of size from the buffer. The learning rate also starts from and is cut in half every episodes.The plots in Figure 4 show the impact of the number of servers on the system performance in terms of lifetime and task completion time for a system with 5 users. As the plots show, the training process converges after around 1000 episodes. Moreover, our proposed approach confirms the fact that densifying the network with more MEC servers allows superior load balancing among them, hence improving the overall system performance.
In order to investigate the performance of our framework after training is complete, we define the following two greedy baseline agents:

TimeGreedy Agent: This agent aims to minimize the task completion time by selecting the user with the largest average queue waiting time at each time interval.

EnergyGreedy Agent: This agent is used to enhance the lifetime of the system by selecting the user with the lowest energy level at each time interval.
In Figure 5, we fix the network size to have 3 servers and 5 users, and compare the performance of our proposed DRLbased scheme with TimeGreedy and EnergyGreedy approaches. As the results show, our approach achieves a better tradeoff between the mean task computation time and system lifetime compared to the aforementioned greedy agents.
V Concluding Remarks
In this paper, we considered the problem of computation offloading in a mobile edge computing (MEC) architecture, where multiple energyconstrained users compete to offload their computational tasks to multiple servers. We developed a deep reinforcement learning framework in which each server is equipped with a deep Qnetwork agent to select the best user for offloading at each time interval. Numerical results demonstrated the superiority of our approach over baseline algorithms in terms of the tradeoff between task computation time and system lifetime.
References
 [1] (2018) Decentralized computation offloading for multiuser mobile edge computing: a deep reinforcement learning approach. arXiv preprint arXiv:1812.07394. Cited by: §I, §IIB, §IIIB.
 [2] (2018) Mobileedge computation offloading for ultradense IoT networks. IEEE Internet of Things Journal 5 (6), pp. 4977–4988. Cited by: §I.
 [3] (2017) Outofband millimeter wave beamforming and communications to achieve low latency and high energy efficiency in 5G systems. IEEE Transactions on Communications 66 (2), pp. 875–888. Cited by: §I.
 [4] (2018) Efficient beam alignment in millimeter wave systems using contextual bandits. In IEEE Conference on Computer Communications (INFOCOM), pp. 2393–2401. Cited by: §I.
 [5] (2017) Efficient resource allocation in mobileedge computation offloading: completion time minimization. In Information Theory (ISIT), 2017 IEEE International Symposium on, pp. 2513–2517. Cited by: §I, §IIB.
 [6] (2018) An incentiveaware job offloading control framework for mobile edge computing. arXiv preprint arXiv:1812.05743. Cited by: §I.
 [7] (2018) A distributed algorithm for multistage computation offloading. In 2018 IEEE 7th International Conference on Cloud Networking (CloudNet), pp. 1–6. Cited by: §IIB.
 [8] (2017) On using edge computing for computation offloading in mobile network. In GLOBECOM 20172017 IEEE Global Communications Conference, pp. 1–7. Cited by: §I, §IIB.
 [9] (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §IIA.
 [10] (2018) Feedbackbased interference management in ultradense networks via parallel dynamic cell selection and link scheduling. In 2018 IEEE International Conference on Communications (ICC), pp. 1–6. Cited by: §I.
 [11] (2017) Ultradense networks in 5G: interference management via nonorthogonal multiple access and treating interference as noise. In 2017 IEEE 86th Vehicular Technology Conference (VTCFall), pp. 1–6. Cited by: §I.
 [12] (201801) Demonstration of VR / AR offloading to mobile edge cloud for low latency 5G gaming application. In 2018 15th IEEE Annual Consumer Communications Networking Conference (CCNC), Vol. , pp. 1–3. External Links: Document, ISSN 23319860 Cited by: §I.
 [13] (2018) Communicationconstrained mobile edge computing systems for wireless virtual reality: scheduling and tradeoff. IEEE Access 6, pp. 16665–16677. Cited by: §I, §IIB.

[14]
(2019)
EdgeFlow: opensource multilayer data flow processing in edge computing for 5G and beyond
. IEEE Network 33 (2), pp. 166–173. Cited by: §I.  [15] (2016) Multiuser resource allocation for mobileedge computation offloading. In Global Communications Conference (GLOBECOM), 2016 IEEE, pp. 1–6. Cited by: §I, §IIB.