It is envisioned that 5G-and-beyond will enable an unprecedented proliferation of data-intensive and computationally-intensive applications such as face recognition, location-based augmented/virtual reality (AR/VR), and online 3D gaming[4, 3, 14, 13, 12]. However, adoption of these resource-hungry applications will be negatively affected by limited on-board computing and energy resources. In addition to the computationally-intensive applications, billions of IoT devices are expected to be deployed for various applications such as health monitoring, environmental monitoring and smart cities, to name a few. These applications require a large number of low-power and resource-constrained wireless nodes to collect, pre-process, and analyze huge amounts of sensory data , which may not be feasible due to the limited on-board computing resources.
In order to bridge the gap between increasing demand for mobile computational power and constrained on-board resources, mobile edge computing (MEC) has been contemplated as a solution to supplement the computing capabilities of the end-users [15, 5, 8, 6, 1]. In contrast to the traditional cloud computing architectures, such as Amazon Web Services (AWS) and Microsoft Azure, MEC leverages the radio access networks to boost the computing power in close proximity to end-users, thus enabling users to offload their computations to MEC servers, as shown in Figure 1.
Under the MEC model, each user either offloads its computation to the server or uses its own resources to locally perform the computation. In this case, users can save energy and prolong the overall lifetime of the system by offloading to the central node (assuming central node is not energy sensitive). However, if all users offload their computations to the central node, on one hand the communication resources need to be divided among all users, which decreases the effective uplink throughput, and on the other hand, the queuing delay and computation time at the central node increases. Therefore, a dynamic policy to select the “best” offloading user is needed in order to strike the optimal trade-off between the lifetime of the system and the computation time. Thus, we note that before a practical MEC architecture becomes a reality, it faces several challenges including efficient management of communication and computing resources and coordination among distributed users and several base stations.
In practical MEC scenarios, the system is partially observable in the sense that users are distributed and the central node only observes the state (e.g., energy level and computation load) of those users that have offloaded so far. In addition, imperfect and delayed channel state information (CSI) makes the problem even more challenging since the central node needs to optimally balance the intricate “exploration and exploitation” trade-offs, i.e., to exploit those users with more up-to-date information or to explore those users which have not offloaded yet or their state information is not fresh.
In this paper, we consider a MEC architecture involving multiple users and multiple MEC servers. The reason we focus on a multi-server architecture is due to the fact that densification of small cells with abundant amounts of computational power is a key technique for improving the system throughput in 5G networks and beyond [11, 10]. In such a scenario, we develop an autonomous and energy-aware distributed computing platform via multi-agent deep reinforcement learning, whose objective is to increase the lifetime of the system, as well as to decrease the average duration for computing incoming tasks to the users. We show, through simulation results, that our proposed approach strikes the right trade-off between the aforementioned metrics, outperforming two greedy baseline algorithms.
Ii-a Reinforcement Learning
Reinforcement learning (RL) is a type of machine learning, in which an agent or a group of agents interact(s) with an environment by collectingobservations, taking actions, and receiving rewards. The agent’s experience is given by the tuple such that at time step , the agent observes current state of the environment denoted by , and chooses action , which results in a reward . The state will transition to
according to the transition probabilityThe ultimate goal for the agent is to learn what action to take given each observation to maximize its cumulative reward over time.
Deep reinforcement learning has been proposed as an enhancement to more traditional RL approaches, where the agent uses a deep neural network as a function approximator to represent its policy and/or value function. This enables the observation space (and potentially the action space) to be continuous and uncountable. Deep Q-Network (DQN) is a specific deep RL agent, where its state-action value function is updated by minimizing the following loss, which is derived through the Bellman Equation:
represents the estimated state-action value function for stateand action and the set of DQN parameters denoted by .
Ii-B Related Work
Recently, there has been an extensive amount of work investigating the mobile edge computing paradigm. Based on the number of users and servers, there are several architectures that have been investigated.
The work in  considers offloading with one base station (BS) and one user, where the user may offload a set of virtual reality (VR) tasks for computation at the BS, or it may compute them locally as well. An optimization problem is solved to schedule the tasks for computation at the user or the BS in order to minimize the average transmitted data per task. Similarly, the authors in  consider a single-user scenario with multiple tasks, some of which can be offloaded to a central server. The tasks have dependency, which is represented by a graph. The graph is partitioned to multiple clusters, and then an integer programming problem is formulated to determine whether to offload each cluster or not such that the total execution time of tasks is minimized given energy constraints.
The authors in  consider a multi-user offloading problem with different computing tasks, each of which can be partially done by the user and the rest to be offloaded to a central server. The objective is to minimize the total weighted energy consumption (local computing plus offloading to central BS), subject to a total fixed delay constraint and computation capacity constraint at all users and central BS. In this case, weights are set arbitrarily, and users offload only using TDMA. In addition, communication time is ignored compared to computation time. The work in  considers TDMA and FDMA methods for the users to offload to the central server such that downlink delay is ignored. However, contrary to , the system considered in  does not necessarily assign computation resources proportionally to the size of offloaded tasks. The objective is to minimize the total computation delay (maximum of local computation delays and central computation plus offloading delays).
The authors in  apply deep reinforcement learning to obtain efficient computation offloading policies independently at each mobile user. In this case, a continuous state space is defined and a deep deterministic policy gradient (DDPG) agent is adopted to handle the high-dimensional action space. Moreover, in 
, an energy minimization offloading problem with a time constraint is tackled. A game theory approach is used to decompose the problem into two sub-problems such that first the access point (AP) receives the offloading decisions from users, and then it optimizes the communication and computation resources (e.g., channel access time and computation power allocated to each user). Based on the assigned resources, each user autonomously decides between local computation, offloading to AP, or offloading to the cloud. Users then report their decisions to the AP.
Iii System Model
We consider a network with MEC servers and users, which are located randomly within an network area. The network operates in a time-slotted fashion, where the duration of each time interval is denoted by . The users receive multiple computation tasks to complete over time. In order to do that, they have two options: i) compute the tasks locally, or ii) offload the tasks to be computed at one of the MEC servers. We assume that all users start from a full energy level of , and then gradually consume energy over time until depletion, in which case the system’s lifetime is over. We use to denote the energy level of user at the beginning of time interval .
Iii-a Task Arrival Process
At time interval , each individual user receives a set of computation tasks, denoted by to compute. We assume a Poisson
arrival process for the tasks, where the number of incoming tasks at each time interval follows an i.i.d. Poisson distribution with rate; i.e.,
The tasks will be buffered in the user’s queue and served on a first-in-first-out basis. We assume the tasks are homogeneous in size, implying that for any user at any time interval , every task in has a fixed size of bits.
Iii-B Local Computation Model
As mentioned above, one way for each user to serve its incoming tasks is to compute the task using its local processor. We adopt a local computation model similar to , where the user first computes its maximum feasible computing power at any interval, and uses that to compute the maximum number of bits it can compute. To be precise, for user , the maximum feasible local computation power at time interval is calculated as:
Then, the maximum feasible CPU frequency is computed as:
where denotes the absolute maximum CPU frequency for user , and represents the effective switched capacitance. This will lead to the maximum number of bits that can be computed by user at time as
where denotes the number of CPU cycles per bit at user . The user then checks its task buffer, and computes the tasks at the head of the queue one by one as long as the total number of computed bits does not exceed . Note that if the size of the first task is already larger than , then the user remains idle and does not do any local computation at that step. We denote the effective consumed energy for the local computation of user at time interval by .
Iii-C Task Offloading Model
The other option for the users to compute their incoming tasks is to offload the tasks to the MEC servers. We assume that before the task arrival process begins, each user is associated with the MEC server which has the strongest long-term channel gain to it. We denote by the MEC server to whom user is associated, and by the set of associated users to MEC server . The local user pools of the MEC servers are disjoint; i.e., .
For user to offload its computation tasks to server at time interval , it first calculates its maximum feasible transmit power based on its instantaneous energy level as in (2). It then obtains its maximum uplink achievable rate as
where denotes the amount of bandwidth allocated to the uplink transmission between user and server at time interval ,
denotes the received signal-to-noise ratio (SNR) from userto server at time interval , and denotes the absolute maximum transmit power of user . The uplink transmissions of users to their respective MEC servers at each time interval may share the spectrum using multiple access techniques, such as FDMA or TDMA. Therefore, the maximum number of bits that user can transmit to server at time interval can be computed as
Similar to local computation, the user offloads tasks from head of its task buffer whose total number of bits does not exceed . We denote the effective consumed energy by user to offload its tasks to its associated server at time interval by .
Iii-D Energy Model
We assume that at each interval, each user either stays idle, does local computation of tasks, or offloads some tasks to its serving MEC server. Denoting the action taken by user at time interval by , the energy level of the user evolves over time as follows:
where denotes the unit stand-by energy consumption for each user at every time interval.
Iii-E Problem Statement
As mentioned before, we assume that the systemncrashes once at least one of the users runs out of energy. This leads to the definition of the system lifetime, denoted by , as follows:
Furthermore, for any incoming task , let and respectively denote the time intervals when the task arrives and when the task computation is completed, either through local computation or offloading to the servers. We define the mean task completion time, denoted by , as the average time it takes for a task to be computed before the system crashes; i.e.,
where denotes the set of all completed tasks within the system lifetime, defined as:
Having defined these metrics, our goal is to minimize the mean task completion time, while increasing the system lifetime as much as possible. Note that there is an inherent trade-off between these two metrics since reducing the mean task completion time requires more local computation and offloading to the MEC servers, which depletes the users’ energy levels more quickly, hence reducing the system lifetime.
Iv Proposed Multi-Agent Deep Reinforcement Learning Approach
In order to enhance the trade-off between system lifetime and task completion time, we propose to equip each MEC server with a DQN agent, which selects the best user (across its associated users) for offloading its tasks to the server at each time interval. The proposed model is shown in Figure 3.
In particular, we consider an episodic time frame, where at the beginning of each episode, the user and server nodes are dropped randomly within the network area, with user nodes at their maximum energy level. We then run the system until at least one of the nodes runs out of energy, in which case the episode terminates and the node locations, task buffers, and energy levels are reset for the next episode.
Iv-a Observations and Actions
We assume that at the beginning of each time interval, the DQN agent at each MEC server receives a partial observation of the environment, including the queue length, energy level, mean task waiting time, and uplink SNR of its associated users, and then it decides which user from its local associated user pool should offload its tasks to the server. The rest of the users in the pool perform local computation of their tasks at that step provided that they have sufficient energy to do so.
As mentioned in Section III, our ultimate goal is to increase the system lifetime and decrease the average time it takes to compute an incoming task. In order to do that, at each time interval, after the agents take their actions, we provide each agent with an individual reward in the form of energy efficiency, i.e., the ratio of the selected user’s computed bits (which were offloaded to the server) to the selected user’s consumed energy for offloading.
Iv-C Numerical Results
We have conducted extensive simulations in order to evaluate the performance of our proposed approach. We consider a network area of size . We assume the maximum energy level of each user at the beginning of each episode is selected uniformly at random from the interval . The maximum transmit power of each user is taken to be dBm. We assume a time interval length of . The server and user CPU frequencies are taken to be GHz and GHz, respectively, with respective cycles per bit of and . The effective switched capacitance is set to
. The noise variance is taken to bedBm/Hz, the total system bandwidth is set to MHz, and the transmissions are assumed to use FDMA. The mean task arrival rate is taken to be , the task length is equal to KB and the unit stand-by energy is set to .
As for the DQN agent, we use a -layer neural network with
nodes per layer and tanh activation function. We use an-greedy policy, with probability of random actions staying at for initial pre-training episodes, and then decaying to over time intervals. The experience buffer size is set to samples, and a discount factor of is utilized. The agent is updated at the end of every episode, with a batch of size from the buffer. The learning rate also starts from and is cut in half every episodes.
The plots in Figure 4 show the impact of the number of servers on the system performance in terms of lifetime and task completion time for a system with 5 users. As the plots show, the training process converges after around 1000 episodes. Moreover, our proposed approach confirms the fact that densifying the network with more MEC servers allows superior load balancing among them, hence improving the overall system performance.
In order to investigate the performance of our framework after training is complete, we define the following two greedy baseline agents:
Time-Greedy Agent: This agent aims to minimize the task completion time by selecting the user with the largest average queue waiting time at each time interval.
Energy-Greedy Agent: This agent is used to enhance the lifetime of the system by selecting the user with the lowest energy level at each time interval.
In Figure 5, we fix the network size to have 3 servers and 5 users, and compare the performance of our proposed DRL-based scheme with Time-Greedy and Energy-Greedy approaches. As the results show, our approach achieves a better trade-off between the mean task computation time and system lifetime compared to the aforementioned greedy agents.
V Concluding Remarks
In this paper, we considered the problem of computation offloading in a mobile edge computing (MEC) architecture, where multiple energy-constrained users compete to offload their computational tasks to multiple servers. We developed a deep reinforcement learning framework in which each server is equipped with a deep Q-network agent to select the best user for offloading at each time interval. Numerical results demonstrated the superiority of our approach over baseline algorithms in terms of the trade-off between task computation time and system lifetime.
-  (2018) Decentralized computation offloading for multi-user mobile edge computing: a deep reinforcement learning approach. arXiv preprint arXiv:1812.07394. Cited by: §I, §II-B, §III-B.
-  (2018) Mobile-edge computation offloading for ultradense IoT networks. IEEE Internet of Things Journal 5 (6), pp. 4977–4988. Cited by: §I.
-  (2017) Out-of-band millimeter wave beamforming and communications to achieve low latency and high energy efficiency in 5G systems. IEEE Transactions on Communications 66 (2), pp. 875–888. Cited by: §I.
-  (2018) Efficient beam alignment in millimeter wave systems using contextual bandits. In IEEE Conference on Computer Communications (INFOCOM), pp. 2393–2401. Cited by: §I.
-  (2017) Efficient resource allocation in mobile-edge computation offloading: completion time minimization. In Information Theory (ISIT), 2017 IEEE International Symposium on, pp. 2513–2517. Cited by: §I, §II-B.
-  (2018) An incentive-aware job offloading control framework for mobile edge computing. arXiv preprint arXiv:1812.05743. Cited by: §I.
-  (2018) A distributed algorithm for multi-stage computation offloading. In 2018 IEEE 7th International Conference on Cloud Networking (CloudNet), pp. 1–6. Cited by: §II-B.
-  (2017) On using edge computing for computation offloading in mobile network. In GLOBECOM 2017-2017 IEEE Global Communications Conference, pp. 1–7. Cited by: §I, §II-B.
-  (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §II-A.
-  (2018) Feedback-based interference management in ultra-dense networks via parallel dynamic cell selection and link scheduling. In 2018 IEEE International Conference on Communications (ICC), pp. 1–6. Cited by: §I.
-  (2017) Ultra-dense networks in 5G: interference management via non-orthogonal multiple access and treating interference as noise. In 2017 IEEE 86th Vehicular Technology Conference (VTC-Fall), pp. 1–6. Cited by: §I.
-  (2018-01) Demonstration of VR / AR offloading to mobile edge cloud for low latency 5G gaming application. In 2018 15th IEEE Annual Consumer Communications Networking Conference (CCNC), Vol. , pp. 1–3. External Links: Cited by: §I.
-  (2018) Communication-constrained mobile edge computing systems for wireless virtual reality: scheduling and tradeoff. IEEE Access 6, pp. 16665–16677. Cited by: §I, §II-B.
EdgeFlow: open-source multi-layer data flow processing in edge computing for 5G and beyond. IEEE Network 33 (2), pp. 166–173. Cited by: §I.
-  (2016) Multiuser resource allocation for mobile-edge computation offloading. In Global Communications Conference (GLOBECOM), 2016 IEEE, pp. 1–6. Cited by: §I, §II-B.