Next-generation wireless networks are expected to significantly rely on edge
applications and functions that include edge computing and edge artificial intelligence (edge AI)[1, 2, 3, 4, 5, 6, 7]. To successfully support such edge services within a wireless network with mobile edge computing (MEC) capabilities, energy management (i.e., demand and supply) is one of the most critical design challenges. In particular, it is imperative to equip next-generation wireless networks with alternative energy sources, such as renewable energy, in order to provide extremely reliable energy dispatch with less energy consumption cost [8, 9, 11, 10, 12, 13, 14, 15]. An efficient energy dispatch design requires energy sustainability, which not only saves energy consumption cost, but also fulfills the energy demand of the edge computing by enabling its own renewable energy sources. Specifically, sustainable energy is the practice of seamless energy flow to the MEC system that emerges to meet the energy demand without compromising the ability of future energy generation. Furthermore, to ensure a sustainable MEC operation, the retrogressive penetration of uncertainty for energy consumption and generation is essential.
To provide sustainable edge computing for next-generation wireless systems, each base station (BS) with MEC capabilities unit can be equipped with renewable energy sources. Thus, the energy source of such a BS unit not only relies solely on the power grid, but also on the equipped renewable energy sources. In particular, in a self-powered network, wireless BSs with MEC capabilities is equipped with its own renewable energy sources that can generate renewable energy, consume, store, and share energy with other BS units.
Delivering seamless energy flow with a low energy consumption cost in a self-powered wireless network with MEC capabilities can lead to uncertainty in both energy demand and generation. In particular, the randomness of the energy demand is induced by the uncertain resources (i.e., computation and communication) request by the edge services and applications. Meanwhile, the energy generation of a renewable source (i.e., a solar panel) at each self-powered BS unit varies on the time of a day. In other words, the pattern of energy demand and generation will differ from one self-powered BS unit to another. Thus, such fluctuating energy demand and generation pattern induces a non-independent and identically distributed (non-iid) of energy dispatch at each BS over time. As such, when designing self-powered wireless networks, it is necessary to take into account this uncertainty in the energy patterns.
I-a Related Works
The problem of energy management for MEC-enabled wireless networks has been studied in [16, 17, 18, 19, 20, 21, 22]. In , the authors proposed a joint mechanism for radio resource management and users task offloading with the goal of minimizing the long-term power consumption for both mobile devices and the MEC server. The authors in 
proposed a heuristic to solve the joint problem of computational resource allocation, uplink transmission power, and user task offloading problem. The work in studied the tradeoff between communication and computation for a MEC system and the authors proposed a MEC server CPU scaling mechanism for reducing the energy consumption. Further, the work in  proposed an energy-aware mobility management scheme for MEC in ultra-dense networks, and they addressed the problem using Lyapunov optimization and multi-armed bandits. Recently, the authors in  proposed a distributed power control scheme for a small cell network by using the concept of a multi-agent calibrate learning. Further, the authors in  studied the problem of energy storage and energy harvesting (EH) for a wireless network using deviation theory and Markov processes. However, all of these existing works assume that the consumed energy is available from the energy utility source to the wireless network system [16, 17, 18, 19, 20, 21, 22]. Since the assumed models are often focused on energy management and user task offloading on network resource allocations, the random demand for computational (e.g., CPU computation, memory, etc.) and communication requirements of the edge applications and services are not considered. In fact, even if enough energy supply is available, the energy cost related to network operation can be significant because of the usage of non-renewable (e.g., coal, petroleum, natural gas). Indeed, it is necessary to include renewable energy sources towards the next-generation wireless networking infrastructure.
Recently, some of the challenges of renewable energy powered wireless networks have been studied in [8, 9, 11, 10, 12, 13, 14, 24, 23]. In , the authors proposed an online optimization framework to analyze the activation and deactivation of BSs in a self-powered network. In , proposed a hybrid power source infrastructure to support heterogeneous networks (HetNets), a model-free deep reinforcement learning (RL) mechanism was proposed for user scheduling and network resource management. In , the authors developed an RL scheme for edge resource management while incorporating renewable energy in the edge network. In particular, the goal of  is to minimize a long-term system cost by load balancing between the centralized cloud and edge server. The authors in  introduced a microgrid enabled edge computing system. A joint optimization problem is studied for MEC task assignment and energy demand-response (DR) management. The authors in  developed a model-based deep RL framework to tackle the joint problem. In , the authors proposed a risk-sensitive energy profiling for microgrid-powered MEC network to ensure a sustainable energy supply for green edge computing by capturing the conditional value at risk (CVaR) tail distribution of the energy shortfall. The authors in  proposed a multi-agent RL system to solve the energy scheduling problem. In , the authors proposed a self-sustainable mobile networks, using graph-based approach for intelligent energy management with a microgrid. The authors in  proposed a smart grid-enabled wireless network and minimized grid energy consumption by applying energy sharing among the BSs. Furthermore, in , the authors addressed challenges of non-coordinated energy shedding and mis-aligned incentives for mixed-use building (i.e., buildings and data centers) using auction theory to reduce energy usage. However, these works [9, 11, 10, 12, 13, 14, 23] do not investigate the problem of energy dispatch nor do they account for the energy cost of MEC-enabled, self-powered networks when the demand and generation of each self-powered BS are non-iid. Dealing with non-iid energy demand and generation among self-powered BSs is challenging due to the intrinsic energy requirements of each BS evolve the uncertainty. In order to overcome this unique energy dispatch challenge, we propose to develop a multi-agent meta-reinforcement learning framework that can adapt new uncertain environment without considering the entire past experience.
The main contribution of this paper is a novel energy management framework for next-generation MEC in self-powered wireless network that is reliable against extreme uncertain energy demand and generation. We formulate a two-stage stochastic energy cost minimization problem that can balance renewable, non-renewable, and storage energy without knowing the actual demand. In fact, the formulated problem also investigates the realization of renewable energy generation after receiving the uncertain energy demand from the MEC applications and service requests. To solve this problem, we propose a multi-agent meta-reinforcement learning (MAMRL) framework that dynamically observes the non-iid behavior of time-varying features in both energy demand and generation at each BS and, then transfers those observations to obtain an energy dispatch decision and execute the energy dispatch policy to the self-powered BS. Fig. 1 illustrates how we propose to dispatch energy to ensure sustainable edge computing over a self-powered network using MAMRL framework. As we can see, each BS that includes small cell base stations (SBSs) and a macro base station (MBS) will act as a local agent and transfer their own decision (reward and action) to the meta-agent. Then, the meta-agent accumulates all of the non-iid observations from each local agent (i.e., SBSs and MBS) and optimizes the energy dispatch policy. The proposed MAMRL framework then provides feedback to each BS agent for exploring efficiently that acquire the right decision more quickly. Thus, the proposed MAMRL framework ensures autonomous decision making under an uncertain and unknown environment. Our key contributions include:
We formulate a self-powered energy dispatch problem for MEC-supported wireless network, in which the objective is to minimize the total energy consumption cost of network while considering the uncertainty of both energy consumption and generation. The formulated problem is, thus, a two-stage linear stochastic programming. In particular, the first stage makes a decision when energy demand is unknown, and the second stage discretizes the realization of renewable energy generation after knowing energy demand of the network.
between each local agent (i.e., self-powered BS) and meta-agent. In this MAMRL scheme, each local agent explores its own energy dispatch decision using Markovian properties for capturing the time-varying features of both energy demand and generation. Meanwhile, the meta-agent evaluates (exploits) that decision for each local agent and optimizes the energy dispatch decision. In particular, we design a long short-term memory (LSTM) as a meta-agent (i.e., run at MBS) that is capable of avoiding the incompetent decision from each local agent and learns the right features more quickly by maintaining its own state information.
We develop the proposed MAMRL energy dispatch framework in a semi-distributed manner. Each local agent (i.e., self-powered BS) estimates its own energy dispatch decision using local energy data (i.e., demand and generation), and provides observations to the meta-agent individually. Consequently, the meta-agent optimizes the decision centrally and assists the local agent toward a globally optimized decision. Thus, this approach not only reduces the computational complexity and communication overhead but it also mitigates the curse of dimensionality under the uncertainty by utilizing non-iid energy demand and generation from each local agent.
Experimental results using real datasets establish a significant performance gain of the energy dispatch under the deterministic, asymmetric, and stochastic environments. Particularly, the results show that the proposed MAMRL model saves up to of energy consumption cost over a baseline approach while achieving an average accuracy of around in a stochastic environment. Our approach also decreases the usage of non-renewable energy up to of total consumed energy.
The rest of the paper is organized as follows. Section II presents the system model of self-powered edge computing. The problem formulation is described in Section III. Section IV provides MAMRL framework for solving energy dispatch problem. Experimental results are analyzed in Section V. Finally, conclusions are drawn in Section VI.
Ii System Model of Self-Powered Edge Computing
|Set of BSs (SBSs and MBS)|
|Set of active server under the BS|
|Set of user tasks|
|Set of renewable energy sources|
|Server utilization in BS|
|Energy co-efficient for BS|
|Renewable energy cost per unit|
|Non-renewable energy cost per unit|
|Storage energy cost per unit|
|Amount of renewable energy|
|Amount of non-renewable energy|
|Amount of surplus energy|
|Energy demand at time slot|
|Random variable for energy demand|
Consider a self-powered wireless network that is connected with a smart grid controller as shown in Fig. 2. Such a wireless network enables edge computing services for various MEC applications and services. The energy consumption of the network depends on network operations energy consumption along with the task loads of the MEC applications. Meanwhile, the energy supply of the network relies on the energy generation from renewable sources that are attached to the BSs, as well as both renewable and non-renewable sources of the smart grid. Therefore, we will first discuss the energy demand model that includes MEC server energy consumption, and network communication energy consumption. We will then describe the energy generation model that consists of the non-renewable energy generation cost, surplus energy storage cost, and total energy generation cost. Table I illustrates the summary of notations.
Ii-a Energy Demand Model
Consider a set of ( for MBS) BSs that encompass SBSs overlaid over a single MBS. Each BS includes a set of MEC application servers. We consider a finite time horizon with each time slot being indexed by and having a duration of 15 minutes . The observational period of each time slot ends at the -th minute and is capable of capturing the changes of network dynamics [11, 12, 31]. A set of heterogeneous MEC application task requests from users will arrive to BS with an average task arrival rate (bits/s) at time . The task arrival rate at BS follows a Poisson process at time slot . BS integrates heterogeneous active MEC application servers that has (bits/s) processing capacity. Thus, computational task requests will be accumulated into the service pool with an average traffic size (bits) at time slot . The average traffic arrival rate is defined as . Therefore, an M/M/K queuing model is suitable to model these user tasks using MEC servers at BS and time [32, 33]
. The task size of this queuing model is exponentially distributed since the average traffic sizeis already known. Hence, the service rate of the BS is determined by . At any given time , we assume that all of the tasks in
are uniformly distributed at each BS. Thus, for a given MEC server task association indicator if task is assigned to server at BS , and otherwise, the average MEC server utilization is defined as follows :
Ii-A1 MEC Server Energy Consumption
In case of MEC server energy consumption, the computational energy consumption (dynamic energy) will be dependent on the CPU activity for executing computational tasks [17, 34, 16]. Further, such dynamic energy is also accounted with the thermal design power (TDP), memory, and disk I/O operations of the MEC server [17, 34, 16] and we denote as . Meanwhile, static energy includes the idle state power of CPU activities [16, 18]. We consider, a single core CPU with a processor frequency (cycles/s), an average server utilization (using (1)) at time slot , and a switching capacitance (farad) . The dynamic power consumption of such single core CPU can be calculated by applying a quadratic formula [18, 35]. Thus, energy consumption of MEC servers with CPU cores at BS is defined as follows:
where denotes a scaling factor of heterogeneous CPU core of the MEC server. Thus, the value of is dependent on the processor architecture  that assures the heterogeneity of the MEC serves.
Ii-A2 Base Station Energy Consumption
The energy consumption needed for the operation of the network base stations (i.e., SBSs and MBS) includes two types of energy: dynamic and static energy consumption . On one hand, a static energy consumption includes the energy for maintaining the idle state of any BS, a constant power consumption for receiving packet from users, and the energy for wired transmission among the BSs. On the other hand, the dynamic energy consumption of the BSs depends on the amount of data transfer from BSs to users which essentially relates to the downlink  transmit energy. Thus, we consider that each BS operates at a fixed channel bandwidth and constant transmission power . Then the average downlink data of BS will be given by :
where represents downlink channel gain between user task to BS ,
determines a variance of an Additive White Gaussian Noise (AWGN), anddenotes the co-channel interference [39, 40] among the BSs. Here, the co-channel interference relates to the transmissions from other BSs that use the same subchannels of . and represent, respectively, the transmit power and the channel gain of the BS . Therefore, downlink energy consumption of the data transfer of BS is defined by [watt-seconds or joule], where [seconds] determines the duration of transmit power [watt]. Thus, the network energy consumption for BS at time is defined as follows [37, 19]:
where determines the energy coefficient for transferring data through the network. In fact, the value of depends on the type of the network device (e.g., for a unit transceiver remote radio head ).
Ii-A3 Total Energy Demand
The total energy consumption (demand) of the network consists of both MEC server computational energy (in (2)) consumption, and network the operational energy (i.e., BSs energy consumption in (4)). Thus, the overall energy demand of the network at time slot is given as follows:
The demand is random over time and completely depends on the computational tasks load of the MEC servers.
Ii-B Energy Generation Model
The energy supply of the self-powered wireless network with MEC capabilities relates to the network’s own renewable (e.g., solar, wind, biofuels, etc.) sources as well as the main grid’s non-renewable (e.g., diesel generator, coal power, and so on) energy sources [8, 41]. In this energy generation model, we consider a set of renewable energy sources of the network, with each element representing the set of renewable energy sources of BS . The amount of renewable energy generation is defined by . The total renewable energy generation at time is defined as . Further, the self-powered wireless network is able to get an additional non-renewable energy amount from the main grid at time . The per unit renewable and non-renewable energy cost are defined by and , respectively. In general, the renewable energy cost only depends on the maintenance cost of the renewable energy sources [42, 43, 44]. Therefore, the per unit non-renewable energy cost is greater than the renewable energy cost . Additionally, the surplus amount of the energy at time can be stored in energy storage medium for the future usages [43, 44] and the energy storage cost of per unit energy store is denoted by .
Ii-B1 Non-renewable Energy Generation Cost
In order to fulfill the energy demand when it is greater than the generated renewable energy , the main grid can provide an additional amount of energy from its non-renewable sources. Thus, the non-renewable energy generation cost of the network is determined as follows:
where represents a unit energy cost.
Ii-B2 Surplus Energy Storage Cost
The surplus amount of energy is stored in a storage medium when (i.e., energy demand is smaller than the renewable energy generation) at time . We consider the per unit energy storage cost . This storage cost depends on the storage medium and amount of the energy store at time [43, 45, 23, 46]. With the per unit energy storage cost , the total storage cost at time is defined as follows:
Ii-B3 Total Energy Generation Cost
The total energy generation cost includes renewable, non-renewable, and storage energy cost. Naturally, this total energy generation cost will depend on the energy demand of the network at time . Therefore, the total energy generation cost at time is defined as follows:
where the energy cost of the renewable, non-renewable, and storage energy are given by , , and , respectively. In (8), energy demand and renewable energy generation are stochastic in nature. The energy cost of non-renewable energy (6) and storage energy (7) completely rely on energy demand and renewable energy generation . Hence, to address the uncertainty of both energy demand and renewable energy generation in a self-powered wireless network, we formulate a two-stage stochastic programing problem. In particular, the first stage makes a decision of the energy dispatch without knowing the actual demand of the network. Then we make further energy dispatch decisions by analyzing the uncertainty of the network demand in the second stage. A detailed discussion of the problem formulation is given in the following section.
Iii Problem Formulation with a Two-Stage Stochastic Model
We now consider the case in which the non-renewable energy cost is greater than the renewable energy cost, that is often the case in a practical smart grid as discussed in , , , and . Here, and are the continuous variables over the observational duration . The objective is to minimize the total energy consumption cost . is the decision variable and the energy demand is a parameter. When the energy demand is known, the optimization problem will be:
In problem (9), after removing the non-negativity constraints , we can rewrite the objective function in the form of piecewise linear functions as follows:
Where and determine the cost of non-renewable (i.e., ) and storage (i.e., ) energy, respectively. Therefore, we have to choose one out of the two cases. In fact, if the energy demand is known and also the amount of renewable energy is the same as the energy demand, then problem (10) provides the optimal decision in order to exact amount of demand . However, the challenge here is to make a decision about the renewable energy usage before the demand becomes known. To overcome this challenge, we consider the energy demand
as a random variable whose probability distribution can be estimated from the previous history of the energy demand. We can re-write problem (9) using the expectation of the total cost as follows:
The solution of problem (11) will provide an optimal result on average. However, in the practical scenario, we need to solve problem (11) repeatedly over the uncertain energy demand . Thus, this solution approach does not significantly affect when large variations (i.e., non-iid) of the energy demand that are generated by BSs over the observational period of .
We consider the moment of random variablethat has a finitely supported distribution and takes values with respective probabilities of BSs
. The cumulative distribution function (CDF)of energy demand is a step function and jumps of size at each demand . Therefore, the probability distribution of each BS energy demand belongs to the CDF of historical observation of energy demand . In this case, we can convert problem (11) into a deterministic optimization problem and the expectation of energy usage cost is determined by . Thus, we can rewrite the problem (9
) as a linear programming problem using the representation in (10) as follows:
For a fixed value of the renewable energy , problem (12) is an equivalent of problem (10). Meanwhile, problem (12) is equal to . We have converted the piecewise linear function from problem (10) into the inequality constraints (12a) and (12b). We consider as a highest probability of energy demand at each BS . Therefore, for BSs, we define as the probability of energy demand with respect to BSs . Thus, we can rewrite the problem (11) for BSs is as follows:
where represents the highest probability (close to ) of energy demand at BS and estimates a probability distribution from
-quantile of empirical CDFof the historical demand observation. Thus, for a fixed value of , this problem is almost separable. Thus, we can decompose problem (13) with a structure of two-stage linear stochastic programming problem [48, 49].
To find an approximation for a random variable with a finite probability distribution, we decompose problem (13) in a two-stage linear stochastic programming under uncertainty. The decision is made using historical data of energy demand, which is fully independent from the future observation. As a result, the first stage of self-powered energy dispatch problem for sustainable edge computing is formulated as follows:
where determines an optimal value of the second stage problem. In problem (14), the decision variable is calculated before the realization of uncertain energy demand . Meanwhile, at the first stage of the formulated problem (14), the cost is minimized for the decision variable which then allows us to estimate the expected energy cost for the second stage decision. Constraint (14a) provides a boundary for the maximum allowable renewable energy usage. Thus, based on the decision of the first stage problem, the second stage problem can be defined as follows:
In the second stage problem , the decision variables and depend on the realization of the energy demand of the first stage problem , where, determines the amount of renewable energy usage at time . The first constraint is an equality constraint that determines the surplus amount of energy must be equal to the absolute value difference between the usage of renewable and non-renewable energy amount. The second constraint is an inequality constraint that uses the optimal demand value from the first stage realization. In particular, the value of demand comes from that is the historical observation of energy demand. Finally, the constraint protects from the non-negativity for the non-renewable energy usage.
The formulated problems and can characterize the uncertainty between network energy demand and renewable energy generation. Particularly, the second stage problem contains random demand that leads the optimal cost as a random variable. As a result, we can rewrite the problems and in a one large linear programming problem for BSs and the problem formulation is as follows:
In problem , for BSs, energy demand happens with positive probabilities and . The decision variables are , and , which represent the amount of renewable, non-renewable, and storage energy, respectively. Constraint defines a relationship among all of the decision variables , and . In essence, this constraint discretizes the surplus amount of energy for storage. Hence, constraint ensures the utilization of non-renewable energy based on the energy demand of the network. Constraint ensures that the decision variable will not be a negative value. Finally, constraint restricts the renewable energy usages in to maximum capacity at time . Problem is an integrated form of the first-stage problem in and the second-stage problem in , where the solution of and completely depends on realization of demand for all BSs. The decision of the comes before the realization of demand and, thus, the estimation of renewable energy generation will be independent and random. Therefore, problem holds the property of relatively complete recourse. In problem , the number of variables and constraints is proportional to the numbers of BSs, . Additionally, the complexity of the decision problem leads to due to the combinatorial properties of the decisions and constraints [48, 49, 50].
The goal of the self-powered energy dispatch problem is to find an optimal energy dispatch policy that includes amount of renewable , non-renewable , and storage energy of each BS while minimizing the energy consumption cost. Meanwhile, such energy dispatch policy relies on an empirical probability distribution of historical demand at each BS at time . In order to solve problem , we choose an approach that does not rely on the conservativeness of a theoretical probability distribution of energy demand in problem , and also will capture the uncertainty of renewable energy generation from the historical data. A data-driven approach that can vanish the conservativeness of theoretical probability distributions as historical data goes to infinity. Eventually, non-iid energy demand and generation will also be captured at each BS when time-variant features of both energy demand and generation are characterized by the Markovian properties of the historical data. To prevalence the aforementioned contemporary, we propose a multi-agent meta-reinforcement learning framework that can explore the Markovian behavior from historical energy demand and generation of each BS . Meanwhile, meta-agent can cope with such time-varying features to a globally optimal energy dispatch policy for each BS .
We design an MAMRL framework by converting the cost minimization problem to a reward maximization problem that we then solve with a data-driven approach. In the MAMRL setting, each agent works as a local agent for each BS and determines an observation (i.e., exploration) for the decision variables, renewable , non-renewable , and storage energy. The goal of this exploration is to find time-varying features from the local historical data so that the energy demand of the network is satisfied. Furthermore, using these observations and current state information, a meta-agent is used to determine a stochastic energy dispatch policy. Thus, to obtain such dispatch policy, the meta-agent only requires the observations (behavior) from each local agent. Then, the meta-agent can evaluate (exploit) behavior toward an optimal decision for dispatching energy. Further, the MAMRL approach is capable of capturing the exploration-exploitation tradeoff in a way that the meta-agent optimizes decisions of the each self-powered BS under uncertainty. A detailed discussion of the MAMRL framework is given in the following section.
Iv Energy Dispatch with Multi-Agent Meta-Reinforcement Learning Framework
In this section, we developed our proposed multi-agent meta-reinforcement learning framework (as seen in Fig. 3) for energy dispatch in the considered network. The proposed MAMRL framework includes two types of agents: A local agent that acts as a local learner at each self-powered with MEC capabilities BS and a meta-agent that learns the global energy dispatch policy. In particular, each local BS agent can discretize the Markovian dynamics for energy demand-generation of each BS (i.e., both SBSs and MBS) separately by applying deep-reinforcement learning. Meanwhile, we train a long short-term memory (LSTM) [53, 54] as a meta-agent at the MBS that optimizes  the accumulated energy dispatch of the local agents. As a result, the meta-agent can handle the non-iid energy demand-generation of the each local agent with own state information of the LSTM. To this end, MAMRL mitigates the curse of dimensionality for the uncertainty of energy demand and generation while providing an energy dispatch solution with a less computational and communication complexity (i.e., less message passing between the local agents and meta-agent).
Iv-a Preliminary Setup
In the MAMRL setting, each BS acts as a local agent and the number of local agents same as the number of BSs (i.e., MBS and SBSs) in the network. We define a set of state spaces and a set of actions for the agents. The state space of a local agent is defined by , where , and represent the amount of energy demand, renewable generation, storage cost, and non-renewable energy cost, respectively, at time . We execute Algorithm 1 to generate the state space for every BSs , individually. In Algorithm 1, lines to calculate the individual energy consumption of the MEC computation and network operation using (2) and (4), respectively. Overall, the energy demand of the BS is computed in line and the self-powered energy generation is estimated by line in Algorithm 1. Non-renewable and storage energy costs are calculated in lines and for time slot . Finally, line creates state space tuple (i.e., ) for time in Algorithm 1.
Iv-B Local Agent Design
Consider each local BS agent that can take two types of actions which is the amount of storage energy , and the amount of non-renewable energy at time . Since the state and action both contain a time varying information of the agent , we consider the dynamics of Markovian and represent problem as a discounted reward maximization problem for each agent (i.e., each BS). Thus, the objective function of the discounted reward maximization problem of agent is defined as follows :
where is a discount factor and each reward is considered as,
In (18), determines a ratio between renewable energy generation and energy demand (supply-demand ratio) of the BS agent at time . When renewable energy generation-demand ratio is larger than then the BS agent achieves a reward of because the amount of renewable energy exceeds the demand that can be stored in the storage unit.
Each action of BS agent determines a stochastic policy . is a parameter of and the energy dispatch policy is defined by . Policy decides a state transition function for the next state . Thus, the state transition function of BS agent is determined by a reward function , where . As a result, for a given state , the state value function with a cumulative discounted reward will be:
where is a discount factor and ensures the convergence of state value function over the infinity time horizon. Thus, for a given state , the optimal policy for the next state can be determined by an optimal state value function while a Markovian property is imposed. Therefore, the optimal value function is given as follows:
In this setting, the policy of energy dispatch is determined by choosing an action in that can be seen as an actor of BS agent while the estimated value function plays the role of a critic. Thus, the critic criticizes actions made by the actor using a temporal difference (TD) error  that determines an energy dispatch policy. The TD error is considered as an advantage function and the advantage function of agent is defined as follows:
Thus, the policy gradient is determined as,
Using (22), we can discretize the energy dispatch decision for each self-powered BS in the network. In fact, we can achieve a centralized solution for when all of the BSs state information (i.e., demand and generation) are known. However, the space complexity for computation increases as and also the computational complexity becomes . Further, the solution does not meet the exploration-exploitation dilemma since the centralized (i.e., single agent) method ignores the interactions and energy dispatch decision strategies of other agents (i.e., BSs) which creates an imbalance between exploration and exploitation. Next, we propose an approach that not only reduces the complexity but also explores alternative energy dispatch decision to achieve the highest expected reward in (17).
Iv-C Multi-Agent Meta-Reinforcement Learning Modeling
We consider a set of observations [56, 27] and for an BS agent , a single observation tuple is given by . For a given state , the observation of the next state consists of , where , , , and are next-state discounted rewards, current state discounted rewards, next action, current action, time slot, and TD error, respectively. Here, a complete information of the observation is correlated with the state space while observation does not require the complete state information of the previous states. Thus, the space complexity for computation is and is the communication complexity of each agent .
In the MAMRL framework, the local agents work as an optimizee and the meta-agent performs the role of optimizer . To model our meta-agent, we consider an LSTM architecture [53, 54] that stores its own state information (i.e., parameters) and the local agent (i.e., optimizee) only provides the observation of a current state. In the proposed MAMRL framework, a policy is determined by updating the parameters . Therefore, we can represent the state value function (20) for time is as follows: , and the advantage (temporal difference) function (21) is presented by, . As a result, the parameterized policy is defined by, . Considering all of the BS agents and the advantage function is rewritten as,