I Introduction
Nextgeneration wireless networks are expected to significantly rely on edge
applications and functions that include edge computing and edge artificial intelligence (edge AI)
[1, 2, 3, 4, 5, 6, 7]. To successfully support such edge services within a wireless network with mobile edge computing (MEC) capabilities, energy management (i.e., demand and supply) is one of the most critical design challenges. In particular, it is imperative to equip nextgeneration wireless networks with alternative energy sources, such as renewable energy, in order to provide extremely reliable energy dispatch with less energy consumption cost [8, 9, 11, 10, 12, 13, 14, 15]. An efficient energy dispatch design requires energy sustainability, which not only saves energy consumption cost, but also fulfills the energy demand of the edge computing by enabling its own renewable energy sources. Specifically, sustainable energy is the practice of seamless energy flow to the MEC system that emerges to meet the energy demand without compromising the ability of future energy generation. Furthermore, to ensure a sustainable MEC operation, the retrogressive penetration of uncertainty for energy consumption and generation is essential.To provide sustainable edge computing for nextgeneration wireless systems, each base station (BS) with MEC capabilities unit can be equipped with renewable energy sources. Thus, the energy source of such a BS unit not only relies solely on the power grid, but also on the equipped renewable energy sources. In particular, in a selfpowered network, wireless BSs with MEC capabilities is equipped with its own renewable energy sources that can generate renewable energy, consume, store, and share energy with other BS units.
Delivering seamless energy flow with a low energy consumption cost in a selfpowered wireless network with MEC capabilities can lead to uncertainty in both energy demand and generation. In particular, the randomness of the energy demand is induced by the uncertain resources (i.e., computation and communication) request by the edge services and applications. Meanwhile, the energy generation of a renewable source (i.e., a solar panel) at each selfpowered BS unit varies on the time of a day. In other words, the pattern of energy demand and generation will differ from one selfpowered BS unit to another. Thus, such fluctuating energy demand and generation pattern induces a nonindependent and identically distributed (noniid) of energy dispatch at each BS over time. As such, when designing selfpowered wireless networks, it is necessary to take into account this uncertainty in the energy patterns.
Ia Related Works
The problem of energy management for MECenabled wireless networks has been studied in [16, 17, 18, 19, 20, 21, 22]. In [16], the authors proposed a joint mechanism for radio resource management and users task offloading with the goal of minimizing the longterm power consumption for both mobile devices and the MEC server. The authors in [17]
proposed a heuristic to solve the joint problem of computational resource allocation, uplink transmission power, and user task offloading problem. The work in
[18] studied the tradeoff between communication and computation for a MEC system and the authors proposed a MEC server CPU scaling mechanism for reducing the energy consumption. Further, the work in [19] proposed an energyaware mobility management scheme for MEC in ultradense networks, and they addressed the problem using Lyapunov optimization and multiarmed bandits. Recently, the authors in [21] proposed a distributed power control scheme for a small cell network by using the concept of a multiagent calibrate learning. Further, the authors in [22] studied the problem of energy storage and energy harvesting (EH) for a wireless network using deviation theory and Markov processes. However, all of these existing works assume that the consumed energy is available from the energy utility source to the wireless network system [16, 17, 18, 19, 20, 21, 22]. Since the assumed models are often focused on energy management and user task offloading on network resource allocations, the random demand for computational (e.g., CPU computation, memory, etc.) and communication requirements of the edge applications and services are not considered. In fact, even if enough energy supply is available, the energy cost related to network operation can be significant because of the usage of nonrenewable (e.g., coal, petroleum, natural gas). Indeed, it is necessary to include renewable energy sources towards the nextgeneration wireless networking infrastructure.Recently, some of the challenges of renewable energy powered wireless networks have been studied in [8, 9, 11, 10, 12, 13, 14, 24, 23]. In [8], the authors proposed an online optimization framework to analyze the activation and deactivation of BSs in a selfpowered network. In [9], proposed a hybrid power source infrastructure to support heterogeneous networks (HetNets), a modelfree deep reinforcement learning (RL) mechanism was proposed for user scheduling and network resource management. In [10], the authors developed an RL scheme for edge resource management while incorporating renewable energy in the edge network. In particular, the goal of [10] is to minimize a longterm system cost by load balancing between the centralized cloud and edge server. The authors in [11] introduced a microgrid enabled edge computing system. A joint optimization problem is studied for MEC task assignment and energy demandresponse (DR) management. The authors in [11] developed a modelbased deep RL framework to tackle the joint problem. In [12], the authors proposed a risksensitive energy profiling for microgridpowered MEC network to ensure a sustainable energy supply for green edge computing by capturing the conditional value at risk (CVaR) tail distribution of the energy shortfall. The authors in [12] proposed a multiagent RL system to solve the energy scheduling problem. In [13], the authors proposed a selfsustainable mobile networks, using graphbased approach for intelligent energy management with a microgrid. The authors in [14] proposed a smart gridenabled wireless network and minimized grid energy consumption by applying energy sharing among the BSs. Furthermore, in [23], the authors addressed challenges of noncoordinated energy shedding and misaligned incentives for mixeduse building (i.e., buildings and data centers) using auction theory to reduce energy usage. However, these works [9, 11, 10, 12, 13, 14, 23] do not investigate the problem of energy dispatch nor do they account for the energy cost of MECenabled, selfpowered networks when the demand and generation of each selfpowered BS are noniid. Dealing with noniid energy demand and generation among selfpowered BSs is challenging due to the intrinsic energy requirements of each BS evolve the uncertainty. In order to overcome this unique energy dispatch challenge, we propose to develop a multiagent metareinforcement learning framework that can adapt new uncertain environment without considering the entire past experience.
IB Contributions
The main contribution of this paper is a novel energy management framework for nextgeneration MEC in selfpowered wireless network that is reliable against extreme uncertain energy demand and generation. We formulate a twostage stochastic energy cost minimization problem that can balance renewable, nonrenewable, and storage energy without knowing the actual demand. In fact, the formulated problem also investigates the realization of renewable energy generation after receiving the uncertain energy demand from the MEC applications and service requests. To solve this problem, we propose a multiagent metareinforcement learning (MAMRL) framework that dynamically observes the noniid behavior of timevarying features in both energy demand and generation at each BS and, then transfers those observations to obtain an energy dispatch decision and execute the energy dispatch policy to the selfpowered BS. Fig. 1 illustrates how we propose to dispatch energy to ensure sustainable edge computing over a selfpowered network using MAMRL framework. As we can see, each BS that includes small cell base stations (SBSs) and a macro base station (MBS) will act as a local agent and transfer their own decision (reward and action) to the metaagent. Then, the metaagent accumulates all of the noniid observations from each local agent (i.e., SBSs and MBS) and optimizes the energy dispatch policy. The proposed MAMRL framework then provides feedback to each BS agent for exploring efficiently that acquire the right decision more quickly. Thus, the proposed MAMRL framework ensures autonomous decision making under an uncertain and unknown environment. Our key contributions include:

We formulate a selfpowered energy dispatch problem for MECsupported wireless network, in which the objective is to minimize the total energy consumption cost of network while considering the uncertainty of both energy consumption and generation. The formulated problem is, thus, a twostage linear stochastic programming. In particular, the first stage makes a decision when energy demand is unknown, and the second stage discretizes the realization of renewable energy generation after knowing energy demand of the network.

To solve the formulated problem, we propose a new multiagent metareinforcement learning framework by considering the skill transfer mechanism [27, 28, 29]
between each local agent (i.e., selfpowered BS) and metaagent. In this MAMRL scheme, each local agent explores its own energy dispatch decision using Markovian properties for capturing the timevarying features of both energy demand and generation. Meanwhile, the metaagent evaluates (exploits) that decision for each local agent and optimizes the energy dispatch decision. In particular, we design a long shortterm memory (LSTM) as a metaagent (i.e., run at MBS) that is capable of avoiding the incompetent decision from each local agent and learns the right features more quickly by maintaining its own state information.

We develop the proposed MAMRL energy dispatch framework in a semidistributed manner. Each local agent (i.e., selfpowered BS) estimates its own energy dispatch decision using local energy data (i.e., demand and generation), and provides observations to the metaagent individually. Consequently, the metaagent optimizes the decision centrally and assists the local agent toward a globally optimized decision. Thus, this approach not only reduces the computational complexity and communication overhead but it also mitigates the curse of dimensionality under the uncertainty by utilizing noniid energy demand and generation from each local agent.

Experimental results using real datasets establish a significant performance gain of the energy dispatch under the deterministic, asymmetric, and stochastic environments. Particularly, the results show that the proposed MAMRL model saves up to of energy consumption cost over a baseline approach while achieving an average accuracy of around in a stochastic environment. Our approach also decreases the usage of nonrenewable energy up to of total consumed energy.
The rest of the paper is organized as follows. Section II presents the system model of selfpowered edge computing. The problem formulation is described in Section III. Section IV provides MAMRL framework for solving energy dispatch problem. Experimental results are analyzed in Section V. Finally, conclusions are drawn in Section VI.
Ii System Model of SelfPowered Edge Computing
Notation  Description 

Set of BSs (SBSs and MBS)  
Set of active server under the BS  
Set of user tasks  
Set of renewable energy sources  
Server utilization in BS  
Energy coefficient for BS  
Renewable energy cost per unit  
Nonrenewable energy cost per unit  
Storage energy cost per unit  
Amount of renewable energy  
Amount of nonrenewable energy  
Amount of surplus energy  
Energy demand at time slot  
Random variable for energy demand 
Consider a selfpowered wireless network that is connected with a smart grid controller as shown in Fig. 2. Such a wireless network enables edge computing services for various MEC applications and services. The energy consumption of the network depends on network operations energy consumption along with the task loads of the MEC applications. Meanwhile, the energy supply of the network relies on the energy generation from renewable sources that are attached to the BSs, as well as both renewable and nonrenewable sources of the smart grid. Therefore, we will first discuss the energy demand model that includes MEC server energy consumption, and network communication energy consumption. We will then describe the energy generation model that consists of the nonrenewable energy generation cost, surplus energy storage cost, and total energy generation cost. Table I illustrates the summary of notations.
Iia Energy Demand Model
Consider a set of ( for MBS) BSs that encompass SBSs overlaid over a single MBS. Each BS includes a set of MEC application servers. We consider a finite time horizon with each time slot being indexed by and having a duration of 15 minutes [30]. The observational period of each time slot ends at the th minute and is capable of capturing the changes of network dynamics [11, 12, 31]. A set of heterogeneous MEC application task requests from users will arrive to BS with an average task arrival rate (bits/s) at time . The task arrival rate at BS follows a Poisson process at time slot . BS integrates heterogeneous active MEC application servers that has (bits/s) processing capacity. Thus, computational task requests will be accumulated into the service pool with an average traffic size (bits) at time slot . The average traffic arrival rate is defined as . Therefore, an M/M/K queuing model is suitable to model these user tasks using MEC servers at BS and time [32, 33]
. The task size of this queuing model is exponentially distributed since the average traffic size
is already known. Hence, the service rate of the BS is determined by . At any given time , we assume that all of the tasks inare uniformly distributed at each BS
. Thus, for a given MEC server task association indicator if task is assigned to server at BS , and otherwise, the average MEC server utilization is defined as follows [11]:(1) 
IiA1 MEC Server Energy Consumption
In case of MEC server energy consumption, the computational energy consumption (dynamic energy) will be dependent on the CPU activity for executing computational tasks [17, 34, 16]. Further, such dynamic energy is also accounted with the thermal design power (TDP), memory, and disk I/O operations of the MEC server [17, 34, 16] and we denote as . Meanwhile, static energy includes the idle state power of CPU activities [16, 18]. We consider, a single core CPU with a processor frequency (cycles/s), an average server utilization (using (1)) at time slot , and a switching capacitance (farad) [17]. The dynamic power consumption of such single core CPU can be calculated by applying a quadratic formula [18, 35]. Thus, energy consumption of MEC servers with CPU cores at BS is defined as follows:
(2) 
where denotes a scaling factor of heterogeneous CPU core of the MEC server. Thus, the value of is dependent on the processor architecture [36] that assures the heterogeneity of the MEC serves.
IiA2 Base Station Energy Consumption
The energy consumption needed for the operation of the network base stations (i.e., SBSs and MBS) includes two types of energy: dynamic and static energy consumption [37]. On one hand, a static energy consumption includes the energy for maintaining the idle state of any BS, a constant power consumption for receiving packet from users, and the energy for wired transmission among the BSs. On the other hand, the dynamic energy consumption of the BSs depends on the amount of data transfer from BSs to users which essentially relates to the downlink [38] transmit energy. Thus, we consider that each BS operates at a fixed channel bandwidth and constant transmission power [38]. Then the average downlink data of BS will be given by [11]:
(3) 
where represents downlink channel gain between user task to BS ,
determines a variance of an Additive White Gaussian Noise (AWGN), and
denotes the cochannel interference [39, 40] among the BSs. Here, the cochannel interference relates to the transmissions from other BSs that use the same subchannels of . and represent, respectively, the transmit power and the channel gain of the BS . Therefore, downlink energy consumption of the data transfer of BS is defined by [wattseconds or joule], where [seconds] determines the duration of transmit power [watt]. Thus, the network energy consumption for BS at time is defined as follows [37, 19]:(4) 
where determines the energy coefficient for transferring data through the network. In fact, the value of depends on the type of the network device (e.g., for a unit transceiver remote radio head [37]).
IiA3 Total Energy Demand
The total energy consumption (demand) of the network consists of both MEC server computational energy (in (2)) consumption, and network the operational energy (i.e., BSs energy consumption in (4)). Thus, the overall energy demand of the network at time slot is given as follows:
(5) 
The demand is random over time and completely depends on the computational tasks load of the MEC servers.
IiB Energy Generation Model
The energy supply of the selfpowered wireless network with MEC capabilities relates to the network’s own renewable (e.g., solar, wind, biofuels, etc.) sources as well as the main grid’s nonrenewable (e.g., diesel generator, coal power, and so on) energy sources [8, 41]. In this energy generation model, we consider a set of renewable energy sources of the network, with each element representing the set of renewable energy sources of BS . The amount of renewable energy generation is defined by . The total renewable energy generation at time is defined as . Further, the selfpowered wireless network is able to get an additional nonrenewable energy amount from the main grid at time . The per unit renewable and nonrenewable energy cost are defined by and , respectively. In general, the renewable energy cost only depends on the maintenance cost of the renewable energy sources [42, 43, 44]. Therefore, the per unit nonrenewable energy cost is greater than the renewable energy cost . Additionally, the surplus amount of the energy at time can be stored in energy storage medium for the future usages [43, 44] and the energy storage cost of per unit energy store is denoted by .
IiB1 Nonrenewable Energy Generation Cost
In order to fulfill the energy demand when it is greater than the generated renewable energy , the main grid can provide an additional amount of energy from its nonrenewable sources. Thus, the nonrenewable energy generation cost of the network is determined as follows:
(6) 
where represents a unit energy cost.
IiB2 Surplus Energy Storage Cost
The surplus amount of energy is stored in a storage medium when (i.e., energy demand is smaller than the renewable energy generation) at time . We consider the per unit energy storage cost . This storage cost depends on the storage medium and amount of the energy store at time [43, 45, 23, 46]. With the per unit energy storage cost , the total storage cost at time is defined as follows:
(7) 
IiB3 Total Energy Generation Cost
The total energy generation cost includes renewable, nonrenewable, and storage energy cost. Naturally, this total energy generation cost will depend on the energy demand of the network at time . Therefore, the total energy generation cost at time is defined as follows:
(8) 
where the energy cost of the renewable, nonrenewable, and storage energy are given by , , and , respectively. In (8), energy demand and renewable energy generation are stochastic in nature. The energy cost of nonrenewable energy (6) and storage energy (7) completely rely on energy demand and renewable energy generation . Hence, to address the uncertainty of both energy demand and renewable energy generation in a selfpowered wireless network, we formulate a twostage stochastic programing problem. In particular, the first stage makes a decision of the energy dispatch without knowing the actual demand of the network. Then we make further energy dispatch decisions by analyzing the uncertainty of the network demand in the second stage. A detailed discussion of the problem formulation is given in the following section.
Iii Problem Formulation with a TwoStage Stochastic Model
We now consider the case in which the nonrenewable energy cost is greater than the renewable energy cost, that is often the case in a practical smart grid as discussed in [42], [43], [44], and [47]. Here, and are the continuous variables over the observational duration . The objective is to minimize the total energy consumption cost . is the decision variable and the energy demand is a parameter. When the energy demand is known, the optimization problem will be:
(9) 
In problem (9), after removing the nonnegativity constraints , we can rewrite the objective function in the form of piecewise linear functions as follows:
(10) 
Where and determine the cost of nonrenewable (i.e., ) and storage (i.e., ) energy, respectively. Therefore, we have to choose one out of the two cases. In fact, if the energy demand is known and also the amount of renewable energy is the same as the energy demand, then problem (10) provides the optimal decision in order to exact amount of demand . However, the challenge here is to make a decision about the renewable energy usage before the demand becomes known. To overcome this challenge, we consider the energy demand
as a random variable whose probability distribution can be estimated from the previous history of the energy demand. We can rewrite problem (
9) using the expectation of the total cost as follows:(11) 
The solution of problem (11) will provide an optimal result on average. However, in the practical scenario, we need to solve problem (11) repeatedly over the uncertain energy demand . Thus, this solution approach does not significantly affect when large variations (i.e., noniid) of the energy demand that are generated by BSs over the observational period of .
We consider the moment of random variable
that has a finitely supported distribution and takes values with respective probabilities of BSs. The cumulative distribution function (CDF)
of energy demand is a step function and jumps of size at each demand . Therefore, the probability distribution of each BS energy demand belongs to the CDF of historical observation of energy demand . In this case, we can convert problem (11) into a deterministic optimization problem and the expectation of energy usage cost is determined by . Thus, we can rewrite the problem (9) as a linear programming problem using the representation in (
10) as follows:(12)  
s.t.  (12a)  
(12b)  
(12c) 
For a fixed value of the renewable energy , problem (12) is an equivalent of problem (10). Meanwhile, problem (12) is equal to . We have converted the piecewise linear function from problem (10) into the inequality constraints (12a) and (12b). We consider as a highest probability of energy demand at each BS . Therefore, for BSs, we define as the probability of energy demand with respect to BSs . Thus, we can rewrite the problem (11) for BSs is as follows:
(13)  
(13a)  
(13b)  
(13c) 
where represents the highest probability (close to ) of energy demand at BS and estimates a probability distribution from
quantile of empirical CDF
of the historical demand observation. Thus, for a fixed value of , this problem is almost separable. Thus, we can decompose problem (13) with a structure of twostage linear stochastic programming problem [48, 49].To find an approximation for a random variable with a finite probability distribution, we decompose problem (13) in a twostage linear stochastic programming under uncertainty. The decision is made using historical data of energy demand, which is fully independent from the future observation. As a result, the first stage of selfpowered energy dispatch problem for sustainable edge computing is formulated as follows:
(14)  
s.t.  (14a) 
where determines an optimal value of the second stage problem. In problem (14), the decision variable is calculated before the realization of uncertain energy demand . Meanwhile, at the first stage of the formulated problem (14), the cost is minimized for the decision variable which then allows us to estimate the expected energy cost for the second stage decision. Constraint (14a) provides a boundary for the maximum allowable renewable energy usage. Thus, based on the decision of the first stage problem, the second stage problem can be defined as follows:
(15)  
s.t.  (15a)  
(15b)  
(15c) 
In the second stage problem , the decision variables and depend on the realization of the energy demand of the first stage problem , where, determines the amount of renewable energy usage at time . The first constraint is an equality constraint that determines the surplus amount of energy must be equal to the absolute value difference between the usage of renewable and nonrenewable energy amount. The second constraint is an inequality constraint that uses the optimal demand value from the first stage realization. In particular, the value of demand comes from that is the historical observation of energy demand. Finally, the constraint protects from the nonnegativity for the nonrenewable energy usage.
The formulated problems and can characterize the uncertainty between network energy demand and renewable energy generation. Particularly, the second stage problem contains random demand that leads the optimal cost as a random variable. As a result, we can rewrite the problems and in a one large linear programming problem for BSs and the problem formulation is as follows:
(16)  
s.t.  (16a)  
(16b)  
(16c)  
(16d) 
In problem , for BSs, energy demand happens with positive probabilities and . The decision variables are , and , which represent the amount of renewable, nonrenewable, and storage energy, respectively. Constraint defines a relationship among all of the decision variables , and . In essence, this constraint discretizes the surplus amount of energy for storage. Hence, constraint ensures the utilization of nonrenewable energy based on the energy demand of the network. Constraint ensures that the decision variable will not be a negative value. Finally, constraint restricts the renewable energy usages in to maximum capacity at time . Problem is an integrated form of the firststage problem in and the secondstage problem in , where the solution of and completely depends on realization of demand for all BSs. The decision of the comes before the realization of demand and, thus, the estimation of renewable energy generation will be independent and random. Therefore, problem holds the property of relatively complete recourse. In problem , the number of variables and constraints is proportional to the numbers of BSs, . Additionally, the complexity of the decision problem leads to due to the combinatorial properties of the decisions and constraints [48, 49, 50].
The goal of the selfpowered energy dispatch problem is to find an optimal energy dispatch policy that includes amount of renewable , nonrenewable , and storage energy of each BS while minimizing the energy consumption cost. Meanwhile, such energy dispatch policy relies on an empirical probability distribution of historical demand at each BS at time . In order to solve problem , we choose an approach that does not rely on the conservativeness of a theoretical probability distribution of energy demand in problem , and also will capture the uncertainty of renewable energy generation from the historical data. A datadriven approach that can vanish the conservativeness of theoretical probability distributions as historical data goes to infinity. Eventually, noniid energy demand and generation will also be captured at each BS when timevariant features of both energy demand and generation are characterized by the Markovian properties of the historical data. To prevalence the aforementioned contemporary, we propose a multiagent metareinforcement learning framework that can explore the Markovian behavior from historical energy demand and generation of each BS . Meanwhile, metaagent can cope with such timevarying features to a globally optimal energy dispatch policy for each BS .
We design an MAMRL framework by converting the cost minimization problem to a reward maximization problem that we then solve with a datadriven approach. In the MAMRL setting, each agent works as a local agent for each BS and determines an observation (i.e., exploration) for the decision variables, renewable , nonrenewable , and storage energy. The goal of this exploration is to find timevarying features from the local historical data so that the energy demand of the network is satisfied. Furthermore, using these observations and current state information, a metaagent is used to determine a stochastic energy dispatch policy. Thus, to obtain such dispatch policy, the metaagent only requires the observations (behavior) from each local agent. Then, the metaagent can evaluate (exploit) behavior toward an optimal decision for dispatching energy. Further, the MAMRL approach is capable of capturing the explorationexploitation tradeoff in a way that the metaagent optimizes decisions of the each selfpowered BS under uncertainty. A detailed discussion of the MAMRL framework is given in the following section.
Iv Energy Dispatch with MultiAgent MetaReinforcement Learning Framework
In this section, we developed our proposed multiagent metareinforcement learning framework (as seen in Fig. 3) for energy dispatch in the considered network. The proposed MAMRL framework includes two types of agents: A local agent that acts as a local learner at each selfpowered with MEC capabilities BS and a metaagent that learns the global energy dispatch policy. In particular, each local BS agent can discretize the Markovian dynamics for energy demandgeneration of each BS (i.e., both SBSs and MBS) separately by applying deepreinforcement learning. Meanwhile, we train a long shortterm memory (LSTM) [53, 54] as a metaagent at the MBS that optimizes [55] the accumulated energy dispatch of the local agents. As a result, the metaagent can handle the noniid energy demandgeneration of the each local agent with own state information of the LSTM. To this end, MAMRL mitigates the curse of dimensionality for the uncertainty of energy demand and generation while providing an energy dispatch solution with a less computational and communication complexity (i.e., less message passing between the local agents and metaagent).
Iva Preliminary Setup
In the MAMRL setting, each BS acts as a local agent and the number of local agents same as the number of BSs (i.e., MBS and SBSs) in the network. We define a set of state spaces and a set of actions for the agents. The state space of a local agent is defined by , where , and represent the amount of energy demand, renewable generation, storage cost, and nonrenewable energy cost, respectively, at time . We execute Algorithm 1 to generate the state space for every BSs , individually. In Algorithm 1, lines to calculate the individual energy consumption of the MEC computation and network operation using (2) and (4), respectively. Overall, the energy demand of the BS is computed in line and the selfpowered energy generation is estimated by line in Algorithm 1. Nonrenewable and storage energy costs are calculated in lines and for time slot . Finally, line creates state space tuple (i.e., ) for time in Algorithm 1.
IvB Local Agent Design
Consider each local BS agent that can take two types of actions which is the amount of storage energy , and the amount of nonrenewable energy at time . Since the state and action both contain a time varying information of the agent , we consider the dynamics of Markovian and represent problem as a discounted reward maximization problem for each agent (i.e., each BS). Thus, the objective function of the discounted reward maximization problem of agent is defined as follows [51]:
(17) 
where is a discount factor and each reward is considered as,
(18) 
In (18), determines a ratio between renewable energy generation and energy demand (supplydemand ratio) of the BS agent at time . When renewable energy generationdemand ratio is larger than then the BS agent achieves a reward of because the amount of renewable energy exceeds the demand that can be stored in the storage unit.
Each action of BS agent determines a stochastic policy . is a parameter of and the energy dispatch policy is defined by . Policy decides a state transition function for the next state . Thus, the state transition function of BS agent is determined by a reward function , where . As a result, for a given state , the state value function with a cumulative discounted reward will be:
(19) 
where is a discount factor and ensures the convergence of state value function over the infinity time horizon. Thus, for a given state , the optimal policy for the next state can be determined by an optimal state value function while a Markovian property is imposed. Therefore, the optimal value function is given as follows:
(20) 
In this setting, the policy of energy dispatch is determined by choosing an action in that can be seen as an actor of BS agent while the estimated value function plays the role of a critic. Thus, the critic criticizes actions made by the actor using a temporal difference (TD) error [52] that determines an energy dispatch policy. The TD error is considered as an advantage function and the advantage function of agent is defined as follows:
(21) 
Thus, the policy gradient is determined as,
(22) 
Using (22), we can discretize the energy dispatch decision for each selfpowered BS in the network. In fact, we can achieve a centralized solution for when all of the BSs state information (i.e., demand and generation) are known. However, the space complexity for computation increases as and also the computational complexity becomes [21]. Further, the solution does not meet the explorationexploitation dilemma since the centralized (i.e., single agent) method ignores the interactions and energy dispatch decision strategies of other agents (i.e., BSs) which creates an imbalance between exploration and exploitation. Next, we propose an approach that not only reduces the complexity but also explores alternative energy dispatch decision to achieve the highest expected reward in (17).
IvC MultiAgent MetaReinforcement Learning Modeling
We consider a set of observations [56, 27] and for an BS agent , a single observation tuple is given by . For a given state , the observation of the next state consists of , where , , , and are nextstate discounted rewards, current state discounted rewards, next action, current action, time slot, and TD error, respectively. Here, a complete information of the observation is correlated with the state space while observation does not require the complete state information of the previous states. Thus, the space complexity for computation is and is the communication complexity of each agent .
In the MAMRL framework, the local agents work as an optimizee and the metaagent performs the role of optimizer [55]. To model our metaagent, we consider an LSTM architecture [53, 54] that stores its own state information (i.e., parameters) and the local agent (i.e., optimizee) only provides the observation of a current state. In the proposed MAMRL framework, a policy is determined by updating the parameters . Therefore, we can represent the state value function (20) for time is as follows: , and the advantage (temporal difference) function (21) is presented by, . As a result, the parameterized policy is defined by, . Considering all of the BS agents and the advantage function is rewritten as,
(23) 
where
Comments
There are no comments yet.