I Introduction
Microgrid networks are a collection of smallscale renewable energy resources that fulfil local consumer demands. They may function independently or in collaboration with other microgrids by trading energy. The main advantage of using microgrids is their ability to decentralize power distribution from the central grid, thus providing a more efficient architecture for energy distribution by targeting smaller areas and serving as reliable power sources when the central grid has a deficiency. In addition, they also reduce losses incurred due to longdistance energy transmissions and prove to be a more costeffective and ecofriendly alternative to traditional resources such as fossil fuels that cause more pollution and are depleting at an alarming rate. Their main tasks include local power generation, storing energy, trading power with other microgrids and satiating local consumer demands.
At the demand side, customers have certain flexible demands that can be satisfied any time during certain given time periods throughout the day. These loads are structured in such a way that they can be fulfilled anytime during their allotted time period. For example: If a washing machine is being operated any time between 2 pm to 6 pm at a particular household, the microgrid would have the ability to intelligently provide the energy required to run the washing machine at any time during this period. These demands are classified as Activities of Daily Living (ADL). Each microgrid has the ability to schedule these ADL demands depending on the peak demand as well as the local energy generation. ADL scheduling does not reduce power consumption; it merely helps in reducing the peak load at any time instance.
Energy trading plays a vital role in the decentralization of power generation and maintaining stability at the microgrid sites. This involves buying and selling the power among neighboring microgrids at favorable prices. The main focus of our paper is to highlight advantages of using the dynamic pricing scheme (a scheme that allows the microgrids to select the prices at which it decides to sell power) in tandem with ADL scheduling. The dynamic pricing scheme not only provides the microgrids autonomy for selecting prices according to their convenience (based on their current state), but also encourages energy trading amongst them. This promotes more cooperation amongst microgrids thereby causing a lower dependency on the central grid for fulfilling local energy requirements. This in turn enables better decentralization, and also helps individual microgrids obtain higher rewards as our results show, than if they were to follow a constant pricing policy. ADL scheduling not only helps reduce the peak load, it also allows the microgrid to intelligently defer certain loads while selling the remaining energy in order to optimize the overall reward it receives.
The literature on the energy trading between microgrid networks is vast. The problem of energy trading in microgrid networks has been primarily considered from three different points of view. In [lee2015distributed, wang2016incentivizing, li2017risk], game theoretic models have been proposed along with the equilibrium analysis of solutions. In [chaouachi2012multiobjective, nunna2013energy, gregoratti2014distributed, shi2014distributed, liu2018dynamic]
, the energy trading has been formulated as an optimization problem and models such as convex programming and Linear programming have been used to compute optimal solutions. The third popular framework for energy trading in microgrids is Reinforcement Learning (RL). RL is a popular paradigm that provides learning algorithms for computing the solution when the model information is not known. We now discuss some of the works that propose RL algorithms to solve the energy trading problem among microgrids. In
[xiao2017energy], an energy trading game using RL techniques has been proposed. In their model, each microgrid, based on its current state configurations, computes the amount of energy to be traded with neighboring microgrids in order to maximize its rewards. However, the prices in this model are the market prices and are not dynamic. In [kim2015dynamic, lu2018dynamic], a dynamic pricing problem for a single microgrid is considered. Based on the consumption pattern of its customers, the microgrid decides the price of the power to be sold to its customers. In [wang2016reinforcement], a novel energy trading model for microgrid networks is proposed that considers dynamic pricing. However, the dynamic scheduling of customers demand is not considered. Deep Reinforcement Learning algorithms have been successfully applied for computing optimal solutions in the context of energy trading between microgrids in [xiao2018reinforcement, chen2018local], for storage device management in [franccois2016deep], and for energy management in [lu2019incentive, ji2019real]. The closest work to ours is [diddigi2017unified], where an energy trading model for a microgrid network has been proposed that also considers job scheduling for customers. We extend this model considerably to include dynamic pricing for transactions between microgrids and apply the independent learners Deep Qlearning algorithm that is shown to have a good empirical performance in literature [tampuu2017multiagent].In [vazquez2019reinforcement], an extensive survey of RL algorithms for demand response is carried out. They also identify the need for RL algorithms to consider demand response in multiagent scenarios with demanddependent dynamic prices. Our work is a step in this direction.
Our main contributions in this paper are as follows:

We construct a MultiAgent Reinforcement Learning framework that addresses the supplyside management problems of dynamic pricing, battery scheduling as well as the demandside management problem of scheduling ADL jobs.

To the best of our knowledge, ours is the first work that uses a novel DQN approach to solve both these problems by creating two separate neural networks (for handling the tasks of stochastic job scheduling as well as energy trading) both working as ingredients to the same Markov Decision Process.

We perform experiments to show that the reward obtained by a microgrid is lower if it employs a constant pricing policy instead of a dynamic pricing policy as the latter ensures better participation in energy trading.

We empirically show that the proposed dynamic pricing setup ensures more reward for most of the participating microgrids.

Based on the results of our experiments we also provide detailed analysis on the behaviour of microgrids under various setups.
Ii Problem Formulation
In this section, we describe the model of the microgrids that enables the energy trading and job scheduling. Our solution is based on a framework of the problem that consists of independent microgrids, interconnected by multiple transmission lines, in the presence of a central grid. Each microgrid has the ability to locally generate renewable energy and it also has the provision of storing energy in a battery unit. We divide each day into several time steps of equal duration, for better granularity of the decisionmaking process. At each time step, the microgrids have information about their current local demand, the renewable energy generated, the amount of energy stored in the battery as well as the remaining ADL demands that are to be fulfilled in that day. Depending on this information the microgrids make decisions regarding their demand and supply management at regular time intervals. These decisions are as follows:

The scheduling and fulfillment of ADL demands.

The fulfillment of nonADL demands.

The amount of electricity to buy or sell, and also the price at which to sell electricity.

The amount of energy to be stored in the battery.
We formulate this problem in the framework of stochastic games. A stochastic game is a popular framework that is used for modeling competing or cooperative agents in a stochastic environment [bowling2000analysis]. The main ingredient of a stochastic game is the tuple , where is the number of agents, denotes the joint state space where is the state space of agent , denotes the joint action space with representing the action space of the agent .
is the probability transition rule that gives the probability of moving to next state
when action tuple is taken in state . Note that in our model, an agent can only observe its own state and picks an action . Finally, is the singlestage reward function of the agent that gives the reward value obtained when the joint action tuple is taken in state and is the discount factor. The objective of the agent is to compute a policy a , that maximizes its total discounted reward, given the optimal policies of other agents . That is,(1) 
where is set of all policies of agent and is the expectation over states with the initial state sampled from a known distribution .
We now describe in detail the states, actions and singlestage rewards of each of the microgrids.
Iia States
The state of the microgrid at time is given by: , where:

Time state (): This is the time interval of the day at which the decision is taken by the microgrid.

The Net Energy (): This is a cumulative sum of battery value () and generated renewable energy () subtracted by the NonADL demand (). Thus

NonADL Demand (
) : This signifies the local consumer demand pertaining to that microgrid. This is provided in addition to the net demand so that the agent is able to estimate the cumulative sum of the battery and the renewable generation at that time step.

ADL state (): This state component jointly signifies the ADL loads that are remaining and the adl loads that have been completed by the adl agent, till the current time step.

Grid Price (): This is the price at which a microgrid buys power from the central grid at that time step. ^{1}^{1}1Please note that when the microgrids sell power to the central grid, the selling price would be  k, where ’k’ is a positive integer. The reason for this will be made clear in the ‘Design Constraints’ section.
IiB Actions
As discussed above, the actions of the agents involve deciding the ADL demands to be scheduled and the amount of energy to be sold/bought. Additionally, we integrate the pricing model where the microgrids also decide the price at which energy trading takes place. In particular, the actions of the microgrid are as follows:

ADL action (): This signifies the ADL demands that a microgrid plans to fulfil in the current time step.

Electricity to be traded (): This denotes the amount of power that the microgrid decides to trade amongst other microgrids as well as the central grid. It is governed by a set of constraints that are derived from the net demand, the NonADL demand, the ADL action and the battery capacity to ensure that the microgrid remains stable. A negative value of signifies that the microgrid is buying electricity whereas a positive value of signifies that the microgrid is selling electricity.

Price Chosen (): This signifies the price chosen by the microgrid at which power is sold. Sellers quote a price while buyers are assumed to adhere to the price determined by the sellers.
After the microgrids select their respective and actions, they are divided into two groups namely buyer microgrids and seller microgrids. A grid is classified as a buyer microgrid based on whether or not the value of selected is negative. Conversely, a grid is classified as a seller microgrid if the value of selected is positive.
Once the microgrids are divided into groups of buyers and sellers, energy trading happens in the following way. First, a microgrid from the seller group (let’s call it the leader), that quotes the lowest price is selected. The amount of energy that the leader microgrid is willing to sell is shared amongst the buyer microgrids, proportional to the energy they demand. This is to ensure that there is no bias amongst buyer microgrids. Once the leader microgrid has sold all of its energy, a new leader is chosen, i.e., the one quoting the next best price. This chain continues on till there are no seller microgrids or no buyer microgrids left in the process.
Even after these transactions, if certain demands of the buyer microgrids remain unfulfilled, the remaining amount of energy is bought from the central grid at a price: . Conversely if all the demands of the buyer microgrids are satiated, the seller microgrids end up selling the remaining to the central grid, at a price:  k.
IiC Reward Function
The goal of each microgrid is to obtain adequate profits acquired by selling electricity, while satiating local consumer demand which consists of both ADL as well as NonADL demand. The reward computed ^{2}^{2}2Please note that the reward function does not take into account the profits obtained by selling electricity to the local customers., takes care of both of these conditions by giving a positive reinforcement to the agents when electricity is sold, charges the microgrid if electricity is bought, and also penalises when the instantaneous local consumer demand (NonADL demand) as well as the ADL demand is not met with.
where is a positive constant. Changing the values of leads to the the microgrids exhibiting different behaviors. When is much larger than , the microgrids favour selling energy as compared to satisfying their local consumer demands (both NonADL as well as ADL demand). Conversely, when k1 is much larger than , the microgrids prefer to satiate their local consumer demands (both NonADL as well as ADL) as compared to selling energy. This can be explained as follows: each microgrid is tasked with optimising its reward function. By changing , the weights given to selling energy and satisfying local consumer demand changes, which in turn changes the reward function. To emulate a real world scenario, we have given a higher importance to satisfying local consumer demand as compared to selling energy. In our experiments, is set to 30.
If is positive, it implies that the microgrid is selling electricity and hence receives a profit. If is negative, it implies that the microgrid is buying energy and hence it incurs a cost. As is a positive constant, the microgrid is penalised for not satisfying the NonADL as well as the ADL demand. If the local consumer demands are met with, the microgrid receives no penalty.
Note that the reward of each microgrid depends not only on its own action, but also on the action of other microgrids (as the energy being traded by other microgrids as well as the price they quote implicitly affects the reward of that microgrid) and hence, this structure induces a stochastic game amongst the microgrids.
Iii Proposed Algorithm
To fulfill the demandside management tasks as well as supplyside tasks, each microgrid employs two agents. The first agent (also called the ADL agent) is responsible for the demandside management. It decides which ADL tasks would be scheduled in the current time step, and this information is then provided to the second agent. The second agent (also called the Energy Trading (ET) agent) is responsible for the supplyside management. It decides the units of electricity to buy or sell, and also sets the transaction prices, i.e., the prices at which the energy trading happens.
Based on the actions taken by the ADL and ET agents, a common reward is obtained by both the agents. This can be justified by the reasoning that both the agents are cohesively working in order to fulfill a common higher goal. Hence, the same credit will be assigned to both of these agents. Due to the interplay between the ADL and ET agents, a single MDP is created which models the state transitions, action selection as well as the reward computation for both the agents. This interplay is shown in Figure 1.
The advantages of using the two separate networks that share the same rewards are as follows: (a). By creating two networks that perform two different tasks that help fulfill a common goal, we have devised a method to successfully model the execution of sequential tasks, using RL. Moreover, by propagating the same reward to both the networks, we have also empirically shown that sharing the same reward for modeling sequential tasks does lead to network learning. Such kind of a sequential learning approach can be used for a lot of realworld fields such as robotics, auctions etc (b). By creating two networks, instead of one very large network, we reduce the number of iterations needed to obtain optimal policy as this enables better exploration of the action space (c). The interplay between the networks as shown in the paper is also novel to the best of our knowledge.
Note that both, the ADL Agent and the ET agent have the same state space except for one parameter. The ADL agent has a parameter known as the ADL state (which signifies which ADL actions have to be fulfilled). Instead of this parameter, the ET agent has a parameter known as ADL action ( the action chosen by the ADL agent). Hence the replay buffers for both the agents are similar. Therefore, by sharing a similar state space and reward, the agents are cooperating. Moreover since the ADL and ET agents have to optimize (increase) their rewards, they would implicitly cooperate to obtain an optimal policy.
To optimise the long term discounted rewards obtained by each microgrid, each agent uses the Deep Qlearning algorithm [mnih2013playing]. This is further described in detail in Algorithm 1.
Iv Design Constraints
In this section, we describe the constraints that we impose on the proposed model.
Iva Price Constraints
To ensure that transactions occur between microgrids, the microgrids are allowed to sell energy within a price range of (where is the central grid price and is a positive constant). This can be justified as follows. If a microgrid quotes a price higher than that quoted by the central grid, the transactions would not even occur as the other microgrids would prefer to buy directly from the central grid.
Moreover, to ensure that energy trading occurs between microgrids, they are allowed to sell power to the central grid at the least price the microgrid can quote, i.e., . This would ensure that a microgrid would prefer to sell to another microgrid as compared to the central grid. Next, we describe the constraints, that are required to maintain stability at the microgrids.
IvB Energy Trading Constraints
For emulating a real world scenario, it becomes imperative that real world constraints are imposed on the amount of energy bought or sold. These constraints are dependant on physical limitations such as the maximum battery capacity, the max energy that can be handled by each microgrid etc. The constraints are imposed as follows:
a. Lower bound on the amount of electricity traded:
(2) 
The first term depicts that a microgrid cannot be allowed to buy more than amount of electricity, thus preventing the microgrid circuits from being excessively overloaded due to the inflow of excess energy. Thus .
After each transaction, the amount of energy that would be stored in the battery of each microgrid, (after factoring in the energy generated, the NonADL demand, the ADL demand, the ADL action selected and the energy present in the battery prior to the transaction) would be less than or equal to the the maximum battery capacity, hence preventing the microgrids from buying excess energy and then in turn, wasting it. Thus,
(3)  
(4) 
where represents the units of energy that are required to fulfill the selected ADL action.
The second term in the max function in (2) ensures that the ADL action selected by the ADL Network is fulfilled.
A maximum of the above two terms is taken to allow the microgrid to trade the maximum energy possible whilst fulfilling the decided ADL actions and also taking the microgrid stability into consideration as well.
b. Upper bound on the amount of electricity traded:
(5) 
The upper bound is derived from the fact that once an ADL action has been chosen then it has to be satisfied by the microgrid. Thus, the amount of energy that the microgrid should possess after trading energy should be greater than or equal to .
Thus,
(6)  
(7) 
After the transactions are completed, the excess energy that remains is stored in the battery for future use. The battery state () is updated as follows:
(8) 
V Simulation Setup
In this section, we describe the simulation setups for our experiments and appropriate models used for comparison purposes. The microgrids used in the experiments either use wind or solar renewable energies as their source. In order to simulate the renewable energy generation for all our experiments, we use the RAPsim software [rapsim].
For comparison purposes, we also implement the constant pricing model described below:

Constant Price Model: The microgrids considered in this case sell energy at the constant grid price decided by the central grid. However, as highlighted in the transaction constraints, the energy is sold to the central grid at a price of . Please note that this model is currently being utilised in some of the power markets where the price of the transaction is decided entirely by the central grid.
We implement our proposed dynamic pricing model and constant pricing model on following three setups:

Setup 1: We first consider a simple threemicrogrid setup where two of them operate on solar while the third microgrid operates on the wind renewable source. Moreover, two microgrids adopt the proposed dynamic pricing scheme while the third microgrid employs constant pricing scheme. The objective of this setup is to understand the dynamics of energy transactions between the three agents and to demonstrate the advantage of dynamic pricing over constant pricing.

Setup 2: Next we consider a more practical setup with 8 microgrids  four generate energy via solar farms and four generate energy via wind farms. In this setup, all microgrids generate less power than their demand at most times. We have run this setup under both models  the proposed dynamic pricing model and the constant pricing model.

Setup 3: This setup is similar to setup 2 with the main difference being the fact that the total renewable energy generated by the microgrids is generally more than the total renewable energy generated by the microgrids in setup 2 while keeping demands the same. This is to ensure that microgrids have more energy to sell as compared to setup 2. We consider eight microgrids, two of them operating on solar and six of them on wind renewable source. Such a configuration is considered to simulate the case were the majority of the microgrids generate higher electricity than the microgrids of setup 2 without violating the stability constraints.
As mentioned above, we use RAPsim simulator to generate the necessary per hour renewable energy data for all the microgrids. We then fit a Poisson distribution on this data and sample renewable energy units from this distribution during our experiments. We limit the maximum amount of electricity that can be generated from renewable sources to
units and consider four decision time intervals in each day. At each epoch, the nonADL demand can be one of the four units:
, , or . We consider three ADL jobs at the start of the day^{4}^{4}4Please note that we have also carried out experiments under the stochastic ADL setting where ADL demands appear in stochastic fashion. Please refer to the https://github.com/marlsmartgrids/energytrading/blob/master/Supplementary.pdf for additional results.. The maximum amount of energy that can be stored in the battery is limited to units. Similarly, the maximum amount of energy that can be bought from other microgrids or the central grid in a single time period is also limited to units. We consider a constant central grid price of (price unit per electricity unit) for our experiments. Recall that, in order to ensure cooperation among microgrids, the selling price to the central grid is fixed at . In our experiments, we set the value of to be 5. Therefore, the action space for the dynamic pricing strategy isprice units. Both the ADL and ET agents of the microgrid use a feed forward neural network model with three layers. The complete configuration of all the microgrids along with the detailed description of the neural network and the code for all our experiments is available at the anonymous GitHub link:
https://github.com/marlsmartgrids/energytrading.Vi Results and Discussions
In this section, we discuss the results of our experiments. We present and discuss them under three setups as defined above.
Microgrid  Rewards Obtained By Following:  Difference in Rewards  Winning Policy  

Dynamic Pricing Policy (DPP)  Constant Pricing Policy (CPP)  
1  13.094589  13.099846  0.005257  DPP 
2  16.335466  16.901891  0.566425  DPP 
3  37.476378  37.532283  0.055905  DPP 
4  54.792268  54.914326  0.122058  DPP 
5  47.218179  47.913392  0.695213  DPP 
6  38.744498  39.517351  0.772853  DPP 
7  55.353694  54.785216  0.568478  CPP 
8  64.276123  63.844191  0.431932  CPP 
Microgrid  Rewards Obtained By Following:  Difference in Rewards  Winning Policy  

Dynamic Pricing Policy (DPP)  Constant Pricing Policy (CPP)  
1  8.83435  10.03937  1.20502  DPP 
2  1.47279  1.03270  0.44006  CPP 
3  9.11136  9.75535  0.64399  CPP 
4  14.65479  14.26748  0.38731  DPP 
5  37.28861  37.72918  0.44057  DPP 
6  47.02739  47.52489  0.49750  DPP 
7  39.48666  42.19527  2.70861  DPP 
8  16.90946  17.86470  0.95524  DPP 
Setup 1: From the results of the experiments ^{1}^{1}1Please note that due to space constraints we could not include all the results of the experiments. However, the same is available at https://github.com/marlsmartgrids/energytrading/blob/master/Supplementary.pdf, we make the following conclusions:

All the 3 microgrids converge to a policy that gives higher rewards than random exploration.

We can see from Figure 4 that the agent which follows dynamic pricing (let us call it dynamic grid) obtains a higher profit than the microgrid which sells at a constant price (constant grid) when there is sufficient power generation. This may be counterintuitive at first sight for the reason that, the selling price of the energy by the dynamic grid is always lower than the price of the constant grid (which is fixed at all times). However, the third microgrid, during the process of buying power, when presented with two options prefers to buy from the microgrid that quotes a lower price, i.e., from the dynamic grid. In this way the dynamic grid successfully sells its power to the third microgrid at most times. The constant grid is left with no choice but to sell to the central grid at a much lower selling price (), yielding lower profits to it.

We can observe from Figure 2 that the agents learn to schedule the ADL demands at different times which shows that our model is capable of shifting power consumption from the peak demand time. We can also observe that the ADL agent picks a certain ADL action frequently for different time steps which show the convergence of the ADL agent’s policy. In the figure, frequency denotes the number of times an ADL demand is fulfilled for the last 10,000 iterations (after convergence), at that particular time step.

We can see from Figure 3, the dynamic nature of the prices decided by the microgrid at different times. These prices are dependent on the current state of the microgrid and are decided at each time period. In the figure, frequency denotes the number of times a particular price is selected, for the last 10,000 iterations (after convergence). From this figure, we can observe that the dynamic grid has learned to quote a price of 19 for the majority of times. This would imply that the dynamic grid has learned to adapt to the constant grid and quote a price lower than the one quoted by the constant grid which is 20. Therefore, the dynamic grid successfully sells its power to the third microgrid leading to more profits.
Setup 2: From the results of our experiments, we draw the following observations:

In Table I, we report the average rewards obtained by the dynamic pricing and constant pricing policies over the last iterations (after convergence).

From Table I, we observe that the proposed dynamic pricing model performs better the constant pricing model for the majority (six out of eight) of microgrids.

This empirically shows that following a DPP as compared to a CPP proves to be more advantageous to a majority of the microgrids, when energy trading occurs between microgrids.
Setup 3: Through the results of our experiments, we draw the following observations:

From Table II, we observe that most of the microgrids receive higher rewards through dynamic pricing than they would have accumulated through constant pricing. Therefore, our proposed dynamic pricing model proves to be more advantageous than the constant pricing model for the majority of the microgrids.
It can be observed that the microgrids achieve better rewards in setup 3 than in setup 2 (difference in rewards are higher in setup 3 compared to setup 2). We attribute this to the fact that the majority of the microgrids have higher energy generation in setup 3 as compared to setup 2, which enables them to sell more energy. Moreover, the effect of dynamic pricing becomes more prominent when they start generating more power, as noticed in the difference between their dynamic pricing rewards and constant pricing reward The differences (column 4 of Tables 1 and 2) are observed to be more in favour of dynamic pricing in setup 3 as our proposed model enables each microgrid to quote prices judiciously throughout the day, enabling them to sell intelligently and the more the energy they possess, the more does this intelligent selling reflect positively in their overall reward.
Remark 1
The objective of our experiments is to understand the behavior of the microgrids learning their strategies together in a network. We wanted our models to be as close to real scenarios as possible where all the microgrids learn their policies parallelly.
Remark 2
It is to be noted that in setup 1, we had two microgrids that generally generated more renewable energy than their demands. This was done to understand the dynamics of energy transactions between the agents. Since all the demands were satisfied and the microgrids were able to sell energy, the rewards were positive. In setup 2 and 3 we implemented more realistic scenarios (where most microgrids generally generated less energy than their demand) where experiments were performed to compare the performance of microgrid networks following dynamic pricing and constant pricing policies. Under these setups, the microgrids may have to buy energy from other grids or the main grid to fulfill some of their demands which in turn ends up creating a negative reward.
From these three setups, it is clear that the agents which follow our dynamic pricing strategy are generally performing better than the constant pricing model. Moreover, we have also shown that besides dynamic pricing the microgrids also learn to intelligently schedule the ADL demands in a way that shifts the energy consumption away from peak demand.
Vii Conclusion
In this work, we have constructed a stochastic game framework involving a network of microgrids that enables the energy trading, dynamic pricing and job scheduling. In order to solve this problem, We have devised a novel two network model (ET and ADL networks) that performs both dynamic pricing and demand scheduling at the same time. To compute the optimal policies under various setups, we have applied our proposed algorithm and have shown that the rewards obtained by our proposed dynamic pricing models yield greater rewards to the majority of the microgrids. We believe that such a modelling scheme can be applied to other sequential learning tasks.
As a future work, we would like to introduce an auction mechanism to enable transactions between microgrids where buyer microgrids can negotiate the prices decided by the seller microgrids.
Comments
There are no comments yet.