A microgrid is a networked group of distributed energy sources with the goal of generating, converting and storing energy. While the main power stations are highly connected, micro-grids with local power generation, storage and conversion capabilities, act locally or share power with a few neighboring micro-grid nodes . This scenario is being envisaged as an important alternative to the conventional scheme with large power stations transmitting energy over long distances.
In order to take full advantage of the modularity and flexibility of microgrid technologies, smart control mechanisms are required to manage and coordinate these distributed energy systems so as to minimize the costs of energy production, conversion and storage, without jeopardizing the central smart grid stability. Augmenting microgrid with smart controls however involves addressing many problems. In this paper, we address two problems. (i) Supply-side management (SSM) problem: energy sharing among microgrids under stochastic supply and demand along with optimal battery scheduling of each microgrid (ii) Demand-side management (DSM) problem: efficiently scheduling the time adjustable demand from smart appliances in a smart home environment along with non-adjustable demand. Our goal here is to maximize profit earned by microgrids by selling excess energy while maintaining a low gap between demand and supply. We address these learning and scheduling problems by modeling them in the framework of Markov decision process (MDP) .
I-a Supply-side management (SSM) problem
Cooperative energy exchange among microgrids is a popular technique in SSM for efficient energy distribution. Local energy sharing/exchange between microgrids has the following advantages: (a) it can significantly reduce power wastage that would otherwise result over long-distance transmission lines, and (b) it helps satisfy demand and reduce reliance on the main grid. Figure 1 shows a cooperative energy exchange model with multiple microgrids (on the distribution side of the network) that can cater to their individual local loads. Each microgrid controls its local sub-network through its controller (labelled , etc.) that mainly has access to its local state information.
In classical power grids, system level optimization is done based on a centralized objective function, where as microgrid network has heterogeneous nature right from the manner in which electricity is generated such as from wind turbines, solar farms and diesel generators to energy storage devices such as batteries and capacitors. Because of this heterogeneity and the fact that energy can be shared between microgrids depending on requirements, one needs to consider distributed techniques to control and optimize a smart grid system with a microgrid distribution network.
Related work :  provides a survey on game theoretic approaches for microgrids where both cooperative energy sharing models as well as non-cooperative game models for distributed control of microgrids are examined when the system model is known. Since models for energy dynamics are very unreliable , one has to use model-free algorithms to address these problems. Because of their model-free nature, Reinforcement Learning (RL)  approaches that are primarily data-driven control techniques play a significant role in these problems.
In , a distributed RL algorithm for coordinated energy sharing and voltage restoration in an islanded DC microgrid is proposed. In , reinforcement learning has been applied for optimal battery scheduling under dynamic load environment and solar power is proposed. In this paper, one of the problems we consider is of coordinated energy sharing among the grid connected microgrids with optimal battery scheduling under stochastic supply and adjustable stochastic demand.
I-B Demand-side management (DSM) problem
Load shifting is a popular technique used in demand-side management (DSM) . It involves moving the consumption of load to different times within an hour, a day, or a week. It does not result in reduction in the net quantity of energy consumed, but simply involves changing the time when the energy is consumed. While load shifting facilitates the customer in reducing the energy consumption cost, it helps the smart grid in managing the peak load.
With increased use of smart appliances and smart home environments, the concept of load shifting is becoming increasingly popular for the smart grid as the demand from smart appliances is time adjustable in general. One or more of these smart appliances collectively achieve some activity in the smart home environment, called ADL (activity of daily living). It is possible to monitor and identify the ADLs in smart home environments .With the help of smart home technology, it is possible to find the amount of load each ADL puts on the grid, and also the allowed time window during which the ADL would perform the activity (e.g., scheduling a washing machine for an hour to clean the clothes anytime between 3PM to 6PM). The demand from ADLs need not be met during a fixed time period, instead it could be met during any time period within a flexible time window. With the help of the advanced metering infrastructure (AMI) that provides a two-way communication between the utility and customers, it is possible to make a decision of when to schedule the ADL demand at the smart grid and convey the same to the customer’s smart meter.
There is regular demand that needs to be met at fixed time periods, apart from the ADL related demand associated with any customer. This regular demand of a smart home will be called non-ADL demand in the rest of the paper. Similarly, the demand from ADLs of the smart home will be called ADL demand.
There is prior work around scheduling the ADL-demand using the load shifting technique for handling peak load scenarios . However, they precisely know the supply profile while doing such a scheduling of the ADL-demand which is an unrealistic assumption. In this paper, we propose scheduling of ADL-demand using the load shifting technique with uncertainty in the supply profile generated (e.g., renewable energy sources like solar or wind being the primary sources of power generation).
Our main contributions:
(i) To the best of our knowledge, we are the first to integrate both the demand-side and supply-side management problems in a unified Markov decision process framework. We apply reinforcement learning (RL) algorithms which do not require knowledge of the underlying system model to address these problems. Our algorithms are easy to implement and also scalable.
(ii) The Optimal scheduling of ADL demand at the microgrid level, where both demand and power generation are stochastic, is introduced for the first time through our work.
The rest of the paper is organized as follows. In section II, we discuss in detail about the problem formulation using the MDP framework. We present in section III the Q-learning algorithm. In section IV, we present simulation experiments along with other algorithms for comparison. Finally in section V, we provide the concluding remarks.
Ii Problem formulation and mdp model
We consider microgrids denoted by , which are inter-connected through the central electric grid distribution network. Each microgrid comprises of the distributed small scale renewable power generation sources that are equipped with energy storage devices. Let be the maximum energy storage capacity of microgrid . At every time step of a day, the microgrid controller has access to the following information:
Total energy () generated from all it’s energy sources.
Accumulated non-ADL demand () from each load.
Set of all ADL jobs (). has the form , where the ADL job . Here, represents the number of units of energy required to finish the job, and represents the number of future time slots remaining by which one can schedule the job without incurring a penalty.
Total energy available () in it’s storage device.
From the above available information, microgrid controller at every time step has to decide on the following choices: (a) Amount of energy it needs to buy (sell) from (to) the main grid, (b) Amount of energy it needs to buy (sell) from (to) the neighboring microgrids, (c) Amount of energy it needs to store (retrieve) into (from) the storage device, and (d) The subset of ADL jobs it needs to schedule. Both the demand and energy generated at each microgrid is uncertain due to the random nature of loads ( and ) and the renewable energy generation ().
MDP is a general framework for modeling problems of dynamic optimal decision making under uncertainty. An MDP is a tuple , where is the set of all states, is the set of feasible actions, is the single- stage reward function and
is the transition probability matrix. In RL, an agent interacts with the environment by observing stateand making decisions . The new state is obtained from the transition probability and yielding a reward . The goal of RL agent is to learn the optimal sequence of actions so as to maximize its total expected return. In the next subsection we provide the details of our MDP model.
Ii-a MDP framework
We begin by specifying the states, actions and single-stage rewards, for the MDP model.
Ii-A1 State space
The state at time instant for the microgrid is as follows:
where the net demand . If , then there is excess of power after meeting the non-ADL demand and if , there is a deficit in power even to meet the non-ADL demand. Also denotes the price per unit energy. The state also includes time since optimal action can depend on it. For example, a microgrid operating on solar renewable generation can sell excess power during the morning as the solar power will be available even during afternoon. But it may not be a good choice to sell it in the evening as there will be no solar power during the night.
Ii-A2 Action space
At each time instant , the microgrid controller needs to make two decisions and . The first action , if positive, denotes the number of units that the microgrid is willing to sell and if negative, represents the number of units that the microgrid is willing to buy. The second action pertains to the scheduling decision of ADL jobs taken by microgrid .
Let be the power set of , which consists of all possible combinations of the ADL jobs that can be scheduled at time instant at microgrid . Let be a set, where each element denotes the total aggregated ADL demand of an element in . For example, element , where is the element in and is the total number of elements in . The feasible region for action is bounded as follows:
where denotes the maximum amount of power the main grid can give to microgrid . This constraint is to maintain stability of the main grid. The above bounds indicate that the microgrids can sell energy if there is surplus; or can buy energy either to meet the non-ADL demand, or to store in the battery, or to meet ADL demand. The energy bought is either stored or used to meet the ADL demand only after satisfying the non-ADL demand. There is flexibility for microgrids to buy (sell) this power from (to) the neighboring microgrids. If it needs to buy (sell) more power, only then it buys (sells) it from (to) the main grid.
After the controller picks action , we construct the feasible set (subset of ) which consists of all possible subsets of ADL jobs that can be scheduled with . More formally, each element of has to satisfy the following condition: , where is the total energy required to finish all the ADL jobs in it. The controller picks action which is an element of , which results in scheduling all the ADL jobs in that subset. The remaining power is used to meet the non-ADL demand or for storage in the battery.
Let be the new set of ADL jobs received by controller at time instant . Depending on action , some of the ADL jobs will not get scheduled. These are then considered in time step , if they can be scheduled without incurring any penalty. The set of all ADL jobs at time instant is the union of the new and old ADL jobs which are not scheduled even after reducing by one (number of future time slots remaining by which one can schedule that job without incurring penalty). We have , where , and . Further, .
The battery information is updated as follows:
which denotes the power available after meeting the non-ADL demand and after meeting part of the ADL demand.
Ii-A3 Single-stage reward function
We want to maximize the profit of each microgrid obtained by selling power while reducing the demand and supply deficit. Our single-stage reward function has both the reward obtained by selling power and penalty for unmet demand. The single-stage reward function for our MDP is as follows :
The first term represents the loss/gain incurred for buying/selling power while the second and third terms represent the penalty incurred for not meeting the non-ADL demand and ADL demand respectively. Here, is penalty per unit of unmet demand and
is the indicator random variable which is equal to one ifand zero otherwise. Next, we provide the long-run average cost objective function.
Ii-B Average cost setting
The long-run average cost objective function of the microgrid for a given policy is given as follows:
where denotes the expected value. Here we view a policy as the map which assigns for any state , a feasible action . The goal of our RL agent is to find , where is the set of all feasible policies.
In this paper, we do not assume any model of the system (i.e., probability transition model of the demand, supply and reward structure) due to uncertainity of renewable energy generation. We employ RL agorithms which do not assume any model to provide optimal solution.
We employ the Q-Learning algorithm, a popular RL method for solving the average cost problem in section II-B. Our objective is to obtain an optimal policy . We apply the Relative Value Iteration (RVI) based Q-Learning algorithm described in . In this algorithm, we update the Q-values in each iteration according to the following rule:
where is the learning rate, is the reward obtained by taking an action in the state and transitioning to the state and is any prescribed state. Also, represents the
th estimate of the Q-value obtained in stateby taking action . In , it is shown that under appropriate learning rate, the algorithm converges to the optimal policy. Each microgrid runs a version of this algorithm independently until convergence. The optimal policy of microgrid is obtained as follows:
that is, the optimal action in state is obtained by taking the maximum over all actions of the Q-values in state .
Iv Simulation Experiments
We used the RAPsim simulator  for the evaluation of our algorithms. RAPsim allows users to simulate microgrid networks (involving main grid, microgrids with solar and wind power generating capabilities, individual homes having solar panels). We implement our models on a network with three microgrids (see Figure 2), out of which two operate on solar and one (in the middle) on wind power. The solar microgrid on the right has more generation capacity than the one on the left. Each microgrid can serve power to homes that are only connected to it.
We implement our ADL-sharing model described in section II. For comparison purposes, we also implement the following models.111The implementation is available at https://github.com/saikotireddy/PES-GM---Smart-Grid/archive/master.zip.
Greedy-ADL model: In this model, microgrids exhibit greedy behavior. They share power only after filling their respective batteries fully. The action at each time instant is bounded as follows:
Thus, if , decision is taken on the amount of power to buy in order to satisfy the demand and to fill the battery. If , then it is first used to fill the battery fully and if any excess power is left, it will be sold to the other microgrids.
Non-ADL model: In this model, ADL demand is treated as normal demand. Penalty is levied immediately if the demand is not met in the current time slot.
Iv-B Simulation setup
We used RAPsim simulator to generate real world per hour renewable energy data for each of the microgrids for the month of September 2017. We used this data to fit a Poisson distribution for energy generation at each microgrid. The number of decision time periods in a day is taken to be 4. The mean of the Poisson distribution for the two solar and one wind power respectively are as follows :
where the element represents the Poisson mean of microgrid at time . For each time period, non-ADL demand () at each of the microgrids can be one of the following three values: 2, 4 and 6 units. The price () per unit energy values is considered to be one of 5, 10 and 15. The transition probability matrix for non-ADL demand and the Price values are generated randomly.
The maximum size of the battery () and maximum power that a microgrid can obtain from the main grid () are considered to be 8 and 10 units respectively. At each microgrid, we consider 3 ADL jobs, at the start of the day, where ADL job requires units of energy within time slots. In the model, the ADL demand is added to the demand at each day. We ran all our models for each of the following (penalty per unit of unmet demand) values : 0, 5, 10 and 30, respectively.
The algorithms are trained for cycles. We used the average profit obtained by each microgrid as a performance metric to evaluate the models. Figure 3 plots the average profit obtained for each microgrid versus the number of iterations, when for all the three models.
Figure 4 plots the average profit obtained for each microgrid versus for all the three models. We run the trained models for 1000 runs to obtain the average reward.
In the model there is less buying and selling of power compared to the other models. Therefore the overall profit obtained is not high. Thus intelligent sharing of power among microgrids as with the RL technique, yields more profit than in the case.
We also observe that, model outperforms the model. In , there is a flexibility to intelligently schedule the ADL jobs according to the non-ADL demand and price. Hence we conclude that intelligently scheduling the ADL demand results in better performance.
Providing a unified solution framework for modeling both demand-side management problem (scheduling ADL jobs) and supply-side management problem (enabling cooperative energy exchange among the microgrids) is a challenging task, particularly when both demand and supply are considered stochastic. We have for the first time in the literature, studied these two problems in a unified framework by using MDPs. Also, for the first time in the literature, we proposed the method of scheduling ADL demand at microgrid level as a load shifting technique. RL algorithms provide an optimal solution methodology for solving MDP when the underlying model is not available. We apply the Q-learning algorithm to maximize profit earned by microgrids by selling excess energy while maintaining a low gap between demand and supply. Based on the simulation experiments, we show that our model consistently outperforms other models.
As future work, we would like to consider the pricing mechanism for microgrids. In the current model, the transaction of power is carried out at the price decided by the main grid. The pricing mechanism allows microgrids to bid for the selling price as well as buying price. One can use RL agents to bid for adaptive prices in such a way that microgrids maximize their profits. Another important future work is to use efficient RL algorithms with function approximation to scale the proposed algorithms. The challenge here is to select the appropriate features to obtain an optimal policy.
-  H. Farhangi, “The path of the smart grid,” IEEE power and energy magazine, vol. 8, no. 1, 2010.
-  M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
-  W. Saad, Z. Han, H. V. Poor, and T. Basar, “Game-theoretic methods for the smart grid: An overview of microgrid systems, demand-side management, and smart grid communications,” IEEE Signal Processing Magazine, vol. 29, no. 5, pp. 86–105, 2012.
-  R. Zamora and A. K. Srivastava, “Controls for microgrids with storage: Review, challenges, and research needs,” Renewable and Sustainable Energy Reviews, vol. 14, no. 7, pp. 2009–2018, 2010.
-  R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
-  L. Zifa, L. Ya, Z. Ranqun, and J. Xianlin, “Distributed reinforcement learning to coordinate current sharing and voltage restoration for islanded dc microgrid,” Journal of Modern Power Systems and Clean Energy, pp. 1–11.
-  R. Leo, R. Milton, and S. Sibi, “Reinforcement learning for optimal energy management of a solar microgrid,” in Global Humanitarian Technology Conference-South Asia Satellite (GHTC-SAS), 2014 IEEE. IEEE, 2014, pp. 183–188.
-  B. Davito, H. Tai, and R. Uhlaner, “The smart grid and the promise of demand-side management,” McKinsey, 2010.
-  G. Baryannis, P. Woznowski, and G. Antoniou, “Rule-based real-time ADL recognition in a smart home environment,” in Rule Technologies. Research, Tools, and Applications - 10th International Symposium, RuleML 2016, Stony Brook, NY, USA, July 6-9, 2016. Proceedings, ser. Lecture Notes in Computer Science, vol. 9718. Springer, 2016, pp. 325–340.
-  C. O.Adika and L. Wang, “Smart charging and appliance scheduling approaches to demand side management,” International Journal of Electrical Power & Energy Systems, vol. 57, pp. 232–240, 2014.
-  J. Abounadi, D. Bertsekas, and V. S. Borkar, “Learning algorithms for markov decision processes with average cost,” SIAM Journal on Control and Optimization, vol. 40, no. 3, pp. 681–698, 2001.
-  M. Pochacker, T. Khatib, and W. Elmenreich, “The microgrid simulation tool rapsim: description and case study,” in Innovative Smart Grid Technologies-Asia (ISGT Asia), 2014 IEEE. IEEE, 2014, pp. 278–283.