Log In Sign Up

Delayed Q-update: A novel credit assignment technique for deriving an optimal operation policy for the Grid-Connected Microgrid

A microgrid is an innovative system that integrates distributed energy resources to supply electricity demand within electrical boundaries. This study proposes an approach for deriving a desirable microgrid operation policy that enables sophisticated controls in the microgrid system using the proposed novel credit assignment technique, delayed-Q update. The technique employs novel features such as the ability to tackle and resolve the delayed effective property of the microgrid, which prevents learning agents from deriving a well-fitted policy under sophisticated controls. The proposed technique tracks the history of the charging period and retroactively assigns an adjusted value to the ESS charging control. The operation policy derived using the proposed approach is well-fitted for the real effects of ESS operation because of the process of the technique. Therefore, it supports the search for a near-optimal operation policy under a sophisticatedly controlled microgrid environment. To validate our technique, we simulate the operation policy under a real-world grid-connected microgrid system and demonstrate the convergence to a near-optimal policy by comparing performance measures of our policy with benchmark policy and optimal policy.


page 1

page 7

page 8


O^2TD: (Near)-Optimal Off-Policy TD Learning

Temporal difference learning and Residual Gradient methods are the most ...

Sublinear Optimal Policy Value Estimation in Contextual Bandits

We study the problem of estimating the expected reward of the optimal po...

Near Optimal Policy Optimization via REPS

Since its introduction a decade ago, relative entropy policy search (REP...

Optimal policies for Bayesian olfactory search in turbulent flows

In many practical scenarios, a flying insect must search for the source ...

A Maximin Optimal Online Power Control Policy for Energy Harvesting Communications

A general theory of online power control for discrete-time battery limit...

Reinforcement Learning Policy Recommendation for Interbank Network Stability

In this paper we analyze the effect of a policy recommendation on the pe...

I Introduction

A microgrid is a system that integrates distributed energy resources (DERs) to supply the local electricity demand. The DERs include several different types of resources such as renewable energy sources (RES), energy storage systems (ESS), and small dispatchable generators (e.g., diesel engines, gas turbines; DGs). In addition, in the grid-connected mode, the electricity is tradable with external power networks (e.g., utility grid). The benefits of a microgrid are as follows: (1) the utilization of RES that efficiently reduces carbon emissions, (2) an improvement local energy delivery with low investment in capacity both for generation and transmission, (3) an improvement to the reliability and resilience of the local energy supply, and (4) a reduction in the costs associated with long-term infrastructure investment [1, 2]. In response to these benefits, the installed capacity of microgrids is growing fast and expected to reach 8.8 GW by 2024 [3].

One general goal of the microgrid operation is minimizing the operation costs by generating an optimal operation of DERs’ output while satisfying the constraints of the system. The optimal operation problem involves a sequential optimization process under uncertainty variables such as time-varying demand and price of electricity. Therefore, many previous studies have attempted to use methods for sequential decision problems, such as dynamic programming (DP) and stochastic programming (SP), to derive the optimal operation policy [4, 5, 6]

. However, most proposed methods require too much computational burden for practical application. Therefore, some studies proposed meta-heuristic algorithms to derive the optimal operation policy within a reasonable time 

[7, 8]. However, these studies cannot fully capture the stochastic characteristics of real system dynamics.

Responding to the limitations of previous studies, some studies have attempted to derive the optimal operation policy of microgrids based on reinforcement learning (RL) methods. RL based methods are a useful machine learning approach wherein the learning agent optimizes policies through sequential interactions with the environment without requiring the learning agent to know any information about the system dynamics.


suggested an effective method for deriving a microgrid operation policy using a multi-agent RL and relaxing the curse of dimensionality issues. Using multi-step Q-learning,

[10] developed an RL framework for autonomous multi-state and multi-criteria decision-making for medium-term scenarios of energy storage management. Similarly, [11] proposed a forecasting based multi-step ahead Q-learning algorithm to manage microgrid systems. [12]

introduced action-dependent heuristic dynamic programming with decision tree rules and applied an evolutionary algorithm to enhance the convergence speed for finding a near-optimal control policy.

[13] proposed an algorithm to derive a policy for smart energy buildings management using the tabular Q-learning method. [14] proposed a simple rule-based microgrid control policy by applying a decision tree algorithm from a Q-table derived using Q-learning. [15] derived a continuous policy for dispatching ESS using a novel RL algorithm that is combined with Monte Carlo tree search methods, reducing the size of the searching space needed to conduct multi-step bootstrapping.

More recently, following the development of deep RL (DRL) methods that combine RL and deep neural networks (DNNs), studies have suggested DRL-based methods for the optimal operation of microgrids. In

[16], a dual iterative Q-learning algorithm was proposed to derive optimal ESS management policy in smart residential energy systems of a periodic nature. [17] applied a deep Q-learning (DQL) algorithm with a novel Q-network structure to derive a microgrid management strategy. [18] derived prosumers’ electricity trading strategy using DQL in the local energy market. [19] formulated a Markov Decision Process (MDP) model for the system of the microgrid interacting with the electricity market and derived an operation policy for the system using a deep Q-network (DQN) algorithm to conduct the real-time energy management. [20] suggested a method that derives the operation strategy of a community ESS using double DQL in both grid-connected and islanded modes with different objectives.

In previous studies proposing Q-learning based algorithms [17, 18, 13, 20], the defined problems contain only one action, either charging/discharging of the ESS or buying/selling electricity from the external grid. However, these approaches are insufficient and cannot be used for real microgrid applications as they are unable to work problems with sophisticated action spaces. That is, in addition to determining the direction of each DER, the algorithm needs to be able to determine the levels of the direction. For example, the algorithm should determine the amount of charging/discharging electricity at the ESS. Although [19] considered integrating the sophisticated action space in the microgrid system, they assumed that the microgrid system had only one controllable DER, the ESS. However, to the best of our knowledge, only limited research has proposed DRL based methods for several controllable DERs with sophisticated controls in a microgrid system.

Of course, adopting the sophisticated action space for controlling several DERs simultaneously can cause an issue, in that the size of the state-action pairs increases exponentially. Particularly, when Q-learning based methods are utilized, the formation of a large discrete action space is inevitable. By constructing the large discrete action space, numerous Q-values of state-action pairs must be explored sufficiently to approximate the true value, but a learning agent cannot achieve it in a reasonable time. This issue leads to the learning agent’s inability to approximate the future value in the learning phase. Particularly in the context of the microgrid operation problem, the agent fails to embody a delayed effective property. That is, the current ESS charging control initially incurs costs but its benefit will be effective in the future when the discharging control is conducted. Consequentially, the agent only follows the myopic goal instead of planning for the long-term objective.

Although there are several typical approximation techniques in the Q-learning based approach to support the agent in adapting well to the complex system, these approaches have not been applied well in the microgrid operation problem. Some studies strived to remedy the agent’s myopic perspective by adopting the multi-step ahead approximation techniques of the action-value with the Q-learning algorithm [10, 11]. However, these techniques require prediction-based scenarios for future steps. Therefore, these techniques can only be used for simple environmental settings with a short-term planning horizon. When the size of the action space for sophisticated control of the system increases, there will still be limits that keep the techniques from being a complete solution to the issue of the agent’s myopic perspective. To overcome the issue of the large action space, this study proposes a novel credit assignment technique, delayed Q-update, to derive a well-fitted operation policy under the sophisticated action space for controlling the microgrid system.

In this context, the purpose of this study is to create a technique that allows a Q-learning based algorithm to derive an optimal microgrid operation policy of a grid-connected microgrid system under the sophisticated action space. To apply the proposed technique, we formulate an MDP, emulating the real microgrid control system, that conducts the sophisticated controls of several DERs. To demonstrate the optimality of an operation policy derived using our proposed algorithm, we conduct a simulation of our operation policy with real-world microgrid data and compare the result with an optimal operation policy derived using DP. The main contributions of this study can be summarized as follows:

  • This is the first study investigating an algorithm for the optimal operation of a microgrid system based on a Q-learning method with sophisticated controls of several controllable DERs, which is much more applicable in the real-world microgrid system.

  • We devise the delayed Q-update technique and corresponding Q-learning algorithm that supports the learning agent in deriving a well-fitted operation policy by overcoming the delayed effective property in the microgrid operation problem.

The remainder of this paper is organized as follows. Section 2 describes the definition of our problem and formulates the MDP model, and Section 3 explains the delayed Q-update technique with the Q-learning algorithm we used. In Section 4, we provide experimental results to validate the advantages of our approach. Finally, we conclude in Section 5 by providing relevant implications and identifying directions for future research.

Ii Problem definition

In this section, we introduce the mathematical models of the microgrid operation problem. The considered microgrid system involves several DERs, such as solar photovoltaic (PV), dispatchable generator, ESS, and demand response, and it is connected to the utility grid. In the grid-connected mode, the electricity that is tradable with the external large utility grid is available to fulfill surplus or shortage demand of electricity in the system. We assume that the price of electricity traded with the utility grid is determined a day before in the day-ahead market. We would like to derive an optimal operation policy of the microgrid, which can minimize the long-run operation costs under system constraints. The mathematical formulation in Subsection describes our microgrid operation problem, similar to our previous study [21]. Then, we explain how to build an MDP model for our RL based algorithm.

index for time
unit time period (e.g., hour)
length of planning horizon
the number of episodes
dispatchable generation unit cost (e.g., KRW/kWh)
load curtailment unit cost (e.g., KRW/kWh)
discharging unit cost of the ESS (e.g., KRW/kWh)
transmission capacity between the microgrid and the utility grid (e.g., kW)
capacity of the dispatchable generation (e.g., kW)
minimum generation of the dispatchable generation (e.g., kW)
maximum ramp up & down capacity for dispatchable generator (e.g., kW)
ESS charging & discharging capacity (e.g., kW)
ESS storage capacity (e.g., kW)
controllable demand rate (%)
efficiency of the ESS (%)
state of charge of the ESS at time (e.g., kWh)
electricity price of the utility grid at time (e.g., KRW/kWh)
electricity demand at time (e.g., kWh)
renewable generation at time (e.g., kWh)
amount of electricity traded with the utility grid at time (e.g., kWh)
amount of electricity generated by the dispatchable generation at time (e.g., kWh)
amount of electricity charged or discharged in the ESS at time (e.g., kWh)
amount of curtailed electricity demand at time (e.g., kWh)
TABLE I: Summary of notations

Ii-a Mathematical modeling

The mathematical model for our microgrid operation problem is described by Equations (1)-(6), and all notations for the model are summarized in Table I.

s.t. (2)

Equation (1) represents the objective of our microgrid operation problem, minimizing total operation costs over the planning horizon. The operation costs of each time step can be defined as below (Equation (7)):


where denotes a non-negative value (i.e., ). All other equations describe the system constraints. Equation (2) represents a balance constraint between the supply and demand of electricity. As constraints on dispatchable generators, Equation (3) represents the upper/lower bound of the generation of these resources and Equation (4) represents that the difference with previous outputs of the resources is limited in a unit time (i.e., ramp-up/down constraint of dispatchable generators). In Equation (5), upper bound means that the amount of electricity to be discharged from the ESS is limited by a certain capacity and also by the remaining capacity of the ESS storage with considering on efficiency rate. Meanwhile, the lower bound of the equation means that the amount of electricity to be charged from the ESS is limited by a certain capacity and also by the current state of charge (SOC) level with considering on efficiency rate. Equation (6) implies that the amount of demand response must positive and lower than the maximum controllable demand. Although the maximum transmission capacity from/to the utility grid is constrained, we can ignore the constraint in that the capacity is large enough to transmit the required amount in the problem.

Ii-B MDP modeling

In our microgrid operation problem, the RL agent can consider five key features in the microgrid system. The first feature is the output of the dispatchable generators in the previous hour (i.e., ), which is used for the ramp-up/down constraints (i.e., Equation (4)). Second and third are the output of renewable energy generators and the amount of demand currently occurring (i.e., ). Finally, the current SOC level and electricity price of the utility grid are needed for configuring the current situation of the microgrid (i.e., ). In summary, the state space of the RL agent can be represented as follows (Equation (8)).


The RL agent needs to determine the outputs of controllable DERs and the amount of electricity from/to the utility grid. As a result, the action space of the RL agent has four actions, as shown by Equation (9). The first action is the amount of electricity from/to the utility grid. If the action has a negative value, it means that the microgrid sells the electricity to the external grid. The amount of small DGs and the demand response are also considered as actions. The last one is the charging/discharging amount of ESS. If the action has a negative value, it means that the microgrid charges electricity into the ESS. Here, the action space determines the last three elements of the action space only as the first element (i.e., ) is automatically determined by the balancing constraint between supply and demand (i.e., ). By doing this, any action selected by that agent can satisfy the first constraint (Equation 2) and also the dimension of the control variable decreases to tractable sizes.


The reward in the MDP model should reflect how much the agent’s action contributed to minimizing the microgrid operation costs. This reward can be simply defined as the negative sign of microgrid operation costs per time step (i.e., one hour). However, if a reward is defined as this simple version, then non-stationary reward criteria may be applied depending on the demand phases (e.g., the phase is categorized as the peak demand and base demand). For example, during the base demand phase (i.e., the periods with relatively low demand), no matter how undesirable the conducted action is, a relatively positive reward is obtained by the agent. In contrast, in the peak demand phase (i.e., the periods with the relatively high demand), regardless of how desirable the conducted action is, a relatively negative reward is obtained by the agent. Therefore, the reward formula should be robust under non-stationary conditions and constitute consistent criteria. In addition, to conduct stable learning, the scale of reward must be adjusted to belong to a reasonable range. The rescaled approach of the reward for Q-learning is formulated in a related study [19]. Similarly, for the reward’s reasonable scale, we define how desirable the current action is relative to the cost of the worst operation that the agent can do in each hour. This reward formula is similar to the reward formula of the reference [22] in that it is also the change rate relative to the baseline. The worst-case operation that the agent can do in this problem is to satisfy the residual demand remaining after the production amount of the RES is fulfilled using dispatchable generators (i.e., the unit cost of dispatchable generator higher than those of other resources in the experiment). Therefore, we define the reward as the negation of change rate in the hourly operation costs relative to the costs that result from the worst action case (Equation (10)).


The system dynamics of the MDP model only consider the SOC level of the ESS because the other elements of state do not depend on the actions in the previous time step. Equation (11) represents the dynamics of SOC level dependent on the action . In the equation, SOC level decreases by multiplied with the efficiency rate when the ESS is discharged, otherwise SOC level increases by divided by the efficiency rate when the ESS is charged.


Iii Methodology

In this section, we introduce our proposed approach for deriving the optimal operation policy of a microgrid using Q-learning with a novel technique, delayed Q-update. We leverage the advantage of the Q-learning algorithm to keep the agent from being trapped in local minima unlike other policy-based RL algorithms [20].

Iii-a Description of Q-Learning Agent Model

Q-learning is a value-based RL method that approximates an action value (i.e., a Q-value) in each state-action pair. The Q-value represents an expected value of a discounted cumulative reward, initiated with the current state and selected action under a policy (Equation (12)). The optimal Q-value is induced by an optimal policy, which maximizes the Q-value of every state (Equation (13)), and this value is expressed using the Bellman equation (Equation (14)). In Q-learning, the optimal Q-value can be approximated by bootstrapping, meaning that the target value of Q-value is utilized from the previous approximated value, the so-called temporal difference learning (Equation (15)). Here denotes the learning rate and denotes the discount factor for discounting time value. As revealed in Equation (15

), this algorithm is a model-free method such that even if the agent does not have knowledge of the environmental factors such as transition probability, the agent can develop a policy using repeated experience by following an behavior policy (i.e., epsilon-greedy). In addition, Q-learning is an off-policy algorithm, that is, the behavior policy for selecting the agent’s action is not the same as the target policy for selecting an action on the target value.


When adopting a Q-learning algorithm, we should define the discrete range of the agent’s state and action spaces. For the discretization of the agent’s action space, we define each control variable divided by 10 units. By following this discretization rule, the amount of dispatchable generators contains discrete cases which are achieved by dividing the possible range () by 10. Likewise, the charging/discharging amount of the ESS is discretized to cases by dividing the possible range () by 10 units, and the amount of the demand response is discretized to cases by dividing the possible range () by 10 units. The amount of electricity from/to the utility grid is determined when all other control variables are fixed, so it does not need any consideration. Unlike the action space, the discretization for the state space cannot be directly applied because the observed features of the system cannot be exactly divided into a discrete unit. The issue of the state space discretization can be resolved by introducing function approximation-based Q-learning, and our proposed method can be extended to the research line of function-approximated Q-value. Since tabular-based Q-learning is utilized in this study, we had to conduct a process of relaxation for the system environment, which makes the system tractable using tabular Q-learning and we describe the details in Section IV-A.

As we defined in Section II-A, our handled system has several constraints. Therefore, our agent model has a constrained action space. Although RL with a constrained action space is a tricky issue and has recently been the focus of much research, this topic is not the focus of this study. Therefore, we tackle the constraint action space of our agent by adopting the simple rule of taking an action that has the highest Q-value among the feasible action spaces (i.e., masking all infeasible actions).

Iii-B Delayed Q-update technique for ESS Controls

In previous research on applying Q-learning to operation cost minimization in microgrid systems, hourly operation costs were provided as part of an immediate reward. However, the effect of the ESS charging control in the current period is not reflected in the current immediate reward but in the value of the future period with discharging control. Although in the updated formula of temporal-difference (TD), the expected future reward may reflect the value of charging control, the approximation for the value of charging control is incorrect for the true value because TD has an issue with the biased value-function [23]. This bias issue occurs more seriously under the large action space with the delayed effective property of the microgrid. Thus, the values of the ESS charging control are depreciated, resulting in undesirable operations utilizing ESS operation inefficiently. This issue does not often occur in previous studies because most of them consider only simple action spaces of an agent. As the degree of the action space’s fineness increases, the learning agent needs more iterations to approximate the true value of charging control. To achieve precise Q-value assignment with a reasonable number of iterations, the delayed Q-update technique is proposed to relax this issue by conducting a delayed credit assignment to the ESS charging control.

The process of delayed Q-update technique requires a first-input-first-output (FIFO) queue for storing and tracking the history of the ESS charging amount and state-action pair in the periods with charging control. Thereafter, discharging control is conducted by utilizing stored electricity and the adjusted value is assigned to the corresponding previous charging actions. This adjusted value is proportional to the deviation between the electricity price in the previous charging period and the price in the current discharging period. The deviation of prices between these two periods measures how low the price in the charging period and how high the price in the discharging period is. Thus, the deviation can be an indicator evaluating the desirability of ESS charging/discharging pairs (i.e., a desirable policy entails conducting ESS charging in the period with the high price and discharging in the period with the low price). In this technique, we assume that the earlier charged electricity amount has a higher priority of discharging, which is why we employ the FIFO queue. Equation (16) represents the updating formula of the Q-value of charging control in the previous period . Here, represents the charging amount at period which is utilized by the discharging control at period and denotes the adaptation rate for the delayed Q-update. The detailed flow of the delayed Q-update technique with the Q-learning is described in the algorithm 1. The time complexity of the proposed algorithm is , and is somewhat more complex than that of the original Q-learning algorithm (i.e., ). However, as mentioned in Subsection , the control variable of the ESS operation is discretized by 10 units and the charging/discharging controls are bounded by a certain amount of the capacity, so the time complexity of the proposed algorithm is reduced to , which is the same as the Q-learning algorithm.

1:Initialize Q-function:
2:Initialize FIFO Queue:
3:for episode  do
4:     receive initial observation state
5:     for =1, do
6:         if with probability  then
7:              select random
8:         else
10:         execute and observe and
11:         if  then
12:              insert to
13:         else if  then
15:              while  do
18:                  if  then
21:                  else
Algorithm 1 Q-learning algorithm with Delayed Q-update

Iv Experimental results

In this section, we demonstrate that the policy derived using our proposed technique with the Q-learning algorithm outperforms the benchmark operation policy derived using the original method of the Q-learning algorithm. In addition, to validate the optimality of the policy, we compare the performance of our policy with that of an optimal operation policy derived using dynamic programming. We conduct a simulation for real-world grid-connected microgrid systems using both our proposed approach and original Q-learning approach as a benchmark, and we verify that the derived operation policy is relatively superior to the benchmark policy based on the proposed performance measures.

Iv-a Experimental setting

As discussed in  [24], we experiment with real-world data for a campus microgrid installed in a university in South Korea. The microgrid has solar PV and ESS, and it covers some parts of electricity demand in the university library. We use the actual data on the hourly electricity demand and solar PV generation of the microgrid. In addition, we assume that the microgrid can trade electricity from/to the utility grid at the system marginal price (SMP), and we obtain the data for the hourly SMP from the website of Korea Power Exchange. The average electricity retail price in South Korea is reported at approximately 100 KRW/kWh [25]. To consider when the microgrid utilizes the ESS actively, we make the SMP price fluctuate more by multiplying a scale-up (discount) factor to SMP price value. Lastly, we refer to several previous studies to set up the appropriate values for other parameters of our models described in Section II, as summarized in Table II (the unit generation cost from the dispatchable generator in South Korea is approximately 500KRW/kWh [26] and the unit cost of the demand response is set at 200KRW/kWh [21]).

parameter value   parameter value
500   200
50   30
60   0
50   50
1.0   0.2
TABLE II: Parameters of grid-connected microgrid system

As previously mentioned, the relaxation process is conducted to convert observed features of the system to discretized state space. In the process, we round off the hourly generation amount of PV, electricity demand, and SMP price to the one-th place. As a result, the generation amount of PV contains 4 discrete values (0,10,20,30), electricity demand contains 8 discrete values (40,50,60,70,80,90,100,110), and SMP price contains 3 discrete values (70,130,140). Furthermore, we apply the discretization rule to the microgrid system element of the state space. The the output of dispatchable generators at the previous hour, is divided similar to the discrete controls of the action space for controlling the dispatchable generators. The SOC level () contains the same number of discrete controls as the discretized ESS storage capacity which is divided by 10 units. Therefore, the SOC level contains (=6) discrete controls, obtained by dividing the possible range () by 10 units. Considering the discrete values for the relaxed observations of the system, the size of state space is 3024.

By following this discretization rule for the action space of the agent, the dispatchable generators are discretized into 7 controls. Likewise, the operation of the ESS is discretized into 11 controls and the demand response is discretized into 3 controls. Thus, the size of the action space is 231, achieved by multiplying the sizes of discrete controls within every control variable.

By conducting several rounds of tuning, we fix the hyper-parameters for the learning agent including the number of episodes (), learning rate (), adaptation rate for delayed Q-update (), and discount factor (). These hyper-parameter values are 3000, 0.3, 0.00001, and 0.9, respectively.

We derived an operation policy using Q-learning with delayed Q-update technique trained in March of 2018, a month period. After the policy training step, we validate the derived operation policy during validation period of approximately 24 hours ().

Iv-B Performance measure

For the validation, we define one of the performance measures as the average cost of microgrid operation, which is the daily average cost in the valid horizon. The formula for this measure is defined in Equation (17). Here, denotes the last index of the valid period.


Furthermore, to focus solely on the performance of the ESS operation, we define a benefit measure of ESS operation, which is the summation of the cost of ESS charging when ESS is charged and the opportunity cost of ESS discharging when ESS is discharged. The formula for the benefit measure of the ESS operation is defined in Equation (18).


To identify the degree of the convergence of the updated policy, we define a convergence measure, Q-value difference, which is the root mean square error between the Q-values of the current and previous epochs. The formula for this performance measure is defined in Equation (

19). Here, denotes the Q-values of state-action pairs after iterations (epoch) are finished. The measure expresses the degree of variation between the Q-values of all state-action pairs, so it can assess the degree of convergence.


We adopt a regret value for the average cost which is a relative measure for evaluating a policy that compares it with an optimal policy based on average cost measure (Equation (20)). Applying the same concept for the ESS benefit measure, regret for the benefit measure of ESS operation is formulated in Equation (21). These regret values evaluate how much each measure of the given policy deviates from the optimal measure, meaning that a policy which has a large regret then the policy is undesirable.


Iv-C Result

We simulate our derived operation policy using Q-learning with delayed Q-update technique in a grid-connected microgrid system. To validate the effect of the delayed Q-update technique, we conduct a comparative analysis for the operation policies derived using Q-learning with and without the delayed Q-update. Figure 1 shows the trends of the average cost plotted against operation policy updates using the proposed approach and the original Q-learning. The result of the average cost shows a declining trend in figure 1(a), it has a stable trend as policy updating is continued before converging to a certain point. In figure 1(b), we can identify that it has a more unstable trend than the result of our proposed method and it may not converge.

Fig. 1: Trend of average cost measure as the learning goes on using (a) Q-learning with delayed Q-update technique and (b) original Q-learning

Figure 2 shows the trends of the ESS benefit measure versus operation policy updates using the proposed approach and the original Q-learning. To clarify the direction of each trend, we put a smoothing trend denoted by the red line in each result. As shown in figure 2(a), the ESS benefit measure improves as learning goes on using the proposed approach. Figure 2(b) shows that the ESS benefit measure does not improve and even decreases over a certain phase.

Fig. 2: Trend of the ESS benefit measure as the learning goes on using (a) Q-learning with delayed Q-update technique and (b) original Q-learning

Figure 3 shows the trends of the Q-value difference measure plotted against operation policy updates using the proposed approach and the original Q-learning. The result of the Q-value difference shows a declining trend after a warm-up phase in figure 3 (a), it has an efficient decreasing trend as policy updating goes on before converging to a certain level. However, in figure 3 (b), we can identify that the convergence is slower and has a less efficient trend than the result of our proposed method.

Fig. 3: The trend of difference measure as learning goes on using (a) Q-learning with delayed Q-update technique and (b) original Q-learning

In the verification process of the valid data, figure 4 (a) shows the operation result of the proposed approach based policy and figure 4 (b) shows the operation result using a policy derived by the original Q-learning algorithm. As seen in the results, the operation from the policy derived by the original Q-learning conducts undesirable ESS control, which frequently performs ESS charging in the period with high SMP price and ESS discharging in the period with low SMP price. Therefore, our delayed Q-update technique is necessary to derive a reasonable operation policy under our problem setting.

Fig. 4: Microgrid operation result using a policy derived by (a) Q-learning with delayed Q-update technique and (b) the original Q-learning algorithm

To verify the optimality of our proposed approach, we conduct a simulation using a DP-driven operation policy (i.e., optimal policy) under the predefined system. Figure 5 shows the operation result, which is based on an optimal operation policy. Using an optimal operation policy, an optimal average cost is 8183.33 and an optimal ESS benefit is 375 in the valid data. Even if our operation policy somewhat differs from the optimal policy, we confirm that our policy is similar to the optimal policy in terms of charging ESS at the periods with low SMP price and discharging ESS at the periods with high SMP price. Thus, we identify that the operation policy derived using Q-learning with delayed Q-update technique converges to the near-optimal operation policy.

Fig. 5: Microgrid operation result using optimal policy

To analyze the above performance measures, we summarize the final performance of the average cost of each operation policy in table III. As the comparative analysis shows, applying the delayed Q-update technique allows us to derive an operation policy that outperforms the policy generated by the original Q-learning algorithm. Thus, we verify that the delayed Q-update technique has a beneficial effect on the situation where sophisticated control is required under the microgrid system.

Delayed Q-update average cost ESS benefit AC-regret EB-regret
without 9789.58 -295.83 1606.25 670.83
with 8462.5 191.66 297.17 183.34.0
difference -13.55% 164.78% -81.5% -72.66%
TABLE III: Comparative analysis based on average cost, ESS benefit, and regret measures for the final derived policy under valid data between the Q-learning with the delayed Q-update technique and the original Q-learning methods

V Conclusion

The main contribution of our study is the creation of the delayed Q-update technique to derive operation policies more efficiently in the real-world grid-connected microgrid. Sophisticated controls are required under the real-world microgrid system, so the action space of the Q-learning agent is constructed as a large discrete action space that determines the level of every DER. There is a limitation in deriving operation policy in the large action space setting under microgrid using the typical Q-learning, especially if the ESS has a property of delayed effects, that prevents the algorithm from deriving a desirable policy efficiently. To respond with this limitation, we introduce a novel delayed credit assignment technique, delayed Q-update, which supports that the Q-learning algorithm in deriving desirable operation policies under the detailed control setting. As experimental results demonstrate, our operation policy derived using the proposed approach outperforms the benchmark policy derived using original Q-learning under adopted performance measures. In addition, these results verify that our operation policy converges to the near-optimal operation policy in the real-world grid-connected microgrid system.

One of the limitations of this study is that the agent and environment are formulated based on the simple setting involving discrete types of electricity demand, SMP price, and generation amount of PV, adjusted by the relaxation process. However, this simple version of the environment can be extended to the real-world version which has continuous variables by adopting a recent paradigm of the RL that utilizes function approximation in Q-learning using DNN, the so-called DQN [27, 28]. Along with combining DNN, to relax our derived operational policy’s swift switching of the ESS controls, we plan to formulate a model that accounts for battery maintenance costs and propose a response solving approach in future research.


  • [1] K. Milis, H. Peremans, and S. V. Passel, “The impact of policy on microgrid economics: A review,” Renewable and Sustainable Energy Reviews, vol. 81, pp. 3111–3119, 2018.
  • [2] T. S. Ustun, C. Ozansoy, and A. Zayegh, “Recent developments in microgrids and example cases around the world-a review,” Renewable and Sustainable Energy Reviews, vol. 15, pp. 4030–4041, 2011.
  • [3] S. Mishra, K. Anderson, B. Miller, K. Boyer, and A. Warren, “Microgrid Resilience: A holistic approach for assessing threats, identifying vulnerabilities, and designing corresponding mitigation strategies,” Applied Energy, vol. 264, 2020.
  • [4] W. Su, J. Wang, and J. Roh, “Stochastic energy scheduling in microgrids with intermittent renewable energy resources,” IEEE Transactions on Smart Grid, vol. 5, pp. 1876–1883, 2013.
  • [5] S. Talari, M. Yazdaninejad, and M. Haghifam, “Stochastic-based scheduling of the microgrid operation including wind turbines, photovoltaic cells, energy storages and responsive loads,” IET Generation, Transmission & Distribution, vol. 9, pp. 1498–1509, 2015.
  • [6] T. A. Nguyen and M. Crow, “Stochastic optimization of renewable-based microgrid operation incorporating battery operating cost,” IEEE Transactions on Power Systems, vol. 3, pp. 2289–2296, 2015.
  • [7]

    S. A. Pourmousavi, M. H. Nehrir, M. C. Christopher, and C. Wang, “Real-Time Energy Management of a Stand-Alone Hybrid Wind-Microturbine Energy System Using Particle Swarm Optimization,”

    IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, vol. 1, pp. 193–201, 2010.
  • [8] H. Karami, M. J. Sanjari, S. H. Hosseinian, and G. B. Gharehpetian, “An optimal dispatch algorithm for managing residential distributed energy resources,” IEEE Transactions on Smart Grid, vol. 5, pp. 2360–2367, 2014.
  • [9] F. D. Li, M. Wu, Y. He, and X. Chen, “Optimal control in microgrid using multi-agent reinforcement learning,” ISA Transactions, vol. 51, pp. 743–751, 2012.
  • [10] E. Kuznetsova, Y. F. Li, C. Ruiz, E. Zio, G. Ault, and K. Bell, “Reinforcement learning for microgrid energy management,” Energy, vol. 59, pp. 133–146, 2013.
  • [11] R. Leo, R. S. Milton, and S. Sibi, “Reinforcement learning for optimal energy management of a solar microgrid,” 2014 IEEE Global Humanitarian Technology Conference-South Asia Satellite (GHTC-SAS), pp. 183–188, 2014.
  • [12] G. K. Venayagamoorthy and P. K. Gautam, “Dynamic Energy Management System for a Smart Microgrid,” IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, vol. 27, pp. 1643–1656, 2016.
  • [13] S. Kim and H. Lim, “Reinforcement learning based energy management algorithm for smart energy buildings,” Energies, vol. 11, p. 2010, 2018.
  • [14] T. Levent, P. Preux, E. Le Pennec, J. Badosa, G. Henri, and Y. Bonnassieux, “Energy Management for Microgrids: a Reinforcement Learning Approach,” 2019 IEEE PES Innovative Smart Grid Technologies Europe (ISGT-Europe), pp. 1–5, 2019.
  • [15] Y. Shang, W. Wu, J. Guo, Z. Lv, Z. Ma, W. Sheng, and R. Chen, “Stochastic Dispatch of Energy Storage in Microgrids: A Reinforcement Learning Approach Incorporated with MCTS,” arXiv preprint arXiv:1910.04541, 2019.
  • [16] Q. Wei, D. Liu, and G. Shi, “A novel dual iterative Q-learning method for optimal battery management in smart residential environments,” IEEE Transactions on Industrial Electronics, vol. 62, pp. 2509–2518, 2014.
  • [17] V. François-Lavet, D. Taralla, D. Ernst, and R. Fonteneau, “Deep reinforcement learning solutions for energy microgrids management,” European Workshop on Reinforcement Learning (EWRL 2016), 2016.
  • [18] T. Chen and W. Su, “Local Energy Trading Behavior Modeling With Deep Reinforcement Learning,” IEEE Access, vol. 6, pp. 62 806–62 814, 2018.
  • [19] Y. Ji, J. Wang, J. Xu, X. Fang, and H. Zhang, “Real-Time Energy Management of a Microgrid Using Deep Reinforcement Learning,” Energies, vol. 12, p. 2291, 2019.
  • [20] V. H. Bui, A. Hussain, and H. M. Kim, “Double Deep Q-Learning-Based Distributed Operation of Battery Energy Storage System Considering Uncertainties,” IEEE Transactions on Smart Grid, 2019.
  • [21] D. G. Choi, D. Min, and J. H. Ryu, “Effective subsidy policy for a Grid-connected Microgrid: A Korean case study,” Submitted, 2019.
  • [22] H. Park, M. K. Sim, and D. G. Choi, “An intelligent financial portfolio trading strategy using deep Q-learning,” Expert Systems with Applications, p. 113573, 2020.
  • [23] R. S. Sutton and S. P. Singh, “On step-size and bias in temporal-difference learning,” Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pp. 91–96, 1994.
  • [24] B. Jeong, D. Shin, J. Im, J. Park, and Y. Kim, “Implementation of Optimal Two-Stage Scheduling of Energy Storage System Based on Big-Data-Driven Forecasting—An Actual Case Study in a Campus Microgrid,” Energies, vol. 12, p. 1124, 2019.
  • [25] K. E. E. Institute. (2018) Korea retail price: Electricity: Average. [Online]. Available:
  • [26] J. Song, S. Oh, Y. Yoo, S. Seo, I. Paek, Y. Song, and S. J. Song, “System design and policy suggestion for reducing electricity curtailment in renewable power systems for remote islands,” Applied energy, vol. 225, pp. 195–208, 2018.
  • [27] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and et al., “Playing Atari with Deep Reinforcement Learning,” arXiv preprint arXiv:1312.5602, 2013.
  • [28] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, and et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.