I Introduction
A microgrid is a system that integrates distributed energy resources (DERs) to supply the local electricity demand. The DERs include several different types of resources such as renewable energy sources (RES), energy storage systems (ESS), and small dispatchable generators (e.g., diesel engines, gas turbines; DGs). In addition, in the gridconnected mode, the electricity is tradable with external power networks (e.g., utility grid). The benefits of a microgrid are as follows: (1) the utilization of RES that efficiently reduces carbon emissions, (2) an improvement local energy delivery with low investment in capacity both for generation and transmission, (3) an improvement to the reliability and resilience of the local energy supply, and (4) a reduction in the costs associated with longterm infrastructure investment [1, 2]. In response to these benefits, the installed capacity of microgrids is growing fast and expected to reach 8.8 GW by 2024 [3].
One general goal of the microgrid operation is minimizing the operation costs by generating an optimal operation of DERs’ output while satisfying the constraints of the system. The optimal operation problem involves a sequential optimization process under uncertainty variables such as timevarying demand and price of electricity. Therefore, many previous studies have attempted to use methods for sequential decision problems, such as dynamic programming (DP) and stochastic programming (SP), to derive the optimal operation policy [4, 5, 6]
. However, most proposed methods require too much computational burden for practical application. Therefore, some studies proposed metaheuristic algorithms to derive the optimal operation policy within a reasonable time
[7, 8]. However, these studies cannot fully capture the stochastic characteristics of real system dynamics.Responding to the limitations of previous studies, some studies have attempted to derive the optimal operation policy of microgrids based on reinforcement learning (RL) methods. RL based methods are a useful machine learning approach wherein the learning agent optimizes policies through sequential interactions with the environment without requiring the learning agent to know any information about the system dynamics.
[9]suggested an effective method for deriving a microgrid operation policy using a multiagent RL and relaxing the curse of dimensionality issues. Using multistep Qlearning,
[10] developed an RL framework for autonomous multistate and multicriteria decisionmaking for mediumterm scenarios of energy storage management. Similarly, [11] proposed a forecasting based multistep ahead Qlearning algorithm to manage microgrid systems. [12]introduced actiondependent heuristic dynamic programming with decision tree rules and applied an evolutionary algorithm to enhance the convergence speed for finding a nearoptimal control policy.
[13] proposed an algorithm to derive a policy for smart energy buildings management using the tabular Qlearning method. [14] proposed a simple rulebased microgrid control policy by applying a decision tree algorithm from a Qtable derived using Qlearning. [15] derived a continuous policy for dispatching ESS using a novel RL algorithm that is combined with Monte Carlo tree search methods, reducing the size of the searching space needed to conduct multistep bootstrapping.More recently, following the development of deep RL (DRL) methods that combine RL and deep neural networks (DNNs), studies have suggested DRLbased methods for the optimal operation of microgrids. In
[16], a dual iterative Qlearning algorithm was proposed to derive optimal ESS management policy in smart residential energy systems of a periodic nature. [17] applied a deep Qlearning (DQL) algorithm with a novel Qnetwork structure to derive a microgrid management strategy. [18] derived prosumers’ electricity trading strategy using DQL in the local energy market. [19] formulated a Markov Decision Process (MDP) model for the system of the microgrid interacting with the electricity market and derived an operation policy for the system using a deep Qnetwork (DQN) algorithm to conduct the realtime energy management. [20] suggested a method that derives the operation strategy of a community ESS using double DQL in both gridconnected and islanded modes with different objectives.In previous studies proposing Qlearning based algorithms [17, 18, 13, 20], the defined problems contain only one action, either charging/discharging of the ESS or buying/selling electricity from the external grid. However, these approaches are insufficient and cannot be used for real microgrid applications as they are unable to work problems with sophisticated action spaces. That is, in addition to determining the direction of each DER, the algorithm needs to be able to determine the levels of the direction. For example, the algorithm should determine the amount of charging/discharging electricity at the ESS. Although [19] considered integrating the sophisticated action space in the microgrid system, they assumed that the microgrid system had only one controllable DER, the ESS. However, to the best of our knowledge, only limited research has proposed DRL based methods for several controllable DERs with sophisticated controls in a microgrid system.
Of course, adopting the sophisticated action space for controlling several DERs simultaneously can cause an issue, in that the size of the stateaction pairs increases exponentially. Particularly, when Qlearning based methods are utilized, the formation of a large discrete action space is inevitable. By constructing the large discrete action space, numerous Qvalues of stateaction pairs must be explored sufficiently to approximate the true value, but a learning agent cannot achieve it in a reasonable time. This issue leads to the learning agent’s inability to approximate the future value in the learning phase. Particularly in the context of the microgrid operation problem, the agent fails to embody a delayed effective property. That is, the current ESS charging control initially incurs costs but its benefit will be effective in the future when the discharging control is conducted. Consequentially, the agent only follows the myopic goal instead of planning for the longterm objective.
Although there are several typical approximation techniques in the Qlearning based approach to support the agent in adapting well to the complex system, these approaches have not been applied well in the microgrid operation problem. Some studies strived to remedy the agent’s myopic perspective by adopting the multistep ahead approximation techniques of the actionvalue with the Qlearning algorithm [10, 11]. However, these techniques require predictionbased scenarios for future steps. Therefore, these techniques can only be used for simple environmental settings with a shortterm planning horizon. When the size of the action space for sophisticated control of the system increases, there will still be limits that keep the techniques from being a complete solution to the issue of the agent’s myopic perspective. To overcome the issue of the large action space, this study proposes a novel credit assignment technique, delayed Qupdate, to derive a wellfitted operation policy under the sophisticated action space for controlling the microgrid system.
In this context, the purpose of this study is to create a technique that allows a Qlearning based algorithm to derive an optimal microgrid operation policy of a gridconnected microgrid system under the sophisticated action space. To apply the proposed technique, we formulate an MDP, emulating the real microgrid control system, that conducts the sophisticated controls of several DERs. To demonstrate the optimality of an operation policy derived using our proposed algorithm, we conduct a simulation of our operation policy with realworld microgrid data and compare the result with an optimal operation policy derived using DP. The main contributions of this study can be summarized as follows:

This is the first study investigating an algorithm for the optimal operation of a microgrid system based on a Qlearning method with sophisticated controls of several controllable DERs, which is much more applicable in the realworld microgrid system.

We devise the delayed Qupdate technique and corresponding Qlearning algorithm that supports the learning agent in deriving a wellfitted operation policy by overcoming the delayed effective property in the microgrid operation problem.
The remainder of this paper is organized as follows. Section 2 describes the definition of our problem and formulates the MDP model, and Section 3 explains the delayed Qupdate technique with the Qlearning algorithm we used. In Section 4, we provide experimental results to validate the advantages of our approach. Finally, we conclude in Section 5 by providing relevant implications and identifying directions for future research.
Ii Problem definition
In this section, we introduce the mathematical models of the microgrid operation problem. The considered microgrid system involves several DERs, such as solar photovoltaic (PV), dispatchable generator, ESS, and demand response, and it is connected to the utility grid. In the gridconnected mode, the electricity that is tradable with the external large utility grid is available to fulfill surplus or shortage demand of electricity in the system. We assume that the price of electricity traded with the utility grid is determined a day before in the dayahead market. We would like to derive an optimal operation policy of the microgrid, which can minimize the longrun operation costs under system constraints. The mathematical formulation in Subsection describes our microgrid operation problem, similar to our previous study [21]. Then, we explain how to build an MDP model for our RL based algorithm.
Indices  
index for time  
Parameters  
unit time period (e.g., hour)  
length of planning horizon  
the number of episodes  
dispatchable generation unit cost (e.g., KRW/kWh)  
load curtailment unit cost (e.g., KRW/kWh)  
discharging unit cost of the ESS (e.g., KRW/kWh)  
transmission capacity between the microgrid and the utility grid (e.g., kW)  
capacity of the dispatchable generation (e.g., kW)  
minimum generation of the dispatchable generation (e.g., kW)  
maximum ramp up & down capacity for dispatchable generator (e.g., kW)  
ESS charging & discharging capacity (e.g., kW)  
ESS storage capacity (e.g., kW)  
controllable demand rate (%)  
efficiency of the ESS (%)  
Variables  
state of charge of the ESS at time (e.g., kWh)  
electricity price of the utility grid at time (e.g., KRW/kWh)  
electricity demand at time (e.g., kWh)  
renewable generation at time (e.g., kWh)  
amount of electricity traded with the utility grid at time (e.g., kWh)  
amount of electricity generated by the dispatchable generation at time (e.g., kWh)  
amount of electricity charged or discharged in the ESS at time (e.g., kWh)  
amount of curtailed electricity demand at time (e.g., kWh) 
Iia Mathematical modeling
The mathematical model for our microgrid operation problem is described by Equations (1)(6), and all notations for the model are summarized in Table I.
(1)  
s.t.  (2)  
(3)  
(4)  
(5)  
(6) 
Equation (1) represents the objective of our microgrid operation problem, minimizing total operation costs over the planning horizon. The operation costs of each time step can be defined as below (Equation (7)):
(7) 
where denotes a nonnegative value (i.e., ). All other equations describe the system constraints. Equation (2) represents a balance constraint between the supply and demand of electricity. As constraints on dispatchable generators, Equation (3) represents the upper/lower bound of the generation of these resources and Equation (4) represents that the difference with previous outputs of the resources is limited in a unit time (i.e., rampup/down constraint of dispatchable generators). In Equation (5), upper bound means that the amount of electricity to be discharged from the ESS is limited by a certain capacity and also by the remaining capacity of the ESS storage with considering on efficiency rate. Meanwhile, the lower bound of the equation means that the amount of electricity to be charged from the ESS is limited by a certain capacity and also by the current state of charge (SOC) level with considering on efficiency rate. Equation (6) implies that the amount of demand response must positive and lower than the maximum controllable demand. Although the maximum transmission capacity from/to the utility grid is constrained, we can ignore the constraint in that the capacity is large enough to transmit the required amount in the problem.
IiB MDP modeling
In our microgrid operation problem, the RL agent can consider five key features in the microgrid system. The first feature is the output of the dispatchable generators in the previous hour (i.e., ), which is used for the rampup/down constraints (i.e., Equation (4)). Second and third are the output of renewable energy generators and the amount of demand currently occurring (i.e., ). Finally, the current SOC level and electricity price of the utility grid are needed for configuring the current situation of the microgrid (i.e., ). In summary, the state space of the RL agent can be represented as follows (Equation (8)).
(8) 
The RL agent needs to determine the outputs of controllable DERs and the amount of electricity from/to the utility grid. As a result, the action space of the RL agent has four actions, as shown by Equation (9). The first action is the amount of electricity from/to the utility grid. If the action has a negative value, it means that the microgrid sells the electricity to the external grid. The amount of small DGs and the demand response are also considered as actions. The last one is the charging/discharging amount of ESS. If the action has a negative value, it means that the microgrid charges electricity into the ESS. Here, the action space determines the last three elements of the action space only as the first element (i.e., ) is automatically determined by the balancing constraint between supply and demand (i.e., ). By doing this, any action selected by that agent can satisfy the first constraint (Equation 2) and also the dimension of the control variable decreases to tractable sizes.
(9) 
The reward in the MDP model should reflect how much the agent’s action contributed to minimizing the microgrid operation costs. This reward can be simply defined as the negative sign of microgrid operation costs per time step (i.e., one hour). However, if a reward is defined as this simple version, then nonstationary reward criteria may be applied depending on the demand phases (e.g., the phase is categorized as the peak demand and base demand). For example, during the base demand phase (i.e., the periods with relatively low demand), no matter how undesirable the conducted action is, a relatively positive reward is obtained by the agent. In contrast, in the peak demand phase (i.e., the periods with the relatively high demand), regardless of how desirable the conducted action is, a relatively negative reward is obtained by the agent. Therefore, the reward formula should be robust under nonstationary conditions and constitute consistent criteria. In addition, to conduct stable learning, the scale of reward must be adjusted to belong to a reasonable range. The rescaled approach of the reward for Qlearning is formulated in a related study [19]. Similarly, for the reward’s reasonable scale, we define how desirable the current action is relative to the cost of the worst operation that the agent can do in each hour. This reward formula is similar to the reward formula of the reference [22] in that it is also the change rate relative to the baseline. The worstcase operation that the agent can do in this problem is to satisfy the residual demand remaining after the production amount of the RES is fulfilled using dispatchable generators (i.e., the unit cost of dispatchable generator higher than those of other resources in the experiment). Therefore, we define the reward as the negation of change rate in the hourly operation costs relative to the costs that result from the worst action case (Equation (10)).
(10) 
The system dynamics of the MDP model only consider the SOC level of the ESS because the other elements of state do not depend on the actions in the previous time step. Equation (11) represents the dynamics of SOC level dependent on the action . In the equation, SOC level decreases by multiplied with the efficiency rate when the ESS is discharged, otherwise SOC level increases by divided by the efficiency rate when the ESS is charged.
(11) 
Iii Methodology
In this section, we introduce our proposed approach for deriving the optimal operation policy of a microgrid using Qlearning with a novel technique, delayed Qupdate. We leverage the advantage of the Qlearning algorithm to keep the agent from being trapped in local minima unlike other policybased RL algorithms [20].
Iiia Description of QLearning Agent Model
Qlearning is a valuebased RL method that approximates an action value (i.e., a Qvalue) in each stateaction pair. The Qvalue represents an expected value of a discounted cumulative reward, initiated with the current state and selected action under a policy (Equation (12)). The optimal Qvalue is induced by an optimal policy, which maximizes the Qvalue of every state (Equation (13)), and this value is expressed using the Bellman equation (Equation (14)). In Qlearning, the optimal Qvalue can be approximated by bootstrapping, meaning that the target value of Qvalue is utilized from the previous approximated value, the socalled temporal difference learning (Equation (15)). Here denotes the learning rate and denotes the discount factor for discounting time value. As revealed in Equation (15
), this algorithm is a modelfree method such that even if the agent does not have knowledge of the environmental factors such as transition probability, the agent can develop a policy using repeated experience by following an behavior policy (i.e., epsilongreedy). In addition, Qlearning is an offpolicy algorithm, that is, the behavior policy for selecting the agent’s action is not the same as the target policy for selecting an action on the target value.
(12)  
(13)  
(14)  
(15) 
When adopting a Qlearning algorithm, we should define the discrete range of the agent’s state and action spaces. For the discretization of the agent’s action space, we define each control variable divided by 10 units. By following this discretization rule, the amount of dispatchable generators contains discrete cases which are achieved by dividing the possible range () by 10. Likewise, the charging/discharging amount of the ESS is discretized to cases by dividing the possible range () by 10 units, and the amount of the demand response is discretized to cases by dividing the possible range () by 10 units. The amount of electricity from/to the utility grid is determined when all other control variables are fixed, so it does not need any consideration. Unlike the action space, the discretization for the state space cannot be directly applied because the observed features of the system cannot be exactly divided into a discrete unit. The issue of the state space discretization can be resolved by introducing function approximationbased Qlearning, and our proposed method can be extended to the research line of functionapproximated Qvalue. Since tabularbased Qlearning is utilized in this study, we had to conduct a process of relaxation for the system environment, which makes the system tractable using tabular Qlearning and we describe the details in Section IVA.
As we defined in Section IIA, our handled system has several constraints. Therefore, our agent model has a constrained action space. Although RL with a constrained action space is a tricky issue and has recently been the focus of much research, this topic is not the focus of this study. Therefore, we tackle the constraint action space of our agent by adopting the simple rule of taking an action that has the highest Qvalue among the feasible action spaces (i.e., masking all infeasible actions).
IiiB Delayed Qupdate technique for ESS Controls
In previous research on applying Qlearning to operation cost minimization in microgrid systems, hourly operation costs were provided as part of an immediate reward. However, the effect of the ESS charging control in the current period is not reflected in the current immediate reward but in the value of the future period with discharging control. Although in the updated formula of temporaldifference (TD), the expected future reward may reflect the value of charging control, the approximation for the value of charging control is incorrect for the true value because TD has an issue with the biased valuefunction [23]. This bias issue occurs more seriously under the large action space with the delayed effective property of the microgrid. Thus, the values of the ESS charging control are depreciated, resulting in undesirable operations utilizing ESS operation inefficiently. This issue does not often occur in previous studies because most of them consider only simple action spaces of an agent. As the degree of the action space’s fineness increases, the learning agent needs more iterations to approximate the true value of charging control. To achieve precise Qvalue assignment with a reasonable number of iterations, the delayed Qupdate technique is proposed to relax this issue by conducting a delayed credit assignment to the ESS charging control.
The process of delayed Qupdate technique requires a firstinputfirstoutput (FIFO) queue for storing and tracking the history of the ESS charging amount and stateaction pair in the periods with charging control. Thereafter, discharging control is conducted by utilizing stored electricity and the adjusted value is assigned to the corresponding previous charging actions. This adjusted value is proportional to the deviation between the electricity price in the previous charging period and the price in the current discharging period. The deviation of prices between these two periods measures how low the price in the charging period and how high the price in the discharging period is. Thus, the deviation can be an indicator evaluating the desirability of ESS charging/discharging pairs (i.e., a desirable policy entails conducting ESS charging in the period with the high price and discharging in the period with the low price). In this technique, we assume that the earlier charged electricity amount has a higher priority of discharging, which is why we employ the FIFO queue. Equation (16) represents the updating formula of the Qvalue of charging control in the previous period . Here, represents the charging amount at period which is utilized by the discharging control at period and denotes the adaptation rate for the delayed Qupdate. The detailed flow of the delayed Qupdate technique with the Qlearning is described in the algorithm 1. The time complexity of the proposed algorithm is , and is somewhat more complex than that of the original Qlearning algorithm (i.e., ). However, as mentioned in Subsection , the control variable of the ESS operation is discretized by 10 units and the charging/discharging controls are bounded by a certain amount of the capacity, so the time complexity of the proposed algorithm is reduced to , which is the same as the Qlearning algorithm.
(16) 
Iv Experimental results
In this section, we demonstrate that the policy derived using our proposed technique with the Qlearning algorithm outperforms the benchmark operation policy derived using the original method of the Qlearning algorithm. In addition, to validate the optimality of the policy, we compare the performance of our policy with that of an optimal operation policy derived using dynamic programming. We conduct a simulation for realworld gridconnected microgrid systems using both our proposed approach and original Qlearning approach as a benchmark, and we verify that the derived operation policy is relatively superior to the benchmark policy based on the proposed performance measures.
Iva Experimental setting
As discussed in [24], we experiment with realworld data for a campus microgrid installed in a university in South Korea. The microgrid has solar PV and ESS, and it covers some parts of electricity demand in the university library. We use the actual data on the hourly electricity demand and solar PV generation of the microgrid. In addition, we assume that the microgrid can trade electricity from/to the utility grid at the system marginal price (SMP), and we obtain the data for the hourly SMP from the website of Korea Power Exchange. The average electricity retail price in South Korea is reported at approximately 100 KRW/kWh [25]. To consider when the microgrid utilizes the ESS actively, we make the SMP price fluctuate more by multiplying a scaleup (discount) factor to SMP price value. Lastly, we refer to several previous studies to set up the appropriate values for other parameters of our models described in Section II, as summarized in Table II (the unit generation cost from the dispatchable generator in South Korea is approximately 500KRW/kWh [26] and the unit cost of the demand response is set at 200KRW/kWh [21]).
parameter  value  parameter  value 
500  200  
50  30  
60  0  
50  50  
1.0  0.2 
As previously mentioned, the relaxation process is conducted to convert observed features of the system to discretized state space. In the process, we round off the hourly generation amount of PV, electricity demand, and SMP price to the oneth place. As a result, the generation amount of PV contains 4 discrete values (0,10,20,30), electricity demand contains 8 discrete values (40,50,60,70,80,90,100,110), and SMP price contains 3 discrete values (70,130,140). Furthermore, we apply the discretization rule to the microgrid system element of the state space. The the output of dispatchable generators at the previous hour, is divided similar to the discrete controls of the action space for controlling the dispatchable generators. The SOC level () contains the same number of discrete controls as the discretized ESS storage capacity which is divided by 10 units. Therefore, the SOC level contains (=6) discrete controls, obtained by dividing the possible range () by 10 units. Considering the discrete values for the relaxed observations of the system, the size of state space is 3024.
By following this discretization rule for the action space of the agent, the dispatchable generators are discretized into 7 controls. Likewise, the operation of the ESS is discretized into 11 controls and the demand response is discretized into 3 controls. Thus, the size of the action space is 231, achieved by multiplying the sizes of discrete controls within every control variable.
By conducting several rounds of tuning, we fix the hyperparameters for the learning agent including the number of episodes (), learning rate (), adaptation rate for delayed Qupdate (), and discount factor (). These hyperparameter values are 3000, 0.3, 0.00001, and 0.9, respectively.
We derived an operation policy using Qlearning with delayed Qupdate technique trained in March of 2018, a month period. After the policy training step, we validate the derived operation policy during validation period of approximately 24 hours ().
IvB Performance measure
For the validation, we define one of the performance measures as the average cost of microgrid operation, which is the daily average cost in the valid horizon. The formula for this measure is defined in Equation (17). Here, denotes the last index of the valid period.
(17) 
Furthermore, to focus solely on the performance of the ESS operation, we define a benefit measure of ESS operation, which is the summation of the cost of ESS charging when ESS is charged and the opportunity cost of ESS discharging when ESS is discharged. The formula for the benefit measure of the ESS operation is defined in Equation (18).
(18) 
To identify the degree of the convergence of the updated policy, we define a convergence measure, Qvalue difference, which is the root mean square error between the Qvalues of the current and previous epochs. The formula for this performance measure is defined in Equation (
19). Here, denotes the Qvalues of stateaction pairs after iterations (epoch) are finished. The measure expresses the degree of variation between the Qvalues of all stateaction pairs, so it can assess the degree of convergence.(19) 
We adopt a regret value for the average cost which is a relative measure for evaluating a policy that compares it with an optimal policy based on average cost measure (Equation (20)). Applying the same concept for the ESS benefit measure, regret for the benefit measure of ESS operation is formulated in Equation (21). These regret values evaluate how much each measure of the given policy deviates from the optimal measure, meaning that a policy which has a large regret then the policy is undesirable.
(20)  
(21) 
IvC Result
We simulate our derived operation policy using Qlearning with delayed Qupdate technique in a gridconnected microgrid system. To validate the effect of the delayed Qupdate technique, we conduct a comparative analysis for the operation policies derived using Qlearning with and without the delayed Qupdate. Figure 1 shows the trends of the average cost plotted against operation policy updates using the proposed approach and the original Qlearning. The result of the average cost shows a declining trend in figure 1(a), it has a stable trend as policy updating is continued before converging to a certain point. In figure 1(b), we can identify that it has a more unstable trend than the result of our proposed method and it may not converge.
Figure 2 shows the trends of the ESS benefit measure versus operation policy updates using the proposed approach and the original Qlearning. To clarify the direction of each trend, we put a smoothing trend denoted by the red line in each result. As shown in figure 2(a), the ESS benefit measure improves as learning goes on using the proposed approach. Figure 2(b) shows that the ESS benefit measure does not improve and even decreases over a certain phase.
Figure 3 shows the trends of the Qvalue difference measure plotted against operation policy updates using the proposed approach and the original Qlearning. The result of the Qvalue difference shows a declining trend after a warmup phase in figure 3 (a), it has an efficient decreasing trend as policy updating goes on before converging to a certain level. However, in figure 3 (b), we can identify that the convergence is slower and has a less efficient trend than the result of our proposed method.
In the verification process of the valid data, figure 4 (a) shows the operation result of the proposed approach based policy and figure 4 (b) shows the operation result using a policy derived by the original Qlearning algorithm. As seen in the results, the operation from the policy derived by the original Qlearning conducts undesirable ESS control, which frequently performs ESS charging in the period with high SMP price and ESS discharging in the period with low SMP price. Therefore, our delayed Qupdate technique is necessary to derive a reasonable operation policy under our problem setting.
To verify the optimality of our proposed approach, we conduct a simulation using a DPdriven operation policy (i.e., optimal policy) under the predefined system. Figure 5 shows the operation result, which is based on an optimal operation policy. Using an optimal operation policy, an optimal average cost is 8183.33 and an optimal ESS benefit is 375 in the valid data. Even if our operation policy somewhat differs from the optimal policy, we confirm that our policy is similar to the optimal policy in terms of charging ESS at the periods with low SMP price and discharging ESS at the periods with high SMP price. Thus, we identify that the operation policy derived using Qlearning with delayed Qupdate technique converges to the nearoptimal operation policy.
To analyze the above performance measures, we summarize the final performance of the average cost of each operation policy in table III. As the comparative analysis shows, applying the delayed Qupdate technique allows us to derive an operation policy that outperforms the policy generated by the original Qlearning algorithm. Thus, we verify that the delayed Qupdate technique has a beneficial effect on the situation where sophisticated control is required under the microgrid system.
Measure  
Delayed Qupdate  average cost  ESS benefit  ACregret  EBregret 
without  9789.58  295.83  1606.25  670.83 
with  8462.5  191.66  297.17  183.34.0 
difference  13.55%  164.78%  81.5%  72.66% 
V Conclusion
The main contribution of our study is the creation of the delayed Qupdate technique to derive operation policies more efficiently in the realworld gridconnected microgrid. Sophisticated controls are required under the realworld microgrid system, so the action space of the Qlearning agent is constructed as a large discrete action space that determines the level of every DER. There is a limitation in deriving operation policy in the large action space setting under microgrid using the typical Qlearning, especially if the ESS has a property of delayed effects, that prevents the algorithm from deriving a desirable policy efficiently. To respond with this limitation, we introduce a novel delayed credit assignment technique, delayed Qupdate, which supports that the Qlearning algorithm in deriving desirable operation policies under the detailed control setting. As experimental results demonstrate, our operation policy derived using the proposed approach outperforms the benchmark policy derived using original Qlearning under adopted performance measures. In addition, these results verify that our operation policy converges to the nearoptimal operation policy in the realworld gridconnected microgrid system.
One of the limitations of this study is that the agent and environment are formulated based on the simple setting involving discrete types of electricity demand, SMP price, and generation amount of PV, adjusted by the relaxation process. However, this simple version of the environment can be extended to the realworld version which has continuous variables by adopting a recent paradigm of the RL that utilizes function approximation in Qlearning using DNN, the socalled DQN [27, 28]. Along with combining DNN, to relax our derived operational policy’s swift switching of the ESS controls, we plan to formulate a model that accounts for battery maintenance costs and propose a response solving approach in future research.
References
 [1] K. Milis, H. Peremans, and S. V. Passel, “The impact of policy on microgrid economics: A review,” Renewable and Sustainable Energy Reviews, vol. 81, pp. 3111–3119, 2018.
 [2] T. S. Ustun, C. Ozansoy, and A. Zayegh, “Recent developments in microgrids and example cases around the worlda review,” Renewable and Sustainable Energy Reviews, vol. 15, pp. 4030–4041, 2011.
 [3] S. Mishra, K. Anderson, B. Miller, K. Boyer, and A. Warren, “Microgrid Resilience: A holistic approach for assessing threats, identifying vulnerabilities, and designing corresponding mitigation strategies,” Applied Energy, vol. 264, 2020.
 [4] W. Su, J. Wang, and J. Roh, “Stochastic energy scheduling in microgrids with intermittent renewable energy resources,” IEEE Transactions on Smart Grid, vol. 5, pp. 1876–1883, 2013.
 [5] S. Talari, M. Yazdaninejad, and M. Haghifam, “Stochasticbased scheduling of the microgrid operation including wind turbines, photovoltaic cells, energy storages and responsive loads,” IET Generation, Transmission & Distribution, vol. 9, pp. 1498–1509, 2015.
 [6] T. A. Nguyen and M. Crow, “Stochastic optimization of renewablebased microgrid operation incorporating battery operating cost,” IEEE Transactions on Power Systems, vol. 3, pp. 2289–2296, 2015.

[7]
S. A. Pourmousavi, M. H. Nehrir, M. C. Christopher, and C. Wang, “RealTime Energy Management of a StandAlone Hybrid WindMicroturbine Energy System Using Particle Swarm Optimization,”
IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, vol. 1, pp. 193–201, 2010.  [8] H. Karami, M. J. Sanjari, S. H. Hosseinian, and G. B. Gharehpetian, “An optimal dispatch algorithm for managing residential distributed energy resources,” IEEE Transactions on Smart Grid, vol. 5, pp. 2360–2367, 2014.
 [9] F. D. Li, M. Wu, Y. He, and X. Chen, “Optimal control in microgrid using multiagent reinforcement learning,” ISA Transactions, vol. 51, pp. 743–751, 2012.
 [10] E. Kuznetsova, Y. F. Li, C. Ruiz, E. Zio, G. Ault, and K. Bell, “Reinforcement learning for microgrid energy management,” Energy, vol. 59, pp. 133–146, 2013.
 [11] R. Leo, R. S. Milton, and S. Sibi, “Reinforcement learning for optimal energy management of a solar microgrid,” 2014 IEEE Global Humanitarian Technology ConferenceSouth Asia Satellite (GHTCSAS), pp. 183–188, 2014.
 [12] G. K. Venayagamoorthy and P. K. Gautam, “Dynamic Energy Management System for a Smart Microgrid,” IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, vol. 27, pp. 1643–1656, 2016.
 [13] S. Kim and H. Lim, “Reinforcement learning based energy management algorithm for smart energy buildings,” Energies, vol. 11, p. 2010, 2018.
 [14] T. Levent, P. Preux, E. Le Pennec, J. Badosa, G. Henri, and Y. Bonnassieux, “Energy Management for Microgrids: a Reinforcement Learning Approach,” 2019 IEEE PES Innovative Smart Grid Technologies Europe (ISGTEurope), pp. 1–5, 2019.
 [15] Y. Shang, W. Wu, J. Guo, Z. Lv, Z. Ma, W. Sheng, and R. Chen, “Stochastic Dispatch of Energy Storage in Microgrids: A Reinforcement Learning Approach Incorporated with MCTS,” arXiv preprint arXiv:1910.04541, 2019.
 [16] Q. Wei, D. Liu, and G. Shi, “A novel dual iterative Qlearning method for optimal battery management in smart residential environments,” IEEE Transactions on Industrial Electronics, vol. 62, pp. 2509–2518, 2014.
 [17] V. FrançoisLavet, D. Taralla, D. Ernst, and R. Fonteneau, “Deep reinforcement learning solutions for energy microgrids management,” European Workshop on Reinforcement Learning (EWRL 2016), 2016.
 [18] T. Chen and W. Su, “Local Energy Trading Behavior Modeling With Deep Reinforcement Learning,” IEEE Access, vol. 6, pp. 62 806–62 814, 2018.
 [19] Y. Ji, J. Wang, J. Xu, X. Fang, and H. Zhang, “RealTime Energy Management of a Microgrid Using Deep Reinforcement Learning,” Energies, vol. 12, p. 2291, 2019.
 [20] V. H. Bui, A. Hussain, and H. M. Kim, “Double Deep QLearningBased Distributed Operation of Battery Energy Storage System Considering Uncertainties,” IEEE Transactions on Smart Grid, 2019.
 [21] D. G. Choi, D. Min, and J. H. Ryu, “Effective subsidy policy for a Gridconnected Microgrid: A Korean case study,” Submitted, 2019.
 [22] H. Park, M. K. Sim, and D. G. Choi, “An intelligent financial portfolio trading strategy using deep Qlearning,” Expert Systems with Applications, p. 113573, 2020.
 [23] R. S. Sutton and S. P. Singh, “On stepsize and bias in temporaldifference learning,” Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pp. 91–96, 1994.
 [24] B. Jeong, D. Shin, J. Im, J. Park, and Y. Kim, “Implementation of Optimal TwoStage Scheduling of Energy Storage System Based on BigDataDriven Forecasting—An Actual Case Study in a Campus Microgrid,” Energies, vol. 12, p. 1124, 2019.
 [25] K. E. E. Institute. (2018) Korea retail price: Electricity: Average. [Online]. Available: https://www.ceicdata.com/en/korea/energyretailprice/retailpriceelectricityaverage
 [26] J. Song, S. Oh, Y. Yoo, S. Seo, I. Paek, Y. Song, and S. J. Song, “System design and policy suggestion for reducing electricity curtailment in renewable power systems for remote islands,” Applied energy, vol. 225, pp. 195–208, 2018.
 [27] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and et al., “Playing Atari with Deep Reinforcement Learning,” arXiv preprint arXiv:1312.5602, 2013.
 [28] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, and et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.