I Introduction
The rapid decarbonization of modern power systems in recent decades presents a variety of operational challenges [6]. These systems are undergoing a rapid growth by integrating more intermittent renewable energy sources, energy storages, and flexible loads. Here, the growing number and complexity of system components makes it significantly difficult to maintain the secure and costeffective operation of power systems. Under this trend, traditional methods adapt and perform poorly because they are inflexible and only fit for typical working conditions. It is thus a urgent need in modern power systems to develop the nextgeneration dispatch methodologies.
Reinforcement learning (RL) provides an emerging and promising option to tackle the existing challenges [7]. In an RL framework, an agent is trained to mimic the decision process of real system operators—they interact with a virtual environment of power systems and learn from the past decisions and rewards [14]. Making use of the hidden data pattern, a welltrained agent is capable of accommodating for multiple resources in a reasonable variety of scenarios. Many scholars have explored the viability of applying RL algorithms to power system dispatch. The authors of [10] applied a deep deterministic policy gradient (DDPG) algorithm to the solution of a joint dispatch problem considering both traditional thermal units and renewable resources. In [3], a cooperative RL algorithm was proposed to achieve efficient distributed economic dispatch in a microgrid with energy storage systems. An improved DDPG algorithm was developed in [13] to solve the dynamic economic dispatch problem for integrated energy systems.
Despite the growing number of research on RLbased power system economic dispatch, the evaluation approach has not been fully studied. The authors of [4] focused on the security evaluation issue for renewablerich power systems using SWOT analysis. In [11], a simulation approach was developed to improve the evaluation by taking the uncertain wind speed into consideration. Often, performance evaluation consists of a novel step in economic dispatch, and then compares the algorithm with traditional baselines to validate its effectiveness [5, 8]. Some research works have been carried out on this topic. In [1], multiple datadriven algorithms for economic dispatch were evaluated from the aspect of optimality and computational complexity, and the advantage of these algorithms was discussed as well. In [2], the influence of data corruption on the lookahead economic dispatch decisions was evaluated using sensitivity analysis, and a linear sensitivity matrix was derived to make a fast evaluation. To sum up, the evaluation approach for RL in power system economic dispatch remains under explored, especially concerning its unique ability to adapt to various scenarios.
In this paper, we propose an evaluation approach for lookahead economic dispatch to analyze the performance of RL agents faced with multiple scenarios. To be specific, the agent is designed for the N1 contingency management, and the network scenarios for evaluation are selected from the N1 cases by network clustering. Based on the agents’ dispatch decisions, several metrics are defined to evaluate their performance. This evaluation approach will simulate with different parameters and inform the designs of learning strategies. The contributions of this paper are summarized as follows:

A network clustering method is proposed to cluster and aggregate the network structures before finalizing the network scenarios. The average change rates of power flow on critical transmission lines are utilized as features of the network structures for clustering, representing the consequences and influence of line outage.

A group of evaluation metrics are developed to comprehensively analyze the agents’ performance. The metrics for security are defined based on the limits of the constraints, while the metrics for economy are defined with respect to the baseline of dispatch decisions.

The effectiveness of the proposed evaluation approach is validated by showing the adaptation ability of the past agents in various scenarios. In addition, the proposed approach can be utilized to offer suggestions for learning strategies.
The remainder of this paper is organized as follows. The scenario generation method based on network clustering is formulated in Section II. Section III introduces the evaluation metrics and the corresponding baseline generation approach. Case study based on a modified IEEE 30bus system is presented in Section IV, and Section V concludes this paper.
Ii Scenario Generation based on Network Clustering
In this section, multiple scenarios are generated for evaluation of the agent by combining network scenarios and demand scenarios. All the scenarios are based on the original conditions used for training the agent. The generation of demand scenarios is achieved by imposing a series of random variations upon the original demand level, while the generation of network scenarios is accomplished by aggregating network structures with similar features.
As the agent is designed for N1 contingency, the potential network scenarios are limited to the N1 cases, where only one transmission line is out of service. Compared with the N case where all transmission lines are functional, the power flow on the line with outage is redistributed to other lines, which results in the shift of power flow distribution, complying with the values of power transfer distribution factors [9]. Such shift is utilized here as a feature of the N1 case and its corresponding network for network clustering.
For an N1 case with an outage in line
, the observation vector is defined as follows:
(1) 
(2)  
(3)  
(4) 
The observation vector (1) is designed with dimensions, each representing one key transmission line being observed. The corresponding element (2) denotes the average change rate of the power flow on line in all lookahead windows before and after the outage of line , where and are the total number of time slots and the duration of a lookahead window, respectively. The average change rate is ignored when below 1, as it becomes less challenging in terms of power flow limits. (3) and (4) represent the power flow on line with and without an outage in line , where and are the total number of thermal units and buses. and denote the power transfer distribution factors, and function represents the bus number of thermal unit . denotes the net demand of bus at time . and represent the real power output of thermal unit at time , which can be obtained from historical data or by solving the DC optimal power flow (DCOPF) problem in the following section.
With the observation vectors for all the N1 cases, a variety of clustering algorithms can be applied to aggregate them using kmeans, hierarchical clustering, and so on
[12]. After the clustering step, the cluster where the original N1 case of the agent belongs becomes the network scenario set for further evaluation, and the number of cases it contains is defined as the total number of network scenarios . Combined with demand scenarios generated earlier, the scenarios for performance evaluation of the agent are therefore generated.Iii Multiscenario Performance Evaluation
Iiia Baseline Generation
In order to properly evaluate the performance of the agent, a baseline for comparison is necessary. In this paper, the baseline of dispatch decisions is generated by solving a DCOPF problem. In practice, such baseline can also be obtained from historical data or the experience of system operators.
The DCOPF model for lookahead economic dispatch is formulated as follows:
(5) 
subject to:
(6) 
(7)  
(8)  
(9) 
where denotes a vector of real power output baseline of all the thermal units at time in the lookahead window in scenario . Function defines the total operation cost of all the thermal units. denotes the net demand of bus at time in demand scenario . and denote the upper and lower limits of power output, and and denote the ramping limits of thermal unit , respectively. represents the power flow limit of transmission line , and is the number of transmission lines in total. denotes the power transfer distribution factor in network scenario . Unless noted otherwise, all in this paper indicates , representing all time slots. All indicates , representing the entire lookahead window. All indicates , representing all thermal units. The symbol indicates all combined scenarios of the network and demand, i.e., .
The objective (5) of the DCOPF model is to minimize the total operation cost in each lookahead window in all scenarios with respect to the power output of thermal units. The constraints consist of power balance constraints (6), power output constraints of thermal units (7), ramping rate constraints of thermal units (8) and power flow constraints (9). The DCOPF model above is a transformation of the RL framework of the agent. Therefore, the baseline of power output of thermal units can be obtained by solving the model above in all the scenarios.
IiiB Evaluation Metrics
The multiscenario evaluation metrics are divided into economy and security metrics, and all the metrics are integrated from the simulation results of the individual scenarios. The evaluation metric for economy is relative cost error (RCE), defined as follows:
(10)  
(11) 
where denotes a vector of real power output of all the thermal units produced by the agent in scenario . The RCE of the dispatch decisions generated by the agent is defined as the relative mean error of total operation cost in all lookahead windows in each scenario with respect to the baseline.
The evaluation metrics for security include total relative violation of constraints (RVS), maximum relative violation of constraints (RVM), average number of violated constraints (NVC), average number of time slots with violated constraints (NVT) and availability rate ():
(12)  
(13)  
(14)  
(15)  
(16) 
RVS (12) and RVM (13) are two metrics characterizing the extent of violation of constraints in general. The RVS of the dispatch decisions is defined as the summation of all scenarios, and the RVM is defined as the maximum value. In each scenario, the RVS is defined as the total relative violation of power output constraints, ramping rate constraints and power flow constraints in all lookahead windows, and the RVM is the corresponding maximum value. , and are relative violation values of the three aforementioned types of constraints, and the limits of the constraints are utilized to measure the level of violation:
(17)  
(18)  
(19) 
(20)  
(21) 
NVC (14), NVT (15) and availability rate (16) are metrics characterizing the proportion of unusable dispatch decisions, regardless of the extent of violation. The NVC and NVT metrics for dispatch decisions are derived by averaging over all scenarios. In each scenario, the NVC is defined as the total number of violated constraints in all lookahead windows, and the NVT is the total number of time slots with violated constraints. Availability rate is defined as the proportion of scenarios without any violation of constraints.
(22)  
(23) 
The above metrics constitute a comprehensive evaluation framework for the agent. By calculating the value of the metrics after obtaining the simulation results of the agent in all scenarios, the performance of the agent can be properly evaluated.
Iv Case Study
The evaluation of an RL agent is conducted in a modified IEEE 30bus power system. This agent is designed for the lookahead economic dispatch and is tested for a total of 40 days (). The interval of time slots is set as 15 min, and the length of the lookahead window .
Iva Scenarios
To fully evaluate agent performance, 123 scenarios are designed by combining different network scenarios with
demand scenarios. As mentioned above, the demand scenarios are generated by randomly shifting the original demand level used to train the agent. To be specific, the original demand curve is multiplied with 41 coefficients from 80% to 120% with a constant interval of 1%, each added with normally distributed random factors for all time slots.
The network scenarios, on the other hand, are generated with network clustering. For all 41 N1 cases, the average change rates of power flow on key transmission lines 5, 11 and 22 are depicted in Fig 1. Here, the N1 cases are aggregated into 4 clusters. The first cluster (inside the semicube area) represents the cases whose power flow changes little or becomes less challenging, while others (each represented with an ellipse) correspond to significant increase in the power flow on one key transmission line, indicating greater challenge on power flow limits. As the agent in this section is designed for the N1 case where line 10 becomes unavailable, the cluster it belongs to (the red ellipse) becomes the network scenario set, which contains N1 cases of lines 10, 13, and 14.
IvB Evaluation Results
Based on the scenarios generated, the performance of the agent is evaluated using the metrics proposed in this paper. The evaluation results are listed in Table I, and the detailed results in one network scenario are depicted in Fig 2, where line 13 becomes unavailable instead of the original line 10.
RCE(%)  RVS(%)  RVM(%)  NVC(%)  NVT(%)  (%) 
11.0  545763.1  13.2  0.035  0.33  84.6 
The evaluation results indicate that the agent is capable of adapting to a certain level of environment disturbance. In Fig. 2, with demand varying between and , the agent produces feasible dispatch decisions in 37 out of 41 scenarios. On one hand, when demand is too high, the dispatch decisions of the agent violate the power flow constraints in a few lookahead windows, which becomes more severe in terms of both the proportion of unusable decisions and the extent of violation as demand increases. On the other hand, when demand becomes too low, although there is no violation of constraints, the operation cost of the dispatch decisions increases gradually with respect to the baseline, suggesting decreased effectiveness. In addition, while the violation in some extreme scenarios (near 20% demand variation) is significant, the majority remains acceptable, which reflects in Table I as well. In the table, despite the high RVS and RVM values suggesting dramatic violation in certain scenarios, the NVC and NVT values are relatively low, suggesting general usability. Judging from all the metric values, the evaluated agent possesses the basic ability of adapting to different scenarios, while under some circumstances it loses such ability and needs further improvement.
IvC Comparison Between Different Agents
The evaluation approach is further employed on several other RL agents apart from the previous one, and the results are listed in Table II. The metric values in the table are all percentage values, same as Table I. In the first column of the table, the names of the agents suggest the number of episodes in their training process, and RL50 correspond to the one used in the previous section.
Agent Name  RCE  RVS  RVM  NVC  NVT  

RL50  11.0  545763.1  13.2  0.035  0.33  84.6 
RL100  11.2  505855.3  12.9  0.033  0.31  82.9 
RL150  11.3  587842.5  12.7  0.038  0.33  81.3 
Here, the evaluation results of the agents provide a reference for the selection of training episodes. In Table II
, with the increase of training episodes of the agents, the evaluation results become slightly different. For RVS, RVM, NVC and NVT, the lowest values (best situation) occur when RL100 is selected. The dispatch decisions of both RL50 and RL150 are less secure in terms of those metrics. This phenomenon is probably due to the fact that more episodes can improve the ability to capture the characteristics of the training set, while increasing the risk of overfitting at the same time. In addition, judging from RCE and
, the further training process compromises the accuracy and economy of the dispatch decisions of the agent. In summary, it can be drawn from the evaluation results that 100 episodes make a relatively appropriate choice for training in this case.V Conclusion
This paper proposes an evaluation approach for lookahead economic dispatch in order to properly analyze the performance of RL agents under multiple scenarios. A network clustering method is developed to generate the scenarios, and a series of evaluation metrics are designed for each scenario considering both economy and security. Evaluation results of multiple agents for a modified IEEE 30bus system show that the proposed approach can effectively assess the adaptability of an agent. In addition, this approach can be utilized to offer suggestions for the value of key parameters in the RL algorithm by comparing results among different agents. In brief, this work will contribute to the improvement of RL algorithms in lookahead dispatch, therefore increasing the intelligence of power system operation.
References
 [1] (2018) Evaluating the performance of an artificial bee colony algorithm for solving a economic dispatch problem. In 2018 Simposio Brasileiro de Sistemas Eletricos (SBSE), pp. 1–6. Cited by: §I.
 [2] (2017) Data PerturbationBased Sensitivity Analysis of RealTime LookAhead Economic Dispatch. IEEE Transactions on Power Systems 32 (3), pp. 2072–2082. Cited by: §I.

[3]
(2018)
Distributed Economic Dispatch in Microgrids Based on Cooperative Reinforcement Learning.
IEEE Transactions on Neural Networks and Learning Systems
29 (6), pp. 2192–2203. Cited by: §I.  [4] (2022) New energy power system operation security evaluation based on the swot analysis. Scientific Reports 12 (1), pp. 1–14. Cited by: §I.

[5]
(201801)
Optimization of unit commitment and economic dispatch in microgrids based on genetic algorithm and mixed integer linear programming
. Applied Energy 210, pp. 944–963. Cited by: §I.  [6] (2020) Challenges in the decarbonization of the energy sector. Energy 205, pp. 118025. Cited by: §I.
 [7] (2020) Review of learningassisted power system optimization. CSEE Journal of Power and Energy Systems 7 (2), pp. 221–231. Cited by: §I.

[8]
(201402)
Solving the Power Economic Dispatch Problem With Generator Constraints by Random Drift Particle Swarm Optimization
. IEEE Transactions on Industrial Informatics 10 (1), pp. 222–232. Cited by: §I.  [9] (2018) Security Constrained Unit Commitment Using Line Outage Distribution Factors. IEEE Transactions on Power Systems 33 (1), pp. 329–337. Cited by: §II.
 [10] (2019) Joint Optimization Dispatching for Hybrid Power System Based on Deep Reinforcement Learning. In 2019 IEEE 8th International Conference on Advanced Power System Automation and Protection (APAP), pp. 1289–1293. Cited by: §I.
 [11] (2011) Simulation of correlated wind speed data for economic dispatch evaluation. IEEE Transactions on Sustainable Energy 3 (1), pp. 142–149. Cited by: §I.
 [12] (200505) Survey of clustering algorithms. IEEE Transactions on Neural Networks 16 (3), pp. 645–678. Cited by: §II.
 [13] (2021) Dynamic energy dispatch strategy for integrated energy system based on improved deep reinforcement learning. Energy 235, pp. 121377. Cited by: §I.
 [14] (2020) Deep reinforcement learning for power system applications: An overview. CSEE Journal of Power and Energy Systems 6 (1), pp. 213–225. Cited by: §I.