Smart grid is a power network that enables a two-way flow of electricity and data through digital communication technologies that can detect, react and proactively act on changes in usage and multiple issues. It plays a crucial role for the smart society and the upcoming carbon neutral society. The core technologies in smart grid include the sensing, communication, optimization and data analysis for various purposes .
Fault detection of smart grid is an important research problem that has attracted increasing attention from both academia and industry. It is essential to improve the performance and reduce disruptions of smart power systems. The presence of faults or incipient faults in smart grid can be determined by measuring and analyzing changes in electrical power (e.g., current and voltage), environmental and equipment parameters, a process known as smart grid fault detection. Achieving autonomous smart grid fault detection is crucial to system status awareness, maintenance and operation.
With the continuous innovation of power systems and equipments, and the carbon neutral goals, more devices and technologies are used for fault detection. All these make fault detection in autonomous smart grid challenging.
More complex operating conditions of new power system have put forward higher requirements for the state awareness and operation maintenance of power equipments:
The operational safety of power equipments is the basis for reliable operation of the entire power grid. Comprehensive, timely and accurate sensing and adaptive adjustment is a prerequisite for ensuring the safety of power equipments . With a high proportion of renewable energy accommodation, the new power system with strong uncertainty, high volatility and many harmonics will lead to more extreme and variable operating conditions, which puts forward higher requirements for safe and reliable operation . It is necessary to solve the problems of real-time sensing, accurate assessment of equipment status and timely warning of potential faults under complex and variable conditions, and to study the strategy of long-term service maintenance, so as to realize lean management and efficient maintenance of power equipment and ensure the safety and reliability of long-term operation of power equipment under the complex operating conditions of new power system.
The massive adoption of new power equipments has brought new challenges to operation and maintenance:
New power equipment are the key infrastructures used to support the construction and development of new power systems, such as power conversion and control devices based on power electronics technology, large-scale energy storage equipment, offshore wind power access-related equipment, etc. These new types of power equipment have a relatively short history and are used in relatively few traditional power grids . They lack effective condition assessment methods that have been tested in practice . The operation and maintenance technology of the equipment is still immature, and basic scientific issues such as the mechanism and evolution law of equipment failure need to be studied in depth.
Carbon neutrality puts forward new requirements for the operation of power equipments:
Green and high efficiency is one of the main features of the new power system . While the power grid strongly supports the access and consumption of renewable energy resources, it should further reduce the carbon emission level of the power grid itself, and realize the low-carbon transformation of planning, operation and maintenance. Considering the huge number and rapid growth of power equipments, improving the utilization efficiency of existing power equipment and reducing equipment operating losses are important goals for the efficient operation of new power systems. The key issues that need to be studied mainly include real-time identification and accurate prediction methods of key parameters that affect the utilization and service life of equipment, factors affecting the comprehensive energy consumption of power grid equipment, and control strategies.
To the best of our knowledge, there are still many unsolved research issues in autonomous smart grid fault detection. To this end, this paper first presents the basic principles of smart grid fault detection. Then, we explain the new requirements for autonomous smart grid fault detection, the technical challenges and their possible solutions. A case study is conducted and shows the great advantages of the proposed solution. Furthermore, we highlight relevant directions for future research.
This paper is structured as follows. Section II introduces smart grid and smart grid fault detection. Section III introduces the challenges and solutions for autonomous smart grid fault detection. One actual case deployed in the ultra high voltage (UHV) converter substation with the highest voltage levels in the world is presented in Section IV. In Section V, we look ahead to the problems that need to be solved in the future. Finally, the manuscript is concluded in section VI.
Ii Smart grid and smart grid fault detection
In this section, we overview the key components of the smart grid and smart grid fault detection.
Ii-a Overview of Smart Grid and Fault Detection
The key components of smart grid system is shown in Fig.1. From the perspectives of power transmission, power distribution and power consumption, autonomous smart grid fault detection is needed.
Ii-A1 Power Transmission
As UHV AC and DC transmission systems become larger, more complex and more intelligent, the operating characteristics of power grid have undergone profound changes. On one hand, UHV power transmission is affected by factors such as changing regional climate and complex operating environment, and it is urgent to control the operation situation of the power grid synchronously. On the other hand, the transmission and reception of UHV power grids are closely coupled with AC and DC, which can easily lead to global security risks. Therefore, real-time early cross-regional risk warning and fault detection at the network-wide level is extremely important to assist in scheduling operations.
Ii-A2 Power Distribution
With the increase of penetration rate, a large number of distributed generations, electric vehicle charging devices, energy storage devices, and microgrids are connected to the distribution network. The randomness of renewable energy sources and loads, as well as the distributed multi-agent control characteristics, are likely to cause large power fluctuations and voltage over-limit, which make the operation of the distribution network face severe challenges. Meanwhile, the tie switch based network reconfiguration method is limited by the response speed, operational life and inrush current, cannot meet the needs of future high-reliability users.
Ii-A3 Power Consumption
The structure of low voltage consumption power grid is extremely complex. Electrical faults, such as leakage fault, short circuit fault, overload fault and large contact resistance, are prone to occur during operation, which affect the users’ daily applications. With the in-depth advancement of power consumption information collection, it provides professional analysis data for power supply voltage monitoring, power quality management, and power fault detection. Strengthening the operation and maintenance work is an inevitable way to ensure the stable operation of the collection system and the improvement of various assessment indicators.
Ii-B Advanced Technologies in Autonomous Fault Detection
As illustrated in Fig. 2, we describe some advanced technologies playing important roles in autonomous smart grid fault detection.
Ii-B1 Advanced Sensing and Communication
With the continuous development of smart grid, its fault detection technology is also constantly developed and updated. In terms of sensing technology, in addition to traditional current and voltage transformers, sensing devices such as phasor measurement units, smart meters, acoustic sensors, vibration sensors, environmental sensors, visible and infrared cameras have also been deployed in the grid system. Massive data collected by sensors needs to be sent to cloud servers or edge devices for processing. At this time, low latency communication with cloud servers and edge devices also needs to be considered. In the smart grid, there are many communication standards (e.g., Ethernet, optical fiber Ethernet, LTE, WiFi, ZigBee). With the realization of 5G technology and throughput optimized communication resource allocation, Ultra-Reliable Low-Latency Communications (URLLC) and massive Machine-Type Communication (mMTC) makes data transmission more reliable and faster, and provide support for real-time fault detection.
Ii-B2 Artificial Intelligence and Machine Learning
Now advanced artificial intelligence (AI) technology has entered the stage of independent learning relying on big data, and with further development, it does not require manual control throughout the process. This is particularly evident in the work of power system fault detection.
The operation structure of conventional power system is complex, with extremely cumbersome internal links. If the power system fails during operation, it will be very difficult and error-prone to use the traditional manual troubleshooting methods. With the popularization and improvement of AI technology, machine learning algorithms can be used to deeply integrate and analyze the massive online and offline multi-source heterogeneous data of power systems, and extract the mapping relationships, which can quickly, efficiently and accurately detect and intelligently handle most power system faults. Different models are suitable for processing different types of data, and deep learning can also integrate multi-source heterogeneous data, which greatly improves the efficiency and ensures safe and reliable operation of the power grid.
Ii-B3 Edge Computing and Cloud-Edge Collaboration
Despite the advances in AI and computing technologies, there lies the need for instant availability of the required computational resources due to the ever-increasing number of smart devices and the evolution of complex machine learning algorithms. With the help of communication and cloud computing, massive data processing and storage can be performed in cloud servers. However, cloud computing-based systems also havelimitation of network latency with high dynamic jitter, which in turn hinders the stringent delay requirement of fault detection in smart grid.
Edge computing has recently emerged as a way to reduce the computing pressure of the end devices and the total delay of transmission and monitoring. By shifting part of the work in cloud servers to edge devices, a cloud-edge collaboration system can be formed. Through the study of data scheduling and computational resource allocation, the performance of such system can meet the delay bound requirements of fault detection and process massive data at the same time.
Iii Challenges and Solutions in Autonomous Smart Grid Fault Detection
This section overviews the technical challenges and possible solutions for autonomous smart grid fault detection.
Iii-a Heterogeneous Sensing
Sensing data is the basis for evaluating the status of power grid hardware infrastructures. According to the sampling frequency, it can be divided into three categories: static data, dynamic data and quasi-dynamic data. Static data mainly includes equipment account, nameplate parameters, test data before commissioning, geographic location, etc. Dynamic data is usually updated periodically by seconds, minutes or hours, mainly including equipment online monitoring data, operation data, inspection records, live detection data, environmental meteorological data, etc. And the quasi-dynamic data is usually updated regularly or irregularly on a monthly or yearly basis, mainly including fault defect data, equipment hidden danger records, and maintenance records.
With the accumulation of a large amount of data collected by the power system, the main characteristics of sensing data are as follows:
The amount of data is large and the growth rate is fast. The types of electrical equipment is varietal, and the number is huge. Besides with the deployment of a large number of online monitoring sensors, the growth trend of condition monitoring data volume is becoming more and more obvious.
The sources of sensing data are heterogeneous. With the continuous development of digital power grids, information platforms such as power equipment status monitoring systems, energy management systems, geographic information systems, and meteorological systems have been widely used, making the relevant information sources of sensing data more extensive.
Solution: To solve the challenges of heterogeneous sensing, one possible solution is data fusion. A large number of multi-source sensors will collect various information of power equipments, some of which are related to each other, while some parameters are not affected by others. Data fusion methods can automatically analyze and optimize information according to certain criteria, to obtain a consistent and complete characteristics of the equipments, and complete the required logical processing and decision-making. Through information-level fusion, feature-level fusion and decision-level fusion, the redundancy of the heterogeneous sensing data will be eliminated, which can improve the accuracy of fault detection.
Iii-B Stringent Delay Requirement
Autonomous smart grid fault detection has stringent delay requirement. However, the current studies of cloud-edge collaborative smart grid detection do not fully consider the delay requirement and the flexibility of the detection, and the resource allocation is complicated. Edge computing can provide relatively low latency. However, the limited computational resources may not be able to take up the task of local detection of all data and the computation time cannot be neglected. Thus, lightweight neural networks that consume different computational resources, computing times, and provide different detection accuracy can be designed for fault detection. The inherent selection of neural network with different detection accuracy and complexity makes the resource allocation even more complicated. To the best of our knowledge, these issues remain unsolved.
For smart grid fault detection with delay and accuracy requirements, one possible solution is a cloud-edge collaborative fault detection system, which runs a high-performance neural network in the cloud serve and a low-precision lightweight neural network on the edge device to achieve fault detection. The inherent research problem is how to solve the resource allocation problem of such systems in an efficient and timely manner, either by optimization methods or by reinforcement learning-based solutions.
Iii-C Incipient Fault Detection
Before the power equipment fails, some pre-occurring abnormalities called incipient faults will occur. It is usually a self-cleaning fault, which lasts for a short time and recurs. Incipient faults are usually accompanied by uncertain and random arcs, and the duration ranges from a quarter cycle to up to four cycles 
, making incipient faults difficult to analyze. Most of the traditional incipient fault detection methods analyze the characteristics of the waveform and then classify the fault. For example, a human-level concept learning based method is illustrated to detect incipient failures in
, in which two steps, human-level waveform decomposition and hierarchical probability learning, are involved. However, these manually selected features have insufficient representation ability and classifiability for incipient faults, leading to poor fault detection performance. In addition, the fault signal in power system presents strong non-stationary characteristics, which makes it difficult to set appropriate metric of autonomous smart grid fault detection.
Using AI methods can quickly detect incipient faults and easily respond to the changes in waveform characteristics. For example,  proposes a hybrid method of initial fault monitoring and classification, which uses machine learning algorithms to detect the faults. However, incipient faults rarely occur, and its duration is very short. When they appear, the protective devices deployed in power system may not operate, and the fault recorder will not be able to record the signals. Therefore, not enough training samples can be obtained to train an accurate AI models.
One possible solution to the incipient fault detection is to extract time and frequency characteristics of the non-stationary signal, which can be achieved through wavelet transform. Then the time-frequency information can be embedded into recurrent neural networks to achieve better performance in non-stationary signal analysis. Besides, data augmentation methods, such as faulty signal switch, temporal extension, generative adversarial network (GAN), semi-supervised learning, can be designed to solve the problem of few training samples.
Iii-D Resource Constrained Edge Computing
Edge computing can reduce the computing pressure of the cloud servers and the total delay of communication and monitoring . The combination of edge computing and intelligence, i.e., edge intelligence, can allow the sensing data of power system to be processed and analyzed locally and intelligently, which effectively reduces the delay and saves bandwidth resources. However, most edge devices have limited computational resources, and may not be able to take on the task of local detection of all data with negligible computing time. The accurate but large AI models for fault detection with billion parameters cannot deployed on the resource constrained edge devices.
Furthermore, given that there are strict delay constraints in the smart grid fault detection systems, including communication and computing delays, it is important to schedule data to get a large throughput while taking into account the stringent delay requirements [6, 8]. However, the data scheduling and computational resource allocation can be formulated into a mixed integer optimization problem, which is NP-hard. Solving such problem usually takes a long time and cannot meet the delay requirements either.
Solution: One possible solution of the resource constrained edge computing is scalable AI models. To meet the delay requirements, we can design several lightweight neural networks with different structures, computing time, consumed resources, providing different detection accuracy. By allocating appropriate computational resources of edge device, the system throughput can be maximized and resource utilization can be improved while meeting the requirements of data transmission and processing delays. Besides, we can utilize deep reinforcement learning methods to solve the NP-hard data offloading and computational resource allocation problem in quasi-real-time with near optimal performance.
Iii-E Autonomous Inspection and Fault Detection
With the development of the power system, the requirements for the safety and reliability of power supply are getting higher. To ensure the normal operation of the power system, power inspection has become one of the important daily tasks of electric utilities. According to the traditional manual methods, the operation and maintenance personnel need to complete the inspection of various equipments in sequence according to the planned route, which consumes a lot of manpower and time.
In the inspection and fault detection of power system, outdoor substations and indoor power equipments are the focus. Regular inspections are required for checking meter data, switch positions, and equipment temperatures at different locations. In addition, the substations and underground pipe corridors have problems such as long inspection distances, strong tunnel closure, inconvenient communication, harmful gases, which pose a certain threat to the personal safety of inspection personnel. Based on this, it has become an inevitable trend from traditional manual inspection to autonomous robots aided inspection and fault detection.
Solution: With the development of information and communication technologies, robots and unmanned aerial vehicles (UAVs) are becoming possible solutions for autonomous inspections and fault detection. In particular, power inspection robots are mainly used in indoor substations, outdoor substations, underground power pipe corridors and other scenarios, while the monitoring of high-altitude power racks, transmission lines, and towers is mainly monitored by drones. On this basis, a complete closed loop of real-time maintenance monitoring of power system can be formed. The mobile robots and UAVs can implement data collection and transmission. After receiving the data, such platforms can also perform fault detection while integrating edge intelligence. The environmental adaptability of the inspection robots is much higher than that of human, saving manpower and greatly improving the quality of inspection.
Iv Case Study
This section introduces the case study, as a preliminary study for the autonomous smart grid fault detection.
Iv-a System Overview
We design a cloud-edge collaborative fault detection system, which is deployed in the UHV converter substation at Guquan, China, which is the receiving end of Changji-Guquan ±1,100 kV UHV DC Transmission Project with the highest voltage levels in the world. As shown in Fig 3, the system is mainly composed of smart sensors, acoustic sensors, infrared cameras, a 5G base station, an edge intelligence empowered mobile robot, and a cloud server. Among them, the function of the smart sensors is to collect heterogeneous sensing data in real time, such as voltage, current, acoustic fingerprint and vibration signal of transformers, visible and infrared image, as well as other environmental parameters. All the data is transmitted to the cloud server through the 5G base station, or to the edge intelligence robot through WiFi. With a cloud-edge collaboration manner, the high-precision neural network deployed on cloud server and the lightweight neural networks deployed on mobile robot can analyze the sensing data to detect whether there is any faults and the type of faults of the power infrastructures in the substation.
Although 5G network with low air interface latency is used, the delay of transmitting data to cloud server through backbone network remains significant and is highly dynamic and jittery, which may exceed the stringent delay bound requirement. Moreover, the computational resources of mobile robots are limited, and the computation time is not negligible even when a lightweight neural network is deployed. To this end, we design an algorithm for data scheduling and computational resource allocation in the cloud-edge collaborative system, which can obtain the maximum system throughput while ensuring accuracy and delay, taking real-time network delay, available computational resource at edge servers and detection accuracy of neural networks into account. The process of scheduling and resource allocation is equivalent to solving a nonlinear integer programming problem. We prove the continuous relaxation version is convex, and use the Karush-Kuhn-Tucker (KKT) condition to find its global optimal solution. Finally the branch-and-bound method is utilized to find the binary solution of original problem. Due to the page limit, we omit the details of the solution and refer the reader to  for a similar example.
Iv-B Experimental Results
We have conducted experiments to compare the proposed scheme with the cloud-only (all data is transmitted to the cloud server for processing) scheme and the edge-greedy scheme (greedy scheduling of as much data as possible to the edge side, and the remaining data will be transmitted to the cloud platform), the result is shown in Fig. 4. It can be seen that when the network delay is small, data trends to be transmitted to the cloud server for processing. The cloud server has greater computing power and can provide higher detection accuracy under the delay requirement. With the increase of network delay, data tends to be transmitted to edge devices. The transmission delay of the edge device is short and the successful transmission rate is high, but the detection accuracy is lower than that of the cloud server. Under the cloud-only and edge-greedy scheme, the amount of data transmitted to the cloud and edge is unchanged, so the average detection accuracy remains unchanged and the successful transmission rate decreases. The average detection accuracy of the cloud-edge collaboration scheme also gradually decreases. When the network delay is small, the data tends to be transmitted to the cloud, and when the network delay is large, the data tends to be transmitted to the edge device, so the successful transmission rate of the cloud-edge collaboration scheme first decreases and then increases. Until the computational resources of the edge device are exhausted, excess data can only be transmitted to the cloud for processing, so the accuracy of the cloud-edge collaboration scheme remains unchanged, and the successful transmission rate decreases. With the increase of network delay, the throughput of cloud-only, edge-greedy and cloud-edge collaboration schemes all gradually decrease, but cloud-edge collaboration systems have the largest throughput.
V Future work
We envision that the autonomous smart grid fault detection will play a vital role in future society by offering a green, safe and efficient smart grid. This area opens up many exciting and critical future research directions.
Smart grid requires a large number of on-site measurement and control devices of different models and functions to be widely deployed in the whole network. If there is no unified requirement in design, production and use, different measurement and control devices may adopt different information formats and communication protocols. The collected information can only meet the application requirements of local sub-nets or some functional systems, but cannot be applied by all functional systems in the entire smart grid. This is exactly a key factor affecting autonomous smart grid fault detection. Therefore, it is urgent to carry out standardization work.
The success of autonomous smart grid fault detection cannot be achieved without standardization, through the implementation and development of technical standards based on consensus of all parties, including companies, users, interest groups, standards organizations, and governments. To the authors’ knowledge, this part of the work is still in progress and the related standardization efforts need further work.
V-B Network for Artificial Intelligence
Artificial intelligence algorithms can be used to deeply integrate and analyze the massive online and offline multi-source heterogeneous data of power systems, and extract the mapping relationships, key information and judgment rules to establish data-driven prediction, evaluation and diagnosis models, thus improving the response time and accuracy of system fault diagnosis and prediction.
To better support the application of artificial intelligence in autonomous smart grid fault detection, smart networks are expected. There are several studies on improving networks for artificial intelligence, but there are still many unresolved issues in autonomous smart grid fault detection, from the aspects of network architecture, network protocol to system optimization objective.
By offloading the sensor data to edge devices or cloud servers, smart grid poses the problems of leaking node information, which may cause security problems. Security concerns for smart grid are discussed in recent study . Particular issues of security assurance and privacy preservation in autonomous smart grid fault detection must be further investigated as well.
In addition, distributed deep learning has recently become a research hotspot that allows learning without sharing raw data and provides an option for secure automated smart grid fault detection.
V-D Integration with Emerging Techniques
UAV systems are generating increasing interest in academia and industry. Due to their mobility, flexibility and maneuverability, they are effective options for a variety of uses such as surveillance and mobile edge computing . Network function virtualization (NFV) and software defined networking (SDN) enable the creation of automated, scalable and customizable networks. Caching facilitates popular content delivery and dramatically reduces network load. In-network computing enables the execution of programs in network devices that normally run on end hosts. That is, in-network computing focuses on computation within the network, using devices that already exist in the network system and have been used to forward traffic. These emerging technologies are already being used in smart grid systems and have improved their performance.
However, to the best of our knowledge, there are still many unresolved issues for optimal integration of these technologies in autonomous smart grid fault detection, such as the joint investigation of caching, task offloading with non-negligible input/output sizes, and aerial computing constraints.
This paper focuses on the autonomous fault detection in smart grids. We first present the basic principles of smart grid fault detection. Then we explain the new requirements for autonomous smart grid fault detection, the technical challenges and their possible solutions. A case study is conducted and shows the great advantages of the proposed solution. Furthermore, we highlight relevant directions for future research.
This work is supported in part by grants from the National Natural Science Foundation of China (52077049, 51877060, 62173120), the Anhui Provincial Natural Science Foundation (2008085UD04), the 111 Project (BP0719039).
-  (2019) A hybrid intelligent approach for classification of incipient faults in transmission network. IEEE Transactions on Power Delivery 34 (4), pp. 1785–1794. External Links: Cited by: §III-C.
-  (2020) Low-carbon operation of multiple energy systems based on energy-carbon integrated prices. IEEE Transactions on Smart Grid 11 (2), pp. 1307–1318. External Links: Cited by: 3rd item.
-  (2011) Smart grid—the new and improved power grid: a survey. IEEE communications surveys & tutorials 14 (4), pp. 944–980. Cited by: §I.
-  (2019) Privacy-aware authenticated key agreement scheme for secure smart grid communication. IEEE Transactions on Smart Grid 10 (4), pp. 3953–3962. External Links: Cited by: §V-C.
-  (2021) Decision-dependent uncertainty modeling in power system operational reliability evaluations. IEEE Transactions on Power Systems 36 (6), pp. 5708–5721. External Links: Cited by: 2nd item.
-  (2014) Communication infrastructure design in cyber physical systems with applications in smart grids: a hybrid system framework. IEEE Communications Surveys Tutorials 16 (3), pp. 1689–1708. Cited by: §III-D.
-  (2022) Resource orchestration of cloud-edge based smart grid fault detection. ACM Transactions on Sensor Networks (TOSN). Cited by: §IV-A.
-  (2021) Optimal resource allocation of 5g machine-type communications for situation awareness in active distribution networks. IEEE Systems Journal. Cited by: §III-D.
-  (2021) Robust edge computing in uav systems via scalable computing and cooperative computing. IEEE Wireless Communications. Cited by: §V-D.
-  (2021) Buffered-microgrid structure for future power networks; a seamless microgrid control. IEEE Transactions on Smart Grid 12 (1), pp. 131–140. External Links: Cited by: 1st item.
Edge computing: vision and challenges. IEEE Internet of Things Journal 3 (5), pp. 637–646. Cited by: §III-D.
-  (2010) Detection of incipient faults in distribution underground cables. IEEE Transactions on Power Delivery 25 (3), pp. 1363–1371. External Links: Cited by: §III-C.
-  (2020) Novel fault models for electronically coupled distributed energy resources and their laboratory validation. IEEE Transactions on Power Systems 35 (2), pp. 1209–1217. External Links: Cited by: 2nd item.
-  (2021) The post-fault current model of voltage source converter and its application in fault diagnosis. IEEE Transactions on Power Electronics 36 (2), pp. 1209–1214. External Links: Cited by: 1st item.
-  (2020-05) Incipient fault identification in power distribution systems via human-level concept learning. IEEE Transactions on Smart Grid PP, pp. 1–1. External Links: Cited by: §III-C.