Detection and Estimation of the Invisible Units Using Utility Data Based on Random Matrix Theory

10/30/2017 ∙ by Xing He, et al. ∙ 0

Invisible units refer mainly to small-scale units that are not monitored, and thus are invisible to utilities and system operators, e.g., small-scale distributed units like unauthorized roof-top photovoltaics (PVs), and plug-and-play units like electric vehicles (EVs). Massive integration of invisible units into power systems could significantly affect the way in which the distribution grid is planned and operated. This paper, based on random matrix theory (RMT), proposes a data-driven approach for the detection, identification, and estimation of the existing invisible units only using easily accessible utility data. The concatenated matrices and linear eigenvalue statistic (LES) indicators are suggested as the main ingredients of this solution. Furthermore, the hypothesis testing is formulated for anomaly detection according to the statistical characteristic of LES indicators. The proposed approach is promising for anomaly detection in a complex grid--it is able to detect invisible power usage, fraud behavior and even to locate the suspect's location. The case studies, using both simulated data and actual data, validate the proposed method.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Future grids are fundamentally different from current ones [1]. Technology development, environment pressure, and market reform have greatly spurred the deployment and penetration of the distributed, the renewable, and even the plug-and-play units, on both the power generation side and the power consumption side. The worldwide small-scale roof-top photovoltaics (PVs) installation reached 23 GW at the end of 2013, and the growth is predicted to be 20 GW per year until 2018 [2]. The up-take of electric vehicles (EVs) also continues to increase. At least 665,000 electric-driven light-duty vehicles, 46,000 electric buses, and 235 million electric two-wheelers were in the worldwide market in early 2015 [3].

These distributed units are mostly invisible to utilities, i.e., they are not monitored by, and thus not visible to, power system operators. 1) Accessing distributed units operation data into utility systems requires an enormous amount of cost paid for data acquisition, communication, storage, calculation, and security [4]. 2) It is hard to describe these units using a fixed model or in a united way; they are small-scale and mostly with high uncertainty or individuality. 3) Some anomaly behaviors are essentially invisible. In 2009, over 20% of total electricity generated is lost from theft in India alone [5]. In 2014, the system in Hawaii, with the highest penetration of PVs in the U.S., recognized a large number of unauthorized PV installations [2].

Lack of visibility may result in incorrect planning and operation of power systems, and even worse, damaging system equipment such as transformers, voltage regulators, and customer appliances. For a highly distributed energy resource penetration environment, utilities are facing technical problems related to overvoltage, frequency control, back feeding flow, and other issues such as a rapid decrease in revenue. The prosumers are also bringing many unknowns and risks that need to be identified and managed [3].

To solve the above problems, many distribution utilities have begun deploying high-precision distribution phasor measurement units (PMUs) for monitoring, diagnostic, and control purposes [6]. High resolution voltage and current phasor measurements can be used in a plethora of applications concerning real-time system operation and long-term planning, such as state estimation, model validation, load characterization, and event detection and localization [7].

Many researchers have studied the impacts and risks of invisible units, especially PVs, on distribution systems [8]; little attention, however, has been paid to the detection and estimation of the invisible units, especially in a complex distributed grid. Some related research is found in the special issue of “Big Data Analytics for Grid Modernization” [9]. Reference [10] proposes a change-point detection algorithm for a time series. The change-point concept is relevant to our paper in spirit. The proposed algorithm, however, is effective only if the characteristics of all other units before and after the change point are similar. In addition, the spatial information of the utility data (data distributed across nodes) are not used. Reference [2] takes the uncertainty in PV sites into account, and estimates the power generation of invisible solar photovoltaic sites using the data generated by a small set of selected representative sites. Reference [11] proposes an approach of big data characterization for smart grids and a two-layer dynamic optimal synchrophasor measurement devices selection algorithm for fault detection, identification, and causal impact analysis. Our previous work [1, 12, 13, 14, 15], based on random matrix theory (RMT), also outlines a data-driven methodology to conduct big data analytics for power systems. Our approach utilizes the temporal-spatial statistics.

I-a Contribution

This paper proposes an approach aimed at detection and estimation of the invisible units in a complex distribution grid; the analysis of these results will give insight into distribution network characteristics and consumer behaviors. Based on RMT, the proposed approach handles raw data in an unsupervised way and obtains Linear Eigenvalue Statistic (LES) indicators, which are in high-dimensional vector space and thus robust when considering data errors (e.g., data loss, data out-of-synchronization)

[13]. Furthermore, using the statistical characteristics of LES indicators, hypothesis testing can be formulated for anomaly detection. The data analytics only rely on easily accessible utility data such as node voltage magnitudes and node power injections. Finally, the proposed method is validated using both simulated data of a complex grid and field data of a certain distribution system in China. The heart of the method is presented in Sec. III-B2.

Ii Problem Formulation

This paper attempts to conduct situation awareness in a non-omniscient distribution network. More precisely, we try to obtain the load/generator ingredients and their weights, and the power usage behaviors at the node level.

For any node, its customers are divided into two categories—typical load pattern units (TLPs) and uncertain load pattern ones (ULPs).

  1. The TLPs operate according to a well-defined profile, and are denoted as vectors . For instance, street-lamps are turned on at 18:00 and turned off at 6:00; their load pattern is modeled as

    If the sampling interval is 6 hours,

  2. The ULPs are denoted as vectors , and might be further divided into three categories—completely random behavior, invisible behavior, and fraudulent behavior. We have already successfully distinguished completely random behavior from the others in our previous work [1, 13] by using random matrix tools. Next, we will focus on the detection and identification of invisible and fraudulent behavior. The former often causes a chain reaction and has an impact on other parameters. For instance, unauthorized residential PV installation and plug-in EV charging changes the power flow. The latter often causes parameter deviation in isolation. For instance, some metering error or cyber attack might merely reduce data value of power consumption without affecting voltage .

Motivated by the above observations, we propose to study a general model for each node:


where vectors and are the daily patterns of TLPs and ULPs, with coefficients respectively. Thus, for vector is the daily power usage for the -th TLP, and similarly vector is the daily power usage for the -th ULP.

If all the units patten and behaviors are known in advance, i.e., no exists, or if ULPs are able to be modeled as instead of uncertain , then Eq. (1) can be rewritten as


Our first step is to formulate the problem in terms of a classical optimization


where vectors and are the power injections of nodes and power losses of nodes, respectively, which are measurable and calculable. In addition, it is worth mentioning that the analysis for the reactive power may be conducted similarly.

For the modern distribution network, as described in Sec I, ULPs play an important role: are present and their influences need to be considered. They violate the prerequisites of most algorithms (e.g., least square method) and have significant effects on the final values of coefficients in Eq. (1). In most cases, it is reasonable to model as a step signal. This is the case when the plug-in EVs charge and/or unauthorized PVs generate during to . Determining the start point and the end point of the step signal is at the heart of the problem. Based on random matrix theory (RMT) and linear eigenvalue statistics (LES), a statistical, data-driven solution, rather than its deterministic, empirical or model-based counterpart, is proposed to solve the problem.

Iii Mathematical Foundation

Iii-a Random Matrix Theory

Iii-A1 Statistics based on Random Matrix Theory

Random matrices have been an important issue in multivariate statistical analysis since the landmark work of Wishart on fixed size Gaussian matrices. The asymptotic theory on the limiting spectrum of large random matrices was initially proposed in several works [16]

by Wigner in the 1950s, motivated by problems in quantum physics. Since then, research on the finite spectral analysis of high dimensional random matrices has come under heated discussion by scholars in numerous disciplines. The RMT, as a statistical tool with profound theoretical basis, is adapted to multivariate analysis. It can help model many intractable practical systems, especially those with numerous variables.

Iii-A2 Laws for Spectral Analysis

RMT mainly concerns two ensemble random matrices—Gaussian unitary ensemble (GUE) and Laguerre unitary ensemble (LUE).


where is the standard Gaussian Random Matrix.

Let be the empirical density of , and define its empirical spectral distribution (ESD) :


where is GUE or LUE matrix, represents the event indicator function. We investigate the rate of convergence of the expected ESD to Wigner’s Semicircle Law or Wishart’s M-P Law.

Let and denote the empirical eigenvalue density and ESD of , and the Wigner’s Semicircle Law [16] and Wishart’s Marchenko-Pastur (M-P) Law [17] say:


where .


Then, we denote the Kolmogorov distance between and as :


Gotze and Tikhomirov, in their work [18], prove an optimal bound for of order .

Iii-B Linear Eigenvalue Statistics and its Central Limit Theorem

The LES of an arbitrary matrix is defined in [19, 20] via the continuous test function


where the trace of the function of a random matrix is involved.

Iii-B1 Law of Large Numbers

The Law of Large Numbers tells us that

converges in probability to the limit



is the probability density function of


Iii-B2 Central Limit Theorem

The CLT [20] as the natural second step, aims to study the LES fluctuations [21]. Consider covariance matrix . The CLT for is given as follows [20]:

Theorem III.1 (M. Sheherbina, 2009).

Let the real valued test function satisfy condition . Then defined in (10), in the limit

, converges in the distribution to the Gaussian random variable with zero mean and the variance:


where and is the -th cumulant of entries of .

Eq. (8) has been used in a power grid in our previous work [14]. This paper takes a fundamentally different approach from (8). To study the convergence as a function of

we study the LES instead of the probability distribution of eigenvalues in (

8). For an arbitrary test function with enough smoothness, the LES is a (positive) scalar random variable defined in (9). As the asymptotic limit of its expectation, is given in (10). As the asymptotic limit of its variance, is given in (11). These two equations are sufficient to study the scalar random variable This approach can be viewed as a dimensionality reduction. The random data matrix of size is reduced to a (positive) scalar random variable ! This dimension reduction is mathematically rigorous only when Experiences demonstrate, however, that moderate values of and are accurate enough for our practical purposes.

Iii-B3 Change Point Detection using LES

Change-point detection began with Page’s (1954, 1955) classical formulation, which was further developed by Shiryaev (1963) and Lorden (1971) [22]. Change-point detection is such a problem: Suppose are independent observations. For they have the distribution , while for they have the distribution . The distributions may be completely specified or may depend on unknown parameters. In the case of a fixed number

of observations, we would like to test the null hypothesis of no change, that

, and perhaps to estimate .

This paper formulates the hypothesis test in terms of the statistical characteristics of LES indicators. Theorem III.1 says that the LES indicator , in the limit , converges in the distribution to a Gaussian random variable with mean and variance Due to the Gaussian property, following a standard procedure, the detection is modeled as a binary hypothesis test: normal hypothesis (no anomaly present) and abnormal one , denoted by:


where is the threshold value, that needs to be preset based on experiences.

Iii-C Concatenation Operation

Numerous causing factors affect the system state in different ways; sensitivity analysis is a valuable and hot topic. Assuming that there are state variables and factors, their sampling data are multiple time-series. In a fixed period of interest , the sampling data of state variables consist of a matrix (i.e. state matrix), and the factors consist of (i.e. factor vector). Two matrices with the same length can be put together and a concatenated matrix is formed; in such a way, we obtain a new matrix using the state matrix and the factor matrix .

In order to balance the proportion (to increase the statistic correlation), a factor matrix is formed for each factor vector. First, for the factor , we duplicate it for times111 is appropriated to to construct a matrix , written as

Then, white noise is introduced into

to avoid extremely strong cross-correlations. Thus, the factor matrix for the factor vector is expressed as


where is related to the signal-to-noise ratio (SNR), and the entries of the matrix are Gaussian random variables.

Through the trace function the SNR of the factor matrix is defined as


In parallel, we can construct the concatenated matrix with each factor , expressed as


The relationships between causing factors and system state can be revealed by the concatenated matrix . The concatenated model is compatible with different units and different measurements for each variable data (in the form of rows of

), due to the normalisation during the data preprocessing. Besides, it is worth to mention that some simple mathematical methods, e.g., interpolation, may be applied to handle data source with different sampling rates.

Iii-D Experiment Design Using Variable Data of Power Systems

The operating states of power systems can be estimated by various kinds of state variables, such as frequencies, voltages, currents, and power flows. In this paper, the state matrix is made up of , and the -th factor matrix is made up of according to (13). Similar to (15), we obtain


Iv Simulation Cases

Iv-a Background

Simulations are based on the IEEE-33 bus system for a distribution network, shown as Fig. 1. For node , its gross power usage and voltage magnitude are sampled at a high rate, for example, 9600 points per day (0.11 Hz). Then we introduce the white noise to the power injections as


where and are two standard Gaussian random variables, i.e. . In this way, the related power flow is obtained via the software package Matpower.

Fig. 1: Topology of the IEEE 33-bus distribution network.

As mentioned in Sec. II, we mainly focus on fraudulent behavior and invisible power usage. Determining the start point and the end point of the is the focus of this paper. For the longstanding anomaly without any step signals in the observed data segment, long-term indicators, such as monthly line loss rate, might be sensitive. This is another topic that will be explored elsewhere.

Iv-B Fraud events in a Simple Scenario

Fraud events often cause parameter deviation. Suppose that the active power values for each node are at their initial points with fluctuations defined in (17). From to , some fraud events on node-6 and node-14 cause a reduction of ( of the total , and of ). The sampling data, power consumptions and voltage magnitudes of each node, are shown as Fig. 2. The lines with legends data 1 to data 33 are for actual power consumption of node 1 to node 33, and lines with data 34 and data 35 are for measured power consumption of node 14 and node 6, respectively. According to the actual power consumption, i.e., data 1 to data 33, the voltage magnitudes are obtained in Fig. (b)b. Note that due to the fraud events, the data 14 and data 6 of Fig. (a)a are unreachable.

(a) of each node
(b) of each node
Fig. 2: Power demand and voltage magnitudes of each node

The matrix concatenation operation and the split window method are used to handle the sampling data. Using (9) for , we choose Chebyshev polynomials : as the test function. The LES indicators of state matrix and concatenated matrix (, referring Eq. (16)) are obtained as Fig. 3.

Fig. 3: LES indicator in the simple scenario

In Fig. 3, the LES indicator of state matrix , namely, is almost constant. From a statistic view, the theoretical expectation

and the standard deviation

are accessible via random matrix theory, or rather, via Eq. (6), (9), and (11). It is found that the experimental indicator is exactly bounded between and . According to Eq. (12), we should accept the hypothesis —there is no factor actually affecting the system state during the observation period. On the other hand, of state matrix , namely, has four spikes: two spikes for and two spikes for . Our previous work [1] tells us that the anomaly should last time points (i.e. ) and have an extreme point at . This phenomenon is observed on the curve and curve:

Iv-C Invisible Power Usage and Fraud Events in a Complex Scenario

This subsection proposes a data-driven solution for the problem given in Sec. II—determining the start point and the end point to model the invisible power usage as a step signal. Firstly, we assume a complex scenario:

  1. The power usage of each bus (e.g., bus ) generally consists of four TLPs and one ULP, denoted as


    The daily load profiles of TLPs are set as Tab. 4 and shown as Fig. 4. Note that the blue-filled rectangle means that the load profiles have a dramatic change at this time point. According to work [10], these special time points are denoted as change points (CPs). The coefficients are assumed as Tab I.

  2. We assume that there exists invisible power usage events on node 20 and 31: the periods are 1:00–5:00 and 14:00–20:00, and the percentages are 30% and 50%, respectively.

  3. We assume that there exist fraud events on node 6, 14 and 27, the periods are 20:00–22:00, 14:00–17:00 and 18:00–19:00, and the percentages are 7%, 8% and 12%, respectively.

    0   12   1   100 13   86 2   14   100 3   100 15   4   16   100 5   0 17   35 6   0 18   85 7   19   8   88 40 20   0 9   85 21   10   22   43 11   95 23   Note: blue-filled rectangle means CP. tableTypical Loads and their 24-hour power demand.
Fig. 4: Daily power demands for typical loads
      1     2   3     4   5     6   7     8   9     10   11     12   13     14   15     16   17     18   19     20   21     22   23     24   25     26   27     28   29     30   31     32   33      
TABLE I: Coefficients of TLPs and ULP of each node.

Using (16), we obtain the active power and then calculate the voltages for the assumed complex scenario above; the results are shown as Fig. (a)a, (c)c and (b)b.

With a similar procedure to that of Sec IV-B, the curve is obtained in Fig. (d)d. Based on the curve of Fig. (d)d, we make the following observations:

  • The brown line at the bottom is the indicator ; it is relatively smooth.

  • The results shown in Fig. (d)d match the settings of the daily load pattern in Tab. 4. Taking TLP as an example, Fig (d)d shows that the indicators of nodes 25, 24, 32, 30, etc, have bright spikes at 3:00; in fact, 3:00 is a CP of TLP in Tab. 4. The coefficients in Table I tell us that these listed nodes are the exact ones of which the TLP takes a dominant part.

  • For the fraud events, the limit points are located at 5553, 6856, 7655, etc. According to Sec IV-B, the key time points are 14:00 (5600), 17:00 (6800), 19:00 (7600), etc., respectively.

  • For the invisible power usage, we can locate them using the special time points and node 31, 20. For time points 200, 700, the change point is 400.222400=[(200+50-50)+(700-50-50)]/2 With similar procedure, the CPs are found as and these CPs are at 1:00, 5:00, 14:00 and 20:00. These results agree with the daily load pattern of Table 4 and the coefficients of Table I. The step signal for is modeled based on this analysis.

V Real-World Case Studies

V-a Data

We use a power grid with 5 substations in China (Fig. (a)a). For each substation, its three-phase voltage data and current data are recorded using a three-minute sampling-rate. We take a two-day time period as the data set, depicted as Fig. (c)c, (d)d, (e)e, and (f)f.

V-B Results: Ring Law and LES Indicator

If we choose (in Fig. (c)c), i.e., the voltage data during 0 a.m. to 2 a.m., the ring distribution is obtained according to our previous work [1], shown as Fig. (b)b. Most eigenvalues are distributed between the inner circle and the outer circle. This implies that the real-world data does follow the Ring Law. With a similar process, and setting the test function as Chebyshev Polynomials : and the Likelihood Ratio Function , respectively, the LES curves are obtain as Fig. (a)a, (b)b, (c)c, and (d)d. The grid is relatively smooth during 0 a.m. to 8 a.m. and has dramatic changes at around 8:30 a.m., 11:30 a.m., etc. This observation agrees with our common sense. For the field data, the test function will influence the result in some complicated ways, although the indicators have a similar trend at most CPs.

(a) Active Power of each node
(b) Voltage of each node
(c) Zoom in of Power
(d) Curve
Fig. 5: Illusion of the data and analysis of a complicated scenario for behavior analysis.
(a) The Grid Network (b) Ring Law of (c) Day 1: Voltage (d) Day 2: Voltage (e) Day 1: Current (f) Day 2: Current
Fig. 6: Grid Network and Raw Data of Real Case

Note: For each substation, the 3-phase data are quite similar and only B-phase data are chosen.

(a) Day 1:
(b) Day 1:
(c) Day 2:
(d) Day 2:
Fig. 7: Illusion of the LES indicators of field data.

Vi Conclusion

This paper extends our framework of using large random matrices to model a power grid in several ways. First, a model-free, data-driven statistical approach is proposed for the detection and estimation of the invisible units, a stressing problem in industry. Behind this approach, we exploit the statistical property of massive datasets in a high-dimensional vector space. The temporal variations ( sampling instants) are simultaneously observed together with spatial variations ( grid nodes). Based on mathematically rigorously random matrix theory, time and space must be unified through their ratio What matters is the ratio rather than and ! This observation is valid when and are large and comparable in size, which is often true in practice.

Second, we explore numerous practical aspects. Hypothesis tests, change point detection, and concatenation operations are investigated. The statistical features of Linear Eigenvalue Statistics (LES), i.e. and , are studied. Based on these features, the hypothesis test is designed for the detection of fraud behavior and anomaly behavior.

Third, real-world data are tested using our algorithms. We find that the experimental LES indicators agree with the theoretical predictions: the Ring Law is valid. Both the simulated cases and real-world cases validate the proposed approach as a powerful and effective way to gain insight into the distribution network characteristics and consumer behaviors.

We pave the way for future work with this paper. First, in the context of cyber attacks in distribution networks, our approach can locate these attacks. Second, the power of our algorithms depends on the selection of the test function; more test functions need to be studied and optimized using metrics.