Future grids are fundamentally different from current ones . Technology development, environment pressure, and market reform have greatly spurred the deployment and penetration of the distributed, the renewable, and even the plug-and-play units, on both the power generation side and the power consumption side. The worldwide small-scale roof-top photovoltaics (PVs) installation reached 23 GW at the end of 2013, and the growth is predicted to be 20 GW per year until 2018 . The up-take of electric vehicles (EVs) also continues to increase. At least 665,000 electric-driven light-duty vehicles, 46,000 electric buses, and 235 million electric two-wheelers were in the worldwide market in early 2015 .
These distributed units are mostly invisible to utilities, i.e., they are not monitored by, and thus not visible to, power system operators. 1) Accessing distributed units operation data into utility systems requires an enormous amount of cost paid for data acquisition, communication, storage, calculation, and security . 2) It is hard to describe these units using a fixed model or in a united way; they are small-scale and mostly with high uncertainty or individuality. 3) Some anomaly behaviors are essentially invisible. In 2009, over 20% of total electricity generated is lost from theft in India alone . In 2014, the system in Hawaii, with the highest penetration of PVs in the U.S., recognized a large number of unauthorized PV installations .
Lack of visibility may result in incorrect planning and operation of power systems, and even worse, damaging system equipment such as transformers, voltage regulators, and customer appliances. For a highly distributed energy resource penetration environment, utilities are facing technical problems related to overvoltage, frequency control, back feeding flow, and other issues such as a rapid decrease in revenue. The prosumers are also bringing many unknowns and risks that need to be identified and managed .
To solve the above problems, many distribution utilities have begun deploying high-precision distribution phasor measurement units (PMUs) for monitoring, diagnostic, and control purposes . High resolution voltage and current phasor measurements can be used in a plethora of applications concerning real-time system operation and long-term planning, such as state estimation, model validation, load characterization, and event detection and localization .
Many researchers have studied the impacts and risks of invisible units, especially PVs, on distribution systems ; little attention, however, has been paid to the detection and estimation of the invisible units, especially in a complex distributed grid. Some related research is found in the special issue of “Big Data Analytics for Grid Modernization” . Reference  proposes a change-point detection algorithm for a time series. The change-point concept is relevant to our paper in spirit. The proposed algorithm, however, is effective only if the characteristics of all other units before and after the change point are similar. In addition, the spatial information of the utility data (data distributed across nodes) are not used. Reference  takes the uncertainty in PV sites into account, and estimates the power generation of invisible solar photovoltaic sites using the data generated by a small set of selected representative sites. Reference  proposes an approach of big data characterization for smart grids and a two-layer dynamic optimal synchrophasor measurement devices selection algorithm for fault detection, identification, and causal impact analysis. Our previous work [1, 12, 13, 14, 15], based on random matrix theory (RMT), also outlines a data-driven methodology to conduct big data analytics for power systems. Our approach utilizes the temporal-spatial statistics.
This paper proposes an approach aimed at detection and estimation of the invisible units in a complex distribution grid; the analysis of these results will give insight into distribution network characteristics and consumer behaviors. Based on RMT, the proposed approach handles raw data in an unsupervised way and obtains Linear Eigenvalue Statistic (LES) indicators, which are in high-dimensional vector space and thus robust when considering data errors (e.g., data loss, data out-of-synchronization). Furthermore, using the statistical characteristics of LES indicators, hypothesis testing can be formulated for anomaly detection. The data analytics only rely on easily accessible utility data such as node voltage magnitudes and node power injections. Finally, the proposed method is validated using both simulated data of a complex grid and field data of a certain distribution system in China. The heart of the method is presented in Sec. III-B2.
Ii Problem Formulation
This paper attempts to conduct situation awareness in a non-omniscient distribution network. More precisely, we try to obtain the load/generator ingredients and their weights, and the power usage behaviors at the node level.
For any node, its customers are divided into two categories—typical load pattern units (TLPs) and uncertain load pattern ones (ULPs).
The TLPs operate according to a well-defined profile, and are denoted as vectors . For instance, street-lamps are turned on at 18:00 and turned off at 6:00; their load pattern is modeled as
If the sampling interval is 6 hours,
The ULPs are denoted as vectors , and might be further divided into three categories—completely random behavior, invisible behavior, and fraudulent behavior. We have already successfully distinguished completely random behavior from the others in our previous work [1, 13] by using random matrix tools. Next, we will focus on the detection and identification of invisible and fraudulent behavior. The former often causes a chain reaction and has an impact on other parameters. For instance, unauthorized residential PV installation and plug-in EV charging changes the power flow. The latter often causes parameter deviation in isolation. For instance, some metering error or cyber attack might merely reduce data value of power consumption without affecting voltage .
Motivated by the above observations, we propose to study a general model for each node:
where vectors and are the daily patterns of TLPs and ULPs, with coefficients respectively. Thus, for vector is the daily power usage for the -th TLP, and similarly vector is the daily power usage for the -th ULP.
If all the units patten and behaviors are known in advance, i.e., no exists, or if ULPs are able to be modeled as instead of uncertain , then Eq. (1) can be rewritten as
Our first step is to formulate the problem in terms of a classical optimization
where vectors and are the power injections of nodes and power losses of nodes, respectively, which are measurable and calculable. In addition, it is worth mentioning that the analysis for the reactive power may be conducted similarly.
For the modern distribution network, as described in Sec I, ULPs play an important role: are present and their influences need to be considered. They violate the prerequisites of most algorithms (e.g., least square method) and have significant effects on the final values of coefficients in Eq. (1). In most cases, it is reasonable to model as a step signal. This is the case when the plug-in EVs charge and/or unauthorized PVs generate during to . Determining the start point and the end point of the step signal is at the heart of the problem. Based on random matrix theory (RMT) and linear eigenvalue statistics (LES), a statistical, data-driven solution, rather than its deterministic, empirical or model-based counterpart, is proposed to solve the problem.
Iii Mathematical Foundation
Iii-a Random Matrix Theory
Iii-A1 Statistics based on Random Matrix Theory
Random matrices have been an important issue in multivariate statistical analysis since the landmark work of Wishart on fixed size Gaussian matrices. The asymptotic theory on the limiting spectrum of large random matrices was initially proposed in several works 
by Wigner in the 1950s, motivated by problems in quantum physics. Since then, research on the finite spectral analysis of high dimensional random matrices has come under heated discussion by scholars in numerous disciplines. The RMT, as a statistical tool with profound theoretical basis, is adapted to multivariate analysis. It can help model many intractable practical systems, especially those with numerous variables.
Iii-A2 Laws for Spectral Analysis
RMT mainly concerns two ensemble random matrices—Gaussian unitary ensemble (GUE) and Laguerre unitary ensemble (LUE).
where is the standard Gaussian Random Matrix.
Let be the empirical density of , and define its empirical spectral distribution (ESD) :
where is GUE or LUE matrix, represents the event indicator function. We investigate the rate of convergence of the expected ESD to Wigner’s Semicircle Law or Wishart’s M-P Law.
Then, we denote the Kolmogorov distance between and as :
Gotze and Tikhomirov, in their work , prove an optimal bound for of order .
Iii-B Linear Eigenvalue Statistics and its Central Limit Theorem
where the trace of the function of a random matrix is involved.
Iii-B1 Law of Large Numbers
Iii-B2 Central Limit Theorem
Theorem III.1 (M. Sheherbina, 2009).
we study the LES instead of the probability distribution of eigenvalues in (8). For an arbitrary test function with enough smoothness, the LES is a (positive) scalar random variable defined in (9). As the asymptotic limit of its expectation, is given in (10). As the asymptotic limit of its variance, is given in (11). These two equations are sufficient to study the scalar random variable This approach can be viewed as a dimensionality reduction. The random data matrix of size is reduced to a (positive) scalar random variable ! This dimension reduction is mathematically rigorous only when Experiences demonstrate, however, that moderate values of and are accurate enough for our practical purposes.
Iii-B3 Change Point Detection using LES
Change-point detection began with Page’s (1954, 1955) classical formulation, which was further developed by Shiryaev (1963) and Lorden (1971) . Change-point detection is such a problem: Suppose are independent observations. For they have the distribution , while for they have the distribution . The distributions may be completely specified or may depend on unknown parameters. In the case of a fixed number
of observations, we would like to test the null hypothesis of no change, that, and perhaps to estimate .
This paper formulates the hypothesis test in terms of the statistical characteristics of LES indicators. Theorem III.1 says that the LES indicator , in the limit , converges in the distribution to a Gaussian random variable with mean and variance Due to the Gaussian property, following a standard procedure, the detection is modeled as a binary hypothesis test: normal hypothesis (no anomaly present) and abnormal one , denoted by:
where is the threshold value, that needs to be preset based on experiences.
Iii-C Concatenation Operation
Numerous causing factors affect the system state in different ways; sensitivity analysis is a valuable and hot topic. Assuming that there are state variables and factors, their sampling data are multiple time-series. In a fixed period of interest , the sampling data of state variables consist of a matrix (i.e. state matrix), and the factors consist of (i.e. factor vector). Two matrices with the same length can be put together and a concatenated matrix is formed; in such a way, we obtain a new matrix using the state matrix and the factor matrix .
In order to balance the proportion (to increase the statistic correlation), a factor matrix is formed for each factor vector. First, for the factor , we duplicate it for times111 is appropriated to to construct a matrix , written as
Then, white noise is introduced intoto avoid extremely strong cross-correlations. Thus, the factor matrix for the factor vector is expressed as
where is related to the signal-to-noise ratio (SNR), and the entries of the matrix are Gaussian random variables.
Through the trace function the SNR of the factor matrix is defined as
In parallel, we can construct the concatenated matrix with each factor , expressed as
The relationships between causing factors and system state can be revealed by the concatenated matrix . The concatenated model is compatible with different units and different measurements for each variable data (in the form of rows of
), due to the normalisation during the data preprocessing. Besides, it is worth to mention that some simple mathematical methods, e.g., interpolation, may be applied to handle data source with different sampling rates.
Iii-D Experiment Design Using Variable Data of Power Systems
Iv Simulation Cases
Simulations are based on the IEEE-33 bus system for a distribution network, shown as Fig. 1. For node , its gross power usage and voltage magnitude are sampled at a high rate, for example, 9600 points per day (0.11 Hz). Then we introduce the white noise to the power injections as
where and are two standard Gaussian random variables, i.e. . In this way, the related power flow is obtained via the software package Matpower.
As mentioned in Sec. II, we mainly focus on fraudulent behavior and invisible power usage. Determining the start point and the end point of the is the focus of this paper. For the longstanding anomaly without any step signals in the observed data segment, long-term indicators, such as monthly line loss rate, might be sensitive. This is another topic that will be explored elsewhere.
Iv-B Fraud events in a Simple Scenario
Fraud events often cause parameter deviation. Suppose that the active power values for each node are at their initial points with fluctuations defined in (17). From to , some fraud events on node-6 and node-14 cause a reduction of ( of the total , and of ). The sampling data, power consumptions and voltage magnitudes of each node, are shown as Fig. 2. The lines with legends data 1 to data 33 are for actual power consumption of node 1 to node 33, and lines with data 34 and data 35 are for measured power consumption of node 14 and node 6, respectively. According to the actual power consumption, i.e., data 1 to data 33, the voltage magnitudes are obtained in Fig. (b)b. Note that due to the fraud events, the data 14 and data 6 of Fig. (a)a are unreachable.
The matrix concatenation operation and the split window method are used to handle the sampling data. Using (9) for , we choose Chebyshev polynomials : as the test function. The LES indicators of state matrix and concatenated matrix (, referring Eq. (16)) are obtained as Fig. 3.
In Fig. 3, the LES indicator of state matrix , namely, is almost constant. From a statistic view, the theoretical expectation
and the standard deviationare accessible via random matrix theory, or rather, via Eq. (6), (9), and (11). It is found that the experimental indicator is exactly bounded between and . According to Eq. (12), we should accept the hypothesis —there is no factor actually affecting the system state during the observation period. On the other hand, of state matrix , namely, has four spikes: two spikes for and two spikes for . Our previous work  tells us that the anomaly should last time points (i.e. ) and have an extreme point at . This phenomenon is observed on the curve and curve:
Iv-C Invisible Power Usage and Fraud Events in a Complex Scenario
This subsection proposes a data-driven solution for the problem given in Sec. II—determining the start point and the end point to model the invisible power usage as a step signal. Firstly, we assume a complex scenario:
The power usage of each bus (e.g., bus ) generally consists of four TLPs and one ULP, denoted as
The daily load profiles of TLPs are set as Tab. 4 and shown as Fig. 4. Note that the blue-filled rectangle means that the load profiles have a dramatic change at this time point. According to work , these special time points are denoted as change points (CPs). The coefficients are assumed as Tab I.
We assume that there exists invisible power usage events on node 20 and 31: the periods are 1:00–5:00 and 14:00–20:00, and the percentages are 30% and 50%, respectively.
We assume that there exist fraud events on node 6, 14 and 27, the periods are 20:00–22:00, 14:00–17:00 and 18:00–19:00, and the percentages are 7%, 8% and 12%, respectively.
The brown line at the bottom is the indicator ; it is relatively smooth.
The results shown in Fig. (d)d match the settings of the daily load pattern in Tab. 4. Taking TLP as an example, Fig (d)d shows that the indicators of nodes 25, 24, 32, 30, etc, have bright spikes at 3:00; in fact, 3:00 is a CP of TLP in Tab. 4. The coefficients in Table I tell us that these listed nodes are the exact ones of which the TLP takes a dominant part.
For the fraud events, the limit points are located at 5553, 6856, 7655, etc. According to Sec IV-B, the key time points are 14:00 (5600), 17:00 (6800), 19:00 (7600), etc., respectively.
For the invisible power usage, we can locate them using the special time points and node 31, 20. For time points 200, 700, the change point is 400.222400=[(200+50-50)+(700-50-50)]/2 With similar procedure, the CPs are found as and these CPs are at 1:00, 5:00, 14:00 and 20:00. These results agree with the daily load pattern of Table 4 and the coefficients of Table I. The step signal for is modeled based on this analysis.
V Real-World Case Studies
V-B Results: Ring Law and LES Indicator
If we choose (in Fig. (c)c), i.e., the voltage data during 0 a.m. to 2 a.m., the ring distribution is obtained according to our previous work , shown as Fig. (b)b. Most eigenvalues are distributed between the inner circle and the outer circle. This implies that the real-world data does follow the Ring Law. With a similar process, and setting the test function as Chebyshev Polynomials : and the Likelihood Ratio Function , respectively, the LES curves are obtain as Fig. (a)a, (b)b, (c)c, and (d)d. The grid is relatively smooth during 0 a.m. to 8 a.m. and has dramatic changes at around 8:30 a.m., 11:30 a.m., etc. This observation agrees with our common sense. For the field data, the test function will influence the result in some complicated ways, although the indicators have a similar trend at most CPs.
This paper extends our framework of using large random matrices to model a power grid in several ways. First, a model-free, data-driven statistical approach is proposed for the detection and estimation of the invisible units, a stressing problem in industry. Behind this approach, we exploit the statistical property of massive datasets in a high-dimensional vector space. The temporal variations ( sampling instants) are simultaneously observed together with spatial variations ( grid nodes). Based on mathematically rigorously random matrix theory, time and space must be unified through their ratio What matters is the ratio rather than and ! This observation is valid when and are large and comparable in size, which is often true in practice.
Second, we explore numerous practical aspects. Hypothesis tests, change point detection, and concatenation operations are investigated. The statistical features of Linear Eigenvalue Statistics (LES), i.e. and , are studied. Based on these features, the hypothesis test is designed for the detection of fraud behavior and anomaly behavior.
Third, real-world data are tested using our algorithms. We find that the experimental LES indicators agree with the theoretical predictions: the Ring Law is valid. Both the simulated cases and real-world cases validate the proposed approach as a powerful and effective way to gain insight into the distribution network characteristics and consumer behaviors.
We pave the way for future work with this paper. First, in the context of cyber attacks in distribution networks, our approach can locate these attacks. Second, the power of our algorithms depends on the selection of the test function; more test functions need to be studied and optimized using metrics.
-  X. He, Q. Ai, R. C. Qiu, W. Huang, L. Piao, and H. Liu, “A big data architecture design for smart grids based on random matrix theory,” IEEE Transactions on Smart Grid, vol. 8, no. 2, pp. 674–686, 2017. [Online]. Available: http://arxiv.org/pdf/1501.07329.pdf
-  H. Shaker, H. Zareipour, and D. Wood, “Estimating power generation of invisible solar sites using publicly available data,” IEEE Transactions on Smart Grid, vol. 7, no. 5, pp. 2456–2465, Sept 2016.
-  Y. Parag and B. K. Sovacool, “Electricity market design for the prosumer era,” Nature Energy, vol. 1, p. 16032, 2016.
-  J. Hu and A. V. Vasilakos, “Energy big data analytics and security: Challenges and opportunities,” IEEE Transactions on Smart Grid, vol. 7, no. 5, pp. 2423–2436, Sept 2016.
-  V. Gaur and E. Gupta, “The determinants of electricity theft: An empirical analysis of indian states,” Energy Policy, vol. 93, pp. 127–136, 2016.
-  A. Von Meier, D. Culler, A. McEachern, and R. Arghandeh, “Micro-synchrophasors for distribution systems,” in Innovative Smart Grid Technologies Conference (ISGT), 2014 IEEE PES. IEEE, 2014, pp. 1–5.
-  O. Ardakanian, Y. Yuan, R. Dobbe, A. von Meier, S. Low, and C. Tomlin, “Event detection and localization in distribution grids with phasor measurement units,” arXiv preprint arXiv:1611.04653, 2016.
-  A. Samadi, L. Söder, E. Shayesteh, and R. Eriksson, “Static equivalent of distribution grids with high penetration of pv systems,” IEEE Transactions on Smart Grid, vol. 6, no. 4, pp. 1763–1774, 2015.
-  T. Hong, C. Chen, J. Huang, N. Lu, L. Xie, and H. Zareipour, “Guest editorial big data analytics for grid modernization,” IEEE Transactions on Smart Grid, vol. 7, no. 5, pp. 2395–2396, Sept 2016.
-  X. Zhang and S. Grijalva, “A data-driven approach for detection and estimation of residential pv installations,” IEEE Transactions on Smart Grid, vol. 7, no. 5, pp. 2477–2485, Sept 2016.
-  H. Jiang, X. Dai, D. W. Gao, J. J. Zhang, Y. Zhang, and E. Muljadi, “Spatial-temporal synchrophasor data characterization and analytics in smart grid fault detection, identification, and impact causal analysis,” IEEE Transactions on Smart Grid, vol. 7, no. 5, pp. 2525–2536, Sept 2016.
-  X. Xu, X. He, Q. Ai, and C. Qiu, “A correlation analysis method for power systems based on random matrix theory,” IEEE Transactions on Smart Grid, vol. PP, no. 99, pp. 1–10, 2016. [Online]. Available: http://arxiv.org/pdf/1506.04854.pdf
-  X. He, R. C. Qiu, Q. Ai, L. Chu, X. Xu, and Z. Ling, “Designing for situation awareness of future power grids: An indicator system based on linear eigenvalue statistics of large random matrices,” IEEE Access, vol. 4, pp. 3557–3568, 2016. [Online]. Available: http://arxiv.org/pdf/1512.07082.pdf
-  X. He, L. Chu, Q. Ai, R. C. Qiu, and Z. Ling, “A Data-driven Situation Awareness Method Based on Random Matrix for Future Grids,” ArXiv e-prints, Oct. 2016. [Online]. Available: https://arxiv.org/pdf/1610.05076.pdf
-  L. Chu, R. C. Qiu, X. He, Z. Ling, and Y. Liu, “Massive streaming pmu data modeling and analytics in smart grid state evaluation based on multiple high-dimensional covariance tests,” IEEE Transactions on Big Data, vol. PP, no. 99, 2017.
-  E. P. Wigner, “On the distribution of the roots of certain symmetric matrices,” Annals of Mathematics, vol. 67, no. 2, pp. 325–327, Mar. 1958. [Online]. Available: http://www.jstor.org/stable/197008
-  V. A. Marčenko and L. A. Pastur, “Distribution of eigenvalues for some sets of random matrices,” Sbornik: Mathematics, vol. 1, no. 4, pp. 457–483, 1967.
-  F. Götze and A. Tikhomirov, “The rate of convergence for spectra of gue and lue matrix ensembles,” Open Mathematics, vol. 3, no. 4, pp. 666–704, 2005.
A. Lytova, L. Pastur et al.
, “Central limit theorem for linear eigenvalue statistics of random matrices with independent entries,”The Annals of Probability, vol. 37, no. 5, pp. 1778–1840, 2009.
-  M. Shcherbina, “Central limit theorem for linear eigenvalue statistics of the wigner and sample covariance random matrices,” ArXiv e-prints, Jan. 2011. [Online]. Available: http://arxiv.org/pdf/1101.3249.pdf
-  R. Qiu and P. Antonik, Smart Grid and Big Data. John Wiley and Sons, 2015.
-  D. Siegmund, “Change-points: From sequential detection to biology and back,” Sequential Analysis, vol. 32, no. 1, pp. 2–14, 2013.