I Introduction
Future grids are fundamentally different from current ones [1]. Technology development, environment pressure, and market reform have greatly spurred the deployment and penetration of the distributed, the renewable, and even the plugandplay units, on both the power generation side and the power consumption side. The worldwide smallscale rooftop photovoltaics (PVs) installation reached 23 GW at the end of 2013, and the growth is predicted to be 20 GW per year until 2018 [2]. The uptake of electric vehicles (EVs) also continues to increase. At least 665,000 electricdriven lightduty vehicles, 46,000 electric buses, and 235 million electric twowheelers were in the worldwide market in early 2015 [3].
These distributed units are mostly invisible to utilities, i.e., they are not monitored by, and thus not visible to, power system operators. 1) Accessing distributed units operation data into utility systems requires an enormous amount of cost paid for data acquisition, communication, storage, calculation, and security [4]. 2) It is hard to describe these units using a fixed model or in a united way; they are smallscale and mostly with high uncertainty or individuality. 3) Some anomaly behaviors are essentially invisible. In 2009, over 20% of total electricity generated is lost from theft in India alone [5]. In 2014, the system in Hawaii, with the highest penetration of PVs in the U.S., recognized a large number of unauthorized PV installations [2].
Lack of visibility may result in incorrect planning and operation of power systems, and even worse, damaging system equipment such as transformers, voltage regulators, and customer appliances. For a highly distributed energy resource penetration environment, utilities are facing technical problems related to overvoltage, frequency control, back feeding flow, and other issues such as a rapid decrease in revenue. The prosumers are also bringing many unknowns and risks that need to be identified and managed [3].
To solve the above problems, many distribution utilities have begun deploying highprecision distribution phasor measurement units (PMUs) for monitoring, diagnostic, and control purposes [6]. High resolution voltage and current phasor measurements can be used in a plethora of applications concerning realtime system operation and longterm planning, such as state estimation, model validation, load characterization, and event detection and localization [7].
Many researchers have studied the impacts and risks of invisible units, especially PVs, on distribution systems [8]; little attention, however, has been paid to the detection and estimation of the invisible units, especially in a complex distributed grid. Some related research is found in the special issue of “Big Data Analytics for Grid Modernization” [9]. Reference [10] proposes a changepoint detection algorithm for a time series. The changepoint concept is relevant to our paper in spirit. The proposed algorithm, however, is effective only if the characteristics of all other units before and after the change point are similar. In addition, the spatial information of the utility data (data distributed across nodes) are not used. Reference [2] takes the uncertainty in PV sites into account, and estimates the power generation of invisible solar photovoltaic sites using the data generated by a small set of selected representative sites. Reference [11] proposes an approach of big data characterization for smart grids and a twolayer dynamic optimal synchrophasor measurement devices selection algorithm for fault detection, identification, and causal impact analysis. Our previous work [1, 12, 13, 14, 15], based on random matrix theory (RMT), also outlines a datadriven methodology to conduct big data analytics for power systems. Our approach utilizes the temporalspatial statistics.
Ia Contribution
This paper proposes an approach aimed at detection and estimation of the invisible units in a complex distribution grid; the analysis of these results will give insight into distribution network characteristics and consumer behaviors. Based on RMT, the proposed approach handles raw data in an unsupervised way and obtains Linear Eigenvalue Statistic (LES) indicators, which are in highdimensional vector space and thus robust when considering data errors (e.g., data loss, data outofsynchronization)
[13]. Furthermore, using the statistical characteristics of LES indicators, hypothesis testing can be formulated for anomaly detection. The data analytics only rely on easily accessible utility data such as node voltage magnitudes and node power injections. Finally, the proposed method is validated using both simulated data of a complex grid and field data of a certain distribution system in China. The heart of the method is presented in Sec. IIIB2.Ii Problem Formulation
This paper attempts to conduct situation awareness in a nonomniscient distribution network. More precisely, we try to obtain the load/generator ingredients and their weights, and the power usage behaviors at the node level.
For any node, its customers are divided into two categories—typical load pattern units (TLPs) and uncertain load pattern ones (ULPs).

The TLPs operate according to a welldefined profile, and are denoted as vectors . For instance, streetlamps are turned on at 18:00 and turned off at 6:00; their load pattern is modeled as
If the sampling interval is 6 hours,

The ULPs are denoted as vectors , and might be further divided into three categories—completely random behavior, invisible behavior, and fraudulent behavior. We have already successfully distinguished completely random behavior from the others in our previous work [1, 13] by using random matrix tools. Next, we will focus on the detection and identification of invisible and fraudulent behavior. The former often causes a chain reaction and has an impact on other parameters. For instance, unauthorized residential PV installation and plugin EV charging changes the power flow. The latter often causes parameter deviation in isolation. For instance, some metering error or cyber attack might merely reduce data value of power consumption without affecting voltage .
Motivated by the above observations, we propose to study a general model for each node:
(1) 
where vectors and are the daily patterns of TLPs and ULPs, with coefficients respectively. Thus, for vector is the daily power usage for the th TLP, and similarly vector is the daily power usage for the th ULP.
If all the units patten and behaviors are known in advance, i.e., no exists, or if ULPs are able to be modeled as instead of uncertain , then Eq. (1) can be rewritten as
(2) 
Our first step is to formulate the problem in terms of a classical optimization
(3) 
where vectors and are the power injections of nodes and power losses of nodes, respectively, which are measurable and calculable. In addition, it is worth mentioning that the analysis for the reactive power may be conducted similarly.
For the modern distribution network, as described in Sec I, ULPs play an important role: are present and their influences need to be considered. They violate the prerequisites of most algorithms (e.g., least square method) and have significant effects on the final values of coefficients in Eq. (1). In most cases, it is reasonable to model as a step signal. This is the case when the plugin EVs charge and/or unauthorized PVs generate during to . Determining the start point and the end point of the step signal is at the heart of the problem. Based on random matrix theory (RMT) and linear eigenvalue statistics (LES), a statistical, datadriven solution, rather than its deterministic, empirical or modelbased counterpart, is proposed to solve the problem.
Iii Mathematical Foundation
Iiia Random Matrix Theory
IiiA1 Statistics based on Random Matrix Theory
Random matrices have been an important issue in multivariate statistical analysis since the landmark work of Wishart on fixed size Gaussian matrices. The asymptotic theory on the limiting spectrum of large random matrices was initially proposed in several works [16]
by Wigner in the 1950s, motivated by problems in quantum physics. Since then, research on the finite spectral analysis of high dimensional random matrices has come under heated discussion by scholars in numerous disciplines. The RMT, as a statistical tool with profound theoretical basis, is adapted to multivariate analysis. It can help model many intractable practical systems, especially those with numerous variables.
IiiA2 Laws for Spectral Analysis
RMT mainly concerns two ensemble random matrices—Gaussian unitary ensemble (GUE) and Laguerre unitary ensemble (LUE).
(4) 
where is the standard Gaussian Random Matrix.
Let be the empirical density of , and define its empirical spectral distribution (ESD) :
(5) 
where is GUE or LUE matrix, represents the event indicator function. We investigate the rate of convergence of the expected ESD to Wigner’s Semicircle Law or Wishart’s MP Law.
Let and denote the empirical eigenvalue density and ESD of , and the Wigner’s Semicircle Law [16] and Wishart’s MarchenkoPastur (MP) Law [17] say:
(6) 
where .
(7) 
Then, we denote the Kolmogorov distance between and as :
(8) 
Gotze and Tikhomirov, in their work [18], prove an optimal bound for of order .
IiiB Linear Eigenvalue Statistics and its Central Limit Theorem
The LES of an arbitrary matrix is defined in [19, 20] via the continuous test function
(9) 
where the trace of the function of a random matrix is involved.
IiiB1 Law of Large Numbers
The Law of Large Numbers tells us that
converges in probability to the limit
(10) 
where
is the probability density function of
.IiiB2 Central Limit Theorem
The CLT [20] as the natural second step, aims to study the LES fluctuations [21]. Consider covariance matrix . The CLT for is given as follows [20]:
Theorem III.1 (M. Sheherbina, 2009).
Let the real valued test function satisfy condition . Then defined in (10), in the limit
, converges in the distribution to the Gaussian random variable with zero mean and the variance:
(11)  
where and is the th cumulant of entries of .
Eq. (8) has been used in a power grid in our previous work [14]. This paper takes a fundamentally different approach from (8). To study the convergence as a function of
we study the LES instead of the probability distribution of eigenvalues in (
8). For an arbitrary test function with enough smoothness, the LES is a (positive) scalar random variable defined in (9). As the asymptotic limit of its expectation, is given in (10). As the asymptotic limit of its variance, is given in (11). These two equations are sufficient to study the scalar random variable This approach can be viewed as a dimensionality reduction. The random data matrix of size is reduced to a (positive) scalar random variable ! This dimension reduction is mathematically rigorous only when Experiences demonstrate, however, that moderate values of and are accurate enough for our practical purposes.IiiB3 Change Point Detection using LES
Changepoint detection began with Page’s (1954, 1955) classical formulation, which was further developed by Shiryaev (1963) and Lorden (1971) [22]. Changepoint detection is such a problem: Suppose are independent observations. For they have the distribution , while for they have the distribution . The distributions may be completely specified or may depend on unknown parameters. In the case of a fixed number
of observations, we would like to test the null hypothesis of no change, that
, and perhaps to estimate .This paper formulates the hypothesis test in terms of the statistical characteristics of LES indicators. Theorem III.1 says that the LES indicator , in the limit , converges in the distribution to a Gaussian random variable with mean and variance Due to the Gaussian property, following a standard procedure, the detection is modeled as a binary hypothesis test: normal hypothesis (no anomaly present) and abnormal one , denoted by:
(12) 
where is the threshold value, that needs to be preset based on experiences.
IiiC Concatenation Operation
Numerous causing factors affect the system state in different ways; sensitivity analysis is a valuable and hot topic. Assuming that there are state variables and factors, their sampling data are multiple timeseries. In a fixed period of interest , the sampling data of state variables consist of a matrix (i.e. state matrix), and the factors consist of (i.e. factor vector). Two matrices with the same length can be put together and a concatenated matrix is formed; in such a way, we obtain a new matrix using the state matrix and the factor matrix .
In order to balance the proportion (to increase the statistic correlation), a factor matrix is formed for each factor vector. First, for the factor , we duplicate it for times^{1}^{1}1 is appropriated to to construct a matrix , written as
Then, white noise is introduced into
to avoid extremely strong crosscorrelations. Thus, the factor matrix for the factor vector is expressed as(13) 
where is related to the signaltonoise ratio (SNR), and the entries of the matrix are Gaussian random variables.
Through the trace function the SNR of the factor matrix is defined as
(14) 
In parallel, we can construct the concatenated matrix with each factor , expressed as
(15) 
The relationships between causing factors and system state can be revealed by the concatenated matrix . The concatenated model is compatible with different units and different measurements for each variable data (in the form of rows of
), due to the normalisation during the data preprocessing. Besides, it is worth to mention that some simple mathematical methods, e.g., interpolation, may be applied to handle data source with different sampling rates.
IiiD Experiment Design Using Variable Data of Power Systems
Iv Simulation Cases
Iva Background
Simulations are based on the IEEE33 bus system for a distribution network, shown as Fig. 1. For node , its gross power usage and voltage magnitude are sampled at a high rate, for example, 9600 points per day (0.11 Hz). Then we introduce the white noise to the power injections as
(17) 
where and are two standard Gaussian random variables, i.e. . In this way, the related power flow is obtained via the software package Matpower.
As mentioned in Sec. II, we mainly focus on fraudulent behavior and invisible power usage. Determining the start point and the end point of the is the focus of this paper. For the longstanding anomaly without any step signals in the observed data segment, longterm indicators, such as monthly line loss rate, might be sensitive. This is another topic that will be explored elsewhere.
IvB Fraud events in a Simple Scenario
Fraud events often cause parameter deviation. Suppose that the active power values for each node are at their initial points with fluctuations defined in (17). From to , some fraud events on node6 and node14 cause a reduction of ( of the total , and of ). The sampling data, power consumptions and voltage magnitudes of each node, are shown as Fig. 2. The lines with legends data 1 to data 33 are for actual power consumption of node 1 to node 33, and lines with data 34 and data 35 are for measured power consumption of node 14 and node 6, respectively. According to the actual power consumption, i.e., data 1 to data 33, the voltage magnitudes are obtained in Fig. (b)b. Note that due to the fraud events, the data 14 and data 6 of Fig. (a)a are unreachable.
The matrix concatenation operation and the split window method are used to handle the sampling data. Using (9) for , we choose Chebyshev polynomials : as the test function. The LES indicators of state matrix and concatenated matrix (, referring Eq. (16)) are obtained as Fig. 3.
In Fig. 3, the LES indicator of state matrix , namely, is almost constant. From a statistic view, the theoretical expectation
and the standard deviation
are accessible via random matrix theory, or rather, via Eq. (6), (9), and (11). It is found that the experimental indicator is exactly bounded between and . According to Eq. (12), we should accept the hypothesis —there is no factor actually affecting the system state during the observation period. On the other hand, of state matrix , namely, has four spikes: two spikes for and two spikes for . Our previous work [1] tells us that the anomaly should last time points (i.e. ) and have an extreme point at . This phenomenon is observed on the curve and curve:IvC Invisible Power Usage and Fraud Events in a Complex Scenario
This subsection proposes a datadriven solution for the problem given in Sec. II—determining the start point and the end point to model the invisible power usage as a step signal. Firstly, we assume a complex scenario:

The power usage of each bus (e.g., bus ) generally consists of four TLPs and one ULP, denoted as
(18) The daily load profiles of TLPs are set as Tab. 4 and shown as Fig. 4. Note that the bluefilled rectangle means that the load profiles have a dramatic change at this time point. According to work [10], these special time points are denoted as change points (CPs). The coefficients are assumed as Tab I.

We assume that there exists invisible power usage events on node 20 and 31: the periods are 1:00–5:00 and 14:00–20:00, and the percentages are 30% and 50%, respectively.

We assume that there exist fraud events on node 6, 14 and 27, the periods are 20:00–22:00, 14:00–17:00 and 18:00–19:00, and the percentages are 7%, 8% and 12%, respectively.
Using (16), we obtain the active power and then calculate the voltages for the assumed complex scenario above; the results are shown as Fig. (a)a, (c)c and (b)b.
With a similar procedure to that of Sec IVB, the curve is obtained in Fig. (d)d. Based on the curve of Fig. (d)d, we make the following observations:

The brown line at the bottom is the indicator ; it is relatively smooth.

The results shown in Fig. (d)d match the settings of the daily load pattern in Tab. 4. Taking TLP as an example, Fig (d)d shows that the indicators of nodes 25, 24, 32, 30, etc, have bright spikes at 3:00; in fact, 3:00 is a CP of TLP in Tab. 4. The coefficients in Table I tell us that these listed nodes are the exact ones of which the TLP takes a dominant part.

For the fraud events, the limit points are located at 5553, 6856, 7655, etc. According to Sec IVB, the key time points are 14:00 (5600), 17:00 (6800), 19:00 (7600), etc., respectively.

For the invisible power usage, we can locate them using the special time points and node 31, 20. For time points 200, 700, the change point is 400.^{2}^{2}2400=[(200+5050)+(7005050)]/2 With similar procedure, the CPs are found as and these CPs are at 1:00, 5:00, 14:00 and 20:00. These results agree with the daily load pattern of Table 4 and the coefficients of Table I. The step signal for is modeled based on this analysis.
V RealWorld Case Studies
Va Data
VB Results: Ring Law and LES Indicator
If we choose (in Fig. (c)c), i.e., the voltage data during 0 a.m. to 2 a.m., the ring distribution is obtained according to our previous work [1], shown as Fig. (b)b. Most eigenvalues are distributed between the inner circle and the outer circle. This implies that the realworld data does follow the Ring Law. With a similar process, and setting the test function as Chebyshev Polynomials : and the Likelihood Ratio Function , respectively, the LES curves are obtain as Fig. (a)a, (b)b, (c)c, and (d)d. The grid is relatively smooth during 0 a.m. to 8 a.m. and has dramatic changes at around 8:30 a.m., 11:30 a.m., etc. This observation agrees with our common sense. For the field data, the test function will influence the result in some complicated ways, although the indicators have a similar trend at most CPs.
Vi Conclusion
This paper extends our framework of using large random matrices to model a power grid in several ways. First, a modelfree, datadriven statistical approach is proposed for the detection and estimation of the invisible units, a stressing problem in industry. Behind this approach, we exploit the statistical property of massive datasets in a highdimensional vector space. The temporal variations ( sampling instants) are simultaneously observed together with spatial variations ( grid nodes). Based on mathematically rigorously random matrix theory, time and space must be unified through their ratio What matters is the ratio rather than and ! This observation is valid when and are large and comparable in size, which is often true in practice.
Second, we explore numerous practical aspects. Hypothesis tests, change point detection, and concatenation operations are investigated. The statistical features of Linear Eigenvalue Statistics (LES), i.e. and , are studied. Based on these features, the hypothesis test is designed for the detection of fraud behavior and anomaly behavior.
Third, realworld data are tested using our algorithms. We find that the experimental LES indicators agree with the theoretical predictions: the Ring Law is valid. Both the simulated cases and realworld cases validate the proposed approach as a powerful and effective way to gain insight into the distribution network characteristics and consumer behaviors.
We pave the way for future work with this paper. First, in the context of cyber attacks in distribution networks, our approach can locate these attacks. Second, the power of our algorithms depends on the selection of the test function; more test functions need to be studied and optimized using metrics.
References
 [1] X. He, Q. Ai, R. C. Qiu, W. Huang, L. Piao, and H. Liu, “A big data architecture design for smart grids based on random matrix theory,” IEEE Transactions on Smart Grid, vol. 8, no. 2, pp. 674–686, 2017. [Online]. Available: http://arxiv.org/pdf/1501.07329.pdf
 [2] H. Shaker, H. Zareipour, and D. Wood, “Estimating power generation of invisible solar sites using publicly available data,” IEEE Transactions on Smart Grid, vol. 7, no. 5, pp. 2456–2465, Sept 2016.
 [3] Y. Parag and B. K. Sovacool, “Electricity market design for the prosumer era,” Nature Energy, vol. 1, p. 16032, 2016.
 [4] J. Hu and A. V. Vasilakos, “Energy big data analytics and security: Challenges and opportunities,” IEEE Transactions on Smart Grid, vol. 7, no. 5, pp. 2423–2436, Sept 2016.
 [5] V. Gaur and E. Gupta, “The determinants of electricity theft: An empirical analysis of indian states,” Energy Policy, vol. 93, pp. 127–136, 2016.
 [6] A. Von Meier, D. Culler, A. McEachern, and R. Arghandeh, “Microsynchrophasors for distribution systems,” in Innovative Smart Grid Technologies Conference (ISGT), 2014 IEEE PES. IEEE, 2014, pp. 1–5.
 [7] O. Ardakanian, Y. Yuan, R. Dobbe, A. von Meier, S. Low, and C. Tomlin, “Event detection and localization in distribution grids with phasor measurement units,” arXiv preprint arXiv:1611.04653, 2016.
 [8] A. Samadi, L. Söder, E. Shayesteh, and R. Eriksson, “Static equivalent of distribution grids with high penetration of pv systems,” IEEE Transactions on Smart Grid, vol. 6, no. 4, pp. 1763–1774, 2015.
 [9] T. Hong, C. Chen, J. Huang, N. Lu, L. Xie, and H. Zareipour, “Guest editorial big data analytics for grid modernization,” IEEE Transactions on Smart Grid, vol. 7, no. 5, pp. 2395–2396, Sept 2016.
 [10] X. Zhang and S. Grijalva, “A datadriven approach for detection and estimation of residential pv installations,” IEEE Transactions on Smart Grid, vol. 7, no. 5, pp. 2477–2485, Sept 2016.
 [11] H. Jiang, X. Dai, D. W. Gao, J. J. Zhang, Y. Zhang, and E. Muljadi, “Spatialtemporal synchrophasor data characterization and analytics in smart grid fault detection, identification, and impact causal analysis,” IEEE Transactions on Smart Grid, vol. 7, no. 5, pp. 2525–2536, Sept 2016.
 [12] X. Xu, X. He, Q. Ai, and C. Qiu, “A correlation analysis method for power systems based on random matrix theory,” IEEE Transactions on Smart Grid, vol. PP, no. 99, pp. 1–10, 2016. [Online]. Available: http://arxiv.org/pdf/1506.04854.pdf
 [13] X. He, R. C. Qiu, Q. Ai, L. Chu, X. Xu, and Z. Ling, “Designing for situation awareness of future power grids: An indicator system based on linear eigenvalue statistics of large random matrices,” IEEE Access, vol. 4, pp. 3557–3568, 2016. [Online]. Available: http://arxiv.org/pdf/1512.07082.pdf
 [14] X. He, L. Chu, Q. Ai, R. C. Qiu, and Z. Ling, “A Datadriven Situation Awareness Method Based on Random Matrix for Future Grids,” ArXiv eprints, Oct. 2016. [Online]. Available: https://arxiv.org/pdf/1610.05076.pdf
 [15] L. Chu, R. C. Qiu, X. He, Z. Ling, and Y. Liu, “Massive streaming pmu data modeling and analytics in smart grid state evaluation based on multiple highdimensional covariance tests,” IEEE Transactions on Big Data, vol. PP, no. 99, 2017.
 [16] E. P. Wigner, “On the distribution of the roots of certain symmetric matrices,” Annals of Mathematics, vol. 67, no. 2, pp. 325–327, Mar. 1958. [Online]. Available: http://www.jstor.org/stable/197008
 [17] V. A. Marčenko and L. A. Pastur, “Distribution of eigenvalues for some sets of random matrices,” Sbornik: Mathematics, vol. 1, no. 4, pp. 457–483, 1967.
 [18] F. Götze and A. Tikhomirov, “The rate of convergence for spectra of gue and lue matrix ensembles,” Open Mathematics, vol. 3, no. 4, pp. 666–704, 2005.

[19]
A. Lytova, L. Pastur et al.
, “Central limit theorem for linear eigenvalue statistics of random matrices with independent entries,”
The Annals of Probability, vol. 37, no. 5, pp. 1778–1840, 2009.  [20] M. Shcherbina, “Central limit theorem for linear eigenvalue statistics of the wigner and sample covariance random matrices,” ArXiv eprints, Jan. 2011. [Online]. Available: http://arxiv.org/pdf/1101.3249.pdf
 [21] R. Qiu and P. Antonik, Smart Grid and Big Data. John Wiley and Sons, 2015.
 [22] D. Siegmund, “Changepoints: From sequential detection to biology and back,” Sequential Analysis, vol. 32, no. 1, pp. 2–14, 2013.
Comments
There are no comments yet.