I Introduction
Phasor measurement units (PMUs) have been introduced in recent years as instruments for monitoring power grids in real time. PMUs provide accurate, synchronized, realtime information of the voltage phasor at 3060 Hz, as well as information about current flows. When processed appropriately, this data has the potential to perform rapid identification of anomalies in operation of the power system. In this paper, we use this data to detect line outage events, discriminating between outages on different lines. This discrimination capability (known in machine learning as “classification”) is made possible by the fact that the topological change to the grid resulting from a line outage leads (after a transient period during which currents and voltages fluctuate) to a new steady state of voltage and power values. The pattern of voltage and power changes is somewhat distinctive for different line outages. By gathering or simulating many samples of these changes, under different load conditions, we can train a machinelearning classifier to recognize each type of line outage. Further, given that it is not common in current practice to install PMUs on all buses, we extend our methodology to place a limited number of PMUs in the network in a way that maximizes the performance of outage detection, or to find optimal locations for
additional PMUs in a network that is already instrumented with some PMUs.Earlier works on classification of line outages from PMU data are based on a linear (DC) power flow model [1, 2], or make use only of phasor angle changes [3, 4, 5], or design a classifier that depends only linearly on the differences in sensor readings before and after an event [6]. These approaches fail to exploit fully the modeling capabilities provided by the AC power flow equations, the information supplied by PMUs, and the power of modern machine learning techniques. Neural networks have the potential to extract automatically from the observations information that is crucial to distinguishing between outage events, transforming the raw data vectors into a form that makes the classification more accurate and reliable. Although the computational burden of training a neuralnetwork classifier is heavy, this processing can be done “offline.” The cost of deploying the trained classifier is low. Outages can be detected and classified quickly, in real time, possibly leading to faster remedial action on the grid, and less damage to the infrastructure and to customers. The idea of using neural networks on PMU data is also studied in [7] to detect multiple simultaneous line outages, in the case that PMU data from all buses are available along with data for power injections at all buses.
The use of neural networks in deep learning is currently the subject of much investigation. Neural networks have yielded significant advances in computer vision and speech recognition, often outperforming human experts, especially when the hidden information in the raw input is not captured well by linear models. The limitations of linear models can sometimes be overcome by means of laborious feature engineering, which requires expert domain knowledge, but this process may nevertheless miss vital information hidden in the data that is not discernible even by an expert. We show below that, in this application to outage detection on power grids, even generic neural network models are effective at classifying outages accurately across wide ranges of demands and seasonal effects. Previous works of databased methods for outage detection only demonstrated outagedetecting ability of these models for a limited range of demand profiles. We show that neural network models can cope with a wider range of realistic demand scenarios, that incorporate seasonal, diurnal, and random fluctuations. Although not explored in this paper, our methodology could incorporate various scenarios for power supply at generation nodes as well. We show too that effective outage detection can be achieved with information from PMUs at a limited set of network locations, and provide methodology for choosing these locations so as to maximize the outage detection performance.
Our approach differs from most approaches to machine learning classification in one important respect. Usually, the data used to train a classifier is historical or streaming, gathered by passive observation of the system under study. Here, instead, we are able to generate the data as required, via a highfidelity model based on the AC power flow equations. Since we can generate enough instances of each type of line outage to make them clearly recognizable and distinguishable, we have an important advantage over traditional machine learning. The role of machine learning is thus slightly different from the usual setting. The classifier serves as a proxy for the physical model (the AC power flow equations), treating the model as a black box and performing the classification task phenomenologically based on its responses to the “stimuli” of line outages. Though the offline computational cost of training the model to classify outages is high, the neural network proxy can be deployed rapidly, requiring much less online computation than an inversion of the original model.
This work is an extension and generalization of [6]
, where a linear machine learning model (multiclass logistic regression, or MLR) is used to predict the relation between the PMU readings and the outage event. The neuralnetwork scheme has MLR as its final layer, but the network contains additional “hidden layers” that perform nonlinear transformations of the raw data vectors of PMU readings. We show empirically that the neural network gives superior classification performance to MLR in a setting in which the electricity demands vary over a wider range than that considered in
[6]. (The wider range of demands casues the PMU signatures of each outage to be more widely dispersed, and thus harder to classify.) A similar approach to outage detection was discussed in [8], using a linear MLR model, with PMU data gathered during the transient immediately after the outage has occurred, rather than the difference between the steady states before and after the outage, as in [6]. Data is required from all buses in [8], whereas in [6] and in the present paper, we consider too the situation in which data is available from only a subset of PMUs.Another line of work that uses neural networks for outage detection is reported in [7] (later expanded into the report [9], which appeared after the original version of this paper was submitted). The neural networks used in [7, 9] and in our paper are similar in having a single hidden layer. However, the data used as inputs to the neural networks differs. We use the voltage angles and magnitudes reported by PMUs, whereas [7, 9] use only voltage angles along with power injection data at all buses. Moreover, [7, 9] require PMU data from all buses, whereas we focus on identifying a subset of PMU locations that optimizes classification performance. A third difference is that [7, 9] aim to detect multiple, simultaneous line outages using a multilabel classification formulation, while we aim to identify only single or simultaneous doubleline outages. The latter are typically the first events to occur in a largescale grid failure, and rapid detection enables remedial action to be taken. We note too that PMU data is simulated in [7, 9] by using a DC power flow model, rather than our AC model, and that a variety of power injections are obtained in the PMU data not by varying over a plausible range of seasonal and diurnal demand/generation variations (as we do) but rather by perturbing voltage angles randomly and inferring the effects of these perturbations on power readings at the buses.
This paper is organized as follows. In Section II, we give the mathematical formulation of the neural network model, and the regularized formulation that can be used to determine optimal PMU placement. We then discuss efficient optimization algorithms for training the models in Section III. Computational experiments are described Section IV. A convergence proof for the optimization method is presented in the Appendix.
Ii Neural Network and Sparse Modeling
In this section, we discuss our approach of using neural network models to identify line outage events from PMU change data, and extend the formulation to find optimal placements of PMUs in the network. (We avoid a detailed discussion of the AC power flow model.) We use the following notation for outage event.

denotes the outage represented by event . It takes a value in the set , where represents the total number of possible outage events (roughly equal to the number of lines in the network that are susceptible to failure).

is the vector of differences between the preoutage and postoutage steadystate PMU readings,
In the parlance of machine learning, is known as a label and is a feature vector. Each indexes a single item of data; we use to denote the total number of items, which is a measure of the size of the data set.
Iia Neural Network
A neural network is a machine learning model that transforms the data vectors
via a series of transformations (typically linear transformations alternating with simple componentwise nonlinear transformations) into another vector to which a standard linear classification operation such as MLR is applied. The transformations can be represented as a network. The nodes in each layer of this network correspond to elements of an intermediate data vector; nonlinear transformations are performed on each of these elements. The arcs between layers correspond to linear transformations, with the weights on each arc representing an element of the matrix that describes the linear transformation. The bottom layer of nodes contains the elements of the raw data vector while a “softmax” operation applied to the outputs of the top layer indicates the probabilities of the vector belonging to each of the
possible classes. The layers / nodes strictly between the top and bottom layers are called “hidden layers” and “hidden nodes.”A neural network is trained by determining values of the parameters representing the linear and nonlinear transformations such that the network performs well in classifying the data objects , . More specifically, we would like the probability assigned to node for input vector to be close to , for each . The linear transformations between layers are learned from the data, allowing complex interactions between individual features to be captured. Although deep learning lacks a satisfying theory, the layered structure of the network is thought to mimic gradual refinement of the information, for highly complicated tasks. In our current application, we expect the relations between the input features — the PMU changes before / after an outage event — to be related to the event in complex ways, making the choice of a neural network model reasonable.
Training of the neural network can be formulated as an optimization problem as follows. Let be the number of hidden layers in the network, with being the number of hidden nodes in each hidden layer. ( denotes the dimension of the raw input vectors, while is the number of classes.) We denote by the matrix of dimensions that represents the linear transformation of output of layer to the input of layer . The nonlinear transformation that occurs within each layer is represented by the function . With some flexibility of notation, we obtain by applying the same transformation to each component of . In our model, we use the function, which transforms each element as follows:
(1) 
(Other common choices of
include the sigmoid function
and the rectified linear unit
.) This nonlinear transformation is not applied at the output layer ; the outputs of this layer are obtained by applying an MLR classifier to the outputs of layer .Using this notation, together with and , we formulate the training problem as:
(2) 
where the objective is defined by
(3a)  
(3b)  
(3c)  
(3d) 
for some given regularization parameter and Frobenius norm
, and nonnegative convex loss function
. ^{1}^{1}1We chose a small positive value for our experiments, as a positive value is required for the convergence theory; see in particular Lemma 1 in the Appendix. The computational results were very similar for , however. We use the constraints in (3) to eliminate intermediate variables , , so that indeed (2) is an unconstrained optimization problem in . The loss function quantifies the accuracy which with the neural network predicts the label for data vector . As is common, we use the MLR loss function, which is the negative logarithm of the softmax operation, defined by(4) 
where . Since for a transformed data vector , the neural network assigns a probability proportional to for each outcome , this function is minimized when the neural network assigns zero probabilities to the incorrect labels .
In practice, we add “bias” terms at each layer, so that the transformations actually have the form
for some parameter . We omit this detail from our description, for simplicity of notation.
IiB Inducing Sparsity via GroupLASSO Regularization
In current practice, PMU sensors are attached to only a subset of transmission lines, typically near buses. We can modify the formulation of neural network training to determine which PMU locations are most important in detecting line outages. Following [6], we do so with the help of a nonsmooth term in the objective that penalizes the use of each individual sensor, thus allowing the selection of only those sensors which are most important in minimizing the training loss function (3). This penalty takes the form of the sum of Frobenius norms on submatrices of , where each submatrix corresponds to a particular sensor. Suppose that is the subset of features in that are obtained from sensor . If the columns of the matrix are zero, then these entries of are ignored — the products will be independent of the values for — so the sensor is not needed. Denoting by a set of sensors, we define the regularization term as follows:
(5a)  
(5b) 
(We can take to be the full set of sensors or some subset, as discussed in Subsection IIIB.) This form of regularizer is sometimes known as a groupLASSO [10, 11, 12]. With this regularization term, the objective in (2) is replaced by
(6) 
for some tunable parameter . A larger induces more zero groups (indicating fewer sensors) while a smaller value of tends to give lower training error at the cost of using more sensors. Note that no regularization is required on for , since is the only matrix that operates directly on the vectors of data from the sensors.
Iii Optimization and Selection Algorithms
Here we discuss the choice of optimization algorithms for solving the training problem (2) and its regularized version (6). We also discuss strategies that use the regularized formulation to select PMU locations, when we are only allowed to install PMUs on a prespecified number of buses.
Iiia Optimization Frameworks
We solve the problem (2) with the popular LBFGS algorithm [13]. Other algorithms for smooth nonlinear optimization can also be applied; we choose LBFGS because it requires only function values and gradients of the objective, and because it has been shown in [14] to be efficient for solving neural network problems. To deal with the nonconvexity of the objective, we made slight changes of the original LBFGS, following an idea in [15]. Denoting by the difference between the iterates at iterations and , and by the difference between the gradients at these two iterations, the pair is not used in computing subsequent search directions if . This strategy ensures that the Hessian approximation remains positive definite, so the search directions generated by LBFGS will be descent directions.
We solve the groupregularized problem (6) using SpaRSA [12], a proximalgradient method that requires only the gradient of and an efficient proximal solver for the regularization term. As shown in [12], the proximal problem associated with the groupLASSO regularization has a closed form solution that is inexpensive to compute.
In the next section, we discuss details of two bus selection approaches, and how to compute the gradient of efficiently.
IiiB Two Approaches for PMU Location
We follow [6] in proposing two approaches for selecting PMU locations. In the first approach, we set in (6) to be the full set of potential PMU locations, and try different values of the parameter until we find a solution that has the desired number of nonzero submatrices for , which indicate the chosen PMU locations.
The second approach is referred to as the “greedy heuristic” in [6]. We initialize to be the set of candidate locations for PMUs. (We can exclude from this set locations that are already instrumented with PMUs and those that are not to be considered as possible PMU locations.) We then minimize (6) with this , and select the index that satisfies
as the next PMU location. This is removed from , and we minimize (6) with the reduced . This process is repeated until the required number of locations has been selected. The process is summarized in Algorithm 1.
IiiC Computing the Gradient of the Loss Function
In both SpaRSA and the modified LBFGS algorithm, the gradient and the function value of defined in (3) are needed at every iteration. We show how to compute these two values efficiently given any iterate . Function values are computed exactly as suggested by the constraints in (3), by evaluating the intermediate quantities , , by these formulas, then finally the summation in (3a
). The gradient involves an adjoint calculation. By applying the chain rule to the constraints in (
3), treating , , as variables alongside , we obtain(7a)  
(7b)  
(7c)  
(7d)  
Since is a pointwise operator that maps to , is a diagonal matrix such that . The quantities and , are computed and stored during the calculation of the objective. Then, from (7b) and (7c), the quantities from can be computed in a reverse recursion. Finally, the formulas (7d) and (7a) can be used to compute the required derivatives , .
IiiD Training and Validation Procedure
In accordance with usual practice in statistical analysis involving regularization parameters, we divide the available data into a training set and a validation set. The training set is a randomly selected subset of the available data — the pairs , in the notation above — that is used to form the objective function whose solution yields the parameters in the neural network. The validation set consists of further pairs that aid in the choice of the regularization parameter, which in our case is the parameter in the greedy heuristic procedure of Algorithm 1, described in Sections IIIA and IIIB. We apply the greedy heuristic for and deem the optimal value to be the one that achieves the most accurate outage identification on the validation set. We select initial points for the training randomly, so different solutions may be obtained even for a single value of . To obtain a “score” for each value of , we choose the best result from ten random starts. The final model is then obtained by solving (2) over the buses selected on the best of the ten validation runs, that is, fixing the elements of that correspond to nonselected buses at zero.
Note that validation is not needed to choose the value of when we solve the regularized problem (6) directly, because in this procedure, we adjust until a predetermined number of buses is selected.
There is also a testing set of pairs . This is data that is used to evaluate the bus selections produced by the procedures above. In each case, the tuned models obtained on the selected buses are evaluated on the testing set.
Iv Experiments
We perform simulations based on grids from the IEEE test set archive [16]. Many of our studies focus on the IEEE57bus case. Simulations of grid response to varying demand and outage conditions are performed using MATPOWER [17]. We first show that high accuracy can be achieved easily when PMU readings from all buses are used. We then focus on the more realistic (but more difficult) case in which data from only a limited number of PMUs is used. In both cases, we simulate PMU readings over a wide range of power demand profiles that encompass the profiles that would be seen in practice over different seasons and at different times of day.
Iva Data Generation
We use the following procedure from [6] to generate the data points using a stochastic process and MATPOWER.

We consider the full grid defined in the IEEE specification, and also the modified grid obtained by removing each transmission line in turn.

For each demand node, define a baseline demand value from the IEEE test set archive as the average of the load demand over 24 hours.

To simulate different “demand averages” for different seasons, we scale the baseline demand value for each node by the values in , to yield five different baseline demand averages for each node. (Note: In [6], a narrower range of multipliers was used, specifically , but each multiplier is considered as a different independent data set.)

Simulate a 24hour fluctuation in demand by an adaptive OrnsteinUhlenbeck process as suggested in [18], independently and separately on each demand bus.

This fluctuation is overlaid on the demand average for each bus to generate a 24hour load demand profile.

Obtain training, validation, and test points from these 24hour demand profiles for each node by selecting different timepoints from this 24hour period, as described below.

If any combination of line outage and demand profile yields a system for which MATPOWER cannot identify a feasible solution for the AC power flow equations, we do not add this point to the data set. Lines connecting the same pair of buses are considered as a single line; we take them to be all disconnected or all connected.
This procedure was used to generate training, validation, and test data. In each category, we generated equal numbers of training points for each feasible case in each of the five scale factors . For each feasible topology and each combination of parameters above, we generate training points from the first 12 hours of the 24hour simulation period, and validation points and test points from the second 12hour period. Summary information about the IEEE power systems we use in the experiments with single line outage is shown in Table I. The column “Feas.” shows the number of lines whose removal still result in a feasible topology for at least one scale factor, while the number of lines whose removal result in infeasible topologies for all scale factors or are duplicated is indicated in the column “Infeas./Dup.” The next three columns show the number of data points in the training / validation / test sets. As an example: The number of training points for the 14Bus case (which is 1840) is approximately 19 (number of feasible line removals) times 5 (number of demand scalings) times 20 (number of training points per configurations). The difference between this calculated value of and the actually used is from that the numbers of feasible lines under different scaling factors are not identical, and higher scaling factors resulted in more infeasible cases. The last column in Table I shows the number of components in each feature vector . There are two features for each bus, being changes in phase angle and voltage magnitude with respect to the original grid under the same demand conditions. There are another two additional features in all cases, one indicating the power generation level (expressed as a fraction of the longterm average), and the other one indicating a bias term manually added to the data.
System  #lines  #Train  #Val  #Test  #Features  

Feas.  Infeas./Dup.  
14Bus  19  1  1,840  920  4,600  30 
30Bus  38  3  3,680  1,840  9,200  62 
57Bus  75  5  5,340  2,670  13,350  116 
118Bus  170  16  16,980  8,490  42,450  238 
IvB Neural Network Design
Configuration and design of the neural network is critical to performance in many applications. In most of our experiments, we opt for a simple design in which there is just a single hidden layer: in the notation of (2). We assume that the matrices and are dense, that is, all nodes in any one layer are connected to all nodes in adjacent layers. It remains to decide how many nodes should be in the hidden layer. Larger values of lead to larger matrices and and thus more parameters to be chosen in the training process. However, larger can raise the possibility of overfitting the training data, producing solutions that perform poorly on the other, similar data in the validation and test sets.
We did an experiment to indicate whether overfitting could be an issue in this application. We set , and solved the unregularized training problem (2) using the modified LBFGS algorithm with iterations. Figure 1 represents the output of each of the 200 nodes in the hidden layer for each of the test examples. Since the output is a result of the transformation (1) of the input, it lies in the range . We colorcode the outputs on a spectrum from red to blue, with red representing and blue representing . A significant number of columns are either solid red or solid blue. The hiddenlayer nodes that correspond to these columns play essentially no role in distinguishing between different outages; similar results would be obtained if they were simply omitted from the network. The presence of these nodes indicates that the training process avoids using all nodes in the hidden layer, if fewer than nodes suffice to attain a good value of the training objective. Note that overfitting is avoided at least partially because we stop the training procedure with a rather small number of iterations, which can be viewed as another type of regularization [19].
In our experiments, we used for the larger grids (57 and 114 buses) and for the smaller grids (14 and 30 buses). The maximum number of LBFGS iterations for all neural networks is set to , while for MLR models we terminate it either when the number of iterations reaches or when the gradient is smaller than a prespecified value ( in our experiments), as linear models do not suffer much from overfitting.
IvC Results on All Buses
We first compare the results between linear multinomial logistic regression (MLR) (as considered in [6]) and a fully connected neural network with one hidden layer, where the PMUs are placed on all buses. Because we use all the buses, no validation phase is needed, because the parameter does not appear in the model. Table II shows error rates on the testing set. We see that in the difficult cases, when the linear model has error rates higher than , the neural network obtains markedly better testing error rates.
Buses  14  30  57  118 

Linear MLR  0.00%  1.76%  4.50%  15.19% 
Neural network  0.43%  0.03%  0.91%  2.28% 
IvD Results on Subset of Buses
We now focus on the 57bus case, and apply the greedy heuristic (Algorithm 1) to select a subset of buses for PMU placement, for the neural network with one hidden layer of 200 nodes. We aim to select 10 locations. Figure 2 shows the locations selected at each run. Values of used were , with ten runs performed for each value of . On some runs, the initial point is close to a bad local optimum (or saddle point) and the optimization procedure terminates early with fewer than columns of nonzeros in (indicating that fewer than 10 buses were selected, as each bus corresponds to 2 columns). The resulting models have poor performance, and we do not include them in the figure.
Even though the random initial points are different on each run, the groups selected for a fixed tend to be similar on all runs when . For larger values of , including the value which gives the best selection performance, the locations selected on different runs are often different. (For the largest values of , fewer than 10 buses are selected.)
Table III shows testing accuracy for the ten PMU locations selected by both the greedy heuristic and regularized optimization with a single wellchosen value of . Both the neural network and the linear MLR classifiers were tried. The groups of selected buses are shown for each case. These differ significantly; we chose the “optimal” group from among these to be the one with the best validation score. We note the very specific choice of for linear MLR (groupLASSO). In this case, the number of groups selected is extremely sensitive to . In a very small range around , the number of buses selected varies between 8 and 12. We report two types of error rates here. In the column “Err. (top1)” we report the rate at which the outage that was assigned the highest probability by the classifier was not the outage that actually occurred. In “Err. (top2)” we score an error only if the true outage was not assigned either the highest or the secondhighest probability by the classifier. We note here that “top1” error rates are much higher than when PMU data from all buses is used, although that the neural network yields significantly better results than the linear classifier. However, “top2” results are excellent for the neural network when the greedy heuristic is used to select bus location.
Model  Buses selected  Err. (top1)  Err. (top2)  

Linear MLR (greedy)  [5 16 20 31 40 43 44 51 53 57]  29.7%  8.4%  
Neural Network (greedy)  [5 20 31 40 43 50 51 53 54 57]  7.1%  0.1%  
Linear MLR (groupLASSO)  [2 4 5 6 7 8 18 27 28 29]  54.4%  39.4%  
Neural Network (groupLASSO)  [4 5 6 7 8 18 26 27 28 55]  24.1%  12.9% 
Model  Buses selected  Err. (top1)  Err. (top2)  

Linear MLR (greedy)  [5 16 17 20 26 31 39 40 43 44 51 53 54 57]  21.8%  3.8%  
Neural Network (greedy)  [5 6 16 24 27 31 39 40 42 50 51 52 53 54]  5.2%  0.3%  
Linear MLR (groupLASSO)  [2 4 5 7 8 17 18 27 28 29 31 32 33 34]  42.1%  25.3%  
Neural Network (groupLASSO)  [4 7 8 18 24 25 26 27 28 31 32 33 39 40]  6.2%  0.6% 
Table IV repeats the experiment of Table III, but for 14 selected buses rather than 10. Again, we see numerous differences between the subsets of buses selected by the greedy and groupLASSO approaches, for both the linear MLR and neural networks. The neural network again gives significantly better test error rates than the linear MLR classifier, and the “top2” results are excellent for the neural network, for both groupLASSO and greedy heuristics. Possibly the most notable difference with Table III is that the buses selected by the groupLASSO network for the neural network gives much better results for 14 buses than for 10 buses. However, since it still performs worse than the greedy heuristic, the groupLASSO approach is not further considered in later experiments.
IvE Why Do Neural Network Models Achieve Better Accuracy?
Full PMU Data  Selected PMUs  Selected PMUs, after neural network transformation  

mean std dev. distance to centroid  
mean std dev. betweencentroid distance 
Reasons for the impressive effectiveness of neural networks in certain applications are poorly understood, and are a major research topic in machine learning. For this specific problem, we compare the distribution of the raw feature vectors with the distribution of feature vectors obtained after transformation by the hidden layer. The goal is to understand whether the transformed vectors are in some sense more clearly separated and thus easier to classify than the original data.
We start with some statistics of the clusters formed by feature vectors of the different classes. For purposes of discussion, we denote as the feature vector, which could be the full set of PMU readings, the reduced set obtained after selection of a subset of PMU locations, or the transformed feature vector obtained as output from the hidden layer, according to the context. For each , we gather all those feature vectors with label , and denote the centroid of this cluster by
. We track two statistics: the mean / standard deviation of the distance of feature vectors
to their cluster centroids, that is, for ; and the mean / standard deviation of distances between cluster centroids, that is, for . We analyze these statistics for three cases, all based on the IEEE 57Bus network: first, when are vectors containing full PMU data; second, when are vectors containing the PMU data from the 10 buses selected by the Greedy heuristic; third, the same data vectors as in the second case, but after they have been transformed by the hidden layer of the neural network.Results are shown in Table V. For the raw data (first and second columns of the table), the distances within clusters are typically smaller than distances between centroids. (This happens because the feature vectors within each class are “strung out” rather than actually clustered, as we see below.) For the transformed data (last column) the clusters are generally tighter and more distinct, making them easier to distinguish.
Visualization of the effects of hiddenlayer transformation is difficult because of the high dimensionality of the feature vectors. Nevertheless, we can gain some insight by projecting into twodimensional subspaces that correspond to some of the leading principal components, which are the vectors obtained from the singular value decomposition of the matrix of all feature vectors
, . Figure 3 shows two graphs. Both show training data for the same line outages for the IEEE 57Bus data set, with each class coded by a particular color and shape. In both graphs, we show data vectors obtained after 10 PMU locations were selected with the Greedy heuristic. In the left graph, we plot the coefficients of the first and fifth principal components of each data vector. The “strung out” nature of the data for each class reflects the nature of the training data. Recall that for each outage / class, we selected 20 points from a 12hour period of rising demand, at 5 different scalings of overall demand level. For the right graph in Figure 3, we plot the coefficients of the first and third principal components of each data vector after transformation by the hidden layer. For both graphs, we have chosen the two principal components to plot to be those for which the separation between classes is most evident. For the left graph (raw data), the data for classes 3, 4, and 5 appear in distinct regions of space, although the border between classes 4 and 5 is thin. For the right graph (after transformation), classes 3, 4, and 5 are somewhat more distinct. Classes 1 and 2 are difficult to separate in both plots, although in the right graph, they no longer overlap with the other three classes. The effects of tighter clustering and cleaner separation after transformation, which we noted in Table V, are evident in the graphs of Figure 3.System  #classes  #Train  #Val  #Test  #Features 

14Bus  182  16,420  8,210  41,050  30 
30Bus  715  66,160  33,080  165,400  62 
14bus  30bus  

Linear MLR  26.07%  36.32% 
Neural network with one hidden layer  0%  0.65% 
Case  Number of PMU  Model  Buses selected  Err. (top1)  Err. (top2)  

14bus  3  Linear MLR (greedy)  [3 5 14]  83.0%  71.7%  
Neural Network (greedy)  [3 12 13]  4.3%  0.9%  
30bus  5  Linear MLR (greedy)  [4 5 17 23 30]  90.6%  84.5%  
Neural Network (greedy)  [5 14 19 29 30]  12.7%  5.6% 
IvF DoubleLine Outage Detection
We now extend our identification methodology to detect not just singleline outages, but also outages on two lines simultaneously. The number of classes that our classifier needs to distinguish between now scales with the square of the number of lines in the grid, rather than being approximately equal to the number of lines. For this much larger number of classes, we generate data in the manner described in Section IVA, again omitting cases where the outage results in an infeasible network. Table VI shows the number of classes for the 14 and 30bus networks, along with the number of training / validation / test points. Note in particular that there are 182 distinct outage events for the 14bus system, and 715 distinct events for the 30bus system.
Table VII shows results of our classification approaches for the case in which PMU observations are made at all buses. The neural network model has a single hidden layer of 100 nodes. The neural network has dramatically better performance than the linear MLR classifier on these problems, attaining a zero error rate on the 14bus tests.
We repeat the experiment using a subset of buses chosen with the greedy heuristic described in Section IIIB — 3 buses for the 14bus network and 5 buses for the 30bus network. Given the low dimensionality of the feature space and the large number of classes, these are difficult problems. (Because it was shown in the previous experiments that the groupLASSO approach has inferior performance to the greedy heuristic, we omit it from this experiment.) As we see in Table VIII, the linear MLR classifiers do not give good results, with “top1” and “top2” error rates all in excess of 71%. Much better results are obtained for neural network with bus selection performed by the greedy heuristic, which obtains “top2” error rates of less than 1% in the 14bus case and 5.6% in the 30bus case.
V Conclusions
This work describes the use of neural networks to detect single and doubleline outages from PMU data on a power grid. We show significant improvements in classification performance over the linear multiclass logistic regression methods described in [6], particularly when data about the PMU signatures of different outage events is gathered over a wide range of demand conditions. By adding regularization to the model, we can determine the locations to place a limited number of PMUs in a way that optimizes classification performance. Our approach uses a highfidelity AC model of the grid to generate data examples that are used to train the neuralnetwork classifier. Although (as is true in most applications of neural networks) the training process is computationally heavy, the predictions can be obtained with minimal computation, allowing the model to be deployed in real time.
References
 [1] H. Zhu and G. B. Giannakis, “Sparse overcomplete representations for efficient identification of power line outages,” IEEE Transactions on Power Systems, vol. 27, no. 4, pp. 2215–2224, Nov. 2012.
 [2] J.C. Chen, W.T. Li, C.K. Wen, J.H. Teng, and P. Ting, “Efficient identification method for power line outages in the smart power grid,” IEEE Transactions on Power Systems, vol. 29, no. 4, pp. 1788–1800, Jul. 2014.
 [3] J. E. Tate and T. J. Overbye, “Line outage detection using phasor angle measurements,” IEEE Transactions on Power Systems, vol. 23, no. 4, pp. 1644–1652, Nov. 2008.
 [4] ——, “Double line outage detection using phasor angle measurements,” in 2009 IEEE Power & Energy Society General Meeting, Calgary, AB, Jul. 2009, pp. 1–5.

[5]
A. Y. Abdelaziz, S. F. Mekhamer, M. Ezzat, and E. F. ElSaadany, “Line outage detection using Support Vector Machine (SVM) based on the Phasor Measurement Units (PMUs) technology,” in
2012 IEEE Power and Energy Society General Meeting, San Diego, CA, Jul. 2012, pp. 1–8.  [6] T. Kim and S. J. Wright, “PMU placement for line outage identification via multinomial logistic regression,” IEEE Transactions on Smart Grid, vol. PP, no. 99, 2016.
 [7] Y. Zhao, J. Chen, and H. V. Poor, “Efficient neural network architecture for topology identification in smart grid,” in Signal and Information Processing (GlobalSIP), 2016 IEEE Global Conference on. IEEE, 2016, pp. 811–815.
 [8] M. Garcia, T. Catanach, S. Vander Wiel, R. Bent, and E. Lawrence, “Line outage localization using phasor measurement data in transient state,” IEEE Transactions on Power Systems, vol. 31, no. 4, pp. 3019–3027, 2016.
 [9] Y. Zhao, J. Chen, and H. V. Poor, “A learningtoinfer method for realtime power grid topology identification,” Tech. Rep., 2017, arXiv:1710.07818.
 [10] D. Malioutov, M. Cetin, and A. S. Willsky, “A sparse signal reconstruction perspective for source localization with sensor arrays,” IEEE transactions on signal processing, vol. 53, no. 8, pp. 3010–3022, 2005.
 [11] L. Meier, S. Van De Geer, and P. Bühlmann, “The group LASSO for logistic regression,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70, no. 1, pp. 53–71, 2008.
 [12] S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo, “Sparse reconstruction by separable approximation,” IEEE Transactions on Signal Processing, vol. 57, no. 7, pp. 2479–2493, 2009.
 [13] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for large scale optimization,” Mathematical Programming, vol. 45, no. 1, pp. 503–528, 1989.
 [14] J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q. V. Le, and A. Y. Ng, “On optimization methods for deep learning,” in Proceedings of the 28th International Conference on Machine Learning, 2011, pp. 265–272.
 [15] D.H. Li and M. Fukushima, “On the global convergence of the BFGS method for nonconvex unconstrained optimization problems,” SIAM Journal on Optimization, vol. 11, no. 4, pp. 1054–1064, 2001.
 [16] “Power systems test case archive,” 2014, [Online]. Available: http://www.ee.washington.edu/research/pstca/.
 [17] R. D. Zimmerman, C. E. MurilloSánchez, and R. J. Thomas, “MATPOWER: Steadystate operations, planning, and analysis tools for power systems research and education,” IEEE Transactions on power systems, vol. 26, no. 1, pp. 12–19, 2011.
 [18] M. Perninge, V. Knazkins, M. Amelin, and L. Söder, “Modeling the electric power consumption in a multiarea system,” European Transactions on Electrical Power, vol. 21, no. 1, pp. 413–423, 2011.

[19]
R. Caruana, S. Lawrence, and C. L. Giles, “Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping,” in
Advances in Neural Information Processing Systems, 2001, pp. 402–408.  [20] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. New York: Springer, 2006.
 [21] J. D. Pearson, “Variable metric methods of minimisation,” The Computer Journal, vol. 12, no. 2, pp. 171–178, 1969.