Lifeline networks such as power grids, transportation, and water networks are critical to normal functions of society and economics. However, rare events and disruptions occur due to natural disasters (earthquakes and hurricanes), aging and short circuits, to name a few. Knowledge of power grid transient stability after fault occurrences is crucial for decision making; see Fig.1 for an illustration of a power grid system with a line fault. In particular, an online tool for predicting the transient dynamics, if available, will be very beneficial to the operation of the increasingly dynamic power grid due to the increasing un-dispatchable renewables. Computationally, a traditional approach for transient assessment is to simulate a large physical system of circuit equations. The simulation-based approach is very time-consuming and requires detailed fault information such as type, location, and clearing times of fault. Therefore, it is not applicable for online assessment due to the unavailability of the fault information.
With the increasing deployment of phasor measurement units (PMUs) in power grids, high-resolution measurements provide an alternative data-driven solution, e.g., learning the post-fault system responses and predicting subsequent system trajectories based on post-fault short time observations. Machine learning (ML)-based online transient assessment methods have been developed using real-time measurements of the initial time-series responses as input. These ML-based studies aim to predict system stability by deriving a binary stability indicator[3, 8, 18]
or estimating the stability margin[12, 13, 22]. However, knowing the system stability or the margin only is often not enough. Instead, having the knowledge of post-fault transient trajectories is more important for the system operators  to take appropriate actions, e.g., a load shedding upon a frequency or voltage violation.
Classical signal processing relies on linear time-invariant filters, such as Padé and Prony’s methods , and can be used for online prediction. However, these methods are limited to fitting responses with a rational function, requiring users to choose integer parameters (filter orders) from observed data. The prediction is not robust with respect to such choices. In 
, Prony’s method is generalized to 1D convolutional neural networks (1D-CNN) with additional spatial coupling (e.g., currents in power lines nearby). Temporal (1D) convolution exists in both methods, yet 1D-CNN also contains nonlinear operations and more depths in feature extractions. Predictions by 1D-CNN improve over Prony’s method considerably. To go further along this line, we notice that transformers ([16, 15] and references therein) have achieved state-of-the-art performance in machine translation and language modeling due to their more non-local representation power than convolutions. However, transformers suffer from quadratic costs in computational time and memory footprint with respect to the sequence length, which can be prohibitively expensive when learning long sequences. Moreover, redundant queries in transformers can degrade the prediction accuracy. Leveraging group Lasso-based structured sparsity regularization, we optimize the sparsity of queries of the self-attention mechanism, resulting in a reduced number of queries and computational cost. We further integrate 1D-CNN layers into transformers with sparse queries for power grid prediction and show remarkable improvement over existing methods.
The proposed model is based on the encoder-decoder architecture  with efficient attention. An overview is shown in the top panel of Fig. 2, and more details are described below.
In the voltage prediction of the post-fault power grid system, the ability to capture long-term voltage volatility patterns requires both global information like time stamps and early-stage temporal time-series features from connected neighbors. To this end, we use a uniform representation embedding layer based on padding and convolution to mitigate mismatch between time stamps and temporal inputs in the encoder and decoder.
The input signals of a single bus in a fault event is and . Where is he number of input signals, counting both voltage and current connected or adjacent to the bus and is length of signal measured. The global time stamp records temporal (positional) relationship to the fault time . The encoder input is . For input of decoder, only the features of initial period before time is used. We take zero padding to match dimension, and get , where . This embedding ensures causality and is written as
where is either or ; and ELU if , ELU if
, component-wise on a vector.
Encoder. After embedding and representations, the input signal is formed as , see Fig.2 (top). In the encoder, there are two identical layers; each consists of multi-head Group Sparse attention and a 1-D convolution layer. The sub-layer, Group Sparse attention, is discussed in detail in Sec.2.2 for removing redundancy in the vanilla attention module . The 2nd sub-layer going forward is written as
where Conv1d performs 1-D convolutional filtering in time dimension 
, followed by the ELU activation function.
Decoder. The embedding layer before decoder shapes the inputs feature as , where is the hidden dimension of model. This layer enables Group Sparse attention to perform the same as in the encoder. After multi-head Group Sparse attention, a canonical multi-head cross attention  combines the hidden feature from encoder and . A fully connected layer after cross attention matches the output dimension with that of the prediction variables. The final output is
where Full and CrossAttn stand for the fully connected and cross attention layers respectively, and is the hidden feature from the encoder layer.
2.2 Self-attention mechanism and query sparsification
The self-attention mechanism  is used to learn long-range dependencies and enable parallel processing of the input sequence. For a given input sequence , self-attention transforms into an output sequence in the following two steps111For simplicity of the following discussion, we formulate the self-attention mechanism slightly differently from that in .:
Project the input sequence
into three matrices via the following linear transformations
where are the weight matrices. We denote matrices , , and , where the vectors for are query, key, and value vectors respectively.
For each query vector for , we compute the output vector as follows
where the softmax is applied row-wise.
For long sequences, the computational and memory costs of transformers are dominated by (1). It is evident that the memory cost is to store the attention matrix . Also, the computational complexities of computing the matrix-matrix products and are both . In response, efficient transformers have been proposed leveraging low-rank and/or sparse approximation of [9, 4, 2, 1], locality-sensitive hashing , clustered attention , etc.
, an empirical formula based on Kullback-Leibler divergence is derived to score the likeness ofand , and select top few row vectors in . To reduce the score computation to complexity, random sampling of vectors is used to approximate the score formula. The resulting top score vectors remain and others are zeroed out. Such a group sparsification of helps remove insignificant part of the full attention matrix, also lower computation and storage costs to .
We observe from (2) that if query matrix only has many nonzero rows (or group sparsity), the complexity of goes down to . This amounts to replacing the empirical score formula in  by a data-driven row selection in . To this end, we adopt the group Lasso (GLasso ) penalty to sparsify columns of so that the number of nonzero rows of is reduced to which in turn lowers complexity of to . Though GLasso computation without any query vector sampling does not reduce complexity of attention computation in training (it surely does so at inference), this is a minor problem for our moderate-size data set and network with a few million parameters in this study. The training times are about the same with or without a random sampling of query vectors on the power grid data set (Tab. 4). More importantly, our resulting network (GLassoformer) is faster and more accurate than Informer  during testing and inference, see Tabs. 1 and 2.
3 Algorithm and Convergence
We present our algorithm to realize row-wise sparsity in query matrix and show its convergence guarantee.
Let denote model parameters, where is the query weights in the attention mechanism. We divide into column-wise groups: . The penalized total loss is:
is the mean squared loss function, andis the GLasso penalty . For network training, we employ the Relaxed Group-wise Splitting Method (RGSM, ) which outperforms standard gradient descent on related group sparsity (channel pruning) task of convolutional networks for loss functions of the type (3). To this end, we pose the proximal problem:
where is the column index set of in group . Solution of (4) is a soft-thresholding function:
The RGSM algorithm minimizes a relaxation of (3):
with , learning rate , relaxation parameter .
3.2 Convergence Theory
Thanks to the ELU activation function (continuously differentiable with piece-wise bounded second derivative), the network mean squared loss function satisfies Lipschitz gradient inequality for some positive constant :
and any , . Notice that violates (7). The advantage of splitting in RGSM (6) is to overcome this lack of non-smoothness in convergence theory. We state the following (the proof follows from applying (7) and (6), similar to Appendix A of )
Lemma 1 (Descent Inequality)
If is coercive ( bounded implies that of its independent variables, true when standard weight decay is present in network training) and the learning rate , then decreases monotonically in , and converges sub-sequentially to a limit point , from which is extracted to speed up inference.
4 Experimental Settings
Datasets. We carry out experiments on the simulated New York/New England 16-generator 68-bus power system . The data set takes over 2248 fault events, where each event has signals of 10 seconds long. These signals contain voltage and frequency from every bus, current from every line. The dataset also records the graph structure of buses and lines, see Fig. 1. The system has 68 buses and 88 lines linking them, i.e., 68 nodes with 88 edges in the graph. To explore the prediction accuracy of the GLassoformer model, we take the voltage of a bus as the training target, and voltage with currents from locally connected buses and lines as input features. The splitting for train/val/test is 1000/350/750 fault events.
We implemented our model on PyTorch.Baselines: we selected several time-series prediction models to compare, including 1D-CNN, Informer, Lasso222Transformer with regular (un-structured) Lasso penalty., and Prony’s method (for voltage signal of single bus, case II in Tab 1). Architecture: The encoder contains 2 Group Sparse attention layers, and decoder consists of 1-layer Group Sparse self-attention and 1-layer cross-attention. We use Adam with a learning rate of and rate decay by
after every 10 epochs. For hyper-parameters in group Lasso,is 0.9 and is . During training, we use a batch size of 30. The maximal training epoch number is 80 with early stopping patience of 30. Setup: The input is zero-mean normalized. Platform: The model is trained on Nvidia GTX-1080Ti.
I (II): input data with (without) neighbor voltage and current features.
Fig. 4 shows that group and regular Lasso affect our network training as seen from visualizations of structured/unstructured sparse query (left/right), where entries with magnitudes below 1e-5 are zeroed out. In Fig. 3, two sample predictions from GLassoformer (green) and Informer (orange) are compared with the blue test data (to the right of the vertical dashed line). The blue curve to the left of the dashed line is short time observation after the fault (network input). Informer’s prediction contains several spurious spikes that conceivably come from the random sampling procedure in their attention module for linear complexity. In Tab. 3, we see that the Informer’s training time is slightly shorter even though it has three attention layers, while G/LassoFormer has two attention layers in the encoder. The training time saving is limited on our data set. However, GLassoformer has much better prediction accuracies with and without nearby voltage and line current features, as shown in Tab. 1 in terms of mean squares error (MSE) and mean absolute error (MAE). The Glassformer model is smaller in parameter size and faster at inference than Informer (Tab. 2).
|Num of Params (M)||5.827||5.707||7.257||0.706|
|Inference Time (ms)||18.98||14.92||29.76||0.5969|
|Pruning rate (%)||19.09||4.220||2.674|
|Training time (second)||9.215||8.975||8.987|
: fraction of zero query vectors (threshold = 1e-5); : time per epoch
|Pruning rate (%)||2.3|
|Training time (s)||8.75|
|Inference Time (ms)||15.32|
We presented GLassoformer, a transformer neural network model with query vector sparsity for time-series prediction, and applied it to power grid data. The model is trained through group Lasso penalty and a relaxed group-wise splitting algorithm with theoretical convergence guarantee. The model’s post-fault voltage prediction is much more accurate and rapid than the recent Informer, also outperformed other benchmark methods in accuracy remarkably.
ETC: encoding long and structured inputs in transformers.
Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing, Online, pp. 268–284. External Links: Cited by: §2.2.
-  (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. Cited by: §2.2.
-  (2016) Power system transient stability assessment based on big data and the core vector machine. IEEE Transactions on Smart Grid 7 (5), pp. 1–1. Cited by: §1.
-  (2021) Rethinking attention with performers. In International Conference on Learning Representations, Cited by: §2.2.
Sparsity meets robustness: channel pruning for the Feynman-Kac formalism principled robust deep neural nets.
in Proc. of International Conference on Machine Learning, Optimization, and Data Science. Cited by: §3.2.
-  (1996) Statistical digital processing and modeling. John Wiley & Sons. Cited by: §1, §2.1.
-  (2021) Post-fault power grid voltage prediction via 1d-cnn with spatial coupling. in Proc. of International Conference on AI for Industries. Cited by: §1.
-  (2017) Intelligent time-adaptive transient stability assessment system. IEEE Transactions on Power Systems 33 (1), pp. 1049–1058. Cited by: §1.
-  (2020-13–18 Jul) Transformers are RNNs: fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 5156–5165. External Links: Cited by: §2.2.
-  (2020) Reformer: the efficient transformer. In International Conference on Learning Representations, External Links: Cited by: §2.2.
-  (2020) Machine-learning-based online transient analysis via iterative computation of generator dynamics. in Proc. IEEE SmartGridComm. Cited by: §1.
A systematic approach for dynamic security assessment and the corresponding preventive control scheme based on decision trees. IEEE Transactions on Power Systems 29 (2), pp. 717–730. Cited by: §1.
Sensitivity analysis by neural networks applied to power systems transient stability. Electric Power Systems Research 77 (7), pp. 730–738. Cited by: §1.
-  Power system toolbox. Note: Available at https://www.ecse.rpi.edu/~chowj/ Cited by: §4.
-  (2020-10) Efficient content-based sparse attention with routing transformers. arXiv:2004.05997v5. Cited by: §1, §2.2.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.1, §2.2, §2, footnote 1.
-  (2020) Fast transformers with clustered attention. Advances in Neural Information Processing Systems 33. Cited by: §2.2.
-  (2019) Fast transient stability batch assessment using cascaded convolutional neural networks. IEEE Transactions on Power Systems, pp. 2802–2813. Cited by: §1.
-  (2019) Channel pruning for deep neural networks via a relaxed group-wise splitting method. In Proc. of International Conference on AI for Industries, Laguna Hills, CA. Cited by: §3.1.
-  (2007) Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B 68(1), pp. 49–67. Cited by: §2.2, §3.1.
Informer: beyond efficient transformer for long sequence time-series forecasting.
Proc. of the Association for the Advancement of Artificial Intelligence, Cited by: §2.1, §2.1, §2.2, §2.2, §6.
Hierarchical deep learning machine for power system online transient stability prediction. IEEE Transactions on Power Systems. Cited by: §1.