I Introduction
Much of the analysis of electric power system resilience relies on describing the duration and magnitude of resilience events with quantitative metrics [1, 2, 3, 4, 5, 6, 7, 8]. The resilience events correspond to conditions of unusually high stress such as extreme weather or cascading and are either simulated [2, 3, 4] or extracted from historical data [5, 6, 7, 8]. The metrics of duration and extent describe the performance of the power system as it responds to the high stress and, sometimes indirectly, the impact of the event on our society. The metrics are broadly useful in improving the engineering of power system resilience. This paper addresses electric power transmission system metrics for the duration of resilience events and the durations of the outage and restore processes occurring within resilience events. Here “outage” refers to a component being removed from service, and “restore” refers to reenergizing a component to return it to service.
The duration of a resilience event would appear to be straightforward: The event starts with the first transmission outage at time and the event ends with the last restore at time , so that the event duration metric is simply . However, we will show that the timing
of the last restore is so highly statistically variable that it is not meaningfully representative of the power system restoration. (A metric is highly statistically variable if it is likely that its value can be much different than its estimated value, and we quantify this by the size of a confidence interval containing the estimate.) Moreover, given the redundancy that is designed into power transmission systems, the last restore may have little or no impact on the power flowing to the distribution system and then to the customers. Therefore we analyze a variety of duration metrics to find new metrics which are less variable and more representative.
Our main approach is to develop new Poisson process models for the outage and restore processes. The new models are driven by seven years of automatic outage data collected across North America by the North American Reliability Corporation (NERC) in its Transmission Availability Data System (TADS). These statistical models enable the variability of the metrics to be quantified. Moreover, parameters of the new models are closely related to some of the duration metrics.
Ia Literature review
Much of the previous work on statistical models of power system resilience events addresses distribution systems. Zapata [9] models distribution system reliability with outages as a powerlaw Poisson process arriving at a queue that is serviced by a powerlaw repair process to produce a restore process. Wei and Ji [6] analyze distribution system resilience to particular severe hurricanes with a Poisson outage process arriving at a queue that repairs the outages to produce a restore process. Both the outage process rate and the repair time distribution vary in time as the hurricane progresses. Carrington [8] shows how to extract outage and restore processes from standard distribution utility data.
Both [6] and [9] statistically model the outage process and the component repair process, and then calculate the restore process with a firstinfirstout queue model, whereas we follow the insight of [8] in extracting and directly modeling the outage and restore processes. Modeling the restore process directly from the data avoids the complexities in queuing models of explicitly modeling the component repair and assuming an order of component repair. While [8]
fits the mean and standard deviation of the distribution system outage and restore processes to give a gamma distribution of restore times, it does not give statistical process models as we do in this paper. Moreover, the forms of the outage and restore processes are quite different: for transmission systems the restore process dramatically slows over time and typically extends well beyond the end of the outage process, whereas in distribution systems the outage and restore processes overlap during most of the event
[8, 6].Previous work also estimates individual component repair times from distribution utility data. For example, Jaech [11]
predicts a gamma distribution of individual component outage restoration times and customer hours lost with a neural network, and Liu
[12] fits generalized additive accelerated failure time models to hurricane and ice storm data.IB Summary of paper contributions
This paper:

proposes new statistical models of outage and restore processes in transmission systems, and shows that the new models describe typical North American data.

analyzes statistical variability and interpretation of a variety of duration metrics.

recommends novel and more useful duration metrics.

reports typical values for model parameters and duration metrics for North America transmission resilience events.
Our previous conference papers [13, 14] extract events and outage and restore processes from transmission system outage data and calculate resilience metrics. These methods are also applied to quantify resilience for the largest events in NERC reports [15, 16]. Outage and restore processes are extracted from utility data for a distribution system in [8]. We deploy these previous methods to extract events and outage and restore processes from data as reviewed in section II, but otherwise the paper only overlaps with the previous work with part of contribution (4), but only for some conventional metrics applied to fewer and larger events; this paper proposes, analyzes, and recommends new and improved duration metrics.
Ii Resilience events and processes
To obtain resilience metrics from utility outage data, we first need to automatically extract resilience events and the outage and restore processes for each event. This section explains how to do this based on previous work [8, 13, 14] and establishes the notation needed for the paper.
Iia Utility data and extracting resilience events
The detailed North American outage data from NERC’s TADS are the automatic outage data for the following bulk electric system transmission system elements: AC circuits, transformers, AC/DC backtoback converters, and DC circuits [14]. The data include the outage and restore time to the nearest minute, the initiating cause code for each outage, and the sustaining cause code for sustained outages. In this paper we analyze the approximately 62 000 automatic outages for all elements reported in TADS from 2015 to 2021 for the Eastern, Western, and ERCOT interconnections.
A key step in resilience analysis of real data is automatically extracting resilience events. For each interconnection, the automatic outages are grouped together into resilience events based on the bunching and overlaps of their starting times and durations. We quote from [14] the algorithm used: “Every outage in an event has to either start within five minutes of a previous outage in the event or overlap in duration with at least one previous outage in the event that has a difference in starting time not exceeding one hour. In applying this algorithm, repeated momentary outages of the same element are neglected if they occur within 5 minutes of each other.” We use this algorithm to automatically group outages into resilience events (their sizes vary from 1 to 352 outages) and then analyze all the resilience events with 10 or more outages. An event that contains at least one outage with a weatherrelated initiating or sustained cause code is defined as a weatherrelated event. The weatherrelated TADS cause codes are lightning, weather excluding lightning, fire, and environmental. This procedure identified 352 transmission events with 10 or more outages, 329 of which are weatherrelated.
IiB Outage, restore, and performance processes
Suppose that the resilience event has outages at times and restores at times . Note that the outages are sorted into the order in which the outages occur, and the restore times are sorted into the order in which the restores occur. This sorting implies that the th restore time is not usually the restore of the th outage .
For each event, the outage process is the cumulative number of outages at time and the restore process is the cumulative number of outages at time :
(1)  
(2) 
Both processes start at zero at the beginning of the event and increase to the total number of outages , as can be seen in the example in Fig. 1.
Resilience studies [1, 2, 3, 4] often define for each event a performance (or resilience) curve , which is the negative of the cumulative number of unrestored outages at time . The performance curve decrements for each outage and increments for each restore as shown in Fig. 1. Indeed, the performance curve is related to the outage and restore processes by . The performance curve can be uniquely decomposed into its outage and restore processes, and it contains the same information as the outage and restore processes [8].
The outage and restore processes, while straightforward, are fundamental to analyzing real outage data, and they have several distinctive features [8]: (a) The outage and restore processes routinely overlap in time in real data; this differs from the customary idealized outage and restore phases of resilience that are separated in time [1, 2, 3, 4, 7]. (b) The analysis is at a systems level and is not focused on tracking individual elements: it only counts the numbers of outages and restores and it does not track which outaged element restored when or the order in which elements restore. (c) The forms of the outage and restore processes and performance curve readily lead to resilience metrics that describe each process; in particular, it is useful to have separate metrics describing the outage process and the restore process.
Iii Poisson process models of outage and restore
This section introduces new Poisson process models that describe typical outage and restore processes in our transmission system data. The mean values of these Poisson processes are a useful approximation of the outage and restore processes. Moreover, parameters of the Poisson process models yield resilience metrics, and section VIII uses the Poisson process models to quantify the variability of the metrics. The fit of the Poisson models with the data is discussed in section VII, where it is shown that the model with a lognormal rate typically fits the restore process better than the model with an exponential rate.
Iiia Poisson process of outage times with constant rate
The data for each event specifies that there are outages in the event and that the outages start at time and end at time . Given this information, and assuming a constant rate Poisson process, we model the outage times as occurring randomly and at a constant rate in the time interval . In particular, given that there are outages in , the outage times
are independent samples from a uniform distribution on
sorted into ascending order^{1}^{1}1One well known property of a constant rate Poisson process is that, if there are a given number of outages in an interval, then these outage times are uniformly distributed in that interval [18, Thm. 4A, Ex. 4A], [19, Thm. 5.2]..A metric characterizing the outages is their rate , which is estimated for each event as^{2}^{2}2 Since there are time differences between the outages, the estimated average time difference between successive outages is , and then the estimated rate is the reciprocal of the average time difference.
(3) 
The average or expected cumulative number of outages at time is
(4) 
approximates the outage process as shown in Fig. 2. We see in Fig. 4 some typical examples in which the cumulative number of outage increases in the linear way given by (4). The total number of outages is . For each event, can be estimated from (3), and then the averaged outage process (4) approximates and describes the outage process .
IiiB Poisson process of restore times with lognormal rate
The data for each event specifies that there are restores in the event and that the restores start at time . We work with the restore times relative to ; that is, , . The first restore time relative to , and any other simultaneous restores at , become . Suppose that first restore that occurs at a time is . Usually and .
The restore times typically happen with a rate that varies, as can be seen in the examples in Fig. 4. In particular, the rate of restores typically slows dramatically for the final restores. We model the positive restore times , as occurring randomly in a nonhomogeneous Poisson process at a rate proportional to a lognormal distribution. In particular, given that there are outages in the time interval , the restore times are independent samples from a lognormal distribution on sorted into ascending order. There are some extremely long restore times in the data (up to a year is recorded), and this is reflected in the modeling of the process as unbounded in .
Let the lognormal distribution have parameters and
and probability density function
. Then the Poisson process rate is proportional to the probability density function:
(5) 
By definition of the lognormal distribution, since the restore times are independent samples from a lognormal distribution, the natural logarithms of the restore times
are independent samples from a normal distribution. The standard parameters characterizing the lognormal distribution are the mean
and standard deviation of the normal distribution. Therefore we estimate and for each event by(6)  
(7) 
The Poisson process restore rate is proportional to the lognormal distribution as shown in (5). Then the average or expected cumulative number of restores is
(8)  
(9) 
where is the CDF of the standard normal distribution. Equation (8) shows that is proportional to the CDF of the lognormal distribution, and (9) expresses in terms of the parameters and . approximates the restore process as shown in Fig. 3.
IiiC Poisson process of restore times with exponential rate
We can substitute the exponential distribution for the lognormal distribution of subsection
IIIB to obtain a Poisson restore process with exponential rate. That is, given that there are outages in , the restore times are independent samples from an exponential distribution on sorted into ascending order. Let the exponential distribution have time constant and probability density function for . Then the Poisson process rate is(10) 
and the expected cumulative number of restores is
(11)  
(12) 
We estimate the exponential time constant by
(13) 
is the arithmetic mean of the positive restore times relative to . The exponential model has parameters , , and . For each event, can be estimated from (13), and then the averaged outage process in (12) approximates and describes the restore process . Examples of the approximating restore curves are shown by gray dashed lines in Fig. 4.
Iv Duration metrics
There are many possible metrics describing durations in resilience events. This section defines and describes a variety of these metrics.
Iva Straightforward duration metrics
 outage duration

 time to first restore

 restore duration

 restore time to th restore

 event duration

The outage process starts at the first outage and ends at so that the outage duration . The first restore is at time and the time to the first restore is . That is, quantifies how much the start of the restore process is delayed. The restore process starts at and ends at the last restore so that the restore duration . The event starts at time and ends at time . The event duration can be split into the time to the first restore and the restore duration:
(14) 
This section discusses restore duration, but the corresponding metrics describing event duration are easily obtained from the metrics for restore duration by adding the time to first restore as in (14). The outage duration and time to first restore are useful metrics, but section V explains that the restore duration and the event duration suffer from high variability.
IvB Restore metrics based on quantiles
It is of interest to quantify the time to reach a given percentage of restoration, or, equivalently, the quantile of the restore times . There are many different definitions of quantiles ([17]
analyzes 10 definitions used in statistics), and correspondingly many ways to define restore metrics based on quantiles. This subsection discusses two metrics of restore duration based on quantiles; the first metric quantizes to a restore time while the second metric interpolates between restore times.
time to first restore with at least restoration
(15) 
The ceiling function is the smallest integer . For example, is the time between the first restore and the first restore at which at least 95% of the restores are completed. It follows that for , for , and for . These quantum jumps in as varies, and which also occur as varies, are unsatisfactory when analyzing a range of events. This can be fixed with the following more elaborate quantile definition.
restore time to of restoration
(16)  
where  (17) 
The ceiling function is the smallest integer , the floor function is the largest integer , and is the fractional part of .
Eqn. (16) shows that linearly interpolates between restore times and . uses the medianbased quantile definition^{3}^{3}3implemented in R as quantile type 8, and in Mathematica by Quantile with parameters recommended by [17], but also limits to a maximum of in (17). When limiting applies, .
In contrast to , changes continuously as varies and with much smaller jumps as varies. For this reason, we strongly prefer to .
evaluated with (16) reduces to the usual median. That is, letting ,
(18) 
IvC Metrics related to restore process models
These metrics work with the positive restore times relative to ; that is, , .^{4}^{4}4The following metric definitions require a positive outage duration () so that . If , we define the metric to be zero. Usually as explained in section IIIB.
 geometric mean of positive restore times

 arithmetic mean of log restore times

 standard deviation of log restore times

 restore time to restoration assuming lognormal

satisfies and
so that(19) Note that .
 restore time to restoration assuming exponential

satisfies and
so that
The average restoring half life is the average time for the number of unrestored outages to halve averaged over the restore process assuming exponential decay.
There are variants of and with slightly simpler formulas that describe the time to restoration of of the nonzero restore times. For these variants, becomes and becomes . We prefer the definitions of and above because the time to restoration of of all restore times seems more straightforward.
All the duration metrics in the paper (labelled with ) are given in hours so that the time unit hour. We now discuss the units of and . A more precise version of is (or ). Dividing in hours by in hours gives the required nondimensional argument of the logarithm [20]. Changing will cause a change in the value of . does not depend on the units used and gives the same value for any choice of .
V Discussing restore metrics , , ,
metric  recommend?  comment  median  

number of outages/restores  Yes  useful measure of event size  13.5  
outage duration  Yes  useful description of outage process  2.69  
outage rate  Yes  useful description of outage process  5.45  
time to first restore  Yes  useful description of delay in start of restores  0.52  
event duration  No  ; extreme variability  69.8  
restore duration (time to last restore)  No  extreme variability  69.1  
restore time to th restore  No  preferred  31.4  
first restore time with restore  No  preferred since continuous  55.4  
restore time to quantile  Yes  39.2  
restore time to quantile  Yes  is an alternative  65.2  
mean of log restore times  and is recommended.  1.64  
standard deviation of log restore times  1.56  
restore time to with lognormal  slightly preferred; lognormal fit only typical  67.7  
exponential time constant  No  exponential fit poorer; arithmetic mean of restores  16.4  
restore time to with exponential  No  exponential fit poorer  47.8  
median restore time  preferred  4.27  
geometric mean of restore times  Yes  best, least variable restore performance metric;  5.15  
also estimates median of restores  
all durations in hours, in per hour 
All duration metrics of the restore process are subject to substantial statistical variability that can undermine their usefulness, especially for smaller values of event size . The variabilities of the restore metrics are analyzed in section VIII by calculating the size of their confidence interval, and only the conclusions about their variability are stated here.
The restore duration metric is straightforward, but it is typically too highly variable to be a reliable estimate. Moreover, depends strongly on the last or last few restores, preventing from describing the performance throughout the entire restore process. This dependence also makes relate poorly to transmission performance because these last restores may be unimportant for customers, or may be excessively delayed by factors out of the control of the utility, such as the difficulty of repairing transmission lines in the mountains in the winter or structural damage caused by hurricane or tornado.
The geometric mean of the positive restore times is the best estimate of restore performance in terms of having the least variability. It is also clear that depends on all the restores throughout the restore process. We now discuss how also estimates a median of the restore process. Since the normal distribution is symmetrical about its mean value, the mean also estimates the median of the normal distribution, and therefore estimates the median of the lognormal distribution^{5}^{5}5Only the symmetry of the distribution of the logarithm of the nonzero restore times relative to is needed here.. In fact,
is a better estimate (less variance) of the median than applying the standard formula (
18) for the median. The detailed correspondence is that estimates the median of , , which is modestly greater than^{6}^{6}6 For , difference in the medians is , where . the median of all the restore times , calculated in (18). That is, under the lognormal model, is a good estimate of the median of the positive restore times relative to , and approximates from above the median of all restore times relative to .While is an informative metric with the lowest variability, and can be used as more representative of the almost complete duration of the restore process, with the compromise of higher variability than . is a more smoothly varying quantile metric indicating the 95% completion of the restore process. is also smoothly varying. is a bit more variable than , particularly for small . Overall, we slightly prefer to because the quantile approach is less model dependent, whereas will work best in the typical lognormal restore case.
Table I summarizes the metrics and our recommendations.
Vi Typical values of metrics & model parameters
Typical values of metrics and parameters are given for all the data in Table I and for each interconnection in Table II; these values are expected to be useful for modeling and assessing interconnectionspecific transmission events. Due to the heavy tails in their distribution, some quantities in Table II such as have mean values that greatly exceed the median and large standard deviations. In these cases, the estimated mean has substantial statistical variation and poorly indicates a typical value; the median is a better typical value. The large standard deviations arise from both the metric statistical variability and the metric variation between events.
Eastern  ERCOT  Western  
Metric  mean  SD  median  mean  SD  median  mean  SD  median 
23.2  38.2  13  16.9  10.0  13  20.1  17.7  14  
3.5  3.6  2.8  2.6  2.1  2.3  2.8  2.3  2.5  
7.3  8.6  5.1  6.5  3.7  5.2  24.0  99.0  6.4  
0.78  1.07  0.53  1.28  1.34  0.95  0.65  0.80  0.43  
379  1088  73  154  227  53  219  494  62  
379  1088  72  153  228  50  218  494  61  
126  332  32  75  81  36  81  210  26  
305  1000  62  128  204  49  170  438  46  
151  471  44  78  83  39  103  262  32  
294  945  67  122  182  49  180  442  48  
1.68  1.17  1.76  1.48  1.62  2.19  1.23  1.11  1.10  
1.64  0.57  1.59  1.56  0.59  1.67  1.57  0.65  1.46  
397  2740  77  199  327  55  132  362  46  
49.8  154  17.6  28.6  30.8  15.0  28.5  57.6  12.5  
145  449  52  84  90  44  83  167  37  
15.3  65.3  4.8  18.1  28.8  5.3  5.6  6.1  2.6  
12.8  51.6  5.8  10.0  9.5  8.9  5.8  5.6  3.0  
all durations in hours, in per hour 
On average, events in the Eastern interconnection are larger than in the West and ERCOT. It can be explained by the fact that the largest transmission events were caused by hurricanes, and all of these events occurred in the East. For all interconnections, the mean and median outage process durations are similar, and very short compared to event durations . The mean outage rate in the West is much higher due to several events (wildfires and a lightning storm) for which all outages started almost simultaneously. This extremely short outage duration results in huge outage rates (see (3)).
The restoration usually starts very quickly after the event starts as the time to first restore indicates. In ERCOT the average time to a first restore, 1 hour 17 minutes, is statistically significantly larger than in the East and in the West, where restoration typically starts within one hour. Overall, the time to first restore is negligible compared to event duration; this makes the event duration and the restore process duration effectively equal. In contrast, the time between the th and th restores, , is sizeable and often comprises a substantial share (41% on average) of . This observation again underscores the impact of the last few restores to the event and restore durations.
The geometric mean of the positive restore times, , is a simple and stable metric. is also an approximate estimate for the time to one half of restores for the events with lognormal restore times. The largest difference between these metrics observed for the ERCOT events can be attributed to the poorer lognormal fit for the ERCOT events. On average, is 12% of the entire restore process duration .
It is interesting to compare in Table II the sample quantile restore time with the lognormal and exponential quantiles and . often overestimates due to the heavy tail of the lognormal distribution, whereas often underestimates due to the light tail of the exponential distribution.
The parameters and
for fitted lognormal distributions and
for fitted exponential are consistent in each interconnection and across interconnections. Table V shows that increases and decreases with event size .Only 23 of the 352 resilience events in the dataset are not weatherrelated. These 23 events vary in size from 10 to 26 outages. Except for , the medians of the duration metrics in Table III are statistically significantly higher^{7}^{7}7confirmed with a nonparametric oneway ANOVA test for medians [21] for weatherrelated events than for non weatherrelated events. Table III also shows for each weather type the median metrics for the 95 weatherrelated events with at least 18 outages. There are some statistically significant differences among the extreme weather types: the medians of and for hurricanes are greater than for other weather types, and and for hurricanes and tornadoes are greater than for other weather types. The mean of the times to first restore are similar for all weather types except tornadoes; the mean for tornadoes is 1.7 hours, which is at least double the mean for the other weather types.
Type (# cases)  

fire (4)  21  1.51  0.33  33.4  2.63  30.8  0.96  1.89 
hurricane (17)  55  6.53  0.58  257  20.4  109  3.02  1.50 
wind,thunder (36)  25  4.04  0.44  122  6.75  82.3  1.90  1.44 
tornado (15)  24.5  5.04  0.96  174  12.7  93.4  2.54  1.47 
winter (23)  32  4.37  0.60  49.5  4.73  41.5  1.55  1.32 
all weather (329)  14  2.80  0.52  73.4  5.76  67.7  1.75  1.56 
nonweather (23)  11  1.00  0.65  19.1  1.10  19.1  0.09  1.58 
all durations in hours 
Vii Fit of Poisson process models to utility data
This section discusses the fit of the Poisson models to the observed utility data by a goodness of fit test, which allows for analysis of each of the 352 events, and by probability plots for the combined normalized data, which also show where the fit deviates. For the goodness of fit tests, there is some arbitrariness in the threshold amount of deviation corresponding to the significance level, as well as some dependence on the event size , but they do give an indication of fit.
Viia Outage process fit with uniform distribution
The Poisson process model with constant outage rate implies that for each event the outage times , should be independent samples from a uniform distribution on the interval . We evaluated the fit of these outage times for each event to the uniform distribution as shown in Table IV. Satisfying the test means that the ideal model is not rejected at the significance level . Table IV shows that a majority of events satisfy the model.
test  interconnection  

(satisfies if )  all  eastern  western  ercot 
percent of events satisfying uniform outages  
KolmogorovSmirnoff  69  70  72  50 
CramervonMises  72  73  71  56 
AndersonDarling  63  63  66  50 
percent of events satisfying lognormal restores  
KolmogorovSmirnoff  63  63  66  44 
CramervonMises  60  61  64  38 
AndersonDarling  59  59  64  38 
percent of events satisfying exponential restores  
KolmogorovSmirnoff  35  33  42  25 
CramervonMises  35  33  45  25 
AndersonDarling  32  31  40  13 
The normalized outage times , should be independent samples from the standard uniform distribution on the interval . The fit of the normalized outage times for all of the events to the standard uniform distribution is shown by the QQ plot in Fig. 5. The fit in Fig. 5 is quite close over the middle range, and the main deviations occur at the ends of the distribution and correspond to simultaneous multiple outages recorded at the beginning or end of the outage process^{8}^{8}8While it is plausible that some outage processes start or end with outages occurring in the same minute, it is not clear that the records accurately reflect the outage timing in all these cases..
The fits of this subsection indicate that the Poisson model with uniform rate is a typical case (a majority of all events) usefully approximating the outage process.
ViiB Restore process fit with lognormal distribution
As explained in section IIIB, the Poisson process model with lognormal rate for the restores implies that for each event the restore times should be independent samples from a lognormal distribution. We evaluated the fit of these restore times for each event to the lognormal distribution with parameters estimated using (6), (7) at the significance level as shown in Table IV. Table IV shows that a majority of all events satisfy the model, and this also holds for the East and West interconnections.
For each event, the normalized restore times , should be independent samples from the standard normal distribution . The fit of the normalized restore times for all events to the standard normal distribution is shown by the CDF and QQ plots in Fig. 6, which show a reasonably good fit with some modest deviations.
The fits described in this subsection indicate that the Poisson process model with lognormal rate is a typical case usefully approximating the restore process. The typical lognormal case has a heavy tail that can describe some extremely delayed final restores.
ViiC Restore process fit with exponential distribution
As explained in section IIIC, the Poisson process model with exponential rate for the restores implies that for each event the restore times should be independent samples from an exponential distribution with time constant . We evaluate the fit of the restore times for each event to the exponential distribution with time constants estimated using (13) as shown in Table IV. Table IV shows that a minority of events satisfy the model.
For each event, the normalized restore times , should be independent samples from the standard exponential distribution with time constant 1. The fit of the normalized restore times for all events to the standard exponential distribution is shown by the survival function and QQ plots in Fig. 7. There is clear discrepancy between the exponential model and the data for the initial portion and tail of the distribution. The tail in the data is much heavier than exponential, and this discrepancy in the tail is particularly significant for our purpose here of estimating restore durations.
The fits described in this subsection indicate that the Poisson process model with exponential rate only fits a minority of the events and is a noticeably poorer approximation of the typical restore process than the model with lognormal rate.
Viii Stochastic variability of restore metrics
The restore duration metrics vary due to variation of the restore processes between events (and this of course is what we want to quantify) but also due to the inherent statistical variability of the metric used (which we want to minimize by selecting a better metric). The statistical variability makes the metric vary between events, even if the events have the same characteristics, because of random variations in the progress of the restores.
We assess the inherent statistical variability of the metrics by assuming the lognormal Poisson model for average values of and , which vary as functions of , and are estimated using (6) and (7). In this section we assume that .
Viiia Variability of and
Since is assumed, and are estimated with samples. The sample mean of samples from a normal distribution with mean and standard deviation has normal distribution . Therefore has twosided confidence interval with end points , where and is the CDF of the standard normal distribution. It follows that the geometric mean of samples from a lognormal distribution with parameters and has twosided confidence interval with endpoints , or
(20) 
We measure the size of the confidence interval (20) by the multiplicative factor , which we call the “multiplicative halfwidth” of the confidence interval. More generally, we define the size of a confidence interval with endpoints as
(21) 
Now we obtain the size of the confidence interval for . From (19), taking ,
(22) 
The sample standard deviation has distribution where is the chi distribution with degrees of freedom^{9}^{9}9the definition of uses , so that the number of degrees of freedom is one fewer than the number of samples .
Using (22) and the independence of and , the probability density function of is the convolution
(23) 
and the CDF of is
(24) 
We use (24), numerically integrating to evaluate the convolution, to find the confidence interval for as , then use (21) to find the multiplicative halfwidth of the confidence interval for .
ViiiB Variability of and
Since the restore times are sorted in increasing order, corresponds to the th largest restore time and, assuming that and , is the th order statistic of the lognormally distributed restore times . We evaluate in Mathematica the inverse CDF of the th order statistic of samples of the lognormal distribution with parameters and . Then we find the confidence interval for and its multiplicative halfwidth from (21).
ViiiC Results for variability of metrics
The size of the 90% confidence interval, measured by the multiplicative halfwidth (21), indicates the inherent statistical variability of the metrics. For example, a multiplicative halfwidth of 2 indicates that the interval spans from half to double of a point inside the interval. Table V shows results for metric variability, and there are some overall trends: All the metrics become much more variable as the event size decreases. Metrics estimating a larger fraction of the entire restore duration are much more variable (consider the sequence , , or , , , ). The quantile metrics (, , ) are always more variable than corresponding metrics related to lognormal restore (, , ), but the increase in variability is modest or small for 50.
Metric variability is worst and unacceptably large for , which always has a confidence interval size of more than a factor of 2. The high variability of the last restore and is expected due to the heavy tail of the lognormal distribution. Fig. 8 shows that the variability of is sharply reduced for , at least for larger , and further reduced for . This motivates avoiding and considering the use of , , , , which have confidence intervals with size less than a factor of 2 for and which perform more continuously by interpolating the metrics.
The pervasive problem of duration metric variability is best mitigated by , which has a confidence interval size of less than a factor of 2 for .
This section assesses metric variability assuming the lognormal model of restores. This is a good assumption for a majority of cases, and can be regarded as a stringent assumption for the remaining minority of cases due to the heavy tail of the lognormal distribution.
10  1.18  1.72  2.57  3.19  3.56  5.09  4.29  5.40  3.83  5.40 
20  1.60  1.58  1.82  2.10  2.24  2.76  2.51  3.69  2.85  3.93 
50  2.20  1.35  1.37  1.49  1.54  1.72  1.63  1.96  2.14  2.79 
100  2.52  1.35  1.25  1.32  1.35  1.46  1.41  1.60  1.98  2.56 
200  3.15  1.33  1.17  1.21  1.23  1.30  1.27  1.39  1.85  2.37 
Ix Conclusions
We use extensive North American transmission system data to analyze the statistical variability and interpretations of a variety of metrics for the duration of processes in resilience events. Some metrics, such as the outage duration , outage rate , and the time delay before the first restore , are useful. Other duration metrics can suffer from excessive statistical variability, in which their estimated values are contained in confidence intervals that are so large that the estimated values of the metric are not representative. This variability is quantified using new Poisson models for outage and restore processes. The variability is worse for small events.
The apparently straightforward metrics of restore process duration and the event duration are extremely statistically variable and do not adequately describe the restore process, so we recommend new duration metrics and (or ) with better performance. In particular, the geometric mean of restore times has the least statistical variability, summarizes all of the restore process, and approximates a time at which half the restores are completed. The quantilebased metric indicates the time at which restoration is 95% complete, but has greater variability than . uses interpolation to vary more continuously as the data changes. Table I summarizes the metrics and their recommendations, and Tables II and III give typical values for the metrics for three interconnections and different weather conditions.
We introduce novel Poisson process models for the outage and restore processes in resilience events. These new stochastic models describe how resilience events progress in North American transmission systems, and are verified with extensive utility data to be good approximations for the majority of cases. The outages occur uniformly over a short time interval, whereas the restores occur at a lognormal rate that slows to produce the long delays often observed for the last few restores. The lognormal model for the restores is a noticeably better fit than an exponential model for the restores. We give typical values of the model parameters for three interconnections and for different weather conditions to make the new models more specific and useful to other researchers.
The Poisson process models describe probabilistic outages and restores occurring according to specified rates. Averaging the Poisson process models produces formulas for smooth, deterministic curves that approximate typical outage and restore processes. These deterministic averaged models are also of considerable interest in describing how resilience events progress in transmission systems.
References
 [1] A. Stankovic et al., Methods for analysis and quantification of power system resilience, IEEE PES working group publication submitted to IEEE Trans. Power Systems (currently responding to first review).
 [2] M. Panteli, D.N. Trakas, P. Mancarella, N.D. Hatziargyriou, Power systems resilience assessment: hardening and smart operational enhancement, Proc. IEEE, vol. 105, no. 7, July 2017, pp. 1202–1213.
 [3] C. Nan, G. Sansavini, A quantitative method for assessing resilience of interdependent infrastructures, Reliab. Eng. Syst. Safety, vol. 157, Jan. 2017, pp. 35–53.
 [4] S. Poudel, A. Dubey, A. Bose, Riskbased probabilistic quantification of power distribution system operational resilience, IEEE Systems Journal, vol. 14, no. 3, Sept. 2020, pp. 3506–3517.
 [5] D.A. Reed, K.C. Kapur, R.D. Christie, Methodology for assessing the resilience of networked infrastructure, IEEE Systems Journal, vol. 3, no. 2, June 2009, pp. 174–180.
 [6] Y. Wei, C. Ji, F. Galvan, S. Couvillon, G. Orellana, J. Momoh, Nonstationary random process for largescale failure and recovery of power distribution, Applied Mathematics, vol. 7, no. 3, 2016, pp. 233–249.
 [7] M.R. KellyGorham, P.D.H. Hines, K. Zhou, I. Dobson, Using utility outage statistics to quantify improvements in bulk power system resilience, Electric Power Systems Research, vol. 189, 106676, Dec. 2020.
 [8] N.K. Carrington I. Dobson, Z. Wang, Extracting resilience metrics from distribution utility data using outage and restore process statistics, IEEE Trans. Power Systems, vol. 36, no. 2, Nov. 2021, pp. 58145823.
 [9] C.J. Zapata, S.C. Silva, H.I. Gonzalez, O.L. Burbano, J.A. Hernandez, Modeling the repair process of a power distribution system, IEEE/PES T&D Conf. & Exp.: Latin America, Bogota, Columbia, 2008.
 [10] H. Li, L.A. Treinish, J.R.M. Hosking, A statistical model for risk management of electric outage forecasts, IBM Journal Research and Development, vol. 54, no. 3, paper 8, May/Jun. 2010.
 [11] A. Jaech, B. Zhang, M. Ostendorf, D.S. Kirschen, Realtime prediction of the duration of distribution system outages, IEEE Trans. Power Systems, vol. 34, Jan. 2019, pp. 773–781.

[12]
H. Liu, R.A. Davidson, T.V. Apanasovich, Spatial generalized linear mixed models of electric power outages due to hurricanes and ice storms, Reliab. Eng. Syst. Safety, vol. 93, no. 6, 2008, pp. 897–912.
 [13] S. Ekisheva, R. Rieder, J. Norris, M. Lauby, I. Dobson, Impact of extreme weather on North American transmission system outages, IEEE PES General Meeting, Washington DC USA, July 2021.
 [14] S. Ekisheva, I. Dobson, J. Norris, R. Rieder, Assessing transmission resilience during extreme weather with outage and restore processes, Probability Methods Applied to Power Sys., Manchester UK, June 2022.
 [15] NERC, 2021 State of reliability, An assessment of 2020 bulk power system performance, July 2021. Available: www.nerc.com.
 [16] NERC, 2022 State of reliability, An assessment of 2021 bulk power system performance, July 2022. Available: www.nerc.com.
 [17] R.J. Hyndman, Y. Fan, Sample quantiles in statistical packages, The American Statistician, vol. 50, no. 4. November 1996, pp. 361365.
 [18] E. Parzen, Stochastic Processes, Dover NY, 2015.
 [19] S.M. Ross, Introduction to Probability Models, 9th ed., Academic Press, MA 2007.
 [20] C.F. Matta, L. Massa, A.V. Gubskaya, E. Knoll, Can one take the logarithm or the sine of a dimensioned quantity or a unit? Dimensional analysis involving transcendental functions, J. Chemical Education, vol. 88, no. 1, January 2011.
 [21] M. Hollander, D.A. Wolfe, Nonparametric Statistical Methods, 2nd ed., John Wiley, 1999.