Many ML models are highly nonlinear and high-dimensional making them difficult to understand without the use of external methods (McGovern et al. 2019b). As discussed in Flora et al. (2022) (hereafter Part I), we can use post hoc techniques, known as explainability methods (defined in Part I), to gain some understanding of how these complex models work. However, in agreement with previous studies (e.g., McGovern et al. 2019b; Krishna et al. 2022), we demonstrated in Part I that explanation methods can disagree with each other. Although some disagreement is justifiable since different explanation methods are designed for different tasks (e.g., feature relevance versus feature importance; see Section 4a.1 of Part I), it is unclear whether disagreement is inflated by less faithful methods. Herein, ”faithfulness” or ”fidelity” refer to the correspondence between the assigned feature importance and the contribution of the feature to model performance. For example, permutation importance may misassign individual importance when features are correlated (e.g., Strobl et al. 2007, 2008; Nicodemus et al. 2010; Gregorutti et al. 2015, 2017), and thus produce less faithful feature rankings in many meteorological applications. However, little research has been done to compare the faithfulness of a wide range of feature ranking methods, especially in the context of atmospheric science-relevant datasets. Covert et al. (2020) is likely the first study to compare the faithfulness of multiple feature ranking methods, but their datasets were not atmospheric science-based and did not include feature ranking methods commonly used in atmospheric sciences (e.g., multi-pass permutation importance; McGovern et al. 2019b; Jergensen et al. 2020; Handler et al. 2020; Lagerquist et al. 2021).
In this study, we explore the faithfulness of the feature ranking methods introduced in Part I using approaches introduced in Covert et al. (2020) and evaluate whether the ranking disagreement can be reduced by identifying and excluding lower-fidelity methods. Since some of the explainability methods explored in this study are sensitive to correlated features, we also evaluate the impact of dimensionality reduction on the ranking results and on the fidelity of individual methods. These efforts are important, since there is a disagreement problem
for explainability methods (see Part I) where ML practitioners find it difficult to choose among explainability methods that provide different feature rankings and are forced to rely on heuristics to make decisions(Krishna et al. 2022). These disagreements can interact with confirmation bias and cause practitioners to make incorrect assumptions about the model behavior.
As in Part I, we use ML models developed for severe weather prediction (Flora et al. 2021) and sub-freezing road surface temperature prediction (Handler et al. 2020). Full descriptions of each dataset are provided in Part I and therefore are omitted from this paper. We limit our focus to the logistic regression models from Flora et al. (2021)
and the random forest trained inHandler et al. (2020). Using both datasets, we outline pertinent discussions on ML explainability and to begin to generalize the behavior of different feature ranking methods. A full description of the explainability methods used in this study are provided in Part I and is omitted from this paper. All the methods demonstrated here are available in the Python package scikit-explain (Flora and Handler 2022), which was developed by the authors.
2 Data and Machine Learning Methods
To generalize the results, this study uses two different types of meteorological datasets: severe weather prediction and sub-freezing road surface prediction. The severe weather dataset is derived from Warn-on-Forecast System (WoFS) output using an object-based approach while the road surface dataset is a grid-based nowcasting dataset based on hourly output from the High-Resolution Rapid Refresh (HRRR). More details on the datasets are provided in Part I, Handler et al. (2020), and Flora et al. (2021).
For this study, we used classification random forests and logistic regression models available in the Python sci-kit learn package (Pedregosa et al. 2011). Consistent with Handler et al. (2020) and Flora et al. (2021)
, the random forest is applied to the road surface dataset to predict whether a road will freeze while logistic regression is applied to the WoFS dataset to predict whether a storm track will be associated with a severe hail, severe wind, or tornado report, respectively. The hyperparameter optimization is described in the the appendix.
3 Reducing Model Complexity and Limiting Correlated Features
3.1 Rashomon Sets and Multi-Objective Optimization
Before discussing the model explainability, we address the fact that disparate ML models can fit the data equally well, but rely on different covariate information. This phenomenon is known as the Rashomon effect (Breiman 2001; Fisher et al. 2018; Rudin 2018; Rudin et al. 2021) and occurs when there are multiple valid descriptions of the same event. Fisher et al. (2018) formalized the idea of the Rashomon set for a given dataset as the set of models that have expected accuracy within some defined error of the best possible model. This set can be thought of as including all models that might be obtained depending on the model developer’s choices regarding data measurement, pre-processing, filtering, model class and parameterization, and so on. Although ML development is often concerned with how well the model generalizes to unseen data (Breiman 2001), less attention is paid to producing an explainable model. In line with Molnar et al. (2019), we argue that ML model development should be treated like a multi-objective optimization problem where the goal is not solely to develop the best-performing model, but to train simpler, well-performing models (ones that are within the Rashomon set) that are therefore more explainable. Often the literature presents the following dichotomy: decreasing (increasing) model complexity reduces (improves) model performance but improves (limits) explainability and interpretability. However, for some datasets, models can be simplified with little or no loss of skill (e.g., Hand 2006; Kuhle et al. 2018; Rudin 2018; Rudin et al. 2021; Christodoulou et al. 2019; Chakraborty et al. 2021; Flora et al. 2021; Molnar et al. 2021). If the Rashomon set is sufficiently large, a simpler, well-performing model is more likely to exist (Semenova et al. 2021). Therefore, our goal is not only to produce the best score on an independent testing dataset, but also to improve model explainability by limiting the number of features, spurious feature interactions, or the complexity of feature effects (Molnar et al. 2019). This philosophy is used in the following section when dimensionality reduction is applied to the datasets.
3.2 Feature Selection and Reducing Multicollinearity
One approach to reducing model complexity and improving model explainability is removing redundant features. While strong feature multicollinearity may not adversely affect model performance, parsing feature-unique model contributions becomes difficult. For example, for regression-based learning algorithms, if two features are highly correlated (e.g., ) their regression coefficients can spuriously have opposite signs, which does not impact model performance (due to compensatory effects), but it misrepresents the relationship of the affected feature to the target variable.
There are other reasons to reduce information redundancy. First, decreasing the number of features reduces the overall computation time. For example, for the real-time prediction of storm longevity, McGovern et al. (2019a) found that reducing the feature set decreased pre-processing time while having little impact on model performance. Second, shrinking the feature set can decrease the likelihood of overfitting by reducing opportunities for ML models to learn noise or spurious feature interactions. Third, explanations of model behavior become more compact as we reduce the set of features (Molnar et al. 2019).
The datasets in this study are reduced by a combination of manual feature removal and a feature selection method based on logistic regression with L1 regularization. The feature selection method trains a logistic regression model with L1 regularization, which incentivizes the model to remove irrelevant features, especially multicollinear features thereby improving model sparsity. For example, if two features are nearly identical, then L1 regularization will set the coefficient of one of them to zero. Unlike sequential selection methods, the L1-norm-based method is computationally feasible for large datasets but requires a linear model to perform well on the data (which was true for both datasets).
Our first step in the feature reduction process was to compute a linear correlation coefficient matrix for both datasets. For the WoFS dataset, we found that the spatial average intrastorm features were often highly correlated with the amplitude intrastorm features (; Fig. 1) and the removal of spatial-based intra-storm features did not adversely affect model performance. For the road surface dataset (Fig. 2), although there were fewer strong correlations, we limited the feature set to near-surface features and removed many radiative features given their correlation with temperature variables. After manually removing these features from the two datasets, we applied L1 regularization, whose strength is governed by the parameter . We found that plus a cutoff threshold of (below which a coefficient is considered zero) was sufficient for reducing the feature set. For the three severe weather hazards, approximately 30 or fewer features (of the original 113 features) were retained after the manual and L1-norm-based feature selection. The road surface dataset was reduced from 30 to 11 features. As will be discussed in Section 44.2, the final number of features for each of the four data sets is consistent with the number of features that capture most of the original model performance. For the four datasets, the final sets of features are shown in Tables 1-4.
|2-5 km Updraft Helicity ()||ML CAPE ()||Major axis length|
|0-2 km Avg. Vertical Vorticity ()||0-6 km V Shear ()||Minor axis length|
|Composite Reflectivity ()||0-1 km V Shear ()|
|Column-maximum Updraft ()||ML LCL ()|
|Hail ()||500 mb Geopotential height ()|
|ML CIN ()|
|2-5 km Updraft Helicity ()||0-3 km Storm Relative Helicity ()||Major axis length|
|2-5 km Updraft Helicity ()||ML CIN ()||Minor axis length|
|Composite Reflectivity ()||0-6 km U Shear ()|
|Composite Reflectivity ()||0-1 km U Shear ()|
|80-m Wind Speed ()||10-m U ()|
|Updraft ()||10-m V ()|
|Updraft ()||0-3 km Lapse Rate ()|
|0-1 km Updraft ()||T ()|
|10-500 m Bulk Shear ()||700 mb Geopotential Height ()|
|Cloud Top Temperature ()||500 mb Geopotential Height ()|
|Cloud Top Temperature ()||T ()|
|Downdraft ()||ML CAPE ()|
|Near-surface deficit ()||ML CIN ()|
|Near-surface deficit ()||500-700 mb Lapse Rate ()|
|500 mb Geopotential Height ()|
0-2 km Avg. Vertical Vorticity ()
|0-1 km Storm Relative Helicity ()||Major axis length|
|Composite Reflectivity ()||0-6 km U Shear ()||Minor axis length|
|80-m Wind Speed ()||0-1 km U Shear ()|
|Hail ()||ML LCL ()|
|Hail ()||10-m U ()|
|0-1 km Updraft ()||500-700 mb Lapse Rate ()|
|10-500 m Bulk Shear ()||0-3 km Lapse Rate ()|
|Near-surface deficit ()||T ()|
|0-6 km U Shear ()|
|10-m U ()|
|10-m U ()|
|850 mb Geopotential Height ()|
|500 mb Geopotential Height ()|
Surface Temperature ()
|Downward longwave radiation flux ()||Absolute difference between current date and 10 Jan (Date Marker)|
|Hours 0C||Simulated Brightness Temperature ()||Low cloud cover percentage ()|
|2-m Dewpoint Temperature||Incoming shortwave radiation ()|
|Ground Flux ()|
|Surface Latent heat flux ()|
|Surface Sensible heat flux ()|
As discussed above, our goal is not to define the most skillful model, but to produce a feature subset that does not substantially harm the model performance while greatly improving explainability. By increasing the model sparsity and reducing multicollinearity, we gain a better holistic view of how variables interact jointly rather than individually with the target variable and a more compact explanation of the relationships learned.
3.3 Quantifying Model Complexity
Recently, Molnar et al. (2019) introduced two metrics, the interaction strength statistic (IAS) and main effect complexity (MEC), for quantifying ML model complexity using the accumulated local effects (ALE; Apley and Zhu 2016). The ALE for the feature is :
where is the ML model, is the set of all features, are the values of , c is the integration constant (set equal to the average so the effect is centered). To compute the feature effect, ALE calculates the expected change in prediction over different conditional distributions and then integrates them (cumulative sum). By computing the effect of the feature from conditional rather than marginal distributions, ALE isolates the effect from the effects of all other features. More details on the ALE computation are provided in Molnar (2020).
From Molnar et al. (2019), any high-dimensional prediction function (e.g., an ML model) can be decomposed into a sum of components with increasing dimensionality:
where is the number of features. Using Equation 2, we can approximate an ML model by using the average model prediction for , the first-order ALE, and all second-order or higher effects [denoted by ]:
where is the th training example. We can define the IAS as an approximation error of the first-order effects with respect to the original model predictions:
where ( is defined as the average model prediction). If IAS = 0, then a ML model is perfectly approximated by first-order effects and has no feature interactions.
To describe the shape of the first-order ALE curves, Molnar et al. (2019) introduced the MEC statistic:
where is the number of line segments needed to approximate the curve with a piecewise linear function and is the variance of . The MEC algorithm starts by approximating the ALE curve with a linear model and measuring if the approximation is within tolerance (i.e., where is 0.05 for this study). If not, the domain is repeatedly split into two parts based on each value of , and each part is approximated with a linear model. If the tolerance is not met, the best split is maintained and then each subdomain is divided. Until the tolerance is met, the domain is divided into more parts (while maintaining the former best splits). For a fully linear model the MEC should equal 1 while higher values indicate more nonlinear first-order effects.
Using the IAS and MEC, we quantify the reduction in the ML models’ complexity with the reduced feature sets. Ideally, the more a model is simplified, the more its explainability increases. To improve the IAS and MEC estimates, ALE curves were calculated from bootstrapped training sets (= 100) for the original and reduced feature set and the mean IAS and MEC were calculated.
4.1 Comparing model performance of original feature set versus reduced feature set
compares the model performance and complexity statistics of the original and reduced feature set for all four datasets. The area under the receiver operating characteristic curve (AUC;Metz 1978) and Brier Skill Score (BSS;Hsu and Murphy 1986) are common verification metrics for probabilistic predictions, while NAUPDC and the normalized critical success index (NCSI; Flora et al. 2021; Miller et al. 2021) have recently been introduced for the prediction of rare events. The ML models trained on the reduced feature sets produced performance practically similar to those trained on the original feature sets. For the severe hail and tornado models (Fig. 3b,d), reducing the feature set resulted in an IAS (Eqn. 4) drop of approximately 0.2, while for the models of severe wind and subfreezing road temperatures (Fig. 3a,c), the IAS was relatively unchanged. We can interpret IAS as the percentage of the model prediction explained by higher-order effects, so a decrease from 0.5 to 0.3 indicates a 40 decrease in the contribution of higher-order effects to the model prediction. When we remove less important features, we might anticipate a lower IAS, as the ML model is less prone to creating spurious feature interactions by fitting noise in the training data. The drop in IAS for tornado and severe hail models coupled with similar model performance despite the feature set reduction indicates that these models are relying more on first-order effects and that some feature interactions in the original dataset were likely spurious. As for MEC (eqn. 5), a simple measure of the non-linearity of first-order effects, the impact of removing features is inconsistent. We found that in the original dataset, there were many features that had small but non-negligible ALE variance (i.e., nearly flat curves), which biased the MEC lower. This may partly explain why the MEC increased when the tornado feature set was reduced. It is possible that the first-order effects became stronger in the reduced feature set (e.g., increasing the ALE variance) and therefore increased the MEC. The MEC values between 1.5 and 2.5 indicate that the first-order effects are fairly simple in all four models, which helps to summarize the model behavior.
4.2 Assessing the Validity of Feature Ranking Methods
Fig. 4 shows the median feature ranking for the original and reduced feature sets. Despite the fact that the original and reduced feature models have similar performance (Fig. 3) there is only partial agreement among the ranking methods on the top 3-5 features and their respective ranks. The noticeable difference in the rankings between the original and reduced models illustrates that feature ranking methods are designed to explain the model rather than the data, and for our datasets, the Rashomon set contains models with a wide range of model reliance on a given feature (Fisher et al. 2018). For a given dataset, the top 3-5 features from the original or reduced models are plausible sets of important features, but either leads to a different conclusion about the dataset. For example, the original tornado and severe wind datasets, the main features are largely based on spatial low-level features (e.g., 0-2 km vertical vorticity, 10-m divergence, and 80-m wind speed; Fig. 4a, c) while the reduced feature sets are based on amplitude-based mid-level features [e.g., hail diameter, 2-5 km UH, mid-level lapse rate, and composite reflectivity; Fig. 4b,d). When using ML for knowledge discovery, it is crucial to keep this limitation in mind. Trusting any plausible explanation will lead to confirmation bias (McGovern et al. 2019b; Molnar et al. 2021; Ghassemi et al. 2021).
For the original severe wind and severe hail datasets (Fig. 4c,e), there are rather large rank uncertainties. For example, 3-5 km maximum reflectivity (3-5 km Max Refl) has a median ranking of 1 (tied for the top most important feature) for the severe hail model while some methods (e.g., the forward permutation methods) have it ranked as one of the worst features. We showed in Part I that the logistic regression had learned the opposite effect for 3-5 km Max Refl, which is due to its strong correlation with features that are more predictive of large hail (e.g., HAILCAST-predicted hail size, updraft speed, composite reflectivity). If 3-5 km Max Refl was removed from the original model, but the features it is correlated with were not, then the performance of the model would suffer due to the loss of the compensating effect, and 3-5 km Max Refl would be assigned high feature importance. However, if 3-5 km Max Refl was retained but its correlated features were removed, its compensating effect within the model would not be useful, rendering it unimportant. This result highlights that the backward and forward permutation importance methods are measuring different aspects of feature importance. Furthermore, 3-5 Max Refl was removed from the original severe hail feature set (Table 2) and model performance was largely maintained (Fig. 3 ). This result illustrates the distinction between model-specific and model-agnostic feature importance made in Part I. 3-5 km Max Refl is important to the original model due to its key compensating role, but for many other models in the Rashomon set, it does not have high importance relative to some of its correlated features (e.g., updraft speed, composite reflectivity).
To objectively assess whether dimensionality reduction decreased the uncertainty in the top feature rankings, we introduce the weighted average feature ranking uncertainty ():
where is the set of ranks for feature ( for the WoFS dataset and =10 for the road surface dataset), is the number of features, and IQR and Median are the interquartile range and median, respectively. By weighing by the median ranking, we ensure that rank uncertainty in the top features is weighed more than rank uncertainty in lower ranked features. To restrict our analysis, we only computed eqn 6 for the top 10 features based on the median rankings.
The uncertainty in the rank of the top features was smaller for the reduced than the original models for all four datasets (Fig. 4). The rank uncertainty is expected to decrease for smaller feature sets as there are fewer possible ranks. However, other factors are likely contributing to the rank uncertainty reductions. For example, the severe hail dataset had the most significant drop in rank uncertainty (30.14 to 2.82), and this owes largely to improved rank certainty of the top 2 features. The decrease in ranking uncertainty was greater for the tornado and severe hail datasets (Fig. 4b,f) than for the severe wind and road surface datasets (Fig. 4d,h), perhaps due to the interaction strength reductions (Fig. 3b,d ). Reducing the strength of the feature interactions enhances the contributions of first-order effects and thereby improving the estimate of feature-unique importance.
Given the large inter-method differences in the feature rankings highlighted in Part I, especially for the original datasets, we should question whether some feature ranking methods more faithfully assign importance than others (i.e., those features contributing strongly to the model prediction are assigned higher importance scores). Unfortunately, we do not know the true feature rankings so we cannot assess ground-truth faithfulness (Agarwal et al. 2022). One alternative is to measure how feature importance scores correspond with model performance for a given dataset using the approach in Covert et al. (2020):
Compute the feature importance/relevance score for each feature ranking method
Randomly generate feature subsets (5000 in this study; subset size [1, -1] where is the number of features)
For each feature subset, train a new model and evaluate the model performance (herein we use NAUPDC).
For each feature subset, compute the total importance score for each feature ranking method.
One drawback to this approach is that the faithfulness of the feature rankings is not being fully measured with respect to the model-specific or model-agnostic feature importance. It is not model-specific as the model performances are based on the re-trained models, but it is not fully model-agnostic as the total importance scores are with respect to the full datasets (either original or reduced feature sets). Nevertheless, this experiment does provide insight into whether higher importance is associated with higher model performance and lower importance with lower model performance.
Fig. 5 shows the normalized total importance against the model performance for each ranking method for the original tornado dataset (the results are fairly representative of the other datasets), while Fig. 6 summarizes the results for each dataset. The total importance was normalized using maximum-minimum scaling to enable comparison amongst the methods. If a feature subset contains features with high importance, then we expect the model trained on that feature subset to have higher performance than an identically sized feature subset containing less-important features.
Covert et al. (2020) only evaluated linear correlation between importance and model performance, but based on Fig. 5, we can see the relationship is highly non-linear and ranking faithfulness based on linear correlation would provide misleading results. Instead, we relied on multiple statistics to measure correspondence between the total importance and model performance: Kendall rank correlation coefficient (), linear correlation after a log-transform (), , and mean squared error (MSE). To compute the and MSE, we fit a 5-degree polynomial curve to the data and use goodness-of-fit (; also known as the coefficient of determination) as a measure for feature ranking performance (in other words, we assume that the more the relationship between total importance and model performance matches a functional form, the better the method is at assigning individual feature importance). To summarize the feature ranking faithfulness, we limited our analysis to and , as we felt they matched well with our subjective evaluation of the scatter plots for each dataset.
For the original datasets (Fig. 6a,c), there is a high correspondence between total importance and model performance for a majority of the methods and we find little separation between a majority of the feature rankings methods for both statistics. We know from Part I that the different feature ranking methods tend to have high agreement on the top features and the disagreement is largely based on exact ranking. These results suggest that those minor discrepancies in ranking do not amount to significant differences in the relationship between model performance and total importance. As for the strong correspondence between total importance and model performance, we can see the relationship is highly non-linear (Fig. 5) and most points are in the higher performance, higher importance region (e.g., NAUPDC ¿ 0.2 and total importance ¿ 0.5). In Fig. 7, we can see that model performance has an asymptotic relationship with feature subset size. The highly non-linear relationship suggests that the feature contributions to model performance follow a Pareto distribution (Davis and Feldstein 1979) such that a small subset of features are responsible for most of the model performance (Fig. 7). Thus, in a bulk sense, most of the feature ranking methods should be able to largely discriminate between strong and weak contributors to model performance. The Parteo-like distribution of feature importance also helps explain the success of the dimensionality reduction, as nearly 30 out of 113 features explain most of the model performance for the severe weather datasets (Fig. 7a-c) while approximately 10 explain the road surface dataset (Fig. 7d).
In terms of specific methods, for all four original datasets, SHAP was one of the best method while SAGE scores had the least correspondence to model performance, despite both being based on Shapley theory (Fig. 6a,c). SHAP does not assume feature independence and leverages feature clustering based on correlations, which likely explains why it performs better than SAGE. Compared to the other feature ranking methods, forward single-pass (FSP) had a lower correspondence between total importance and model performance. Recall that the forward permutation methods starts with all features permuted, which breaks up the relationships between features. Neglecting those important relationships (e.g., potential compensating effects) and attempting to isolate individual importance appears to degrade the faithfulness of FSP. In terms of model-specific ranking methods, LR coefficients were among the top performers for the severe weather datasets while tree interpreter and Gini importance were poor performers for the road surface dataset. These results reinforce that the interpretability of logistic regression models can be leveraged to explain their behavior, but random forests are uninterpretable and require outside methods to understand and explain them.
Despite the feature correlations, the backward single-pass (BSP) method was one of the most faithful, especially for the road surface dataset. This is not that surprising as the impact of correlated features on permutation importance scores is heavily dependent on the correlations between a feature and the target variable. (Gregorutti et al. 2017). If two features are correlated, but one of them is a much stronger predictor of the target variable, then permutation importance scores may only be negligibly impacted by the feature correlations. To determine whether our datasets had sufficient correlations with the target variable compared to the correlations between features, we compare the average feature correlation (i.e., the expected correlation between any two features) and the average correlation of a feature with the target variable (Table 5). For the road surface dataset, the average correlation with the target variable is similar to the average correlation with target variable while for the severe weather datasets, the features were more correlated with each than the target variable on average. The strong predictors of the target variable in the road surface dataset likely limited the negative impact of feature correlations, which helps explain the higher fidelity of BSP. For the severe weather datasets, the predictors are not nearly as strong as for the road surface dataset, but are likely sufficient to improve the fidelity of BSP for those datasets.
|Avg. Feature Correlations||Avg. Correlation with the Target|
For the reduced datasets (Fig. 6b,d), and varied approximately between 0.65-0.9 and 0.75-0.92 (depending on the dataset), respectively, which are generally lower values and wider ranges than for the original datasets (cf. Fig. 6a,c and Fig. 6b,d). After dimensionality reduction, the feature importance scores were less Pareto-like (i.e., feature contributions to model performance were more even; Fig. 8). Thus, discriminating between low- and high-importance features is more difficult, consistent with the larger differences in performance between the methods. Though there is greater variability between the ranking methods, the LR coefficients, BSP, and SHAP are consistently among the most faithful (partially consistent with Covert et al. 2020
; did not include the LR coefficients in their study), while the ALE variance, FSP, and BMP are among the worst. The poor performance of FSP and ALE variance, even after reducing the feature sets and limiting feature correlations, suggests that isolating features (by permuting all other features or only including first-order effects, respectively) is a poor strategy for estimating feature importance. FMP is a top performer for the severe hail, wind, and road surface datasets, but amongst the worst for the tornado dataset. The improved faithfulness of the BSP, LR coefficients, and SAGE for the reduced feature sets is probably due at least in part to the reduced feature correlations.
Although the differences in and between the ranking methods may seem small, they correspond to substantial differences in the explanation of the model behavior. For example, for the reduced tornado dataset model, while both BSP and BMP (cf. Fig. 9a,b) have low-level vorticity as the most important feature (which is by default), the BSP ranks some of the intrastorm features substantially higher than the BMP does. Given BMP’s lower fidelity implied by the above analysis, we should prefer the BSP rankings of these features, unless the BMP rankings are much more plausible (incidentally, in our judgment, they are not).
It is surprising that feature relevance methods tend to outperform or perform similarly to feature importance methods in assigning individual contributions to model performance, especially for the original datasets. The exceptions are Gini impurity and tree interpreter. For the Gini impurity, it is well established that the feature rankings are biased by the way the random forest is trained (Strobl et al. 2007, 2008). Tree interpreter has a known bias of assigning lower attribution values for features higher up in the tree (Lundberg et al. 2020). We suspect the success of the SHAP owes to the Owen values accounting for co-dependencies amongst features, but more work is required to determine this. For the LR coefficients, we might expect instability as the dimensionality increases and the co-linear features become more prevalent. We hypothesize that the L1 and L2 regularizations mitigate this problem. The high fidelity of the LR coefficients for the original and reduced feature sets supports the use of partially interpretable models.
4.3 Assessing the Validity of Feature Rankings for Feature Selection.
The previous results highlight the overall performance of the ranking methods, but a more granular perspective is warranted. We now compare how well the methods discriminate between the top and bottom features by comparing the predictive power of the highest and lowest ranked features from each method. We train new models with the top and bottom 15 features, respectively, identified by each method for the original datasets and evaluate the models’ performance on the training dataset (to be consistent with how feature importance is computed). Given the strong agreement among the methods for the original road surface temperature dataset (Fig. 4g; Fig. 4 from Part I) and that the dataset only has 30 features, we restrict our focus to the WoFS-based datasets. For severe hail and wind, the method that produced the highest correlation between total importance and model performance (Fig. 6a,c) also produced the greatest difference in performance between the highest and lowest ranked features (Fig. 10b,c), but the best method was hazard dependent (LIME for hail and FMP for wind). For the tornado dataset, the most faithful feature ranking methods also discriminated well between the best and worst features, but so did FMP, which had relatively low fidelity (Fig. 6c). Another surprising result is that SHAP discriminated poorly between the best and worst features (Fig. 10c) despite being a top overall performer for the severe wind dataset (Fig. 6g).
To better appreciate the results in Fig. 10, we retrain a model on the top feature, top 2 features, and so on until we reach the top 15 and compute the model performance (we repeat this for the worst feature, worst 2 features, and so on until we reach the worst 15). From a feature selection perspective, FMP performs well in determining the features that contribute the most to the model skill for the tornado dataset (Fig. 11a), but performs poorly at determining the features that contributed the least to the model performance (Fig. 11b). As for BSP and BMP, one could argue that their top features are important largely due to their association with the remaining features (more so for BMP; Lakshmanan et al. 2015) and thus would not perform well as feature selection methods. Our analysis supports this notion: BSP and BMP have a less faithful set of top five predictors compared to the other methods, although for the top 10+ features they are more similar (Fig. 11a). When examining the relationship between the top BSP/BMP features (not shown) and tornado likelihood, we found that the top BSP/BMP features have poorer discrimination between tornado and non-tornado (not shown) compared to top features from the other methods. We suspect that other, more important features are lower ranked due to their correlations with other features, a well-known issue with backward permutation importance (Gregorutti et al. 2015; Strobl et al. 2007; Molnar et al. 2020; Molnar 2020).
For severe hail (Fig. 11c,d), LIME had the greatest performance difference between the top and worst 15 features, which is due to LIME better identifying the features that contribute the least to model performance (Fig. 11d). In fact, the feature relevance methods (LIME, ALE variance, SHAP, and LR coefficients) all performed better than the feature importance methods in identifying the least important features. However, feature relevance methods such as SHAP and ALE variance underperformed feature importance methods in identifying the least important features in the severe wind dataset (Fig. 11f).
The two main takeaways from this section are that (1) the general fidelity of a feature ranking vary for more granular explanations (e.g., FMP had poor bulk fidelity for the original tornado dataset, but discriminated well between the top and worst 15 features) and (2) feature importance/relevance methods can perform well as feature selection methods, but the behavior is inconsistent. This is not a surprising result, as many of these methods are designed to measure model-specific feature importance, which may not translate to model-agnostic feature importance.
4.4 Can we reduce feature ranking uncertainty by leveraging the relative fidelity of the feature ranking methods?
Figs. 6 and 10 show that there are disparities between feature rank and contribution to model performance for the different ranking methods. Thus, we determine whether the uncertainty of the feature rank is meaningfully reduced by excluding less faithful ranking methods. To assess the improvement in rank uncertainty, we introduce the ranking uncertainty ratio:
The numerator is the feature rank uncertainty (eqn. 6) re-computed per dataset using the top 3 ranking methods based on Fig. 6, while the denominator is the expected rank uncertainty from all combinations of 3 methods from among all our ranking methods. Ideally, the ranking uncertainty ratio should be .
For the original severe wind and road surface datasets, the ratio is close to 1 while for the original severe hail and tornado datasets the ratio is 0.20 and 0.68, respectively (Fig. 12). It is not surprising that the rank uncertainty ratio for the road surface dataset is approximately 1 (1.02) as the general rank uncertainty in the original dataset was fairly low (Fig. 4) and the methods were largely in agreement on the top features (Part I). The low rank uncertainty ratios for the original tornado and severe hail datasets suggest that the uncertainty in the feature rankings of the original tornado and severe hail datasets was likely inflated by including less faithful ranking methods.
The ranking uncertainty ratios are 1 for the reduced datasets. It is not surprising that the ratios for the reduced tornado and severe hail datasets are higher than for the original datasets since the rank uncertainty across all methods had already been substantially decreased by dimensionality and interaction strength reduction (Fig. 3 and Fig. 4). These higher ratios can also be partially explained by the fact that for the reduced feature datasets, the set of possible rankings is substantially reduced, and thus the expected uncertainty (the denominator in eq. 7) is smaller. Nevertheless, these results suggest that knowing the relative faithfulness of feature ranking methods can reduce feature rank uncertainty.
ML models are becoming increasingly common in the atmospheric science community with a wide range of applications. ML models are powerful statistical tools, but their complexity often prevents a comprehensive understanding of their internal functionality. Improving model understanding is necessary, as decision makers will need to interrogate these models to begin to establish trust in them, especially in high-risk situations where decisions can be costly. In Part I, we highlighted several post hoc explainability methods (e.g., SHAP, LIME, permutation importance, partial dependence/ALE) that have been developed to improve our understanding of complex ML models.
Little has been done, however, to verify the faithfulness of the explainability methods. In this Part II of a two-part study, we evaluated and compared multiple feature ranking methods using a convection-allowing ensemble (WoFS) severe weather dataset and a nowcasting road surface dataset derived from the HRRR model. Given the sensitivity of many of these methods to correlated features, we also evaluated how the feature ranking faithfulness is impacted by dimensionality reduction. This is one of the first studies to quantify how model explainability can improve through dimensionality reduction and knowing the relative faithfulness of feature ranking methods for a given dataset.
The conclusions are as follows:
For all datasets, dimensionality reduction objectively reduced model complexity (i.e., reduced feature interaction strength and first-order effects; Molnar et al. 2019) without greatly impacting model performance. This is a promising result as increasing explainability is becoming a key issue for model development and ethical use of AI.
Feature rankings varied considerably among the explainability methods, but the various rankings were all plausible. Thus, confirmation bias could mislead domain experts into trusting the feature rankings assigned by individual methods.
The average feature ranking uncertainty (i.e., the variation in rank averaged for all features) was substantially decreased in the reduced versus the original datasets. For the severe hail dataset, the uncertainty was reduced by a factor of 10 while the other datasets had reductions of 2-3. This is one of the first attempts to quantify how dimensionality reduction improves explainability.
The average feature ranking uncertainty was also reduced by excluding lower fidelity ranking methods. This result suggests that knowing the relative faithfulness of the explainability methods for a given dataset can lead to improved explanations.
The traditional permutation importance (backward singlepass) was a reliable method for all datasets, even without dimensionality reduction despite prevalent feature correlations. As discussed in Part I, the impact of correlations between features can be mitigated if the features are strong predictors of the target variable.
Top performers for both the original and reduced severe weather datasets include SHAP and LR coefficients. The faithfulness of LR coefficients motivates the use of partially interpretable models since they can be partially understood without the use of post-hoc explainability methods. The same cannot be said for model-specific explainability (e.g., tree-based methods) as the Gini impurity and tree interpreter methods were among the least faithful methods for both the original and reduced road surface datasets.
For the datasets and models used in this study, the feature relevance methods (SHAP, LR coefficients) tended to outperform or perform similar to the feature importance methods. The forward singlepass and backward multipass were consistently amongst the worst ranking methods (often failing to improve upon the backward singlepass method) while the other flavors of permutation importance were found to have inconsistent behavior, possibly depending on whether the dataset/model was largely based on first-order effects.
The differences in fidelity between the feature ranking methods became smaller as the dimensionality increased, which is inconsistent with Covert et al. (2020). Since our datasets follow a Pareto distribution (i.e., a small subset of features are responsible for most of the model performance), we expect the differences in fidelity between feature rankings methods to be smaller as the dimensionality increases. This highlights that when dealing with large feature sets, a more granular approach to exploring model explainability fidelity is required.
It is possible to leverage feature ranking methods as feature selection methods. This is a promising result as other methods of robust feature selection are often computationally prohibitive.
This study is one of the first to explore the faithfulness of feature importance methods using geoscience-based datasets, and there are several promising avenues for future work. First, we only applied these techniques to two datasets and two different classes of ML models (i.e., logistic regression and random forests). It is unclear how well the results extend to other datasets and models (e.g., deep learning models). Second, it will be necessary to collaborate with social scientists and end users (e.g., forecasters) to determine how to balance the tradeoff between explainability and skill. Complex models should only be favored if the performance gain is practically significant —a judgment that the practitioner must make. Third, though we demonstrated a method introduced byCovert et al. (2020) for exploring the fidelity of explainability methods when the ground truth is unknown, it would be useful to assess the skill of explainability methods on benchmark datasets where the ground truth is known (Mamalakis et al. 2021; Agarwal et al. 2022).
Acknowledgements.Funding was provided by NOAA/Office of Oceanic and Atmospheric Research under NOAA-University of Oklahoma Cooperative Agreement NA21OAR4320204, U.S. Department of Commerce. The authors thank Eric Loken for informally reviewing an early version of the manuscript. We also acknowledge the team members responsible for generating the experimental WoFS output, which include Kent Knopfmeier, Brian Matilla, Thomas Jones, Patrick Skinner, Brett Roberts, and Nusrat Yussouf. This material is also based upon work supported by the National Science Foundation under Grant No. ICER-2019758. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. The experimental WoFS ensemble forecast data and road surface dataset used in this study are not currently available in a publicly accessible repository. However, the data and code used to generate the results here are available upon request from the authors. The explainability methods were computed and visualized using the scikit-explain python package (Flora and Handler 2022) developed by Dr. Montgomery Flora and Shawn Handler. The python scripts used to generate the figures in Part I and II are available at https://github.com/monte-flora/compare-explain-methods. Hyperparameter Optimization Hyperparameter selection was performed using the Bayesian hyperparameter optimization available in the hyperopt Python package (Bergstra et al. 2013). Details for hyperparameter optimization are provided in Flora et al. (2021). The hyperparameters for the models trained on the reduced feature sets are presented in Table 6.
|Hyperparameter||Tornadoes||Severe hail||Severe Wind|
|Num. of Trees||500|
|Min Sample Leaf||5|
|Min. Sample Split||8|
- OpenXAI: towards a transparent evaluation of model explanations. arXiv. External Links: Cited by: §4.2, §5.
- . arXiv. External Links: Cited by: §3.3.
- Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures.. In Proc. of the 30th International Conference on Machine Learning, Vol. 28, pp. 115–123. Cited by: §5.
- Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science 16 (3), pp. 199–231. External Links: Cited by: §3.1.
- Interpretable vs. noninterpretable machine learning models for data-driven hydro-climatological process modeling. Expert Systems with Applications 170, pp. 114498. External Links: Cited by: §3.1.
- A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of Clinical Epidemiology 110, pp. 12–22. External Links: Cited by: §3.1.
- Understanding global feature contributions with additive importance measures. arXiv. External Links: Cited by: §1, §1, §4.2, §4.2, §4.2, 8th item, §5.
- The generalized pareto law as a model for progressively censored survival data. Biometrika 66 (2), pp. 299–306. External Links: Cited by: §4.2.
- All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously. arXiv. External Links: Cited by: §3.1, §4.2.
- Scikit-Explain. Github. External Links: Cited by: §1, §5.
- Using machine learning to generate storm-scale probabilistic guidance of severe weather hazards in the warn-on-forecast system. Monthly Weather Review 149 (5), pp. 1535 – 1557. External Links: Cited by: Table 6, §1, §2, §2, Figure 1, §3.1, Figure 9, §4.1, §5.
- A Comparative Study of Explanation Methods for Traditional Machine Learning Models Part 1: Quantifying Disagreement. Artificial Intelligence for Earth Sciences, pp. in review. Cited by: §1.
- The false hope of current approaches to explainable artificial intelligence in health care. The Lancet Digital Health 3 (11), pp. e745–e750. External Links: Cited by: §4.2.
- Grouped variable importance with random forests and application to multiple functional data analysis. Computational Statistics & Data Analysis 90, pp. 15–35. External Links: Cited by: §1, §4.3.
- Correlation and variable importance in random forests. Statistics and Computing 27 (3), pp. 659–678. External Links: Cited by: §1, §4.2.
- Classifier technology and the illusion of progress. Statistical Science 21 (1). External Links: Cited by: §3.1.
- Development of a probabilistic subfreezing road temperature nowcast and forecast using machine learning. Weather and Forecasting 35 (5), pp. 1845 – 1863. External Links: Cited by: Table 6, §1, §1, §2, §2, Figure 2.
- The attributes diagram a geometrical framework for assessing the quality of probability forecasts. International Journal of Forecasting 2 (3), pp. 285–293. External Links: Cited by: §4.1.
- Classifying Convective Storms Using Machine Learning. Wea. Forecasting 35 (2), pp. 537–559. External Links: Cited by: §1.
- The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective. arXiv. External Links: Cited by: §1, §1.
- Comparison of logistic regression with machine learning methods for the prediction of fetal growth abnormalities: a retrospective cohort study. BMC Pregnancy and Childbirth 18 (1), pp. 333. External Links: Cited by: §3.1.
- Using deep learning to emulate and accelerate a radiative transfer model. Journal of Atmospheric and Oceanic Technology 38 (10), pp. 1673 – 1696. External Links: Cited by: §1.
- Which Polarimetric Variables Are Important for Weather/No-Weather Discrimination?. Journal of Atmospheric and Oceanic Technology 32 (6), pp. 1209–1223. External Links: Cited by: §4.3.
- From local explanations to global understanding with explainable ai for trees.. Nat Mach Intell.. External Links: Cited by: §4.2.
- Neural network attribution methods for problems in geoscience: a novel synthetic benchmark dataset. arXiv. External Links: Cited by: §5.
- Quasi-Operational Testing of Real-Time Storm-Longevity Prediction via Machine Learning. Wea. Forecasting 34 (5), pp. 1437–1451. External Links: Cited by: §3.2.
- Making the black box more transparent: Understanding the physical implications of machine learning. Bull. Amer. Meteor. Soc.. External Links: Cited by: §1, §4.2.
- Basic principles of ROC analysis. Seminars in Nuclear Medicine 8 (4), pp. 283–298. External Links: Cited by: §4.1.
- Exploring the usefulness of downscaling free forecasts from the warn-on-forecast system. Weather and Forecasting. External Links: Cited by: §4.1.
- Quantifying Model Complexity via Functional Decomposition for Better Post-Hoc Interpretability. arXiv. External Links: Cited by: §3.1, §3.2, §3.3, §3.3, §3.3, Figure 3, 1st item.
- Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges. arXiv. External Links: Cited by: §4.3.
- General pitfalls of model-agnostic interpretation methods for machine learning models. arXiv. External Links: Cited by: §3.1, §4.2.
- Interpretable machine learning: a guide for making black box models explainable.. Bookdown, New York, NY, USA. Cited by: §3.3, §4.3.
- The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinformatics 11 (1), pp. 110. External Links: Cited by: §1.
- Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, pp. 2825–2830. Cited by: §2.
- Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges. arXiv. External Links: Cited by: §3.1.
- Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. arXiv. External Links: Cited by: §3.1.
- A study in rashomon curves and volumes: a new perspective on generalization and model simplicity in machine learning. arXiv. External Links: Cited by: §3.1.
- Conditional variable importance for random forests. BMC Bioinformatics 9 (1), pp. 307. External Links: Cited by: §1, §4.2.
- Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8 (1), pp. 25. External Links: Cited by: §1, §4.2, §4.3.