Log In Sign Up

Sparse Interaction Additive Networks via Feature Interaction Detection and Sparse Selection

There is currently a large gap in performance between the statistically rigorous methods like linear regression or additive splines and the powerful deep methods using neural networks. Previous works attempting to close this gap have failed to fully investigate the exponentially growing number of feature combinations which deep networks consider automatically during training. In this work, we develop a tractable selection algorithm to efficiently identify the necessary feature combinations by leveraging techniques in feature interaction detection. Our proposed Sparse Interaction Additive Networks (SIAN) construct a bridge from these simple and interpretable models to fully connected neural networks. SIAN achieves competitive performance against state-of-the-art methods across multiple large-scale tabular datasets and consistently finds an optimal tradeoff between the modeling capacity of neural networks and the generalizability of simpler methods.


page 8

page 24

page 26


Neural Rule Ensembles: Encoding Sparse Feature Interactions into Neural Networks

Artificial Neural Networks form the basis of very powerful learning meth...

A neural network with feature sparsity

We propose a neural network model, with a separate linear (residual) ter...

Reluctant additive modeling

Sparse generalized additive models (GAMs) are an extension of sparse gen...

Sparse Neural Additive Model: Interpretable Deep Learning with Feature Selection via Group Sparsity

Interpretable machine learning has demonstrated impressive performance w...

Reluctant generalized additive modeling

Sparse generalized additive models (GAMs) are an extension of sparse gen...

Learning Feature Nonlinearities with Non-Convex Regularized Binned Regression

For various applications, the relations between the dependent and indepe...

Interpretable Low-Dimensional Regression via Data-Adaptive Smoothing

We consider the problem of estimating a regression function in the commo...

Code Repositories



view repo

1 Introduction

Over the past decade, deep learning has achieved significant success in providing solutions to challenging AI problems like computer vision, language processing, and game playing

he2015resnet; Vaswani2017transformer; deepmind2017alphaZero. As deep-learning-based AI models increasingly serve as important solutions to applications in our daily lives, we are confronted with some of their major disadvantages. First, current limitations in our theoretical understanding of deep learning makes fitting robust neural networks a balancing act between accurate training fit and low generalization gap. Questions which ask how the test performance will vary given the choice of architecture, dataset, and training algorithm remain poorly understood and difficult to answer. Second, infamously known as blackbox models, deep neural networks greatly lack in interpretability. This continues to lead to a variety of downstream consequences in applications: unexplainable decisions for stakeholders, the inability to distinguish causation from correlation, and a misunderstood sensitivity to adversarial examples.

In contrast, simpler machine learning models such as linear regression, splines, and the generalized additive model (GAM)

hastie1990originalGAM naturally win in interpretability and robustness. Their smaller number of parameters often have clear and meaningful interpretations and these methods rarely succumb to overfitting the training data. The main shortcoming of these simpler methods is their inability to accurately fit more complex data distributions.

There is a growing strand of literature attempting to merge the interpretability of additive models with the streamlined differentiable power of deep neural networks (DNNs). Key works in this direction such as NAM and NODE-GAM have been able to successfully model one-dimensional main effects and two-dimensional interaction effects using differentiable training procedures agarwal2020nam; chang2022nodegam. Although many other works have found similar success in modeling one- and two-dimensional interactions, few have made practical attempts towards feature interactions of size three or greater, which we will refer to throughout as higher-order feature interactions. In this way, no existing works have been able to use the hallmark ability of neural networks to model higher-order interactions: amassing the influence of hundreds of pixels in computer vision and combining specific words from multiple paragraphs in language processing. In this work, we bring interpretable models one step closer towards the impressive differentiable power of neural networks by developing a simple but effective selection algorithm and an efficient implementation to be able to train neural additive models which fit higher-order interactions of degrees three and greater. The proposed Sparse Interaction Additive Networks (SIANs) consistently achieve results that are competitive with state-of-the-art methods across multiple datasets. We summarize our contributions as follows:

  • We develop a feature interaction selection algorithm which efficiently selects from the exponential number of higher-order feature interactions by leveraging heredity and interaction detection. This allows us to construct higher-order neural additive models for medium-scale datasets unlike previous works in neural additive modeling which only consider univariate and bivariate functions.

  • We provide further insights into the tradeoffs faced by neural networks between better generalization and better functional capacity. By tuning the hyperparameters of our SIAN model, we can gradually interpolate from one-dimensional additive models to full-complexity neural networks. We observe fine-grained details about how the generalization gap increases as we add capacity to our neural network model.

  • We design a block sparse implementation of neural additive models which is able to greatly improve the training speed and memory efficiency of neural-based additive models. These improvements over previous NAM implementations allow shape functions to be computed in parallel, allowing us to train larger and more complex additive models. We provide code for our implementation and hope the tools provided can be useful throughout the additive modeling community for faster training of neural additive models. 111Available at

2 Related Work

The generalized additive model (GAM) hastie1990originalGAM has existed for decades as a more expressive alternative to linear regression, replacing each linear coefficient with a nonparametric function. Two-dimensional extensions appeared in the literature shortly after its introduction wahba1994ssanova

. Over the years, an abundance of works have used the GAM model as an interpretable method for making predictions, with the choice of functional model typically reflecting the most popular method during the time period: regression splines, random forests, boosting machines, kernel methods, and most recently neural networks

hooker2007functionalANOVA; caruana2015intelligible; kandasamy16salsa; yang2020gamiNet. Two of the most prominent neural network based approaches are NAM and NODE-GAM agarwal2020nam; chang2022nodegam

. The former stacks multilayer perceptrons to build a one-dimensional GAM; the latter connects differentiable decision trees to build a two-dimensional GAM. Both have demonstrated competitive performance and interpretable trends learned over multiple tabular datasets. Other neural network extensions

wang2021partiallyInterpretableEstimatorPIE; oneill2021regressionNetworks increase the modeling capacity to higher-order interactions by first training a two-dimensional model and then training an additional blackbox neural network to fit the residual error. While this approach does have higher modeling capacity, it foregoes interpretable insights on the higher-order feature interactions and suffers the same inclination to overfit held naturally by deep neural networks.

Other works in additive modeling instead focus on extending the univariate GAM to sparse additive models in the high-dimensional inference regime lin2006cosso; ravikumar2009spam; meier2009hdam; xu2022snam. Further extensions of these methods to sparse bivariate models for high-dimensional inference also exist tyagi2016spam2; liu2020ssam. These works extend classical high-dimensional inference techniques like LASSO and LARS from linear parameters to additive functions by shrinking the effect of minimally important variables and emphasizing sparse solutions. We note that, unlike these works, we do not use sparsity in the soft-constrained sense to shrink features from a fixed selection, but instead adaptively use feature interaction measurements to hierarchically build a feasible set of interactions. The only existing work in additive modeling which uses hierarchy to induce sparsity in the same sense as this work is the GAMI-Net which uses a three stage procedure to select candidate pairs under a hierarchical constraint yang2020gamiNet. Extending their procedure to three or higher dimensions would require four or more stages of training and is left unexplored in their work.

Although the theoretical extension to three-dimensional additive models is clear, there is currently a lack of discussion surrounding the practical challenges faced when trying to model three-dimensional shape functions. One of the few works to pursue practical implementation of higher-order GAMs on real-world datasets is the work of SALSA kandasamy16salsa. This work uses a specialized kernel function to fit additive models of order three and higher. Their work also corroborates our finding that optimal performance is achieved by different orders for different datasets. However, the kernel-based approaches used in this work make it unsuitable beyond small-scale datasets with few samples. This makes our work one of the first to train additive models of higher-order interactions which leverage the automatic differentiation and GPU optimization toolkits which have become commonplace in modern workstations.

3 Methods


We denote a -dimensional input as with its -th component as ; its corresponding output is denoted . We consider one-dimensional as in regression and binary classification. We will use to denote the function or model used to approximate , implicitly considering the additive noise model for some noise term . We denote a subset of the set of features by . Its cardinality is denoted , its complement , and its power set . For , we define such that:

Figure 1: SIAN Pipeline Diagram. (a) We train a base DNN to be able to learn the feature interactions from the dataset given the input features of . (b) We feed our DNN into the FIS algorithm to be able to select from the possible subsets of . (c) We use our selected feature subsets as the GAM array (of length 6 in the image) for our full SIAN network. The empty set function is a constant we absorb into the last additive layer of . Finally, we train our SIAN neural network using the specified architecture.

3.1 Generalized Additive Models

We first consider the generalized additive model (GAM) hastie1990originalGAM, which extends linear regression by allowing each input feature to have a nonlinear relationship with the output.


Each of the ‘reshapes’ their respective feature and then adds the reshaped feature to the total prediction. These are hence called shape functions and were traditionally fit using regression splines. The function is the link function which will be the identity function for regression and the inverse-sigmoid for classification. is a normalizing constant. This original formulation where each shape function considers only one feature we will further refer to as GAM-.

Feature Interactions

In order to extend this definition we must consider the interplay which occurs between multiple input features. A ‘non-additive feature interaction’ between features for the function is said to exist when the function cannot be decomposed into a sum of arbitrary subfunctions such that each subunction excludes one of the interacting variables : friedman2008predictive; sorokina2008detecting; tsang2018neural. In other words, the entire feature set : must be simultaneously known to be able to correctly predict the output .

The goal of feature interaction detection is to uncover these groups of features which depend on one another. For smooth functions, this can be quantitatively done by finding the sets such that the interaction strength, , is positive and large.


We may now adjust the GAM definition to capture feature interactions by considering a set of specified interactions, , where each is an interaction of size :


The third term extends GAMs to full capacity models which can represent nonlinear dependencies of arbitrary feature sets. For instance, if our set of interactions includes the complete feature set , then our model has exactly the same capacity as the underlying functional model we choose for the shape functions (splines, random forests, deep neural networks, etc.) An abundance of previous works have used this extended version of the GAM model, often called the Smoothing Spline ANOVA model or the functional ANOVA model caruana2015intelligible; ravikumar2009spam; meier2009hdam; hooker2007functionalANOVA; wahba1994ssanova.

Throughout this work, we will use GAM- to refer to a GAM whose highest order interaction in is of cardinality . (i.e. .) For instance, we will call the NODE-GAM model chang2022nodegam a GAM-2 model since the interaction sets are all possible feature pairs: . We will refer to our SIAN networks as SIAN- using the same convention.

3.2 Feature Interaction Selection

A key concern of using neural networks to fit the shape functions is keeping the number of networks low enough that our training time computation is kept reasonable. While this is typically not a problem for the GAM-, this can quickly become an issue for the GAM-. For instance, if we have an input variable with 30 features, then including all pairwise functions would need to cover the possible pairs. Although learning linear coefficients is reasonable, training hundreds of neural networks becomes less so. Moreover, this quantity only grows exponentially as we increase to higher-order interactions (3, 4, 5, etc.) In an effort to combat this growing complexity, we introduce a Feature Interaction Selection (FIS) algorithm which depends on two key ingredients: an interaction detection procedure and a heredity condition.

Feature Interaction Detection

In recent years, there has been a growing body of work focused on detecting and measuring the strength of feature interactions from large-scale data. Three of the most popular and generally applicable of these methods are the Shapley Additive Explanations (SHAP) lundberg2017shapleySHAP; lundberg2020local2global; dhamdhere2019shapley, Integrated Hessians (sundararajan2017integratedGradients; janizek2020explaining), and Archipelago tsang2020archipelago

. For our experiments, we primarily use an adaptation of Archipelago for higher-order interactions because of its compatibility with blackbox models. In contrast, Integrated Hessians is only applicable to sufficiently smooth networks (ReLU networks are not compatible) and SHAP only has fast implementations available for tree-based approaches. Moreover, although both SHAP and Integrated Hessians have clear ideological extensions to higher-order interactions, there are no currently available implementations. A detailed discussion surrounding Archipelago and other detection techniques can be found in Appendix



The practice of only modeling the pairwise interaction effect for some features when both of the main effects and are already being modeled has a long history throughout statistics peixoto1987hierarchicalPolynomials; Chipman1995BayesianVS; bien2013hierarchicalLasso. There are two main versions of this hierarchical principle explored in the literature: strong heredity and weak heredity. If we are given that , strong heredity implies that both and whereas weak heredity implies that or . For our definition of a feature interaction, we have that strong heredity holds; however, our algorithm will instead focus on a computational version of heredity which asks that percent of subsets are above threshold in order for a pair (triple, tuple, etc.) to be considered as a possible interaction.

Inputs: Trained prediction model , and validation dataset

Parameters: Cutoff index , cutoff threshold , strength threshold

Output: , a family of feature interactions with index at most and strength above

1:  Set
2:  Set
4:  while  do
5:     for  in  do
7:        for  do
10:        end for
12:        if  then
14:        end if
15:     end for
17:     for  in  do
19:     end for
22:  end while
23:  return
Algorithm 1 Feature Interaction Selection (FIS)


In Algorithm 1, we detail how we use Archipelago to build our FIS algorithm. The visual overview of the SIAN pipeline is also depicted in Figure 1 above. We start by training a reference DNN which is required for the inductive insights generated by Archipelago. We then pass the trained function approximator to the FIS algorithm along with hyperparameters (interpretability index), (heredity strength threshold), and (interaction strength threshold). This procedure efficiently searches and ranks the possible feature subsets to produce a final set of interactions which we use to specify the SIAN model architecture. Finally, we train our SIAN neural network using typical gradient descent methods. In Appendix A, we provide a further theoretical discussion in which we prove exact recovery of the true feature interactions and show how this sparse selection leads to provably lower generalization gaps in a toy additive noise model.

3.3 Block Sparse Implementation

In order to improve the time and memory efficiency of the SIAN model, we implement a block sparse construction for neural additive models. The default scheme of SIAN is to use a network module for each of the shape functions and additively combine the output features, following the implementation strategies of NAM and other popular neural additive models. However, since each shape function network is computed sequentially, this greatly bottlenecks the computation speed of the model. We instead construct a large, block sparse network which computes the hidden representations of all shape function networks simultaneously, leveraging the fact that each shape subnetwork has the same depth. Using this block sparse network allows for shape features to be computed in parallel, leading to a significant improvement in training speed. The main consequence of this design is a higher footprint in memory; therefore, we also develop a compressed sparse matrix implementation of the network which has a greatly reduced memory footprint for saving network parameters. SIAN is able to interchange between these different modes with minimal overhead, allowing for faster training in the block sparse module, lower memory footprint in the compressed module, and convenient visualization of shape functions in the default module. We provide further numerical details of our gains in Section


4 Datasets

Our experiments focus on seven machine learning datasets. Two are in the classification setting, the MIMIC-III Healthcare and the Higgs datasets 2016mimicIII; pierre2014exoticParticlesHiggsDataset. The other five are in the regression setting, namely the Appliances Energy, Bike Sharing, California Housing Prices, Wine Quality, and Song Year datasets candanedo2017appliancesEnergyDataset; 2013bikeSharing; 1997caliHousing; cortez2009wineQualityDataset; BertinMahieux2011millionSongYearDataset. More details about each dataset are provided in Table 2. We evaluate the regression datasets using mean-squared error (MSE). We measure the performance on the classification datasets using both the area under the receiver operating characteristic (AUROC) and the area under the precision-recall curve (AUPRC) metrics. We report both metrics for the MIMIC dataset since the positive class is only 9% of examples and report only AUROC for the Higgs dataset since it is relatively well-balanced.

4.1 Experiment Details

For the baseline DNNs we are using hidden layer sizes [256,128,64] with ReLU activations. For the GAM subnetworks we are using hidden layer sizes [16,12,8] with ReLU activations. We use L1 regularization of size 5e–5. In the main results section, we report the results for each using only a single value of and . The hyperparameter was taken to be throughout and was selected from a handful of potential values using a validation set. We train all networks using Adagrad with a learning rate of 5e–3. All models are evaluated on a held-out test dataset over five folds of training-validation split unless three folds are specified. Three folds are used for NODE-GAM on all datasets as well as Song Year and Higgs for all models. We respect previous testing splits when applicable, otherwise we subdivide the data using an 80-20 split to generate a testing set. In addition to NODE-GA2M, we compare against the interpretable models LASSO and GA

M EBM as well as the popular machine learning models of support vector machines (SVM), random forests (RF), and extreme gradient boosting (XGB)

chen2016xgboost; nori2019interpretml.

Higgs Boson
Energy Appliances
Bike Sharing
California Housing
Wine Quality
Song Year
Table 2: MIMIC-III Performance.
Model AUROC () AUPRC ()
Table 1: Real-world datasets. is the number of samples, is the number of features, and is the percentage of data samples with the positive class label.

5 Results

Across seven different datasets, SIAN achieves an average rank of 3.00 out of the 8 models we consider. The next best performing model, NODE-GA2M, has an average rank of 3.71 out of 8. The third best performing model, DNN, has an average rank of 3.86 out of 8. We find that SIAN achieves consistent performance by being able to adapt to both the low-dimensional datasets and the high-dimensional datasets, finding a balance between good training fit and good generalization.

Model Appliances Energy () Bike Sharing () California Housing () Wine Quality () Song Year () Higgs Boson ()
LASSO 0.7400.002 1.0530.001 0.4780.000 0.5750.002 1.0000.008 0.6350.000
GAM EBM 1.0530.138 0.1240.004 0.2650.002 0.4980.004 0.8940.001 0.6980.001
NODE-GAM 1.0640.056 0.1110.006 0.2220.005 0.5210.009 0.8060.001 0.8110.000
SIAN-1 0.7180.007 0.3870.035 0.3780.007 0.5510.004 0.8600.001 0.7710.001
SIAN-2 0.7630.009 0.1270.008 0.3020.002 0.4970.003 0.8420.002 0.7950.001
SIAN-3 0.8080.026 0.1250.013 0.2780.001 0.4970.003 0.8310.001 0.7980.001
SIAN-5 0.8010.031 0.1490.011 0.2720.003 0.4840.006 0.8210.001 0.8020.001
RF 1.1140.095 0.2060.009 0.2710.001 0.4390.005 0.9940.005 0.6540.002
SVM 0.7400.008 0.1680.001 0.2620.001 0.4570.008 0.9400.012 0.6980.001
XGB 1.1880.119 0.1570.003 0.2290.002 0.4650.014 0.8810.002 0.7400.000
DNN 0.9450.054 0.3740.017 0.2830.005 0.4950.007 0.7910.002 0.8230.000
Table 3: Test metrics for six of seven datasets. ()/() indicates higher/lower is better, respectively.

For the MIMIC dataset, we can see the models’ performances in terms of both AUROC and AUPRC in Table 2. In addition to the models we previously described, we also compare against the interpretable medical baselines of SOFA and SAPS II which are simple logic-based scoring methods 1993sapsII; 1997sofa. We see that the machine learning methods improve over the baseline performance achieved by the SAPS and SOFA methods.

In Table 3, we can see the combined results for our six other datasets. First, in the Appliances Energy dataset, we see that the SIAN-1 performs the best, with LASSO and SVM trailing slightly behind. The success of one-dimensional methods on this dataset could be indicative that many of the dataset’s trends are one-dimensional. As we increase the dimension of the SIAN, the test error slowly increases as the generalization gap grows; the full-complexity DNN has even worse error than all SIAN models.

Second, in the Bike Sharing dataset, we see that the best performing model is the NODE-GA2M, with the EBM, SIAN-2, and SIAN-3 trailing only slightly behind. All of these methods are two or three dimensional GAM models, again hinting that a significant portion of the important trends in this dataset could be bivariate. Indeed, the most important bivariate trend accounts for more than

of the variance in the dataset, as we explore in Figure

2 below.

For the remaining datasets, we see that the performance of the SIAN model improves as we add higher and higher-order feature interactions. For the California Housing dataset, we find the best performance using the differentiable tree method of NODE-GA2M. For the Wine Quality dataset, we find the best performance using the random forest algorithm. For the larger-scale datasets of Song Year and Higgs Boson, we find that the best performance is still obtained by a full-complexity deep neural network. These two are the only datasets where SIANs of order five or less are not sufficient to outperform vanilla deep neural networks, implying there are important feature interactions of degree greater than five.

5.1 Training Speed and Storage

In Table 4 below we see how our SIAN network compares against other popular differentiable additive models in both training time and size on disk. For fair comparison we do not utilize our interaction selection algorithm for SIAN in this section, instead training SIAN-2 with all possible pairs. We see that our implementation of different modes for the SIAN architecture allows us to outperform both NAM and NODE-GAM, with 2-8x faster training on GAM-1 models and 10-80x faster training on GAM-2 models. We reiterate that because the overhead for switching between modes is negligible, SIAN enjoys the benefits of all modes: training quickly in the block sparse mode and saving succinctly in the compressed mode.

SIAN-2 (block- (comp- NODE- SIAN-1 (block- (comp- NAM
(default) sparse) ressed) GA2M (default) sparse) ressed)
Training Time
Wine 17.57 0.69 55.19 3.03 0.66 5.45
Bike 159.65 5.38 60.75 25.65 4.58 12.09
House 88.67 6.09 58.03 22.72 5.88 14.74
Size on Disk
Wine 7,113 7,113 537 1,040 182 182 86 1,203
Bike 9,727 9,666 626 1,040 218 212 92 308
House 1,526 1,526 251 1,040 83 83 59 1,921
Table 4: Training Time and Memory Size. Experiments are run with a machine using a GTX 1080 GPU and 16GB of RAM. The average wall-clock time over three runs is reported.

5.2 Beyond ReLU Networks

The FIS algorithm we describe can be applied to other combinations of FID algorithm + functional model besides Archipelago + ReLU Neural Networks. To demonstrate the general applicability of our scheme, we replace the continuous, piecewise-linear functions of ReLU neural networks with the piecewise-constant, differentiable decision trees using NODE-GAM. We extend the original implementation of NODE-GAM to handle feature triplets, extending the method to a trivariate function or GAM-3 model. We run our FIS algorithm using on both the Housing and Wine datasets to fit GAM-3 models using the inductive biases of the NODE architecture.

Extending from NODE-GA2M to NODE-GA3M is able to improve performance from to on the Housing dataset, a further improvement over the state of the art method. The same extension is only able to deliver a improvement from to on the Wine dataset; however, both the SIAN-5 and NODE-GA3M are able to improve performance over all previously available additive models. These two examples demonstrate the ability of our FIS pipeline to be applied to more general machine learning techniques to model higher-order interactions.

6 Discussion

In this section, we further explore the feature interactions learned by SIAN and FIS across multiple datasets. We provide multiple visualizations of the shape functions learned by SIAN to get a sense of the diverse analysis and interpretations which are made possible by additive models. Further discussion and graphics are provided in Appendix C and D.

Figure 2: Analysis of Bike Sharing. (a) We plot the histogram of feature interaction strength, adjusting the vertical scaling by degree to display in a single plot. (b) We plot the training and validation performance (darker and lighter, respectively) as we add feature interactions to SIAN. We use an exponentially weighted moving average to reduce variance instead of training multiple networks of each size. The blue star is the performance of the shape functions in 2c and 2d. (c) The most important shape function showing the interaction between hour of the day and work day. (d) The next shape function showing temperature’s effect on bikes rented.

In Figure 2a, we visualize the interaction strength measured for each of the singles, pairs, and triples of features from the Bike Sharing dataset. The four top rated interactions (three of which are visually separated from the main body of the histogram) are, from right to left, [“hour”, “hour X workday”, “workday”, “temperature”]. In 2b we see the effect of gradually adding feature interactions to our model. We see there is a steep jump in performance when we are able to model the first pair, depicted in 2c. Together with 2d, these two visualizations alone explain 84% of the variance in the Bike Sharing dataset. Importantly, these two trends are interpretable and agree with our intuition about when people are more likely to bike: on work days, we see peak spikes at 8 a.m. and 5 p.m., corresponding to the beginning and end of work hours; on weekends and holidays, there is a steady demand of bikes throughout the afternoon. There are also more bikers during warmer temperatures.

Figure 3: 3D feature interaction for California Housing. (a) The effect of the location only for the entire state of California and zoomed in to three selected locations. (b) The effect of only the block-level population. (c) The three-dimensional interaction effect of block-level population with location. We plot the effect for both a low and high population to highlight the differences in effect for each location. Note the different scales per panel.

In Figure 3, we set out to visualize one of the three dimensional interactions which occurs between the three features of latitude, longitude, and population. In 3a, we see the rise in housing price along the coast of California, especially around the metropolitan areas of Los Angeles and San Francisco. In 3b, we see that house price does not monotonically increase with population as we might expect for higher population densities. A detailed look into the dataset reveals that the ‘population’ feature being used is accumulated at the census block level, creating an inverse relationship with population density as areas like LA and SF are subdivided more than their suburban and rural counterparts. Although it is possible location and population density might have independent effects on housing price, the dataset nevertheless induces an interaction between location and population. In 3c, we attempt to visualize this 3D interaction by viewing subsamples across two population sizes and over three regions. We see the network has learned to differentiate the urban regions of LA and SF from the rural coast of northern California, where high population becomes indicative of higher population densities and higher housing prices. We note that the trends learned by SIAN tends to be very continuous, which is a potential shortcoming in learning fine-grained block-level information in cities like Los Angeles and San Francisco. In light of these concerns, we run experiments on a three-dimensional extension of the piecewise constant, discontinuous NODE-GAM model in section 5.2.

Figure 4: MIMIC-III. Age Risk and GCS Risk

In Figure 4, we see two example trends from the SIAN trained to predict mortality given hospital data. On the left, we see that health risk gradually increases with age. On the right, we see the trend with respect to the Glasgow Coma Scale indicator which is a measure of alertness and consciousness. The lowest and highest scores correspond to severe head injury and full alertness, respectively. We see that very low GCS corresponds to high risk and that the risk decreases as GCS increases. At the score of 13, however, we see that an increase to 14 or 15 actually increases mortality risk, defying the intuition that risk should be monotone in the GCS score. It is highly likely that this dip in mortality risk comes from the special care given to patients with GCS scores below 15, compared to their counterparts with perfect scores.

This trend illustrates a phenomenon occurring throughout machine learning applications in healthcare where correlation and causation are conflated with one another. Similar results have been found linking asthma to a decreased risk of death from pneumonia caruana2015intelligible. These issues might go unnoticed in black-box models whereas interpretable models can uncover and resolve such discrepancies before deployment. While discovering and correcting interpretable trends brings clear advantages, it is also possible for these trends to be misinterpreted by non-experts as a causal relationship. Such false causal discoveries can not only lead to physical harm in the domain of medicine, but also to larger social harm in broader AI systems and applications.

7 Limitations and Future Work

A primary limitation of the current work is its focus on multilayer perceptrons whereas modern state-of-the-art results are dominated by geometric deep learning and transformer architectures. Extending this procedure to more general architectures is a key direction for bringing interpretability to domains like computer vision and natural language. Another important direction of research includes a better theoretical understanding how the benefits of SIAN scale with the dimensionality of the dataset and the number of samples provided, especially in the presence of noise as well as theory which accurately models the empirically observed distributions of feature interactions.

Multiple experiments confirm that SIAN can produce powerful and interpretable machine learning models which match the performance of state-of-the-art benchmarks like deep neural networks and NODE-GA2M. Further experiments show that FIS can be applied to more general machine learning algorithms. Hopefully, future work will be able to further clarify this sparse interaction perspective and help deepen our understanding of the generalization performance of neural networks.



  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work?

    3. Did you discuss any potential negative societal impacts of your work?

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets?

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Theoretical Discussion

a.1 Introduction with pseudo-boolean function

Let us first discuss a pseudo-boolean function of the form . Every such function has a unique decomposition as a multilinear polynomial:

where is the monomial . This multilinear decomposition is usually referred to as the Fourier expansion of the function and the are called the Fourier coefficients. Instead of estimating the functional values, we can instead estimate the Fourier coefficients. Although there is no immediate benefit to this shift in perspective, an abundance of practical assumptions like submodularity, disjunctive normal forms, low-order approximates, and decaying Fourier spectrum can be made to imply that only a sparse subset of the full coefficients are worth estimating. A further discussion of these assumptions in the context of boolean functions can be found in [donnel2008someTopicsBooleanFunctions, stobbe2012sparseFourierSet, raskhodnikova2012pseudoBooleanKDNF, odonnel2021booleanFunctionBook].

For our context, it suffices to say that only of the full coefficients are nonzero. Our problem hence becomes a sparse selection problem in the space of Fourier coefficients, which can be equivalently thought of as a Feature Interaction Selection (FIS) problem in the functional space of pseudo-boolean functions.

Lemma 1.

Given exact functional values, Archipelago recovers the upper set of Fourier coefficients. Specifically, given a subset and defining:

We have that the upper cone set of is recovered by the Archipelago measure:


where and . We also write and . Continuing:

We know that the second summand will have every term equal to zero. Without loss of generality, let belong to but not which we can always find such an element since the two sets are not equal. We can then factor out an independent term of and take the expectation over these independent quantities:

Hence, we can now focus on a given and consider what happens to:

Following a similar argument, we can partition into two summands consisting of and . Without loss of generality, pick one such in but not in . For each choice of , there is a choice of either or which can be factored out from one term and not the other, by assumption there is no . Without loss of generality, let it be .

Now, we are focusing on only a specific and , giving:

Writing the symmetric difference as , our quantity is equal to:

This expectation will be equal to zero if the product is nonempty, following the same logic as before by taking the expectation over one of the independent or . Otherwise, the expectation will be equal to one. Hence, it is only equal to one for the terms in which and our quantity of interest is equal to:

This sum will, however, be equal to zero unless . Suppose instead that there is an . Consider the mapping from pairs where to the pairs which have . These sets clearly partition

and the mapping is bijective inside the subsets. Hence, the two sets are of the same size and with odd parity of

. Together, this implies the sum above will be equal to zero for such cases. In the alternate case that , we know that or , meaning we instead consider .

Altogether, our original quantity of interest has now become:

Proposition 1.

The FIS Algorithm 1 recovers the downwards closure of the true nonzero entries of the function. Let be the set of nonzero Fourier coefficients of size . Define the closure or downwards closure of as . Algorithm 1 returns exactly .


Using the lemma above, we know that the measurement on the psuedo-boolean function will be the upper cone. From this it fairly simply follows that any member of the downwards closure of will have positive upper cone size and will be added to the candidate set of interactions. ∎

a.2 Decomposition of a general function

Given a more general space and a function we can define another decomposition similar to the Fourier decomposition from before. Given some fixed distribution on our input space , we can define the conditional expectation functions as:

and inductively define with:

We can now decompose our original function as:

where each of these functions corresponds to a subset of the entire feature space which focuses only on these sets of features.

Moreover, we see that these functions are orthogonal with respect to the distribution.

We can more enticingly write this as:


For instance, the function corresponds to the average value of the function f given the input distribution. The function corresponds to the expectation of the function conditioned on knowing variable minus the mean value. As a running example, let’s consider

using the uniform distribution over the square

. We will have , , , and , so that , , , . In our example this means that:

Formally, let us assume that our function is measurable, and

integrable with respect to our probability measure, and that additionally each such subfunction is well defined and similarly integrable for all product measures. In this case it should be that all of the following expectations (and integrations) that we write are well defined and exchangeable. Under these assumptions, we have the sum decomposition as just written above.


Let us first note that:

and consider arbitrary . Let us define

, , , .

Further, we see that:

Let us further focus on

We are now able to take and we see that our outside expectation can be reduced to an expectation over because for each feature not in the set , at least one of the functional terms in the product is constant with respect to that variable. In more detail, anything outside of will have the first term be constant and anything outside of will have the second term be constant. With respect to any of these variables, we are able to use the linearity of expectation to show that our reduced expectation will have the same value as the original expectation. Consequently, we may reduce this expectation except for the variables in and we have that our expectation reduces to:

In greater detail, we have that when we define , , , , and that:

where each step is done by the linearity of expectation and the law of total expectation.

Since this is true for any , we now focus on entire summation. Let us assume that so that we will have that either or is nonempty. Let’s assume without loss of generality that it is which is nonempty. It then follows that

Altogether, we can go back to our original reformulation to get our desired result:

Moreover, our model approximating the function decomposes in the same way and we divide our estimation problem into different signal estimation problems with:

In the next two subsections, we will further develop this framework to layout why such a decomposition should succeed by only estimating a subset of all possible interactions. In particular, we will show that when is large, it will become likely that

as the effective signal-to-noise ratio shrinks as

grows. In other words, noisy estimates of the signal on average perform even worse than the baseline guess of zero signal.

a.3 Higher dimensional signals decay with dimension

For this section we will focus on a Fourier signal on the -dimensional rescaled torus .

Proposition 2.

If we assume that our signal is in the class of -differentiable smooth functions, then we have that our Fourier coefficients decay like . This is a known fact from Fourier analysis and can be easily derived using a Fourier series.

Consequently, we will consider a random signal defi