sparseinteractionadditivenetworks
None
view repo
There is currently a large gap in performance between the statistically rigorous methods like linear regression or additive splines and the powerful deep methods using neural networks. Previous works attempting to close this gap have failed to fully investigate the exponentially growing number of feature combinations which deep networks consider automatically during training. In this work, we develop a tractable selection algorithm to efficiently identify the necessary feature combinations by leveraging techniques in feature interaction detection. Our proposed Sparse Interaction Additive Networks (SIAN) construct a bridge from these simple and interpretable models to fully connected neural networks. SIAN achieves competitive performance against stateoftheart methods across multiple largescale tabular datasets and consistently finds an optimal tradeoff between the modeling capacity of neural networks and the generalizability of simpler methods.
READ FULL TEXT VIEW PDFNone
Over the past decade, deep learning has achieved significant success in providing solutions to challenging AI problems like computer vision, language processing, and game playing
he2015resnet; Vaswani2017transformer; deepmind2017alphaZero. As deeplearningbased AI models increasingly serve as important solutions to applications in our daily lives, we are confronted with some of their major disadvantages. First, current limitations in our theoretical understanding of deep learning makes fitting robust neural networks a balancing act between accurate training fit and low generalization gap. Questions which ask how the test performance will vary given the choice of architecture, dataset, and training algorithm remain poorly understood and difficult to answer. Second, infamously known as blackbox models, deep neural networks greatly lack in interpretability. This continues to lead to a variety of downstream consequences in applications: unexplainable decisions for stakeholders, the inability to distinguish causation from correlation, and a misunderstood sensitivity to adversarial examples.In contrast, simpler machine learning models such as linear regression, splines, and the generalized additive model (GAM)
hastie1990originalGAM naturally win in interpretability and robustness. Their smaller number of parameters often have clear and meaningful interpretations and these methods rarely succumb to overfitting the training data. The main shortcoming of these simpler methods is their inability to accurately fit more complex data distributions.There is a growing strand of literature attempting to merge the interpretability of additive models with the streamlined differentiable power of deep neural networks (DNNs). Key works in this direction such as NAM and NODEGAM have been able to successfully model onedimensional main effects and twodimensional interaction effects using differentiable training procedures agarwal2020nam; chang2022nodegam. Although many other works have found similar success in modeling one and twodimensional interactions, few have made practical attempts towards feature interactions of size three or greater, which we will refer to throughout as higherorder feature interactions. In this way, no existing works have been able to use the hallmark ability of neural networks to model higherorder interactions: amassing the influence of hundreds of pixels in computer vision and combining specific words from multiple paragraphs in language processing. In this work, we bring interpretable models one step closer towards the impressive differentiable power of neural networks by developing a simple but effective selection algorithm and an efficient implementation to be able to train neural additive models which fit higherorder interactions of degrees three and greater. The proposed Sparse Interaction Additive Networks (SIANs) consistently achieve results that are competitive with stateoftheart methods across multiple datasets. We summarize our contributions as follows:
We develop a feature interaction selection algorithm which efficiently selects from the exponential number of higherorder feature interactions by leveraging heredity and interaction detection. This allows us to construct higherorder neural additive models for mediumscale datasets unlike previous works in neural additive modeling which only consider univariate and bivariate functions.
We provide further insights into the tradeoffs faced by neural networks between better generalization and better functional capacity. By tuning the hyperparameters of our SIAN model, we can gradually interpolate from onedimensional additive models to fullcomplexity neural networks. We observe finegrained details about how the generalization gap increases as we add capacity to our neural network model.
We design a block sparse implementation of neural additive models which is able to greatly improve the training speed and memory efficiency of neuralbased additive models. These improvements over previous NAM implementations allow shape functions to be computed in parallel, allowing us to train larger and more complex additive models. We provide code for our implementation and hope the tools provided can be useful throughout the additive modeling community for faster training of neural additive models. ^{1}^{1}1Available at github.com/EnouenJ/sparseinteractionadditivenetworks
The generalized additive model (GAM) hastie1990originalGAM has existed for decades as a more expressive alternative to linear regression, replacing each linear coefficient with a nonparametric function. Twodimensional extensions appeared in the literature shortly after its introduction wahba1994ssanova
. Over the years, an abundance of works have used the GAM model as an interpretable method for making predictions, with the choice of functional model typically reflecting the most popular method during the time period: regression splines, random forests, boosting machines, kernel methods, and most recently neural networks
hooker2007functionalANOVA; caruana2015intelligible; kandasamy16salsa; yang2020gamiNet. Two of the most prominent neural network based approaches are NAM and NODEGAM agarwal2020nam; chang2022nodegam. The former stacks multilayer perceptrons to build a onedimensional GAM; the latter connects differentiable decision trees to build a twodimensional GAM. Both have demonstrated competitive performance and interpretable trends learned over multiple tabular datasets. Other neural network extensions
wang2021partiallyInterpretableEstimatorPIE; oneill2021regressionNetworks increase the modeling capacity to higherorder interactions by first training a twodimensional model and then training an additional blackbox neural network to fit the residual error. While this approach does have higher modeling capacity, it foregoes interpretable insights on the higherorder feature interactions and suffers the same inclination to overfit held naturally by deep neural networks.Other works in additive modeling instead focus on extending the univariate GAM to sparse additive models in the highdimensional inference regime lin2006cosso; ravikumar2009spam; meier2009hdam; xu2022snam. Further extensions of these methods to sparse bivariate models for highdimensional inference also exist tyagi2016spam2; liu2020ssam. These works extend classical highdimensional inference techniques like LASSO and LARS from linear parameters to additive functions by shrinking the effect of minimally important variables and emphasizing sparse solutions. We note that, unlike these works, we do not use sparsity in the softconstrained sense to shrink features from a fixed selection, but instead adaptively use feature interaction measurements to hierarchically build a feasible set of interactions. The only existing work in additive modeling which uses hierarchy to induce sparsity in the same sense as this work is the GAMINet which uses a three stage procedure to select candidate pairs under a hierarchical constraint yang2020gamiNet. Extending their procedure to three or higher dimensions would require four or more stages of training and is left unexplored in their work.
Although the theoretical extension to threedimensional additive models is clear, there is currently a lack of discussion surrounding the practical challenges faced when trying to model threedimensional shape functions. One of the few works to pursue practical implementation of higherorder GAMs on realworld datasets is the work of SALSA kandasamy16salsa. This work uses a specialized kernel function to fit additive models of order three and higher. Their work also corroborates our finding that optimal performance is achieved by different orders for different datasets. However, the kernelbased approaches used in this work make it unsuitable beyond smallscale datasets with few samples. This makes our work one of the first to train additive models of higherorder interactions which leverage the automatic differentiation and GPU optimization toolkits which have become commonplace in modern workstations.
We denote a dimensional input as with its th component as ; its corresponding output is denoted . We consider onedimensional as in regression and binary classification. We will use to denote the function or model used to approximate , implicitly considering the additive noise model for some noise term . We denote a subset of the set of features by . Its cardinality is denoted , its complement , and its power set . For , we define such that:
We first consider the generalized additive model (GAM) hastie1990originalGAM, which extends linear regression by allowing each input feature to have a nonlinear relationship with the output.
(1) 
Each of the ‘reshapes’ their respective feature and then adds the reshaped feature to the total prediction. These are hence called shape functions and were traditionally fit using regression splines. The function is the link function which will be the identity function for regression and the inversesigmoid for classification. is a normalizing constant. This original formulation where each shape function considers only one feature we will further refer to as GAM.
In order to extend this definition we must consider the interplay which occurs between multiple input features. A ‘nonadditive feature interaction’ between features for the function is said to exist when the function cannot be decomposed into a sum of arbitrary subfunctions such that each subunction excludes one of the interacting variables : friedman2008predictive; sorokina2008detecting; tsang2018neural. In other words, the entire feature set : must be simultaneously known to be able to correctly predict the output .
The goal of feature interaction detection is to uncover these groups of features which depend on one another. For smooth functions, this can be quantitatively done by finding the sets such that the interaction strength, , is positive and large.
(2) 
We may now adjust the GAM definition to capture feature interactions by considering a set of specified interactions, , where each is an interaction of size :
(3) 
The third term extends GAMs to full capacity models which can represent nonlinear dependencies of arbitrary feature sets. For instance, if our set of interactions includes the complete feature set , then our model has exactly the same capacity as the underlying functional model we choose for the shape functions (splines, random forests, deep neural networks, etc.) An abundance of previous works have used this extended version of the GAM model, often called the Smoothing Spline ANOVA model or the functional ANOVA model caruana2015intelligible; ravikumar2009spam; meier2009hdam; hooker2007functionalANOVA; wahba1994ssanova.
Throughout this work, we will use GAM to refer to a GAM whose highest order interaction in is of cardinality . (i.e. .) For instance, we will call the NODEGAM model chang2022nodegam a GAM2 model since the interaction sets are all possible feature pairs: . We will refer to our SIAN networks as SIAN using the same convention.
A key concern of using neural networks to fit the shape functions is keeping the number of networks low enough that our training time computation is kept reasonable. While this is typically not a problem for the GAM, this can quickly become an issue for the GAM. For instance, if we have an input variable with 30 features, then including all pairwise functions would need to cover the possible pairs. Although learning linear coefficients is reasonable, training hundreds of neural networks becomes less so. Moreover, this quantity only grows exponentially as we increase to higherorder interactions (3, 4, 5, etc.) In an effort to combat this growing complexity, we introduce a Feature Interaction Selection (FIS) algorithm which depends on two key ingredients: an interaction detection procedure and a heredity condition.
In recent years, there has been a growing body of work focused on detecting and measuring the strength of feature interactions from largescale data. Three of the most popular and generally applicable of these methods are the Shapley Additive Explanations (SHAP) lundberg2017shapleySHAP; lundberg2020local2global; dhamdhere2019shapley, Integrated Hessians (sundararajan2017integratedGradients; janizek2020explaining), and Archipelago tsang2020archipelago
. For our experiments, we primarily use an adaptation of Archipelago for higherorder interactions because of its compatibility with blackbox models. In contrast, Integrated Hessians is only applicable to sufficiently smooth networks (ReLU networks are not compatible) and SHAP only has fast implementations available for treebased approaches. Moreover, although both SHAP and Integrated Hessians have clear ideological extensions to higherorder interactions, there are no currently available implementations. A detailed discussion surrounding Archipelago and other detection techniques can be found in Appendix
C.The practice of only modeling the pairwise interaction effect for some features when both of the main effects and are already being modeled has a long history throughout statistics peixoto1987hierarchicalPolynomials; Chipman1995BayesianVS; bien2013hierarchicalLasso. There are two main versions of this hierarchical principle explored in the literature: strong heredity and weak heredity. If we are given that , strong heredity implies that both and whereas weak heredity implies that or . For our definition of a feature interaction, we have that strong heredity holds; however, our algorithm will instead focus on a computational version of heredity which asks that percent of subsets are above threshold in order for a pair (triple, tuple, etc.) to be considered as a possible interaction.
In Algorithm 1, we detail how we use Archipelago to build our FIS algorithm. The visual overview of the SIAN pipeline is also depicted in Figure 1 above. We start by training a reference DNN which is required for the inductive insights generated by Archipelago. We then pass the trained function approximator to the FIS algorithm along with hyperparameters (interpretability index), (heredity strength threshold), and (interaction strength threshold). This procedure efficiently searches and ranks the possible feature subsets to produce a final set of interactions which we use to specify the SIAN model architecture. Finally, we train our SIAN neural network using typical gradient descent methods. In Appendix A, we provide a further theoretical discussion in which we prove exact recovery of the true feature interactions and show how this sparse selection leads to provably lower generalization gaps in a toy additive noise model.
In order to improve the time and memory efficiency of the SIAN model, we implement a block sparse construction for neural additive models. The default scheme of SIAN is to use a network module for each of the shape functions and additively combine the output features, following the implementation strategies of NAM and other popular neural additive models. However, since each shape function network is computed sequentially, this greatly bottlenecks the computation speed of the model. We instead construct a large, block sparse network which computes the hidden representations of all shape function networks simultaneously, leveraging the fact that each shape subnetwork has the same depth. Using this block sparse network allows for shape features to be computed in parallel, leading to a significant improvement in training speed. The main consequence of this design is a higher footprint in memory; therefore, we also develop a compressed sparse matrix implementation of the network which has a greatly reduced memory footprint for saving network parameters. SIAN is able to interchange between these different modes with minimal overhead, allowing for faster training in the block sparse module, lower memory footprint in the compressed module, and convenient visualization of shape functions in the default module. We provide further numerical details of our gains in Section
5.1.Our experiments focus on seven machine learning datasets. Two are in the classification setting, the MIMICIII Healthcare and the Higgs datasets 2016mimicIII; pierre2014exoticParticlesHiggsDataset. The other five are in the regression setting, namely the Appliances Energy, Bike Sharing, California Housing Prices, Wine Quality, and Song Year datasets candanedo2017appliancesEnergyDataset; 2013bikeSharing; 1997caliHousing; cortez2009wineQualityDataset; BertinMahieux2011millionSongYearDataset. More details about each dataset are provided in Table 2. We evaluate the regression datasets using meansquared error (MSE). We measure the performance on the classification datasets using both the area under the receiver operating characteristic (AUROC) and the area under the precisionrecall curve (AUPRC) metrics. We report both metrics for the MIMIC dataset since the positive class is only 9% of examples and report only AUROC for the Higgs dataset since it is relatively wellbalanced.
For the baseline DNNs we are using hidden layer sizes [256,128,64] with ReLU activations. For the GAM subnetworks we are using hidden layer sizes [16,12,8] with ReLU activations. We use L1 regularization of size 5e–5. In the main results section, we report the results for each using only a single value of and . The hyperparameter was taken to be throughout and was selected from a handful of potential values using a validation set. We train all networks using Adagrad with a learning rate of 5e–3. All models are evaluated on a heldout test dataset over five folds of trainingvalidation split unless three folds are specified. Three folds are used for NODEGAM on all datasets as well as Song Year and Higgs for all models. We respect previous testing splits when applicable, otherwise we subdivide the data using an 8020 split to generate a testing set. In addition to NODEGA2M, we compare against the interpretable models LASSO and GA
M EBM as well as the popular machine learning models of support vector machines (SVM), random forests (RF), and extreme gradient boosting (XGB)
chen2016xgboost; nori2019interpretml.Dataset  
Higgs Boson  
MIMICIII  
Energy Appliances  –  
Bike Sharing  –  
California Housing  –  
Wine Quality  –  
Song Year  – 
Model  AUROC ()  AUPRC () 
SAPS II  
SOFA  
LASSO  
GAM EBM  
NODEGAM  
SIAN1  
SIAN2  
SIAN3  
SIAN5  
RF  
SVM  
XGB  
DNN 
Across seven different datasets, SIAN achieves an average rank of 3.00 out of the 8 models we consider. The next best performing model, NODEGA2M, has an average rank of 3.71 out of 8. The third best performing model, DNN, has an average rank of 3.86 out of 8. We find that SIAN achieves consistent performance by being able to adapt to both the lowdimensional datasets and the highdimensional datasets, finding a balance between good training fit and good generalization.
Model  Appliances Energy ()  Bike Sharing ()  California Housing ()  Wine Quality ()  Song Year ()  Higgs Boson () 
LASSO  0.7400.002  1.0530.001  0.4780.000  0.5750.002  1.0000.008  0.6350.000 
GAM EBM  1.0530.138  0.1240.004  0.2650.002  0.4980.004  0.8940.001  0.6980.001 
NODEGAM  1.0640.056  0.1110.006  0.2220.005  0.5210.009  0.8060.001  0.8110.000 
SIAN1  0.7180.007  0.3870.035  0.3780.007  0.5510.004  0.8600.001  0.7710.001 
SIAN2  0.7630.009  0.1270.008  0.3020.002  0.4970.003  0.8420.002  0.7950.001 
SIAN3  0.8080.026  0.1250.013  0.2780.001  0.4970.003  0.8310.001  0.7980.001 
SIAN5  0.8010.031  0.1490.011  0.2720.003  0.4840.006  0.8210.001  0.8020.001 
RF  1.1140.095  0.2060.009  0.2710.001  0.4390.005  0.9940.005  0.6540.002 
SVM  0.7400.008  0.1680.001  0.2620.001  0.4570.008  0.9400.012  0.6980.001 
XGB  1.1880.119  0.1570.003  0.2290.002  0.4650.014  0.8810.002  0.7400.000 
DNN  0.9450.054  0.3740.017  0.2830.005  0.4950.007  0.7910.002  0.8230.000 
For the MIMIC dataset, we can see the models’ performances in terms of both AUROC and AUPRC in Table 2. In addition to the models we previously described, we also compare against the interpretable medical baselines of SOFA and SAPS II which are simple logicbased scoring methods 1993sapsII; 1997sofa. We see that the machine learning methods improve over the baseline performance achieved by the SAPS and SOFA methods.
In Table 3, we can see the combined results for our six other datasets. First, in the Appliances Energy dataset, we see that the SIAN1 performs the best, with LASSO and SVM trailing slightly behind. The success of onedimensional methods on this dataset could be indicative that many of the dataset’s trends are onedimensional. As we increase the dimension of the SIAN, the test error slowly increases as the generalization gap grows; the fullcomplexity DNN has even worse error than all SIAN models.
Second, in the Bike Sharing dataset, we see that the best performing model is the NODEGA2M, with the EBM, SIAN2, and SIAN3 trailing only slightly behind. All of these methods are two or three dimensional GAM models, again hinting that a significant portion of the important trends in this dataset could be bivariate. Indeed, the most important bivariate trend accounts for more than
of the variance in the dataset, as we explore in Figure
2 below.For the remaining datasets, we see that the performance of the SIAN model improves as we add higher and higherorder feature interactions. For the California Housing dataset, we find the best performance using the differentiable tree method of NODEGA2M. For the Wine Quality dataset, we find the best performance using the random forest algorithm. For the largerscale datasets of Song Year and Higgs Boson, we find that the best performance is still obtained by a fullcomplexity deep neural network. These two are the only datasets where SIANs of order five or less are not sufficient to outperform vanilla deep neural networks, implying there are important feature interactions of degree greater than five.
In Table 4 below we see how our SIAN network compares against other popular differentiable additive models in both training time and size on disk. For fair comparison we do not utilize our interaction selection algorithm for SIAN in this section, instead training SIAN2 with all possible pairs. We see that our implementation of different modes for the SIAN architecture allows us to outperform both NAM and NODEGAM, with 28x faster training on GAM1 models and 1080x faster training on GAM2 models. We reiterate that because the overhead for switching between modes is negligible, SIAN enjoys the benefits of all modes: training quickly in the block sparse mode and saving succinctly in the compressed mode.
SIAN2  (block  (comp  NODE  SIAN1  (block  (comp  NAM  
(default)  sparse)  ressed)  GA2M  (default)  sparse)  ressed)  

Wine  17.57  0.69  –  55.19  3.03  0.66  –  5.45  
Bike  159.65  5.38  –  60.75  25.65  4.58  –  12.09  
House  88.67  6.09  –  58.03  22.72  5.88  –  14.74  

Wine  7,113  7,113  537  1,040  182  182  86  1,203  
Bike  9,727  9,666  626  1,040  218  212  92  308  
House  1,526  1,526  251  1,040  83  83  59  1,921 
The FIS algorithm we describe can be applied to other combinations of FID algorithm + functional model besides Archipelago + ReLU Neural Networks. To demonstrate the general applicability of our scheme, we replace the continuous, piecewiselinear functions of ReLU neural networks with the piecewiseconstant, differentiable decision trees using NODEGAM. We extend the original implementation of NODEGAM to handle feature triplets, extending the method to a trivariate function or GAM3 model. We run our FIS algorithm using on both the Housing and Wine datasets to fit GAM3 models using the inductive biases of the NODE architecture.
Extending from NODEGA2M to NODEGA3M is able to improve performance from to on the Housing dataset, a further improvement over the state of the art method. The same extension is only able to deliver a improvement from to on the Wine dataset; however, both the SIAN5 and NODEGA3M are able to improve performance over all previously available additive models. These two examples demonstrate the ability of our FIS pipeline to be applied to more general machine learning techniques to model higherorder interactions.
In this section, we further explore the feature interactions learned by SIAN and FIS across multiple datasets. We provide multiple visualizations of the shape functions learned by SIAN to get a sense of the diverse analysis and interpretations which are made possible by additive models. Further discussion and graphics are provided in Appendix C and D.
In Figure 2a, we visualize the interaction strength measured for each of the singles, pairs, and triples of features from the Bike Sharing dataset. The four top rated interactions (three of which are visually separated from the main body of the histogram) are, from right to left, [“hour”, “hour X workday”, “workday”, “temperature”]. In 2b we see the effect of gradually adding feature interactions to our model. We see there is a steep jump in performance when we are able to model the first pair, depicted in 2c. Together with 2d, these two visualizations alone explain 84% of the variance in the Bike Sharing dataset. Importantly, these two trends are interpretable and agree with our intuition about when people are more likely to bike: on work days, we see peak spikes at 8 a.m. and 5 p.m., corresponding to the beginning and end of work hours; on weekends and holidays, there is a steady demand of bikes throughout the afternoon. There are also more bikers during warmer temperatures.
In Figure 3, we set out to visualize one of the three dimensional interactions which occurs between the three features of latitude, longitude, and population. In 3a, we see the rise in housing price along the coast of California, especially around the metropolitan areas of Los Angeles and San Francisco. In 3b, we see that house price does not monotonically increase with population as we might expect for higher population densities. A detailed look into the dataset reveals that the ‘population’ feature being used is accumulated at the census block level, creating an inverse relationship with population density as areas like LA and SF are subdivided more than their suburban and rural counterparts. Although it is possible location and population density might have independent effects on housing price, the dataset nevertheless induces an interaction between location and population. In 3c, we attempt to visualize this 3D interaction by viewing subsamples across two population sizes and over three regions. We see the network has learned to differentiate the urban regions of LA and SF from the rural coast of northern California, where high population becomes indicative of higher population densities and higher housing prices. We note that the trends learned by SIAN tends to be very continuous, which is a potential shortcoming in learning finegrained blocklevel information in cities like Los Angeles and San Francisco. In light of these concerns, we run experiments on a threedimensional extension of the piecewise constant, discontinuous NODEGAM model in section 5.2.
In Figure 4, we see two example trends from the SIAN trained to predict mortality given hospital data. On the left, we see that health risk gradually increases with age. On the right, we see the trend with respect to the Glasgow Coma Scale indicator which is a measure of alertness and consciousness. The lowest and highest scores correspond to severe head injury and full alertness, respectively. We see that very low GCS corresponds to high risk and that the risk decreases as GCS increases. At the score of 13, however, we see that an increase to 14 or 15 actually increases mortality risk, defying the intuition that risk should be monotone in the GCS score. It is highly likely that this dip in mortality risk comes from the special care given to patients with GCS scores below 15, compared to their counterparts with perfect scores.
This trend illustrates a phenomenon occurring throughout machine learning applications in healthcare where correlation and causation are conflated with one another. Similar results have been found linking asthma to a decreased risk of death from pneumonia caruana2015intelligible. These issues might go unnoticed in blackbox models whereas interpretable models can uncover and resolve such discrepancies before deployment. While discovering and correcting interpretable trends brings clear advantages, it is also possible for these trends to be misinterpreted by nonexperts as a causal relationship. Such false causal discoveries can not only lead to physical harm in the domain of medicine, but also to larger social harm in broader AI systems and applications.
A primary limitation of the current work is its focus on multilayer perceptrons whereas modern stateoftheart results are dominated by geometric deep learning and transformer architectures. Extending this procedure to more general architectures is a key direction for bringing interpretability to domains like computer vision and natural language. Another important direction of research includes a better theoretical understanding how the benefits of SIAN scale with the dimensionality of the dataset and the number of samples provided, especially in the presence of noise as well as theory which accurately models the empirically observed distributions of feature interactions.
Multiple experiments confirm that SIAN can produce powerful and interpretable machine learning models which match the performance of stateoftheart benchmarks like deep neural networks and NODEGA2M. Further experiments show that FIS can be applied to more general machine learning algorithms. Hopefully, future work will be able to further clarify this sparse interaction perspective and help deepen our understanding of the generalization performance of neural networks.
For all authors…
Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
Did you describe the limitations of your work?
Did you discuss any potential negative societal impacts of your work?
Have you read the ethics review guidelines and ensured that your paper conforms to them?
If you are including theoretical results…
Did you state the full set of assumptions of all theoretical results?
Did you include complete proofs of all theoretical results?
If you ran experiments…
Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?
Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?
Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?
Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
If your work uses existing assets, did you cite the creators?
Did you mention the license of the assets?
Did you include any new assets either in the supplemental material or as a URL?
Did you discuss whether and how consent was obtained from people whose data you’re using/curating?
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?
If you used crowdsourcing or conducted research with human subjects…
Did you include the full text of instructions given to participants and screenshots, if applicable?
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?
Let us first discuss a pseudoboolean function of the form . Every such function has a unique decomposition as a multilinear polynomial:
where is the monomial . This multilinear decomposition is usually referred to as the Fourier expansion of the function and the are called the Fourier coefficients. Instead of estimating the functional values, we can instead estimate the Fourier coefficients. Although there is no immediate benefit to this shift in perspective, an abundance of practical assumptions like submodularity, disjunctive normal forms, loworder approximates, and decaying Fourier spectrum can be made to imply that only a sparse subset of the full coefficients are worth estimating. A further discussion of these assumptions in the context of boolean functions can be found in [donnel2008someTopicsBooleanFunctions, stobbe2012sparseFourierSet, raskhodnikova2012pseudoBooleanKDNF, odonnel2021booleanFunctionBook].
For our context, it suffices to say that only of the full coefficients are nonzero. Our problem hence becomes a sparse selection problem in the space of Fourier coefficients, which can be equivalently thought of as a Feature Interaction Selection (FIS) problem in the functional space of pseudoboolean functions.
Given exact functional values, Archipelago recovers the upper set of Fourier coefficients. Specifically, given a subset and defining:
We have that the upper cone set of is recovered by the Archipelago measure:
where and . We also write and . Continuing:
We know that the second summand will have every term equal to zero. Without loss of generality, let belong to but not which we can always find such an element since the two sets are not equal. We can then factor out an independent term of and take the expectation over these independent quantities:
Hence, we can now focus on a given and consider what happens to:
Following a similar argument, we can partition into two summands consisting of and . Without loss of generality, pick one such in but not in . For each choice of , there is a choice of either or which can be factored out from one term and not the other, by assumption there is no . Without loss of generality, let it be .
Now, we are focusing on only a specific and , giving:
Writing the symmetric difference as , our quantity is equal to:
This expectation will be equal to zero if the product is nonempty, following the same logic as before by taking the expectation over one of the independent or . Otherwise, the expectation will be equal to one. Hence, it is only equal to one for the terms in which and our quantity of interest is equal to:
This sum will, however, be equal to zero unless . Suppose instead that there is an . Consider the mapping from pairs where to the pairs which have . These sets clearly partition
and the mapping is bijective inside the subsets. Hence, the two sets are of the same size and with odd parity of
. Together, this implies the sum above will be equal to zero for such cases. In the alternate case that , we know that or , meaning we instead consider .Altogether, our original quantity of interest has now become:
∎
Using the lemma above, we know that the measurement on the psuedoboolean function will be the upper cone. From this it fairly simply follows that any member of the downwards closure of will have positive upper cone size and will be added to the candidate set of interactions. ∎
Given a more general space and a function we can define another decomposition similar to the Fourier decomposition from before. Given some fixed distribution on our input space , we can define the conditional expectation functions as:
and inductively define with:
We can now decompose our original function as:
where each of these functions corresponds to a subset of the entire feature space which focuses only on these sets of features.
Moreover, we see that these functions are orthogonal with respect to the distribution.
We can more enticingly write this as:
For instance, the function corresponds to the average value of the function f given the input distribution. The function corresponds to the expectation of the function conditioned on knowing variable minus the mean value. As a running example, let’s consider
using the uniform distribution over the square
. We will have , , , and , so that , , , . In our example this means that:Formally, let us assume that our function is measurable, and
integrable with respect to our probability measure, and that additionally each such subfunction is well defined and similarly integrable for all product measures. In this case it should be that all of the following expectations (and integrations) that we write are well defined and exchangeable. Under these assumptions, we have the sum decomposition as just written above.
Let us first note that:
and consider arbitrary . Let us define
, , , .
Further, we see that:
Let us further focus on
We are now able to take and we see that our outside expectation can be reduced to an expectation over because for each feature not in the set , at least one of the functional terms in the product is constant with respect to that variable. In more detail, anything outside of will have the first term be constant and anything outside of will have the second term be constant. With respect to any of these variables, we are able to use the linearity of expectation to show that our reduced expectation will have the same value as the original expectation. Consequently, we may reduce this expectation except for the variables in and we have that our expectation reduces to:
In greater detail, we have that when we define , , , , and that:
where each step is done by the linearity of expectation and the law of total expectation.
Since this is true for any , we now focus on entire summation. Let us assume that so that we will have that either or is nonempty. Let’s assume without loss of generality that it is which is nonempty. It then follows that
Altogether, we can go back to our original reformulation to get our desired result:
∎
Moreover, our model approximating the function decomposes in the same way and we divide our estimation problem into different signal estimation problems with:
In the next two subsections, we will further develop this framework to layout why such a decomposition should succeed by only estimating a subset of all possible interactions. In particular, we will show that when is large, it will become likely that
as the effective signaltonoise ratio shrinks as
grows. In other words, noisy estimates of the signal on average perform even worse than the baseline guess of zero signal.For this section we will focus on a Fourier signal on the dimensional rescaled torus .
If we assume that our signal is in the class of differentiable smooth functions, then we have that our Fourier coefficients decay like . This is a known fact from Fourier analysis and can be easily derived using a Fourier series.
Consequently, we will consider a random signal defi