# Detection of Cooperative Interactions in Logistic Regression Models

An important problem in the field of bioinformatics is to identify interactive effects among profiled variables for outcome prediction. In this paper, a logistic regression model with pairwise interactions among a set of binary covariates is considered. Modeling the structure of the interactions by a graph, our goal is to recover the interaction graph from independently identically distributed (i.i.d.) samples of the covariates and the outcome. When viewed as a feature selection problem, a simple quantity called influence is proposed as a measure of the marginal effects of the interaction terms on the outcome. For the case when the underlying interaction graph is known to be acyclic, it is shown that a simple algorithm that is based on a maximum-weight spanning tree with respect to the plug-in estimates of the influences not only has strong theoretical performance guarantees, but can also outperform generic feature selection algorithms for recovering the interaction graph from i.i.d. samples of the covariates and the outcome. Our results can also be extended to the model that includes both individual effects and pairwise interactions via the help of an auxiliary covariate.

## Authors

• 1 publication
• 31 publications
• 10 publications
• 63 publications
• ### On marginal and conditional parameters in logistic regression models

A fundamental research question is how much a variation in a covariate i...
04/09/2018 ∙ by Elena Stanghellini, et al. ∙ 0

• ### High Dimensional Logistic Regression Under Network Dependence

Logistic regression is one of the most fundamental methods for modeling ...
10/07/2021 ∙ by Somabha Mukherjee, et al. ∙ 0

• ### A Permutation Approach to Testing Interactions in Many Dimensions

To date, testing interactions in high dimensions has been a challenging ...
06/27/2012 ∙ by Noah Simon, et al. ∙ 0

• ### Using relative weight analysis with residualization to detect relevant nonlinear interaction effects in ordinary and logistic regressions

Relative weight analysis is a classic tool for detecting whether one var...
06/26/2021 ∙ by Maikol Solís, et al. ∙ 0

• ### A general approach to detect gene (G)-environment (E) additive interaction leveraging G-E independence in case-control studies

It is increasingly of interest in statistical genetics to test for the p...
08/18/2018 ∙ by Eric J. Tchetgen Tchetgen, et al. ∙ 0

• ### Recovering Pairwise Interactions Using Neural Networks

Recovering pairwise interactions, i.e. pairs of input features whose joi...
01/24/2019 ∙ by Tianyu Cui, et al. ∙ 40

• ### Purifying Interaction Effects with the Functional ANOVA: An Efficient Algorithm for Recovering Identifiable Additive Models

Recent methods for training generalized additive models (GAMs) with pair...
11/12/2019 ∙ by Benjamin Lengerich, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Consider a regression problem with independent covariates and a binary outcome variable

. The covariates are assumed to be uniformly distributed over

, and the conditional probabilities of the outcome given the covariates are assumed to be logistic:

 Pr(y=+1|xV=xV) =σ(∑{i,j}∈Eβ{i,j}xixj), (1) Pr(y=−1|xV=xV) =σ(−∑{i,j}∈Eβ{i,j}xixj) (2)

for some real constants ’s, where , , , and

is the sigmoid function. It is straightforward to verify that

for any , so we have for any .

For any two distinct , we say that the covariates and interact if and only if . Let be a simple graph with the vertex set and edge set . Then

captures all pairwise interactions between the covariates in determining the odds of the outcome of interest. Our goal is to recover the graph

from independently identically distributed (i.i.d.) samples of .

Our main motivation for considering the above pairwise interaction problem is from computational biology, where each covariate represents the expression of a biomarker (a gene or an environmental factor), and the variable represents the phenotypic outcome with respect to a specific phenotype. In computational biology, many complex diseases, such as cancer and diabetes, are conjectured to have complicated underlying disease mechanisms [1, 3, 4, 2, 5, 6, 7]. Multiple candidate risk factors, either genetic or environmental, and their interactions are known to play critical roles in triggering and determining the development of a large family of diseases [1, 3, 4, 2, 5, 6, 7]. Identifying interactive effects among profiled variables not only helps with more accurate identification of critical risk factors for outcome prediction, but also helps reveal functional interactions and understand aberrant system changes that are specifically related to the outcome for effective systems intervention. Our model, of course, is a simplified one from real-world situations, and is studied here since it captures some essential features of the problem (as we shall see shortly) while being relatively simple.

Note that if we let

 z{i,j}:=xixj,∀{i,j}∈E (3)

and consider ’s (instead of ’s) as the covariates, then the problem of recovering the graph can be viewed as a feature selection

problem in statistics and machine learning. In

[8], a basic approach for feature selection is to first use Shannon’s mutual information [9] to measure the marginal effects of the covariates on the outcome, and then select the features based on the ranking of the mutual information. More advanced approaches such as the immensely popular mRMR method [10]

make incremental selections while taking into account both the relevance to the outcome and the redundancy among the selected features. However, even though Shannon’s mutual information provides a compact and model-free measure of correlation between the covariates and the outcome, which is well accepted in the statistics and computational biology communities, it is a complex function of the underlying joint distribution and hence difficult to analyze and estimate from limited data samples. As a result, when applied to specific regression models, the performance of the generic feature selection algorithms is usually difficult to characterize.

Motivated by the recent progress on learning Ising models over arbitrary graphs [11], in this paper we propose a quantity called influence as a measure of the marginal effects of ’s on the outcome . Compared with Shannon’s mutual information, influence is a simple function of the low-order joint probabilities between ’s and the outcome , and hence is much easier to analyze and estimate. When the underlying graph is known to be acyclic, we show that, a simple algorithm that is based on a maximum-weight spanning tree with respect to the “plug-in” estimate of the influences and followed by simple thresholding operations, not only has strong theoretical performance guarantees, but can also outperform the generic feature selection algorithms for recovering from i.i.d. samples of .

The rest of the paper is organized as follows. In Section II, we show that any acyclic can be identified from the influences of ’s on the outcome . Building on the results from Section II, in Section III we show that any acyclic can be recovered with probability at least from i.i.d. samples of , where . In Section IV, we extend our results of the above sections to the model involving both individual effects and cooperative interactions. In Section V, we use computer simulations to demonstrate that the proposed algorithm can outperform the generic feature selection algorithms. Finally, in Section VI we conclude the paper with some remarks.

Notation

. Random variables are written in serif font, and sets are written in capital letters.

## Ii Identification of Cooperative Interactions from Low-Order Joint Probabilities

Our main result in this section is to show that any acyclic can be identified from the low-order joint probabilities . Towards this goal, let be a weight assignment over given by:

 w{i,j}: =∣∣Pr(y=+1|xi=+1,xj=+1)− Pr(y=−1|xi=+1,xj=+1)∣∣ (4) =∣∣2Pr(y=+1|xi=+1,xj=+1)−1∣∣ (5) =∣∣8Pr(xi=+1,xj=+1,y=+1)−1∣∣ (6)

for any . Here, (5) follows from the fact that

 Pr (y=+1|xi=+1,xj=+1)+ Pr(y=−1|xi=+1,xj=+1)=1

and (6) is due to the fact that .

The following proposition helps to clarify the meaning of the weight assignment as defined in (4).

###### Proposition 1 (Influence)

Assume that is acyclic. We have

 w{i,j} =∣∣Pr(y=+1|xi=+1,xj=+1)− Pr(y=+1|xi=+1,xj=−1)∣∣ (7)

for any .

See Section A-A.

By (7), indicates whether the product has any influence on the event and hence can be a useful indication on whether . This intuition is partially justified by the following proposition.

###### Proposition 2 (Direct influence)

Assume that is acyclic. We have for any .

See Section A-B.

We say that the product has a direct influence on the outcome if . The above proposition guarantees that direct influences are strictly positive when is acyclic. The following proposition provides a partial converse to Proposition 2.

###### Proposition 3 (Zero influence)

Assume that is acyclic. Then for any two distinct , we have if and are disconnected in , or the unique path between and in has an even length.

See Section A-C.

###### Theorem 1 (Union of stars)

Assume that each connected component of is a star. Then for any two distinct , we have if and only if .

This follows immediately from Propositions 2 and 3 and the fact that if each connected component of is a star (which implies that is acyclic), then any two distinct such that must be either disconnected (if they belong to two different connected components) or connected by a unique path of length two (if they belong to the same connected component) in .

The following example, however, shows that the converse of Proposition 2 is not true in general. Consider , , and . Note that the graph here is acyclic and the unique path between and is of length three. It is straightforward to calculate that

 w{1,4} =(e−1)32(e3+1)>0,

even though .

For any , we say that the product has an indirect influence on the outcome if . Due to the possible existence of indirect influences, unlike the unions of stars, a general acyclic cannot be recovered via edge-by-edge identifications.

The following proposition, however, shows that indirect influences are locally dominated by direct influences.

###### Proposition 4 (Indirect influence)

Assume that is acyclic, and let be a path of length in . Then, we have for any .

Note that when is even, by Propositions 2 and 3 we have for any . Therefore, we only need to consider the cases where is odd, for which the proof can be found in Appendix A-D.

Let . A weight assignment over is said to have strict separation between and if there exists a real constant such that for any and . The consequence of strict separation and local dominance is summarized in the following proposition.

###### Proposition 5

Assume that is acyclic, let be a weight assignment over satisfying: 1) (strict separation) there exists a real constant such that for any and ; and 2) (local dominance) for any path of length in and any . Then for any maximum-weight spanning tree with respect to the weight assignment , we have where .

See Section A-E.

The following theorem is the main result of this section.

###### Theorem 2 (Acyclic graphs)

Assume that is acyclic, and let be a maximum-weight spanning tree with respect to the weight assignment as defined in (4). Then, for any two distinct we have if and only if and .

Note that by Propositions 2 and 3, we have for any and . By Proposition 4, we have for any path of length in and any . The theorem thus follows directly from Proposition 5 with and .

## Iii Detection of Cooperative Interactions from Finite Data Samples

Let , be i.i.d. samples of . To recover the graph , we shall assign to each a weight that is based on the empirical joint probability:

 ^w{i,j} :=∣∣ ∣∣8nn∑t=11((xi[t],xj[t],y[t])=(+1,+1,+1))−1∣∣ ∣∣, (8)

where is the indicator function. By (6), for any we have converging to in probability in the limit as . The following simple proposition, which follows directly from the well-known Hoeffding’s inequality [12], provides a bound on the rate at which the weight assignment converges uniformly to .

###### Proposition 6

For any and ,

 Pr(∣∣^w{i,j}−w{i,j}∣∣≥η)≤2e−nη2/32. (9)

Note that

 ∣∣^w{i,j} −w{i,j}∣∣≤8∣∣∣Pr(xi=+1,xj=+1,y=+1) −1nn∑t=11((xi[t],xj[t],y[t])=(+1,+1,+1))∣∣∣.

We then finish the proof by applying the Hoeffding’s inequality [12].

The following propositions, which are generalizations of Propositions 2 and 4 respectively, play a key role in adapting the results of Theorem 2 from the weight assignment to .

###### Proposition 7

Assume that is acyclic and for any we have for some . Let

 γ:=√2πd[σ(λ+3μ)−σ(−λ+3μ)]>0. (10)

We have for any .

See Section A-B.

###### Proposition 8

Assume that is acyclic and for any we have for some . We have for any path of length in and any , where is defined as in (10).

See Section A-D.

Given Propositions 7 and 8, it is clear that if the estimation error is uniformly bounded by , an acyclic can be recovered from , similar to that from .

###### Theorem 3

Assume that is acyclic and for any we have for some . Let be a maximum-weight spanning tree with respect to the weight assignment . If

 ∣∣^w{i,j}−w{i,j}∣∣<γ2,∀{i,j}∈E, (11)

then for any two distinct we have if and only if and .

By Proposition 7, we have for any and . Under assumption (11), this implies that for any and . By Proposition 8, we have for any path of length in and any . Under assumption (11), this implies that for any path of length in and any . The theorem thus follows directly from Proposition 5 with and .

We then establish the following algorithm to detect based on Theorem 3.

The sample complexity of the algorithm is summarized in the following theorem.

###### Theorem 4

Assume that is acyclic and for any we have for some . Fix and let be a positive integer such that

 n ≥128γ2logd2ϵ=64πd[σ(λ+3μ)−σ(−λ+3μ)]2logd2ϵ. (12)

Then with probability at least , the algorithm can successfully detect the graph from i.i.d. samples of .

By Proposition 6 and Theorem 3, we have

 Pr(^G=G) ≥Pr(⋂{i,j}∈E{∣∣^wi,j−wi,j∣∣<γ2}) ≥1−d(d−1)e−nγ2128≥1−d2⋅ϵd2=1−ϵ.

This completes the proof of the theorem.

## Iv Extension to Models with both Individual Effects and Cooperative Interactions

In this section, we extend the results of Sections II and III to the models that include both individual effects and cooperative interactions. More specifically, we shall assume that the conditional probability of the outcome given the covariates are given by:

 Pr(y=+1|xV=xV) =σ(∑i∈Vβixi+∑{i,j}∈Eβ{i,j}xixj), (13) Pr(y=−1|xV=xV) =σ(−∑i∈Vβixi−∑{i,j}∈Eβ{i,j}xixj) (14)

for some real constants ’s and ’s. For any , we say that the covariate has an individual effect on if and only if ; for any , we say that the covariates and interact if and only if . Let , , and for all . Then, the structure of the model (including both individual effects and cooperative interactions) is fully captured by the simple graph , where . As before, our goal is to recover the graph from i.i.d. samples of .

Toward this goal, we shall introduce an additional covariate , which we assume to be uniformly over and independent of , and use it to define an auxiliary model, for which the conditional probabilities of the outcome given the covariates are given by:

 Pr(~y=+1|x~V=x~V) =σ(∑{i,j}∈~Eβ{i,j}xixj), (15) Pr(~y=−1|x~V=x~V) =σ(−∑{i,j}∈~Eβ{i,j}xixj). (16)

By the results of Section II, if the underlying graph is known to be acyclic, it can be recovered from the weight assignment:

 ~w{i,j}:=∣∣2Pr(~y=+1|xi=+1,xj=+1)−1∣∣ (17)

for all .

Note that when , we trivially have

 Pr (~y=+1|xi=+1,x0=+1)=Pr(y=+1|xi=+1),

so

 (18)

for all . On the other hand, when and , we may write

 Pr (~y=+1|xi=+1,xj=+1) =12Pr(~y=+1|xi=+1,xj=+1,x0=+1)+ 12Pr(~y=+1|xi=+1,xj=+1,x0=−1) =12Pr(y=+1|xi=+1,xj=+1)+ 12Pr(~y=+1|xi=+1,xj=+1,x0=−1).

To proceed, further note that

 2d−2Pr(~y=+1|xi=+1,xj=+1,x0=−1) =∑xV:(xi,xj)=(+1,+1)Pr(~y=+1|xV=xV,x0=−1) =∑xV:(xi,xj)=(+1,+1)σ(−∑k∈Vβkxk+∑{k,l}∈Eβ{k,l}xkxl) =∑xV:(xi,xj)=(−1,−1)σ(∑k∈Vβkxk+∑{k,l}∈Eβ{k,l}xkxl) (19) =∑xV:(xi,xj)=(−1,−1)Pr(y=+1|xV=xV) =2d−2Pr(y=+1|xi=−1,xj=−1)

where (19) follows from the simple change of variable . It thus follows that

 Pr (~y=+1|xi=+1,xj=+1,x0=−1) =Pr(y=+1|xi=−1,xj=−1)

and hence

 Pr (~y=+1|xi=+1,xj=+1) =12Pr(y=+1|xi=+1,xj=+1)+ 12Pr(y=+1|xi=−1,xj=−1)

giving

 ~w{i,j} =∣∣Pr(y=+1|xi=+1,xj=+1)+ Pr(y=+1|xi=−1,xj=−1)−1∣∣ (20)

for any .

Combining (18) and (20), we conclude that the weight assignment , and hence any acyclic , can be fully recovered from the low-order joint probabilities . The results of Section III can be extended similarly; the details are omitted.

## V Simulation Results

In our simulations, we randomly generate 5,000 logistic regression models, each including 10 independent binary covariates. For each model that we generate, we randomly choose 5 individual effects and 5 interaction pairs, resulting in an acyclic graph as its underlying structure. The model parameters for each individual effect and for each interaction pair are randomly assigned from a uniform distribution over with . In Fig. 1, we compare the detection rate of Algorithm 1 for different parameter ranges , , and , under the sample sizes of 300, 600, 900, 1,200 and 1,500, respectively. Here, we emphasize that a logistic regression model is correctly detected if and only if all 10 features, including 5 individual effects and 5 interaction pairs, are correctly detected. Clearly, the detection rate increases as the lower and upper bounds of the parameter range increase, and Algorithm 1 can achieve a high detection rate (at least 93%) with a reasonably large number of training samples (around 1200). Also in Fig. 1, we plot the fitted curves of the detection rate with respect to the increasing sample size based on the functional form that we derived in (12) in Section III. It is clear that the curves fit very well with the empirical results, thus validating the order-tightness of the lower bound in (12).

Next, we shall compare the performance of our algorithm (Algorithm 1) with three generic feature selection algorithms: mRMR forward selection [10], mutual information (MI) ranking [8], and the problem-specific -regularized logistic regression algorithm [13, 14]. For the mRMR forward selection, MI ranking, and -regularized logistic regression algorithms, we shall view each of the single variables ’s and the interaction terms ’s as a separate feature, and assume that the number of features to be selected is known. The estimation of mutual information in the mRMR forward selection and MI ranking algorithms is based on [15, 16]. For the -regularized logistic regression algorithm, we tune the regularization parameter till the desired number of nonzero coefficients is obtained.

In Fig. 2, we compare the detection rate of Algorithm 1 with that of mRMR forward selection, MI ranking, and -regularized logistic regression algorithms for under a finite number of data samples. As we can see, Algorithm 1 achieves a significantly higher detection rate than the mRMR forward selection and the MI ranking algorithms, especially when the sample size is relatively small. This is due to the facts that: 1) Algorithm 1 exploits the fact that the underlying interaction graph is acyclic while the other two algorithms do not; 2) the proposed influence measure is much easier to estimate than MI. (The performances of the mRMR forward selection and the MI ranking algorithms are nearly identical since the feature candidates are pairwise independent.) On the other hand, the performances of Algorithm 1 and the -regularized logistic regression algorithm appear to be very comparable. It is somewhat surprising that the -regularized logistic regression algorithm performs well for typical problem instances with finite sample sizes. Intuitively, this is related to the fact that acyclic graphs are “sparse” graphs. However, analyzing the sample complexity of the -regularized logistic regression algorithm appears to be very challenging.

For completeness, in Fig. 3 and Fig. 4 we also compare the miss detection/false positive rate and prediction accuracy of Algorithm 1 with those of the mRMR forward selection, the MI ranking, and the -regularized logistic regression algorithms. Note that since the number of features selected is fixed in this case, the miss detection and false positive rates are identical. The prediction accuracy is calculated as follows. For each logistic regression model that we generate, we first obtain the model structure via one of the algorithms under the consideration, followed by standard logistic regression for parameter estimation. Once each logistic regression model is reconstructed, we randomly generate 200 testing samples to estimate the accuracy of the outcome prediction. As we can see, the relative performance among these algorithms is very similar to the case with the detection rate.

## Vi Concluding Remarks

An important problem in bioinformatics is to identify interactive effects among profiled variables for outcome prediction. In this paper, a simple logistic regression model with pairwise interactions among the binary covariates was considered. Modeling the structure of the interactions by a graph , our goal was to recover from i.i.d. samples of the covariates and the outcome. When viewed as a feature selection problem, a simple quantity called influence is proposed as a measure of the marginal effects of the interaction terms on the outcome. For the case where is known to be acyclic, it is shown that a simple algorithm that is based on a maximum-weight spanning tree with respect to the plug-in estimates of the influences not only has strong theoretical performance guarantees, but can also outperform generic feature selection algorithms for recovering the graph from i.i.d. samples of the covariates and the outcome. A sample complexity analysis for detecting the interaction graph was provided, and the results were further extended to the model that includes both individual effects and pairwise interactions.

In our future work, we would like to understand the behavior of the -regularized logistic regression algorithm from a theoretical standpoint, and also extend our results to the more challenging case where the interaction graph might be cyclic.

## References

• [1] J. H. Moore, “The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases,” Hum. Hered., vol. 56, no. 1–3, pp. 73–82, Nov. 2003.
• [2] Y. Chung, S. Y. Lee, R. C. Elston, and T. Park, “Odds Ratio Based Multifactor-Dimensionality Reduction Method for Detecting Gene-Gene Interactions,” Bioinformatics, vol. 23, no. 1, pp. 71–76, Jan. 2007.
• [3] D. Anastassiou, “Computational Analysis of the Synergy among Multiple Interacting Genes,” Mol. Syst. Biol., vol. 3, no. 83, pp. 1–8, Feb. 2007.
• [4] J. A. Eddy, J. Sung, D. Geman, and N. D. Price, “Relative Expression Analysis for Molecular Cancer Diagnosis and Prognosis,” Tech. Cancer Res. Treat., vol. 9, no. 2, pp. 149–159, Apr. 2010.
• [5] K. Hoon, J. Watkinson, and D. Anastassiou, “Biomarker Discovery Using Statistically Significant Gene Sets,” J. Comput. Biol., vol. 18, no. 10, pp. 1–10, Oct. 2011.
• [6] A. A. Adl, X. Qian, P. Xu, K. Vehik, and J. Krischer, “Feature Ranking Based on Synergy Networks to Identify Prognostic Markers in DPT-1,” EURASIP J. Bioinform. Syst. Biol., vol. 2013, no. 1, pp. 1–9, Sept. 2013.
• [7] N. A. Sakhanenko and D. J. Galas, “Biological Data Analysis as an Information Theory Problem: Multivariable Dependence Measures and the Shadows Algorithm,” J. Comput. Biol., vol. 22, no. 11, pp. 1005-1024, Nov. 2015.
• [8] Y. Saeys, I. Inza, and P. Larranaga, “A Review of Feature Selection Techniques in Bioinformatics,” Bioinformatics, vol. 23, no. 19, pp. 2507–2517, Oct. 2007.
• [9] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed., Hoboken, NJ: John Wiley & Sons, July 2006.
• [10] H. C. Peng, F. Long, and C. Ding, “Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp. 1226–1238, Aug. 2005.
• [11] G. Bresler, “Efficiently Learning Ising Models on Arbitrary Graphs,” Proc. 47th Annu. ACM Symp. Theory Comput., pp. 771–782, June 2015.
• [12] W. Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables,” J. Amer. Stat. Assoc., vol. 58, no. 301, pp. 13–30, Mar. 1963.
• [13] S.-I. Lee, H. Lee, P. Abbeel, and A. Y. Ng, “Efficient -regularized Logistic Regression,” Proc. Nat. Conf. Artif. Intel., vol. 21, no. 1, pp. 401–408, July 2006.
• [14] M. Y. Park and T. Hastie, “-regularization Path Algorithm for Generalized Linear Models,” J. Roy. Stat. Soc. B, vol. 69, no. 4, pp. 659–677, Sept. 2007.
• [15] A. Antos and I. Kontoyiannis, “Convergence Properties of Functional Estimates for Discrete Distributions,” Random Struct. Algor., vol. 19, pp. 163–193, Nov. 2001.
• [16] L. Paninski, “Estimation of Entropy and Mutual Information,” Neural Comput., vol. 15, no. 6, pp. 1191–1253, June 2003.
• [17] J. L. Berggren, J. M. Borwein, and P. Borwein, Pi: A Source Book, 3rd ed., New York, NY: Springer, June 2004.

## Appendix A Proof of the Propositions

### A-a Proof of Proposition 1

We begin with the following lemma.

###### Lemma 1

For any acyclic and any , we have

 Pr(xi=+1,y=+1)=14.

A proof of Lemma 1 can be found in Appendix B-A. Now fix . By Lemma 1, we have

 Pr(xi=+1,xj=−1,y=+1) =14−Pr(xi=+1,xj=+1,y=+1) (a)=Pr(xi=+1,xj=+1)−Pr(xi=+1,xj=+1,y=+1) =Pr(xi=+1,xj=+1,y=−1),

and hence

 Pr(y=+1|xi=+1,xj=−1)(b)=Pr(y=−1|xi=+1,xj=+1), (21)

where and are due to the fact that . Substituting (21) into (4) completes the proof of Proposition 1.

### A-B Proof of Propositions 2 and 7

Fix . For any , let

 ζ+{i,j}(xV) :=σ(β{i,j}+∑{k,l}∈I∖{{i,j}}β{k,l}xkxl), (22) ζ−{i,j}(xV) :=σ(−β{i,j}+∑{k,l}∈I∖{{i,j}}β{k,l}xkxl), (23) ζ{i,j}(xV) :=ζ+{i,j}(xV)−ζ−{i,j}(xV). (24)

We have the following lemma, for which a proof is provided in Appendix B-B.

###### Lemma 2

Assume that is acyclic. We have

 Pr(y=+1|xi=+1,xj=+1)−Pr(y=−1|xi=+1,xj=+1) =2−(d−2)∑xV:(xi,xj)=(+1,+1)ζ{i,j}(xV). (25)

To prove Proposition 2, note that the sigmoid function is strictly monotone increasing. We thus have for any when , and for any when . It thus follows from Lemma 2 and (4) that

 w{i,j} =2−(d−2)∣∣ ∣∣∑xV:(xi,xj)=(+1,+1)ζ{i,j}(xV)∣∣ ∣∣>0.

This completes the proof of Proposition 2.

To prove Proposition 7, let be a tree that covers . With a slight abuse of notation, let

 ζ{i,j}(zT∖{{i,j}}) :=σ(β{i,j}+∑{k,l}∈T∖{{i,j}}β{k,l}z{k,l})− σ(−β{i,j}+∑{k,l}∈T∖{{i,j}}β{k,l}z{k,l}) (a)=σ(β{i,j}+∑{k,l}∈I∖{{i,j}}β{k,l}z{k,l})− σ(−β{i,j}+∑{k,l}∈I∖{{i,j}}β{k,l}z{k,l}),

where follows from the fact that for any . Note that: 1) for any ; 2) is a union of two trees where and are in different trees, such that the mapping between and is one-on-one. We thus have

 ∑x