1 Introduction
Joint probability distributions of random variables with underlying conditional dependence structure are ubiquitous in science and engineering. The dependency structure of these distributions often reflect the properties of the scientific models that generate them. Graphical models are a natural way to represent such distributions and have found use in myriad fields such as physics
Brush (1967)Wang et al. (2013), and biology Mora and Bialek (2011) to name a few.Due to the importance of graphical models, the inverse problem of learning them given i.i.d samples from their joint distribution has been a very active area of research. This problem was first addressed by Chow and Liu Chow and Liu (1968), who solved it for the case of graphical models with a tree structure. Since then there have been many works on learning graphical models under certain assumptions about the true model, with most of them focused on the special case of learning Ising models Tanaka (1998); Kappen and Rodríguez (1998); Ravikumar et al. (2010). Bresler Bresler (2015) gave the first polynomial time greedy algorithm that learned Ising models without any underlying assumptions. But the number of samples required to learn the model (i.e. sample complexity) using this method was still suboptimal. The first nearsample optimal method to learn binary models with second order interactions (Ising models) was introduced by Vuffray, et.al Vuffray et al. (2016). Here the learning problem is converted to a convex optimization problem which reconstructs the neighborhood of each variable in the model. The generalization of this method to learn any discrete graphical model is known as the Generalized Regularized Interaction Screening Estimator (GRISE)Vuffray et al. (2019). This algorithm runs in polynomial time in the size of the model and its sample complexity is very close to known informationtheoretic lower bounds Lokhov et al. (2018); Santhanam and Wainwright (2012).
Despite being sample optimal, the computational cost of GRISE becomes very high when trying to learn models with higher order interactions. To learn a model that has interactions up to order , GRISE will need to have all possible models at that order in its hypothesis space. For a variable model with order interactions the computation complexity of GRISE goes as , which can be very high for large . Even if the true model has a high degree of symmetry or structure that reduces the effective number of parameters to be learned, GRISE will not be able to leverage this to reduce the size of its hypothesis space.
In this work we propose a way to overcome this shortcoming arising from the linear parameterization that GRISE uses to represent candidate models. We introduce a method, which we call NNGRISE, that elegantly combines the strength of GRISE and the nonlinear representation power of neural networks. As neural nets are universal function approximators Barron (1993); Hornik (1991), NNGRISE has the ability to learn the true model given enough samples just like GRISE. In addition, we demonstrate experimentally that NNGRISE is much more efficient than GRISE for learning higher order models with some form of underlying symmetry. We exhibit practical examples where this performance gain can be exponentially large with a parameter space reduction of 99.4% already on small systems of only 7 variables. Another important aspect of graphical model learning is learning the structure of the conditional independence relations between random variables. While one may think at first that the use of a neural network could obfuscate the graphical model structure, we show that with a proper regularization process, the Markov random field structure reappears within the weights of the neural network. Finally, we also provide an aggregated cost function for learning a global probability distribution or energy function using consensus of local NNGRISE reconstructions.
The paper is organized as follows. The interaction screening principle is explained in Section 2 and NNGRISE is presented in Section 3. Structure learning with NNGRISE using an appropriate initialization and regularization is illustrated in Section 4. The method for representing the energy function of the true model with a single neural net representation for the energy function is discussed in Section 5. The supplementary material contains additional experimental results with NNGRISE.
2 Learning graphical models via interaction screening
2.1 The interaction screening method
We consider a graphical model defined over discrete variables for , where the notation refers to the set containing exactly elements. The models we consider will be positive probability distributions over the set of
dimensional vectors,
Without loss of generality this probability distribution can be written as,(1) 
The function, , is called the energy function or the Hamiltonian of the graphical model. The quantity is called the partition function and it ensures normalization. GRISE considers a linear parameterization of by expanding it with respect to a chosen basis
(2) 
where denote the elements of the basis, denotes the index set of the basis functions acting on the variables , and are the parameters of the model. Given i.i.d. samples , GRISE uses the convex interaction screening objective (ISO) to estimate the parameters around one variable at a time. For any , the ISO is given by
(3) 
where and are the parameters associated with . The quantity
(4) 
denotes the partial energy function containing all terms dependent on and is directly related to the conditional distribution . In the presence of prior information in the form of an bound on the parameters, the estimation can be efficiently performed by:
(5) 
The basis functions in (5) are assumed to be centered: for all . A quick intuition behind the minimization in (5) can be obtained by considering the ISO in the limit
(6) 
It is easy to verify using simple computation that . Since is a convex function, the minimization in (5) estimates the parameters correctly in the limit of infinite samples. We refer the reader to Vuffray et al. (2019) for a detailed finite sample analysis of GRISE.
2.2 Basis function hierarchies
The choice of basis functions in (3) plays a crucial role in the computational complexity of GRISE. Unless clearly specified by the application, one must use generic complete hierarchies of basis functions that have the ability to express any discrete function . We present two such generic heirarchies below.
Centered indicator basis:
This basis is defined by using the onedimensional centered indicator functions given by
(7) 
For any a set of basis functions can be constructed as
(8) 
Monomial basis:
For the special case of binary variables with
, we can define the monomial basis for any as(9) 
When both the centered indicator basis and the monomial basis are complete. However, this choice makes GRISE, as given in (5), clearly intractable. Without any known strong underlying structure, a natural, and perhaps the only logical choice, is to restrict the socalled interaction order to a specified value by considering . The complexity of GRISE is driven by the number of terms in the exponent in the objective which is now bounded by . A natural hierarchy of basis functions is constructed by starting from , and increasing in small steps as required. We thus obtain increasing representation power at the expense of higher computational cost. This approach is highly effective when the interaction order of the true underlying model is low. However, for models with high interaction order this approach can be computationally expensive, even if the model has significant structure.
3 Neural NetGRISE
To deal with higher order models more easily, we use GRISE with a neural net ansatz to represent the partial energy function A neural net, being a universal function approximator, will eventually cover the space of models as its size is increased and provides a natural alternative to the monomial and centered indicator hierarchies in Section 2.2. But it explores the function space in a different way, and given the ability of neural nets to find patterns in data, it is to be expected that this approach will work better in practice for learning structured models with high interaction order.
3.1 Neural Net parameterization of partial energy function
The analysis of GRISE in Vuffray et al. (2019) relies on the linearity of in the parameters and the centered property of , none of which is true for a neural net parameterization. Nevertheless, our approach using neural nets attempts to generalize the intuition regarding the zero gradient property of the infinitesample limit of ISO in (6). Similar to GRISE, we propose a neural network parameterization for one variable at a time given by approximating the partial energy function in (4) as
(10) 
where and the function in (10) is a vector valued function with outputs. The input to the neural net () is the set of all variables expect . We use to denote the weights of the neural net and they serve the role of the parameters in (4). The representation above is automatically centered in . Moreover, the representation does not lose any generality since the global energy function can always be written as
(11) 
The corresponding neural net interaction screening objective (NNISO) is given by
(12) 
An intuitive justification for NNISO  a variational argument:
Consider the traditional GRISE with a complete set of the centered indicator basis defined in (8) with all terms, i.e., order . Due to completeness of the basis, optimizing over the parameters is equivalent to optimizing over the set of all discrete functions that are centered w.r.t. . Since GRISE is able to recover the correct energy function using this basis, it follows that the true partial energy function is a global optimum of the following variational problem using the infinitesample ISO:
(13) 
Since the objective in (13) is convex in , the partial energy function is the global optimum of the problem. When the size of a neural net is sufficiently large, minimizing the NNISO in (12) is almost equivalent to the variational problem in (13). Although the function is a nonconvex function of , a property similar to the zero gradient property of can be shown to hold for in the limit of infinite samples and neural net size when the function space covered by the neural net is large enough such that there exists a set of weights such that . In this setting, we can show that one of the minima of corresponds to . The gradient of w.r.t. can be written as
If , for all values of the gradient is zero. We will call the minima for which this condition is satisfied as the interaction screening minima. Using the intuition derived from (13), this optimum is the global optimum over . Even for finite size, this motivates the use of SGD or its variants to optimize . The inherent noise in SGD will prevent it from being stuck in any spurious local minima and it will converge more easily to the interaction screening minima. It is further accompanied by its usual perks of parallelizability and the ability to implement using GPUs.
NNGRISE as described so far, doesn’t learn the full energy function. Instead it gives neural nets which approximate the partial energy function of each variable in the model. This gives us an approximation for the conditionals of the true model. Let be the fully trained neural net obtained by minimizing . For the true model the conditional probability of a variable conditioned on everything else can be estimated as
(14) 
The conditionals can be used to draw samples from the learned model using Gibbs sampling Geman and Geman (1984).
Remark:
Everything in the above discussion carries over to the case of binary models by using
(15) 
where is a scalar valued function.
3.2 Experiments
Now we will test NNGRISE on two highly structured graphical models. In our testing we will compare NNGRISE to GRISE. These are completely different types of algorithms and finding the right metric to compare them is tricky. GRISE converts the learning problem into a convex optimization problem for which theoretical guarantees can be derived. On the other hand, NNGRISE is a nonconvex problem but the learning process in this case can be easily parallelized on a GPU using offtheshelf machine learning libraries. If there are no limitations on the computational power, then both these methods will find the true model eventually. But on real hardware the performance of these will depend on implementation and on the true model being learned. To quantify the hardness of these algorithms in a device independent fashion, we compare the number of free parameters these algorithms optimize over per variable (
). Roughly, this number reflects the dimension of the hypothesis space of these algorithms.We will be using feed forward neural nets with the swish activation function (
) Ramachandran et al. (2017) . We specify the size of a neural net with two numbers, and, which will be the number of hidden layers in the model and the number of neurons in each hidden layer respectively. All the nets were trained using the ADAM optimizer
Kingma and Ba (2014).3.2.1 Learning binary models with higher order interactions
We expect NNGRISE to work well on models with a high degree of underlying structure even if that model has higher order interactions. GRISE, with out any prior information about the structure, will need to use the entire hierarchy described in Section 2.2 to recover the correct model.
To demonstrate this, we generate samples from a graphical model with the following energy function,
(16) 
and learn it using GRISE and NNGRISE ^{1}^{1}1The code and data for this can be found at https://github.com/biryani/NN_GRISE_NeurIPS. The hypergraph structure of this model is one dimensional and it has up to order interactions. We impose an extra symmetry here by choosing the same interaction strength for all terms of the same order. So in effect there are only parameters to be learned here. But for our experiments the learning algorithms will be unaware of this property and also of the one dimensional nature of the model. First we compute the error in the learned conditionals averaged over all possible inputs using Eq.(14). We compare the average error in the learned conditionals for a variable model with in Fig. 0(a). The values are chosen from .
We also study our learned model by expanding the learned neural nets in the monomial basis. This can be done for any binary function using standard formulae O’Donnell (2014). If the neural net learns the correct model, the leading coefficients in the monomial expansion at each order should match those of the true model at that order. The absolute values of these coefficients are compared in Fig.0(b). Both these metrics are exponentially expensive in to compute. The small model makes these explicit comparisons possible with out resorting to Gibbs sampling.
We see from Fig. 0(a) that GRISE with all terms up to fifth order (), just one order lower than the true model, fails to learn the model correctly. For this algorithm the number of input samples has little effect on the error in the conditionals. This implies that the candidate models considered by the algorithm are far away from the true model in function space. Including all terms up to fifth order in GRISE requires us to optimize over 256 free parameters. A neural net model comparable to this is the model which has 221 free parameters to optimize. We see that this model has better error in comparison to GRISE with similar . Also the error in this case decays as the number of training samples increases, unlike the floor observed in L=5GRISE. This is an indication that the neural net manages to learn the true model.
Looking at the other neural net models, we see that most of them learn the true model well. The best algorithm according to Fig. 0(a) is GRISE with sixth order interactions included. Ths is expected because it has the advantage of having the true model in its hypothesis space. On the other hand, with NNGRISE we can only get close to the true model. But the advantage of NNGRISE is that it can do this with far less number of parameters than L=6GRISE with parameters.
In Fig. 0(b), we see that the neural net learns the coefficients well up to sixth order. But it cannot completely suppress the higher order coefficients. This happens because NNGRISE implicitly has higher order polynomials in its hypothesis space. The agreement with the true model improves as the number of training samples increases and the spurious hypotheses are suppressed.
The efficiency of NNGRISE over GRISE is obvious for models with a higher number of variables. For larger models we have to resort to sampling from the models and computing the total variation distance (TVD) between the sampled distributions. To set a base line for sampling error we take two independent set of samples from the true model and compare the TVD between them. If the learned model is close to the true model then the TVD between their samples should closely follow this base line. GRISE with sixth order parameters for a 15 variable model is a 3472 variable optimization problem. This was intractable on the hardware used for these experiments. Instead we compare GRISE with up to fourth order interactions () with [] NNGRISE (). This comparison is given in Fig. 2. We see that NNGRISE learns the true model well with fewer number of parameters in comparison to GRISE with a higher , and performs better than GRISE even with fewer training samples.
3.2.2 Learning a model with permutation symmetry
Now we test NNGRISE on graphical models over . Additionally these distributions will also have complete permutation symmetry, i.e. the probability of a string will not change under permutations of that string,
(17) 
where is the symmetric group on elements. Distributions with this symmetry occur naturally in quantum physics. For this specific experiment we will learn the probability distribution obtained by measuring a quantum state called the GHZ state on qubits. We will work with the distribution obtained from the GHZ state by measuring it in a basis known as the tetrahedral POVM. The mathematical details of this setup can be found in Ref. Carrasquilla et al. (2019). Measuring a qubit GHZ state in this basis produces a positive distribution on which is symmetric under permutations. This means that this distribution can be represented as a Gibbs distribution. Our testing on small systems show the energy function of this distribution has all terms up to order . (Fig. 2(a)). Fig. 2(b) shows the error measured in TVD when learning a GHZ state on qubits. Doing full GRISE in this case is prohibitively expensive (). But NNGRISE performs remarkably well with a small neural net ().
4 Structure learning with input regularization
In this section we will discuss structure learning with NNGRISE, where the focus is to learn the structure of the hypergraph associated with the true model. The ensuing discussion will focus on binary models for simplicity, but the same principle applies for all alphabets.
If NNGRISE converges to an interaction screening minima then will contain information about the sites in the neighbourhood of in the hypergraph of the true model. More precisely, for inputs in , the output of will be insensitive to the inputs corresponding to sites outside the neighbourhood of . By looking at which inputs influence the output, we can learn the underlying dependency structure of the graphical model. If a certain input doesn’t influence the output, then we expect the input weights connecting that input to the rest of the net to be close to zero. But, is a function whose full domain is which we are restricting to . This restriction is a many to one map in the space of functions, i.e. there are many functions with a continuous domain that can be projected to the same function with a discrete domain. This means that could be a function that depends on all its inputs when its domain is continuous. It could very well have nonzero weights at certain inputs while being not sensitive to those inputs when they take discrete values.
This problem can be fixed in practice by taking two steps. First, at the beginning of training the input weights must be initialized to zero. Secondly, we must regularize the NNGRISE loss function with the
norm of only the input weights. Both these steps will ensure that the weights corresponding to the inputs that do not influence the output go to zero.We test this method on pairwise binary models on random graphs. The energy function here will be, . The random graph is generated by the ErdősRényi model Erdős and Rényi (1960) and the interaction strengths are chosen uniformly random from an interval . We denote by the array of input weights of that connect the input corresponding to site to the rest of the network. In Fig. 4, we see the effect regularization and initialization has on while NNGRISE is being trained. Regularization forces the nonedge weights to zero and clearly separates them from the edge weights. This means that by plotting a histogram of values at the end of training, we can distinguish between the edges and non edges in the graph. This is demonstrated in Fig. 5, where a variable random graph is reconstructed perfectly from the histogram of trained weights. For a model with higher order interactions, NNGRISE will be able to learn neighborhood of each variable. This information can then even be used as a prior in GRISE to reduce its computational cost. More experiments with structure learning can be found in the supplementary material.
5 Learning the complete energy function with NNGRISE
NNGRISE as described so far learns the conditionals of the graphical model. At the level of the energy function, the learned model for a particular variable represents the partial energy function of that variable. But for some applications it would be beneficial to have the complete energy function.
Reconstructing the energy function from the partial energies is a nontrivial task. If all the partial energies are compatible, i.e. if they come from the same underlying energy function, then we can expand them in the monomial basis and reconstruct the energy function from the expansion. But this method is computationally expensive. Instead a simple modification to the NNGRISE loss function can let us learn the complete energy function directly. We will explain this for the case of binary models, but a similar principle can be used for models with general alphabets as well. The modification to learn the energy function is based on (15) which shows that the partial energy for a variable can be written as,
(18) 
where is with the variable flipped in sign. Using a neural net as a candidate for the energy function , we can rewrite the loss in Eq. (3) as
(19) 
To ensure that the trained neural net gives the correct energy function we have to sum up these individual loss functions to construct a single loss function,
(20) 
Just as before, we can show that if the neural net has sufficient expressive power and if , then the global minima of this loss function correspond to the correct energy function. For GRISE this modification is not necessary as it learns directly in the monomial basis. The results of learning a model with this loss function is given in the supplementary material due to space considerations.
References

[1]
(1993)
Universal approximation bounds for superpositions of a sigmoidal function
. IEEE Transactions on Information theory 39 (3), pp. 930–945. Cited by: §1. 
[2]
(2015)
Efficiently learning Ising models on arbitrary graphs.
In
Proceedings of the fortyseventh annual ACM symposium on Theory of computing
, pp. 771–782. Cited by: §1.  [3] (1967) History of the LenzIsing model. Reviews of modern physics 39 (4), pp. 883. Cited by: §1.
 [4] (2019) Reconstructing quantum states with generative models. Nature Machine Intelligence 1 (3), pp. 155–161. Cited by: §3.2.2.
 [5] (1968) Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory 14 (3), pp. 462–467. Cited by: §1.
 [6] (1960) On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5 (1), pp. 17–60. Cited by: §4.
 [7] (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI6 (6), pp. 721–741. Cited by: §3.1.
 [8] (1991) Approximation capabilities of multilayer feedforward networks. Neural networks 4 (2), pp. 251–257. Cited by: §1.

[9]
(1998)
Efficient learning in Boltzmann machines using linear response theory
. Neural Computation 10 (5), pp. 1137–1156. Cited by: §1.  [10] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
 [11] (2018) Optimal structure and parameter learning of Ising models. Science advances 4 (3), pp. e1700791. Cited by: §B.2, §1.
 [12] (2011) Are biological systems poised at criticality?. Journal of Statistical Physics 144 (2), pp. 268–302. Cited by: §1.
 [13] (2014) Analysis of boolean functions. Cambridge University Press. Cited by: §3.2.1.
 [14] (2017) Swish: a selfgated activation function. arXiv preprint arXiv:1710.05941 7. Cited by: §3.2.

[15]
(2010)
Highdimensional Ising model selection using
regularized logistic regression
. The Annals of Statistics 38 (3), pp. 1287–1319. Cited by: §1.  [16] (2012) Informationtheoretic limits of selecting binary graphical models in high dimensions. IEEE Transactions on Information Theory 58 (7), pp. 4117–4134. Cited by: §B.2, §1.
 [17] (1998) Meanfield theory of Boltzmann machine learning. Physical Review E 58 (2), pp. 2302. Cited by: §1.
 [18] (2016) Interaction screening: efficient and sampleoptimal learning of Ising models. In Advances in Neural Information Processing Systems, pp. 2595–2603. Cited by: §1.
 [19] (2019) Efficient learning of discrete graphical models. arXiv preprint arXiv:1902.00600. Cited by: §1, §2.1, §3.1.

[20]
(2013)
Markov random field modeling, inference & learning in computer vision & image understanding: a survey
. Computer Vision and Image Understanding 117 (11), pp. 1610–1627. Cited by: §1.
Supplementary Material
This section contains supplementary materials for the paper ”Learning of Discrete Graphical Models with Neural Networks”. Here we show results of some more experiments done with NNGRISE. Section A has results of learning the Ising model. Section B has more results on structure learning, including learning hypergraphs. Section C has results on learning the full energy function using NNGRISE.
Appendix A Learning Ising models.
For this experiment we learn Ising models with two body interactions. We take random graphs with an average degree of three and choose the interaction strengths uniformly at random from [1,1]. The Hamiltonian here has the from,
(21) 
This is an adversarial experiment for NNGRISE when compared to GRISE. GRISE will learn this model in the second level of its hierarchy with parameters per optimization. The neural net will have to fit a linear function of its inputs, which it will not be able to do as well as lowdegree GRISE. Despite this, NNGRISE does a good job of learning the true model, albeit with more number of free parameters when compared to second order GRISE.


Appendix B Structure learning with NNGRISE.
b.1 Learning hypergraphs
We show that NNGRISE can accurately reconstruct the neighbourhood of each variable for a general model with higher order interactions. In Fig 7 we learn the following variable model ^{2}^{2}2Code available at https://github.com/biryani/NN_GRISE_NeurIPS.,
(22) 
The parameters here are chosen uniformly from As seen in Fig. 6(b), NNGRISE can perfectly reconstruct the neighbourhoods of each variables. The fifth order term shows up as clique of size connecting the corresponding variables. Once the neighborhood reconstruction is done, we can use this as a prior in GRISE. This can reduce the number of free parameters in order GRISE from to , where is the size of the neighbourhood of the variable being learned.


b.2 Learning graphs when the number of samples is too low
The success of structure learning depends on the number of samples used in the algorithm. The number of samples required to perfectly learn the structure depends on the strength of interaction and the degree of the underlying model. This is reflected in the sample complexity lower bound which is exponential in the product of the degree of the graph and the maximum strength of interaction [16]. In particular, learning models with higher degree with a limited number of samples makes distinguishing edges from nonedges more difficult. The histogram of trained inputs weights in this case will be more spread out as seen in Fig. 7(a)
. Despite this there are only a few misclassified edges in the reconstructed graph. In the histogram these edges usually lie close to the large cluster of weights close to zero. If the threshold line is chosen right after this large cluster most edges and nonedges are classified accurately. GRISE also exhibits a similar behaviour when the number of samples available are inadequate for perfect structure learning
[11].Appendix C Results of learning the energy function with NNGRISE.
In this section we will look at the results of learning the complete energy function using NNGRISE by training it with the loss function given in Eq. (20). We will look at the results of using this loss on the Energy function in Eq. (16). Since we have the neural net representation for the full energy function we will compute the average loss in the energy function rather than in the conditionals. This comparison for a variable model is given in Fig. 8(a). We also compute the TVD between sampled distributions for the variable model in Fig. 8(b) . The samples are now generated from the neural net using exact sampling rather than Gibbs sampling. This would have been intractable with neural nets approximations of the partial energy functions.
GRISE directly learns in the monomial basis, so the total energy function can be approximated by appropriately averaging the terms in the partial energy function. But this requires separate optimizations and increases the count of learning the energy function. To make the comparison with NNGRISE more fair, instead we compute the value of order GRISE as This is just the total number of independent parameters in a order energy function with variables.
From Fig. 9, we see that NNGRISE learns the energy function well with less Notice that the neural nets used here are larger in size than that used in learning the partial energies. But here a single neural net learns the complete model, while in the other case we had separate nets learning the model.


Comments
There are no comments yet.