1 Introduction
Representing and understanding the structure of direct correlations between distinct random variables with graphical models is a fundamental task that is essential to scientific and engineering endeavors. It is the first step towards an understanding of interactions between interleaved constituents of elaborated systems
[11]; it is key for developing causal theories [6]; and it is at the core of automated decision making [8], cybersecurity [5][20].The problem of reconstruction of graphical models from samples traces back to the seminal work of Chow and Liu [7]
for treestructured graphical models, and as of today is still at the center of attention of the learning community. For factor models defined over general hypergraphs, the learning problem is particularly challenging in graphical models over discrete variables, for which the maximum likelihood estimator is in general computationally intractable. One of the earlier tractable algorithms that has been suggested to provably reconstruct the structure of a subset of pairwise binary graphical models is based on inferring the sparsity pattern of the socalled regularized pseudolikelihood estimator, equivalent to regularized logistic regression in the binary case
[16]. However, additional assumptions required for this algorithm to succeed severely limit the set of models that can be learned [15]. After it was proven that reconstruction of arbitrary discrete graphical models with bounded degree can be done in polynomial time in the system size [4], Bresler showed that it is possible to bring the computational complexity down to quasiquadratic in the number of variables for Ising models (pairwise graphical models over binary variables); however, the resulting algorithm has nonoptimal sample requirements that are doubleexponential in other model parameters
[3]. The first computationally efficient reconstruction algorithm for sparse pairwise binary graphical models with a nearoptimal sample complexity with respect to the information theoretic lower bound [17], called Rise, was designed and analyzed by Vuffray et al. [19]. The algorithm Risesuggested in this work is based on the minimization of a novel local convex loss function, called the
Interaction Screening objective, supplemented with an penalty to promote sparsity. Even though it has been later shown by Lokhov et al. [13] that regularized pseudolikelihood supplemented with a crucial postprocessing step also leads to a structure estimator for pairwise binary models, strong numerical and theoretical evidences provided in that work demonstrated that Rise is superior in terms of worstcase sample complexity.Algorithms for learning discrete graphical models beyond pairwise and binary alphabets have been proposed only recently by Hamilton et al. [9], Suggala et al. [18] and Klivans et al. [12]. The method in [9] works for arbitrary models with bounded degrees, but being a generalization of Bresler’s algorithm for Ising models [3], it suffers from similar prohibitive sample requirements growing doubleexponentially in the strength of model parameters. The learning algorithm in [18] has a low sample complexity and has the advantage of dealing with graphical models possessing an arbitrary parametric form, the socalled nonparametric expxorcist family. However, being a generalization of the sparsitybased pseudolikelihood estimator, it relies on restrictive assumptions similar to those used in [16] and in the earlier generalization to general discrete graphical models [10]. The socalled Sparsitron algorithm in [12] has the flavor of a stochastic first order method with multiplicative updates. It has a low computational complexity and is sampleefficient for structure recovery of two subclasses of discrete graphical models: multiwise graphical models over binary variables or pairwise models with general alphabets. A recent followup work by Wu et al. [21] considered an contrained logistic regression, and showed that it provides a slight improvement of the sample complexity compared to [12] in the case of pairwise models over nonbinary variables.
In this work, we propose a general framework for learning general discrete factor models expressed in an arbitrary parametric form, based on a significant generalization of the Interaction Screening method of [19, 13], previously introduced for pairwise binary models. Our primary insight lies in the identification of a single general condition related to model parameterization that is sufficient to obtain bounds on sample complexity. This provides a general framework for analyzing the sample complexity of parameter estimation for any given class of discrete graphical models, and we show that for many important examples, this condition can be readily verified. We use this to show that structure and parameter recovery results for all model classes previously considered in the literature in [19, 9, 12, 21] can be obtained as special cases of our general framework. Noticably, this includes the challenging case of multibody interactions defined over general discrete alphabets. Hence, our work constitutes the first result providing efficient learning algorithm, termed Grise, with theoretical guarantees that reconstructs any discrete graphical model with multibody interactions and general nodespecific alphabets, while keeping sample requirements low. In addition, our theoretical guarantees can be expressed in any error norm, and explicitly includes distinction between bounds on the parameters of the underlying model and the prior parameters used in the optimization; as a result running Grise with prior information that is not tight only has moderate effect on the sample complexity bounds. Finally, we also provide a fully parallelizable algorithmic formulation for the resulting Grise estimator, and show that it runs in efficient time for a model of size with order interactions, including the bestknown scaling for pairwise models.
2 Problem Formulation
2.1 Parameterized family of models
We consider positive joint probability distributions over variables for . The set of variable indices is referred to as vertices . Nodedependent alphabets are assumed to be discrete and of size bounded by . Without loss of generality, the positive probability distribution over the
dimensional vector
can be expressed as(1) 
where is a set of basis functions acting upon subsets of variables that specify a family of distributions and are parameters that specify a model within this family. The quantity denotes the partition function and serves as a normalization constant that enforces that the in (1) is a probability distribution. For , let denote the set of factors corresponding to basis functions acting upon subsets that contain the variable and .
Given any set of basis functions, we can locally center them by first defining for a given , the local centering functions
(2) 
where denotes the vector without , and define the locally centered basis functions,
(3) 
As their name suggests, the locally centered basis functions sum to zero . To ensure the scales of the parameters are well defined, we assume that are chosen or rescaled such that all locally centered basis functions are normalized in the following sense:
(4) 
for all vertices and basis factor . This normalization can always been achieved by choosing bounded basis functions . An important special case is when the basis functions are already centered, i.e. . In this case the basis functions are directly normalized . Note that one of the reasons to define the normalization in (4) in terms of the centered functions instead of is to avoid spurious cases where the functions have inflated magnitudes due to addition of constants .
2.2 Model selection problem
For each , let denote the set of target factors that we aim at reconstructing accurately and let be the set of residual factors for which we do not need learning guarantees. The target and residual parameters are defined similarly as and respectively. Given independent samples from a model in the family in Section 2.1, the goal of the model selection problem is to reconstruct the target parameters of the model.
Definition 1 (Model Selection Problem).
Given i.i.d. samples drawn from some distribution in Eq. (1) defined by , and prior information on given in form of an upper bound on the norm of the local subcomponents
(5) 
and a local constraint set for each such that
(6) 
compute estimates of such that the estimates of the target parameters satisfy
(7) 
where denotes some norm of interest with respect to which the error is measured.
The bound on the norm in (5) is a natural generalization of the sparse case where only has a small number of nonzero components; in the context of parameter estimation in graphical models, the setting of parameters bounded in the norm have been previously considered in [12]. The constraint sets are used to encode any other side information that may be known about model parameters. Below, we show how constraints encoded in can be conveniently used to lift degeneracy due to specific model parametrizations.
2.3 Sufficient conditions for wellposedness
In this section, we describe some conditions on the model in (1) that makes the model selection problem defined in Section 2.2 wellposed. We first state the conditions formally.
Condition 1.
The model from which the samples are drawn in the model selection problem in Definition 1 satisfies the following:

Constrained linear independence of target basis functions: There exists a constant such that for every vertex and any vector in the difference constraint set defined as
(8) the following holds:
(9) where denotes the components of , and denotes the norm used in Definition 1.

Finite maximum interaction strength: The quantity defined as
(10) is finite.
To understand why (C1) is required, consider a model that violates this condition and suppose that there exist a vertex and a vector for which . Since the probability distribution in Eq. (1) is positive, it implies that for all configurations we have the functional equality . This enables us to locally reparameterize the distribution:
(11) 
where is a sum of locally centered functions that does not involve the variable . We see from equation Eq. (11) that by changing the parameterization of the residual factors, one can arbitrarily change the parameterization of the target factors. As a result, unique determination of the parameters solely from samples and knowledge of and , as required by the model selection problem (Definition 1), is impossible. We also note that in (C1) the lower bound on the RHS only consists of terms in the taget set . This is a weaker assumption than requiring that . However, as we will see in Theorem 1 below, any parameters in the target factors sets for which condition (C1) is verified for given , can be recovered to the precision specified in (7).
The bound on the interaction strength in (C2) is required to ensure that the model can be recovered with finitely many samples. For many special cases, such as the Ising model, the minimum number of samples required to estimate the parameters must grow exponentially with the maximum interaction strength [17].
3 Generalized Interaction Screening
3.1 Generalized Regularized Interaction Screening Estimator
We propose a generalization of the estimator Rise, first introduced in [19] for pairwise binary graphical models, in order to reconstruct general discrete graphical models defined in (1). The generalized interaction screening objective (GISO) is defined for each vertex separately and is given by
(12) 
where are i.i.d samples drawn from in Eq. (1), is the vector of parameters associated with the factors in and the locally centered basis functions are defined as in Eq. (3). The GISO retains the main feature of the interaction screening objective (ISO) in [19]: it is proportional to the inverse of the factor in , except for the additional centering term . The GISO is a convex function of and retains the “screening” property of the original ISO. The GISO is used to define the generalized regularized interaction screening estimator (Grise) for the parameters given by
(13) 
where and are the prior information avaialable on as defined in (5) and (6).
3.2 Error Bound on Parameter Estimation with Grise
We now state our main result regarding the theoretical guarantees on the parameters reconstructed by Grise. We call an optimal solution of (13) if
(14) 
Theorem 1 (Error Bound on Grise estimates).
Let be i.i.d. samples drawn according to in (1). Assume that the model satisfies Condition 1 for some norm and some constraint set , and let be the prescribed accuracy level. If the number of samples satisfies
(15) 
then, with probability at least , any estimate that is an minimizer of Grise, with , satisfies
(16) 
3.3 Computational Complexity of Grise for parametrically complete Constraint Sets
For a certain class of constraint sets , which we term parametrically complete, the problem can be solved in two steps: first, finding a solution to an uncostrained problem, and then projecting onto this set. Note, however, that in general the problem of finding optimal solutions to constrained Grise can still be difficult since the constraint set can be arbitrarily complicated.
Definition 2.
The constraint set is called a parametrically complete set if for all , there exists such that for all , we have
(17) 
Any satisfying (17) is called an equicost projection of onto and is denoted by
(18) 
When the constraint set is parametrically complete, an optimal solution to (13) can be found by first solving unconstrained version and then performing an equicost projection onto . We define an optimal solution to the unconstrained Grise problem as
(19) 
Lemma 1.
Let be an optimal solution of the unconstrained Grise problem in Eq. (19). Then an equicost projection of is an optimal solution of the constrained Grise problem,
(20) 
Lemma 1 implies that the computational complexity of Grise for parametrically complete case is the sum of the computational complexity of the unconstrained Grise and the projection step. The iterative Algorithm 1 takes as input a number of steps and output an optimal solution of Grise without constraints in Eq. (19). This algorithm is an application of the Entropic Descent Algorithm introduced by Beck and Teboulle [1] to reformulation of Eq. (13) as a minimization over the probability simplex. Note that there exist other efficient iterative methods for minimizing the GISO, such as the mirror gradient descent from BenTal et al. in [2].
The following proposition provides guarantees on the computational complexity of unconstrained Grise.
Proposition 1 (Computational complexity for unconstrained Grise).
Let be the optimality gap and be the maximum number of iterations. Then Algorithm 1 is guaranteed to produce an optimal solution of Grise without constraints with a number of operation less than
(21) 
where is a universal constant that is independent of all parameters of the problem.
We state the overall algorithm to compute Grise estimates in the parametrically complete constraint case below.
The following theorem regarding the computational complexity of Algorithm 2 follows easily by combining Lemma 1 and Proposition 1.
Theorem 2 (Computational complexity for Grise with parametrically complete constraints).
Let be a parametrically complete set and let be given. Then Algorithm 2 computes an optimal solution to Grise can be computed with a number of operations bounded by
(22) 
where denotes the computational complexity of the projection step.
As we will see in the next section, it is often possible to explicitly construct parametrically complete sets for which the associated computational complexity of the projection step is insignificant compared to the computational complexity of unconstrained Grise provided in Proposition 1.
4 Special Cases
As an application of Theorem 1, we show that with some wellsuited basis functions we can recover the structure of arbitrary graphical models over arbitrary alphabets. We also demonstrate that our framework is able to recover parameter values and the structure of some wellknown special cases, thus generalizing the existing results in the literature. For convenience, the results presented in this section are summarized in the following table:
Model  Interaction  Alphabet  Recovery  Sample  Algorithmic 
name  order  size  type  complexity  complexity 
Ising  2  2  structure  
Ising  2  2  parameter  
Binary  2  structure  
Pairwise  2  structure  
Pairwise  2  parameter  
General  structure 
In the table above, is related to the precision to which parameters are recovered in a certain norm, and is the chromatic number of the graph, see below for precise definitions. At this point, it is instructive to compare our sample complexity requirements to existing results. A direct application of bounds of [12] and [21] to the case of pairwise multialphabet models that we consider below yields dependence, whereas Grise has a complexity that scales as . In the case of binary wise models, while [12] shows the scaling, Grise enjoys a sample complexity . The algorithm of [9] recovers a subclass of general graphical models with bounded degree, but has a suboptimal doubleexponential scaling in , while the Interaction Screening method leads to recovery of arbitrary discrete graphical models with a singleexponential dependence in and needs no bound on the degree. In terms of the computational complexity, Grise achieves the efficient scaling for models with the maximum interaction order , which matches the bestknown scaling for pairwise models [12, 21].
4.1 Ising Models
For Ising models with maximal node degree the alphabet is given by , and the basis functions are given by for and for . Observe that the basis functions are already centered, i.e., . Define the graph of an Ising model as being the set of edges corresponding to nonzero couplings,
(23) 
The distribution specified by the ising model is given by
(24) 
For each node , we define the target set (see Section 2.2) as the set of potential edges associated with given by
(25) 
The first corollary focuses on graph reconstruction of Ising models.
Corollary 1 (Graph reconstruction for Ising models).
Let be i.i.d. samples drawn according to in (24) and let denote the mimnimum edge strength. If the number of samples
(26) 
then, with probability at least , for all vertices the coupling are estimated with the error
(27) 
The computational complexity of obtaining these estimates is for fixed , , and . Moreover, thresholding the esitmated couplings at recovers the structure i.e. .
The thresholding operator in the above theorem is defined as
(28) 
.
Let be the chromatic number of i.e. the smallest number of colors for obtaining a proper vertex coloring of the graph. The second corollary focuses on parameter estimation of Ising models with chromatic number .
Corollary 2 (parameter estimation for Ising Models).
Let be i.i.d. samples drawn according to in (24). If
(29) 
then, with probability at least , for all vertices the coupling are estimated with the error
(30) 
The computational complexity of obtaining these estimates is for fixed , , and .
As graphs with bounded degree have a chromatic number at most , Corollary 2 recovers the guarantees for sparse graphs recovery of Vuffray et al. in [19] albeit with slightly worse dependence with respect to and . However, this is an artifact of the general analysis presented in this paper. For models over binary variables one can improve the dependence to using Berstein’s inequality in Proposition 2 instead of Hoeffding’s inequality. Moreover, we hypothesize that for Ising models, a tight analysis of the tradeoff between the error on the estimated couplings and on the estimated magnetic fields in Proposition 3 can reduce the dependence to .
For graphs with unbounded vertex degree but low chromatic number, such as star graphs or bipartite graphs, Corollary 2 shows that the parameters of the corresponding Ising model can be fully recovered with a bounded error using a number of samples that is only logarithmic in the model size .
4.2 wise Binary Models on the Monomial Basis
In this section, we consider general models on binary alphabet . Let the factors be all nonempty subsets of elements of size at most ,
(31) 
The set contains all potential hyperedges of size at most . The parameterization uses the monomial basis given by with . As for the Ising model, the monomial basis functions are already centered . The probability distribution for this model is expressed as
(32) 
The factor graph associated with the binary graphical model is given by hyperedges or monomials associated with nonzero coefficients in Eq. (32),
(33) 
For a given binary model as in (32), we call the subset of maximal hyperedges the structure of the model,
(34) 
where we define the maximal hyperedges of as
(35) 
Suppose that the minimal coupling is given by . Then Algorithm 3 reconstructs the structure given independent samples from the distribution in (32), implementing an iterative use of Grise.
Corollary 3 (Structure recovery for binary graphical models).
4.3 Pairwise Models with Arbitrary Alphabets
In this section, we consider pairwise models with arbitrary alphabet of size , parametrized with indicatortype functions. The building block of the set of basis functions is the centered univariate indicator function defined as
(37) 
where is a prescribed letter of the alphabet. The set of factors are pair of vertices with alphabet values and single vertex with alphabet value . We slightly abuse notation by shortening the tuples and whenever there is no ambiguity with respect to the edge or vertex associated with the factor. Basis functions associated with edges are for and for . Note that basis functions over edges are centered for both variables, i.e. . Similarly basis functions associated with vertices are also centered. The probability distribution for this model is expressed as
(38) 
We seek to reconstruct the structure of pairwise models which means that for every pair we want to be able to assess if there exists at least one nonzero parameter or to certify that these parameters are zero for all values of . Therefore we define the set of target factors as and the set of residual factors as . We need to require additional constraints on the target parameters as the parametrization used in Eq. (38) is overcomplete. The reason is that the univariate indicator functions associated with a variable sum to zero independently of . This degeneracy is lifted by constraining parameters in Eq. (38) in the following way,
(39) 
The set is also parametrically complete as it follows from Lemma 8 (see Section 7.4) that the following projection on is equicost in the sense of Definition 2,
(40)  
(41) 
Thanks to Lemma 1 this enables us to solve Grise with constraints from Eq. (39) by first running the unconstrained version of Grise and then perform the projection.
Define the graph of the model by the edge set
(42) 
This corresponds to create an edge between the two nodes whenever there exists a nonzero parameter associated with these the two variables ,. For the structure recovery, define the weighted norm by
(43) 
where is the maximum alphabet size in the model. Note that this weighted norm is always greater than the corresponding unweighted norm and is significantly different than its unweighted counterpart for models with widely varying alphabets accross vertices. The following corollary provides reconstruction guarantees for the structure recovery of pairwise models.
Corollary 4 (Structure recovery for pairwise models with arbitrary alphabets).
Let be i.i.d. samples drawn according to
Comments
There are no comments yet.