Representing and understanding the structure of direct correlations between distinct random variables with graphical models is a fundamental task that is essential to scientific and engineering endeavors. It is the first step towards an understanding of interactions between interleaved constituents of elaborated systems; it is key for developing causal theories ; and it is at the core of automated decision making , cybersecurity 20].
The problem of reconstruction of graphical models from samples traces back to the seminal work of Chow and Liu 
for tree-structured graphical models, and as of today is still at the center of attention of the learning community. For factor models defined over general hypergraphs, the learning problem is particularly challenging in graphical models over discrete variables, for which the maximum likelihood estimator is in general computationally intractable. One of the earlier tractable algorithms that has been suggested to provably reconstruct the structure of a subset of pairwise binary graphical models is based on inferring the sparsity pattern of the so-called regularized pseudo-likelihood estimator, equivalent to regularized logistic regression in the binary case. However, additional assumptions required for this algorithm to succeed severely limit the set of models that can be learned . After it was proven that reconstruction of arbitrary discrete graphical models with bounded degree can be done in polynomial time in the system size 
, Bresler showed that it is possible to bring the computational complexity down to quasi-quadratic in the number of variables for Ising models (pairwise graphical models over binary variables); however, the resulting algorithm has non-optimal sample requirements that are double-exponential in other model parameters. The first computationally efficient reconstruction algorithm for sparse pairwise binary graphical models with a near-optimal sample complexity with respect to the information theoretic lower bound , called Rise, was designed and analyzed by Vuffray et al. . The algorithm Rise
suggested in this work is based on the minimization of a novel local convex loss function, called theInteraction Screening objective, supplemented with an penalty to promote sparsity. Even though it has been later shown by Lokhov et al.  that regularized pseudo-likelihood supplemented with a crucial post-processing step also leads to a structure estimator for pairwise binary models, strong numerical and theoretical evidences provided in that work demonstrated that Rise is superior in terms of worst-case sample complexity.
Algorithms for learning discrete graphical models beyond pairwise and binary alphabets have been proposed only recently by Hamilton et al. , Suggala et al.  and Klivans et al. . The method in  works for arbitrary models with bounded degrees, but being a generalization of Bresler’s algorithm for Ising models , it suffers from similar prohibitive sample requirements growing double-exponentially in the strength of model parameters. The learning algorithm in  has a low sample complexity and has the advantage of dealing with graphical models possessing an arbitrary parametric form, the so-called non-parametric expxorcist family. However, being a generalization of the sparsity-based pseudo-likelihood estimator, it relies on restrictive assumptions similar to those used in  and in the earlier generalization to general discrete graphical models . The so-called Sparsitron algorithm in  has the flavor of a stochastic first order method with multiplicative updates. It has a low computational complexity and is sample-efficient for structure recovery of two subclasses of discrete graphical models: multiwise graphical models over binary variables or pairwise models with general alphabets. A recent follow-up work by Wu et al.  considered an contrained logistic regression, and showed that it provides a slight improvement of the sample complexity compared to  in the case of pairwise models over non-binary variables.
In this work, we propose a general framework for learning general discrete factor models expressed in an arbitrary parametric form, based on a significant generalization of the Interaction Screening method of [19, 13], previously introduced for pairwise binary models. Our primary insight lies in the identification of a single general condition related to model parameterization that is sufficient to obtain bounds on sample complexity. This provides a general framework for analyzing the sample complexity of parameter estimation for any given class of discrete graphical models, and we show that for many important examples, this condition can be readily verified. We use this to show that structure and parameter recovery results for all model classes previously considered in the literature in [19, 9, 12, 21] can be obtained as special cases of our general framework. Noticably, this includes the challenging case of multi-body interactions defined over general discrete alphabets. Hence, our work constitutes the first result providing efficient learning algorithm, termed Grise, with theoretical guarantees that reconstructs any discrete graphical model with multi-body interactions and general node-specific alphabets, while keeping sample requirements low. In addition, our theoretical guarantees can be expressed in any error norm, and explicitly includes distinction between bounds on the parameters of the underlying model and the prior parameters used in the optimization; as a result running Grise with prior information that is not tight only has moderate effect on the sample complexity bounds. Finally, we also provide a fully parallelizable algorithmic formulation for the resulting Grise estimator, and show that it runs in efficient time for a model of size with -order interactions, including the best-known scaling for pairwise models.
2 Problem Formulation
2.1 Parameterized family of models
We consider positive joint probability distributions over variables for . The set of variable indices is referred to as vertices . Node-dependent alphabets are assumed to be discrete and of size bounded by . Without loss of generality, the positive probability distribution over the
-dimensional vectorcan be expressed as
where is a set of basis functions acting upon subsets of variables that specify a family of distributions and are parameters that specify a model within this family. The quantity denotes the partition function and serves as a normalization constant that enforces that the in (1) is a probability distribution. For , let denote the set of factors corresponding to basis functions acting upon subsets that contain the variable and .
Given any set of basis functions, we can locally center them by first defining for a given , the local centering functions
where denotes the vector without , and define the locally centered basis functions,
As their name suggests, the locally centered basis functions sum to zero . To ensure the scales of the parameters are well defined, we assume that are chosen or rescaled such that all locally centered basis functions are normalized in the following sense:
for all vertices and basis factor . This normalization can always been achieved by choosing bounded basis functions . An important special case is when the basis functions are already centered, i.e. . In this case the basis functions are directly normalized . Note that one of the reasons to define the normalization in (4) in terms of the centered functions instead of is to avoid spurious cases where the functions have inflated magnitudes due to addition of constants .
2.2 Model selection problem
For each , let denote the set of target factors that we aim at reconstructing accurately and let be the set of residual factors for which we do not need learning guarantees. The target and residual parameters are defined similarly as and respectively. Given independent samples from a model in the family in Section 2.1, the goal of the model selection problem is to reconstruct the target parameters of the model.
Definition 1 (Model Selection Problem).
Given i.i.d. samples drawn from some distribution in Eq. (1) defined by , and prior information on given in form of an upper bound on the -norm of the local sub-components
and a local constraint set for each such that
compute estimates of such that the estimates of the target parameters satisfy
where denotes some norm of interest with respect to which the error is measured.
The bound on the -norm in (5) is a natural generalization of the sparse case where only has a small number of non-zero components; in the context of parameter estimation in graphical models, the setting of parameters bounded in the -norm have been previously considered in . The constraint sets are used to encode any other side information that may be known about model parameters. Below, we show how constraints encoded in can be conveniently used to lift degeneracy due to specific model parametrizations.
2.3 Sufficient conditions for well-posedness
The model from which the samples are drawn in the model selection problem in Definition 1 satisfies the following:
Constrained linear independence of target basis functions: There exists a constant such that for every vertex and any vector in the difference constraint set defined as
the following holds:
where denotes the components of , and denotes the norm used in Definition 1.
Finite maximum interaction strength: The quantity defined as
To understand why (C1) is required, consider a model that violates this condition and suppose that there exist a vertex and a vector for which . Since the probability distribution in Eq. (1) is positive, it implies that for all configurations we have the functional equality . This enables us to locally reparameterize the distribution:
where is a sum of locally centered functions that does not involve the variable . We see from equation Eq. (11) that by changing the parameterization of the residual factors, one can arbitrarily change the parameterization of the target factors. As a result, unique determination of the parameters solely from samples and knowledge of and , as required by the model selection problem (Definition 1), is impossible. We also note that in (C1) the lower bound on the RHS only consists of terms in the taget set . This is a weaker assumption than requiring that . However, as we will see in Theorem 1 below, any parameters in the target factors sets for which condition (C1) is verified for given , can be recovered to the precision specified in (7).
The bound on the interaction strength in (C2) is required to ensure that the model can be recovered with finitely many samples. For many special cases, such as the Ising model, the minimum number of samples required to estimate the parameters must grow exponentially with the maximum interaction strength .
3 Generalized Interaction Screening
3.1 Generalized Regularized Interaction Screening Estimator
We propose a generalization of the estimator Rise, first introduced in  for pairwise binary graphical models, in order to reconstruct general discrete graphical models defined in (1). The generalized interaction screening objective (GISO) is defined for each vertex separately and is given by
where are i.i.d samples drawn from in Eq. (1), is the vector of parameters associated with the factors in and the locally centered basis functions are defined as in Eq. (3). The GISO retains the main feature of the interaction screening objective (ISO) in : it is proportional to the inverse of the factor in , except for the additional centering term . The GISO is a convex function of and retains the “screening” property of the original ISO. The GISO is used to define the generalized regularized interaction screening estimator (Grise) for the parameters given by
3.2 Error Bound on Parameter Estimation with Grise
We now state our main result regarding the theoretical guarantees on the parameters reconstructed by Grise. We call an -optimal solution of (13) if
Theorem 1 (Error Bound on Grise estimates).
Let be i.i.d. samples drawn according to in (1). Assume that the model satisfies Condition 1 for some norm and some constraint set , and let be the prescribed accuracy level. If the number of samples satisfies
then, with probability at least , any estimate that is an -minimizer of Grise, with , satisfies
3.3 Computational Complexity of Grise for parametrically complete Constraint Sets
For a certain class of constraint sets , which we term parametrically complete, the problem can be solved in two steps: first, finding a solution to an uncostrained problem, and then projecting onto this set. Note, however, that in general the problem of finding -optimal solutions to constrained Grise can still be difficult since the constraint set can be arbitrarily complicated.
The constraint set is called a parametrically complete set if for all , there exists such that for all , we have
Any satisfying (17) is called an equi-cost projection of onto and is denoted by
When the constraint set is parametrically complete, an -optimal solution to (13) can be found by first solving unconstrained version and then performing an equi-cost projection onto . We define an -optimal solution to the unconstrained Grise problem as
Let be an -optimal solution of the unconstrained Grise problem in Eq. (19). Then an equi-cost projection of is an -optimal solution of the constrained Grise problem,
Lemma 1 implies that the computational complexity of Grise for parametrically complete case is the sum of the computational complexity of the unconstrained Grise and the projection step. The iterative Algorithm 1 takes as input a number of steps and output an -optimal solution of Grise without constraints in Eq. (19). This algorithm is an application of the Entropic Descent Algorithm introduced by Beck and Teboulle  to reformulation of Eq. (13) as a minimization over the probability simplex. Note that there exist other efficient iterative methods for minimizing the GISO, such as the mirror gradient descent from Ben-Tal et al. in .
The following proposition provides guarantees on the computational complexity of unconstrained Grise.
Proposition 1 (Computational complexity for unconstrained Grise).
Let be the optimality gap and be the maximum number of iterations. Then Algorithm 1 is guaranteed to produce an -optimal solution of Grise without constraints with a number of operation less than
where is a universal constant that is independent of all parameters of the problem.
We state the overall algorithm to compute Grise estimates in the parametrically complete constraint case below.
Theorem 2 (Computational complexity for Grise with parametrically complete constraints).
Let be a parametrically complete set and let be given. Then Algorithm 2 computes an -optimal solution to Grise can be computed with a number of operations bounded by
where denotes the computational complexity of the projection step.
As we will see in the next section, it is often possible to explicitly construct parametrically complete sets for which the associated computational complexity of the projection step is insignificant compared to the computational complexity of unconstrained Grise provided in Proposition 1.
4 Special Cases
As an application of Theorem 1, we show that with some well-suited basis functions we can recover the structure of arbitrary graphical models over arbitrary alphabets. We also demonstrate that our framework is able to recover parameter values and the structure of some well-known special cases, thus generalizing the existing results in the literature. For convenience, the results presented in this section are summarized in the following table:
In the table above, is related to the precision to which parameters are recovered in a certain norm, and is the chromatic number of the graph, see below for precise definitions. At this point, it is instructive to compare our sample complexity requirements to existing results. A direct application of bounds of  and  to the case of pairwise multi-alphabet models that we consider below yields dependence, whereas Grise has a complexity that scales as . In the case of binary -wise models, while  shows the scaling, Grise enjoys a sample complexity . The algorithm of  recovers a subclass of general graphical models with bounded degree, but has a sub-optimal double-exponential scaling in , while the Interaction Screening method leads to recovery of arbitrary discrete graphical models with a single-exponential dependence in and needs no bound on the degree. In terms of the computational complexity, Grise achieves the efficient scaling for models with the maximum interaction order , which matches the best-known scaling for pairwise models [12, 21].
4.1 Ising Models
For Ising models with maximal node degree the alphabet is given by , and the basis functions are given by for and for . Observe that the basis functions are already centered, i.e., . Define the graph of an Ising model as being the set of edges corresponding to non-zero couplings,
The distribution specified by the ising model is given by
For each node , we define the target set (see Section 2.2) as the set of potential edges associated with given by
The first corollary focuses on graph reconstruction of Ising models.
Corollary 1 (Graph reconstruction for Ising models).
Let be i.i.d. samples drawn according to in (24) and let denote the mimnimum edge strength. If the number of samples
then, with probability at least , for all vertices the coupling are estimated with the error
The computational complexity of obtaining these estimates is for fixed , , and . Moreover, thresholding the esitmated couplings at recovers the structure i.e. .
The thresholding operator in the above theorem is defined as
Let be the chromatic number of i.e. the smallest number of colors for obtaining a proper vertex coloring of the graph. The second corollary focuses on -parameter estimation of Ising models with chromatic number .
Corollary 2 (-parameter estimation for Ising Models).
Let be i.i.d. samples drawn according to in (24). If
then, with probability at least , for all vertices the coupling are estimated with the error
The computational complexity of obtaining these estimates is for fixed , , and .
As graphs with bounded degree have a chromatic number at most , Corollary 2 recovers the -guarantees for sparse graphs recovery of Vuffray et al. in  albeit with slightly worse dependence with respect to and . However, this is an artifact of the general analysis presented in this paper. For models over binary variables one can improve the dependence to using Berstein’s inequality in Proposition 2 instead of Hoeffding’s inequality. Moreover, we hypothesize that for Ising models, a tight analysis of the trade-off between the error on the estimated couplings and on the estimated magnetic fields in Proposition 3 can reduce the dependence to .
For graphs with unbounded vertex degree but low chromatic number, such as star graphs or bipartite graphs, Corollary 2 shows that the parameters of the corresponding Ising model can be fully recovered with a bounded -error using a number of samples that is only logarithmic in the model size .
4.2 -wise Binary Models on the Monomial Basis
In this section, we consider general models on binary alphabet . Let the factors be all nonempty subsets of elements of size at most ,
The set contains all potential hyperedges of size at most . The parameterization uses the monomial basis given by with . As for the Ising model, the monomial basis functions are already centered . The probability distribution for this model is expressed as
The factor graph associated with the binary graphical model is given by hyperedges or monomials associated with non-zero coefficients in Eq. (32),
For a given binary model as in (32), we call the subset of maximal hyperedges the structure of the model,
where we define the maximal hyperedges of as
Corollary 3 (Structure recovery for binary graphical models).
4.3 Pairwise Models with Arbitrary Alphabets
In this section, we consider pairwise models with arbitrary alphabet of size , parametrized with indicator-type functions. The building block of the set of basis functions is the centered univariate indicator function defined as
where is a prescribed letter of the alphabet. The set of factors are pair of vertices with alphabet values and single vertex with alphabet value . We slightly abuse notation by shortening the tuples and whenever there is no ambiguity with respect to the edge or vertex associated with the factor. Basis functions associated with edges are for and for . Note that basis functions over edges are centered for both variables, i.e. . Similarly basis functions associated with vertices are also centered. The probability distribution for this model is expressed as
We seek to reconstruct the structure of pairwise models which means that for every pair we want to be able to assess if there exists at least one non-zero parameter or to certify that these parameters are zero for all values of . Therefore we define the set of target factors as and the set of residual factors as . We need to require additional constraints on the target parameters as the parametrization used in Eq. (38) is over-complete. The reason is that the univariate indicator functions associated with a variable sum to zero independently of . This degeneracy is lifted by constraining parameters in Eq. (38) in the following way,
Define the graph of the model by the edge set
This corresponds to create an edge between the two nodes whenever there exists a non-zero parameter associated with these the two variables ,. For the structure recovery, define the weighted -norm by
where is the maximum alphabet size in the model. Note that this weighted -norm is always greater than the corresponding unweighted -norm and is significantly different than its unweighted counterpart for models with widely varying alphabets accross vertices. The following corollary provides reconstruction guarantees for the structure recovery of pairwise models.
Corollary 4 (Structure recovery for pairwise models with arbitrary alphabets).
Let be i.i.d. samples drawn according to