DeepAI
Log In Sign Up

Entropy-based Characterization of Modeling Constraints

In most data-scientific approaches, the principle of Maximum Entropy (MaxEnt) is used to a posteriori justify some parametric model which has been already chosen based on experience, prior knowledge or computational simplicity. In a perpendicular formulation to conventional model building, we start from the linear system of phenomenological constraints and asymptotically derive the distribution over all viable distributions that satisfy the provided set of constraints. The MaxEnt distribution plays a special role, as it is the most typical among all phenomenologically viable distributions representing a good expansion point for large-N techniques. This enables us to consistently formulate hypothesis testing in a fully-data driven manner. The appropriate parametric model which is supported by the data can be always deduced at the end of model selection. In the MaxEnt framework, we recover major scores and selection procedures used in multiple applications and assess their ability to capture associations in the data-generating process and identify the most generalizable model. This data-driven counterpart of standard model selection demonstrates the unifying prospective of the deductive logic advocated by MaxEnt principle, while potentially shedding new insights to the inverse problem.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/26/2020

Robust Hypothesis Testing and Model Selection for Parametric Proportional Hazard Regression Models

The semi-parametric Cox proportional hazards regression model has been w...
04/07/2022

Categorical Distributions of Maximum Entropy under Marginal Constraints

The estimation of categorical distributions under marginal constraints s...
08/02/2017

Dirichlet Bayesian Network Scores and the Maximum Entropy Principle

A classic approach for learning Bayesian networks from data is to select...
11/05/2010

Model Selection by Loss Rank for Classification and Unsupervised Learning

Hutter (2007) recently introduced the loss rank principle (LoRP) as a ge...
01/31/2018

Data driven time scale in Gaussian quasi-likelihood inference

We study parametric estimation of ergodic diffusions observed at high fr...
01/30/2013

Measure Selection: Notions of Rationality and Representation Independence

We take another look at the general problem of selecting a preferred pro...

1 Introduction

Given phenomenological information in form of marginals, averages or higher moments of features modeling seeks to determine some class of statistical models (i.e. probability distributions) that best captures trends or laws in the underlying system that generated the observed data. Since phenomenological data is by definition a physical invariant of the problem at hand, it is most naturally expressed by the invariant form of the defining set of equations which constrain the space of model distributions to reproduce the observed moments. As long as constraints remain of linear form, at least, this poses a well-defined problem eliminating ambiguities related to reparametrization symmetries in phenomenological information.

For a long time, it has been known that the distribution which satisfies the problem-defining constraints while maximizing the entropy functional is in a sense the least-biased distribution [17, 11] that incorporates the provided information and nothing more. Aside from this conceptually appealing interpretation, the entropy-based approach to modeling offers many benefits of practical significance. Given certain information from phenomenology entropy maximization uniquely selects one model distribution as the most representative among all phenomenologically viable distributions offering a clear deductive reasoning to modeling. At the same time, the maxe

nt logic combinatorially assigns probabilities to any other distribution in the most intuitive way, namely according to its “distance” from the provided set of observations. As this point appears to be of fundamental importance in the

maxent logic, we start our more formal parameter-agnostic exploration by investigating the distribution over phenomenologically viable distributions.

Since entropy measures information flowing into our model, the maxe

nt principle applied on the linear system of phenomenological constraints turns also classification tasks into clear-cut deductive procedures. In this work, we explain how to unambiguously define model architectures and directly count the physical degrees of freedom defining model dimensionality in order to ultimately perform selection of significant constraints that can be distinguished from sampling noise. Furthermore, we examine how large-order expansions of information-theoretic measures and benchmark scores can be meaningfully evaluated, when performed around the distribution of maximum entropy.

More specifically, we rigorously show that expansions in the fluctuations of probabilities around their maxe

nt estimate at large sample size

result in leading Gaussian kernels with terms that are quadratic in the fluctuating degrees of freedom. Such formal observation justifies writing discrete distributions over distributions into multivariate normal distributions in the fluctuations around the

maxent saddle point. Asymptotically, this suggests the integral representation of expectation sums over viable distributions which can be in turn evaluated using standard large- localization techniques. Eventually, we arrive at hypothesis testing in the maxent world and suggest significance level(s) compatible with the large- expansion. In addition, we review standard modeling scores outlining their meaning in the present framework.

Under the unifying umbrella of maxent reasoning, we test accuracy of (hyper)graph reconstruction and generalizability of models (i.e. maxent distributions satisfying sets of constraints) trained on synthetic data from an effective Ising magnet. Taking large verifies the theoretically anticipated estimates of error bounds. Most crucially, it reveals the universality of many measures which is expected to manifest at large order: Information-theoretic “distances” between distributions asymptotically depend only on the number of physical degrees of freedom and not on further details of the phenomenological problem at hand. In particular, most model selection procedures seem to merge into a universal rule that accurately detects the asymptotic structure.

All in all, the constraint optimization of entropy gives a robust definition of the notion of a model assigning to any distinct set of consistent constraints its maxe

nt distribution. At an operational level, the simplicity and flexibility of directly working with physical degrees of freedom, i.e. probabilities, could help efficiently tackle inverse problems in data science for which parameters (lacking microscopic justification) often lead to distractingly cumbersome inference. In addition, it enables the large-

estimation of important information-theoretic metrics, such as the expected entropy, training and test errors. Under a unifying framework, the maxent principle seems to naturally provide the tools to consistently perform application-adaptive selection of the most suitable model reflecting underlying mechanisms in a fully data-driven fashion.

2 Phenomenological constraints

Any phenomenological problem is related to some probability distribution over physical realizations. In this work, we concretely investigate statistical systems of categorical features each assuming a distinct number of states. All admissible realizations of these features comprise the space of microstates . A distribution over can be thought of as an

-dimensional vector living on the probability simplex

which assigns a probability to each microstate. Modeling seeks to determine some optimal distribution(s) in that satisfies phenomenological constraints (one is the normalization condition in ) which we denote by a vector .

The most obvious and easy to handle type of constraints is of linear form,

(1)

based on an architecture matrix with each row representing a constraint and columns corresponding to the microstates in . In the following, we take the architecture matrix to be of full row rank directly specifying the number of independent constraints. In realistic applications however, a redundant set of linear constraints summarized by a coefficient matrix acting on is often provided, instead. Whenever the phenomenological input concretely consists of a set of marginal relative frequencies, becomes a binary matrix dictating the relative frequencies of microstates to be summed over to obtain marginal relative frequencies in the empirical set (i.e. each row of the coefficient matrix describes a marginal constraint sum). To obtain a robust model definition it is instructive to transform the coefficient matrix

to its reduced row-echelon form. Dropping any all-zero row corresponding to redundant constraints, one then obtains the architecture matrix

which uniquely characterizes the linear system at hand.

Noting in particular that there are by construction columns in with all-zeros but one (the leading) 1, it can be easily seen that normalization of implies

(2)

The solution set of linear system Eq. eq:LinearSystem, which is unambiguously defined by architecture matrix and inhomogeneous vector , induces an equivalence class in . To see this first note that the empirical distribution which summarizes the observed relative frequencies of all microstates in always satisfies the coupled equations, thus being member of the equivalence class which we denote by in the following. By construction, any distribution shares the same statistics as the empirical distribution. Whenever only normalization constraint is specified as an all-one row in , the equivalence class covers the full simplex, , whereas in the other extreme case where the architecture matrix becomes the identity, all frequencies are learned by heart, so that .

In a general setup, there might exist microstates whose probability is fixed in the problem statement to some (non)-finite empirical value. If some empirical marginal is known to be zero by logical deduction, for example, all the associated probabilities appearing in this marginal sum are necessarily set to zero due to the non-negativity on the simplex . In the following, we concentrate on all constraints and associated microstates that are not a priori determined by some empirical or logical argument removing any probabilities that are learned by heart from empirical data. The latter can be trivially reinstated at the end of model selection without modifying conclusions. In other words, we understand to only include those microstates with that retain some probabilistic interest for model building.

3 Entropy concentration

Mathematically, the notion of an optimal solution amounts to selecting a target functional to be extremized over the equivalence class associated to the system of phenomenological constraints. Choosing [18, 15, 6] as a measure of uncertainty Shannon’s differential entropy,

(3)

we are naturally led to the principle of maximum entropy (maxent) under linear constraints Eq. eq:LinearSystem. In this way, we ensure that only the information specified by the linear system in Eq. eq:LinearSystem enters into the model distribution and nothing more. If no constraints were specified besides the dimensionality of

, then the distribution of maximum entropy would equal the uniform distribution

as a manifestation [12] of having the same degree of ignorance about any two microstates in . In the other extreme scenario that all empirical frequencies were constrained, Eq. eq:LinearSystem would unambiguously fix the maxent distribution to the empirical distribution , as previously discussed. Most interestingly, we are concerned with all those cases were is column-rank deficit thus leading to a non-trivial equivalence class from which the maxent principle selects a representative distribution that we henceforth denote by . As established in [14] the maxent distribution always exists and is unique for any consistent set of linear constraints.

Suppose we are now given a dataset with records. From elementary combinatorics, we know that the multinomial coefficient (r.h.s. gives the large- expansion)

(4)

counts the possible ways to separate those individuals into teams such that the counts are reproduced. As we have no way to distinguish among realizations of the counts, we should assign equal probabilities to them. Under this thermodynamic assumption, member distributions in the equivalence class formally distribute according to

(5)

at sample sizes . Incorporating model-dependent effects of in the large- expansion of Eq. eq:MultinomialCoefficient would result in effects in any relevant expectation over the distribution of distributions in the equivalence class. At sufficiently large we may thus neglect subleading corrections in Stirling’s formula used to expand factorials and directly start from Eq. eq:DistributionOfDistributions.

In particular, the ratio for any distribution measures the relative abundance of the maxent distribution over another member in the equivalence class. The large- expansion of Eq. eq:MultinomialCoefficient leads to

(6)

Taking the limit, we recognize [10] that the maxent distribution can be realized in more ways compared to any other distribution in , as the entropy difference will be by definition positive. Consequently, is the distribution we should overwhelmingly anticipate given the empirical constraints, the latest asymptotically, since any other distribution would be exponentially suppressed, justifying our choice of target functional. Intuitively, the dominating role of over any other distribution in the equivalence class shows that the maxent distribution is in a sense typical of this set.

In order to prove the main and subsequent theorems in this paper we shall need a fairly general

Lemma 1

Let be the maxent distribution in equivalence class induced by architecture on data described by empirical distribution . For any distribution with linearized fluctuations

(7)

around the maxent distribution in the equivalence class, the large- expansion of the entropy difference

(8)

results into a leading term which is quadratic in the fluctuations .

Proof

As long as111The case of exhibits no probabilistic fluctuations and does not contribute to our formulas. Formally, this is manifested by taking in any information-theoretic measure. , parametrizing fluctuations according to Eq. eq:LinearizedFluctuations does not degenerate. Generically, normalization on the simplex means

(9)

Furthermore, invoking a well-known fact [5, 2] from information theory about maxent distributions ,

(10)

we recognize that

(11)

whenever the maxent distribution in is chosen as expansion point. Combining Eq. eq:Normalization_Fluctuations with Eq. eq:BaseLemma_Fluctuations in the Taylor expansion of the entropy difference results into Eq. eq:EntropyDifference_LargeN. The factor of in parametrization of Eq. eq:LinearizedFluctuations is chosen [7] so that the quadratic term in Eq. eq:EntropyDifference_LargeN scales as in the large- expansion ensuring the proper asymptotic limit by suppressing higher-order corrections.

We are now in a position to state

Theorem 1

Asymptotically, the scaled difference of the entropy of any distribution from the maximum entropy in the equivalence class induced by on data described by is distributed as chi-squared with degrees of freedom, in short as . For any positive the likelihood to find a fraction of distributions in having entropies in the range

is accordingly given by the cumulative distribution of distribution.

Proof

At sufficiently large , we are interested to evaluate the fraction of distributions in the equivalence class having entropies in a range below maxent , i.e. in the region

(12)

Formally using the distribution over distributions introduced in Eq. eq:DistributionOfDistributions this fraction is written as

(13)

( denotes the indicator function), which is clearly governed by the entropy difference of each member distribution from maxent. For the purpose of evaluating as an expansion in powers of , we set

(14)

such that . Clearly, Lemma 1 applies for , so that the governing entropy difference results in a quadratic term as in Eq. eq:EntropyDifference_LargeN. On account of linear system Eq. eq:LinearSystem, fluctuations around any reference distribution in the equivalence class satisfy homogeneous system

(15)

Due to these homogeneous constraints any lives in the -dimensional cone spanned by the kernel of architecture matrix after column-wise rescaling by . At an operational level, the quadratic sum in Eq. eq:EntropyDifference_LargeN expressed in any valid coordinate system parameterizing the kernel of the rescaled architecture matrix can be diagonalized with unity-det Jacobian. To this end, we can write the fluctuations in Eq. eq:LinearizedFluctuations around maxent solution as

(16)

where orthonormally encodes the kernel space, i.e. for and it is

(17)

Hence, the leading quadratic term in the entropy difference becomes

(18)

readily following a chi-square distribution with

degrees of freedom.

Exploiting the radial symmetry of the large- expansion we subsequently introduce spherical coordinates such that any living on the  – sphere with squared radius has leading entropy

(19)

Notice that the uniqueness of the optimal solution unambiguously fixes the origin of the chosen spherical coordinate system to correspond to the maxent distribution in the given equivalence class. Conveniently, the spherical parametrization allows us to write the integral representation of Eq. eq:FractionDEF after performing the angular integration,

(20)

as the cumulative distribution of chi squared. This concludes the proof.

Based on radial symmetry, Theorem 1 provides [9] a neat geometrization of the distribution over distributions Eq. eq:DistributionOfDistributions in the equivalence class. Around the maxent distribution in a -dimensional space, any spherical shell of radius traces over all those distributions whose entropy differs from by the same amount . Clearly, the interval within which the entropy of a specified portion of distributions is to be found becomes narrower with larger , showing that the maxent distribution has an entropy more and more representative of the whole equivalence class. Hence, the allowed by entropies concentrate around the maxent in a geometric sense. A posteriori, this formally justifies our choice of entropy (which in the present setting rather emerges at large- than being postulated) as a target functional for selecting a model distribution.

3.1 Calculating the optimal solution

The preceding discussion about the concentration of entropies around the maxent and the application of chi-squared statistics presupposes that the distribution of maximum entropy has already been determined. Of course, one can parametrize the solution space of linear system Eq. eq:LinearSystem and try to maximize analytically the entropy. For small dimensionality of , the solution could have a neat closed form. In other setups, there might exist some perturbative method to proceed.

In a reverse approach, we directly incorporate the optimization objective into the phenomenological system to obtain a non-linear system of coupled equations, which can be iteratively solved via Newton-Raphson method. Given architecture matrix and empirical inhomogeneous vector , we start from the uniform distribution . Upon using the exponential form [5] of maxent distribution in absence of zero-probability microstates,

(21)

for some vector of Lagrange multipliers, the update rule for Newton estimate of the maxent solution to system Eq. eq:LinearSystem at the -th step reads

(22)

in terms of the previously estimated empirical moment

(23)

and Jacobian matrix

(24)

The latter matrix is invertible, as it can be easily seen by noting that vanishing of its quadratic form for any vector implies

(25)

since all (having excluded structural zeros whose Newton estimate becomes trivially zero) while the rows of are per assumption linearly independent. As customary, Newton method converges fast to the unique maxent distribution , at the cost of a matrix inversion at each iteration step.

There exist many variations of the main iterative scheme. Suppose that instead of a binary matrix which describes a possibly redundant set of marginal constraints is used. Cycling through each marginal constraint to iteratively update probabilities via one-dimensional Newton method (as the multidimensional Newton method would be plagued by the singularity of redundancies) gives the rule

(26)

which is determined by the ratio of empirical marginals to the running estimate of model marginals

(27)

One can show [5, 8] that the Taylor-expanded r.h.s. of Eq. eq:IPF always converges to the maxent solution. This approach leads to the algorithm of iterative proportional fitting [3].

4 Constraint selection

In this section, we exploit the logic advocated in Section 3 as our starting point in order to motivate a fully data-driven selection of phenomenological constraints. In most cases, we are given summary statistics of an empirical distribution, i.e. a set of phenomenological constraints summarized by whose invariant form induces an equivalence class , as outlined in Section 2. This robustly defines the notion of a model in a purely data-driven fashion. In addition, a chi-square distribution asymptotically assigns probabilities to all member distributions in the equivalence class based on their entropy relative to maxent. With increasing sample size , an increasing number of distributions has entropies concentrating around maxent, making typical of and hence a good candidate of a model distribution.

4.1 Hypothesis testing

The hyper MaxEnt.

If the given architecture matrix defined all our knowledge about the system under investigation, we first ask how likely it would be to observe the empirical distribution whose entropy differs from the maxent by

(28)

On account of Theorem 1, the asymptotic likelihood to observe the empirical distribution – if the constraints in are all we know/trust from the data – is naturally given as the -value under cumulative distribution to record a value or more extreme. If the empirical -value lies above a pre-specified threshold, we could argue in favor of the maxent distribution as capturing all essential information, any discrepancy to attributed to mere statistical noise. In the opposite case of a -value below the threshold, we would have to deduce that the provided constraints were not sufficient to capture mechanisms of the data-generating process, systematically failing to adequately characterize .

In principle, full access to empirical distribution makes a classification of all possible – given the microstate space  – architectures and corresponding generalized moments amenable. Any set of constraints which fails to capture the systematics of the data-generating process, so that the observed data becomes very atypical in the induced equivalence class, is immediately vetoed. In the spirit of Occam’s razor, we select the simplest model with the highest entropy which still makes the empirical dataset likely (empirical -value above the threshold) as our starting point. We shall call the model incorporating all significant constraints – i.e. constraints whose removal leads to significantly low -values – and nothing more, the hyper-maxent.

Likelihood ratio test.

For those constraints whose removal does not bring any significant result, as far as the empirical entropy difference is concerned, we could proceed by relative comparison of constraints according to

Theorem 2

Given is a base architecture and a more complex which implies the former. Asymptotically, the scaled difference of maxent in from maxent in is distributed as chi-squared with degrees of freedom.

Evidently, Theorem 2 reduces to Theorem 1 using as most complex model architecture with that learns by heart.

Proof

This theorem describes a frequently encountered task in model building, to namely quantify the suitability of a simpler set of constraints over a more stringent which implies the former. At the level of architecture matrices, such nested models are related via a linear map

(29)

for some transformation matrix of (otherwise would be row-rank deficit). In our framework, both models are unambiguously characterized by maxent distributions and , respectively. Clearly, the form Eq. eq:ExponentialForm of maxent distribution automatically accommodates the reference distribution of the simpler model, so that there always exists vector with given empirical data.

In a designer setup, different datasets that respect the reference constraints in are produced. By definition, a maxentist trusting would always conclude out of all those datasets that is the appropriate distribution. On the other hand, a modeler trusting would compute an a priori different maxent distribution depending on the received dataset, but respecting the reference constraints from . Writing as in Eq. eq:LinearizedFluctuations this translates into fluctuations around the reference maxent distribution satisfying projection conditions

(30)

We are interested in all those cases that the latter modeler would select a different distribution whereas the first would insist on . To properly quantify this discrepancy we need to filter out redundancies whenever different datasets lead to the same model distributions, as they differ in constraints that none of the two modelers can see so that both insist on their respective maxent models. In other words, fluctuations from around must be perpendicular to the kernel of . Hence, its linearized fluctuations must look like

(31)

Reference projection Eq. eq:LRT:refProjection then constrains the vector via

(32)

upon using Eq. eq:LRT:NestedModels Hence, lies in the kernel of which is of dimension , so that in total

(33)

where orthonormally spans the kernel of . Analogously to the proof of Theorem 1, this form of fluctuations with directly leads to the claimed chi-squared distribution by applying Lemma 1.

In the large- expansion, the multinomial distribution for any two distributions ,

(34)

is governed222

The same applies for the multivariate hypergeometric distribution, as long as the population size and population counts are much larger than the size and frequencies of samples from this population.

by the kl divergence of distribution from empirical . In particular, the log-likelihood ratio to leading order in upon using Eq. eq:BaseLemma equals

(35)

i.e. the observed entropy difference between maxent distributions of the more complex to simpler model. Hence, we recover the likelihood ratio test [19] (see also Lagrange multiplier test [4]) in the maxent framework, where is preferred, whenever it is sufficiently descriptive of all those distributions satisfying the additional constraints in and nothing more. Conversely, we can formulate

Corollary 2.1

For any two sets of constraints described by architectures and where the former is implied by the latter, the likelihood ratio of an observed dataset of sample size is governed by the entropy difference between the maxents in the induced equivalence classes. The maxent distribution is rejected as a good model distribution in favor of at significance level, whenever .

4.2 Information criteria

Maximum likelihood and BIC score.

In the equivalence class induced by architecture matrix on counts , the probability density of observing entropy difference is given according to

(36)

after redefining the entropic radius in Eq. eq:EntropicRadius to extract the large-order scaling which remains constant throughout the equivalence class. Taking the logarithm of Eq. eq:ChiDensity evaluated at the empirical entropy difference,

(37)

we recover from the model-dependent part the Bayesian Information Criterion (in short bic score), first introduced in [16]. In the spirit of mle, the set of constraints must be selected which make the data most likely to be observed. One thus needs to choose the architecture which assigns highest weight to the observed dataset or in other words whose penalized empirical entropy difference maximizes the density in the induced equivalence class , conversely minimizing the bic score.

The expectation of entropy and AIC score.

Another way to characterize an equivalence class and assess the quality of modeling is via expectations over member distributions . For instance, the expected entropy in equals

(38)

using the entropic radius from Eq. eq:EntropicRadius and taking the expectation over chi-squared density. Simplicity (lower dimension ) seems to dictate the expected entropy to lie further away from maxe

nt, cf. chi-squared probability density spreading out with increasing degrees of freedom. On the other hand, goodness of fit translates into the expected entropy approaching the empirical one. Combining both heuristic objectives we require the expected entropy to be closer to empirical

than to maxent. In other words, we could look for the minimum of the difference of entropy differences

(39)

In the spirit of Eq. eq:Motivation:ExponentialDominanance, minimization of aic score (Akaike information criterion [1]) can be understood as selecting the set of constraints whose maxent-to-expected-entropy ratio of combinatorial possibilities is larger than the expected-to-empirical-entropy ratio.

5 Training and test metrics

For the purposes of this section, we suppose that some form of “truth” asymptotically exists. Let describe the asymptotic model as some maxent distribution of architecture . To assess the quality of modeling at a given sample size we first define the expected training error of an architecture by

(40)

where is understood to be the maxent distribution in the equivalence class induced by on samples drawn from . This error is governed by the expected Kullback-Leibler (kl) divergence [13] of model from sample distribution for samples drawn from the “true” distribution . Intuitively, the training error quantifies the average proximity to achieved by models that were trained on a sample of size . It depends on , the training architecture and the unknown asymptotic model. To evaluate Eq. eq:TrainingError we consider two distinct cases.

If model architecture implies all asymptotic constraints , then there must exist parameter vector such that

(41)

As data varies, linearized fluctuations of model distribution around ,

(42)

(column-wise multiplication by is a rank-preserving operation) induce a parametrization of some generic distribution around the asymptotic distribution ,

(43)

where orthogonally encodes the kernel of , similarly to Eq. eq:OrthonormalKernel. Note in particular that normalization condition on the simplex realized by -th row in the coefficient matrix with automatically ensures

(44)

where represents the Gaussian-Jordan transformation matrix from to . Enforcing thus requires that the components of are linearly dependent,

(45)

by vanishing projection on the moments of induced on the “true” distribution. Hence, there remain directions along which model distribution can vary.

At large , expansion Eq. eq:multidistro:expansion_LargeN states that multinomial distribution is governed by the “distance” of asymptotic to empirical distribution . Upon using Eq. eq:Genericfluctuations:Param the information-theoretic approximation to multinomial distribution gives a leading quadratic (in the fluctuations) term,

(46)

As argued in Eq. eq:InvertibleJacobian, precision matrices of the above form are invertible, as long as and are of full row-rank. Any linear term in the fluctuations vanishes identically due to normalization condition Eq. eq:TrainError:xCondition or drops after imposing Eq. eq:TrainError:yCondition, whereas mixed terms in and vanish due to the normality of space to . This large- expansion of multinomial distribution shows that it asymptotically goes into a multivariate normal distribution, thus enabling us to nicely write the integral representation of Eq. eq:TrainingError, which takes the form of an -dimensional integral in and . In particular, the Gaussian kernel associated to Eq. eq:TrainError:GaussianKernel can be always diagonalized by an orthogonal transformation in the fluctuations. At the same time, the integrand becomes

(47)

rendering straight-forward the computation of

(48)

Whenever our model fully captures the asymptotic architecture, the expected error converges as . It is optimal precisely when .

On the other hand, if misses any of the constraints from asymptotic architecture , the large- localization of the integral representation of Eq. eq:TrainingError at the saddle point would give

(49)

where is the model distribution induced by on the asymptotic counts . Since the kl divergence would be finite – model architecture lacking some constraints – the training error diverges linearly in in this case.

Similarly, one can estimate the expected test (or generalization) error of a model architecture ,

(50)

where represents the model distribution, as before. Parameterizing fluctuations around the asymptotic distribution along the lines of Eq. eq:TrainError:ModelDistro and Eq. eq:Genericfluctuations:Param, we obtain

(51)

if test model implies the asymptotic architecture and an estimate that diverges as , whenever any of the asymptotic constraints is missed by .

6 Model selection in the inverse Ising problem

A classical paradigm used both in benchmark tests and to illustrate various properties of magnetic materials is the Ising model. Suppose that a group of Ising spins in lattice-gas gauge labeled by site index , which could have been e.g. extracted from a much bigger Ising crystal, had energy (Hamiltonian)

(52)

As we marginalize over lattice regions, there is nothing that prevents us from writing effective associations of higher orders among the remaining spins, but for concreteness we consider up to cubic interactions. Furthermore, we take

to speed up evaluation of multiple instances tested in parallel. For our benchmark tests below, we shall randomly generate different values for bias vector

, symmetric coupling matrix

and cubic tensor

(excluding self-interactions which are presently meaningless) based on the hypergraph

(53)

For simple Ising systems, any -point feature association implies all lower terms. For example, a trilinear association automatically induces all pairwise associations among spins , and as well as the one-site biases for , and . Hence, the Ising system has non-vanishing biases, couplings and 2 cubic interactions. The Boltzmann distribution

(54)

of statistical system Eq. eq:IsingHamiltonian defines an asymptotic model with333In the parametric formulation of thermodynamics, the normalization condition on is realized by the partition function. in a spin space of microstates.

From this Ising system in the inverse problem, a modeler obtains samples at different sizes . Knowing only that this is a system of potentially interacting spins leaves candidate hypergraphs with distinct architectures reflecting different interaction structures. For each architecture, ipf or any Newton-based method can quickly deduce the corresponding maxent distribution on a given training sample, as outlined in Section 3.1. In such setting, applications are usually concerned with two intimately related aspects: (i) the reconstruction of the original hypergraph from a sample of size and (ii) the generalization error of a trained model on different samples drawn from the same system.

To answer those questions by better understanding the various information-theoretic metrics discussed in Section 4, we have independently generated 10 000 samples from different realizations of the Ising system Eq. eq:IsingHamiltonian with interaction structure Eq. eq:IsingGraph at various sample sizes. By a realization, we simply mean a set of fields and couplings parametrizing the Ising Hamiltonian, which were independently drawn from a normal distribution of zero mean and variance. On each training dataset of size the maxent distribution associated to all architectures is deduced. Subsequently, the architecture is recorded each time which exhibits (a) lowest bic, (b) lowest aic, (c) lowest and empirical -value above threshold (hyper-maxent), (d) lowest and typical for all more complex models implying (hyper-maxent + lrt).

In order to determine a threshold for (c) and (d) below which a -value is considered to be significant (see Section 4.1) we exploit two facts. According to Eq. eq:TrainError_evaluated and Eq. eq:TestError_evaluated, expected errors tend to decrease for simpler models, hence the cutoff should increase for higher-dimensional kernels Eq. eq:OrthonormalKernel for which we anticipate on average smaller atypical fluctuations. On the other hand, larger datasets tend to eliminate sampling noise sharpening the distribution of -values. The empirical -value associated to any architecture that misses some asymptotic constraint goes faster to zero, while any slightly atypical dataset sampled from could lead to smaller -values under the asymptotic model. The acceptance threshold should thus scale with inverse sampling size to ensure a proper asymptotic limit. Combining these observations we conclude that the fraction

(55)

has a scaling which seems reasonable for a cut-off in the maxent hypothesis testing. In a similar spirit, we use in the lrt part of (d) as significance level

(56)

to decide about the -value when comparing two nested model architectures and .

Figure 1: Left: Average “distance” from the asymptotic distribution associated to of model distribution that is selected by the various scores as a function of sample size . Below: Average reconstruction accuracy determined by the number of correctly identified hypergraphs in the generated samples. Right: Average generalization error (and its dispersion) of model distribution on unseen data sampled from the asymptotic system as a function of .
Figure 2: -dependent average rates of true/false positive associations against asymptotic graph characterizing the various selection procedures.

In Figure 1, we plot the training error Eq. eq:TrainingError averaged over all sampled datasets of various sizes . To get a better idea on the training quality the average graph reconstruction that was achieved at each sample size is comparatively plotted. This shows how often model architecture selected by the various metrics precisely captures the underlying Ising model associated to graph Eq. eq:IsingGraph, but nothing more. On the right plot, the test error Eq. eq:TestError on samples is given again averaged over the training datasets at each sample size. To obtain a feeling about the dispersion of the generalization error we alongside plot the difference of the test error of the other scores from the consistently best-scoring lrt model within each sample.

We observe that hyper-maxent and lrt with the acceptance threshold introduced in Eq. eq:Threshold1 and Eq. eq:Threshold2 as well as the bic score are all scoring metrics that are consistent with large- expansion(s). As increases, they clearly exhibit the correct asymptotic behavior eventually achieving reconstruction accuracy. At the same time, the generalization error saturates the bound in Eq. eq:TestError for . It is important to stress that modifying the significance level, does not change the asymptotic behavior, as long as the scaling goes (at least) as . This demonstrates the universality of large- expansion which renders any compatible selection algorithms indistinguishable as tends to infinity. On the other hand, aic consistently appears sub-optimal failing to recover the underlying model. Instead, the reasoning of maximizing the expected entropy Eq. eq:ExpectedEntropy seems to capture models that are generically more complex than necessary and are thus plagued by false positive (fp) associations.

To get a handle on true/false positive (tp/fp) rates, we plot444A Python script to randomly generate Ising systems of a given architecture, sample from them and benchmark the presented model-selection procedures can be found under github.com/imbbLab/ModelSelection. the percentages of correctly identified interactions, which are achieved by the various selection procedures, averaged over all sampled datasets at each size . As a base reference, we use the probability that a randomly selected constraint out of the 31 possible is a true (false) positive, i.e. falls (does not fall) within the set of the 14 asymptotic constraints. This roc-curve inspired plot verifies our previous finding on the large- compatibility of all three selection procedures except for the aic score.

However, it is noteworthy that aic manages to quickly reach a higher tp rate at the expense of keeping on average a finite fp rate, which in the present setting remains lower than the base rate (dotted gray line). By modifying acceptance threshold Eq. eq:Threshold2, one can adjust the fp tolerance of lrt selection in the small- regime. Most interestingly, we recognize that bic and the hyper-maxent consistently have (almost) vanishing fp rates, even at sample sizes of order . In particular, this gives a strong hint in favor of the scaling of acceptance threshold Eq. eq:Threshold1. To further support the suggested scaling for the threshold, we mention the observation that the architecture corresponding to the asymptotic graph Eq. eq:IsingGraph always has an empirical -value Eq. eq:empirical_p_value above at any sample size .

References

  • [1] H. Akaike (1974) A new look at the statistical model identification. IEEE transactions on automatic control 19 (6), pp. 716–723. Cited by: §4.2.
  • [2] S. Amari and H. Nagaoka (2000) Methods of information geometry. Vol. 191, American Mathematical Soc.. Cited by: §3.
  • [3] Y. M. Bishop, S. E. Fienberg, and P. W. Holland (2007)

    Discrete multivariate analysis: theory and practice

    .
    Springer Science & Business Media. Cited by: §3.1.
  • [4] A. Buse (1982) The likelihood ratio, wald, and lagrange multiplier tests: an expository note. The American Statistician 36 (3a), pp. 153–157. External Links: Document, Link, https://doi.org/10.1080/00031305.1982.10482817 Cited by: §4.1.
  • [5] I. Csiszár (1975) I-divergence geometry of probability distributions and minimization problems. The annals of probability, pp. 146–158. Cited by: §3, §3.1, §3.1.
  • [6] I. Csiszár (1996) Maxent, mathematics, and information theory. In Maximum entropy and Bayesian methods, pp. 35–50. Cited by: §3.
  • [7] C. Geyer and G. Meeden (2013) Asymptotics for constrained dirichlet distributions. Bayesian Analysis 8 (1), pp. 89–110. Cited by: §3.
  • [8] C. T. Ireland and S. Kullback (1968) Contingency tables with given marginals. Biometrika 55 (1), pp. 179–188. Cited by: §3.1.
  • [9] E. T. Jaynes (1989) Concentration of distributions at entropy maxima (1979). In E. T. Jaynes: Papers on Probability, Statistics and Statistical Physics, R. D. Rosenkrantz (Ed.), pp. 315–336. External Links: ISBN 978-94-009-6581-2, Document, Link Cited by: §3.
  • [10] E. T. Jaynes (1968) Prior probabilities. IEEE Transactions on systems science and cybernetics 4 (3), pp. 227–241. Cited by: §3.
  • [11] E. T. Jaynes (2003) Probability theory: the logic of science. Cambridge university press. Cited by: §1.
  • [12] J. M. Keynes (1921)

    Chapter iv: the principle of indifference

    .
    A treatise on probability 4, pp. 41–64. Cited by: §3.
  • [13] S. Kullback (1997) Information theory and statistics. Courier Corporation. Cited by: §5.
  • [14] O. Loukas and H. R. Chung (2022) Categorical distributions of maximum entropy under marginal constraints. arXiv preprint arXiv:2204.03406. Cited by: §3.
  • [15] J. B. Paris and A. Vencovská (1990) A note on the inevitability of maximum entropy. International Journal of Approximate Reasoning 4 (3), pp. 183–223. Cited by: §3.
  • [16] G. Schwarz (1978) Estimating the Dimension of a Model. The Annals of Statistics 6 (2), pp. 461 – 464. External Links: Document, Link Cited by: §4.2.
  • [17] C. E. Shannon (1948) A mathematical theory of communication. The Bell System Technical Journal 27 (3), pp. 379–423. External Links: Document Cited by: §1.
  • [18] J. Shore and R. Johnson (1980) Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Transactions on Information Theory 26 (1), pp. 26–37. External Links: Document Cited by: §3.
  • [19] S.D. Silvey (1975) Statistical inference. Chapman & Hall/CRC Monographs on Statistics & Applied Probability, Taylor & Francis. External Links: ISBN 9780412138201, LCCN 75002254, Link Cited by: §4.1.