I Introduction
Over the last couple of decades, the growth of network science has impacted several disciplines by establishing new, empirical facts about many, realworld systems, in terms of the structural, network properties that are typically found in those systems. In the context of trade economics, the growing availability of data about import/export relationships among world countries has prompted researchers to explore and model the architecture of the international trade network, or World Trade Web (WTW) [1, 2, 3, 4, 5, 6, 7, 12, 8, 9, 10, 11].
This approach has complemented, and in many ways enriched, the traditional econometric exercise of modelling individual trade flows, i.e. relating the volume of individual trade exchanges to the most relevant covariates (generally, macroeconomic factors) they may depend on. The earliest example of an econometric model for international trade is the celebrated Gravity Model (GM) [13] that predicts that the expected value of the trade volume from country to country can be expressed via the econometric function
(1) 
where is the GDP of country divided by the arithmetic mean of the GDPs of all countries, is the geographic distance between (generally the capitals of) countries and and
is a vector of parameters (notice that
when the direction of the exchanges is disregarded, as we will, throughout the paper, and takes care of dimensional units). Equation 1 has a long tradition in successfully explaining the existing (i.e. positive) trade volumes between pairs of countries. Outside economics, the gravity equation has also been extensively employed in studies concerning transportation, migration [14] and the maximization of utility functions constrained to satisfy requirements on the rate of information acquisition [15].Although accurate in reproducing the positive trade volumes, the traditional GM cannot replicate structural network properties, unless the topology is completely fixed via a separate approach [16]. Indeed, if denotes the weighted adjacency matrix of the WTW, where the entry represents the trade volume from country to country , Eq. 1 predicts that the expected matrix has no offdiagonal zeroes, i.e. an expected positive trade relationship exists between all countries. This means that, when interpreted as an expected value of a regression with small, symmetric, zeromean (e.g. Gaussian) noise, Eq. 1 predicts a fully connected network: if denotes the generic entry of the binary adjacency matrix (equal to if a positive trade volume from country to country is present, i.e. , and equal to if a zero trade is, instead, observed, i.e. ), then Eq. 1 predicts almost surely , . This result is in obvious contrast with empirical data, which show that the WTW has a rich topological architecture, characterized by a broad degree distribution, (dis)assortative and clustering patterns and other properties [1, 2, 3, 4, 5, 6, 7, 12, 8, 9, 10, 11].
In order to overcome such a limitation, the plain Gravity Model needs to be ‘dressed’ with a probability distribution that produces as the expected value while, at the same time, accounts for null outcomes as well (i.e. those entries reading and representing missing links in the network) [17]. Clearly, the support of the probability should not include matrices with negative numbers. From the Sixties on, the GM has indeed been interpreted as the expected value of a probability distribution whose functional form needs to be determined.
Trade econometrics models the tendency of countries to establish trade relationships relating it to accepted, macroeconomic determinants (the socalled ‘factors’) such as GDP and geographic distance, as in the expression of Eq. 1
. Econometricians have considered increasingly flexible distributions, the most recent versions of them being capable of disentangling the estimation of the presence of a trade exchange from the estimation of the traded amount. This has led to the definition of two distinct classes of models, i.e.
zeroinflated models [19] and hurdle models [20]. Zeroinflated (ZI) models have been introduced to model the following two scenarios: the possibility of exchanging a zero amount of goods even after having established a trade partnership (e.g. because of a limited trading capacity); the possibility of establishing a very small amount of trade (in fact, so small to be compatible with statistical noise and, as such, removed). A general drawback of employing ZI models is that of predicting sparserthanobserved network structures [18]; moreover, only discrete distributions (specifically, either the Poisson or the negativebinomial one [19]) have been considered, so far, to carry out the proper weightestimation step. Hurdle models, introduced to overcome the limitations affecting zeroinflated models, can predict zeros only at the first step [20]: in any case, the presence of links is established by employing either a logit or a probit estimation step.
Network science has tackled the aforementioned inference problem using techniques rooted in statistical physics. The most prominent examples descend from the MaximumEntropy Principle (MEP) [37, 38, 39] applied to network ensembles [21, 22], prescribing to maximize Shannon entropy [23, 24] in a constrained fashion to obtain the maximally unbiased distribution of networks compatible with a chosen set of structural constraints. This approach is formally equivalent to the construction of socalled Exponential Random Graphs (ERGs) for social network analysis [25] but differs in the typical choice of the constraints: in particular, when the enforced constraints are local, such as the degree (number of links) and the strength (total link weight) of each node, maximumentropy network models have been shown to successfully replicate both the topology and the weights of many economic and financial networks, including the WTW [5, 6, 26, 27, 28, 29, 30, 31, 32, 18]. The entire framework can also accommodate possibly degenerate, discretevalued, single or multiedges [33].
Although maximumentropy models have been also studied from an economic perspective (see [34] for a discussion of the economic relevance of the constraints defining the Poisson and the geometric network models), it is only recently that progress has been made to reconcile the above two approaches, allowing for economic factors parametrizing the maximumentropy probability distribution producing links and weights [5, 6, 29, 30, 31, 32, 18] or by introducing networkrelated statistics into otherwise purely econometric models [35]. On one hand, the novel framework enriches the methods developed by network scientists with an econometric interpretation; on the other, it enlarges the list of candidate distributions usable for econometric purposes.
With this contribution, we refine the theoretical picture provided in a companion paper [18], introducing models to infer the topology and the weights of undirected networks defined by continuousvalued data. In order to do so, we present a theoretical, physicsinspired framework capable of accommodating both integrated and conditional, continuous models, our goal being threefold: 1) testing the performance of both classes of models on the WTW in order to understand which one is best suited for the task; 2) offering a principled derivation of currently available, conditional, econometric models; 3) enlarging the list of continuousvalued distributions to be used for econometric purposes. From an econometric point of view, our work moves along the methodological guidelines defining the class of Generalized Linear Models (GLMs) [36] while enriching it with distributions defined by both econometric and structural parameters. From a statistical physics point of view, our work expands the class of maximumentropy network models [21] or weighted ERGs [25] and endows them with macroeconomic factors replacing certain model parameters.
The rest of the paper is organized as follows: in Sec. II, after introducing the basic quantities, we derive the class of conditional models; in Sec. III we derive the class of integrated models; in Sec. IV we apply all models to the analysis of WTW data; in Sec. V we discuss the results and provide our concluding remarks.
Ii Conditional Models
Discrete maximumentropy models can be derived by performing a constrained maximization of Shannon entropy [37, 38, 39]. However, unlike the companion paper [18], our focus, here, is on continuous probability distributions. In such a case, mathematical problems are known to affect the definition of Shannon entropy and the resulting inference procedure. To restore the framework, one has to introduce the KullbackLeibler (KL) divergence of a distribution from a prior distribution and reinterpret the maximization of the entropy of as the minimization of from a given prior distribution . In formulas, the KL divergence is defined as
(2) 
where
is one of the possible values of a continuous random variable (in our setting, an entire network with continuousvalued link weights),
is the set of possible values that can take,is the (multivariate) probability density function to be estimated and
plays the role of prior distribution, the divergence of from which must be minimized. Such an optimization scheme embodies the socalled Minimum Discrimination Information Principle (MDIP), originally proposed by Kullback and Leibler [40] and implementing the idea that, given a prior distribution and new information that becomes available, an updated distribution should be chosen in order to make its discrimination from as hard as possible. In other words, the MDIP demands that new data produce an information gain that is as small as possible. The use of the KL divergence is widespread in the fields of information theory [23]and machine learning
[41], e.g. as a loss function within the Generative Adversarial Network (GAN) scheme (the aim of the ‘generating’ neural network being that of producing samples that cannot be distinguished from those constituting the training set by the ‘discriminating’ neural network).
In order to introduce the class of conditional models, we write the posterior distribution as
(3) 
where denotes the adjacency matrix for the binary projection of the weighted network . The above equation allows us to split the KL divergence into the following sum of three terms
(4) 
where
(5) 
is the Shannon entropy of the probability distribution describing the binary projection of the network structure,
(6) 
is the conditional Shannon entropy of the probability distribution of the weighted network structure given the binary projection and
(7) 
is the cross entropy quantifying the amount of information required to identify a weighted network sampled from the distribution by employing the distribution . When continuous models are considered, is defined by a first sum running over all the binary configurations within the ensemble and an integral over all the weighted configurations that are compatible with each, specific, binary structure  embodied by the adjacency matrix , i.e. such that ).
The expression for can be further manipulated as follows. Upon separating the prior distribution itself into a purely binary part and a conditional, weighted one, one can write
(8) 
an expression that allows us to write as
(9) 
which, in turn, allows the KL divergence to be rewritten as
(10) 
i.e. as a sum of two terms, one of which involves conditional distributions; specifically,
(11)  
(12) 
with representing the binary prior and representing the conditional, weighted one. In what follows, we will deal with completely uninformative priors: this amounts at considering the somehow ‘simplified’ expression
(13) 
with
(14)  
(15) 
The (independent) constrained optimization of and represents the starting point for deriving the members of the class of conditional models.
ii.1 Choosing the binary constraints
The functional form controlling for the binary part of conditional models can be derived by carrying out a constrained maximization of the binary Shannon entropy
(16) 
leading to a probability mass function reading
(17) 
where the functional form of the Hamiltonian reads . This choice induces a factorization of the probability mass function , which becomes
(18) 
i.e. a product of a number of Bernoullilike probability mass functions with , , where is the Lagrange multiplier controlling for the generic entry of the adjacency matrix .
In what follows, we will consider the specification of the tensorlike Hamiltonian introduced above reading
, , a choice inducing the Undirected Binary Configuration Model (UBCM), characterized by the following, pairspecific probability coefficient(19) 
and ensuring the entire degree sequence of the network at hand to be reproduced.
The econometric reparametrization of the UBCM can be achieved by posing , , a choice inducing the socalled fitness model (FM), characterized by the pairspecific probability coefficient
(20) 
and requiring the estimation of a global parameter only, i.e. . The FM represents a particular case of the logit model [42], being defined by a vector of external properties (the ‘fitnesses’) that replace the information provided by some kind of (otherwise) purely structural properties [43, 5]: in fact, Eq. 20 can be equivalently rewritten as with and . The global constant can be determined by imposing the total number of links as the only constraint. Remarkably, the fitness model has been proven to reproduce the (binary) properties of a wide spectrum of realworld systems [26, 6] as accurately as the UBCM, although requiring much less information.
In what follows, we will consider both the UBCM and the FM specifications.
ii.2 Choosing the weighted constraints
The constrained maximization of proceeds by specifying the following set of weighted constraints
(21)  
(22) 
the first condition ensuring the normalization of the probability distribution and the vector
representing the ‘proper’ set of weighted constraints (weights are, now, treated as continuous random variables, i.e.
, ). They induce the distribution reading(23) 
where is the socalled Hamiltonian, listing the constrained quantities, and is the partition function, conditional on the ‘fixed topology’ .
The explicit functional form of can be obtained only once the functional form of the constraints has been specified. In what follows, we will deal with the Hamiltonian reading
(24) 
with the Lagrange multipliers satisfying the following requirements:

, where is the Lagrange multiplier associated with the total weight and encodes the dependence on purely econometric quantities;

will be kept either in its dyadic form, to constrain the logarithm of each weight, or in its global form, , to constrain the sum of the logarithms of the weights, i.e. ;

plays the role of the Lagrange multiplier associated with (a function of) the total variance of the logarithms of the weights, i.e.
.
ii.2.1 Conditional exponential model.
Let us start by considering the simplest, conditional model, defined by the positions and inducing the Hamiltonian
(25) 
inserting the expression above into Eq. 23 leads to the distribution
(26) 
and each node pairspecific distribution induces a (conditional) expected weight reading
(27) 
From a purely topological point of view, constraining each weight and their total sum is redundant. However, this is no longer true when turning the conditional, exponential model into a proper econometric one. Its econometric reparametrization should be consistent with the literature on trade, stating that the weights are monotonically increasing functions of the gravity specification, i.e. , and with ; for this reason, the link function
usually associated with the exponential distribution prescribes to identify the
linear predictor with the inverse of the purely econometric parameter of the model, i.e.(28) 
a position that turns Eq. 27 into
(29) 
notice that the only structural constraint is, now, represented by the total weight (see also Appendix A).
ii.2.2 Conditional gamma model.
Let us, now, consider a different Hamiltonian, constraining each weight, their total sum and the sum of their logarithms, i.e.
(30) 
it induces the distribution reading
(31) 
each node pairspecific distribution is characterized by a (conditional) expected weight reading
(32) 
and by a (conditional) expected logarithmic weight reading
(33) 
where the function is the socalled digamma function.
Such a model can be turned into a proper, econometric one by considering the inference scheme of the gamma model with inverse response, which allows us to identify the linear predictor with the inverse of the purely econometric parameter of the model, i.e.
(34) 
a position that, in turn, leads to the expressions
(35) 
(allowing the conditional, exponential model to be recovered in case , i.e. when the constraint on the sum of the logarithms of the weights is switchedoff) and
(36) 
(see also Appendix A).
ii.2.3 Conditional Pareto model.
Constraining a slightly more complex function of the weights, i.e. their logarithm, leads to the Hamiltonian
(37) 
which, in turn, induces the distribution
(38) 
where is the minimum, node pairspecific weight allowed by the model. Each node pairspecific distribution is characterized by a (conditional) expected weight reading
(39) 
Such a model can be turned into a proper, econometric one by considering the positions
(40) 
ensuring that the expected weights are monotonically increasing functions of the gravity specification and leading to the expression
(41) 
where is the empirical, minimum weight (see also Appendix A).
Let us explicitly notice that the derivation of the gamma and Pareto distributions within the maximumentropy framework has been already studied in [44]; here, however, we aim at making a step further, by individuating a suitable redefinition of these models parameters capable of turning them into proper, econometric ones.
ii.2.4 Conditional lognormal model.
Adding a global constraint on (a function of) the total variance of the logarithms of the weights to the Hamiltonian defining the Pareto model leads to the expression
(42) 
the Hamiltonian above induces a distribution reading
(43) 
each node pairspecific distribution is characterized by a (conditional) expected weight reading
(44) 
by a (conditional) expected, logarithmic weight reading
(45) 
and by a (conditional) logarithmic weight whose squared expectation reads
(46) 
Such a model can be turned into a proper, econometric one by considering the position
(47) 
ensuring that the expected weights are monotonically increasing functions of the gravity specification and leading to the expressions
(48)  
(49)  
(50) 
(see also Appendix A).
Iii Integrated Models
MDIP can be also implemented in a straightforward way, by carrying out a constrained optimization of . In this second case, the following set of constraints
(51)  
(52) 
can be specified, with obvious meaning of the symbols. Differentiating the corresponding Lagrangean functional with respect to and equating the result to zero leads to
(53) 
where is, again, the Hamiltonian and is the ‘integrated’ partition function.
The explicit functional form of can be obtained only once the functional form of both the prior distribution and the constraints has been specified as well. In what follows, we will deal with completely uninformative priors, a choice that amounts at considering the simplified expression
(54) 
notice that the result above could have been also derived by carrying out a constrained minimization of
(55) 
i.e. of (minus) the functional named differential entropy into which the KL divergence ‘degenerates’ in case completely uninformative priors are considered.
iii.1 Choosing the constraints
Let us, now, specify the functional form of the constraints. In what follows, we will deal with a specific instance of the generic Hamiltonian
(56) 
in particular, one could pose , a choice that would lead to constrain the total number of links, or , a choice that would lead to constrain the whole degree sequence. If not specified otherwise, in what follows we will employ the second functional form and pose , where is the Lagrange multiplier associated with the total weight and encodes the dependence on purely econometric quantities. Our choices induce the Hamiltonian of the socalled integrated exponential model, i.e.
that leads to the distribution
(57) 
where . The generic node pairspecific distribution induces a probability for nodes and to be connected reading
(58) 
besides, the corresponding expected weight reads
(59) 
Equation III.1
clarifies why the models considered in the present section are classified as ‘integrated’: each node pairspecific probability of connection is a function of the parameters controlling for both topological
and weighted properties. Models of the kind are, thus, capable of ‘integrating’ information concerning a network structure with information concerning its weights, hence employing them in a joint fashion to define both inference steps.The recipe for the econometric reparametrization of the integrated exponential model can read as the one of its conditional counterpart, i.e.
(60) 
a position that turns Eq. 59 into
(61) 
where
(62) 
(see also Appendix B).
Iv Results
The effectiveness of the two classes of models considered in the present paper to reproduce the topological properties of the World Trade Web has been tested on two different datasets, i.e. the Gleditsch one (covering 11 years, from 1990 to 2000 [45]) and the BACI one (covering 11 years, from 2007 to 2017 [46]). To carry out our analyses, we have built the ensemble induced by each model as follows. First, the presence of a link connecting any two nodes and is established with probability : as usual, a real number
is drawn from the uniform distribution whose support coincides with the unit interval, i.e.
, and compared with ; if , and are linked, otherwise they are not. Once the presence of a link is established, it is loaded with a weight by employing the inverse transform sampling technique. According to it, it is possible to generate random variables by considering the complementary cumulative distribution , defined as(63) 
since behaves like a random variable, say , uniformly distributed between 0 and 1, the equation can be inverted to obtain , i.e. the weight to be assigned to the pair . Each ensemble is constituted by a number of configurations amounting at
; the error accompanying the estimate of any quantity of interest is quantified via the confidence intervals (CI) induced by the ensemble distribution of the former one.


iv.1 Model selection via statistical indicators
Let us consider two measures of goodnessoffit, i.e. the reconstruction accuracy and the KolmogorovSmirnov (KS) compatibility frequency : while is defined as the percentage of nodespecific values of the statistic falling within the CIs, induced by model , at the significance level of , measures the percentage of times the distribution of a given, expected statistics , under model , is compatible with the empirical one, according to the twosample KolmogorovSmirnov test; for example, a value would indicate that the distribution of the expected values of the statistics , under model , is found to be compatible with the empirical one on the of the years in the dataset, at the significance level of .
The network statistics for which the values and have been computed are the degree sequence
(64) 
which gives information about the tendency of single nodes to link with other trade partners, the average nearest neighbors degree
(65) 
which gives information about the degree correlations, the clustering coefficient
(66) 
which counts the percentage of node partners that are also partners themselves. For what concerns the weighted statistics, we have considered the strength sequence
(67) 
which gives information about the trade flow of a country, the average nearest neighbors strength
(68) 
which gives information about the strength correlations, the weighted clustering coefficient
(69) 
that weighs the closed triangular patterns node establishes with other trade partners.
iv.1.1 KS compatibility frequency
Table 1 lists the values of for both binary and weighted network statistics. For what concerns the binary statistics, we report the performance of three different models, i.e. the UBCM, the FM and the integrated exponential model (denoted as IExp).
For what concerns the Gleditsch dataset, compatibility is observed for every year; for what concerns the BACI dataset, instead, this is no longer true: in fact, the FM outputs predictions that are not compatible with the empirical values for a large number of years and irrespectively from the considered quantity; the UBCM and the IExp (i.e. the models constraining the degrees), instead, output predictions whose compatibility depends on the considered quantity: higherorder statistics are the ones for which the two aforementioned models ‘fail’ to the larger extent. Overall, these results lead us to prefer the UBCM as the ‘first stepalgorithm’ of our conditional models.
Let us, now, comment on the performance of our models in reproducing weighted statistics. As it can be appreciated upon looking at Table 1, the only models outputting predictions whose distributions are compatible with the empirical analogues are the integrated exponential one, the conditional exponential one and the conditional gamma one. On the other hand, employing only logarithmic constraints (as for the conditional Pareto model and the conditional lognormal model) does not help improving the accuracy of the description of the system at hand.
iv.1.2 Reconstruction accuracy
So far, we have inspected the compatibility of the distributions of the empirical values of each network statistics with the ones of their expected values under each of our models. Let us, now, quantify the extent to which each model is able to recover nodewise information by computing the values. Figure 1 shows the temporal average of the latter ones (i.e. across the years covered by our datasets), with the whiskers representing their variation, i.e. an indication of the stability of each model performance.
For what concerns the binary statistics (see Fig. 1a), both the UBCM and the IExp perform quite well in reproducing them; on the other hand, the performance of the FM is much poorer. For what concerns the weighted statistics (see Fig. 1b), only the average nearest neighbors strength is satisfactorily recovered by the (integrated and conditional) exponential models and the conditional gamma one. Still, they are found to perform poorly on the other statistics, i.e. the strength, that is only recovered in distribution on both datasets, and the weighted clustering coefficient, that is only recovered in distribution on the BACI dataset. For what concerns the lognormal model, it performs better than competitors in reproducing the strength and the weighted clustering coefficient on the Gledistch dataset but worse on the BACI dataset, causing its behavior to be datasetdependent.
Finally, let us inspect the reconstruction accuracy of our models for what concerns our networks link weights. Specifically, let us define , i.e. the percentage of empirical weights falling within the CI, induced by model , at the significance level of [48]. Here, we have proceeded numerically, i.e. by considering the and the percentiles induced by the ensemble distribution of each node pairspecific weight. As Fig. 2 shows, all models perform quite well in reproducing the weights (across all years, on both datasets) with the only exception of the conditional Pareto model. Overall, the bestperforming model on the Gleditsch dataset is the integrated exponential one while the bestperforming model on the BACI dataset is the conditional gamma model.
iv.1.3 Confusion matrix
The UBCM and the integrated exponential model perform similarly in reproducing the binary statistics, on both datasets. Let us, now, compare them in reproducing the four indicators composing the socalled confusion matrix, i.e. the true positive rate
(measuring the percentage of links correctly recovered by a given reconstruction method), the specificity (measuring the percentage of zeros correctly recovered by a given reconstruction method), the positive predictive value (measuring the percentage of links correctly recovered by a given reconstruction method with respect to the total number of links predicted by it) and the accuracy (measuring the overall performance of a given reconstruction method in correctly placing both links and zeros).The results are reported in Table 2, that shows the increments of the four indicators, defined as with . Notice that each entry of the table is positive, a result signalling that the integrated exponential model steadily performs better than the UBCM. This is further confirmed by the (nonparametric) Wilcoxon ranksum test on the ensemble distributions of the statistics to compare: all increments are significant, at the level.
iv.2 Model selection via statistical tests
Let us now rigorously test if constraining the entire degree sequence leads to a significantly better description of our data than that obtainable by just constraining the total number of links .
Upon solving the model constraining the entire degree sequence and the one constraining the total number of links, we are able to construct a vector reading where is either or for the th statistics under the ‘constrained version’ of model ; on the other hand, is either or for the th statistics under the ‘constrained version’ of model  naturally, both values have been considered for the same year, keeping the same set of weighted constraints. Pairing statistics as described above allows us to employ the (nonparametric) Wilcoxon signedrank test for testing the hypotheses and , i.e. that the models just constraining perform better, in reproducing the statistics , than those constraining the entire degree sequence.
Our results let us conclude that, for both datasets, constraining the degree sequence leads to a significant improvement, at the level of , of the reconstruction accuracy of the average nearest neighbors degree, the clustering coefficient, the strengths and the average nearest neighbors strength; on the other hand, constraining the degree sequence does not lead to any significant improvement of the reconstruction accuracy of the weighted clustering coefficient. For what concerns the KS compatibility frequency, a significant improvement, at the level of , is observed in the description accuracy of the average nearest neighbors degree, the clustering coefficient and the average nearest neighbors strength.
iv.3 Model selection via information criteria
Let us, now, compare the performance of our models in a more general fashion. To this aim, let us consider the Akaike Information Criterion (AIC) [49], reading
(70) 
where is the number of free parameters of the model and is its loglikelihood, evaluated at its maximum.
Dataset  

Gleditsch 90  
Gleditsch 91  
Gleditsch 92  
Gleditsch 93  
Gleditsch 94  
Gleditsch 95  
Gleditsch 96  
Gleditsch 97  
Gleditsch 98  
Gleditsch 99  
Gleditsch 00 
Dataset  

BACI 07  
BACI 08  
BACI 09  
BACI 10  
BACI 11  
BACI 12  
BACI 13  
BACI 14  
BACI 15  
BACI 16  
BACI 17 
The purely binary loglikelihood induced by model is readily obtained from Eq. (18) and reads
(71) 
where is the generic entry of the empirical adjacency matrix and is the modeldependent probability that node and node