# Variational Bayesian Complex Network Reconstruction

Complex network reconstruction is a hot topic in many fields. A popular data-driven reconstruction framework is based on lasso. However, it is found that, in the presence of noise, it may be inefficient for lasso to determine the network topology. This paper builds a new framework to cope with this problem. The key idea is to employ a series of linear regression problems to model the relationship between network nodes, and then to use an efficient variational Bayesian method to infer the unknown coefficients. Based on the obtained information, the network is finally reconstructed by determining whether two nodes connect with each other or not. The numerical experiments conducted with both synthetic and real data demonstrate that the new method outperforms lasso with regard to both reconstruction accuracy and running speed.

## Authors

• 22 publications
• 2 publications
• 36 publications
• 16 publications
• ### A Bayesian approach to multi-task learning with network lasso

Network lasso is a method for solving a multi-task learning problem thro...
10/18/2021 ∙ by Kaito Shimamura, et al. ∙ 0

• ### Greedy Variance Estimation for the LASSO

Recent results have proven the minimax optimality of LASSO and related a...
03/28/2018 ∙ by Christopher Kennedy, et al. ∙ 0

• ### On the Adversarial Robustness of LASSO Based Feature Selection

In this paper, we investigate the adversarial robustness of feature sele...
10/20/2020 ∙ by Fuwei Li, et al. ∙ 0

• ### An Efficient Augmented Lagrangian Based Method for Constrained Lasso

Variable selection is one of the most important tasks in statistics and ...
03/12/2019 ∙ by Zengde Deng, et al. ∙ 0

• ### Variational phase recovering without phase unwrapping in phase-shifting interferometry

We present a variational method for recovering the phase term from the i...
01/28/2020 ∙ by Ricardo Legarda-Saenz, et al. ∙ 0

• ### Variational Full Bayes Lasso: Knots Selection in Regression Splines

We develop a fully automatic Bayesian Lasso via variational inference. T...
02/26/2021 ∙ by Larissa Alves, et al. ∙ 0

• ### Semi-Analytic Resampling in Lasso

An approximate method for conducting resampling in Lasso, the ℓ_1 penali...
02/28/2018 ∙ by Tomoyuki Obuchi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The networked systems are ubiquitous in many fields, including social-tech science[1, 2], bioinformatics [3, 4, 5, 6], epidemic dynamics [7, 8, 9] and power grid [10, 11]. However, as is often the case, it is not able to observe the topology of a network, while data generated by this network are available. Therefore, in interdisciplinary science, one of the most important but challenging problems is to reconstruct the complex network from the observed data or time series [12]. This problem has been widely investigated in the past three decades, where the classical method is the delay-coordinate embedding method proposed by Takens [13], which, nevertheless, is only suitable for small-scale networks. Nowadays, with the advent of big data era [14], it is of great urgency solve this issue for large-scale complex networks.

Suppose that a complex network consists of nodes, in practice we are often given the time series of the states for the nodes. Generally speaking, the core idea of many data-driven network reconstruction investigations is to first calculate the correlation between two nodes. Then, a threshold can be set mutually or automatically to make the network binary. Eventually, a binary adjacent matrix can be obtained, where its th element indicates that the nodes and connects whereas means they do not connect. Here, the correlation can be measured in many ways, such as the Pearson’s, Kendall’s and Hoeffding’s coefficients as well as the mutual information in information theory. However, this type of methods are not reliable, because the correlation cannot be precisely extracted in many cases [12].

Thereafter, the above framework is gradually replaced by causality analysis (CA) [15] which infers the causal influence between two variables via building two linear regression models. In the context of complex networks, one can apply CA to each pair of nodes to detect whether two nodes are connected [16]. Guo et al. [17]

proposed partial CA by means of radial basis function so as to recover the interactions among elements in a non-linear network. Then, Wu et al.

[18, 19] proposed piecewise CA which is a simpler and more straightforward method for both linear and non-linear cases. Now, with the rapid development of variable selection and feature learning [20, 21, 22], the framework of CA is progressively abandoned, and the more acclaimed approach is lasso [23] (or called compressive sensing in signal processing field). Wang et al. [24] made the first attempt to apply the framework of lasso to network reconstruction based on game-theoretic data. Thereafter, Wang’s group investigated the application of lasso to various network reconstructions, such as epidemic data[25], geospatial data [26] and electrical power data [27]. Meanwhile, they also studied the application to related problems such as the prediction of catastrophes in nonlinear dynamical systems [28]. Recently, Wu’s group has extended this framework to multilayer networks [29], the networks with time-varying nodal parameters[30] as well as the weighted networks with continuous time-varying topologies [31].

To facilitate later discussions, Table 1 lists the main notations and symbols used in the paper. Essentially, lasso [23] solves an -norm penalized least squares problem, i.e., . Owing to the property of the -norm penalty [32], the solution of is sparse, that is, many entries of are zero. The parameter controls the sparsity, where larger makes sparser. For network reconstruction, and

are observed data, while the regression coefficient vector

embodies the important information indicating whether a node connects to others. In other words, corresponds to a row of the adjacent matrix of a network with nodes. Therefore, the network topology can be recovered by repeatedly applying lasso times.

Although the above-mentioned lasso framework is widely used, there are defects which limit its application in complex network reconstruction. For instance, consider an unweighted network, namely, the th entry if node points to and otherwise. Generally speaking, the result produced by lasso (i.e., ), however, is not binary. Thus, there is still a gap to reconstruct the network topology with . On one hand, some literatures reported that lasso’s AUC score is very high [27, 12], even equals to one in some scenarios. Therefore, it is reasonable to set a threshold

to make the estimated adjacent matrix (also called the reconstructed network)

binary, namely, let if and 0 otherwise. Intuitively, if we have no prior knowledge, the threshold should be set as the middle number in the range for all the values in , that is, . Nevertheless, this strategy is ineffective under some circumstances. To illustrate this phenomenon, Fig. 1 reports a case that lasso fails. The caption of Fig. 1 describes the detailed experimental settings. Due to the presence of noise, lasso works badly and the value of ranges from to , which leads to . However, only 2 entries satisfy the condition . Certainly, we can also utilize the prior knowledge from a certain field. For example, it is known that most gene regulation networks’ average degree is around three, so we can determine the threshold according to average degree [33]. Nonetheless, it requires the experience of experts and is not a general strategy. As an alternative, we also can get the binary by simply setting if and if since the solution of lasso is sparse. But this strategy requires to carefully tune , otherwise it may generate many false positive edges. As reported in Fig. 1, if we use cross validation (CV) [34] to tune , this strategy is much more efficient than the former one, but spends more time in the meanwhile.

In this paper, we propose an elegant framework for data-driven complex network reconstruction, especially for large-scale networks. To overcome the shortcoming of lasso, we consider a special regression problem with multiple regression coefficients, that is, , where denotes the element-wise product, plays the same role as the traditional regression coefficient vector, and embodies the information about which variables are active. In other words, denotes either or . In later discussions, it will be found that this formulation is very suitable for handling large-scale network reconstruction problems. Under our framework, it is very easy to determine whether two nodes connect with each other according to . To estimate the unknown parameters and

, a full hierarchical Bayesian inference process is implemented. Hence, by this virtue, different from lasso, there is no parameter to be tuned in our model. To speed up the inference, a variational Bayesian technique rather than MCMC sampling is employed to approximate the posterior distributions of unknown variables

[35, 36, 37]. To the best of our knowledge, this is the first attempt to utilize the variational Bayes technique to reconstruct network. The experiments conducted on Barabasi-Albert (BA) networks with electrical current transportation dynamics and a real-world dataset show that our framework significantly outperforms lasso with respect to reconstruction accuracy and running speed.

The rest of this paper is organized as follows. In section 2, we formulate the new framework and show how to infer the network topology with it. Related works are introduced in section 3. Then, experiments are conducted in section 4. Finally, some discussions and conclusions are presented in section 5. In appendix, more details for the variational inference of the (hyper)parameters are provided.

## 2 Bayesian complex network reconstruction

### 2.1 Problem statement

In this paper, we consider an unweighted complex network without loops. Its adjacency matrix is , where if there is an edge from node to , and otherwise. Specifically, let denote the number of nodes in the network. Moreover, we assume the nodes are governed by a specific dynamics. Here, for ease of illustration, we take the electrical current transportation in a power network consisting of resistors as a special example. The resistance of a resistor between nodes and is denoted by , where if they are not connected. Based on the Kirchhoff’s law, there is

 N∑j=1aijrij(Vi−Vj)=Ii(i=1,2,⋯,N), (1)

where and denote the voltage and the electrical current for the node . Generally speaking, in the real world, we are able to observe the nodes’ voltage and electrical current, while the network structure is invisible. Therefore, the goal of network reconstruction is to infer the adjacency matrix from the observed data, i.e., and .

In what follows, we define and . At the same time, we use to refer to . Since the network simulates a dynamic system, it is reasonable to assume that data are collected at different time points. For ease of description, let denote the time index set. Under the above assumptions, the variables should satisfy the equation

 (2)

according to Eq. (1). It is noteworthy that the item represents the electrical current of node and denotes the difference of voltage between nodes and at time . In the matrix form, there is

 y(i)=X(i)D(ai)wi, (3)

where is a diagonal matrix whose main diagonal is , and is the th row of matrix . As stated above, the node states are observable. In other words, in Eq. (3), and are observed data; and are unknown variables. Therefore, in this manner, the network reconstruction can be cast into a series of regression problems. That is to say, the whole network topology can be recovered by solving problems like that shown in Eq. (3), with each corresponding to one node.

### 2.2 Model formulation

On considering that the network reconstruction actually corresponds to solving some special regression problems, in this subsection we cast it into a new framework based on Bayesian statistics.

In the following discussions, we reformulate the estimation problem in Eq. (3) into a linear regression model via

 y=XD(a)w+ϵ, (4)

where denotes the noise item. The response vector and the design matrix are observable, while the binary coefficient vector and the continuous coefficient vector are to be estimated. It is assumed that are independently and identically distributed (i.i.d.) as Gaussian, namely, , where denotes the variance. Note that the popular reconstruction method lasso also falls into this category. The significant advantage of lasso over the simple least-squares method lies in that it can provide a sparse solution. However, its performance highly depends on the tuning of its parameter . In contrast, Bayesian methods can provide satisfactory estimates for and while avoiding the tedious parameter adjustment. The core idea of Bayesian methods is to impose a prior distribution on each unknown variable (i.e., and

), then MCMC sampling or a variational technique is employed to approximate the posterior distribution according to the famous Bayes theorem. There are hyper-parameters in the prior distribution sometimes, and some proper hyper-prior distributions can be hypothesized. In what follows, we will adopt a full hierarchical Bayesian inference process to infer our interested items in Eq. (

4).

For ease of illustration, we let and is often called precision. Thereafter, the conditional distribution of is given by

 y∣a,w,τ∼N(y∣XD(a)w,τ−1IM), (5)

where denotes the

-dimensional identity matrix. For each regression coefficient

, we place the following Gaussian prior

 wj∼N(wj∣0,λ−1j),j=1,2,⋯,N, (6)

where is the variance. Since the adjacency matrix only takes binary value (the network is unweighted), it is natural to consider a Bernoulli prior for , that is,

 aj∣ρ∼Bernoulli(aj∣ρ),j=1,⋯,N, (7)

where

denotes the probability of

taking 1. At last, to make a full Bayesian inference, we impose conjugate priors on the hyper-parameters

, and , namely,

 τ∼Gamma(τ∣c0,d0),ρ∼Beta(ρ∣e0,f0),λj∼Gamma(λj∣g0,h0), (8)

where are hyper-parameters. To facilitate the understanding of the Bayesian framework, Fig. 2

shows the hierarchical Bayesian graph and the probability density functions of related random variables.

Subsequently, it is able to write the joint distribution of all variables, viz.,

 p(y,a,w,λ,τ,ρ) (9) = p(y∣a,w,τ)p(τ)N∏j=1p(aj∣ρ)p(wj∣λj)p(λj)p(ρ).

Our aim is to obtain the posterior distribution of and . Based on the Bayes theorem, we have

 p(a,w,λ,τ,ρ∣y)=p(y,a,w,λ,τ,ρ)p(y).

However, the margin distribution

is computationally infeasible, because our model is too complicated. We have to seek an alternative so as to dispense with the computation of . In the literature of Bayesian methods [21, 22, 38], there are mainly two types of algorithms to deal with the inference of complex posterior distribution. One is MCMC sampling which approximates the posterior distribution via iteratively drawing samples from the full conditional distributions of each variable. Although MCMC sampling behaves very well in many cases, it is time-consuming when the number of unknown variables are large (i.e., large ). The other one is approximation-based techniques such as variational Bayes or expectation propagation (EP) which works by directly utilizing another easily estimated distribution to approximate the desired posterior distribution. The prominent advantage of these methods is their good performance at the low computational cost.

Here, we employ a variational Bayesian method to infer this model. The basic idea is to use a variational distribution to approximate the posterior one. The Kullback-Leibler (KL) divergence is utilized to measure the difference between two distributions [35]. Hence, the original problem is converted into the following optimization issue, namely,

 q∗=minq(a,w,λ,τ,ρ)KL(q(a,w,λ,τ,ρ)∥p(a,w,λ,τ,ρ∣y)). (11)

Specifically, we hypothesize that the variational distributions for each item are independent, that is,

 q(a,w,λ,τ,ρ)=q(w)q(ρ)q(τ)N∏j=1q(λj)q(aj). (12)

Then, the optimal variational posterior distribution can be obtained as the following Theorem 1 states.

###### Theorem 1

The optimal variational posterior distributions of model (9) are

 q(w) =N(w∣μ,Σ), (13) q(τ) =Gamma(τ∣c,d), q(ρ) =Beta(ρ∣e,f), q(aj) =Bernoulli(aj∣θj),j=1,⋯,N, q(λj) =Gamma(λj∣gj,hj),j=1,⋯,N,

where

 Ω =θθT+D(θ)⊙(IN−D(θ)), (14) Σ =[cd(XTX)⊙Ω+D(gh)]−1,μ=cdΣD(θ)XTy, gj =g0+12, hj =h0+12(Σjj+μ2j), c =c0+M2, d =d0+12{∥y∥2−2yTXD(θ)μ +tr[((XTX)⊙Ω)(Σ+μTμ)]}, θj =1exp(−uj)+1, uj =ψ(e)−ψ(f)+c2d{XTjXj[μ2jD(θ)−0.5(Σjj+μ2j)] +μjXTj(y−XD(θ)μ)}, e =e0+N∑j=1θj,f=f0+N∑j=1(1−θj).
###### Proof 1

The following remarks give more insights of our model.

###### Remark 1

It is not surprising that controls the sparsity of our model. Larger leads to larger , which means that more are not zero. In this aspect, it plays the same role as . However, in our model is not a constant (or say hyper-parameter) and it is governed by a distribution. Herein, can be automatically estimated without resorting to CV.

###### Remark 2

Generally speaking, in the uninformative fashion, the hyper-parameters (i.e., and ) of Gamma distributions can be set as very small values. For the Beta distribution, we find that the model performs well with . Under some circumstances, if it is known that the target network is very spare, we can also make smaller and larger so as to obtain a smaller .

### 2.3 Algorithm and implementation details

In this part, we describe how to apply our model to cope with network reconstruction tasks. Algorithm 1 lists the main steps of the inference process of our model. The hyper-parameters are initialized as shown in line 1. We stop the algorithmic iteration if ; and in this paper, is set as . Algorithm 2 summarizes the workflow of Bayesian complex network reconstruction, that is, repeatedly computing the response vector and the design matrix and then carrying out Algorithm 1 until all nodes’ structure are recovered. At last, the initial output of Algorithm 2, , is a soft version of reconstructed complex network, where ’s th entry denotes the probability that node points to node . In addition, we can set a threshold and let if and 0 otherwise. Hereinafter, we set . The matrix is the final reconstructed network. In what follows, we abbreviate our method (Variational Bayesian Reconstruction) as VBR.

We emphasize that, although Algorithm 2 is designed for electrical current transportation dynamics, it can be directly adapted to many other dynamical processes. Essentially, VBR is available for all the problems which can be solved by lasso.

## 3 Related Work

Besides the simple methods [12] to compute the correlation coefficients between two nodes, another approach which is widely used to infer the structure of small network is causality analysis (CA) proposed by Granger [15]. Specifically, given the time series of two variables, and , the autoregressive prediction of is defined by

 Xt=∞∑i=1b1iXt−i+ϵ1t,

where is the observation of at time , and is the noise item. As well, it is able to utilize the information of both and to predict , viz.

 Xt=∞∑i=1b2iXt−i+∞∑i=1c2iYt−i+ϵ2t.

If , influences , because accounts for the variation of . In practice, the causal influence from to can be measured by [19]. In the context of network reconstruction, we can believe that there is an edge from to if . Because it needs to implement regression times, CA is computationally infeasible if is large.

Nowadays, lasso has become one of the most popular procedures in the literature of complex network system. In [12], the authors reviews recent advances of data based network reconstruction with focus on lasso. To some extent, lasso is similar to VBR. We take electrical current transportation as an example to explain how it works to reconstruct the whole network. If let , Eq. (3) can be rewritten as

 y(i)=X(i)~wi. (15)

Lasso directly estimates , the product of two coefficients and , by solving the following optimization problem

 minwi∥y(i)−X(i)~wi∥22+λlasso∥~wi∥1. (16)

From Eq. (16), it can be observed that lasso can recover the relationship between node and others by solving one optimization problem. Totally, lasso needs to be implemented times to identify the network topology. As stated in Section 1, we ought to design a strategy to make the result binary. However, to the best of our knowledge, there is no general guidance about how to choose a proper threshold. In literature [24], Wang et al. offer an empirical strategy, namely, let if and if . Due to its specialty, this strategy will lose efficiency in the presence of noise or if the parameter is not well-selected.

Beyond the above-mentioned techniques, for economic and financial systems, there is one popular method called maximum entropy (MaxEnt) [39]. It is assumed that the network is not directly observable, while the marginal information on this network is accessible. Namely, the in-degree and out-degree of each node are known. Then, MaxEnt algorithm aims to solve the following constraint optimization, viz.

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩maxaij−∑Ni=1∑Nj=1aijlogaij;dinj=∑Ni=1aij,∀j∈V,doutj=∑Ni=1aji,∀j∈V. (17)

Here, and denote the in-degree and out-degree, respectively. However, the solution of MaxEnt is dense, while the real-world economic networks are sparse. Thereafter, many variants are proposed to improve the performance of MaxEnt. More details about MaxEnt can refer to a very recent review [40]. It is a pity that this class of algorithms are not suitable for our problem, because the reconstruction problem setting of economic systems is totally different from our focus.

## 4 Experiments

In this section, we will carry out simulated experiments to study the behavior of our model. The Matlab code of VBR is available at https://github.com/xsxjtu/VBR. Lasso with 10-fold CV is used as the benchmark algorithm and it is implemented by the built-in function lasso

in Statistics and Machine Learning Toolbox of Matlab. All the experiments are conducted with Matlab R2017a and run on a computer with Intel Core CPU 3.60 GHz, 8.00 GB RAM and Windows 10 (64-bit) system.

To evaluate the reconstruction accuracy of a method, two metrics (true positive rate) and (true negative rate) are employed. They are defined as

 TPR=TPP=∑Ni=1∑Nj=1aij^aij∑Ni=1∑Nj=1aij, (18)

and

 TNR=TNN=∑Ni=1∑Nj=1(1−aij)(1−^aij)∑Ni=1∑Nj=1(1−aij), (19)

respectively. Here, and are the th entries of the target and the reconstructed networks, and , respectively. As a matter of fact, TPR is the proportion that existed edges are correctly identified while TNR is the proportion that non-existed edges are correctly excluded. In the meanwhile, we also utilize the computational time (in seconds) to compare the efficiency of each algorithm.

### 4.1 Experiment 1: Comparison with lasso

In this first case, we mainly compare the performance of VBR and lasso with regard to reconstruction accuracy and running speed.

#### 4.1.1 Experimental settings

We mainly consider BA scale-free networks, where a newly added node connects to existing nodes with a probability that is proportional to their degree. Larger makes a network have larger average degree. We mimic the electrical current transportation on network described by Eq. (1), where the node’s voltage is generated by alternating current with , and and the resistance are uniformly sampled from [0, 20] and [0, 10], respectively. The number of nodes and observed data are and , respectively. The variance of noise item is . In the following experiments, we consider , , . For each case, the experiment was conducted 100 times by generating different data.

#### 4.1.2 Experimental results

The results are shown in Fig. 3 and 4. In particular, the mean of TPR (also TNR and the time consumed by each algorithm) was plotted as a function of

. In each subplot, the bar indicates the mean plus or minus one standard deviation. As a matter of fact, it is not surprising that the performance of both VBR and lasso weakens as

increases. Specifically, compared with VBR, lasso is more sensitive to the scale of noise . The reason is that, VBR uses the full Bayesian inference and the variance of noise is directly modeled by the latent variable , while lasso does not consider this factor. Furthermore, in most cases, the standard deviation of lasso is higher than that of VBR, which means that lasso may be instable. Overall, VBR does better than lasso with regard to reconstruction accuracy. As for computational time, VBR takes a great advantage over lasso and its speed is very robust to as well as . It consumes around 0.55s and 4.6s for , and , respectively. However, the computational time of lasso is also affected by other factors. It dramatically increases as becomes larger. For the cases with , lasso takes 3-4.5s, while for cases with , lasso takes 15-30s. In conclusion, in the context of BA scale-free networks, VBR outperforms lasso in both reconstruction accuracy and running speed.

### 4.2 Experiment 2: Runing speed versus network scale

In the second case, we conduct experiments on relatively large-scale complex networks. Because lasso with CV is extremely slow in this case, we only implemented VBR. And, the aim is to study how the running speed of VBR varies with the growing of network scale.

#### 4.2.1 Experimental settings

Here, the experimental settings were the same as those in Experiment 1, except that the experimental parameters were set as follows: , , , , .

#### 4.2.2 Experimental results

Fig. 5 depicts the execution time of VBR versus , an index indicating the network scale. It can be found in Fig. 5 that the running speed of VBR grows faster than network scale. The main crux lies in the update of , which requires the computation of the inverse of an -dimensional matrix. In general, its computation complexity is . It is worth pointing out that, if we use ADMM algorithm to solve lasso, it has the same problem.

In VBR, the key reason is that we assume that are not independent in Eq. (12). Therefore, when the target network is high-dimensional, we had better assume to be independent. In this manner, the computation of the inverse of an -dimensional matrix can be avoided. However, we find in our experiments that, this strategy is very likely to be tapped into the local minimum, and the values of TPR and TNR is small. Unfortunately, there is still no efficient approach to solve this issue and it is extremely interesting to study it in the future.

### 4.3 Experiment 3: Hyper-parameter analysis

In this part, we focus on studying whether the choice of hyper-parameters affects the behavior of VBR. In the following experiments, we set

. For ease of comparison, we adopted the harmonic mean of TPR and TNR (HMTT) to measure the performance, namely,

 HMTT=21TPR+1TNR. (20)

At first, we investigated the influence of the hyper-parameters and since they control the variable . In the previous experiments, we set and . Here, we fix and let vary from to with the increment 1 under the logarithmic scale. Fig. 6(a) shows that can significantly affect VBR’s performance because the value of HMTT reduces as decreases. This result is reasonable. As stated above, denotes the inverse of the variance of . And we know , so the variance of gets larger as becomes smaller. In general, larger variance does not encourage sparsity of regression coefficients and weakens the reconstruction performance. Fig. 6(a) implies that VBR begins to break down when . Therefore, the choice of is very safe.

In the next, we pay attention to the hyper-parameters and . They are related with the variable which controls the variance of noise. In Experiments 1 and 2, we set and . Here, we fix and let vary from to . Since controls the amount of noise in data, we conducted three sets of experiments with different noise scale, that is, . The results reported in Fig. 6(b)-(d) show that indeed affects the performance of VBR. Specifically, the value of HMTT keeps almost unchanged when while it tends to reduce when takes smaller value. Based on this fact, we can see that the choice of guarantees pretty performance.

### 4.4 Experiment 4: Real-world data

Here, we will conduct experiment on a real-world network USAir which is available at http://vlado.fmf.uni-lj.si/pub/networks/data/. There are 332 nodes and 2126 edges in the network. The degree distribution plot is shown in Fig. 7. USAir is a scale-free network and the exponential-law coefficient is around . The experimental settings are the same to those in Experiment 1. The time series data was generated by the electrical current transportation dynamics. In this experiment, we did not include noise item and set .

The results reported in Table 2 show that VBR always achieves higher TPR values than lasso. Even though lasso possesses a slight advantage over VBR in terms of TNR, the values of HMTT manifest the superiority of VBR in the aspect of reconstruction. Meanwhile, lasso takes about 15-35 minutes to accomplish the reconstruction task. As for our method VBR, it consumes only about 2 minutes to recover the whole network.

## 5 Conclusion and Discussions

In this paper, we propose a general framework based on Bayesian statistics to reconstruct complex networks. By reformulating the task as a series of regression problems, prior distributions are assigned to the unknown parameters. To efficiently infer from their posterior distributions, a variational Bayesian method is employed. Compared with lasso, the novel method does not need the fine tuning of its associated parameters. The experiments conducted with both synthetic and real data show that, in the presence of noise, our method VBR outperforms lasso with regard to both reconstruction accuracy and running speed.

Moreover, we would like to further discuss the potential improvements of our model.

##### Reconstruction of heterogeneous and homogeneous networks

We claim that our algorithm can be directly applied to reconstruct both heterogeneous and homogeneous networks, but it may be more suitable for heterogeneous ones. The main reason lies in that the nodes in a heterogeneous network have different degrees, while the nodes in a homogeneous network have similar degrees. In our proposed method VBR, it should be noticed that the variable controls degree. Specifically, large makes a node connect to more nodes and leads to a large degree. When VBR is used to reconstruct a network, every time it estimates a for one node, thus different nodes have different ’s. Based on this fact, the algorithm is very likely to be improved, if we can take into account the prior knowledge of homogeneous networks and make all nodes share the same . It is interesting to investigate whether the strategy is efficient for homogeneous networks in the future.

##### Inference of ρ

Note that in our model, a Beta distribution is imposed on . With the variational inference, the value of can be automatically determined. As a side effect, however, more training samples need to be utilized to ensure the approximation accuracy, otherwise the model may perform badly. One solution is to treat as a hyper-parameter (similar to ) and to employ CV to select its optimal value.

## Acknowledgement

The research of C.X. Zhang is supported by the National Natural Science Foundation of China under grant 11671317. The research of J.S. Zhang is supported by the National Key Research Development Program of China under grant 2018YFC0809001, and the National Natural Science Foundation of China under grant 61572393. The research of P. Wang is supported by the National Natural Science Foundation of China under grant 61773153, the Key Scientific Research Projects in Colleges and Universities of Henan Province under grant 17A120002, and the Basal Research Fund of Henan University under grant yqpy20140049.

## Appendix A A Appendix

In this part, we provide more details about the variational inference (VI), which includes a brief introduction of VI and how the variational distributions for the items shown in Theorem 1 are derived.

### a.1 Variational inference

At first, we show how to use VI to infer the optimal solution for a general model. Let denote all the variables to be inferred, where is the th variable; and represents the observed data. According to the main principle of VI, the optimal variational distribution is given by

 q∗(z)=minKL(q(z)∥p(z∣y)). (21)

According to the definition of KL divergence, we have

 KL(q(z)||p(z∣y)) (22) = logp(y)+Eq(z)[logq(z)−logp(z,y)].

We define the evidence lower bound (ELBO) as , where represents the expectation operator with regard to (w.r.t.) variational distribution . Note that Eq. (22) can be reformulated as

 logp(y)=KL(q(z)||p(z∣y))+ELBO. (23)

Because is a constant which does not depend on the variational distribution , minimizing KL divergence is thus equivalent to maximizing ELBO. Because we assume variational distributions are independent, ELBO can be rewritten as

 ELBO (24) = −∫∏jq(zj)log∏jq(zj)p(z,y)dz = −∫q(zj){logq(zj)−E−q(zj)logp(z,y)}dzj+const.,

where represents the expectation operator w.r.t. all variational distributions but . The refers to all items that do not depend on . Hence, the optimal variational distribution for should satisfy

 logq∗(zj)=E−q(zj)[logp(z,y)]. (25)

Because of this property, the variational distribution can be efficiently attained by coordinately updating . Remark that is the joint distribution (for our model, it refers to Eq. (9)). In the next, we show how to acquire the optimal solution to our model.

### a.2 Inference of w

According to Eq. (25), we have

 logq∗(w) (26) = E−q(w){logp(y,a,w,λ,τ,ρ)} = E−q(w){logp(y∣a,w,τ)+logp(w∣λ)}+const. = E−q(w){−τ2∥y−XD(a)w∥2−12wTD(λ)w}+const. = E−q(w){−τ2[wTD(a)XTXD(a)w−2yTXD(a)w] −12wTD(λ)w}+const. = E−q(w){−12wT[τD(a)XTXD(a)+D(λ)]w +τyTXD(a)w}+const.

From Eq. (26), it can be seen that the

is still a Gaussian distribution. In what follows, we denote it by

with

 Σ =(Eq[τ]Eq[D(a)XTXD(a)]+Eq[D(λ)])−1, (27) μ =Eq[τ]ΣEq[D(a)]XTy.

It should be mentioned that in and , the expectations w.r.t. their corresponding variational distribution will be computed in Appendix-G.

### a.3 Inference of λj

According to Eq. (25), there is

 logq∗(λj) (28) = E−q(λj){logp(y,a,w,λ,τ,ρ)} = E−q(λj){logp(wj∣λj)+logp(λj)}+const. = E−q(λj){12logλj−λj2w2j+(g0−1)logλj−h0λj} +const.

Therefore, is still a Gamma distribution. In what follows, we denote it by , where

 gj=g0+12,hj=h0+12Eq(w2j). (29)

### a.4 Inference of τ

As for , we have

 logq∗(τ) (30) = E−q(τ){logp(y,a,w,λ,τ,ρ)} = E−q(τ){logp(y∣a,w,τ)+logp(τ)}+const. = M2logτ−τ2E−q(τ)[∥y−XD(a)w∥2] +(c0−1)logτ−(d0−1)τ+const. = (c0+M2−1)logτ−(d0+ E−q(τ)[∥y−XD(a)w∥2]−1)τ+const.

according to Eq. (25). Hence, is still a Gamma distribution. In what follows, we denote it by , where

 c=c0+M2,d=d0+12Eq[∥y−XD(a)w∥2]. (31)

### a.5 Inference of aj

To facilitate the derivation, it is worthy mentioning one fact that, if , then . That is, and have the identical distribution. According to Eq. (25), the optimal variational distribution for should satisfy

 logq∗(aj) (32) = E−q(τ){logp(y,a,w,λ,τ,ρ)} = E−q(aj){−τ2∥y−XD(a)w∥2 +ajlogρ+(1−aj)log(1−ρ)}+const. = E−q(aj)⎧⎨⎩−τ2∥y−∑n≠jXnanwn−Xjajwj∥2 +ajlogρ1−ρ}+const. = E−q(aj){−τ2∥rj−Xjajwj∥2+ajlogρ1−ρ}+const. = E−q(aj){−τ2(−2rTjXjajwj+XTjXja2jw2j) +ajlogρ1−ρ}+const. = ajE−q(aj){−τ2XTj(w2jXj−2wjrj)+logρ1−ρ}+const.

Here, represents the th column of matrix and . Obviously, is still a Bernoulli distribution. In what follows, we denote it by with

 θj=1exp(−uj)+1, (33)

and

 uj=Eq{logρ1−ρ−τ2XTj(w2jX