 # Empirical Bayes Method for Boltzmann Machines

In this study, we consider an empirical Bayes method for Boltzmann machines and propose an algorithm for it. The empirical Bayes method allows estimation of the values of the hyperparameters of the Boltzmann machine by maximizing a specific likelihood function referred to as the empirical Bayes likelihood function in this study. However, the maximization is computationally hard because the empirical Bayes likelihood function involves intractable integrations of the partition function. The proposed algorithm avoids this computational problem by using the replica method and the Plefka expansion. Our method does not require any iterative procedures and is quite simple and fast, though it introduces a bias to the estimate, which exhibits an unnatural behavior with respect to the size of the dataset. This peculiar behavior is supposed to be due to the approximate treatment by the Plefka expansion. A possible extension to overcome this behavior is also discussed.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Boltzmann machine learning (BML) Ackley et al. (1985)

has been actively studied in the field of machine learning and also in statistical mechanics. In statistical mechanics, the problem of BML is sometimes referred to as the

inverse Ising problem, because a Boltzmann machine is the same as an Ising model, and BML can be regarded as an inverse problem for the Ising model. The framework of the usual BML is as follows. Given a set of observed data points (e.g., spin snapshots), we estimate appropriate values of the parameters, the external field and couplings, of our Boltzmann machine through maximum likelihood (ML) estimation (cf. Sec. II.1). Because BML involves intractable multiple summations (i.e., evaluation of the partition function), many approximations for it were proposed from the viewpoint of statistical mechanics Roudi et al. (2009): for example, methods based on mean-field approximations (such as the Plefka expansion Plefka (1982) and the cluster variation method Pelizzola (2005)Kappen and Rodríguez (1998); Tanaka (1998); Yasuda and Horiguchi (2006); Sessak and Monasson (2009); Yasuda and Tanaka (2009); Ricci-Tersenghi (2012); Furtlehner (2013) and methods based on other approximations Sohl-Dickstein et al. (2011); Yasuda (2015).

In this study, we focus on another type of learning problem. We consider prior distributions of parameters of the Boltzmann machine and assume that the prior distributions are governed by some hyperparameters. The introduction of the prior distributions is strongly connected with the regularized ML estimation (cf. Sec. II.1). As mentioned above, the aim of the usual BML is to optimize the values of the parameters of the Boltzmann machine by using a set of observed data points. Meanwhile, the aim of the problem investigated in this study is the estimation of appropriate values of the hyperparameters from the dataset without estimating specific values of the parameters. One way to allow us to accomplish this from the Bayesian point of view is the empirical Bayes method (or also called type-II ML estimation or evidence approximation) MacKay (1992); Bishop (2006) (cf. Sec. II.2). The schemes of the usual BML and of our problem are illustrated in Fig. 1. Figure 1: Illustration of scheme of empirical Bayes method considered in this study.

However, the evaluation of the likelihood function in the empirical Bayes method is again intractable, because it involves intractable multiple integrations of the partition function. In this study, we analyze the empirical Bayes method for fully-connected Boltzmann machines, using statistical mechanical techniques based on the replica method Mezard et al. (1987); Nishimori (2001) and the Plefka expansion to derive an algorithm for it. We consider two types of cases of the prior distribution of : the cases of Gaussian and Laplace priors.

The rest of this paper is organized as follows. The formulations of the usual BML and the empirical Bayes method are presented in Sec. II. In Sec. III, we describe our statistical mechanical analysis for the empirical Bayes method. The proposed inference algorithm obtained from our analysis is shown in Sec. III.3 with its pseudocode. In Sec. IV, we examine our proposed method through numerical experiments. Finally, the summary and some discussions are presented in Sec. V.

## Ii Boltzmann Machine and Empirical Bayes Method

### ii.1 Boltzmann machine and prior distributions

Consider a fully-connected Boltzmann machine with Ising variables  Ackley et al. (1985):

 P(S∣h,J):=1Z(h,J)exp(hn∑i=1Si+∑i

where is the sum over all the distinct pairs of variables; i.e., . is the partition function defined by

 Z(h,J):=∑Sexp(hn∑i=1Si+∑i

where is the sum over all the possible configurations of ; i.e., . The parameters, and , denote the external field and couplings, respectively.

Given observed data points, , we define the log-likelihood function:

 L\mrmML(h,J):=1nNN∑μ=1lnP(\mbfS(μ)∣h,J). (2)

Maximizing the log-likelihood function with respect to and (i.e., the ML estimation) just corresponds to the BML (or the inverse Ising problem), i.e.,

 {^h\mrmML,^J\mrmML}=\argmaxh,JL\mrmML(h,J). (3)

Now, we introduce prior distributions for the parameters and as and

 P\mrmprior(J∣γ) :=∏i

respectively. and are the hyperparameters of these prior distributions. One of the most important motivations for introducing the prior distributions is for a Bayesian interpretation of the regularized ML estimation Bishop (2006). Given the observed dataset , by using the prior distributions, the posterior distribution of and is expressed as

 P\mrmpost(h,J∣\mcalD,H,γ)\nn =P(\mcalD∣h,J)P\mrmprior(h∣H)P\mrmprior(J∣γ)P(\mcalD∣H,γ), (5)

where

 P(\mcalD∣h,J):=N∏μ=1P(\mbfS(μ)∣h,J).

The distribution in the denominator in Eq. (5), , is sometimes referred to as the evidence. By using the posterior distribution, the maximum a posteriori (MAP) estimation of the parameters is obtained as

 {^h\mrmMAP,^J\mrmMAP}=\argmaxh,JL\mrmMAP(h,J), (6)

where

 L\mrmMAP(h,J):=1nNlnP\mrmpost(h,J∣\mcalD,H,γ)\nn =L\mrmML(h,J)+1nNR0(h)+1nNR1(J)+\mrmconstant. (7)

The MAP estimation in Eq. (6) corresponds to the regularized ML estimation, in which and work as a penalty. For example, (i) when the prior distribution of is the Gaussian prior,

 P\mrmprior(Jij∣γ)=√n2πγexp(−nJ2ij2γ),γ>0, (8)

corresponds to the regularization term, and corresponds to its coefficient; (ii) when the prior distribution of is the Laplace prior,

 P\mrmprior(Jij∣γ)=√n2γexp(−√2nγ|Jij|),γ>0 (9)

corresponds to the regularization term, and

again corresponds to its coefficient. The variances of these prior distributions are identical,

. In this study, as a simple test case, we use these two prior distributions for and

 P\mrmprior(h∣H)=δ(h−H), (10)

where is the Dirac delta function, for .

### ii.2 Framework of the empirical Bayes method

Using the empirical Bayes method, we can infer the values of the hyperparameters, and , from the observed dataset . We define a marginal log-likelihood function as

 L\mrmEB(H,γ) :=1nNln[P(\mcalD∣h,J)]h,J, (11)

where is the average over the prior distributions; i.e.,

 [⋯]h,J:=∫dJ∫dh(⋯)P\mrmprior(h∣H)P\mrmprior(J∣γ).

We refer to the marginal log-likelihood function as the empirical Bayes likelihood function in this study. From the perspective of the empirical Bayes method, the optimal values of the hyperparameters, and , are obtained by maximizing of the empirical Bayes likelihood function; i.e.,

 {^H,^γ}=\argmaxH,γL\mrmEB(H,γ). (12)

It is noteworthy that in Eq. (11) is identified as the evidence appearing in Eq. (5).

The marginal log-likelihood function can be rewritten as

 L\mrmEB(H,γ)=1nNln[exp(nNL\mrmML(h,J))]h,J (13)

Consider the case . In this case, by using the saddle point evaluation, Eq. (13) is reduced to

 L\mrmEB(H,γ) ≈1nNlnP\mrmprior(^h\mrmML∣H)\nn\aleq+1nNlnP\mrmprior(^J\mrmML∣γ)+\mrmconstant.

In this case, the empirical Bayes’ estimates thus converge to the maximum likelihood estimates of the hyperparameters in the prior distributions in which the maximum likelihood estimates of the parameters (i.e., the solution to the BML) are inserted. This indicates that the parameter estimations can be conducted independently of the hyperparameter estimation. In this study, we do not concern ourselves with this trivial case.

## Iii Statistical Mechanical Analysis

The empirical Bayes likelihood function in Eq. (11) involves intractable multiple integrations. In this section, we evaluate the empirical Bayes likelihood function using a statistical mechanical analysis. We consider the two types of the prior distribution of : one is the Gaussian prior in Eq. (8), and the other is the Laplace prior in Eq. (9).

First, we evaluate the empirical Bayes likelihood function on the basis of the Gaussian prior in Secs. III.1III.3, after which we describe the evaluation based on the Laplace prior in Sec. III.4.

### iii.1 Replica method

The empirical Bayes likelihood function in Eq. (11) can be represented as

 L\mrmEB(H,γ)=1nNlnlimx→−1Ψx(H,γ), (14)

where

 Ψx(H,γ)\nn :=[Z(h,J)xNexpN(hn∑i=1di+∑i

and

 di:=1NN∑μ=1\mrmS(μ)i,dij:=1NN∑μ=1\mrmS(μ)i\mrmS(μ)j

are the sample averages of the observed data points. We assume that is a natural number, and therefore Eq. (15) can be expressed as

 Ψx(H,γ) =[∑\mcalSxexp{hn∑i=1(τx∑a=1S{a}i+Ndi)\nn\aleq+∑i

where are replica indices, and is the Ising variable on site in the th replica. is the set of all the Ising variables in the replicated system, and is the sum over all the possible configurations of ; i.e., . We evaluate under the assumption that us a natural number, after which we take the limit of of the evaluation result to obtain the empirical Bayes likelihood function (this is the so-called replica trick).

By employing the Gaussian prior in Eq. (8), Eq. (16) becomes

 Ψ\mrmGaussx(H,γ) =exp{nNHM+γ(n−1)N24(C2+xN)\nn\aleq−Fx(H,γ)}, (17)

where

 M:=1nn∑i=1di,Ck:=2n(n−1)∑i

and

 Fx(H,γ):=−ln∑\mcalSxexp(−Ex(\mcalSx;H,γ)) (19)

is the replicated (Helmholtz) free energy Rizzo et al. (2010); Yasuda et al. (2012); Lage-Castellanos et al. (2013); Yasuda et al. (2015); here,

 Ex(\mcalSx;H,γ)\nn :=−Hn∑i=1τx∑a=1S{a}i−γNn∑i

is the Hamiltonian of the replicated system, where is the sum over all the distinct pairs of replicas; i.e., .

### iii.2 Plefka expansion

Because the replicated free energy in Eq. (19) includes intractable multiple summations, an approximation is needed to proceed with our evaluation. In this section, we approximate the replicated free energy using the Plefka expansion Plefka (1982). In brief, the Plefka expansion is the perturbative expansion in a Gibbs free energy that is a dual form of a corresponding Helmholtz free energy.

The Gibbs free energy is obtained as

 Gx(m,H,γ) =−nτxHm+\extrλ{λnτxm\nn\aleq−ln∑\mcalSxexp(−Ex(\mcalSx;λ,γ))}. (21)

The derivation of this Gibbs free energy is described in Appendix A. It is noteworthy that this type of expression of the Gibbs free energy implies the replica-symmetric (RS) assumption. To take the replica-symmetry breaking (RSB) into account, explicit treatments of overlaps between different replicas are needed Yasuda et al. (2012). By expanding around , we obtain

 Gx(m,H,γ)nN =−xHm+xe(m)+ϕ(1)x(m)γ\nn\aleq+ϕ(2)x(m)γ2+O(γ3), (22)

where is the negative mean-field entropy defined by

 e(m):=1+m2ln1+m2+1−m2ln1−m2, (23)

and the coefficients, and , are expressed as Eqs. (41) and (46), respectively. The detailed derivation of these coefficients is presented in Appendix B.

From Eqs. (14), (17), (22), and (37), we obtain the empirical Bayes likelihood function as

 L\mrmEB(H,γ) ≈HM−\extrm[Hm−e(m)+Φ(m)γ\nn\aleq+ϕ(2)−1(m)γ2]. (24)

where

 Φ(m):=ϕ(1)−1(m)−(n−1)N4n(C2−1N).

From Eqs. (41) and (46), and are

 Φ(m) =(n−1)NC12nm2−(n−1)N4n{C2\nn\aleq+N+1N(m4−1N+1)} (25)

and

 ϕ(2)−1(m) =(n−1)2N2Ω2n2m2(1−m2)+(n−1)N2C24n2(1−m2)2−(n−1)N(N+1)C12n2m2(1−m2)2\nn\aleq−(n−1)2(N+1)4n2(1−N+1n−1)m4(1−m2)2−(n−1)(N+1)8n2(1−m4)2, (26)

respectively. The coefficient appearing in the above equation is defined by

 Ω:=1nn∑i=1ω2i, (27)

where

 ωi :=1n−1∑j∈∂(i)dij−C1; (28)

here, .

### iii.3 Inference algorithm

As mentioned in Sec. II.2, the empirical Bayes inference is achieved by maximizing with respect to and (cf. Eq. (12)). From the extremum condition of Eq. (24) with respect to , we obtain

 ^m=M, (29)

where is the value of that satisfies the extremum condition in Eq. (24). From the extremum condition of Eq. (24) with respect to and Eq. (29), we obtain

 (30)

From Eqs. (24) and (29), the optimal value of is obtained by

 ^γ =\argmaxγ[−Φ(M)γ−ϕ(2)−1(M)γ2]. (31)

From Eq. (31), is immediately obtained as follows: (i) when and or when and , , (ii) when and , , and (iii) elsewhere. Here, we ignore the case , because it hardly occurs in realistic settings. By using Eqs. (30) and (31), we can obtain the solution to the empirical Bayes inference without any iterative processes. The pseudocode of the proposed procedure is shown in Algorithm 1.

In the proposed method, the value of does not affect the determination of . Many mean-field-based methods for BML (e.g., listed in Sec. I) have similar procedures, in which are determined separately from . This is seen as one of the common properties of the mean-field-based methods for BML including the current empirical Bayes problem.

### iii.4 Evaluation based on Laplace prior

The above evaluation was for the Gaussian prior in Eq. (8). Here, we explain the evaluation for the Laplace prior in Eq. (9). By employing the Laplace prior in Eq. (9), Eq. (16) becomes

 Ψ\mrmLaplacex(H,γ)\nn (32)

where . Here, we assume

 ξ>maxi

By using the perturbative approximation,

 ln{ξ2−(τx∑a=1S{a}iS{a}j+Ndij)2}\nn =lnξ2+ln{1−ξ−2(τx∑a=1S{a}iS{a}j+Ndij)2}\nn ≈lnξ2−ξ−2(τx∑a=1S{a}iS{a}j+Ndij)2,

we obtain the approximation of Eq. (32) as

 Ψ\mrmLaplacex(H,γ) ≈enNHM∑\mcalSxexp[Hn∑i=1τx∑a=1S{a}i\nn\aleq+ξ2∑i

The right-hand side of this equation coincides with in Eq. (17). This means that the empirical Bayes inference based on the Laplace prior in Eq. (9) is (approximately) equivalent to that based on the Gaussian prior in Eq. (8) (i.e., ) when the assumption of Eq. (33) is justified. Thus, we can also use the algorithm presented in Sec. III.3 for the case of the Laplace prior.

## Iv Numerical Experiments

In this section, we describe the results of our numerical experiments. In these experiments, the observed dataset are generated from the generative Boltzmann machine, which has the same form as Eq. (1), by using annealed importance sampling Neal (2001). The parameters of the generative Boltzmann machine are drawn from the prior distributions in Eqs. (4) and (10). That is, we consider the model-matched case (i.e., the generative and learning models are identical).

In the following, we use the notations and

. The standard deviations of the Gaussian prior in Eq. (

8) and of the Laplace prior in Eq. (9) are then . We express the hyperparameters for the generative Boltzmann machine by and .

### iv.1 Gaussian prior case

Here, we consider the case in which the prior distribution of is the Gaussian prior in Eq. (8). In this case, the Boltzmann machine corresponds to the Sherrington-Kirkpatrick (SK) model Sherrington and Kirkpatrick (1975), and therefore it shows the spin-glass transition at when (i.e., when ).

First, we consider the case . We show the scatter plots for the estimation of for various when and in Fig. 2. Figure 2: Scatter plots of J\mrmtrue (horizontal axis) versus ^J (vertical axis) when H\mrmtrue=0 and α=0.4: (a) n=300 and (b) n=500. Plots are the average values over 300 experiments.

The detailed values of the plots for some values are shown in Tab. 1.

When , our estimates of are in good agreement with . This implies that the validity of our perturbative approximation is lost in the spin-glass phase, as is often the case with many mean-field approximations. Fig. 3 shows the scatter plots for various . Figure 3: Scatter plots of J\mrmtrue (horizontal axis) versus ^J (vertical axis) for various α=N/n when H\mrmtrue=0: (a) n=300 and (b) n=500. Plots are the average values over 300 experiments.

A smaller causes to be overestimated and a larger causes it to be underestimated. At least in our experiments, the optimal value of seems to be when . Our method can estimate together with . The results for the estimation of when and are shown in Fig. 4. Figure 4: Results of estimation of ^H against J\mrmtrue when H\mrmtrue=0 and α=0.4: (a) the mean absolute error and (b) standard deviation. Plots are the average values over 300 experiments.

Figs. 4(a) and (b) show the average of (i.e., the mean absolute error (MAE)) and the standard deviation of over 300 experiments, respectively. The MAE and standard deviation drastically increase in the region .

Next, we consider the cases . The scatter plots for the estimation of for various values when and are shown in Fig. 5. Figure 5: Scatter plots of J\mrmtrue (horizontal axis) versus ^J (vertical axis) for various α=N/n for n=300 and 500: (a) H\mrmtrue=0.2 and (b) H\mrmtrue=0.4. Plots are the average values over 300 experiments. The notation in the legend means (n,N).

The appropriate values of when and “approximately” seem to be and , respectively. The detailed values of these plots for some values are shown in Tabs. 2 and 3. The results for the estimation of when and and when and are shown in Figs. 6 and 7, respectively. Figure 6: Results of estimation of ^H against J\mrmtrue when H\mrmtrue=0.2 and α=30/n: (a) the mean absolute error and (b) standard deviation. Plots are the average values over 300 experiments. Figure 7: Results of estimation of ^H against J\mrmtrue when H\mrmtrue=0.4 and α=5/n: (a) the mean absolute error and (b) standard deviation. Plots are the average values over 300 experiments.

The increases in the MAE and standard deviations occur earlier than for the case in Fig. 4.

One of the largest qualitative differences between the cases and is the scale of . In the case , the optimal was scaled by with respect to (i.e., ). Meanwhile, in the case , the optimal is scaled by with respect to (i.e., ). This change of scale can be understood from a scale evaluation for the terms in the empirical Bayes likelihood function in Eq. (24). The detailed reasoning is given in Appendix C.

### iv.2 Laplace prior case

Here, we consider the case in which the prior distribution of is the Laplace prior in Eq. (9). The scatter plots for the estimation of for various values when are shown in Fig. 8. Figure 8: Scatter plots of J\mrmtrue (horizontal axis) versus ^J (vertical axis) for various α=N/n, when H\mrmtrue=0, in the case of the Laplace prior: (a) n=300 and (b) n=500. Plots are the average values over 300 experiments.

The plots shown in Fig. 8 almost completely overlap with those in Fig. 3. Furthermore, all the numerical results in the case also almost completely overlap with the corresponding results obtained in the above Gaussian prior case, and therefore we do not show those results.

## V Summary and Discussions

In this study, we proposed a hyperparameters inference algorithm by analyzing the empirical Bayes likelihood function in Eq. (11) using the replica method and the Plefka expansion. The validity of our method was examined in numerical experiments for the Gaussian and Laplace priors, which demonstrated the existence of an appropriate scale in the size of the dataset that can accurately recover the values of the hyperparameters.

However, some problems remain. The first one is the scale of . In our experiments, we found that an appropriate is scaled by when or by when . However, such scales seem to be unnatural, because they should not appear in the original framework of the empirical Bayes method. As discussed in Sec. II.2, when , maximizing the empirical Bayes likelihood function is reduced to the maximum likelihood estimation of the prior distributions for the solution to BML. This must lead to the correct and , because the solution to BML is perfect when . Therefore, such unnatural scales appear due to our approximation, which is also supported by a scale analysis given in Appendix C. An improvement of the approximation (e.g., by evaluating the leading terms in the Plefka expansion or using some other approximations) might reduce these unnatural behaviors.

The second problem is the optimal setting . Empirically, we found that when and that it decreases as increases (e.g., when and when ). As can be seen in the results of our experiments, the solution to our method is robust for the choice of when is small () and is sensitive to it when is large (), where . The estimation of is very important for our method, and it will make our method more practical. This problem would be strongly related to the first problem.

The third problem is the degradation of the estimation accuracy in the spin-glass phase. In our experiments, the estimation accuracies of and were obviously degraded in the spin-glass phase. This means that our Plefka expansion based on the RS assumption loses its validity in the spin-glass phase. In Ref. Yasuda et al. (2012), a Plefka expansion for the one-step RSB was proposed. Employing this expansion instead of the current expansion could reduce the degradation in the spin-glass phase. These three problems should be addressed in our future studies.

In this study, we used fully-connected Boltzmann machines whose variables are all visible. We are also interested in an extension of our method to other types of Boltzmann machines such as Boltzmann machines having specific structures or hidden variables. Furthermore, we considered the model-matched case (i.e., the case in which the generative mode and learning model are the same model) in the current study, but model-mismatched cases are more practical and important.

## Appendix A Gibbs Free Energy

In this appendix, we derive the Gibbs free energy for the replicated (Helmholtz) free energy in Eq. (19).

The replicated free energy is obtained by minimizing the variational free energy, defined by

 f[Q]:=∑\mcalSxEx(\mcalS;H,γ)Q(\mcalSx)+∑\mcalSxQ(\mcalSx)lnQ(\mcalSx), (34)

under the normalization constraint, i.e., , where is a test distribution over , and is the Hamiltonian for the replicated system defined in Eq. (20).

The Gibbs free energy is obtained by adding new constraints to the minimization of . Here, we add the relation as the constraint. By using Lagrange multipliers, the Gibbs free energy is obtained as

 Gx(m,H,γ) :=\extrQ,λ,r{f[Q]−r(∑\mcalSxQ(\mcalSx)−1)\nn −λ(n∑i=1τx∑a=1∑\mcalSxS{a}iQ(\mcalSx)−nτxm)}, (35)

where “” denotes the extremum with respect to the assigned parameters. By performing the extremum operation with respect to and in Eq. (35), we obtain

 Gx(m,H,γ)\nn =\extrλ{λnτxm−ln∑\mcalSxexp(−Ex(\mcalSx;H+λ,γ))}. (36)

The replicated free energy in Eq. (19) coincides with the extremum of this Gibbs free energy with respect to ; i.e.,

 Fx(H,γ)=\extrmGx(m,H,γ). (37)

By performing the shift in Eq. (36), we obtain Eq. (21).

## Appendix B Derivation of Coefficients of Plefka Expansion

The Plefka expansion considered in this study can be obtained by expanding the Gibbs free energy in Eq. (21) around .

When , we have

 Gx(m,H,0) =−nτxHm+nτx\extrλ(λm−ln2coshλ)\nn =−nτxHm+nτxe(m), (38)

where is defined in Eq. (23).

For the derivations of the coefficients and , we decompose in Eq. (21) into two parts:

 Ex(\mcalSx;λ,γ)=−λn∑i=1τx∑a=1S{a}i+γE\mrmintx(\mcalSx),

where

 E\mrmintx(\mcalSx) :=−Nn∑i

Coefficient is defined by

 ϕ(1)x(m):=1nN∂Gx(m,H,γ)∂γ∣∣γ=0.