The generator model assumes that the observed example is generated by a low-dimensional latent vector via a top-down network, and the latent vector follows a simple and known prior distribution, such as uniform or Gaussian white noise distribution. While we can learn an expressive top-down network to map the prior distribution to the data distribution, we can also learn an expressive prior model instead of assuming a given prior distribution. This follows the philosophy of empirical Bayes where the prior model is learned from the observed data. We propose to learn an energy-based prior model for the latent vector, where the energy function is parametrized by a very simple multi-layer perceptron. Due to the low-dimensionality of the latent space, learning a latent space energy-based prior model proves to be both feasible and desirable. In this paper, we develop the maximum likelihood learning algorithm and its variation based on short-run Markov chain Monte Carlo sampling from the prior and the posterior distributions of the latent vector, and we show that the learned model exhibits strong performance in terms of image and text generation and anomaly detection.READ FULL TEXT VIEW PDF
In recent years, deep generative models have achieved impressive successes in image and text generation. A particularly simple and powerful model is the generator model Kingma and Welling (2014); Goodfellow et al. (2014), which maps a low-dimensional latent vector to image or text via a top-down network. The generator model was proposed in the contexts of variational auto-encoder (VAE) Kingma and Welling (2014); Rezende et al. (2014) and generative adversarial networks (GAN) Goodfellow et al. (2014); Radford et al. (2016). In both frameworks, the generator model is jointly learned with a complementary model, such as the inference model in VAE and the discriminator model in GAN. More recently in Han et al. (2017); Nijkamp et al. (2019b)
, the generator model has also been learned by maximum likelihood without resorting to a complementary model, where the inference is carried out by Markov chain Monte Carlo (MCMC) such as the Langevin dynamics. In this paper, we shall adopt the the framework of maximum likelihood estimate (MLE), instead of GAN or VAE, so that the learning is simpler in the sense that we do not need to train a complementary network.
The expressive power of the generator network for image and text generation comes from the top-down network that maps a simple prior distribution to be close to the data distribution. Most of the existing papers Makhzani et al. (2015); Tolstikhin et al. (2017); Arjovsky et al. (2017); Dai et al. (2017); Turner et al. (2019); Kumar et al. (2019)
assume that the latent vector follows a given simple prior distribution, such as isotropic Gaussian white noise distribution or uniform distribution. However, such assumption may cause ineffective generator learning as observed inDai and Wipf (2019b); Tomczak and Welling (2018a). While we can increase the complexity of the top-down network to enhance the expressive power of the model, in this paper, we shall pursue a different strategy by following the philosophy of empirical Bayes, that is, instead of assuming a given prior distribution for the latent vector, we learn a prior model from empirical observations.
Specifically, we assume the latent vector follows an energy-based model (EBM), or more specifically, an energy-based correction of the isotropic Gaussian white noise prior distribution. We call this model the latent space energy-based prior model. Such a prior model adds more expressive power to the generator model.
The MLE learning of the generator model with a latent space EBM prior involves MCMC sampling of latent vector from both the prior and posterior distributions. Parameters of the prior model can then be updated based on the statistical difference between samples from the two distributions. Parameters of the top-down network can be updated based on the samples from the posterior distribution as well as the observed data.
Compared to GAN that involves delicate dueling between two networks, MLE learning is simpler, and does not suffer from issues such as instability or mode collapsing. As to VAE, for generator model with a latent space EBM prior, VAE is not easily applicable because of the intractability of the normalizing constant of the latent space EBM.
Although MLE learning does not require training a complementary model, it requires MCMC sampling from prior and posterior distributions of the learned model. However, because MCMC sampling is carried out in the low-dimensional latent space instead of the high-dimensional data space, it is easily affordable on modern computing platforms. Compared to EBM built directly on image or text, the latent space EBM prior can be much less multi-modal, because it can rely on the top-down network to map the prior distribution to the highly multi-modal data distribution. A less multi-modal EBM is more amendable to MCMC sampling.
Furthermore, in this paper, we propose to use short-run MCMC sampling Nijkamp et al. (2019a, 2020, c), i.e., we always initialize MCMC from the fixed Gaussian white noise distribution, and we always run a fixed and small number of steps, in both training and testing stages. Such a learning algorithm is simple and efficient. We formulate this learning algorithm as a perturbation of MLE learning in terms of both objective function and estimating equation, so that the learning algorithm has a soild theoretical foundation.
We test the proposed modeling, learning and computing method on tasks such as image synthesis, text generation, as well as anomaly detection. We show that our method is competitive with prior art.
Contributions. (1) We propose a generator model with a latent space energy-based prior model by following the empirical Bayes philosophy. (2) We develop the maximum likelihood learning algorithm based on MCMC sampling of the latent vector from the prior and posterior distributions. (3) We further develop an efficient modification of MLE learning based on short-run MCMC sampling Nijkamp et al. (2019a, 2020, c). (4) We provide theoretical foundation for learning driven by short-run MCMC. (5) We provide strong empirical results to corroborate the proposed method.
We now put our work within the bigger picture of modeling and learning, and discuss related work.
Energy-based model and top-down generation model. A top-down model or a directed acyclic graphical (DAG) model is of a simple factorized form that is capable of ancestral sampling. The prototype of such a model is factor analysis Rubin and Thayer (1982)
, which has been generalized to independent component analysisHyvärinen et al. (2004), sparse coding Olshausen and Field (1997), non-negative matrix factorization Lee and Seung (2001), etc. An early example of a multi-layer top-down model is the generation model of Helmholtz machine Hinton et al. (1995)
. An energy-based model defines an unnormalized density or a Gibbs distribution. The prototype of such a model is exponential family distribution, the Boltzmann machineAckley et al. (1985), and the FRAME (Filters, Random field, And Maximum Entropy) model Zhu et al. (1998). Zhu (2003) contrasted these two classes of models, calling the top-down latent variable model the generative model, and the energy-based model the descriptive model. Guo et al. (2003) proposed to integrate the two models, where the top-down generation model generates textons, while the EBM prior accounts for the spatial placement and arrangement of textons. Our model follows such a scheme.
The energy function in the EBM can be viewed as the objective function, the cost function, the constraints, or a critic Sutton and Barto (2018). It is easy to specify, although optimizing or sampling the energy function can be hard, and may require iterative algorithm such as MCMC. The maximum likelihood learning of EBM can be interpreted as an adversarial scheme Wu et al. (2019); Han et al. (2020); Lazarow et al. (2017); Xie et al. (2017), where the MCMC serves as the generator and the energy function serves as an evaluator. However, unlike GAN, the maximum likelihood learning of EBM does not suffer from issues such as mode collapsing.
The top-down generation model can be viewed as an actor Sutton and Barto (2018) that directly generates the samples. It is easy to sample from, although one may need a complex top-down model to generate high quality samples. Comparing the two models, the EBM can be more expressive than a top-down model of the same complexity, while a top-down model is much easier to sample from. Therefore, it is desirable to let EBM take over the top layers of the top-down model to make the model more expressive, while EBM learning is still feasible.
Energy-based correction of top-down model. The top-down model usually assumes independent nodes at the top layer and conditional independent nodes at subsequent layers. We can introduce energy terms at multiple layers to correct for the independence or conditional independence assumptions. This leads to a latent energy-based model. However, unlike undirected latent EBM, the energy-based correction is learned on top of a directed top-down model, and this can be easier than learning an undirected latent EBM from scratch. Our work is a simple example of this scheme where we correct the prior distribution. We can also correct the generation model.
From data space EBM to latent space EBM. EBM learned in data space such as image space Xie et al. (2016); Lu et al. (2016); Han et al. (2019); Nijkamp et al. (2019a); Du and Mordatch (2019) can be highly multi-modal, and MCMC sampling can be difficult. In that case, we can introduce latent variables and learn an EBM in latent space, while also learning a mapping from the latent space to the data space. Our work follows such a strategy. Earlier papers on this strategy are Zhu (2003); Guo et al. (2003); Bengio et al. (2013); Brock et al. (2018); Kumar et al. (2019). Learning EBM in latent space can be much feasible than learning EBM in data space in terms of MCMC sampling, and much of past work on EBM can be re-casted in the latent space.
Short-run MCMC. Recently, Nijkamp et al. (2019a) proposed to use short-run MCMC to sample from the EBM in data space. Nijkamp et al. (2019c) proposed to use short-run MCMC to sample the latent variables of a top-down generation model from their posterior distribution. Our work adopts short-run MCMC to sample from both the prior and the posterior of the latent variables. We also provide theoretical foundation for the learning algorithm with short-run MCMC sampling.
Generator model with flexible prior. A few variants of VAE attempt to address the mismatch between the prior and the aggregate posterior. VampPrior Tomczak and Welling (2018b) parameterizes the prior based on the posterior inference model, while Bauer and Mnih (2019) proposes to construct rich priors using rejection sampling with a learned acceptance function, both yielding improved performance on grey scale images. ARAE Zhao et al. (2018) learns an implicit prior distribution in the latent space with adversarial training and demonstrates superior performance on text generation. Recently, some papers resort to a two-stage approach Dai and Wipf (2019a); Ghosh et al. (2020)
. They first train a VAE or deterministic autoencoder (with some form of regularization) in the data space. To enable generation from the model, they then fit a VAE or Gaussian mixture to the posterior samples inferred by the first stage model. An earlier model related to the two-stage approach is GLOBojanowski et al. (2017) where a generator is trained paired with inference conducted by gradient descent on the latent variables instead of a separate inference network. All of these prior models by and large follow the empirical Bayes philosophy, which is also one motivation of our work.
Let be an observed example such as an image or a piece of text, and let be the latent variables, where
. The joint distribution ofis
where is the prior model with parameters , is the top-down generation model with parameters , and .
The prior model is formulated as an energy-based model,
where is a reference distribution, assumed to be isotropic Gaussian in this paper. is the negative energy and is parameterized by a small multi-layer perceptron with parameters . is the normalizing constant or partition function.
The prior model (2) can be interpreted as an energy-based correction or exponential tilting of the original prior distribution , which is the prior distribution in the generator model in VAE.
The generation model is the same as the top-down network in VAE. For image modeling,
where , so that . As in VAE, takes an assumed value. For text modeling, let where each is a token. Following previous text VAE model Bowman et al. (2016), we define
as a conditional autoregressive model,
which is often parameterized by a recurrent network with parameters .
In the original generator model, the top-down network maps the unimodal prior distribution to be close to the usually highly multi-modal data distribution. The prior model in (2) refines so that maps the prior model to be closer to the data distribution. The prior model does not need to be highly multi-modal because of the expressiveness of .
The marginal distribution is The posterior distribution is
In the above model, we exponentially tilt . We can also exponentially tilt to . Equivalently, we may also exponentially tilt , as the mapping from to is a change of variable. This leads to an EBM in both the latent space and data space, which makes learning and sampling more complex. Therefore, we choose to only tilt and leave as a directed top-down generation model.
Suppose we observe training examples . The log-likelihood function is
The learning gradient can be calculated according to
For the prior model, Thus the learning gradient for an example is
The above equation has an empirical Bayes nature. is based on the empirical observation , while is the prior model. is updated based on the difference between inferred from empirical observation , and sampled from the current prior.
For the generation model,
where or for image and text modeling respectively, which is about the reconstruction error.
Expectations in (7) and (8) require MCMC sampling of the prior model and the posterior distribution . We can use Langevin dynamics Langevin (1908); Zhu and Mumford (1998). For a target distribution , the dynamics iterates
where indexes the time step of the Langevin dynamics, is a small step size, and is the Gaussian white noise. can be either or . In either case, can be efficiently computed by back-propagation.
It is worth noting that VAE is not conveniently applicable here. Even if we have a tractable approximation to in the form of an inference network, we still need to compute , which requires MCMC.
Convergence of Langevin dynamics to the target distribution requires infinite steps with infinitesimal step size, which is impractical. We thus propose to use short-run MCMC Nijkamp et al. (2019a, 2020, c) for approximate sampling. This is in agreement with the philosophy of variational inference, which accepts the intractability of the target distribution and seeks to approximate it by a simpler distribution. The difference is that we adopt short-run Langevin dynamics instead of learning a separate network for approximation.
The short-run Langevin dynamics is always initialized from the fixed initial distribution , and only runs a fixed number of steps, e.g., ,
Denote the distribution of to be . Because of fixed and fixed and , the distribution is well defined. In this paper, we put sign on top of the symbols to denote distributions or quantities produced by short-run MCMC, and for simplicity, we omit the dependence on and in notation. As shown in Cover and Thomas (2006)decreases to zero monotonically as .
Specifically, denote the distribution of to be if the target , and denote the distribution of to be if . We can then replace by and replace by in equations (7) and (8), so that the learning gradients in equations (7) and (8) are modified to
The short-run MCMC sampling is always initialized from the same initial distribution , and always runs a fixed number of steps. This is the case for both training and testing stages, which share the same short-run MCMC sampling.
The learning and sampling algorithm is described in Algorithm 1.
The learning algorithm based on short-run MCMC sampling in Algorithm 1 is a modification or perturbation of maximum likelihood learning, where we replace and by and respectively. For theoretical underpinning, we should also understand this perturbation in terms of objective function and estimating equation.
In terms of objective function, define the Kullback-Leibler divergence . At iteration , with fixed , consider the following perturbation of the log-likelihood function of for an observation ,
The above is a function of , while is fixed. Then
where the derivative is taken at . See Appendix 6.8 for details. Thus the updating rule of Algorithm 1 follows the stochastic gradient (i.e., Monte Carlo approximation of the gradient) of a perturbation of the log-likelihood ( above is not necessarily a normalized density function any more). Equivalently, because is fixed, we can drop the entropies of and in the above Kullback-Leibler divergences, hence the updating rule follows the stochastic gradient of
where is the total log-likelihood defined in equation (5), and the gradient is taken at .
In equation (13), the first term is related to variational inference, although we do not learn a separate inference model. The second
term is related to contrastive divergenceTieleman (2008), except that is initialized from . It serves to cancel the intractable term.
In terms of estimating equation, the stochastic gradient descent in Algorithm1 is a Robbins-Monro stochastic approximation algorithm Robbins and Monro (1951) that solves the following estimating equation:
The solution to the above estimating equation defines an estimator of the parameters. Algorithm 1 converges to this estimator under the usual regularity conditions of Robbins-Monro Robbins and Monro (1951). If we replace by , and by , then the above estimating equation is the maximum likelihood estimating equation.
The above theoretical understanding in terms of objective function and estimating equation is more general than that of maximum likelihood, which is a special case where the number of steps and the step size in the Langevin dynamics in equation (10). Our theoretical understanding is clearly more relevant in practice where we can only afford finite with non-zero .
As to the step size , we currently treat it as a tuning parameter. can be more formally optimized by maximizing in equation (15) or maximizing the average of defined by equation (13). We may also allow different step sizes for different steps in the short-run Langevin dynamics. We leave this issue to future investigations.
We present a set of experiments which highlight the effectiveness of our proposed model with (1) excellent synthesis for both visual and textual data outperforming state-of-the-art baselines, (2) high expressiveness of the learned prior model for both data modalities, and (3) strong performance in anomaly detection.
For image data, we include SVHN Netzer et al. (2011), CelebA Liu et al. (2015), and CIFAR-10 Krizhevsky et al. . For text data, we include PTB Marcus et al. (1993), Yahoo Yang et al. (2017), and SNLI Bowman et al. (2015). We refer to Appendix 7.1 for details.
We evaluate the quality of the generated and reconstructed images. If the model is well-learned, the latent EBM will fit the generator posterior which in turn renders realistic generated samples as well as faithful reconstructions. We compare our model with VAE Kingma and Welling (2014) and SRI Nijkamp et al. (2019b) which assume a fixed Gaussian prior distribution for the latent vector and two recent strong VAE variants, 2sVAE Dai and Wipf (2019a) and RAE Ghosh et al. (2020), whose prior distributions are learned with posterior samples in a second stage. We also compare with multi-layer generator (i.e., 5 layers of latent vectors) model Nijkamp et al. (2019b) which admits a powerful learned prior on the bottom layer of latent vector. We follow the protocol as in Nijkamp et al. (2019b).
Generation. The generator network in our framework is well-learned to generate samples that are realistic and share visual similarities as the training data. The qualitative results are shown in Figure 2. We further evaluate our model quantitatively by using Fréchet Inception Distance (FID) Lucic et al. (2017) in Table 1. It can be seen that our model achieves superior generation performance compared to listed baseline models.
Reconstruction. We then evaluate the accuracy of the posterior Langevin process by testing image reconstruction. The well-formed posterior Langevin should not only help to learn the latent EBM model but also learn to match the true posterior of the generator model. We quantitatively compare reconstructions of test images with the above baseline models Mean Square Error (MSE). From Table 1, our proposed model could achieve not only high generation quality but also accurate reconstructions.
We compare our model to related baselines, SA-VAE Kim et al. (2018), FB-VAE Li et al. (2019), and ARAE Zhao et al. (2018). SA-VAE involves optimizing posterior samples with gradient descent guided by EBLO, resembling the short run dynamics in our model. FB-VAE is the SOTA VAE for text modeling. While SA-VAE and FB-VAE assume a fixed Gaussian prior, ARAE learns a latent sample generator as an implicit prior distribution, which paired with a discriminator is adversarially trained. To evaluate the quality of the generated samples, we follow Zhao et al. (2018); Cífka et al. (2018) and recruit Forward Perplexity (FPPL) and Reverse Perplexity (RPPL). FPPL is the perplexity of the generated samples evaluated under a language model trained with real data and measures the fluency of the synthesized sentences. RPPL is the perplexity of real data (the test data partition) computed under a language model trained with the model-generated samples. Prior work employs it to measure the distributional coverage of a learned model, in our case, since a model with a mode-collapsing issue results in a high RPPL. FPPL and RPPL are displayed in Table 2. Our model outperforms all the baselines on the two metrics, demonstrating the high fluency and diversity of the samples from our model. We also evaluate the reconstruction of our model against the baselines using negative log-likelihood (NLL). Our model has a similar performance as that of FB-VAE and ARAE, while they all outperform SA-VAE.
Short-run chains. We examine the exponential tilting of the reference prior through Langevin samples initialized from with target distribution . As the reference distribution is in the form of an isotropic Gaussian, we expect the energy-based correction to tilt into an irregular shape. In particular, learning equation 62 may form shallow local modes for . Therefore, the trajectory of a Markov chain initialized from the reference distribution with well-learned target should depict the transition towards synthesized examples of high quality while the energy fluctuates around some constant. Figure 3 and Table 3 depict such transitions for image and textual data, respectively, which are both based on models trained with steps. For image data the quality of synthesis improve significantly with increasing number of steps. For textual data, there is an enhancement in semantics and syntax along the chain, which is especially clear from step 0 to 40 (see Table 3).
Long-run chains. While the learning algorithm 1 recruits short-run MCMC with steps to sample from target distribution , a well-learned should allows for Markov chains with realistic synthesis for steps. We demonstrate such long-run Markov chain with and in Figure 4. The long-run chain samples in the data space are reasonable and do not exhibit the oversaturating issue of the long-run chain samples of recent EBM in the data space (see oversaturing examples in Figure 3 in Nijkamp et al. (2020)).
We evaluate our model through the lens of anomaly detection. If the generator and EBM are well learned, then the posterior
would form a discriminative latent space that has separated probability density for normal and anomalous data, respectively. Samples from such latent space can then be used as discriminative features to detect anomalies. We perform posterior sampling on the learned model to obtain the latent samples, and use the unnormalized log-posterioras our decision function.
Following the protocol as in Kumar et al. (2019); Zenati et al. (2018), we make each digit class an anomaly and consider the remaining 9 digits as normal examples. Our model is trained with only normal data and tested with both normal and anomalous data. We compare with the BiGAN-based anomaly detection Zenati et al. (2018), MEG Kumar et al. (2019) and VAE using area under the precision-recall curve (AUPRC) as in Zenati et al. (2018). Table 4 shows the results.
|MEG||0.281 0.035||0.401 0.061||0.402 0.062||0.290 0.040||0.342 0.034|
|BiGAN-||0.287 0.023||0.443 0.029||0.514 0.029||0.347 0.017||0.307 0.028|
|Ours||0.336 0.008||0.630 0.017||0.619 0.013||0.463 0.009||0.413 0.010|
is the FID score reported in the main text and compared to other baseline models. It is obtained from the model with the architecture and hyperparameters specified in Table8 and Table 9 which serve as the reference configuration for the ablation study.
Fixed prior. We examine the expressivity endowed with the EBM prior by comparing it to models with a fixed isotropic Gaussian prior. The results are displayed in Table 5. The model with an EBM prior clearly outperforms the model with a fixed Gaussian prior and the same generator as the reference model. The fixed Gaussian models exhibit an enhancement in performance as the generator complexity increases. They however still have an inferior performance compared to the model with an EBM prior even when the fixed Gaussian prior model has a generator with four times more parameters than that of the reference model.
|Latent EBM Prior||29.44|
|generator with 2 times as many parameters||41.10|
|generator with 4 times as many parameters||39.50|
MCMC steps. We also study how the number of short run MCMC steps for prior inference () and posterior inference (). The left panel of Table 6 shows the results for and the right panel for . As the number of MCMC steps increases, we observe improved quality of synthesis in terms of FID.
Prior EBM and generator complexity. Table 7 displays the FID scores as a function of the number of hidden features of the prior EBM (nef) and the factor of the number of channels of the generator (ngf, also see Table 9). In general, enhanced model complexity leads to improved generation.
We note that our method involving MCMC sampling is more computationally costly compared to those with amortized inference such as VAE which however bears on issues like inaccurate inference and the mismatch between the prior and aggregate posterior. Several works involve MCMC sampling attempting to improve VAE by either enhancing the posterior inference (SA-VAE Kim et al. (2018)) or constructing a flexible prior with rejection sampling (LARS Bauer and Mnih (2019)) within the original VAE framework. In contrast, we adopt a maximum likelihood learning approach with short run MCMC sampling and follow the philosophy of empirical Bayes. Our approach trades feasible computational cost for expressive prior and simple and accurate inference. Consider SVHN as an example. Training our model on a single NVIDIA 1080Ti needs approximately hours to converge and VAE training needs hours. Thus our method is 4 times slower. Bearing with the feasible cost our method leads to performance improving over strong baselines such as 2sVAE Dai and Wipf (2019b) and RAE Ghosh et al. (2020) on image and text modeling and anomaly detection.
We have also explored avenues to improve training speed and found that a PyTorch extension, NVIDIA Apex111https://github.com/NVIDIA/apex, is able to improve the training time of our model by a factor of 2.5. We test our method with Apex training on a larger scale dataset, CelebA . The learned model is able to synthesize examples with high fidelity (see Figure 1 for examples).
This paper proposes a generalization of the generator model, where the latent vector follows a latent space EBM, which is a refinement or correction of the independent Gaussian or uniform noise prior in the original generator model. We adopt a simple maximum likelihood framework for learning, and develop a practical modification of the maximum likelihood learning algorithm based on short-run MCMC sampling from the prior and posterior distributions of the latent vector. We also provide a theoretical underpinning of the resulting algorithm as a perturbation of the maximum likelihood learning in terms of objective function and estimating equation. Our method combines the best of both top-down generative model and undirected EBM.
EBM has many applications, however, its soundness and its power are limited by the difficulty with MCMC sampling. By moving from data space to latent space, MCMC-based learning of EBM becomes sound and feasible, and we may release the power of EBM in the latent space for many applications.
In this section, we shall derive most of the equations in the main text. We take a step by step approach, starting from simple identities or results, and gradually reaching the main results. Our derivations are unconventional, but they pertain more to our model and learning method.
Let . A useful identity is
where (or ) is the expectation with respect to .
The proof is one liner:
The simple identity (18) also underlies the consistency of MLE. Suppose we observe independently, where is the true value of . The log-likelihood is
The maximum likelihood estimating equation is
According to the law of large number, as, the above estimating equation converges to
where is the unknown value to be solved, while is fixed. According to the simple identity (18), is the solution to the above estimating equation (22), no matter what is. Thus with regularity conditions, such as identifiability of the model, the MLE converges to in probability.
The optimality of the maximum likelihood estimating equation among all the asymptotically unbiased estimating equations can be established based on a further generalization of the simple identity (18).
We shall justify our learning method with short run MCMC in terms of an estimating equation, which is a perturbation of the maximum likelihood estimating equation.
Recall that , where . The learning gradient for an observation is as follows:
The above identity is a simple consequence of the simple identity (18).
because of the fact that according to the simple identity (18), while because what is inside the expectation only depends on , but does not depend on .
We shall provide a theoretical understanding of the learning method with short run MCMC in terms of Kullback-Leibler divergences. We start from some simple results.
The simple identity (18) also follows from Kullback-Leibler divergence. Consider
as a function of with fixed. Suppose the model is identifiable, then achieves its minimum 0 at , thus . Meanwhile,
Since is arbitrary in the above derivation, we can replace it by a generic , i.e.,
which is the simple identity (18).
As a notational convention, for a function , we write , i.e., the derivative of at .
We now re-derive MLE learning gradient in terms of perturbation of log-likelihood by Kullback-Leibler divergence terms. Then the learning method with short run MCMC can be easily understood.
At iteration , fixing , we want to calculate the gradient of the log-likelihood function for an observation , , at . Consider the following perturbation of the log-likelihood
In the above, as a function of , with fixed, is minimized at , thus its derivative at is 0. As a function of , with fixed, is minimized at , thus its derivative at is 0. Thus
We now unpack so that we can obtain its derivative at .
where term gets canceled,
do not depend on . consists of two entropy terms. Now taking derivative at , we have
Averaging over the observed examples leads to MLE learning gradient.
In the above, we calculate the gradient of at . Since is arbitrary in the above derivation, if we replace by a generic , we get the gradient of at a generic , i.e.,
The above calculations are related to the EM algorithm and the learning of energy-based model.
In EM algorithm Dempster et al. (1977), the complete-data log-likelihood serves as a surrogate for the observed-data log-likelihood , where
and , where is a lower-bound of or minorizes the latter. and touch each other at , and they are co-tangent at . Thus the derivative of at is the same as the derivative of at .
In EBM, serves to cancel term in the EBM prior, and is related to the second divergence term in contrastive divergence.