1 Introduction
One of the core research problems in machine learning is learning with probabilistic models, which can be broadly classified into two classes  directed and undirected graphical models
[1, 2]. Apart from the topology difference, an easy way to tell an undirected model from a directed model is that an undirected model involves the normalizing constant (also called the partition function in physics), while the directed model is selfnormalized. Recently, significant progress has been made on learning with deep generative models (DGMs), which generally refer to probabilistic models with multiple layers of stochastic or deterministic variables. There have emerged a bundle of deep directed models, such as variational AutoEncoders (VAEs)
[3], generative adversarial networks (GANs) [4] and so on. In contrast, undirected models (also known as random fields [2], energybased models [5]) received less attention with slow progress. This is presumably because fitting undirected models is more challenging than fitting directed models. In general, calculating the loglikelihood and its gradient is analytically intractable, because it involves evaluating the normalizing constant and, respectively, the expectation with respect to (w.r.t.) the model distribution.In this paper, we are interested in developing deep undirected models. Generally, an undirected model, or exchangeably termed as a random field (RF), defines a probability distribution of the form
, where is usually called the potential function over observation with parameter , and is the normalizing constant. In most existing random fields, the potential function is often defined as linear functions, e.g. , whereis a vector of features (usually handcrafted) and
is the corresponding parameter vector. Such RFs are known as loglinear models [2] or exponential families [6].Note that an attractive property of RF modeling is that one is free to define the potential function in any sensible way, giving it much flexibility. In this paper, we aims to advance the learning of neural random fields, which use neural networks with multiple deterministic layers to define the potential function . With the potential benefit of exploiting the expressive power of deep neural networks, this type of RFs appeared several times in different contexts with different model definitions, called deep energy models (DEMs) [7, 8], descriptive models [9], generative ConvNet [10], neural random field language models [11]. For convenience, we refer to such models as neural random fields (NRFs) in general. Conceptually, compared to traditional loglinear RFs, if we could successful train such NRFs, we can jointly learn the features and the feature weights, which is highly desirable. However, learning NRFs presents much greater challenge.
An important method of maximum likelihood (ML) learning of random fields is called stochastic maximum likelihood (SML) [12], which approximates the model expectations by Monte Carlo sampling for calculating the gradient. A recent progress in learning NRFs as studied in [8, 9, 11, 13] is to pair the target random field with an auxiliary directed generative model (often called generator) parameterized by , which approximates sampling from the target random field. Learning is performed by maximizing the loglikelihood of training data under or some bound of the loglikelihood, and simultaneously minimizing some divergence between the target random field and the auxiliary generator . Different learning algorithms differ in the objective functions used in the joint training of and , and thus have different computational and statistical properties (partly illustrated in Figure 3). For example, minimizing the exclusivedivergence w.r.t. , as employed in [8], involves the intractable entropy term and tends to enforce the generator to seek modes, yielding missing modes. There are also other factors, e.g. modeling discrete or continuous data, different model choices of the target RF and the generator, which lead to different learning algorithms. We leave detailed comparison and connection of our approach with existing studies to Section 4 (Related work).
In this paper, we propose to use inclusivedivergence minimized auxiliary generators (Section 3.2). And particularly for continuous data (e.g. images), we propose to use stochastic gradient samplers, including but not limited to SGLD (stochastic gradient Langevin dynamics) and SGHMC (stochastic gradient Hamiltonian Monte Carlo), to exploit noisy gradients in NRF model sampling (Section 3.3). Within our formulation, SGHMC is successfully developed, which improves over SGLD in learning NRFs. Notably, this SGLD/SGHMC development is not like in previous applications ([14, 15]
) which mainly simulate Bayesian posterior samples in largescale Bayesian inference, though we use the same terms.
Featured by developing inclusive auxiliary generators and stochastic gradient sampling, this new approach to learning NRFs for continuous data, abbreviated as the inclusiveNRF approach, is the main contribution of this paper. Conceptually, minimizing the inclusivedivergence w.r.t. avoids the annoying entropy term and tends to drive the generator to cover modes of the target density . The SGLD/SGHMC sampling further pushes the samples towards the modes of
. Presumably, this helps to produce Markov chains that mix fast between modes and facilitate model learning.
As demonstrations of how the new approach can be flexibly and effectively used, specific inclusiveNRF models are developed and thoroughly evaluated for a number of tasks  unsupervised/supervised image generation, semisupervised classification and anomaly detection. The proposed models consistently achieve strong experimental results in all these tasks compared to stateoftheart methods, which are summarized as follows:

InclusiveNRFs achieve stateoftheart sample generation quality, measured by both Inception Score (IS) and Frechet Inception Distance (FID). On CIFAR10, we obtain unsupervised IS 8.28 (FID 20.9) and supervised IS 9.06 (FID 18.1), both using unconditional generation.

Semisupervised inclusiveNRFs show strong classification results on par with stateoftheart DGMbased semisupervised learning (SSL) methods, and simultaneously achieve superior generation, on the widely benchmarked datasets  MNIST, SVHN and CIFAR10.

By directly using the potential function for sample evaluation, inclusiveNRFs achieve stateoftheart performance in anomaly detection on the widely benchmarked datasets  KDDCUP, MNIST, and CIFAR10. This shows that, unlike GANs, the new approach can provide informative density estimate, besides superior sample generation.
The remainder of this paper is organized as follows. After presenting some background on random fields in Section 2, we introduce the inclusiveNRF approach. In Section 4, we discuss related work. The extensive experimental evaluations are given in Sections 5 . We conclude the paper with a discussion in Section 6.
2 Background on Random fields
Undirected models, or exchangeably termed as random fields form one of the two main classes of probabilistic graphical models [1, 2]
. In defining the joint distribution, directed models use conditional probability functions, with the directionality given by the conditioning relationship, whereas undirected models use unnormalized potential functions and are more suitable for capturing interactions among variables, especially when the directionality of a relationship cannot be clearly defined (e.g. as in between neighboring image pixels).
A random field (RF) defines a probability distribution for a collection of random variables
with parameter in the form:(1) 
where is the normalizing constant, is called the potential function^{1}^{1}1Negating the potential function defines the energy function. which assigns a scalar value to each configuration of . High probability configurations correspond to high potential/low energy configurations.
There is an extensive literature devoted to maximum likelihood (ML) learning of random fields, as briefly reviewed in [16]. It is usually intractable to maximize the data loglikelihood for observed , since the gradient involves expectation w.r.t. the model distribution, as shown below:
(2)  
In graphical modeling terminology, without loss of generality, let each component of indexed by a node in a graph. The whole potential function is defined to be decomposed over cliques of the graph (i.e., fully connected subsets of nodes):
(3) 
where denotes the collection of cliques in the graph, and denotes the clique potential defined over the vector of variables indexed by the clique
. Such decomposition reduces the complexity of model representation but maybe at the sacrifice of model expressive capacity, and should respect the inherent modularity in the interactions among variables. In previous studies, particularly in computer vision tasks, gridlike RFs are mostly used; higherorder RFs have been pursued but most are in fact conditional random fields (CRFs)
[17]. Notably, CRFs can only be used for discriminative tasks, e.g. segmenting and labeling of natural language sentences or images, and usually have a reduced sample spaces of labels. Thus, the learning algorithms developed in CRFs are usually not applicable to unconditional RFs.3 The inclusiveNRF approach
A highlevel overview of our inclusiveNRF approach is shown in Figure 1. In the following, after introducing the NRF model (Section 3.1), the two new designs  introducing the inclusivedivergence minimized auxiliary generator and developing stochastic gradient sampling are elaborated in Section 3.2 and Section 3.3 respectively.
3.1 The NRF model
The general idea of neural random fields (NRFs) is to implement the potential , by a neural network, which takes the multidimensional as input and outputting the scalar . In this manner, we can take advantage of the representation power of neural networks for RF modeling. And such a RF essentially becomes defined over a fullyconnected undirected graph and captures interactions in observations to the largest order, since the neural potential function involves all the components in . Remarkably, the NRFs used in our experiments are different from similar models in previous studies [8, 11, 9], as detailed in Section 4.
3.2 Introducing inclusivedivergence minimized auxiliary generators
As shown in Eq. (2), the bottleneck in learning NRFs is that Monte Carlo sampling from the RF model is needed to approximate the model expectation for calculating the gradient. A recent idea is to introduce an auxiliary generator to approximate sampling from the target RF. In this paper, we are mainly concerned with modeling fixeddimensional continuous observations (e.g. images). For reasons to be clear in the following, we choose a directed generative model, , for the auxiliary generator, which specifically is defined as follows^{2}^{2}2Note that during training, is absorbed into the learning rates and does not need to be estimated.:
(4)  
where is implemented as a neural network with parameter , which maps the latent code to the observation space. and denote the identity matrices, with dimensionality implied by and respectively. Drawing samples from the generator is simple as it is just ancestral sampling from a 2variable directed graphical model. This is one reason for choosing Eq. (4) as the generator.
For dataset , consisting of observations, let denotes the empirical data distribution. A new design in this paper is that we perform the maximum likelihood learning of and simultaneously minimize the inclusive divergence between the target random field and the auxiliary generator by^{3}^{3}3Such optimization using two objectives is employed in a number of familiar learning methods, such as GAN with log trick [4], wakesleep algorithm [18].
(5) 
The first line of Eq. (5) is equivalent to maximum likelihood training of the target RF under the empirical data , which requires sampling from . Simultaneously, the second line optimizes the generator to be close to so that becomes a good proposal for sampling from . By Proposition 1, we can derive the gradients w.r.t. and (to be ascended) as follows:
(6) 
In practice, we apply minibatch based stochastic gradient descent (SGD) to solve the optimization problem Eq. (
5), as shown in Algorithm 1. Notably, choosing Eq. (4) as the generator yields tractable gradients for the inclusivedivergence minimization.Proposition 1.
Both lines of Eq.(6) for gradient calculations hold.
3.3 Developing stochastic gradient samplers for NRF model sampling
In Algorithm 1, we need to draw samples from our target distribution given current and . For such continuous distribution, samplers leveraging continuous dynamics (namely continuoustime Markov processes described by stochastic differential equations), such as Langevin dynamics (LD) and Hamiltonian Monte Carlo (HMC) [19], are known to be efficient in exploring the continuous state space. Simulating the continuous dynamics leads to the target distribution as the stationary distribution. The Markov transition kernel defined by the continuous dynamical system usually involves using the gradients of the target distribution, which in our case are as follows:
(7) 
It can be seen that while it is straightforward to calculate the gradient w.r.t. and the first two terms^{4}^{4}4Notably, does not require the calculation of the normalizing constant. in the gradient w.r.t. , the gradient w.r.t. consists of an intractable term . Therefore we are interested in developing stochastic gradient variants of continuousdynamics samplers, which rely on using noisy estimate of .
Recently, stochastic gradient samplers have emerged in simulating posterior samples in largescale Bayesian inference, such as SGLD (stochastic gradient Langevin dynamics) [14] and SGHMC (Stochastic Gradient Hamiltonian Monte Carlo) [15]. To illustrate, consider the posterior of model parameters given the observed dataset , with abuse of notation. We have , which is taken as the target distribution. Instead of using fulldata gradients , which needs a sweep over the entire dataset, these samplers subsample the dataset and use stochastic gradients in the dynamic simulation, where is a subsampled data subset. In this manner, the computation cost is significantly reduced in each iteration and such Bayesian inference methods scale to large datasets.
In practice, sampling is based on a discretization of the continuous dynamics. Despite the discretization error and the noise introduced by the stochastic gradients, it can be shown that simulating the discretized dynamics with stochastic gradients also leads to the target distribution as the stationary distribution, when the step sizes are annealed to zero at a certain rate. The convergence of SGLD/SGHMC is provided in Theorem 1, which are summarized from [20, 15, 21].
Theorem 1.
Denote the target density as with given
. Assume that one can compute a noisy, unbiased estimate
(a stochastic gradient) to the gradient . For a sequence of asymptotically vanishing timesteps (satisfying and ) and an i.i.d. noise sequence , the SGLD iterates as follows, starting from :(8)  
Starting from and , the SGHMC iterates as follows:
(9) 
By considering , , , and Eq. (7), we can use Theorem 1 to develop the sampling step for Algorithm 1, as presented in Algorithm 2. For the gradient w.r.t. , the intractable term could be estimated by a stochastic gradient. Motivated by observing
(10) 
as proved in Proposition 2, ideally we draw and then use as an unbiased estimator of . In practice, at step , given and starting from , we run one step of LD sampling over targeting , to obtain and calculate . This gives a biased but tractable estimator to . It is empirically found in our experiments that more steps of this inner LD sampling do not significantly improve the performance for NRF learning.
Proposition 2.
Eq. (10) holds.
Proof.
∎
(11) 
(12) 
(13) 
So instead of using the exact gradient as shown in Eq. (7) in our case, we develop a tractable biased stochastic gradient as follows:
(14) 
where is an approximate sample from obtained by one or more steps of LD from . Remarkably, as we show in Algorithm 2, the starting point for the SGLD/SGHMC recursions is obtained from an ancestral sampling from . Thus at step , is already a sample from given , and we can directly use as without running the inner LD sampling. Afterwards, for , the conditional distribution of given is close to , though strictly not. We could run one or more steps of LD to obtain to reduce the bias in the stochastic gradient estimator.
With the above stochastic gradients in Eq. (14), the sampling step in Algorithm 1 can be performed by running parallel chains, each chain being executed by running finite steps of SGLD/SGHMC with tractable gradients w.r.t. both and , as shown in Algorithm 2. Intuitively, the generator first gives a proposal , and then the system follows the gradients of and (w.r.t. and respectively) to revise to . The gradient terms pull samples moving to low energy region of the random field and adjust the latent code of the generator, while the noise term brings randomness. In this manner, we obtain Markov chain samples from .
To examine the sampling performance of the developed SGLD/SGHMC samplers, we conduct a synthetic experiment and the results are shown in Figure 2. The and are 50D and 100D Gaussians respectively with randomly generated covariance matrices (i.e. both and are of 50D). For evaluation, we simulate parallel chains for steps. We follow [21] to evaluate the sampler’s performance measured by the KL divergence from the empirical Gaussian (estimated by the samples) to the ground truth . We use the stepsize schedule of like in [21] with for all methods, and
for SGHMC, and we find that these hyperparameters perform well for each method during the experiment. The main observations are as follows. First, SGLD and SGHMC converge, though worse than their counterparts using exact gradients. Second, HMC samplers, whether using exact gradients or using stochastic gradients, outperform the corresponding LD samplers, since HMC dynamics, also referred to as secondorder Langevin dynamics, exploit an additional momentum term. Third, interestingly, the SGHMC sampler outperforms the LD sampler with exact gradients. This reveals the benefit of our systematic development of the stochastic gradient samplers, including but not limited to SGLD and SGHMC. Although the CoopNet sampler in
[9] (to be described in related work) performs close to our SGLD sampler, its performance is much worse than our SGHMC sampler. SGHMC is a new development, which cannot be obtained from simply extending the CoopNet sampler.Finally note that, as discussed before, finite steps in Eq. (11)(12) and in Eq. (13) in applying SGLD/SGHMC sampling from will produce biased estimates of the gradients ( and ) in Eq. (6) for NRF learning. We did not find this to pose problems to the SGD optimization in practice, as similarly found in [22] and [13], which work with biased gradient estimators.
3.4 Semisupervised learning with inclusiveNRFs
In the following, we apply our inclusiveNRF approach in the SSL setting to show its flexibility. Note that different models are needed in unsupervised and semisupervised learning, because SSL needs to additionally consider labels apart from observations.
Model definition. In semisupervised tasks, we consider the following RF for joint modeling of observation and class label :
(15) 
This is different from Eq. (1) for unsupervised learning which only models without labels. To implement the potential function , we consider a neural network , with as the input and the output size being equal to the number of class labels, . Then we define , where
represents the onehot encoding vector for the label
. In this manner, the conditional density is the classifier, defined as follows:(16) 
which acts like multiclass logistic regression using
logits calculated from by the neural network . And we do not need to calculate for classification. The auxiliary generator is implemented the same as in Eq. 4, i.e. an unconditional generator.With the definition the joint density in Eq. 15, it can be shown that, with abuse of notation, the marginal density where .
Model learning. Suppose that among the data , only a small subset of the observations, for example the first observations, have class labels, . Denote these labeled data as . Then we can formulate the semisupervised learning as jointly optimizing
(17) 
which are defined by hybrids of generative and discriminative criteria, similar to [23, 24, 25]. The hyperparameter controls the relative weight between generative and discriminative criteria. Similar to deriving Eq. (6), it can be easily seen that the gradients w.r.t. and (to be ascended) are defined as follows:
(18) 
In practice, we calculate noisy gradient estimators, and apply minibatch based stochastic gradient descent (SGD) to solve the optimization problem Eq.(17), as shown in Algorithm 3 in Appendix A. Apart from the basic losses as shown in Eq. (17), there are some regularization losses that are found to be helpful to guide SSL learning and are presented in Appendix B.
To conclude, we show that the inclusiveNRF can be easily applied to SSL. To the best of our knowledge, there are no priori studies in applying random fields to SSL. The semisupervised inclusiveNRF model defined above is novel itself for SSL.
4 Related work
Comparison and connection of our inclusiveNRF approach with existing studies are provided in the following from three perspectives.
Learning NRFs. These studies are most relevant to this work, which aims to learn NRFs. The classic method for learning RFs is the SML method [12], which works with the single target model . Compared to learning traditional RFs which mainly use linear potential functions, learning NRFs which use NN based nonlinear potential functions, is more challenging. A recent progress in learning NRFs as studied in [8, 9, 11, 13] is to jointly train the target random field and an auxiliary generator . Different studies differ in the objective functions used in the joint training, and thus have different computational and statistical properties.
(1) It is shown in Proposition 3 in Appendix C that learning in [8] minimizes the exclusivedivergence w.r.t. , which involves the intractable entropy term and tends to enforce the generator to seek modes, yielding missing modes. We refer to this approach as exclusiveNRF.
(2) Learning in [11] and in this paper minimizes the inclusivedivergence w.r.t. . But noticeably, this paper presents our innovation in development of NRFs for continuous data, which is fundamentally different from [11] for discrete data. The target NRF model, the generator and the sampler all require new designs. [11] mainly studies random field language models, using LSTM generators (autoregressive with no latent variables) and employing Metropolis independence sampler (MIS)  applicable for discrete data (natural sentences). In this paper, we design random field models for continuous data (e.g. images), choosing latentvariable generators and developing SGLD/SGHMC to exploit noisy gradients in the continuous space.
(3) In [9] (CoopNet), motivated by interweaving maximum likelihood training of the random field and the latentvariable generator , a joint training method is introduced to train NRFs. Our inclusiveNRF approach is different from [9] in two key aspects. First, this method uses LD sampling to generate samples, but two LD sampling steps are intuitively interleaved according to and separately, without aiming to draw samples from and without awareness of stochastic gradients. Specifically, starting from the ancestral sample , a first LD sampling with steps are conducted to obtain according to , and then a second LD sampling with steps are used to obtain according to with fixed . Algorithmically, this is different from our sampling step, which moves jointly, as systematically developed in Section 3.3. Specifically, starting from , our SGLD recursions as shown in Eq. (11) are conducted to obtain . Notably, SGHMC is a new development, which cannot be obtained from simply extending the CoopNet sampler. Moreover, our development allows easy incorporation of more advanced elements in stochastic gradient samplers, such as utilizing the Riemannian geometry of the target distribution via preconditioning [21].
Second, let denote the distribution of , resulting from the interleaved Langevin transitions. Interpretation presented in [9] relates their method to the following joint optimization problem:
which is also different from our learning objectives as shown in Eq. (5). Thus, learning in [9] does not aim to minimize the inclusivedivergence w.r.t. .
Empirically, as shown in Figure 2, the CoopNet sampler performs much worse than our SGHMC sampler. It is further shown in Table II that inclusiveNRF with SGLD outperforms CoopNet in image generation, and in Table IV that utilizing SGHMC in learning inclusiveNRFs to exploit gradient information with momentum yields better performance than using SGLD.
(4) Learning in [13] minimizes the divergence w.r.t.
, which also tends to drive the generator to cover modes. But this approach is severely limited by the high variance of the gradient estimator w.r.t.
, and is only tested on the simpler MNIST and Omniglot.Additionally, different NRF studies also differ in models used in the joint training. The target NRF used in this work is different from those in previous studies [8, 11, 9]. The differences are: [8] includes additional linear and squared terms in , [11] defines over discretevalued sequences, and [9]
defines in the form of exponential tilting of a reference distribution (Gaussian white noise). There also exist different choices for the generator, such as GAN models in
[8], LSTMs in [11], or latentvariable models in [9] and this work.To sum up, our contributions in introducing inclusivedivergence minimized auxiliary generators and developing stochastic gradient sampling enables the solid development of the new inclusiveNRF approach. Moreover, all the previous NRF studies examine unsupervised learning, and none shows application or extension of their methods or models for semisupervised learning.
Monte Carlo sampling. One step in our inclusiveNRF approach is to apply SGLD/SGHMC to draw samples from the target density , starting from the proposal sample from the generator. Theoretically, improvements in NRF sampling methods could be potentially integrated into NRF learning algorithms. For example, it is recently studied in [26] to learn MCMC transition kernels, also parameterized by neural networks, to improve the HMC sampling from the given target distribution. Integration into learning NRFs is interesting but outside the scope of this paper.
Comparison and connection with GANs. On the one hand, there are some efforts that aim to address the inability of GANs to provide sensible energy estimates for samples. The energybased GANs (EBGAN) [27] proposes to view the discriminator as an energy function by designing an autoencoder discriminator. The recent work in [28] connects [27] and [8], and show another two approximations for the entropy term. However, it is known that as the generator converges to the true data distribution, the GAN discriminator converges to a degenerate uniform solution. This basically afflicts the GAN discriminator to provide density information, though there are some modifications. In contrast, our inclusiveNRFs, unlike GANs, naturally provide (unnormalized) density estimate, which is examined with GMM synthetic experiments and anomaly detection benchmarking experiments. Moreover, none of the above energyrelated GAN studies examine their methods or models for SSL, except in EBGAN which performs moderately.
On the other hand, there are interesting connections between inclusiveNRFs and GANs, as elaborated in Appendix D. When interpreting the potential function as the critic in Wasserstein GANs [29], inclusiveNRFs seem to be similar to Wasserstein GANs. A difference is that in optimizing in inclusiveNRFs, the generated samples are further revised by taking finitestepgradient of w.r.t. . However, the critic in Wasserstein GANs can hardly be interpreted as an unnormalized logdensity. Thus strictly speaking, inclusiveNRFs are not GANlike.
5 Experiments
As demonstrations of how the new inclusiveNRF approach can be flexibly and effectively used, specific inclusiveNRF models are developed and thoroughly evaluated for a number of tasks  unsupervised/supervised image generation, semisupervised classification and anomaly detection.
First, we report experiments on synthetic datasets, which helps to illustrate different models and learning methods. Then, extensive experiments are conducted to evaluate the performances of our approach (inclusiveNRFs) and various existing methods on realworld datasets. We refer to Appendix E for experimental details and additional results.
5.1 GMM synthetic experiment for unsupervised learning
Methods  covered modes  realistic ratio 

GAN with logD trick [4]  
WGANGP [30]  
ExclusiveNRF [8]  
InclusiveNRF generation  
InclusiveNRF revision 
The synthetic data consist of 1,600 training examples generated from a 2D Gaussian mixture model (GMM) with 32 equallyweighted, lowvariance (
) Gaussian components, uniformly laid out on four concentric circles as in Figure 3(a). The data distribution exhibits many modes separated by large lowprobability regions, which makes it suitable to examine how well different learning methods can deal with multiple modes. For comparison, we experiment with GAN with logD trick [4] and WGANGP [30] for directed generative model, exclusiveNRF [8] and inclusiveNRF for undirected generative model.The network architectures and hyperparameters are the same for all methods, as listed in Table VII in Appendix. We use SGLD [14] for inclusiveNRFs on this synthetic dataset, with empirical revision hyperparameters
Figure 3 visually shows the generated samples from the trained models using different methods. Table I reports the “covered modes” and “realistic ratio” as numerical measures of how the multimodal data are fitted, similarly as in [31]. We use the following procedure to estimate the metrics “covered modes” and “realistic ratio” for each trained model.

Stochastically generate 100 samples.

A mode is defined to be covered (not missed) if there exist generated samples located closely to the mode (with squared distance ), and those samples are said to be realistic.

Count how many modes are covered and calculate the proportion of realistic samples.

Repeat the above steps 100 times and perform averaging.
For each method, we independently train 10 models and calculate the mean and standard deviation (SD) across the 10 independent runs. The main observations are as follows:

GAN suffers from mode missing, generating realistic but not diverse samples. WGANGP increases “covered modes” but decreases “realistic ratio”. InclusiveNRF performs much better than both GAN and WGANGP in sample generation.

InclusiveNRF outperforms exclusiveNRF in both sample generation and density estimation.

After revision, samples from inclusiveNRF become more like real samples, achieving the best in both “covered modes" and “realistic ratio” metrics.
5.2 GMM synthetic experiment for semisupervised learning
Methods  Unsupervised  Supervised  

IS  FID  IS  FID  
DCGAN [35]      
ImprovedGAN [33]        
WGANGP [30]      
SGAN [36]        
DFM [37]        
CTGAN [38]      
FisherGAN [39]      
CoopNet [9]        
BWGAN [40]        
SNGAN [41]      
InclusiveNRF generation 
In this experiment, we present the performance of semisupervised inclusiveNRFs for SSL on a synthetic dataset. In addition to illustrating how semisupervised inclusiveNRF works, this experiment further emphasizes that the inclusiveNRF approach can provide (unnormalized) density estimates for , and . In contrast, the use of GANs as general purpose probabilistic generative models has been limited by the difficulty in using them to provide density estimates or even unnormalized potential values for sample evaluation.
The dataset is a 2D GMM with 16 Gaussian components, uniformly laid out on two concentric circles. The two circles represent two different classes. There are only 4 labeled points for each class and a total of 400 unlabeled points. The network architectures are the same as in Section 5.1, except that the neural network which implement the potential function for SSL now has two units in the output. As shown in Figure 4, the semisupervised trained inclusiveNRFs not only captures the marginal potential effectively, but also learn the classconditional potentials successfully. This behavior agrees with our design idea of blending unsupervised and supervised learning for the semisupervised setting as shown in Eq. (18).
5.3 Image generation on CIFAR10
In this experiment, we examine both unsupervised and supervised learning over the widely used realworld dataset CIFAR10 [32] for image generation. To evaluate generation quality quantitatively, we use inception score (IS) [33] (the larger the better), and Frechet inception distance (FID) [34] (the smaller the better). Table II reports the inception score and FID for state of the art methods, for both unsupervised and supervised settings. The supervised learning of inclusiveNRF is conducted as a special case of semisupervised learning over all labeled images (), which uses unconditional generation. We use ResNet in this experiment, see Appendix E.1 for experimental details.
From the comparison results in Table II, it can be seen that the proposed inclusiveNRF model achieves the best inception score over CIFAR10, to the best of our knowledge, in both unsupervised and supervised settings. Some generated samples are shown in Figure 7
(c)(d) in Appendix for unsupervised and supervised settings respectively. We also show the capability of inclusiveNRFs in latent space interpolation (Appendix
F) and conditional generation (Appendix G).5.4 Semisupervised learning on MNIST, SVHN and CIFAR10
Methods  error (%)  error (%)  error (%)  IS 

MNIST  SVHN  CIFAR10  CIFAR10  
CatGAN [42]    
SDGM [43]      
Ladder network [44]    /  
ADGM [43]      
ImprovedGAN [33]  
EBGAN [27]        
ALI [31]      
TripleGAN [45]  
TriangleGAN [46]        
BadGAN [47]  
SobolevGAN [48]        
Semisupervised inclusiveNRF  
Results below this line cannot be directly compared to those above.  
VAT small [49]  6.83  14.87  /  
model [50]    /  
Temporal Ensembling [50]    /  
Mean Teacher [51]    /  
VAT+EntMin [49]    /  
CTGAN [38]    / 
For semisupervised learning, we consider the three widely used benchmark datasets, namely MNIST [52], SVHN [53], and CIFAR10 [32]. As in previous work, we randomly sample 100, 1,000, and 4,000 labeled samples from MNIST, SVHN, and CIFAR10 respectively during training, and use the standard data split for testing. See Appendix E.2 for experimental details.
It can be seen from Table III that semisupervised inclusiveNRFs produce strong classification results on par with stateofart DGMbased SSL methods. See Figure 7(a)(b) in Appendix for generated samples. BadGANs achieve better classification results, but as indicated by the low inception score, their generation is much worse than semisupervised inclusiveNRFs. In fact, among DGMbased SSL methods, inclusiveNRFs achieve the best performance in sample generation. This is in contrast to the conflict of good classification and good generation, as observed in GANbased SSL [33, 47]. It is analyzed in [47] that good GANbased SSL requires a bad generator^{5}^{5}5This analysis is based on using the ()class GANlike discriminator objective for SSL. To the best of our knowledge, the conflict does not seem to be reported in previous generative SSL methods [23, 24] which use the class classifier like in semisupervised inclusiveNRFs.. This is embarrassing and in fact obviates the original idea of generative SSL  successful generative training, which indicates good generation, provides regularization for finding good classifiers [23, 24]. In this sense, BadGANs could hardly be classified as a generative SSL method.
Finally, note that some discriminative SSL methods, as listed in the lower block in Table III also produce superior performances, by utilizing data augmentation and consistency regularization. However, these methods are unable to generate (realistic) samples. It can be seen that discriminative SSL methods utilize different regularization from generative SSL methods and cannot be directly compared to generative SSL methods. Their combination, as an interesting future work, could yield further performance improvement.
Training Setting  Unsupervised  Semisupervised  

Generation IS  Revision IS  error (%)  
SGLD  
SGLD  
SGLD  
SGHMC 
5.5 Ablation study
We report the results of ablation study of our inclusiveNRF method on CIFAR10 in Table IV. In this experiment, we use the standard CNN [41] for unsupervised learning and the same networks as those used in Table III for semisupervised learning. See Appendix E.3 for experimental details. We analyze the effects of different settings in model training, such as using SGLD or SGHMC and the revision step used. For each training setting, we also compare the two manners to generate samples  whether applying sample revision or not in inference (generating samples) given a trained NRF, as previously illustrated in Figure 3 over synthetic GMM data. The main observations are as follows.
First, given a trained NRF, after revision (i.e. following the gradient of the RF’s potential w.r.t. ), the quality (IS) of samples is always improved, as shown by the consistent IS improvement from the second column (generation) to the third (revision). This is in accordance with the results in the GMM synthetic experiments in Section 5.1. Moreover, noting that in revision, it is the the estimated density that guides the samples towards low energy region of the random field. This also demonstrates one benefit of random field modeling, which, unlike GANs, can learn density estimate about the data manifold.
Second, a rowwise reading of Table IV reveals that with more revision steps and using SGHMC in training, the SSL classification performance is improved. Utilizing SGHMC in inclusiveNRFs to exploit gradient information with momentum yields better performance than simple SGLD. It is also found that more revision steps in model training do not significantly improve unsupervised IS. So we can use in unsupervised learning for generation, which can reduce the computational cost.
Model  Precision  Recall  F1 

OCSVM [54]  0.7457  0.8523  0.7954 
DSEBMe [57]  0.7369  0.7477  0.7423 
DAGMM [58]  0.9297  0.9442  0.9369 
ALAD [59]  0.94270.0018  0.95770.0018  0.95010.0018 
InclusiveNRF  0.94520.0105  0.96000.0113  0.95250.0108 
Normal Class  OCSVM  KDE  IF  DCAE  AnoGAN  DSVDD  InclusiveNRF 

Digit 0  98.60.0  97.10.0  98.00.3  97.60.7  96.61.3  98.00.7  98.90.6 
Digit 1  99.50.0  98.90.0  97.30.4  98.30.6  99.20.6  99.70.1  99.80.1 
Digit 2  82.50.1  79.00.0  88.60.5  85.42.4  85.02.9  91.70.8  91.84.0 
Digit 3  88.10.0  86.20.0  89.90.4  86.70.9  88.72.1  91.91.5  93.82.6 
Digit 4  94.90.0  87.90.0  92.70.6  86.52.0  89.41.3  94.90.8  95.61.9 
Digit 5  77.10.0  73.80.0  85.50.8  78.22.7  88.32.9  88.50.9  94.91.4 
Digit 6  96.50.0  87.60.0  95.60.3  94.60.5  94.72.7  98.30.5  97.52.7 
Digit 7  93.70.0  91.40.0  92.00.4  92.31.0  93.51.8  94.60.9  96.41.0 
Digit 8  88.90.0  79.20.0  89.90.4  86.51.6  84.92.1  93.91.6  88.93.3 
Digit 9  93.10.0  88.20.0  93.50.3  90.41.8  92.41.1  96.50.3  94.91.0 
Mean  91.29  86.93  92.30  89.65  91.27  94.80  95.26 
AIRPLANE  61.60.9  61.20.0  60.10.7  59.15.1  67.12.5  61.74.1  78.12.1 
AUTOMOBILE  63.80.6  64.00.0  50.80.6  57.42.9  54.73.4  65.92.1  71.62.1 
BIRD  50.00.5  50.10.0  49.20.4  48.92.4  52.93.0  50.80.8  65.41.8 
CAT  55.91.3  56.40.0  55.10.4  58.41.2  54.51.9  59.11.4  63.31.9 
Comments
There are no comments yet.