Learning Neural Random Fields with Inclusive Auxiliary Generators

06/01/2018 ∙ by Yunfu Song, et al. ∙ 0

In this paper we develop Neural Random Field learning with Inclusive-divergence minimized Auxiliary Generators (NRF-IAG), which is under-appreciated in the literature. The contributions are two-fold. First, we rigorously apply the stochastic approximation algorithm to solve the joint optimization and provide theoretical justification. The new approach of learning NRF-IAG achieves superior unsupervised learning performance competitive with state-of-the-art deep generative models (DGMs) in terms of sample generation quality. Second, semi-supervised learning (SSL) with NRF-IAG gives rise to strong classification results comparable to state-of-art DGM-based SSL methods, and simultaneously achieves superior generation. This is in contrast to the conflict of good classification and good generation, as observed in GAN-based SSL.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the core research problems in machine learning is learning with probabilistic models, which can be broadly classified into two classes - directed and undirected graphical models

[1, 2]

. Apart from the topology difference, an easy way to tell an undirected model from a directed model is that an undirected model involves the normalizing constant (also called the partition function in physics), while the directed model is self-normalized. Recently, significant progress has been made on learning with deep generative models (DGMs), which generally refer to probabilistic models with multiple layers of stochastic or deterministic variables. There have emerged a bundle of deep directed models, such as variational AutoEncoders (VAEs)

[3], generative adversarial networks (GANs) [4] and so on. In contrast, undirected models (also known as random fields [2], energy-based models [5]) received less attention with slow progress. This is presumably because fitting undirected models is more challenging than fitting directed models. In general, calculating the log-likelihood and its gradient is analytically intractable, because it involves evaluating the normalizing constant and, respectively, the expectation with respect to (w.r.t.) the model distribution.

In this paper, we are interested in developing deep undirected models. Generally, an undirected model, or exchangeably termed as a random field (RF), defines a probability distribution of the form

, where is usually called the potential function over observation with parameter , and is the normalizing constant. In most existing random fields, the potential function is often defined as linear functions, e.g. , where

is a vector of features (usually hand-crafted) and

is the corresponding parameter vector. Such RFs are known as log-linear models [2] or exponential families [6].

Note that an attractive property of RF modeling is that one is free to define the potential function in any sensible way, giving it much flexibility. In this paper, we aims to advance the learning of neural random fields, which use neural networks with multiple deterministic layers to define the potential function . With the potential benefit of exploiting the expressive power of deep neural networks, this type of RFs appeared several times in different contexts with different model definitions, called deep energy models (DEMs) [7, 8], descriptive models [9], generative ConvNet [10], neural random field language models [11]. For convenience, we refer to such models as neural random fields (NRFs) in general. Conceptually, compared to traditional log-linear RFs, if we could successful train such NRFs, we can jointly learn the features and the feature weights, which is highly desirable. However, learning NRFs presents much greater challenge.

An important method of maximum likelihood (ML) learning of random fields is called stochastic maximum likelihood (SML) [12], which approximates the model expectations by Monte Carlo sampling for calculating the gradient. A recent progress in learning NRFs as studied in [8, 9, 11, 13] is to pair the target random field with an auxiliary directed generative model (often called generator) parameterized by , which approximates sampling from the target random field. Learning is performed by maximizing the log-likelihood of training data under or some bound of the log-likelihood, and simultaneously minimizing some divergence between the target random field and the auxiliary generator . Different learning algorithms differ in the objective functions used in the joint training of and , and thus have different computational and statistical properties (partly illustrated in Figure 3). For example, minimizing the exclusive-divergence w.r.t. , as employed in [8], involves the intractable entropy term and tends to enforce the generator to seek modes, yielding missing modes. There are also other factors, e.g. modeling discrete or continuous data, different model choices of the target RF and the generator, which lead to different learning algorithms. We leave detailed comparison and connection of our approach with existing studies to Section 4 (Related work).

In this paper, we propose to use inclusive-divergence minimized auxiliary generators (Section 3.2). And particularly for continuous data (e.g. images), we propose to use stochastic gradient samplers, including but not limited to SGLD (stochastic gradient Langevin dynamics) and SGHMC (stochastic gradient Hamiltonian Monte Carlo), to exploit noisy gradients in NRF model sampling (Section 3.3). Within our formulation, SGHMC is successfully developed, which improves over SGLD in learning NRFs. Notably, this SGLD/SGHMC development is not like in previous applications ([14, 15]

) which mainly simulate Bayesian posterior samples in large-scale Bayesian inference, though we use the same terms.

Featured by developing inclusive auxiliary generators and stochastic gradient sampling, this new approach to learning NRFs for continuous data, abbreviated as the inclusive-NRF approach, is the main contribution of this paper. Conceptually, minimizing the inclusive-divergence w.r.t. avoids the annoying entropy term and tends to drive the generator to cover modes of the target density . The SGLD/SGHMC sampling further pushes the samples towards the modes of

. Presumably, this helps to produce Markov chains that mix fast between modes and facilitate model learning.

As demonstrations of how the new approach can be flexibly and effectively used, specific inclusive-NRF models are developed and thoroughly evaluated for a number of tasks - unsupervised/supervised image generation, semi-supervised classification and anomaly detection. The proposed models consistently achieve strong experimental results in all these tasks compared to state-of-the-art methods, which are summarized as follows:

  • Inclusive-NRFs achieve state-of-the-art sample generation quality, measured by both Inception Score (IS) and Frechet Inception Distance (FID). On CIFAR-10, we obtain unsupervised IS 8.28 (FID 20.9) and supervised IS 9.06 (FID 18.1), both using unconditional generation.

  • Semi-supervised inclusive-NRFs show strong classification results on par with state-of-the-art DGM-based semi-supervised learning (SSL) methods, and simultaneously achieve superior generation, on the widely benchmarked datasets - MNIST, SVHN and CIFAR-10.

  • By directly using the potential function for sample evaluation, inclusive-NRFs achieve state-of-the-art performance in anomaly detection on the widely benchmarked datasets - KDDCUP, MNIST, and CIFAR-10. This shows that, unlike GANs, the new approach can provide informative density estimate, besides superior sample generation.

The remainder of this paper is organized as follows. After presenting some background on random fields in Section 2, we introduce the inclusive-NRF approach. In Section 4, we discuss related work. The extensive experimental evaluations are given in Sections 5 . We conclude the paper with a discussion in Section 6.

Fig. 1: Overview of the inclusive-NRF approach. Two neural networks are used to define the NRF’s potential function and the auxiliary generator respectively. Both network parameters, and , are updated by using the revised samples , which are obtained by revising the samples proposed by the auxiliary generator, according to the RF’s stochastic gradients.

2 Background on Random fields

Undirected models, or exchangeably termed as random fields form one of the two main classes of probabilistic graphical models [1, 2]

. In defining the joint distribution, directed models use conditional probability functions, with the directionality given by the conditioning relationship, whereas undirected models use unnormalized potential functions and are more suitable for capturing interactions among variables, especially when the directionality of a relationship cannot be clearly defined (e.g. as in between neighboring image pixels).

A random field (RF) defines a probability distribution for a collection of random variables

with parameter in the form:

(1)

where is the normalizing constant, is called the potential function111Negating the potential function defines the energy function. which assigns a scalar value to each configuration of . High probability configurations correspond to high potential/low energy configurations.

There is an extensive literature devoted to maximum likelihood (ML) learning of random fields, as briefly reviewed in [16]. It is usually intractable to maximize the data log-likelihood for observed , since the gradient involves expectation w.r.t. the model distribution, as shown below:

(2)

In graphical modeling terminology, without loss of generality, let each component of indexed by a node in a graph. The whole potential function is defined to be decomposed over cliques of the graph (i.e., fully connected subsets of nodes):

(3)

where denotes the collection of cliques in the graph, and denotes the clique potential defined over the vector of variables indexed by the clique

. Such decomposition reduces the complexity of model representation but maybe at the sacrifice of model expressive capacity, and should respect the inherent modularity in the interactions among variables. In previous studies, particularly in computer vision tasks, grid-like RFs are mostly used; higher-order RFs have been pursued but most are in fact conditional random fields (CRFs)

[17]. Notably, CRFs can only be used for discriminative tasks, e.g. segmenting and labeling of natural language sentences or images, and usually have a reduced sample spaces of labels. Thus, the learning algorithms developed in CRFs are usually not applicable to unconditional RFs.

3 The inclusive-NRF approach

A high-level overview of our inclusive-NRF approach is shown in Figure 1. In the following, after introducing the NRF model (Section 3.1), the two new designs - introducing the inclusive-divergence minimized auxiliary generator and developing stochastic gradient sampling are elaborated in Section 3.2 and Section 3.3 respectively.

3.1 The NRF model

The general idea of neural random fields (NRFs) is to implement the potential , by a neural network, which takes the multi-dimensional as input and outputting the scalar . In this manner, we can take advantage of the representation power of neural networks for RF modeling. And such a RF essentially becomes defined over a fully-connected undirected graph and captures interactions in observations to the largest order, since the neural potential function involves all the components in . Remarkably, the NRFs used in our experiments are different from similar models in previous studies [8, 11, 9], as detailed in Section 4.

3.2 Introducing inclusive-divergence minimized auxiliary generators

As shown in Eq. (2), the bottleneck in learning NRFs is that Monte Carlo sampling from the RF model is needed to approximate the model expectation for calculating the gradient. A recent idea is to introduce an auxiliary generator to approximate sampling from the target RF. In this paper, we are mainly concerned with modeling fixed-dimensional continuous observations (e.g. images). For reasons to be clear in the following, we choose a directed generative model, , for the auxiliary generator, which specifically is defined as follows222Note that during training, is absorbed into the learning rates and does not need to be estimated.:

(4)

where is implemented as a neural network with parameter , which maps the latent code to the observation space. and denote the identity matrices, with dimensionality implied by and respectively. Drawing samples from the generator is simple as it is just ancestral sampling from a 2-variable directed graphical model. This is one reason for choosing Eq. (4) as the generator.

For dataset , consisting of observations, let denotes the empirical data distribution. A new design in this paper is that we perform the maximum likelihood learning of and simultaneously minimize the inclusive divergence between the target random field and the auxiliary generator by333Such optimization using two objectives is employed in a number of familiar learning methods, such as GAN with log trick [4], wake-sleep algorithm [18].

(5)

The first line of Eq. (5) is equivalent to maximum likelihood training of the target RF under the empirical data , which requires sampling from . Simultaneously, the second line optimizes the generator to be close to so that becomes a good proposal for sampling from . By Proposition 1, we can derive the gradients w.r.t. and (to be ascended) as follows:

(6)

In practice, we apply minibatch based stochastic gradient descent (SGD) to solve the optimization problem Eq. (

5), as shown in Algorithm 1. Notably, choosing Eq. (4) as the generator yields tractable gradients for the inclusive-divergence minimization.

Proposition 1.

Both lines of Eq.(6) for gradient calculations hold.

Proof.

The first line of Eq.(6) can be obtained by directly taking derivative of w.r.t. , as shown below,

and then applying the basic formula of Eq. (2).

For the second line, by direct calculation, we first have

Then combining

and

will give the second line of Eq.(6). ∎

  repeat
     Sampling: Draw a minibatch from (see Algorithm 2);
     Updating: Update by ascending: ; Update by ascending: ;
  until convergence
Algorithm 1 Learning NRFs with inclusive auxiliary generators

3.3 Developing stochastic gradient samplers for NRF model sampling

In Algorithm 1, we need to draw samples from our target distribution given current and . For such continuous distribution, samplers leveraging continuous dynamics (namely continuous-time Markov processes described by stochastic differential equations), such as Langevin dynamics (LD) and Hamiltonian Monte Carlo (HMC) [19], are known to be efficient in exploring the continuous state space. Simulating the continuous dynamics leads to the target distribution as the stationary distribution. The Markov transition kernel defined by the continuous dynamical system usually involves using the gradients of the target distribution, which in our case are as follows:

(7)

It can be seen that while it is straightforward to calculate the gradient w.r.t. and the first two terms444Notably, does not require the calculation of the normalizing constant. in the gradient w.r.t. , the gradient w.r.t. consists of an intractable term . Therefore we are interested in developing stochastic gradient variants of continuous-dynamics samplers, which rely on using noisy estimate of .

Recently, stochastic gradient samplers have emerged in simulating posterior samples in large-scale Bayesian inference, such as SGLD (stochastic gradient Langevin dynamics) [14] and SGHMC (Stochastic Gradient Hamiltonian Monte Carlo) [15]. To illustrate, consider the posterior of model parameters given the observed dataset , with abuse of notation. We have , which is taken as the target distribution. Instead of using full-data gradients , which needs a sweep over the entire dataset, these samplers subsample the dataset and use stochastic gradients in the dynamic simulation, where is a subsampled data subset. In this manner, the computation cost is significantly reduced in each iteration and such Bayesian inference methods scale to large datasets.

In practice, sampling is based on a discretization of the continuous dynamics. Despite the discretization error and the noise introduced by the stochastic gradients, it can be shown that simulating the discretized dynamics with stochastic gradients also leads to the target distribution as the stationary distribution, when the step sizes are annealed to zero at a certain rate. The convergence of SGLD/SGHMC is provided in Theorem 1, which are summarized from [20, 15, 21].

Theorem 1.

Denote the target density as with given

. Assume that one can compute a noisy, unbiased estimate

(a stochastic gradient) to the gradient . For a sequence of asymptotically vanishing time-steps (satisfying and ) and an i.i.d. noise sequence , the SGLD iterates as follows, starting from :

(8)

Starting from and , the SGHMC iterates as follows:

(9)

The iterations of Eq. (8) and (9) lead to the target distribution as the stationary distribution.

By considering , , , and Eq. (7), we can use Theorem 1 to develop the sampling step for Algorithm 1, as presented in Algorithm 2. For the gradient w.r.t. , the intractable term could be estimated by a stochastic gradient. Motivated by observing

(10)

as proved in Proposition 2, ideally we draw and then use as an unbiased estimator of . In practice, at step , given and starting from , we run one step of LD sampling over targeting , to obtain and calculate . This gives a biased but tractable estimator to . It is empirically found in our experiments that more steps of this inner LD sampling do not significantly improve the performance for NRF learning.

Proposition 2.

Eq. (10) holds.

Proof.

  1. Do ancestral sampling by the generator, namely first drawing , and then drawing ;
  2. Starting from , run finite steps of SGLD/SGHMC to obtain , which we call sample revision, according to Eq. (8) or (9).
  In particular, the SGLD recursions are conducted as follows:
(11)
  The SGHMC recursions are conducted as follows:
(12)
  Given and starting from , we run one step (or more steps) of LD to obtain , which could be regarded as an approximate sample from :
(13)
  Return .
Algorithm 2 Sampling from

So instead of using the exact gradient as shown in Eq. (7) in our case, we develop a tractable biased stochastic gradient as follows:

(14)

where is an approximate sample from obtained by one or more steps of LD from . Remarkably, as we show in Algorithm 2, the starting point for the SGLD/SGHMC recursions is obtained from an ancestral sampling from . Thus at step , is already a sample from given , and we can directly use as without running the inner LD sampling. Afterwards, for , the conditional distribution of given is close to , though strictly not. We could run one or more steps of LD to obtain to reduce the bias in the stochastic gradient estimator.

With the above stochastic gradients in Eq. (14), the sampling step in Algorithm 1 can be performed by running parallel chains, each chain being executed by running finite steps of SGLD/SGHMC with tractable gradients w.r.t. both and , as shown in Algorithm 2. Intuitively, the generator first gives a proposal , and then the system follows the gradients of and (w.r.t. and respectively) to revise to . The gradient terms pull samples moving to low energy region of the random field and adjust the latent code of the generator, while the noise term brings randomness. In this manner, we obtain Markov chain samples from .

Fig. 2:

Sampler’s performance measured by the KL divergence with 10 independent runs to obtain standard deviations. “CoopNet

” denotes the sampling method in [9] with . “LD” or “HMC” means the Langevin Dynamics or Hamiltonian Monte Carlo sampling of the target distribution with exact gradients. “SGLD” or “SGHMC” are our developed samplers with stochastic gradients (Algorithm 2). We fix the total iterations of and to be the same for each sampling method. Thus one iteration of “CoopNet ” would be regarded as 20 iterations of other methods in the figure.

To examine the sampling performance of the developed SGLD/SGHMC samplers, we conduct a synthetic experiment and the results are shown in Figure 2. The and are 50D and 100D Gaussians respectively with randomly generated covariance matrices (i.e. both and are of 50D). For evaluation, we simulate parallel chains for steps. We follow [21] to evaluate the sampler’s performance measured by the KL divergence from the empirical Gaussian (estimated by the samples) to the ground truth . We use the stepsize schedule of like in [21] with for all methods, and

for SGHMC, and we find that these hyperparameters perform well for each method during the experiment. The main observations are as follows. First, SGLD and SGHMC converge, though worse than their counterparts using exact gradients. Second, HMC samplers, whether using exact gradients or using stochastic gradients, outperform the corresponding LD samplers, since HMC dynamics, also referred to as second-order Langevin dynamics, exploit an additional momentum term. Third, interestingly, the SGHMC sampler outperforms the LD sampler with exact gradients. This reveals the benefit of our systematic development of the stochastic gradient samplers, including but not limited to SGLD and SGHMC. Although the CoopNet sampler in

[9] (to be described in related work) performs close to our SGLD sampler, its performance is much worse than our SGHMC sampler. SGHMC is a new development, which cannot be obtained from simply extending the CoopNet sampler.

Finally note that, as discussed before, finite steps in Eq. (11)(12) and in Eq. (13) in applying SGLD/SGHMC sampling from will produce biased estimates of the gradients ( and ) in Eq. (6) for NRF learning. We did not find this to pose problems to the SGD optimization in practice, as similarly found in [22] and [13], which work with biased gradient estimators.

3.4 Semi-supervised learning with inclusive-NRFs

In the following, we apply our inclusive-NRF approach in the SSL setting to show its flexibility. Note that different models are needed in unsupervised and semi-supervised learning, because SSL needs to additionally consider labels apart from observations.

Model definition. In semi-supervised tasks, we consider the following RF for joint modeling of observation and class label :

(15)

This is different from Eq. (1) for unsupervised learning which only models without labels. To implement the potential function , we consider a neural network , with as the input and the output size being equal to the number of class labels, . Then we define , where

represents the one-hot encoding vector for the label

. In this manner, the conditional density is the classifier, defined as follows:

(16)

which acts like multi-class logistic regression using

logits calculated from by the neural network . And we do not need to calculate for classification. The auxiliary generator is implemented the same as in Eq. 4, i.e. an unconditional generator.

With the definition the joint density in Eq. 15, it can be shown that, with abuse of notation, the marginal density where .

Model learning. Suppose that among the data , only a small subset of the observations, for example the first observations, have class labels, . Denote these labeled data as . Then we can formulate the semi-supervised learning as jointly optimizing

(17)

which are defined by hybrids of generative and discriminative criteria, similar to [23, 24, 25]. The hyper-parameter controls the relative weight between generative and discriminative criteria. Similar to deriving Eq. (6), it can be easily seen that the gradients w.r.t. and (to be ascended) are defined as follows:

(18)

In practice, we calculate noisy gradient estimators, and apply minibatch based stochastic gradient descent (SGD) to solve the optimization problem Eq.(17), as shown in Algorithm 3 in Appendix A. Apart from the basic losses as shown in Eq. (17), there are some regularization losses that are found to be helpful to guide SSL learning and are presented in Appendix B.

To conclude, we show that the inclusive-NRF can be easily applied to SSL. To the best of our knowledge, there are no priori studies in applying random fields to SSL. The semi-supervised inclusive-NRF model defined above is novel itself for SSL.

4 Related work

Comparison and connection of our inclusive-NRF approach with existing studies are provided in the following from three perspectives.

Learning NRFs.  These studies are most relevant to this work, which aims to learn NRFs. The classic method for learning RFs is the SML method [12], which works with the single target model . Compared to learning traditional RFs which mainly use linear potential functions, learning NRFs which use NN based nonlinear potential functions, is more challenging. A recent progress in learning NRFs as studied in [8, 9, 11, 13] is to jointly train the target random field and an auxiliary generator . Different studies differ in the objective functions used in the joint training, and thus have different computational and statistical properties.

(1)  It is shown in Proposition 3 in Appendix C that learning in [8] minimizes the exclusive-divergence w.r.t. , which involves the intractable entropy term and tends to enforce the generator to seek modes, yielding missing modes. We refer to this approach as exclusive-NRF.

(2)  Learning in [11] and in this paper minimizes the inclusive-divergence w.r.t. . But noticeably, this paper presents our innovation in development of NRFs for continuous data, which is fundamentally different from [11] for discrete data. The target NRF model, the generator and the sampler all require new designs. [11] mainly studies random field language models, using LSTM generators (autoregressive with no latent variables) and employing Metropolis independence sampler (MIS) - applicable for discrete data (natural sentences). In this paper, we design random field models for continuous data (e.g. images), choosing latent-variable generators and developing SGLD/SGHMC to exploit noisy gradients in the continuous space.

(3)  In [9] (CoopNet), motivated by interweaving maximum likelihood training of the random field and the latent-variable generator , a joint training method is introduced to train NRFs. Our inclusive-NRF approach is different from [9] in two key aspects. First, this method uses LD sampling to generate samples, but two LD sampling steps are intuitively interleaved according to and separately, without aiming to draw samples from and without awareness of stochastic gradients. Specifically, starting from the ancestral sample , a first LD sampling with steps are conducted to obtain according to , and then a second LD sampling with steps are used to obtain according to with fixed . Algorithmically, this is different from our sampling step, which moves jointly, as systematically developed in Section 3.3. Specifically, starting from , our SGLD recursions as shown in Eq. (11) are conducted to obtain . Notably, SGHMC is a new development, which cannot be obtained from simply extending the CoopNet sampler. Moreover, our development allows easy incorporation of more advanced elements in stochastic gradient samplers, such as utilizing the Riemannian geometry of the target distribution via preconditioning [21].

Second, let denote the distribution of , resulting from the interleaved Langevin transitions. Interpretation presented in [9] relates their method to the following joint optimization problem:

which is also different from our learning objectives as shown in Eq. (5). Thus, learning in [9] does not aim to minimize the inclusive-divergence w.r.t. .

Empirically, as shown in Figure 2, the CoopNet sampler performs much worse than our SGHMC sampler. It is further shown in Table II that inclusive-NRF with SGLD outperforms CoopNet in image generation, and in Table IV that utilizing SGHMC in learning inclusive-NRFs to exploit gradient information with momentum yields better performance than using SGLD.

(4)  Learning in [13] minimizes the -divergence w.r.t.

, which also tends to drive the generator to cover modes. But this approach is severely limited by the high variance of the gradient estimator w.r.t.

, and is only tested on the simpler MNIST and Omniglot.

Additionally, different NRF studies also differ in models used in the joint training. The target NRF used in this work is different from those in previous studies [8, 11, 9]. The differences are: [8] includes additional linear and squared terms in , [11] defines over discrete-valued sequences, and [9]

defines in the form of exponential tilting of a reference distribution (Gaussian white noise). There also exist different choices for the generator, such as GAN models in

[8], LSTMs in [11], or latent-variable models in [9] and this work.

To sum up, our contributions in introducing inclusive-divergence minimized auxiliary generators and developing stochastic gradient sampling enables the solid development of the new inclusive-NRF approach. Moreover, all the previous NRF studies examine unsupervised learning, and none shows application or extension of their methods or models for semi-supervised learning.

Monte Carlo sampling.  One step in our inclusive-NRF approach is to apply SGLD/SGHMC to draw samples from the target density , starting from the proposal sample from the generator. Theoretically, improvements in NRF sampling methods could be potentially integrated into NRF learning algorithms. For example, it is recently studied in [26] to learn MCMC transition kernels, also parameterized by neural networks, to improve the HMC sampling from the given target distribution. Integration into learning NRFs is interesting but outside the scope of this paper.

Comparison and connection with GANs.  On the one hand, there are some efforts that aim to address the inability of GANs to provide sensible energy estimates for samples. The energy-based GANs (EBGAN) [27] proposes to view the discriminator as an energy function by designing an auto-encoder discriminator. The recent work in [28] connects [27] and [8], and show another two approximations for the entropy term. However, it is known that as the generator converges to the true data distribution, the GAN discriminator converges to a degenerate uniform solution. This basically afflicts the GAN discriminator to provide density information, though there are some modifications. In contrast, our inclusive-NRFs, unlike GANs, naturally provide (unnormalized) density estimate, which is examined with GMM synthetic experiments and anomaly detection benchmarking experiments. Moreover, none of the above energy-related GAN studies examine their methods or models for SSL, except in EBGAN which performs moderately.

On the other hand, there are interesting connections between inclusive-NRFs and GANs, as elaborated in Appendix D. When interpreting the potential function as the critic in Wasserstein GANs [29], inclusive-NRFs seem to be similar to Wasserstein GANs. A difference is that in optimizing in inclusive-NRFs, the generated samples are further revised by taking finite-step-gradient of w.r.t. . However, the critic in Wasserstein GANs can hardly be interpreted as an unnormalized log-density. Thus strictly speaking, inclusive-NRFs are not GAN-like.

5 Experiments

As demonstrations of how the new inclusive-NRF approach can be flexibly and effectively used, specific inclusive-NRF models are developed and thoroughly evaluated for a number of tasks - unsupervised/supervised image generation, semi-supervised classification and anomaly detection.

First, we report experiments on synthetic datasets, which helps to illustrate different models and learning methods. Then, extensive experiments are conducted to evaluate the performances of our approach (inclusive-NRFs) and various existing methods on real-world datasets. We refer to Appendix E for experimental details and additional results.

5.1 GMM synthetic experiment for unsupervised learning

(a) Training data
(b) GAN generation
(c) WGAN-GP generation
(d) Exclusive-NRF generation
(e) Inclusive-NRF generation
(f) Inclusive-NRF revision
(g) Exclusive-NRF potential
(h) Inclusive-NRF potential
Fig. 3: Comparison of different methods over GMM synthetic data. Stochastic generations from GAN with logD trick, WGAN-GP, Exclusive-NRF, Inclusive-NRF generation (i.e. sampling from the auxiliary generator) and Inclusive-NRF revision (i.e. after sample revision), are shown in (b)-(f) respectively. Each generation contains 1,000 samples. The learned potentials from exclusive and inclusive NRFs are shown in (g) and (h) respectively, where the red dots indicate the mean of each Gaussian component. Inclusive NRFs are clearly superior in learning data density and sample generation.
Methods covered modes realistic ratio
GAN with logD trick [4]
WGAN-GP [30]
Exclusive-NRF [8]
Inclusive-NRF generation
Inclusive-NRF revision
TABLE I: Numerical evaluations over the GMM (32 components) synthetic data. The “covered modes” metric is defined as the number of covered modes by a set of generated samples. The “realistic ratio” metric is defined as the proportion of generated samples which are close to a mode. The measurement details are presented in text. Mean and SD are from 10 independent runs.

The synthetic data consist of 1,600 training examples generated from a 2D Gaussian mixture model (GMM) with 32 equally-weighted, low-variance (

) Gaussian components, uniformly laid out on four concentric circles as in Figure 3(a). The data distribution exhibits many modes separated by large low-probability regions, which makes it suitable to examine how well different learning methods can deal with multiple modes. For comparison, we experiment with GAN with logD trick [4] and WGAN-GP [30] for directed generative model, exclusive-NRF [8] and inclusive-NRF for undirected generative model.

The network architectures and hyperparameters are the same for all methods, as listed in Table VII in Appendix. We use SGLD [14] for inclusive-NRFs on this synthetic dataset, with empirical revision hyperparameters

Figure 3 visually shows the generated samples from the trained models using different methods. Table I reports the “covered modes” and “realistic ratio” as numerical measures of how the multi-modal data are fitted, similarly as in [31]. We use the following procedure to estimate the metrics “covered modes” and “realistic ratio” for each trained model.

  1. Stochastically generate 100 samples.

  2. A mode is defined to be covered (not missed) if there exist generated samples located closely to the mode (with squared distance ), and those samples are said to be realistic.

  3. Count how many modes are covered and calculate the proportion of realistic samples.

  4. Repeat the above steps 100 times and perform averaging.

For each method, we independently train 10 models and calculate the mean and standard deviation (SD) across the 10 independent runs. The main observations are as follows:

  • GAN suffers from mode missing, generating realistic but not diverse samples. WGAN-GP increases “covered modes” but decreases “realistic ratio”. Inclusive-NRF performs much better than both GAN and WGAN-GP in sample generation.

  • Inclusive-NRF outperforms exclusive-NRF in both sample generation and density estimation.

  • After revision, samples from inclusive-NRF become more like real samples, achieving the best in both “covered modes" and “realistic ratio” metrics.

5.2 GMM synthetic experiment for semi-supervised learning

(a) Training data
(b) Learned
(c) Learned
(d) Learned
Fig. 4: SSL toy experiment based on semi-supervised inclusive-NRFs. Each class has 4 labeled points, red dots for class 1 and blue for class 2. The learned potentials for , and are shown in (b)(c)(d) respectively.
Methods Unsupervised Supervised
IS FID IS FID
DCGAN [35] - -
Improved-GAN [33] - - -
WGAN-GP [30] - -
SGAN [36] - - -
DFM [37] - - -
CT-GAN [38] - -
Fisher-GAN [39] - -
CoopNet [9] - - -
BWGAN [40] - - -
SNGAN [41] - -
Inclusive-NRF generation
TABLE II: Inception score (IS) and FID on CIFAR-10 for unsupervised and supervised learning. "-" means the results are not reported in the original work.

In this experiment, we present the performance of semi-supervised inclusive-NRFs for SSL on a synthetic dataset. In addition to illustrating how semi-supervised inclusive-NRF works, this experiment further emphasizes that the inclusive-NRF approach can provide (unnormalized) density estimates for , and . In contrast, the use of GANs as general purpose probabilistic generative models has been limited by the difficulty in using them to provide density estimates or even unnormalized potential values for sample evaluation.

The dataset is a 2D GMM with 16 Gaussian components, uniformly laid out on two concentric circles. The two circles represent two different classes. There are only 4 labeled points for each class and a total of 400 unlabeled points. The network architectures are the same as in Section 5.1, except that the neural network which implement the potential function for SSL now has two units in the output. As shown in Figure 4, the semi-supervised trained inclusive-NRFs not only captures the marginal potential effectively, but also learn the class-conditional potentials successfully. This behavior agrees with our design idea of blending unsupervised and supervised learning for the semi-supervised setting as shown in Eq. (18).

5.3 Image generation on CIFAR-10

In this experiment, we examine both unsupervised and supervised learning over the widely used real-world dataset CIFAR-10 [32] for image generation. To evaluate generation quality quantitatively, we use inception score (IS) [33] (the larger the better), and Frechet inception distance (FID) [34] (the smaller the better). Table II reports the inception score and FID for state of the art methods, for both unsupervised and supervised settings. The supervised learning of inclusive-NRF is conducted as a special case of semi-supervised learning over all labeled images (), which uses unconditional generation. We use ResNet in this experiment, see Appendix E.1 for experimental details.

From the comparison results in Table II, it can be seen that the proposed inclusive-NRF model achieves the best inception score over CIFAR-10, to the best of our knowledge, in both unsupervised and supervised settings. Some generated samples are shown in Figure 7

(c)(d) in Appendix for unsupervised and supervised settings respectively. We also show the capability of inclusive-NRFs in latent space interpolation (Appendix

F) and conditional generation (Appendix G).

5.4 Semi-supervised learning on MNIST, SVHN and CIFAR-10

Methods error (%) error (%) error (%) IS
MNIST SVHN CIFAR-10 CIFAR-10
CatGAN [42] -
SDGM [43] - -
Ladder network [44] - /
ADGM [43] - -
Improved-GAN [33]
EBGAN [27] - - -
ALI [31] - -
Triple-GAN [45]
Triangle-GAN [46] - - -
BadGAN [47]
Sobolev-GAN [48] - - -
Semi-supervised inclusive-NRF
Results below this line cannot be directly compared to those above.
VAT small [49] 6.83 14.87 /
model [50] - /
Temporal Ensembling [50] - /
Mean Teacher [51] - /
VAT+EntMin [49] - /
CT-GAN [38] - /
TABLE III: Comparison with state-of-the-art methods on three benchmark datasets. “CIFAR-10 IS” means the inception score for samples generated by SSL models trained on CIFAR-10. “” is obtained by running the released code accompanied by the corresponding papers. “-” means the results are not reported in the original work and without released code. “/” means not applicable, e.g. the models cannot generate samples stochastically. “” uses image data augmentation which significantly helps classification performance. The upper/lower blocks show generative/discriminative SSL methods respectively.

For semi-supervised learning, we consider the three widely used benchmark datasets, namely MNIST [52], SVHN [53], and CIFAR-10 [32]. As in previous work, we randomly sample 100, 1,000, and 4,000 labeled samples from MNIST, SVHN, and CIFAR-10 respectively during training, and use the standard data split for testing. See Appendix E.2 for experimental details.

It can be seen from Table III that semi-supervised inclusive-NRFs produce strong classification results on par with state-of-art DGM-based SSL methods. See Figure 7(a)(b) in Appendix for generated samples. Bad-GANs achieve better classification results, but as indicated by the low inception score, their generation is much worse than semi-supervised inclusive-NRFs. In fact, among DGM-based SSL methods, inclusive-NRFs achieve the best performance in sample generation. This is in contrast to the conflict of good classification and good generation, as observed in GAN-based SSL [33, 47]. It is analyzed in [47] that good GAN-based SSL requires a bad generator555This analysis is based on using the ()-class GAN-like discriminator objective for SSL. To the best of our knowledge, the conflict does not seem to be reported in previous generative SSL methods [23, 24] which use the -class classifier like in semi-supervised inclusive-NRFs.. This is embarrassing and in fact obviates the original idea of generative SSL - successful generative training, which indicates good generation, provides regularization for finding good classifiers [23, 24]. In this sense, Bad-GANs could hardly be classified as a generative SSL method.

Finally, note that some discriminative SSL methods, as listed in the lower block in Table III also produce superior performances, by utilizing data augmentation and consistency regularization. However, these methods are unable to generate (realistic) samples. It can be seen that discriminative SSL methods utilize different regularization from generative SSL methods and cannot be directly compared to generative SSL methods. Their combination, as an interesting future work, could yield further performance improvement.

Training Setting Unsupervised Semi-supervised
Generation IS Revision IS error (%)
SGLD
SGLD
SGLD
SGHMC
TABLE IV: Ablation study of our inclusive-NRF method on CIFAR-10, regarding the effects of using SGLD or SGHMC in training and of applying sample revision in inference (generating samples). Mean and SD are from 5 independent runs for each training setting. In each training setting, for unsupervised learning, two manners to generate samples given a trained NRF are compared, as previously illustrated in Figure 3 over synthetic GMM data. We examine generated samples (i.e. directly from the generator) and revised samples (i.e. after sample revision) respectively, in term of inception scores (IS). For semi-supervised learning, we examine the classification error rates.

5.5 Ablation study

We report the results of ablation study of our inclusive-NRF method on CIFAR-10 in Table IV. In this experiment, we use the standard CNN [41] for unsupervised learning and the same networks as those used in Table III for semi-supervised learning. See Appendix E.3 for experimental details. We analyze the effects of different settings in model training, such as using SGLD or SGHMC and the revision step used. For each training setting, we also compare the two manners to generate samples - whether applying sample revision or not in inference (generating samples) given a trained NRF, as previously illustrated in Figure 3 over synthetic GMM data. The main observations are as follows.

First, given a trained NRF, after revision (i.e. following the gradient of the RF’s potential w.r.t. ), the quality (IS) of samples is always improved, as shown by the consistent IS improvement from the second column (generation) to the third (revision). This is in accordance with the results in the GMM synthetic experiments in Section 5.1. Moreover, noting that in revision, it is the the estimated density that guides the samples towards low energy region of the random field. This also demonstrates one benefit of random field modeling, which, unlike GANs, can learn density estimate about the data manifold.

Second, a row-wise reading of Table IV reveals that with more revision steps and using SGHMC in training, the SSL classification performance is improved. Utilizing SGHMC in inclusive-NRFs to exploit gradient information with momentum yields better performance than simple SGLD. It is also found that more revision steps in model training do not significantly improve unsupervised IS. So we can use in unsupervised learning for generation, which can reduce the computational cost.

Model Precision Recall F1
OC-SVM [54] 0.7457 0.8523 0.7954
DSEBM-e [57] 0.7369 0.7477 0.7423
DAGMM [58] 0.9297 0.9442 0.9369
ALAD [59] 0.94270.0018 0.95770.0018 0.95010.0018
Inclusive-NRF 0.94520.0105 0.96000.0113 0.95250.0108
TABLE V: Anomaly detection results on KDDCUP dataset. Results for OC-SVM, DSEBM-e, DAGMM are taken from [58]. Inclusive-NRF results are obtained from 20 runs, each with random split of training and test sets, as in [58]. ALAD result uses a fixed split with random parameter initializations, and thus has smaller standard deviations.
Normal Class OC-SVM KDE IF DCAE AnoGAN DSVDD Inclusive-NRF
Digit 0 98.60.0 97.10.0 98.00.3 97.60.7 96.61.3 98.00.7 98.90.6
Digit 1 99.50.0 98.90.0 97.30.4 98.30.6 99.20.6 99.70.1 99.80.1
Digit 2 82.50.1 79.00.0 88.60.5 85.42.4 85.02.9 91.70.8 91.84.0
Digit 3 88.10.0 86.20.0 89.90.4 86.70.9 88.72.1 91.91.5 93.82.6
Digit 4 94.90.0 87.90.0 92.70.6 86.52.0 89.41.3 94.90.8 95.61.9
Digit 5 77.10.0 73.80.0 85.50.8 78.22.7 88.32.9 88.50.9 94.91.4
Digit 6 96.50.0 87.60.0 95.60.3 94.60.5 94.72.7 98.30.5 97.52.7
Digit 7 93.70.0 91.40.0 92.00.4 92.31.0 93.51.8 94.60.9 96.41.0
Digit 8 88.90.0 79.20.0 89.90.4 86.51.6 84.92.1 93.91.6 88.93.3
Digit 9 93.10.0 88.20.0 93.50.3 90.41.8 92.41.1 96.50.3 94.91.0
Mean 91.29 86.93 92.30 89.65 91.27 94.80 95.26
AIRPLANE 61.60.9 61.20.0 60.10.7 59.15.1 67.12.5 61.74.1 78.12.1
AUTOMOBILE 63.80.6 64.00.0 50.80.6 57.42.9 54.73.4 65.92.1 71.62.1
BIRD 50.00.5 50.10.0 49.20.4 48.92.4 52.93.0 50.80.8 65.41.8
CAT 55.91.3 56.40.0 55.10.4 58.41.2 54.51.9 59.11.4 63.31.9