 # A Tale of Three Probabilistic Families: Discriminative, Descriptive and Generative Models

The pattern theory of Grenander is a mathematical framework where the patterns are represented by probability models on random variables of algebraic structures. In this paper, we review three families of probability models, namely, the discriminative models, the descriptive models, and the generative models. A discriminative model is in the form of a classifier. It specifies the conditional probability of the class label given the input signal. The descriptive model specifies the probability distribution of the signal, based on an energy function defined on the signal. A generative model assumes that the signal is generated by some latent variables via a transformation. We shall review these models within a common framework and explore their connections. We shall also review the recent developments that take advantage of the high approximation capacities of deep neural networks.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Initially developed by Grenander in the 1970s, the pattern theory [30, 31] is a unified mathematical framework for representing, learning and recognizing the patterns that arise in science and engineering. The objects in pattern theory are usually of high complexity or dimensionality, defined in terms of the constituent elements and the bonds between them. The patterns of these objects are characterized by both the algebraic structures governed by local and global rules, as well as the probability distributions of the associated random variables. Such a framework encompasses most of the probability models in various disciplines. In the 1990s, Mumford 

advocated the pattern theoretical framework for computer vision, so that the learning and inference can be based on the probability models.

Despite its generality, developing probability models in the pattern theoretical framework remains a challenging task. In this article, we shall review three families of models, which we call the discriminative models, the descriptive models, and the generative models, following the terminology of . A discriminative model is in the form of a classifier. It specifies the conditional probability of the output class label given the input signal. Such a model can be learned in the supervised setting where a training dataset of input signals and the corresponding output labels is provided. A descriptive model specifies the probability distribution of the signal, based on an energy function defined on the signal through some descriptive feature statistics extracted from the signal. Such models originated from statistical physics, where they are commonly called the Gibbs distributions 

. The descriptive models belong to the broader class of energy-based models

 that include non-probabilistic models as well as models with latent variables. A generative model assumes that the signal is generated by some latent variables via a deterministic transformation. A prototype example is factor analysis 

, where the signal is generated by some latent factors via a linear transformation. Both the descriptive models and generative models can be learned in the unsupervised setting where the training dataset only consists of input signals without the corresponding output labels.

In this paper, we shall review these three families of models within a common framework and explore their connections. We shall start from the flat linear forms of these models. Then we shall present the hierarchical non-linear models, where the non-linear mappings in these models are parametrized by neural networks [59, 58] that have proved exceedingly effective in approximating non-linear relationships.

Currently the most successful family of models are the discriminative models. A discriminative model is in the form of the conditional distribution of the class label given the input signal. The normalizing constant of such a probability model is a summation over the finite number of class labels or categories. It is readily available, so that the model can be easily learned from big datasets. The learning of the descriptive models and the generative models can be much more challenging. A descriptive model is defined as a probability distribution of the signal, which is usually of a high dimensionality. The normalizing constant of such a model is an integral over the high dimensional signal and is analytically intractable. A generative model involves latent variables that follow some prior distribution, so that the marginal distribution of the observed signal is obtained by integrating out the latent variables, and this integral is also analytically intractable. Due to the intractabilities of the integrals in the descriptive and generative models, the learning of such models usually requires Markov chain Monte Carlo (MCMC) sampling

[25, 65]. Specifically, the learning of the descriptive models require MCMC sampling of the synthesized signals, while the learning of the generative models require MCMC sampling of the latent variables. Nonetheless, we shall show that such learning methods work reasonably well, where the gradient-based Langevin dynamics  can be employed conveniently for MCMC sampling, which is an inner loop within the gradient-based learning of the model parameters.

Because of the high capacity of the neural networks in approximating highly non-linear mappings, the boundary between representation and computation is blurred in neural networks. A deep neural network can be used to represent how the signal is generated or how the features are defined. It can also be used to approximate the solution of a computational problem such as optimization or sampling. For example, the iterative sampling of the latent variables of a generative model can be approximated by an inference model that provides the posterior samples directly, as is the case with the wake-sleep algorithm  and the variational auto-encoder (VAE) [55, 78, 68]. As another example, the iterative sampling of a descriptive model can be approximated by a generative model that can generate the signal directly . In general, the solutions to the on-line computational problems can be encoded by high capacity neural networks, so that iterative computations only occur in the off-line learning of the model parameters.

The three families of models do not exist in isolation. There are intimate connections between them. [33, 34] proposed to integrate the descriptive and generative models into a hierarchical model. [96, 98] proposed data-driven MCMC where the MCMC is to fit the generative models, but the proposal distributions for MCMC transitions are provided by discriminative models. The discriminative model and the descriptive model can be translated into each other via the Bayes rule. Tu  exploited this relationship to learn the descriptive model via discriminative training, thus unifying the two models. Similarly, the discriminative model can be paired with the generative model in the generative adversarial networks (GAN) , and the adversarial learning has become an alternative framework to likelihood-based learning. The descriptive model and the generative model can also be paired up so that they can jump-start each other’s MCMC sampling. Moreover, the family of descriptive models and the family of generative models overlap in terms of undirected latent energy-based models.

## 2 Non-hierarchical linear forms of the three families

We shall first review the non-hierarchical linear forms of the there families of models within a common framework.

### 2.1 Discriminative models

This subsection reviews the linear form of the discriminative models.

The table below displays the dataset for training the discriminative models.

There are training examples. For the -th example, let be the -dimensional input signal (the notation is commonly used in statistics to denote the number of observations and the number of predictors). Let be the outcome label. In the case of classification, is categorical or binary. is the

-dimensional vector of features or hidden variables.

The discriminative models can be represented by the diagram below,

 output:Yi↑features:hi↑input:Xi (1)

where the vector of features is computed from via . In a non-hierarchical or flat model, the feature vector is designed, not learned, i.e., is a pre-specified non-linear transformation.

For the case of binary classification where ,

 logPr(Yi=+|Xi)Pr(Yi=−|Xi)=h⊤iθ+b, (2)

where is the -dimensional vector of weight or coefficient parameters, and

is the bias or intercept parameter. The classification can also be based on the perceptron model

 ^Yi=sign(h⊤iθ+b), (3)

where if , and otherwise. Both the logistic regression and the perceptron can be generalized to the multi-category case. The bias term can be absorbed into the weight parameters if we fix .

Let . captures the relationship between and . Because is non-linear, is also non-linear. We say the model is in the linear form because it is linear in , or is a linear combination of the features in . The following are the choices of in various discriminative models.

Kernel machine : is implicit, and the dimension of can potentially be infinite. The implementation of this method is based on the kernel trick , where

is a kernel that is explicitly used by the classifier such as the support vector machine

. belongs to the reproducing kernel Hilbert space where the norm of can be defined as the Euclidean norm of , and the norm is used to regularize the model. A Bayesian treatment leads to the Gaussian process, where is assumed to follow , and

is the identity matrix of dimension

. is a Gaussian process with .

Boosting machine : For , each

is a weak classifier or a binary feature extracted from

, and is a committee of weak classifiers.

CART : In the classification and regression trees, there are rectangle regions resulted from recursive binary partition of the space of , and each is the binary indicator such that if and otherwise. is a piecewise constant function.

MARS 

: In the multivariate adaptive regression splines, the components of

are hinge functions such as (where is the -th component of , , and is a threshold) and their products. It can be considered a continuous version of CART.

Encoder and decoder: In the diagram in (1), the transformation is called an encoder, and the transformation

is called a decoder. In the non-hierarchical model, the encoder is designed, and only the decoder is learned. This is different from the auto-encoder in unsupervised learning.

The outcome

can also be continuous or a high-dimensional vector. The learning then becomes a regression problem. Both classification and regression are about supervised learning because for each input

, an output

is provided as supervision. The reinforcement learning is similar to supervised learning except that the guidance is in the form of a reward function.

### 2.2 Descriptive model

This subsection describes the linear form of the descriptive models and the maximum likelihood learning algorithm.

The descriptive models  can be learned in the unsupervised setting, where are not observed, as illustrated by the table below.

The linear form of the descriptive model is an exponential family model. It specifies a probability distribution on the signal via an energy function that is a linear combination of the features,

 pθ(X) =1Z(θ)exp[h(X)⊤θ]p0(X) (4)

where is the -dimensional feature vector extracted from , and is the -dimensional vector of weight parameters.

is a known reference distribution such as the white noise model

, or the uniform distribution within a bounded range.

 Z(θ)=∫exp[h(X)⊤θ]p0(X)dX=Ep0{exp[h(X)⊤θ]} (5)

is the normalizing constant ( denotes the expectation with respect to ). It is analytically intractable.

The descriptive model (4) has the following information theoretical property [16, 116, 2]. Let be the distribution that generates the training data . Let be the family of distributions defined by the descriptive model. Let , where .

can be estimated from the observed data by the sample average:

. is the family of distributions that reproduce the observed . Figure 1: The two curves illustrate Θ and Ω respectively, where each point is a probability distribution.

Let be the intersection between and . Then for any and any , we have , which can be interpreted as a Pythagorean property that defines orthogonality.

denotes the Kullback-Leibler divergence from

to . Thus and are orthogonal to each other, , as illustrated by Figure 1.

This leads to the following dual properties of , which can be considered the learned model:

(1) Maximum likelihood. . That is, is the projection of on . . The second term is the population version of the log-likelihood. Thus minimizing is equivalent to maximizing the likelihood.

(2) Maximum entropy: . That is, is the minimal modification of to reproduce the observed feature statistics . . If is the uniform distribution, then the second term is a constant, and the first term is the negative entropy. In that case, minimizing is equivalent to maximizing the entropy over .

Given the training data , let be the log-likelihood. The gradient of is

 L′(θ)=1nn∑i=1h(Xi)−Eθ[h(X)], (6)

because , where denotes the expectation with respect to . This leads to a stochastic gradient ascent algorithm for maximizing ,

 θt+1=θt+ηt[1nn∑i=1h(Xi)−1~n~n∑i=1h(~Xi)], (7)

where are random samples from , and is the learning rate. The learning algorithm has an “analysis by synthesis” interpretation. The are the synthesized data generated by the current model. The learning algorithm updates the parameters in order to make the synthesized data similar to the observed data in terms of the feature statistics. At the maximum likelihood estimate , the model matches the data: .

One important class of descriptive models are the Markov random field models [5, 26], such as the Ising model in statistical physics. Such models play an important role in the pattern theory. Figure 2: Two types of potential functions learned by  from natural images. The function on the left encourages big filter responses and creates patterns via reaction, while the function on the right prefers small filter responses and smoothes the synthesized image via diffusion.

One example of the descriptive model (4) is the FRAME (Filters, Random field, And Maximum Entropy) model [116, 103], where consists of histograms of responses from a bank of filters. In a simplified non-convolutional version, , where is a matrix, and is the -th row of . consists of the filter responses with each row of being a linear filter. are one-dimensional potential functions applied respectively to the elements of . In the FRAME model, the rows of are a bank of Gabor wavelets or filters . Given the filters,  learned the potential functions from natural images. There are two types of potential functions as shown in Figure 2 taken from . The function on the left encourages big filter responses while the function on the right prefers small filter responses.  used the Langevin dynamics to sample from the learned model. The gradient descent component of the dynamics is interpreted as the Gibbs Reaction And Diffusion Equations (GRADE), where the function on the left of Figure 2 is for reaction to create patterns, while the function on the right is for diffusion to smooth out the synthesized image. Figure 3: Learning a two dimensional FRAME model by sequentially adding rows to W. Each row of W corresponds to a projection of the data. Each step finds the projection that reveals the maximum difference between the observed data and the synthesized data generated by the current model.

In , the authors illustrate the idea of learning by a two-dimensional example. Each step of the learning algorithm adds a row to the current . Each row corresponds to a projection of . Each step finds a direction of the projection that reveals the maximum difference between the data points sampled from the current model and the observed data points. The learning algorithm then updates the model to match the marginal distributions of the model and the data in that direction. After a few steps, the distribution of the learned model is almost the same as the distribution of the observed data. By assuming a parametric differentiable form for , can be learned by gradient descent. Such models are called product of experts [42, 93] or field of experts . Figure 4: Under the uniform distribution of images defined on a large lattice (that goes to Z2) where the images share the same marginal histograms of filter responses, the conditional distribution of the local image patch given its boundary (in blue color) follows the FRAME model.

The FRAME model is convolutional, where the rows of can be partitioned into different groups, and the rows in the same group are spatially translated versions of each other, like wavelets. They are called filters or kernels. The model can be justified by a uniform distribution over the images defined on a large lattice that goes to , where all the images share the same marginal histograms of filter responses. Under such a uniform distribution, the distribution of the local image patch defined on a local lattice conditional on its boundary (illustrated by the blue color, including all the pixels outside that can be covered by the same filters as the pixels within ) follows the FRAME model . See Figure 4 for an illustration.

### 2.3 Generative models

This subsection reviews various versions of the linear generative models. These models share the same linear form, but they differ in terms of the prior assumptions of the latent factors or coefficients.

Like the descriptive models, the generative models can be learned in the unsupervised setting, where are not observed, as illustrated below:

In a generative model, the vector is not a vector of features extracted from the signal . is a vector of hidden variables that is used to generate , as illustrated by the following diagram:

 hidden:hi↓input:Xi (8)

The components of the -dimensional are variably called factors, sources, components or causes.

Auto-encoder: is also called a code in the auto-encoder illustrated by the following diagram:

 code:hi↑↓input:Xi (9)

The direction from to is called the decoder, and the direction from to is called the encoder. The decoder corresponds to the generative model in (8), while the encoder can be considered the inference model.

Distributed representation and disentanglement:

is called a distributed representation of

. Usually the components of , , are assumed to be independent, and are said to disentangle the variations in .

Embedding: can also be considered the coordinates of , if we embed into a low-dimensional space, as illustrated by the following diagram:

 ←hi→∣←Xi→ (10)

In the training data, we find a for each , so that preserve the relative relations between . The prototype example of embedding is multi-dimensional scaling, where we want to preserve the Euclidean distances between the examples. A more recent example of embedding is local linear embedding . In the embedding framework, there are no explicit encoder and decoder.

Linear generative model: The linear form of the generative model is as follows:

 Xi=Whi+ϵi, (11)

for , where is a dimensional matrix ( is the dimensionality of and is the dimensionality of ), and is a -dimensional residual vector. The following are the interpretations of :

(1) Loading matrix: Let . , i.e., each component of , , is a linear combination of the latent factors. is the loading weight of factor on variable .

(2) Basis vectors: Let , where is the -th column of . , i.e., is a linear superposition of the basis vectors , where are the coefficients.

(3) Matrix factorization: , where the matrix is factorized into the matrix and the matrix .

The following are some of the commonly assumed prior distributions or constraints on .

Factor analysis : , , , and is independent of . The dimensionality of , which is , is smaller than the dimensionality of , which is

. The factor analysis is very similar to the principal component analysis (PCA), which is a popular tool for dimension reduction. The difference is that in factor analysis, the column vectors of

do not need to be orthogonal to each other.

The factor analysis model originated from psychology, where consists of the test scores of student on subjects. consists of the verbal intelligence and the analytical intelligence of student (). Another example is the decathlon competition, where consists of the scores of athlete on sports, and consists of athlete ’s speed, strength and endurance ().

Independent component analysis : In ICA, for , independently, and are assumed to be heavy-tailed distributions. For analytical tractability, ICA assumes that , and . Hence , where is a squared matrix assumed to be invertible. , where . Let . The marginal distribution of has a closed form . The ICA model is both a generative model and a descriptive model.

Sparse coding : In the sparse coding model, the dimensionality of , which is , is bigger than the dimensionality of , which is . However, is a sparse vector, meaning that only a small number of are non-zero, although for different example , the non-zero elements in can be different. Thus unlike PCA, sparse coding provides adaptive dimension reduction. is called a redundant dictionary because , and each is a basis vector or a “word” in the dictionary. Each is explained by a small number of selected from the dictionary, depending on which are non-zero. The inference of the sparse vector can be accomplished by Lasso or basis pursuit [94, 8] that minimizes , which imposes the sparsity inducing regularization on with a regularization parameter . Figure 5: Sparse coding: learned basis vectors from natural image patches. Each image patch in the picture is a column vector of W.

A Bayesian probabilistic formulation is to assume a spike-slab prior: with a small , which is the probability that is non-zero.

Figure 5 displays a sparse code learned from a training set of natural image patches of size . Each column of , , is a basis vector that can be made into an image patch as shown in the figure.

Non-negative matrix factorization : In NMF, is constrained to have non-negative components, i.e., for all . It is also called positive factor analysis . The rationale for NMF is that the parts of a pattern should be additive and the parts should contribute positively.

Matrix factorization for recommender system : In recommender system, are the ratings of user on the items. For instance, in the Netflix example, there are users and movies, and is user ’s rating of movie . Let be the -th row of matrix , then , where characterizes the desires of user in aspects, and characterizes the desirabilities of item in the corresponding aspects. The rating matrix thus admits a rank factorization. The rating matrix is in general incomplete. However, we can still estimate and from the observed ratings and use them to complete the rating matrix for the purpose of recommendation.

Probabilistic formulation: In the above models, there is a prior model or a prior constraint such as is sparse or non-negative. Then there is a linear generative model , with , for . This defines the conditional distribution

. The joint distribution is

. The marginal distribution is obtained by integrating out :

 p(X|W)=∫p(h)p(X|h;W)dh=∫p(h,X|W)dh. (12)

This integral is analytically intractable. According to the Bayes rule, can be inferred from based on the posterior distribution, , which is proportional to as a function of . We call the inference model.

In the auto-encoder terminology, and define the decoder, while defines the encoder. In factor analysis and independent component analysis, can be inferred in closed form. For other models, however, needs to be inferred by an iterative algorithm.

Restricted Boltzmann machine : In RBM, unlike the above models, there is no explicit prior . The model is defined by the joint distribution

 (hi,Xi)∼p(h,X|W) =1Z(W)exp⎡⎣∑j,kwjkxjhk⎤⎦ (13) =1Z(W)exp[X⊤Wh]. (14)

The above model assumes that both and are binary. Under the above model, both the generative distribution and the inference distribution are independent logistic regressions. We may modify the model slightly to make continuous, so that in the modified model, the generative distribution

is normal linear regression:

, with . The inference model, , is logistic regression, , i.e., , where .

If we sum out , the marginal distribution can be obtained in closed form, and is a descriptive model.

RBM-like auto-encoder [100, 4]: The RBM leads to the following auto-encoder: the encoder is , i.e., ; the decoder is .

Like the descriptive model, the generative model can also be learned by maximum likelihood. However, unlike the “analysis by synthesis” scheme for learning the descriptive model, the learning algorithm for generative model follows an “analysis by inference” scheme. Within each iteration of the learning algorithm, there is an inner loop for inferring for each . The most rigorous inference method is to sample from the posterior distribution or the inference distribution . After inferring for each

, we can then update the model parameters by analyzing the “imputed” dataset

, by fitting the generative distribution . The EM algorithm  is an example of this learning scheme, where the inference step is to compute expectation with respect to . From a Monte Carlo perspective, it means we make multiple imputations  or make multiple guesses of to account for the uncertainties in . Then we analyze the multiply imputed dataset to update the model parameters.

## 3 Interactions between different families

### 3.1 Discriminative learning of descriptive model

This subsection shows that the descriptive model can be learned discriminatively.

The descriptive model (4) can be connected to the discriminative model (2) if we treat as the distribution of the negative examples, and as the distribution of the positive examples. Suppose we generate the data as follows: , i.e.,

, which is the prior probability of positive examples.

, and . According to the Bayes rule

 logPr(Yi=1∣Xi)Pr(Yi=0∣Xi)=h(Xi)⊤θ−logZ(θ)+log[ρ/(1−ρ)], (15)

which corresponds to (2) with . Figure 6: Discriminative learning of the descriptive model. By fitting a logistic regression to discriminate between the observed examples and the synthesized examples generated by the current model, we can modify the current model according to the fitted logistic regression, so that the modified model gets closer to the distribution of the observed data.

Tu  made use of this fact to estimate discriminatively. The learning algorithm starts from . At step , we let the current serve as the negative distribution, and generate synthesized examples from . Then we fit a logistic regression by treating the examples generated by as the negative examples, and the observed examples as the positive examples. Let be the estimated parameter of this logistic regression. We then let . See  for an analysis of the convergence of the learning algorithm.

Figure 6 taken from  illustrates the learning process by starting from the uniform . By iteratively fitting the logistic regression and modifying the distribution, the learned distribution converges to the true distribution.

### 3.2 Integration of descriptive and generative models

Natural images contain both stochastic textures and geometric objects (as well as their parts). The stochastic textures can be described by some feature statistics pooled over the spatial domain, while the geometric objects can be represented by image primitives or textons. Figure 7: Pre-attentive vision is sensitive to local patterns called textons.

The psychophysicist Julesz  studied both texture statistics and textons. He conjectured that pre-attentive human vision is sensitive to local patterns called textons. Figure 7 illustrates the basic idea. Figure 8: A model of textons, where each texton is a composition of a small number of wavelets.

Inspired by Julesz’s work,  proposed a generative model for textons, where each texton is a composition of a small number of wavelets, as illustrated by Figure 8. The model is a generalization of the sparse coding model of . Figure 9: Active basis model: each active basis template is a composition of wavelets selected from a dictionary, and the wavelets are allowed to shift their locations and orientations to account for shape deformation. Here each wavelet is illustrated by a bar. The templates are learned at two different scales. The observed images can be reconstructed by the wavelets of the deformed templates.

Building on the texton model of , [102, 46] proposed an active basis model, where each model is a composition of wavelets selected from a dictionary, and the wavelets are allowed to shift their locations and orientations to account for shape deformation. See Figure 9 for an illustration. Figure 10: Hybrid image template: integrating generative model for shape template and the descriptive model for texture.

The texton model and the active basis model are generative models. However, they do not account for stochastic texture patterns.  proposed to integrate the generative model for shape templates and the descriptive model for stochastic textures, as illustrated by Figure 10. A similar model is developed by  to model both the geometric structures and stochastic textures by generative models and descriptive models respectively.

In  the authors provided another integration of the generative model and the descriptive model, where the lowest layer is a generative model such as the wavelet sparse coding model , but the spatial distribution of the wavelets is governed by a descriptive model.

### 3.3 DDMCMC: integration of discriminative and generative models Figure 11: Data-driven MCMC: when fitting the generative models and descriptive models using MCMC, the discriminative models can be employed to provide proposals for MCMC transitions.

In [96, 98] the authors proposed a data-driven MCMC method for fitting the generative models as well as the descriptive models to the data. Fitting such models usually require time-consuming MCMC. In [96, 98] the authors proposed to speed up the MCMC by using the discriminative models to provide the proposals for the Metropolis-Hastings algorithm. See Figure 11 for an illustration.

## 4 Hierarchical forms of the three families

This section presents the hierarchical non-linear forms of the three families of models, where the non-linear mappings are parametrized by neural networks, in particular, the convolutional neural networks.

### 4.1 Recent developments

During the past few years, deep convolutional neural networks (CNNs or ConvNets) [59, 58]

and recurrent neural networks (RNNs)



have transformed the fields of computer vision, speech recognition, natural language processing, and other fields in artificial intelligence (AI). Even though these neural networks were invented decades ago, their potentials were realized only recently mainly because of the following two factors. (1) The availability of big training datasets such as Imagenet



. (2) The improvement in computing power, mainly brought by the graphical processing units (GPUs). These two factors, together with some recent clever tweaks and inventions such as rectified linear units

, residual networks , etc., enable the training of very deep networks (e.g., 152 layers with 60 million parameters in a residual network for object recognition ) that achieve impressive performances on many tasks in AI (a recent example being Alpha Go Zero ).

One key reason for the successes of deep neural networks is that they are universal and flexible function approximators. For instance, a feedforward neural network with rectified linear units is a piecewise linear function with recursively partitioned linear pieces that can approximate any continuous non-linear mapping 

. However, this does not fully explain the “unreasonable effectiveness” of deep neural networks. The stochastic gradient descent algorithm that is commonly employed to train the neural networks is expected to approach only a local minimum of the highly non-convex objective function. However, for large and deep networks, it appears that most of the local modes are equally good

 in terms of training and testing errors, and the apparent vices of local modes and stochasticity in the mini-batch on-line training algorithm actually turn out to be big virtues in that they seem to prevent overfitting and lead to good generalization .

The approximation capacities of the deep neural networks have been extensively exploited in supervised learning (such as classification networks and regression networks) and reinforcement learning (such as policy networks and value networks). They have also proven to be useful for unsupervised learning and generative modeling, where the goal is to learn features or hidden variables from the observed signals without external guidance such as class labels or rewards. The unsupervised learning is often accomplished in the context of a generative model (or an auto-encoder), which explains or characterizes the observed examples.

### 4.2 Discriminative models by convolutional neural networks

The neural networks in general and the convolutional neural networks (ConvNet or CNN) in particular were initially designed for discriminative models. Let be the -dimensional input vector, and be the output. We want to predict by which is a non-linear transformation of : , where is parametrized by parameters . In a feedforward neural network, is a composition of layers of liner mappings followed by element-wise non-linear rectifications, as illustrated by the following diagram:

 X→h(1)→...h(l−1)→h(l)→...→h(L)→^Y, (16)

where is a dimensional vector which is defined recursively by

 h(l)=f(l)(W(l)h(l−1)+b(l)), (17)

for . We may treat as , and as and . is the weight matrix and is the bias or intercept vector at layer . is element-wise transformation, i.e., for , .

Compared to the discriminative models in the previous section, we now have multiple layers of features . They are recursively defined via (17), and they are to be learned from the training data instead of being designed.

For classification, suppose there are categories, the conditional probability of category given input is given by the following soft-max probability:

 Pr(Y=k∣X)=fθk(X)∑Kk=1fθk(X), (18)

where is the score for category . We may take . This final classification layer is usually called the soft-max layer.

The most commonly used non-linear rectification in modern neural nets is the Rectified Linear Unit (ReLU)

: . The resulting function can be considered a multi-dimensional linear spline, i.e., a piecewise linear function. Recall a one-dimensional linear spline is of the form , where are the knots. At each knot , the linear spline takes a turn and changes its slope by . With enough knots, can approximate any non-linear continuous function. We can view this as a simplified two-layer network, with . The basis function is two-piece linear function with a bending at . For multi-dimensional input , a two-layer network with one-dimensional output is of the following form , where , and is the -th row of . The basis function is again a two-piece linear function with a bending along the line . The dividing lines partition the domain of into up to pieces, and is a continuous piecewise linear function over these pieces.

In the multi-layer network, the hierarchical layers of partition the domain of recursively, creating a piecewise linear function with exponentially many pieces . Such reasoning also applies to other forms of rectification functions

, as long as they are non-linear and create bending. This makes the neural network an extremely powerful machine for function approximation and interpolation. The recursive partition in neural nets is similar to CART and MARS, but is more flexible.

Back-propagation. Both and

can be computed by the chain-rule back-propagation, and they share the computation of

in the chain rule. Because is element-wise, is a diagonal matrix.

A recent invention  is to reparametrize the mapping (17) by , where is used to model the residual term. This enables the learning of very deep networks. One may think of it as modeling an iterative algorithm where the layers can be interpreted as time steps of the iterative algorithm. Figure 12: Filtering or convolution: applying a filter of the size 3×3×3 on an image of the size 6×6×3 to get a filtered image or feature map of 6×6 (with proper boundary handling). Each pixel of the filtered image is computed by the weighted sum of the 3×3×3 pixels of the input image centered at this pixel. There are 3 color channels (R, G, B), so both the input image and the filter are three-dimensional. Figure 13: Convolutional neural networks consist of multiple layers of filtering and sub-sampling operations for bottom-up feature extraction, resulting in multiple layers of feature maps and their sub-sampled versions. The top layer features are used for classification via multinomial logistic regression. The discriminative direction is from image to category, whereas the generative direction is from category to image.

Convolution. The signal can be an image, and the linear transformations at each layer may be convolutions with localized kernel functions (i.e. filters). That is, the row vectors of (as well as the elements of ) form different groups, and the vectors in the same group are localized and translation invariant versions of each other, like wavelets. Each group of vectors corresponds to a filter or a kernel or a channel. See Figures 12 and 13 for illustrations. Recent networks mostly use small filters of the size [90, 92]. The minimal size is also a popular choice [63, 92]

. Such a filter fuses the features of different channels at the same location, and is often used for reducing or increasing the number of channels. When computing the filtered image, we can also sub-sample it by, e.g., taking one filter response every two pixels. The filter is said to have stride 2.

### 4.3 Descriptive models

This subsection describes the hierarchical form of the descriptive model and the maximum likelihood learning algorithm.

We can generalize the descriptive model in the previous sections to a hierarchical form with multiple layers of features [72, 13, 105, 106],

 X→h(1)→...→h(L)→fθ(X) (19)

which is a bottom-up process for computing , and collects all the weight and bias parameters at all the layers. The probability distribution is

 pθ(X)=1Z(θ)exp[fθ(X)]p0(X), (20)

where again is the reference distribution such as Gaussian white noise model . Again the normalizing constant is . The energy function is

 Uθ(X)=∥X∥2/2σ2−fθ(X). (21)

can also be a uniform distribution within a bounded range, then .

The model (20) can be considered a hierarchical generalization of the FRAME model. While the energy function of the FRAME model is defined in terms of element-wise non-linear functions of filter responses, model (20) involves recursions of this structure at multiple layers according to the ConvNet.

Suppose we observe training examples . The maximum likelihood learning seeks to maximize . The gradient of the is

 L′(θ) = 1nn∑i=1∂∂θfθ(Xi)−Eθ[∂∂θfθ(X)], (22)

where denotes the expectation with respect to . The key identity underlying equation (22) is .

The expectation in equation (22) is analytically intractable and has to be approximated by MCMC, such as the Langevin dynamics, which samples from by iterating the following step:

 Xτ+1=Xτ−s22[Xτσ2−∂∂Xfθ(Xτ)]+sEτ, (23)

where indexes the time steps of the Langevin dynamics, is the step size, and is the Gaussian white noise term. A Metropolis-Hastings step can be added to correct for the finiteness of . The Langevin dynamics was used by  for sampling from the linear form of the descriptive model such as the FRAME model.

We can run parallel chains of Langevin dynamics according to (23) to obtain the synthesized examples . The Monte Carlo approximation to is

 L′(θ) ≈ ∂∂θ[1nn∑i=1fθ(Xi)−1~n~n∑i=1fθ(~Xi)], (24)

which is the difference between the observed examples and the synthesized examples. We can then update , with computed according to (24). is the learning rate. The convergence of this algorithm has been studied by [79, 107].

Alternating back-propagation: The learning and sampling algorithm is again an “analysis by synthesis” scheme. The sampling step runs the Langevin dynamics by computing , and the learning step updates by computing . Both derivatives can be computed by back-propagation, and they share the same computations of .

Mode shifting interpretation: The data distribution is likely to have many local modes. The parametrized by the ConvNet can be flexible enough to creates many local modes to fit . We should learn or equivalently the energy function so that the energy function puts lower values on the observed examples than the unobserved examples. This is achieved by the learning and sampling algorithm, which can be interpreted as density shifting or mode shifting. In the sampling step, the Langevin dynamics settles the synthesized examples at the low energy regions or high density regions, or major modes (or basins) of , i.e., modes with low energies or high probabilities, so that tends to be low. The learning step seeks to change the energy function by changing in order to increase . This has the effect of shifting the low energy or high density regions from the synthesized examples toward the observed examples , or shifting the major modes of the energy function from the synthesized examples toward the observed examples, until the observed examples reside in the major modes of the model. If the major modes are too diffused around the observed examples, the learning step will sharpen them to focus on the observed examples. This mode shifting interpretation is related to Hopfield network  and attractor network  with the Langevin dynamics serving as the attractor dynamics.

The energy landscape may have numerous major modes that are not occupied by the observed examples, and these modes imagine examples that are considered similar to the observed examples. Even though the maximum likelihood learning matches the average statistical properties between model and data, the ConvNet is expressive enough to create modes to encode the highly varied patterns. We still lack an in-depth understanding of the energy landscape.

Adversarial interpretation: The learning and sampling algorithm also has an adversarial interpretation where the learning and sampling steps play a minimax game. Let the value function be defined as

 V=1~n∑~ni=1Uθ(~Xi) −1n∑ni=1Uθ(Xi). (25)

The learning step updates to increase , while the Langevin sampling step tends to relax to decrease . The zero temperature limit of the Langevin sampling is gradient descent that decreases , and the resulting learning and sampling algorithm is a generalized version of herding . See also . This is related to Wasserstein GAN , but the critic and the actor are the same descriptive model, i.e., the model itself is its own generator and critic.

Multi-grid sampling and learning

: The MCMC in general and the Langevin dynamics in particular may have difficulty traversing different modes and may take a long time to converge. A simple and popular modification of the maximum likelihood learning is the contrastive divergence (CD) learning

, where for each observed training example, we obtain a corresponding synthesized example by initializing a finite-step MCMC from the observed example. The CD learning is related to score matching estimator [48, 49] and auto-encoder [99, 91, 1]. Such a method can be scaled up to large training datasets using mini-batch training. However, the synthesized examples may be far from fair samples of the current model, thus resulting in bias of the learned model parameters. A modification of CD is persistent CD 

, where the MCMC is still initialized from the observed example at the initial learning epoch. However, in each subsequent learning epoch, the finite-step MCMC is initialized from the synthesized example of the previous epoch. Running persistent chains may make the synthesized examples less biased by the observed examples, although the persistent chains may still have difficulty traversing different modes of the learned model. Figure 14: Synthesized images at multi-grids. From left to right: 4×4 grid, 16×16 grid and 64×64 grid. Synthesized image at each grid is obtained by 30 step Langevin sampling initialized from the synthesized image at the previous coarser grid, beginning with the 1×1 grid. Figure 15: Synthesized images from models learned by multi-grid method from 4 categories of MIT places205 datasets.

To address the above challenges under the constraint of finite budget MCMC, we develop a multi-grid sampling and learning method in our recent work . Specifically, for each training image, we obtain its multi-grid versions by repeated down-scaling. Our method learns a separate descriptive model at each grid. Within each iteration of our learning algorithm, for each observed training image, we generate the corresponding synthesized images at multiple grids. Specifically, we initialize the finite-step MCMC sampling from the minimal version of the training image, and the synthesized image at each grid serves to initialize the finite-step MCMC that samples from the model of the subsequent finer grid. See Figure 14 for an illustration, where we sample images sequentially at 3 grids, with 30 steps of Langevin dynamics at each grid. After obtaining the synthesized images at the multiple grids, the models at the multiple grids are updated separately and simultaneously based on the differences between the synthesized images and the observed training images at different grids.

Unlike original CD or persistent CD, the learned models are equipped with a fixed budget MCMC to generate new synthesized images from scratch, because we only need to initialize the MCMC by sampling from the one-dimensional histogram of the versions of the training images.

In our experiments, the training images are resized to . Since the models of the three grids act on images of different scales, we design a specific ConvNet structure per grid: grid1 has a 3-layer network with stride filters at the first layer and stride filters at the next two layers; grid2 has a 4-layer network with stride filters at the first layer and stride filters at the next three layers; grid3 has a 3-layer network with stride filters at the first layer, stride filters at the second layer, and stride filters at the third layer. Numbers of channels are at grid1 and grid3, and at grid2. A fully-connected layer with channel output is added on top of every grid to get the value of the function