An Interpretable Generative Model for Handwritten Digit Image Synthesis

11/11/2018 ∙ by Yao Zhu, et al. ∙ 16

An interpretable generative model for handwritten digits synthesis is proposed in this work. Modern image generative models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), are trained by backpropagation (BP). The training process is complex and the underlying mechanism is difficult to explain. We propose an interpretable multi-stage PCA method to achieve the same goal and use handwritten digit images synthesis as an illustrative example. First, we derive principal-component-analysis-based (PCA-based) transform kernels at each stage based on the covariance of its inputs. This results in a sequence of transforms that convert input images of correlated pixels to spectral vectors of uncorrelated components. In other words, it is a whitening process. Then, we can synthesize an image based on random vectors and multi-stage transform kernels through a coloring process. The generative model is a feedforward (FF) design since no BP is used in model parameter determination. Its design complexity is significantly lower, and the whole design process is explainable. Finally, we design an FF generative model using the MNIST dataset, compare synthesis results with those obtained by state-of-the-art GAN and VAE methods, and show that the proposed generative model achieves comparable performance.



There are no comments yet.


page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data-driven image synthesis is a popular topic for a variety of vision-based applications nowadays. A great majority of modern image synthesis tasks are accomplished by Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). The GAN approach trains two networks at the same time, namely, a generator and a discriminator. They are trained with respect to each other adverserially. The generator takes a random vector as the input and generates an image by adapting network parameters to make synthesized and real ones as close as possible. Their difference is detected by the discriminator. The whole training process is cast as a multi-layer non-convex optimization problem and solved by backpropagation (BP). Although the BP approach provides reasonable results, the whole process is complex and mathematically intractable.

To offer an interpretable image generative model, one idea is to adopt a one-stage whitening and coloring process using the principal component analysis (PCA), which is also known as the Karhunen-Love transform (KLT). However, the one-stage transform cannot generate images of high resolution. To overcome this difficulty, we derive a multi-stage PCA by cascading multiple transform stages. It relates input images of correlated pixels to output spectral vectors of uncorrelated components through a multi-stage whitening process. To accomplish the synthesis task, we conduct multi-stage transforms in the reversed order that corresponds to a multi-stage coloring process. The resulting generative model is called a feedforward one since there is no backpropagation used in model parameters determination. The complexity of the feedforward model is low and the whole design process is explainable.

The rest of the paper is organized as follows. Related previous work is reviewed in Sec. 2. One-stage and multi-stage generative models are presented in Secs. 3 and 4, respectively. Experimental results are shown in Sec. 5. Some discussion is made in Sec. 6. Finally, concluding remarks are given in Sec. 7.

2 Review of related work

Generative models are an important topic in computer vision and machine learning. Among many state-of-the-art generative models for image synthesis

dosovitskiy2015learning ; reed2015deep ; yang2015weakly , there are two main stochastic models. They are VAEs and GANs. The VAE kingma2013auto ; rezende2014stochastic ; oord2016pixel ; zhao2017learning is a probablistic graphical model that optimizes the variational lower bound on data likelihood. The GAN system goodfellow2014generative consists of a generator and a discriminator. The generator is optimized to fool the discriminator, and the discriminator is optimized to distinguish real and fake images. There are several GAN variants to improve stability and efficiency arjovsky2017towards ; arjovsky2017wasserstein ; gulrajani2017improved ; mao2017least ; radford2015unsupervised ; salimans2016improved ; zhao2016energy . Examples include WGAN, informative GAN, etc. Takuhiro et al. kaneko2018generative

proposed a decision tree controller GAN, which can learn hierarchically interpretable representations without relying on detailed supervision. Huang

et al. huang2018introvae proposed an introspective variational autoencoder (IntroVAE) model to synthesize high-resolution photographic images. It can conduct self-evaluation of generated images for quality improvement accordingly.

Research on interpretable generative methods for image synthesis is much less. One idea is to examine hierarchical representations for image synthesis. Examples include: zhang2017stackgan ; wang2016generative ; vondrick2016generating ; huang2017stacked . Another idea is to adopt a recursive structure eslami2016attend ; gregor2015draw ; im2016generating ; kwak2016generating ; yang2017lr . Recently, Kuo et al. kuo2016understanding ; kuo2017cnn ; kuo2018data ; kuo2018interpretable

studied interpretable convolutional neural networks (CNNs), and proposed an image synthesis solution based on the Saak transform in

kuo2018data . They successfully reconstructed images using Saak coefficients. Here, we conduct image synthesis from random vectors rather than image reconstruction, and present a multi-stage PCA-based generative model for high resolution image synthesis.

3 Generative Model with One-Stage PCA

The block-diagram of a generative model using the one-stage PCA is shown in Fig. 1. It contains a forward transform and an inverse transform. They correspond to the whitening and the color processes, respectively. The training data are used to determine the transform kernels and the distribution of transformed coefficients. Then, the two pieces of information are used for automatic image synthesis.

Figure 1: 3D representation of single stage PCA transform. Blue flow represents forward transform, orange flow represents inverse transform.

The forward PCA transform consists of the following two steps.

  1. Compute transform kernels from input vectors.
    By following kuo2018interpretable , we decompose input vectors into DC (direct current) and AC (alternating current) two components. The correlation matrix for AC input vectors is denoted by . It has rank . That is, its first eigenvalues are positive and the last one is zero. The eigenvectors corresponding to positive eigenvalues form an orthonormal basis of the AC subspace. Also, we store eigenvectors in a descent order of its corresponding eigenvalues. We select the first (with ) eigenvectors, of which its corresponding eigenvalues takes the most energy of the input vector space. The first basis functions, denoted by and called transform kernels, provide the optimal subspace approximation to .

  2. Project AC input to kernels .
    We have projection coefficients shaped in a 3D cuboid, as shown in Fig. 1. The projection coefficients are defined by


    where , and denote the th transform kernel, the AC input vector and the number of kernels, respectively.

The inverse PCA transform aims to reconstruct the AC input, , as closely as possible. We use the best linear subspace approximation to reconstruct the input vector using transform kernels , , and projection coefficient vector . That is, we have


The inverse PCA transform is a mapping from projection vector to an approximation of .

It is worthwhile to highlight two points. First, image synthesis is different from image reconstruction. Image reconstruction attempts to relate one input image with its transformed coefficients. Image synthesis intends to relate one class of images with their transform coefficients. Thus, we should study the distribution of projection coefficient vector . It is well known that the PCA transform is a whitening process so that elements of are uncorrelated. Thus, each element can be treated independently. Second, the single-stage PCA cannot generate high quality images of high resolution since many PCA components contain high frequency components only. To address this problem, we proposed to cascade multiple PCA transforms to provide a multi-stage generative model.

4 Generative Model with Multi-Stage PCA

In this section, we propose a generative model based on multi-stage PCA transforms. The block-diagram of the proposed multi-stage generative model is shown in Fig. 2. Since the difference between the original digit images of size

and the interpolated digit image from size

to is very small, we do an average pooling to reduce all images to size as a pre-processing step to save computational complexity. Note that this pre-processing step can be removed to allow a more general processing. We will illustrate this point for more complicated images such as facial images as future extension.

Figure 2: Illustration of the proposed multi-stage generative model for handwritten digits with the MNIST dataset as the training samples, where blue arrows represent the training workflow and orange arrows represent the synthesis workflow.

4.1 Multi-Stage Generative Model

Our proposed system is a two-stage generative model that consists of two single-stage PCA transforms, where the spatial resolution at each stage is . The model parameters are determined in a FF one-pass manner as follows.

  • Stage 1:
    Conduct the position-dependent single-stage PCA transform on non-overlapping batches of size . Each position yields PCA coefficients of size , where is the spectral component and is the spatial resolution. In other words, the 3D cuboid output has the spatial dimension of each position and a spectral dimension of 16. Among 16 spectral coefficients ,

    is the DC projection while others are AC projections. We only pass the DC projection to the second stage and use AC projections of training samples to train a random forest regressor. Then, the random forest regressor will be used to predict AC projections in the synthesis process.

  • Stage 2:
    Conduct single-stage PCA transform of dimension , where is the spectral dimension and is the spatial dimension. Thus, the whole input image is transformed to a spectral vector of dimension .

The synthesis process is formed by the cascade of two inverse transforms. The input vector is a random vector which has the mean and variance information of PCA coefficients of the last stage. It introduces randomness into the synthesis process.

4.2 Outlier Detection

In the synthesis procedure, a random vector is used as the start point. We may see a synthesized vector that lies in the tail region of the Gaussian distribution. They are outliers and will lead to bad generated samples in the end. We use two methods to detect outliers and remove them. One is k-mean clustering combined with the mean squared error. The other is the Z-score method.

  • K-mean clustering and MSE.
    We apply K-mean clustering on coefficients of the second stage PCA. Each cluster has its centroid and the mean squared error. The latter indicates the distance between samples and their centroid. We use to denote the coefficient of the input image in the cluster. The centroid for cluster is . Then, we have the mean squared error of cluster :


    For each generated , we check which cluster it belongs to. If the distance between and its centroid is greater than , we view it as an outlier, vice versa.

  • Z score.

    The Z score is often used to measure the number of standard deviations for a data point to be away from its mean. Here, we use it to detect outliers. Since

    lies in a high dimensional space, we obtain mean and standard deviation of each dimension. We consider as an inlier only if all the components of lies within range of its distribution.

We discard outliers in the image synthesis process.

4.3 AC Prediction

For the model in Fig. 2

, the conditional probability of each AC projection given the DC projection, i.e.

, is learned in the training procedure. Thus, in the synthesis flow, once we obtain the generated DC projection, , it will be used to predict AC projections through the inverse transform at the second stage. The random forest regressor offers a powerful method in predicting AC projections in two aspects. First, given a set of data points, both input and output are high dimensional, the random forest regressor works well for the dataset. Second, it meets our need. Given , the random forest regressor can give the corresponding .

5 Experimental Results

5.1 Number of Principle Components

For input images of size , the total number of principle components is . We choose the first (with ) eigenvectors of the correlation matrix to provide the optimal linear subspace approximation, to the original space, . One key question is the selection of parameter . Here, we use the first-stage PCA for the MNIST dataset as an example.

We show the semi-log energy plot of principle components of input images in Fig. 3, where the y-axis indicates the ratio of an eigenvalue and the sum of all eigenvalues , , and the x-axis is the indices of principal components. As shown in Fig. 3, the whole energy curve can be decomposed into five linear sectors with four turning points as separators. The four turning points are given in Table 1. To derive these turning points automatically, we use the least square regressor to fit leading data points in one sector and choose the point that starts to deviate from a straight line segment as the turning point.

Figure 3: The energy curve as a function of principal component indices for the MNIST dataset.
Turning point 1st 2nd 3rd 4th
index 8 120 180 220
Table 1: The indices of turning points.
(a) Section 1 (b) Section 2 (c) Section 3
Figure 4: Generated images using principle components in different sections.
(a) section 1 (b) section 2 (c) section 3
Figure 5: A horizontal slice of generated image intensity using principle components from different sections, where four dotted lines in each subfigure correspond to the same spatial locations.

In order to understand the role of principal components in each section, we generate images using principal components in each section only and show the result in Fig. 4, where sections 1, 2 and 3 contain principal component indices 1-8, 9-120 and 121-180, respectively. Furthermore, for generated images in Fig. 4, we plot the intensity of a horizontal slice for the corresponding images in Fig. 5, where the horizontal slice is chosen to cross the upper circle of digit 8. We observe a different number of peaks in these three figures. For example, there are only two peaks in Fig. 5(a) representing two peaks in Fig. 4(a). Figs. 5(b) and 5(c) have four peaks, representing four cross points of images in Figs. 4(b) and 4(c), respectively.

Based on the above analysis, we can explain the role of principal components in each section. The first section shapes the main structure of a digit image. The second section enhance the boundary region of the main structure. The third and fourth sections focus on the background, which can be discarded safely. We see that there is a trade-off in the choice of principle components in the second section. If we want to have simpler and clearer stroke images, it is desired to drop components in the second section. However, the variation of generated images will be more limited. It is a trade-off between image quality and image diversity.

5.2 Comparison with GAN and VAE

We validate the proposed image synthesis method using the MNIST handwritten digit dataset. The MNIST dataset is a collection of handwritten digits from 0 to 9. We compare the synthesis results using the proposed method, the VAE and the GAN in Fig. 6. As shown in the figure, we see that our method can generate images that are different from training data with sufficient variations. There is no obvious difference between images synthesized by the three methods. This indicates that our method can perform equally well as the VAE and the GAN but at a lower training cost with a transparent design methodology.

(a) Our method (b) VAE (c) GAN
Figure 6: Comparison of synthesized digits using our method, the VAE and the GAN.

6 Discussion

There are similarities between the proposed generative model and the GAN. First, they both use random vectors for image synthesis. Second, they both use convolutional operations to generate responses and feed them to the next stage. On the other hand, there exist major differences between them as discussed below.

  • Theoretical Foundation. It is difficult to explain GAN’s underlying operational mechanism soltanolkotabi2018theoretical ; wiatowski2018mathematical . As compared with the GAN, our proposed method is totally transparent.

  • Model Comparison. There are a generator and a discriminator in a GAN. They function as adversaries in the training process. Besides, the BP process is used to determine the model parameters (or filter weights) of both networks. In contrast, our solution does not have the BP dataflow in the training. The training process is FF and one pass.

  • Kernel Determination. The filter weights of GANs and VAEs are equivalent to the transformation kernels in our proposed systme. These kernels are determined by the PCA of the corresponding inputs. Our solution is data-centric (rather than system-centric). No optimization framework is used in our solution.

7 Conclusion and Future Work

An interpretable generative model was proposed to synthesize handwritten digits in this work. The multi-stage PCA system was adopted to generate images of high resolution to overcome the drawback of the single-stage PCA system. We discussed how to determine the principle component number in constructing the best linear approximation subspace to the original input space. Also, we presented two methods to detect outliers in the synthesis procedure to improve overall image quality. It was demonstrated by experimental results that our method can offer high quality images that are comparable to those obtained by GAN and VAE at a much lower training complexity since no BP is adopted.

There are several possible directions for further exploration. First, we may design and interpret the VAE based on the proposed framework. The VAE has an encoder and a decoder that have structures similar to ours They correspond the forward transform and the inverse transform respectively. Second, it is inspiring to apply our proposed method to face image synthesis, which is more challenging as compared to handwritten digits. Third, it is worthwhile to study an automatic and effective way to differentiate synthesized and real images. This is critical to image forensic applications.