1 Introduction and Motivation
Generative models are statistical models that attempt to explain observed data by some underlying hidden (i.e., latent) causes Hyvärinen et al. (2009). Building good generative models for images is very appealing for many computer vision and image processing tasks. Although a lot of previous effort has been spent on this problem and has resulted in many models, generating images that match the qualities of natural scenes remains to be a daunting task.
There are two major schemes for the design of image generative models. The first one is based on the known regularities of natural images and aims at satisfying the observed statistics of natural images. Examples include the Gaussian MRF model Mumford and Gidas (2001), for the power law, and the Dead leaves model Zhu (2003) for the scale invariant property of natural images. These models are able to reproduce the empirical statistics of natural images well Lee et al. (2001)
, but images generated by them do not seem very realistic. The second scheme is datadriven. It assumes a flexible model governed by several parameters, and then learns the parameters from training data. Thanks to large image datasets and powerful deep learning architectures, the second scheme has been adopted in most of the recent image generation models. Typical examples include variational autoencoders (VAE)
Kingma and Welling (2013) and generative adversarial networks (GAN)Goodfellow et al. (2014). Utilizing convolutional neural networks
Krizhevsky et al. (2012), as building blocks, and training on tens of thousands of images, deep generative models are able to generate plausible images, as shown in the second row of Figure 1.On the one hand, despite the promise of deep generative models to recover the true distribution of images, formulating these models usually involves some sort of approximation. For instance, the variational autoencoder (VAE)Kingma and Welling (2013)
aims at estimating an explicit probability distribution through maximum likelihood, but the likelihood function is intractable. So a tractable lower bound on loglikelihood of the distribution is defined and maximized. The generative adversarial network (GAN)
Goodfellow et al. (2014) can recover the training data distribution when optimized in the space of arbitrary functions, but in practice, it is always optimized in the space of the model parameters. Therefore, there is basically no theoretical guarantee that the distribution of images by generative models is identical to that of natural images. On the other hand, images generated by deep generative models, hereinafter referred to as deep generated images, indeed seem different from natural images such that it is easy for humans to distinguish them from natural images Denton et al. (2015). Please see the first and the second rows of Figure 1. It remains unclear whether deep generative models can reproduce the empirical statistics of natural images.Driven by this motivation, we take the generative adversarial networks and variational autoencoders as examples to explore statistics of deep generated images with respect to natural images in terms of scale invariance, nonGaussianity, and Weibull contrast distribution. These comparisons can reveal the degree to which the deep generative models capture the essence of the natural scenes and guide the community to build more efficient generative models. In addition, the current way of assessing image generative models are often based on visual fidelity of generated samples using human inspections Theis et al. (2015). As far as we know, there is still not a clear way to evaluate image generative models Im et al. (2016). We believe that our work will provide a new dimension to evaluate image generative models.
Specifically, we first train a Wasserstein generative adversarial network (WGAN Arjovsky et al. (2017)), a deep convolutional generative adversarial network (DCGAN Radford et al. (2015)), and a variational autoencoder (VAE Kingma and Welling (2013)) on the ImageNet dataset. The reason for choosing ImageNet dataset is that it contains a large number of photos from different object categories. We also collect the same amount of cartoon images to compute their statistics and to train the models on them, in order to: 1) compare statistics of natural images and cartoons, 2) compare statistics of generated images and cartoons, and 3) check whether the generative models work better on cartoons, since cartoons have less texture than natural images. As far as we know, we are the first to investigate statistics of cartoons and deep generated images. Statistics including luminance distribution, contrast distribution, mean power spectrum, the number of connected component with a given area, and distribution of random filter responses will be computed.
Our analyses on training natural images confirm existing findings of scale invariance, nonGaussianity, and Weibull contrast distribution on natural image statistics. We also find nonGaussianity and Weibull contrast distribution in VAE, DCGAN and WGAN’s generated natural images. However, unlike real natural images, neither of the generated images have scale invariant mean power spectrum magnitude. Instead, the deep generative models seem to prefer certain frequency points, on which the power magnitude is significantly larger than their neighborhood. We show that this phenomenon is caused by the deconvolution operations in the deep generative models. Replacing deconvolution layers in the deep generative models by subpixel convolution enables them to generate images with a mean power spectrum more similar to the mean power spectrum of natural images. The spiky power spectrum is related to the checkerboard patterns reported in Odena et al. (2016). However, Odena et al.only give a qualitative discussion on individual image in spatial domain. We are the first to find the spiky power spectra of the generated images and provide a quantifiable measure of them.
2 Related Work
In this section, we briefly describe recent work that is closely related to this paper, including important findings in the area of natural image statistics and recent developments on deep image generative models.
2.1 Natural Image Statistics
Research on natural image statistics has been growing rapidly since the mid1990’sHyvärinen et al. (2009). The earliest studies showed that the statistics of the natural images remains the same when the images are scaled (i.e., scale invariance)Srivastava et al. (2003); Zhu (2003). For instance, it is observed that the average power spectrum magnitude over natural images has the form of (See for example Deriugin (1956); Cohen et al. (1975); Burton and Moorhead (1987); Field (1987)
). It can be derived using the scaling theorem of the Fourier transformation that the power spectrum magnitude will stay the same if natural images are scaled by a factor
Zoran (2013). Several other natural image statistics have also been found to be scale invariant, such as the histogram of log contrasts Ruderman and Bialek (1994), the number of gray levels in small patches of images Geman and Koloydenko (1999), the number of connected components in natural images Alvarez et al. (1999), histograms of filter responses, full cooccurrence statistics of two pixels, as well as joint statistics of Haar wavelet coefficients.Another important property of natural image statistics is the nonGaussianity Srivastava et al. (2003); Zhu (2003); Wainwright and Simoncelli (1999)
. This means that marginal distribution of almost any zero mean linear filter response on virtually any dataset of images is sharply peaked at zero, with heavy tails and high kurtosis (greater than 3 of Gaussian distributions)
Lee et al. (2001).In addition to the two wellknown properties of natural image statistics mentioned above, recent studies have shown that the contrast statistics of the majority of natural images follows a Weibull distribution Ghebreab et al. (2009). Although less explored, compared to the scale invariance and nonGaussianity of natural image statistics, validity of Weibull contrast distribution has been confirmed in several studies. For instance, Geusebroek et al. Geusebroek and Smeulders (2005)
show that the variance and kurtosis of the contrast distribution of the majority of natural images can be adequately captured by a twoparameter Weibull distribution. It is shown in
Scholte et al. (2009) that the two parameters of the Weibull contrast distribution cover the space of all possible natural scenes in a perceptually meaningful manner. Weibull contrast distribution also has been applied to a wide range of computer vision and image processing tasks. Ghebreab et al.Ghebreab et al. (2009) propose a biologically plausible model based on Weibull contrast distribution for rapid natural image identification, and Yanulevskaya et al.Yanulevskaya et al. (2011) exploit this property to predict eye fixation location in images, to name a few.2.2 Deep Generative Models
Several deep image generative models have been proposed in a relatively short period of time since 2013. As of this writing, variational autoencoders (VAE) and generative adversarial networks (GAN) constitute two popular categories of these models. VAE aims at estimating an explicit probability distribution through maximum likelihood, but the likelihood function is intractable. So a tractable lower bound on loglikelihood of the distribution is defined and maximized. For many families of functions, defining such a bound is possible even though the actual loglikelihood is intractable. In contrast, GANs implicitly estimate a probability distribution by only providing samples from it. Training GANs can be described as a game between a generative model trying to estimate data distribution and a discriminative model trying to distinguish between the examples generated by and the ones coming from actual data distribution. In each iteration of training, the generative model learns to produce better fake samples while the discriminative model will improve its ability of distinguishing real samples.
It is shown that a unique solution for and exists in the space of arbitrary functions, with recovering the training data distribution and equal to everywhere Goodfellow et al. (2014). In practice, and
are usually defined by multilayer perceptrons (MLPs) or convolutional neural networks (CNNs), and can be trained with backpropagation through gradientbased optimization methods. However, in this case, the optimum is approximated in the parameter space instead of the space of arbitrary functions. Correspondingly, there is no theoretical guarantee that the model’s distribution is identical to the data generating process
Goodfellow et al. (2014).Generally speaking, image samples generated by GANs and VAEs look quite similar to the real ones, but there are indeed some differences. Figure 1 shows samples of training images from ImageNet, and images generated by a popular implementation of GANs, termed as DCGAN Radford et al. (2015). As humans, we can easily distinguish fake images from the real ones. However, it is not so easy to tell how different deep generated images are from the real ones, and whether deep generative models, trained on a large number of images, capture the essence of the natural scenes. We believe that answering how well the statistics of the deep generated images match with the known statistical properties of natural images, reveals the degree to which deep generative models capture the essence of the natural scenes. Insights can be gained from this work regarding possible improvements of image generative models.
3 Data and Definitions
In this section, we introduce data, definitions, and symbols that will be used throughout the paper.
3.1 Natural Images, Cartoons and Generated Images
We choose 517,400 out of 1,300,000 pictures of ImageNet Deng et al. (2009) dataset as our natural image training set. These images cover 398 classes of objects, and each class contains 1,300 images. The cartoon training images include 511,460 frames extracted from 303 videos of 73 cartoon movies (i.e., multiple videos per movie). These two sets are used to train the deep generative models to generate natural images and cartoons. All training images are cropped around the image center. Each image has pixels. Figure 2 shows some examples of the natural and cartoon training images.
Several variants of deep generative models have been proposed. Since it is nearly impossible to consider all models, here we focus on three leading models including VAE, DCGAN and WGAN for our analysis. DCGAN refers to a certain type of generative adversarial networks with the architecture proposed by Radford et al. Radford et al. (2015) and the cost function proposed by Goodfellow et al. Goodfellow et al. (2014). WGAN refers to the model with the architecture proposed by Radford et al. Radford et al. (2015) and the cost function proposed by Arjovsky et al. Arjovsky et al. (2017). VAE approach, proposed by Kingma et al. Kingma and Welling (2013), consists of fully connected layers which are not efficient in generating large images. Therefore, we replace the architecture of the original VAE with the convolutional architecture proposed by Radford et al. Radford et al. (2015). In short, the DCGAN, WGAN and VAE models used in this paper have the same architecture. Their difference lies in their loss functions. The generated images considered in this work have the size of pixels. Examples of images generated by VAE, DCGAN and WGAN are shown in Figures 3, 4 and 5, respectively.
3.2 Kurtosis and Skewness
Kurtosis is a measure of the heaviness of the tail of a probability distribution. A large kurtosis indicates that the distribution has a sharp peak and a heavy tail. Skewness measures asymmetry of a probability distribution with respect to the mean. A positive skewness indicates the mass of the distribution is concentrated on the values less than the mean, while a negative skewness indicates the opposite. The kurtosis and skewness of a random variable
are defined as:(1) 
(2) 
where is the mean,
is the standard deviation, and
denotes the mathematical expectation.3.3 Luminance
Since training and deep generated images are RGB color images, first we convert them to grayscale using the formula as for CCIR Rec. 601, a standard for digital video, as follow:
(3) 
It is a weighted average of R, G, and B to tally with human perception. Green is weighted most heavily since human are more sensitive to green than other colors Kanan and Cottrell (2012). The grayscale value of the pixel at position is taken as its luminance. Following Geisler (2008), in this work, we deal with the normalized luminance within a given image which is defined by dividing the luminance at each pixel by the average luminance over the whole image:
(4) 
where and are the height and width of the image, respectively. Averaging the luminance histograms across images gives the distribution of luminance.
As a fundamental feature encoded by biological visual systems, luminance distribution within natural images has been studied in many works. It has been observed that this distribution is approximately symmetric on a logarithmic axis and hence positively skewed on a linear scale Geisler (2008). In other words, relative to the mean luminance, there are many more dark pixels than light pixels. One reason is the presence of the sky in many images, which always has high luminance, causing the mean luminance to be greater than the luminance of the majority of pixels.
3.4 Contrast Distribution
Distribution of local contrast within images has been measured using various definitions of contrast. In this work, we use the gradient magnitude calculated by Gaussian derivative filters to define local contrast of an image, as in Scholte et al. (2009); Ghebreab et al. (2009); Yanulevskaya et al. (2011). These contrast values have been shown to follow a Weibull distribution Ghebreab et al. (2009):
(5) 
Images are firstly converted to a color space that is optimized to match the human visual system color representation Yanulevskaya et al. (2011):
(6) 
where and are the intensity of a pixel in the red, green and blue channels, respectively. The gradient magnitude is then obtained by,
(7) 
where are the responses of the th channel to Gaussian derivative filters in and directions given by the following impulse responses:
(8) 
(9) 
3.5 Filter Responses
It has been observed that convolving natural images with almost any zero mean linear filter results in a histogram of a similar shape with heavy tail, sharp speak and high kurtosis Zoran (2013) (higher than kurtosis of Gaussian distribution, which is 3). That is called the nonGaussian property of natural images.
Since it is impossible in this work to consider all these filters, we avoid inspecting responses to any specific filter. Instead, without loss of generality, we apply random zero mean filters to images as introduced in Huang and Mumford (1999) to measure properties of images themselves. A random zero mean filter
is generated by normalizing a random matrix
with independent elements sampled uniformly from :(10) 
3.6 Homogeneous Regions
Homogeneous regions in an image are the connected components where contrast does not exceed a certain threshold. Consider an image of size and gray levels. We generate a series of thresholds , in which is the least integer such that more than pixels have a gray value less than . Using these thresholds to segment an image results in homogeneous regions. Figure 7 illustrates an example image and its homogeneous regions.
Alvarez et al.Alvarez et al. (1999) show that the number of homogeneous regions in natural images, denoted as , as a function of their size , obeys the following law:
(11) 
where is an image dependent constant, denotes the area, and is close to 2. Suppose image is scaled into , such that . Let denote the number of homogeneous regions of area in . Then, for , we have , so the number of homogeneous regions in natural images is a scaleinvariant statistic.
3.7 Power Spectrum
We adopt the most commonly used definition of power spectrum in image statistics literature: “the power of different frequency components”. Formally, the power spectrum of an image is defined as the square of the magnitude of the image FFT. Prior studies Deriugin (1956); Burton and Moorhead (1987); Field (1987) have shown that the mean power spectrum of natural images denoted as , where is frequency, is scale invariant. It has the form of:
(12) 
4 Experiments and Results
In this section, we report the experimental results of luminance distribution, contrast distribution, random filter response, distribution of homogeneous regions, and the mean power spectrum of the training images and deep generated images. We use Welch’s ttest to test whether the statistics are significantly different between generated images and training images (ImageNet1 and Cartoon1 in the tables). The larger
value is, the more similar the generated images to training images. Therefore, the models can be ranked according to ttest results. To make sure that our results are not specific to the choice of training images, we sampled another set of training images (ImageNet2 and Cartoon2 in the tables), and use ttest to measure difference between the two sets of training images. All experiments are performed using Python 2.7 and OpenCV 2.0 on a PC with Intel i7 CPU and 32GB RAM. The deep generative models used in this work are implemented in PyTorch
^{1}^{1}1https://github.com/pytorch.4.1 Luminance
Luminance distributions of training images and deep generated images are shown in Figure 8. Average skewness values are shown in Table 1.
Results in in Figure 8 show that luminance distributions of training and generated images have similar shapes, while those of cartoons are markedly different from natural images. From in Table 1, we can see that the luminance distributions over natural images, cartoons, generated natural images and generated cartoons all have positive skewness values. However, the difference of skewness values between training and generated images is statistically significant (over both natural images and cartoons). The difference between skewness values over each image type (i.e., ImageNet1 vs. ImageNet2 or Cartoon1 vs. Cartoon2) is not significant, indicating that our our findings are general and image sets are good representatives of the natural or synthetic scenes. According to Table 1, we rank the models in terms of luminance distribution as follows. For natural images, WGAN DCGAN VAE, and for cartoons, VAE DCGAN WGAN.
  I1  I2  IDCGAN  IWGAN  IVAE  C1  C2  CDCGAN  CWGAN  CVAE 

Skew  0.11  0.11  0.08  0.14  0.15  0.29  0.30  0.23  0.49  0.25 
tstat    0.30  3.45  2.75  4.25    0.51  3.65  10.78  3.26 
pvalue    0.76  0.00  0.00  0.00    0.60  0.00  0.00  0.00 
4.2 Contrast
It has been reported that the contrast distribution in natural images follows a Weibull distribution Geusebroek and Smeulders (2005). To test this on our data, first we fit a Weibull distribution (eqn. 5) to the histogram of each of the generated images and training images. Then, we use KL divergence to examine if the contrast distribution in deep generated images can be well modeled by a Weibull distribution as in the case of natural images. If this is true, the fitted distributions will be close to the histogram as in training images, and thus the KL divergence will be small.
Figure 9 shows that contrast distributions of training and generated images have similar shapes, while those of cartoons are markedly different from natural images. Parameters of the fitted Weibull distribution and its KL divergence to the histogram, as well as the corresponding ttest results are shown in Table 2. We find that the contrast distributions in generated natural images are also Weibull distributions. However, the difference of parameters between training and generated images, in both cases of natural images and cartoons, is statistically significant. We also observe that the KL divergence between contrast distribution and its Weibull fit of natural images and generated natural images is small, while the KL divergence between contrast distribution and its Weibull fit of cartoons and generated cartoons is larger. According to Table 2, WGAN gets the largest value for both natural images and cartoons. DCGAN and VAE are of equally small value. Therefore, for both natural images and cartoons, WGAN DCGAN VAE in terms of contrast distribution.
  I1  I2  IDCGAN  IWGAN  IVAE  C1  C2  CDCGAN  CWGAN  CVAE 

KLD  1.68  1.67  1.50  1.63  1.49  2.54  2.49  2.29  2.13  1.63 
tstat    1.62  26.66  6.86  31.90    1.67  8.35  13.97  31.71 
pval    0.10  0.00  0.00  0.00    0.09  0.00  0.00  0.00 
1.15  1.16  1.23  1.16  1.17  1.01  1.01  1.02  1.00  1.13  
tstat    1.70  31.14  2.64  10.57    0.23  11.21  9.74  91.82 
pval    0.08  0.00  0.00  0.00    0.81  0.00  0.00  0.00 
1251.86  1257.05  1055.32  1241.86  518.40  1262.24  1270.16  1198.82  1247.14  588.88  
tstat    1.11  46.98  2.03  216.72    1.40  11.69  2.83  162.88 
pval    0.26  0.00  0.04  0.00    0.16  0.00  0.00  0.00 
4.3 Filter Responses
We generate three zero mean random filters as in Huang and Mumford (1999), and apply them to ImageNet training images, VAE generated images, DCGAN generated images and WGAN generated images. Averaging the response histograms over training images, VAE images, DCGAN images and WGAN images gives the distributions shown in Figure 10 (in order). The distributions of responses to different random filters have similar shapes with a sharp peak and a heavy tail, which is in agreement with Huang et al.’s results Huang and Mumford (1999). Average kurtosis of the filter response distributions over the training images and deep generated images are shown in Table 3.
Figure 10 shows that generated images have similar filter response distributions to training images, while those of cartoons looks different from natural images. Table 3 shows that the average responses kurtosis of generated natural images and real natural images are all greater than 3 of Gaussian distribution. As a result, we draw the conclusion that the generated natural images also have similar nonGaussianity as in natural images. However, there is a statistically significant difference of the filter response kurtosis between deep generated images and training images, in both cases of natural images and cartoons (except ImageNetWGAN and CartoonDCGAN). For natural images, WGAN gets the largest value of filter 1, 2 and 3. DCGAN and VAE are of similar value. Therefore for natural images, WGAN DCGAN VAE. For cartoons, WGAN gets the largest value of filter 3, and DCGAN gets the largest value of filter 1, and 2. Therefore, for cartoons, DCGAN WGAN VAE.
  I1  I2  IDCGAN  IWGAN  IVAE  C1  C2  CDCGAN  CWGAN  CVAE 

filter 1  5.93  5.91  6.70  5.87  11.36  7.97  7.97  7.97  8.21  11.32 
tstat    1.42  16.43  1.26  101.32    0.05  0.00  2.16  31.14 
pval    0.67  0.00  0.20  0.00    0.95  0.99  0.03  0.00 
filter 2  5.77  5.67  7.11  5.95  15.77  7.39  7.46  7.43  7.17  12.01 
tstat    1.08  29.00  3.94  175.08    0.64  0.33  2.72  56.42 
pval    0.03  0.00  0.00  0.00    0.51  0.73  0.00  0.00 
filter 3  5.79  5.74  6.17  5.77  11.98  5.12  5.19  5.33  5.19  8.43 
tstat    1.05  8.45  0.33  110.72    0.48  1.73  0.69  35.58 
pval    0.29  0.00  0.73  0.00    0.62  0.08  0.48  0.00 
4.4 Homogeneous Regions
We compute distribution of homogeneous regions as stated in Section 3.6. The number of homogeneous regions ( in Section 3.6) is set to 16. Figure 11 shows the distribution of the number of the homogeneous regions of area in the training images and deep generated images. We use eqn. 11 to fit the distribution of homogeneous regions of each image. Table 4 shows average parameters and in eqn. 11 evaluated through maximum likelihood (only regions of area pixels are considered in the evaluation).
Over real natural images and generated natural images, the relationship between the number and the area of regions is linear in loglog plots, thus supporting the scale invariance property observed by Alvarez et al.Alvarez et al. (1999). This is also reflected from the small fitting residual to eqn. 11 shown in the first column of Table 4. We also find this property holds over cartoons and generated cartoons. However, the differences between the parameters of deep generated images and training images, in both cases of natural images and cartoons, are statistically significant (except ImageNetWGAN). For natural images, WGAN has the largest value. DCGAN and VAE are of equally small value. Therefore, we rank the models for natural images as follow: WGAN DCGAN VAE. For cartoons, three models have similar value, therefore, WGAN DCGAN VAE.
  I1  I2  IDCGAN  IWGAN  IVAE  C1  C2  CDCGAN  CWGAN  CVAE 

1.54  1.55  1.60  1.54  0.65  1.25  1.25  1.37  1.44  0.73  
tstatistic    2.26  23.16  0.96  387.72    0.49  52.89  82.67  220.13 
pvalue    0.02  0.00  0.33  0.00    0.62  0.00  0.00  0.00 
2.91  2.92  3.02  2.91  1.21  2.36  2.36  2.91  2.75  1.36  
tstatistic    2.03  25.21  0.89  411.37    0.62  51.35  83.14  212.51 
pvalue    0.04  0.00  0.36  0.00    0.53  0.00  0.00  0.00 
residual  3.87  3.88  3.82  3.89  3.98  4.24  4.22  4.39  4.11  3.85 
tstatistic    0.41  4.49  1.55  8.94    1.24  13.29  11.60  34.12 
pvalue    0.68  0.00  0.11  0.00    0.21  0.00  0.00  0.00 
4.5 Power Spectrum
Figures 12(a), and 13(a) show the mean power spectrum of training images. We use eqn. 12 to fit a power spectrum to each image. The evaluated parameters of eqn. 12 and the corresponding fitting residual, averaged over all images, are shown in Table 5. The results of ttest show that differences between the parameters of deep generated images and training images, in both cases of natural images and cartoons, are statistically significant. For natural images, WGAN has the largest value. DCGAN and VAE are of similar value, therefore, for natural images, WGAN DCGAN VAE. For cartoons, WGAN has the largest value of and . VAE has the largest value of residualv. DCGAN has the largest value of residualh. Therefore, for cartoons, WGAN DCGAN VAE.
Scale invariant mean power spectrum of natural images with the form of eqn. 12 is the earliest and the most robust discovery in natural image statistics. Our experiments on training images confirm this discovery and aligns with the prior results in Deriugin (1956); Cohen et al. (1975); Burton and Moorhead (1987); Field (1987). This can be seen from the linear relationship between log frequency and log power magnitude shown in Figure 13(a), and the small fitting residual to eqn.12 in the first column of Table 5. We also observe a similar pattern over cartoons.
However, unlike training images, the generated images seem to have a spiky mean power spectrum. See Figure 12(b)(c) and Figure 13(b)(c). It can be seen from the figures that there are several local maxima of energy in certain frequency points. Without frequency axis being taken logarithm, it can be read from Figure 12(b)(c) that the position of each spike is the integer multiple of cycle/pixel.
  I1  I2  IDCGAN  IWGAN  IVAE  C1  C2  CDCGAN  CWGAN  CVAE 

9.23  9.24  8.97  9.21  8.83  9.16  9.16  9.07  9.20  8.94  
tstatistic    0.75  57.96  6.01  105.71    0.61  14.17  6.17  43.96 
pvalue    0.45  0.00  0.00  0.00    0.53  0.00  0.00  0.00 
1.97  1.97  1.88  1.95  2.33  1.98  1.97  1.94  1.98  2.32  
tstatistic    0.02  26.64  5.71  111.16    1.22  9.81  1.68  102.43 
pvalue    0.97  0.00  0.00  0.00    0.22  0.00  0.09  0.00 
residualh  1.53  1.55  2.57  1.41  2.18  2.00  2.02  2.06  1.73  1.81 
tstatistic    1.27  61.28  10.67  52.32    0.56  2.96  18.14  12.78 
pvalue    0.20  0.00  0.00  0.00    0.57  0.00  0.00  0.00 
9.29  9.28  9.00  9.28  8.97  9.28  9.27  9.18  9.27  9.13  
tstatistic    0.57  67.42  0.54  86.95    0.16  16.27  0.22  29.44 
pvalue    0.56  0.00  0.58  0.00    0.87  0.00  0.82  0.00 
1.95  1.95  1.81  1.97  2.33  2.03  2.02  1.98  2.01  2.42  
tstatistic    0.88  45.61  5.03  105.75    0.91  14.52  4.21  96.76 
pvalue    0.37  0.00  0.00  0.00    0.36  0.00  0.00  0.00 
residualv  1.45  1.46  1.74  1.48  2.13  1.98  1.98  1.84  1.89  1.99 
tstatistic    0.63  21.24  2.00  49.10    0.12  8.01  5.46  0.66 
pvalue    0.52  0.00  0.04  0.00    0.89  0.00  0.00  0.50 
5 Discussion
It is surprising to see that unlike natural images, deep generated images do not meet the wellestablished scale invariance property of the mean power spectrum. These models, however, well reproduce other statistical properties such as Weibull contract distribution and nonGaussianity. Specifically, mean power magnitude of natural images falls smoothly with frequency of the form , but mean power spectra of the deep generated images turns out to have local energy maxima at the integer multiples of frequency of 4/128 (i.e., 4/128, 8/128, etc). In spatial domain, this indicates that there are some periodic patterns with period of pixels superimposed on the generated images. Averaging the deep generated images gives an intuitive visualization of these periodic patterns. Please see Figure 14.
One reason for the occurrence of the periodic patterns might be the deconvolution operation (a.k.a transposed or fractionally strided convolutions). Deep image generative models usually consist of multiple layers. The generated images are progressively built from low to high resolution layer by layer. The processing of each layer includes a deconvolution operation to transform a smaller input to a larger one.
Figure 15 provides an intuitive explanation of deconvolution operations. In a deconvolution operation using strides,
zeros are inserted between input units, which makes the output tensors
times of the length of the size tensors Dumoulin and Visin (2016).Inserting zeros between input units will have the effect of superposing an impulse sequence with period of pixels on the output map. Meanwhile, if the input itself is an impulse sequence, the period will be enlarged times. Figure 16 shows a demonstration of how the generation process with deconvolution operations gives rise to the periodic patterns. Consider a deep generative model with deconvolution layers with strides, similar to the models used in this work. When this model builds images from tensors, the first layer outputs the weighted sum of its kernels. The output maps of the second layer are superposed on an impulse sequence with a period of 2 pixels. Each of the subsequent layers doubles the period of the input, meanwhile superposes a new impulse sequence with a period of 2 on the output. Finally, there will be a periodic pattern consisting of impulse sequences of different intensities and periods of pixels, which corresponds to spikes at positions in the frequency domain. Figure 17 shows this periodic pattern when
(as in the models used in this work) and its power spectrum (averaged over horizontal direction). As it can be seen, the spikes shown in the power spectrum of this periodic pattern match exactly the power spectrum of the deep generated images shown in
12, which experimentally shows that the spiky power spectra of deep generated images is caused by the deconvolution operation.Apart from causing the spikes of the power spectrum of the generated images as stated above, other drawbacks of the deconvolution operation have also been observed. For instance, the zero values added by the deconvolution operations have no gradient information that can be backpropagated through and have to be later filled with meaningful values Tetra (2017). In order to overcome the shortcomings of deconvolution operations, recently, a new upscaling operation known as subpixel convolution Shi et al. (2016)
has been proposed for image superresolution. Instead of filling zeros to upscale the input, subpixel convolutions perform more convolutions in lower resolution and reshape the resulting map into a larger output
Tetra (2017).Since filling zeros is not needed in subpixel convolution, it is expected that replacing deconvolution operations by subpixel convolutions, will remove periodic patterns from the output. Thus, the power spectrum of the generated images will be more similar to natural images, without the spikes shown in Figure 17 and Figure 12. To confirm this, we trained a WGAN model with all of its deconvolution layers replaced by subpixel convolutions. Examples of images generated by this model are shown in Figure 18. The corresponding mean power spectrum is shown in Figure 19. As results show, replacing deconvolution operations by subpixel convolutions removes the periodic patterns caused by deconvolution operations and results in images with more similar mean power spectrum as in natural images.
6 Summary and Conclusion
We explore statistics of images generated by stateoftheart deep generative models (VAE, DCGAN and WGAN) and cartoon images with respect to the natural image statistics. Our analyses on training natural images corroborates existing findings of scale invariance, nonGaussianity, and Weibull contrast distribution on natural image statistics. We also find nonGaussianity and Weibull contrast distribution for generated images with VAE, DCGAN and WGAN. These statistics, however, are still significantly different. Unlike natural images, neither of the generated images has scale invariant mean power spectrum magnitude, which indicates extra structures in the generated images. We show that these extra structures are caused by the deconvolution operations. Replacing deconvolution layers in the deep generative models by subpixel convolution helps them generate images with mean power spectrum closer to the mean power spectrum of natural images.
Inspecting how well the statistics of the generated images match natural scenes, can a) reveal the degree to which deep learning models capture the essence of the natural scenes, b) provide a new dimension to evaluate models, and c) suggest possible directions to improve image generation models. Correspondingly, two possible future works include:

Building a new metric for evaluating deep image generative models based on image statistics, and

Designing deep image generative models that better capture statistics of the natural scenes (e.g., through designing new loss functions).
To encourage future explorations in this area and assess the quality of images by other image generation models, we share our cartoon dataset and code for computing the statistics of images at: https://github.com/zengxianyu/generate
References
 Alvarez et al. (1999) Alvarez, L., Gousseau, Y., Morel, J.M., 1999. The size of objects in natural and artificial images. Advances in Imaging and Electron Physics 111, 167–242.
 Arjovsky et al. (2017) Arjovsky, M., Chintala, S., Bottou, L., 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 .
 Burton and Moorhead (1987) Burton, G., Moorhead, I.R., 1987. Color and spatial structure in natural scenes. Applied Optics 26, 157–170.
 Cohen et al. (1975) Cohen, R.W., Gorog, I., Carlson, C.R., 1975. Image descriptors for displays. Technical Report. DTIC Document.

Deng et al. (2009)
Deng, J., Dong, W.,
Socher, R., Li, L.J.,
Li, K., FeiFei, L.,
2009.
Imagenet: A largescale hierarchical image database, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE. pp. 248–255.
 Denton et al. (2015) Denton, E.L., Chintala, S., Fergus, R., et al., 2015. Deep generative image models using a laplacian pyramid of adversarial networks, in: Advances in neural information processing systems, pp. 1486–1494.
 Deriugin (1956) Deriugin, N., 1956. The power spectrum and the correlation function of the television signal. Telecommunications 1, 1–12.
 Dumoulin and Visin (2016) Dumoulin, V., Visin, F., 2016. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 .
 Field (1987) Field, D.J., 1987. Relations between the statistics of natural images and the response properties of cortical cells. JOSA A 4, 2379–2394.
 Geisler (2008) Geisler, W.S., 2008. Visual perception and the statistical properties of natural scenes. Annu. Rev. Psychol. 59, 167–192.
 Geman and Koloydenko (1999) Geman, D., Koloydenko, A., 1999. Invariant statistics and coding of natural microimages, in: IEEE Workshop on Statistical and Computational Theories of Vision.
 Geusebroek and Smeulders (2005) Geusebroek, J.M., Smeulders, A.W., 2005. A sixstimulus theory for stochastic texture. International Journal of Computer Vision 62, 7–16.
 Ghebreab et al. (2009) Ghebreab, S., Scholte, S., Lamme, V., Smeulders, A., 2009. A biologically plausible model for rapid natural scene identification, in: Advances in Neural Information Processing Systems, pp. 629–637.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Advances in neural information processing systems, pp. 2672–2680.
 Huang and Mumford (1999) Huang, J., Mumford, D., 1999. Statistics of natural images and models, in: Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference On., IEEE. pp. 541–547.
 Hyvärinen et al. (2009) Hyvärinen, A., Hurri, J., Hoyer, P.O., 2009. Natural Image Statistics: A Probabilistic Approach to Early Computational Vision.. volume 39. Springer Science & Business Media.
 Im et al. (2016) Im, D.J., Kim, C.D., Jiang, H., Memisevic, R., 2016. Generating images with recurrent adversarial networks. arXiv preprint arXiv:1602.05110 .
 Kanan and Cottrell (2012) Kanan, C., Cottrell, G.W., 2012. Colortograyscale: does the method matter in image recognition? PloS one 7, e29740.
 Kingma and Welling (2013) Kingma, D.P., Welling, M., 2013. Autoencoding variational bayesge. arXiv preprint arXiv:1312.6114 .
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, pp. 1097–1105.
 Lee et al. (2001) Lee, A.B., Mumford, D., Huang, J., 2001. Occlusion models for natural images: A statistical study of a scaleinvariant dead leaves model. International Journal of Computer Vision 41, 35–59.
 Mumford and Gidas (2001) Mumford, D., Gidas, B., 2001. Stochastic models for generic images. Quarterly of applied mathematics 59, 85–111.
 Odena et al. (2016) Odena, A., Dumoulin, V., Olah, C., 2016. Deconvolution and checkerboard artifacts. Distill URL: http://distill.pub/2016/deconvcheckerboard, doi:10.23915/distill.00003.
 Radford et al. (2015) Radford, A., Metz, L., Chintala, S., 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 .
 Ruderman and Bialek (1994) Ruderman, D.L., Bialek, W., 1994. Statistics of natural images: Scaling in the woods. Physical review letters 73, 814–817.
 Scholte et al. (2009) Scholte, H.S., Ghebreab, S., Waldorp, L., Smeulders, A.W., Lamme, V.A., 2009. Brain responses strongly correlate with weibull image statistics when processing natural images. Journal of Vision 9, 29–29.
 Shi et al. (2016) Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z., 2016. Realtime single image and video superresolution using an efficient subpixel convolutional neural network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883.
 Srivastava et al. (2003) Srivastava, A., Lee, A.B., Simoncelli, E.P., Zhu, S.C., 2003. On advances in statistical modeling of natural images. Journal of mathematical imaging and vision 18, 17–33.

Tetra (2017)
Tetra, 2017.
subpixel: A subpixel convnet for super resolution with tensorflow.
URL: https://github.com/tetrachrome/subpixel.  Theis et al. (2015) Theis, L., Oord, A.v.d., Bethge, M., 2015. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844 .
 Wainwright and Simoncelli (1999) Wainwright, M.J., Simoncelli, E.P., 1999. Scale mixtures of gaussians and the statistics of natural images., in: Nips, pp. 855–861.
 Yanulevskaya et al. (2011) Yanulevskaya, V., Marsman, J.B., Cornelissen, F., Geusebroek, J.M., 2011. An image statistics–based model for fixation prediction. Cognitive computation 3, 94–104.
 Zhu (2003) Zhu, S.C., 2003. Statistical modeling and conceptualization of visual patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 691–712.
 Zoran (2013) Zoran, D., 2013. Natural Image Statistics for Human and Computer Vision. Ph.D. thesis. Hebrew University of Jerusalem.
Comments
There are no comments yet.