Statistics of Deep Generated Images

08/09/2017 ∙ by Yu Zeng, et al. ∙ Dalian University of Technology 0

Here, we explore the low-level statistics of images generated by state-of-the-art deep generative models. First, Wasserstein generative adversarial network (WGAN) and deep convolutional generative adversarial network (DCGAN) are trained on the ImageNet dataset and a large set of cartoon frames from animations. Then, for images generated by these models as well as natural scenes and cartoons, statistics including mean power spectrum, the number of connected components in a given image area, distribution of random filter responses, and contrast distribution are computed. Our analyses on training images support current findings on scale invariance, non-Gaussianity, and Weibull contrast distribution of natural scenes. We find that although similar results hold over cartoon images, there is still a significant difference between statistics of natural scenes and images generated by both DCGAN and WGAN models. In particular, generated images do not have scale invariant mean power spectrum magnitude, which indicates existence of extra structures in these images caused by deconvolution operations. We also find that replacing deconvolution layers in the deep generative models by sub-pixel convolution helps them generate images with a mean power spectrum more similar to the mean power spectrum of natural images. Inspecting how well the statistics of deep generated images match the known statistical properties of natural images, such as scale invariance, non-Gaussianity, and Weibull contrast distribution, can a) reveal the degree to which deep learning models capture the essence of the natural scenes, b) provide a new dimension to evaluate models, and c) allow possible improvement of image generative models (e.g., via defining new loss functions).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 9

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

Generative models are statistical models that attempt to explain observed data by some underlying hidden (i.e., latent) causes Hyvärinen et al. (2009). Building good generative models for images is very appealing for many computer vision and image processing tasks. Although a lot of previous effort has been spent on this problem and has resulted in many models, generating images that match the qualities of natural scenes remains to be a daunting task.

There are two major schemes for the design of image generative models. The first one is based on the known regularities of natural images and aims at satisfying the observed statistics of natural images. Examples include the Gaussian MRF model Mumford and Gidas (2001), for the -power law, and the Dead leaves model Zhu (2003) for the scale invariant property of natural images. These models are able to reproduce the empirical statistics of natural images well Lee et al. (2001)

, but images generated by them do not seem very realistic. The second scheme is data-driven. It assumes a flexible model governed by several parameters, and then learns the parameters from training data. Thanks to large image datasets and powerful deep learning architectures, the second scheme has been adopted in most of the recent image generation models. Typical examples include variational autoencoders (VAE)

Kingma and Welling (2013) and generative adversarial networks (GAN)Goodfellow et al. (2014)

. Utilizing convolutional neural networks 

Krizhevsky et al. (2012), as building blocks, and training on tens of thousands of images, deep generative models are able to generate plausible images, as shown in the second row of Figure 1.

Figure 1: Top: training images from ImageNet dataset Deng et al. (2009). Bottom: images generated by DCGAN Radford et al. (2015).

On the one hand, despite the promise of deep generative models to recover the true distribution of images, formulating these models usually involves some sort of approximation. For instance, the variational auto-encoder (VAE)Kingma and Welling (2013)

aims at estimating an explicit probability distribution through maximum likelihood, but the likelihood function is intractable. So a tractable lower bound on log-likelihood of the distribution is defined and maximized. The generative adversarial network (GAN)

Goodfellow et al. (2014) can recover the training data distribution when optimized in the space of arbitrary functions, but in practice, it is always optimized in the space of the model parameters. Therefore, there is basically no theoretical guarantee that the distribution of images by generative models is identical to that of natural images. On the other hand, images generated by deep generative models, hereinafter referred to as deep generated images, indeed seem different from natural images such that it is easy for humans to distinguish them from natural images Denton et al. (2015). Please see the first and the second rows of Figure 1. It remains unclear whether deep generative models can reproduce the empirical statistics of natural images.

Driven by this motivation, we take the generative adversarial networks and variational auto-encoders as examples to explore statistics of deep generated images with respect to natural images in terms of scale invariance, non-Gaussianity, and Weibull contrast distribution. These comparisons can reveal the degree to which the deep generative models capture the essence of the natural scenes and guide the community to build more efficient generative models. In addition, the current way of assessing image generative models are often based on visual fidelity of generated samples using human inspections Theis et al. (2015). As far as we know, there is still not a clear way to evaluate image generative models Im et al. (2016). We believe that our work will provide a new dimension to evaluate image generative models.

Specifically, we first train a Wasserstein generative adversarial network (WGAN Arjovsky et al. (2017)), a deep convolutional generative adversarial network (DCGAN Radford et al. (2015)), and a variational auto-encoder (VAE Kingma and Welling (2013)) on the ImageNet dataset. The reason for choosing ImageNet dataset is that it contains a large number of photos from different object categories. We also collect the same amount of cartoon images to compute their statistics and to train the models on them, in order to: 1) compare statistics of natural images and cartoons, 2) compare statistics of generated images and cartoons, and 3) check whether the generative models work better on cartoons, since cartoons have less texture than natural images. As far as we know, we are the first to investigate statistics of cartoons and deep generated images. Statistics including luminance distribution, contrast distribution, mean power spectrum, the number of connected component with a given area, and distribution of random filter responses will be computed.

Our analyses on training natural images confirm existing findings of scale invariance, non-Gaussianity, and Weibull contrast distribution on natural image statistics. We also find non-Gaussianity and Weibull contrast distribution in VAE, DCGAN and WGAN’s generated natural images. However, unlike real natural images, neither of the generated images have scale invariant mean power spectrum magnitude. Instead, the deep generative models seem to prefer certain frequency points, on which the power magnitude is significantly larger than their neighborhood. We show that this phenomenon is caused by the deconvolution operations in the deep generative models. Replacing deconvolution layers in the deep generative models by sub-pixel convolution enables them to generate images with a mean power spectrum more similar to the mean power spectrum of natural images. The spiky power spectrum is related to the checkerboard patterns reported in Odena et al. (2016). However, Odena et al.only give a qualitative discussion on individual image in spatial domain. We are the first to find the spiky power spectra of the generated images and provide a quantifiable measure of them.

2 Related Work

In this section, we briefly describe recent work that is closely related to this paper, including important findings in the area of natural image statistics and recent developments on deep image generative models.

2.1 Natural Image Statistics

Research on natural image statistics has been growing rapidly since the mid-1990’sHyvärinen et al. (2009). The earliest studies showed that the statistics of the natural images remains the same when the images are scaled (i.e., scale invariance)Srivastava et al. (2003); Zhu (2003). For instance, it is observed that the average power spectrum magnitude over natural images has the form of (See for example Deriugin (1956); Cohen et al. (1975); Burton and Moorhead (1987); Field (1987)

). It can be derived using the scaling theorem of the Fourier transformation that the power spectrum magnitude will stay the same if natural images are scaled by a factor 

Zoran (2013). Several other natural image statistics have also been found to be scale invariant, such as the histogram of log contrasts Ruderman and Bialek (1994), the number of gray levels in small patches of images Geman and Koloydenko (1999), the number of connected components in natural images Alvarez et al. (1999), histograms of filter responses, full co-occurrence statistics of two pixels, as well as joint statistics of Haar wavelet coefficients.

Another important property of natural image statistics is the non-Gaussianity Srivastava et al. (2003); Zhu (2003); Wainwright and Simoncelli (1999)

. This means that marginal distribution of almost any zero mean linear filter response on virtually any dataset of images is sharply peaked at zero, with heavy tails and high kurtosis (greater than 3 of Gaussian distributions

Lee et al. (2001).

In addition to the two well-known properties of natural image statistics mentioned above, recent studies have shown that the contrast statistics of the majority of natural images follows a Weibull distribution Ghebreab et al. (2009). Although less explored, compared to the scale invariance and non-Gaussianity of natural image statistics, validity of Weibull contrast distribution has been confirmed in several studies. For instance, Geusebroek et al. Geusebroek and Smeulders (2005)

show that the variance and kurtosis of the contrast distribution of the majority of natural images can be adequately captured by a two-parameter Weibull distribution. It is shown in 

Scholte et al. (2009) that the two parameters of the Weibull contrast distribution cover the space of all possible natural scenes in a perceptually meaningful manner. Weibull contrast distribution also has been applied to a wide range of computer vision and image processing tasks. Ghebreab et al.Ghebreab et al. (2009) propose a biologically plausible model based on Weibull contrast distribution for rapid natural image identification, and Yanulevskaya et al.Yanulevskaya et al. (2011) exploit this property to predict eye fixation location in images, to name a few.

2.2 Deep Generative Models

Several deep image generative models have been proposed in a relatively short period of time since 2013. As of this writing, variational autoencoders (VAE) and generative adversarial networks (GAN) constitute two popular categories of these models. VAE aims at estimating an explicit probability distribution through maximum likelihood, but the likelihood function is intractable. So a tractable lower bound on log-likelihood of the distribution is defined and maximized. For many families of functions, defining such a bound is possible even though the actual log-likelihood is intractable. In contrast, GANs implicitly estimate a probability distribution by only providing samples from it. Training GANs can be described as a game between a generative model trying to estimate data distribution and a discriminative model trying to distinguish between the examples generated by and the ones coming from actual data distribution. In each iteration of training, the generative model learns to produce better fake samples while the discriminative model will improve its ability of distinguishing real samples.

It is shown that a unique solution for and exists in the space of arbitrary functions, with recovering the training data distribution and equal to everywhere Goodfellow et al. (2014). In practice, and

are usually defined by multi-layer perceptrons (MLPs) or convolutional neural networks (CNNs), and can be trained with backpropagation through gradient-based optimization methods. However, in this case, the optimum is approximated in the parameter space instead of the space of arbitrary functions. Correspondingly, there is no theoretical guarantee that the model’s distribution is identical to the data generating process 

Goodfellow et al. (2014).

Generally speaking, image samples generated by GANs and VAEs look quite similar to the real ones, but there are indeed some differences. Figure 1 shows samples of training images from ImageNet, and images generated by a popular implementation of GANs, termed as DCGAN Radford et al. (2015). As humans, we can easily distinguish fake images from the real ones. However, it is not so easy to tell how different deep generated images are from the real ones, and whether deep generative models, trained on a large number of images, capture the essence of the natural scenes. We believe that answering how well the statistics of the deep generated images match with the known statistical properties of natural images, reveals the degree to which deep generative models capture the essence of the natural scenes. Insights can be gained from this work regarding possible improvements of image generative models.

3 Data and Definitions

In this section, we introduce data, definitions, and symbols that will be used throughout the paper.

3.1 Natural Images, Cartoons and Generated Images

We choose 517,400 out of 1,300,000 pictures of ImageNet Deng et al. (2009) dataset as our natural image training set. These images cover 398 classes of objects, and each class contains 1,300 images. The cartoon training images include 511,460 frames extracted from 303 videos of 73 cartoon movies (i.e., multiple videos per movie). These two sets are used to train the deep generative models to generate natural images and cartoons. All training images are cropped around the image center. Each image has pixels. Figure 2 shows some examples of the natural and cartoon training images.

Figure 2: Examples of natural images from the ImageNet dataset Deng et al. (2009) (top row), and our collected cartoon images (bottom row).

Several variants of deep generative models have been proposed. Since it is nearly impossible to consider all models, here we focus on three leading models including VAE, DCGAN and WGAN for our analysis. DCGAN refers to a certain type of generative adversarial networks with the architecture proposed by Radford et al. Radford et al. (2015) and the cost function proposed by Goodfellow et al. Goodfellow et al. (2014). WGAN refers to the model with the architecture proposed by Radford et al. Radford et al. (2015) and the cost function proposed by Arjovsky et al. Arjovsky et al. (2017). VAE approach, proposed by Kingma et al. Kingma and Welling (2013), consists of fully connected layers which are not efficient in generating large images. Therefore, we replace the architecture of the original VAE with the convolutional architecture proposed by Radford et al. Radford et al. (2015). In short, the DCGAN, WGAN and VAE models used in this paper have the same architecture. Their difference lies in their loss functions. The generated images considered in this work have the size of pixels. Examples of images generated by VAE, DCGAN and WGAN are shown in Figures 3,  4 and 5, respectively.

Figure 3: Examples of natural (top) and cartoon images (bottom) generated by the VAE model Kingma and Welling (2013).
Figure 4: Examples of natural (top) and cartoon images (bottom) generated by the DCGAN model Radford et al. (2015).
Figure 5: Examples of natural (top) and cartoon images (bottom) generated by the WGAN model Arjovsky et al. (2017).

3.2 Kurtosis and Skewness

Kurtosis is a measure of the heaviness of the tail of a probability distribution. A large kurtosis indicates that the distribution has a sharp peak and a heavy tail. Skewness measures asymmetry of a probability distribution with respect to the mean. A positive skewness indicates the mass of the distribution is concentrated on the values less than the mean, while a negative skewness indicates the opposite. The kurtosis and skewness of a random variable

are defined as:

(1)
(2)

where is the mean,

is the standard deviation, and

denotes the mathematical expectation.

3.3 Luminance

Since training and deep generated images are RGB color images, first we convert them to grayscale using the formula as for CCIR Rec. 601, a standard for digital video, as follow:

(3)

It is a weighted average of R, G, and B to tally with human perception. Green is weighted most heavily since human are more sensitive to green than other colors Kanan and Cottrell (2012). The grayscale value of the pixel at position is taken as its luminance. Following Geisler (2008), in this work, we deal with the normalized luminance within a given image which is defined by dividing the luminance at each pixel by the average luminance over the whole image:

(4)

where and are the height and width of the image, respectively. Averaging the luminance histograms across images gives the distribution of luminance.

As a fundamental feature encoded by biological visual systems, luminance distribution within natural images has been studied in many works. It has been observed that this distribution is approximately symmetric on a logarithmic axis and hence positively skewed on a linear scale Geisler (2008). In other words, relative to the mean luminance, there are many more dark pixels than light pixels. One reason is the presence of the sky in many images, which always has high luminance, causing the mean luminance to be greater than the luminance of the majority of pixels.

3.4 Contrast Distribution

Distribution of local contrast within images has been measured using various definitions of contrast. In this work, we use the gradient magnitude calculated by Gaussian derivative filters to define local contrast of an image, as in Scholte et al. (2009); Ghebreab et al. (2009); Yanulevskaya et al. (2011). These contrast values have been shown to follow a Weibull distribution Ghebreab et al. (2009):

(5)

Images are firstly converted to a color space that is optimized to match the human visual system color representation Yanulevskaya et al. (2011):

(6)

where and are the intensity of a pixel in the red, green and blue channels, respectively. The gradient magnitude is then obtained by,

(7)

where are the responses of the -th channel to Gaussian derivative filters in and directions given by the following impulse responses:

(8)
(9)

The resulting gradient magnitude in eqn.7 is considered as local contrast of an image. Figure 6 shows several examples of local contrast maps of training images and deep generated images.

(a)      (b)      (c)      (d)      (e)      (f)      (g)      (h)

Figure 6: Local contrast maps of (a) natural images, (b) natural images generated by DCGAN, (c) natural images generated by WGAN, (d) natural images generated by VAE, (e) cartoon images, (f) cartoon images generated by DCGAN, (g) cartoon images generated by WGAN, and (h) cartoon generated by VAE.

3.5 Filter Responses

It has been observed that convolving natural images with almost any zero mean linear filter results in a histogram of a similar shape with heavy tail, sharp speak and high kurtosis Zoran (2013) (higher than kurtosis of Gaussian distribution, which is 3). That is called the non-Gaussian property of natural images.

Since it is impossible in this work to consider all these filters, we avoid inspecting responses to any specific filter. Instead, without loss of generality, we apply random zero mean filters to images as introduced in Huang and Mumford (1999) to measure properties of images themselves. A random zero mean filter

is generated by normalizing a random matrix

with independent elements sampled uniformly from :

(10)

3.6 Homogeneous Regions

Homogeneous regions in an image are the connected components where contrast does not exceed a certain threshold. Consider an image of size and gray levels. We generate a series of thresholds , in which is the least integer such that more than pixels have a gray value less than . Using these thresholds to segment an image results in homogeneous regions. Figure 7 illustrates an example image and its homogeneous regions.

Figure 7: An example image and its four homogeneous regions.

Alvarez et al.Alvarez et al. (1999) show that the number of homogeneous regions in natural images, denoted as , as a function of their size , obeys the following law:

(11)

where is an image dependent constant, denotes the area, and is close to -2. Suppose image is scaled into , such that . Let denote the number of homogeneous regions of area in . Then, for , we have , so the number of homogeneous regions in natural images is a scale-invariant statistic.

3.7 Power Spectrum

We adopt the most commonly used definition of power spectrum in image statistics literature: “the power of different frequency components”. Formally, the power spectrum of an image is defined as the square of the magnitude of the image FFT. Prior studies Deriugin (1956); Burton and Moorhead (1987); Field (1987) have shown that the mean power spectrum of natural images denoted as , where is frequency, is scale invariant. It has the form of:

(12)

4 Experiments and Results

In this section, we report the experimental results of luminance distribution, contrast distribution, random filter response, distribution of homogeneous regions, and the mean power spectrum of the training images and deep generated images. We use Welch’s t-test to test whether the statistics are significantly different between generated images and training images (ImageNet-1 and Cartoon-1 in the tables). The larger

-value is, the more similar the generated images to training images. Therefore, the models can be ranked according to t-test results. To make sure that our results are not specific to the choice of training images, we sampled another set of training images (ImageNet-2 and Cartoon-2 in the tables), and use t-test to measure difference between the two sets of training images. All experiments are performed using Python 2.7 and OpenCV 2.0 on a PC with Intel i7 CPU and 32GB RAM. The deep generative models used in this work are implemented in PyTorch

111https://github.com/pytorch.

4.1 Luminance

Luminance distributions of training images and deep generated images are shown in Figure 8. Average skewness values are shown in Table 1.

Results in in Figure 8 show that luminance distributions of training and generated images have similar shapes, while those of cartoons are markedly different from natural images. From in Table 1, we can see that the luminance distributions over natural images, cartoons, generated natural images and generated cartoons all have positive skewness values. However, the difference of skewness values between training and generated images is statistically significant (over both natural images and cartoons). The difference between skewness values over each image type (i.e., ImageNet-1 vs. ImageNet-2 or Cartoon-1 vs. Cartoon-2) is not significant, indicating that our our findings are general and image sets are good representatives of the natural or synthetic scenes. According to Table 1, we rank the models in terms of luminance distribution as follows. For natural images, WGAN DCGAN VAE, and for cartoons, VAE DCGAN WGAN.

(a)                                       (b)

Figure 8: Luminance distribution. The distributions are all averaged over 12,800 images. (a) natural images and generated natural images, (b) cartoons and generated cartoon images.
- I-1 I-2 I-DCGAN I-WGAN I-VAE C-1 C-2 C-DCGAN C-WGAN C-VAE
Skew 0.11 0.11 0.08 0.14 0.15 0.29 0.30 0.23 0.49 0.25
t-stat - 0.30 3.45 -2.75 -4.25 - -0.51 3.65 -10.78 3.26
p-value - 0.76 0.00 0.00 0.00 - 0.60 0.00 0.00 0.00
Table 1: Skewness of the luminance distributions of the deep generated images and natural images. All values are averaged over 12,800 images. I: ImageNet, C: Cartoons.

4.2 Contrast

It has been reported that the contrast distribution in natural images follows a Weibull distribution Geusebroek and Smeulders (2005). To test this on our data, first we fit a Weibull distribution (eqn. 5) to the histogram of each of the generated images and training images. Then, we use KL divergence to examine if the contrast distribution in deep generated images can be well modeled by a Weibull distribution as in the case of natural images. If this is true, the fitted distributions will be close to the histogram as in training images, and thus the KL divergence will be small.

Figure 9 shows that contrast distributions of training and generated images have similar shapes, while those of cartoons are markedly different from natural images. Parameters of the fitted Weibull distribution and its KL divergence to the histogram, as well as the corresponding t-test results are shown in Table 2. We find that the contrast distributions in generated natural images are also Weibull distributions. However, the difference of parameters between training and generated images, in both cases of natural images and cartoons, is statistically significant. We also observe that the KL divergence between contrast distribution and its Weibull fit of natural images and generated natural images is small, while the KL divergence between contrast distribution and its Weibull fit of cartoons and generated cartoons is larger. According to Table 2, WGAN gets the largest -value for both natural images and cartoons. DCGAN and VAE are of equally small -value. Therefore, for both natural images and cartoons, WGAN DCGAN VAE in terms of contrast distribution.

- I-1 I-2 I-DCGAN I-WGAN I-VAE C-1 C-2 C-DCGAN C-WGAN C-VAE
KLD 1.68 1.67 1.50 1.63 1.49 2.54 2.49 2.29 2.13 1.63
t-stat - 1.62 26.66 6.86 31.90 - 1.67 8.35 13.97 31.71
p-val - 0.10 0.00 0.00 0.00 - 0.09 0.00 0.00 0.00
1.15 1.16 1.23 1.16 1.17 1.01 1.01 1.02 1.00 1.13
t-stat - -1.70 -31.14 -2.64 -10.57 - -0.23 -11.21 9.74 -91.82
p-val - 0.08 0.00 0.00 0.00 - 0.81 0.00 0.00 0.00
1251.86 1257.05 1055.32 1241.86 518.40 1262.24 1270.16 1198.82 1247.14 588.88
t-stat - -1.11 46.98 2.03 216.72 - -1.40 11.69 2.83 162.88
p-val - 0.26 0.00 0.04 0.00 - 0.16 0.00 0.00 0.00
Table 2: Average Weibull parameters and KL divergence of training images and generated images. and are parameters in eqn. 5. All values are averaged over 12,800 images. I: ImageNet, C:cartoons.
Figure 9: Contrast distributions of training images and generated images. The plots are all averaged over 12,800 images. Top: natural images, bottom: cartoons. (a) training images, (b) images generated by DCGAN, (c) images generated by WGAN, (d) images generated by VAE.

4.3 Filter Responses

We generate three zero mean random filters as in Huang and Mumford (1999), and apply them to ImageNet training images, VAE generated images, DCGAN generated images and WGAN generated images. Averaging the response histograms over training images, VAE images, DCGAN images and WGAN images gives the distributions shown in Figure 10 (in order). The distributions of responses to different random filters have similar shapes with a sharp peak and a heavy tail, which is in agreement with Huang et al.’s results Huang and Mumford (1999). Average kurtosis of the filter response distributions over the training images and deep generated images are shown in Table 3.

Figure 10 shows that generated images have similar filter response distributions to training images, while those of cartoons looks different from natural images. Table 3 shows that the average responses kurtosis of generated natural images and real natural images are all greater than 3 of Gaussian distribution. As a result, we draw the conclusion that the generated natural images also have similar non-Gaussianity as in natural images. However, there is a statistically significant difference of the filter response kurtosis between deep generated images and training images, in both cases of natural images and cartoons (except ImageNet-WGAN and Cartoon-DCGAN). For natural images, WGAN gets the largest -value of filter 1, 2 and 3. DCGAN and VAE are of similar -value. Therefore for natural images, WGAN DCGAN VAE. For cartoons, WGAN gets the largest -value of filter 3, and DCGAN gets the largest -value of filter 1, and 2. Therefore, for cartoons, DCGAN WGAN VAE.

Figure 10: Distribution of zero mean random filters responses, averaged over 12,800 images. Top: natural images, bottom: cartoons. (a) training images, (b) images generated by DCGAN, (c) images generated by WGAN, and (d) images generated by VAE.
- I-1 I-2 I-DCGAN I-WGAN I-VAE C-1 C-2 C-DCGAN C-WGAN C-VAE
filter 1 5.93 5.91 6.70 5.87 11.36 7.97 7.97 7.97 8.21 11.32
t-stat - 1.42 -16.43 1.26 -101.32 - 0.05 0.00 -2.16 -31.14
p-val - 0.67 0.00 0.20 0.00 - 0.95 0.99 0.03 0.00
filter 2 5.77 5.67 7.11 5.95 15.77 7.39 7.46 7.43 7.17 12.01
t-stat - 1.08 -29.00 -3.94 -175.08 - -0.64 -0.33 2.72 -56.42
p-val - 0.03 0.00 0.00 0.00 - 0.51 0.73 0.00 0.00
filter 3 5.79 5.74 6.17 5.77 11.98 5.12 5.19 5.33 5.19 8.43
t-stat - 1.05 -8.45 0.33 -110.72 - -0.48 -1.73 -0.69 -35.58
p-val - 0.29 0.00 0.73 0.00 - 0.62 0.08 0.48 0.00
Table 3: Kurtosis of the distributions of responses to three zero mean random filters. I: ImageNet, C: Cartoons.

4.4 Homogeneous Regions

We compute distribution of homogeneous regions as stated in Section 3.6. The number of homogeneous regions ( in Section 3.6) is set to 16. Figure 11 shows the distribution of the number of the homogeneous regions of area in the training images and deep generated images. We use eqn. 11 to fit the distribution of homogeneous regions of each image. Table 4 shows average parameters and in eqn. 11 evaluated through maximum likelihood (only regions of area pixels are considered in the evaluation).

Over real natural images and generated natural images, the relationship between the number and the area of regions is linear in log-log plots, thus supporting the scale invariance property observed by Alvarez et al.Alvarez et al. (1999). This is also reflected from the small fitting residual to eqn. 11 shown in the first column of Table 4. We also find this property holds over cartoons and generated cartoons. However, the differences between the parameters of deep generated images and training images, in both cases of natural images and cartoons, are statistically significant (except ImageNet-WGAN). For natural images, WGAN has the largest -value. DCGAN and VAE are of equally small -value. Therefore, we rank the models for natural images as follow: WGAN DCGAN VAE. For cartoons, three models have similar -value, therefore, WGAN DCGAN VAE.

Figure 11: The number of homogeneous regions of area in training and generated images (both axes are in log units). The plots are all averaged over 12,800 images. Top: natural images, bottom: cartoons. (a) training images, (b) images generated by DCGAN, (c) images generated by WGAN, and (d) images generated by VAE.
- I-1 I-2 I-DCGAN I-WGAN I-VAE C-1 C-2 C-DCGAN C-WGAN C-VAE
-1.54 -1.55 -1.60 -1.54 -0.65 -1.25 -1.25 -1.37 -1.44 -0.73
t-statistic - 2.26 23.16 0.96 -387.72 - -0.49 52.89 82.67 -220.13
p-value - 0.02 0.00 0.33 0.00 - 0.62 0.00 0.00 0.00
2.91 2.92 3.02 2.91 1.21 2.36 2.36 2.91 2.75 1.36
t-statistic - -2.03 -25.21 -0.89 411.37 - 0.62 -51.35 -83.14 212.51
p-value - 0.04 0.00 0.36 0.00 - 0.53 0.00 0.00 0.00
residual 3.87 3.88 3.82 3.89 3.98 4.24 4.22 4.39 4.11 3.85
t-statistic - -0.41 4.49 -1.55 -8.94 - 1.24 -13.29 11.60 34.12
p-value - 0.68 0.00 0.11 0.00 - 0.21 0.00 0.00 0.00
Table 4: Parameters and in eqn. 11 computed using maximum likelihood. I: ImageNet, C: Cartoons.

4.5 Power Spectrum

Figures 12(a), and 13(a) show the mean power spectrum of training images. We use eqn. 12 to fit a power spectrum to each image. The evaluated parameters of eqn. 12 and the corresponding fitting residual, averaged over all images, are shown in Table 5. The results of t-test show that differences between the parameters of deep generated images and training images, in both cases of natural images and cartoons, are statistically significant. For natural images, WGAN has the largest -value. DCGAN and VAE are of similar -value, therefore, for natural images, WGAN DCGAN VAE. For cartoons, WGAN has the largest -value of and . VAE has the largest -value of residual-v. DCGAN has the largest -value of residual-h. Therefore, for cartoons, WGAN DCGAN VAE.

Scale invariant mean power spectrum of natural images with the form of eqn. 12 is the earliest and the most robust discovery in natural image statistics. Our experiments on training images confirm this discovery and aligns with the prior results in Deriugin (1956); Cohen et al. (1975); Burton and Moorhead (1987); Field (1987). This can be seen from the linear relationship between log frequency and log power magnitude shown in Figure 13(a), and the small fitting residual to eqn.12 in the first column of Table 5. We also observe a similar pattern over cartoons.

However, unlike training images, the generated images seem to have a spiky mean power spectrum. See Figure 12(b)(c) and Figure 13(b)(c). It can be seen from the figures that there are several local maxima of energy in certain frequency points. Without frequency axis being taken logarithm, it can be read from Figure 12(b)(c) that the position of each spike is the integer multiple of cycle/pixel.

Figure 12: Mean power spectrum averaged over 12800 images. Magnitude is in logarithm unit. Top: natural images, bottom: cartoons. (a) training images, (b) images generated by DCGAN, (c) images generated by WGAN, and (d) images generated by VAE.
Figure 13: Mean power spectrum averaged over 12800 images in horizontal and vertical directions. Both magnitude and frequency are in logarithm unit. Top: natural images, bottom: cartoons. (a) training images, (b) images generated by DCGAN, (c) images generated by WGAN, and (d) images generated by VAE.
- I-1 I-2 I-DCGAN I-WGAN I-VAE C-1 C-2 C-DCGAN C-WGAN C-VAE
9.23 9.24 8.97 9.21 8.83 9.16 9.16 9.07 9.20 8.94
t-statistic - -0.75 57.96 6.01 105.71 - 0.61 14.17 -6.17 43.96
p-value - 0.45 0.00 0.00 0.00 - 0.53 0.00 0.00 0.00
-1.97 -1.97 -1.88 -1.95 -2.33 -1.98 -1.97 -1.94 -1.98 -2.32
t-statistic - 0.02 -26.64 -5.71 111.16 - -1.22 -9.81 1.68 102.43
p-value - 0.97 0.00 0.00 0.00 - 0.22 0.00 0.09 0.00
residual-h 1.53 1.55 2.57 1.41 2.18 2.00 2.02 2.06 1.73 1.81
t-statistic - -1.27 -61.28 10.67 -52.32 - -0.56 -2.96 18.14 12.78
p-value - 0.20 0.00 0.00 0.00 - 0.57 0.00 0.00 0.00
9.29 9.28 9.00 9.28 8.97 9.28 9.27 9.18 9.27 9.13
t-statistic - 0.57 67.42 0.54 86.95 - 0.16 16.27 0.22 29.44
p-value - 0.56 0.00 0.58 0.00 - 0.87 0.00 0.82 0.00
-1.95 -1.95 -1.81 -1.97 -2.33 -2.03 -2.02 -1.98 -2.01 -2.42
t-statistic - -0.88 -45.61 5.03 105.75 - -0.91 -14.52 -4.21 96.76
p-value - 0.37 0.00 0.00 0.00 - 0.36 0.00 0.00 0.00
residual-v 1.45 1.46 1.74 1.48 2.13 1.98 1.98 1.84 1.89 1.99
t-statistic - -0.63 -21.24 -2.00 -49.10 - 0.12 8.01 5.46 -0.66
p-value - 0.52 0.00 0.04 0.00 - 0.89 0.00 0.00 0.50
Table 5: Fitted parameters of eqn. 12 from the mean power spectra (averaged over 12800 images in horizontal and vertical directions). , , residual-h: the parameters and fitting residual of horizontally averaged power spectrum; , , residual-v: parameters and fitting residual of vertically averaged power spectrum. I: ImageNet, C: Cartoon.

5 Discussion

It is surprising to see that unlike natural images, deep generated images do not meet the well-established scale invariance property of the mean power spectrum. These models, however, well reproduce other statistical properties such as Weibull contract distribution and non-Gaussianity. Specifically, mean power magnitude of natural images falls smoothly with frequency of the form , but mean power spectra of the deep generated images turns out to have local energy maxima at the integer multiples of frequency of 4/128 (i.e., 4/128, 8/128, etc). In spatial domain, this indicates that there are some periodic patterns with period of pixels superimposed on the generated images. Averaging the deep generated images gives an intuitive visualization of these periodic patterns. Please see Figure 14.

(a)               (b)               (c)               (d)

Figure 14: Average images of training and generated images (each of them are averaged over 12,800 images). The average generated images show periodic patterns. (a) training images, (b) images generated by DCGAN, (c) images generated by WGAN, and (d) images generated by VAE.

One reason for the occurrence of the periodic patterns might be the deconvolution operation (a.k.a transposed or fractionally strided convolutions). Deep image generative models usually consist of multiple layers. The generated images are progressively built from low to high resolution layer by layer. The processing of each layer includes a deconvolution operation to transform a smaller input to a larger one.

Figure 15 provides an intuitive explanation of deconvolution operations. In a deconvolution operation using strides,

zeros are inserted between input units, which makes the output tensors

times of the length of the size tensors Dumoulin and Visin (2016).

Figure 15: Deconvolving a kernel over a

input padded with a

zero-padding using strides. It is equivalent to convolving a kernel over a input (with 1 zero inserted between inputs) padded with a border of zeros using unit strides. This figure is borrowed from Dumoulin and Visin (2016) with permission.

Inserting zeros between input units will have the effect of superposing an impulse sequence with period of pixels on the output map. Meanwhile, if the input itself is an impulse sequence, the period will be enlarged times. Figure 16 shows a demonstration of how the generation process with deconvolution operations gives rise to the periodic patterns. Consider a deep generative model with deconvolution layers with strides, similar to the models used in this work. When this model builds images from tensors, the first layer outputs the weighted sum of its kernels. The output maps of the second layer are superposed on an impulse sequence with a period of 2 pixels. Each of the subsequent layers doubles the period of the input, meanwhile superposes a new impulse sequence with a period of 2 on the output. Finally, there will be a periodic pattern consisting of impulse sequences of different intensities and periods of pixels, which corresponds to spikes at positions in the frequency domain. Figure 17 shows this periodic pattern when

(as in the models used in this work) and its power spectrum (averaged over horizontal direction). As it can be seen, the spikes shown in the power spectrum of this periodic pattern match exactly the power spectrum of the deep generated images shown in 

12, which experimentally shows that the spiky power spectra of deep generated images is caused by the deconvolution operation.

Figure 16: A demonstration of how the generation process with deconvolution operations gives rise to the periodic patterns. The second deconvolution layer using stride 2 superposes an impulse sequence with a period of 2 (the white dashed squares) on the output. The third deconvolution layer using stride 2 doubles the period of the existing impulse sequence (the gray solid squares), meanwhile superposes a new impulse sequence with period of 2 (the white dashed squares) on the output.

(a)                                       (c)

Figure 17: (a) a periodic pattern consisting of impulse sequences of different intensities and periods of pixels, where is the number of layers, and in this case. (b) the power spectrum of this pattern (averaged over horizontal direction) matches exactly the power spectrum of the deep generated images shown in 12, which experimentally shows that the spiky power spectra of deep generated images is caused by the deconvolution operation.

Apart from causing the spikes of the power spectrum of the generated images as stated above, other drawbacks of the deconvolution operation have also been observed. For instance, the zero values added by the deconvolution operations have no gradient information that can be backpropagated through and have to be later filled with meaningful values Tetra (2017). In order to overcome the shortcomings of deconvolution operations, recently, a new upscaling operation known as sub-pixel convolution Shi et al. (2016)

has been proposed for image super-resolution. Instead of filling zeros to upscale the input, sub-pixel convolutions perform more convolutions in lower resolution and reshape the resulting map into a larger output 

Tetra (2017).

Figure 18: Examples of images generated by WGAN with all deconvolution layers replaced by sub-pixel convolutions.

(a)                                     (b)                                    (c)

Figure 19: Mean power spectrum of images generated by WGAN with all deconvolution layers replaced by sub-pixel convolutions. Compared with Figure 12, Figure 12, and Figure 13, mean power spectrum of these images is less spiky. (a) mean power spectrum averaged over 12,800 images; (b) mean power spectrum averaged over 12,800 images as well as horizontal or vertical direction, the magnitude axis is in log unit; (c) mean power spectrum averaged over 12,800 images as well as horizontal or vertical direction, both the magnitude axis and the frequency axis are in log unit.

Since filling zeros is not needed in sub-pixel convolution, it is expected that replacing deconvolution operations by sub-pixel convolutions, will remove periodic patterns from the output. Thus, the power spectrum of the generated images will be more similar to natural images, without the spikes shown in Figure 17 and Figure 12. To confirm this, we trained a WGAN model with all of its deconvolution layers replaced by sub-pixel convolutions. Examples of images generated by this model are shown in Figure 18. The corresponding mean power spectrum is shown in Figure 19. As results show, replacing deconvolution operations by sub-pixel convolutions removes the periodic patterns caused by deconvolution operations and results in images with more similar mean power spectrum as in natural images.

6 Summary and Conclusion

We explore statistics of images generated by state-of-the-art deep generative models (VAE, DCGAN and WGAN) and cartoon images with respect to the natural image statistics. Our analyses on training natural images corroborates existing findings of scale invariance, non-Gaussianity, and Weibull contrast distribution on natural image statistics. We also find non-Gaussianity and Weibull contrast distribution for generated images with VAE, DCGAN and WGAN. These statistics, however, are still significantly different. Unlike natural images, neither of the generated images has scale invariant mean power spectrum magnitude, which indicates extra structures in the generated images. We show that these extra structures are caused by the deconvolution operations. Replacing deconvolution layers in the deep generative models by sub-pixel convolution helps them generate images with mean power spectrum closer to the mean power spectrum of natural images.

Inspecting how well the statistics of the generated images match natural scenes, can a) reveal the degree to which deep learning models capture the essence of the natural scenes, b) provide a new dimension to evaluate models, and c) suggest possible directions to improve image generation models. Correspondingly, two possible future works include:

  1. Building a new metric for evaluating deep image generative models based on image statistics, and

  2. Designing deep image generative models that better capture statistics of the natural scenes (e.g., through designing new loss functions).

To encourage future explorations in this area and assess the quality of images by other image generation models, we share our cartoon dataset and code for computing the statistics of images at: https://github.com/zengxianyu/generate

References

  • Alvarez et al. (1999) Alvarez, L., Gousseau, Y., Morel, J.M., 1999. The size of objects in natural and artificial images. Advances in Imaging and Electron Physics 111, 167–242.
  • Arjovsky et al. (2017) Arjovsky, M., Chintala, S., Bottou, L., 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 .
  • Burton and Moorhead (1987) Burton, G., Moorhead, I.R., 1987. Color and spatial structure in natural scenes. Applied Optics 26, 157–170.
  • Cohen et al. (1975) Cohen, R.W., Gorog, I., Carlson, C.R., 1975. Image descriptors for displays. Technical Report. DTIC Document.
  • Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009.

    Imagenet: A large-scale hierarchical image database, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE. pp. 248–255.

  • Denton et al. (2015) Denton, E.L., Chintala, S., Fergus, R., et al., 2015. Deep generative image models using a laplacian pyramid of adversarial networks, in: Advances in neural information processing systems, pp. 1486–1494.
  • Deriugin (1956) Deriugin, N., 1956. The power spectrum and the correlation function of the television signal. Telecommunications 1, 1–12.
  • Dumoulin and Visin (2016) Dumoulin, V., Visin, F., 2016. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 .
  • Field (1987) Field, D.J., 1987. Relations between the statistics of natural images and the response properties of cortical cells. JOSA A 4, 2379–2394.
  • Geisler (2008) Geisler, W.S., 2008. Visual perception and the statistical properties of natural scenes. Annu. Rev. Psychol. 59, 167–192.
  • Geman and Koloydenko (1999) Geman, D., Koloydenko, A., 1999. Invariant statistics and coding of natural microimages, in: IEEE Workshop on Statistical and Computational Theories of Vision.
  • Geusebroek and Smeulders (2005) Geusebroek, J.M., Smeulders, A.W., 2005. A six-stimulus theory for stochastic texture. International Journal of Computer Vision 62, 7–16.
  • Ghebreab et al. (2009) Ghebreab, S., Scholte, S., Lamme, V., Smeulders, A., 2009. A biologically plausible model for rapid natural scene identification, in: Advances in Neural Information Processing Systems, pp. 629–637.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Advances in neural information processing systems, pp. 2672–2680.
  • Huang and Mumford (1999) Huang, J., Mumford, D., 1999. Statistics of natural images and models, in: Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference On., IEEE. pp. 541–547.
  • Hyvärinen et al. (2009) Hyvärinen, A., Hurri, J., Hoyer, P.O., 2009. Natural Image Statistics: A Probabilistic Approach to Early Computational Vision.. volume 39. Springer Science & Business Media.
  • Im et al. (2016) Im, D.J., Kim, C.D., Jiang, H., Memisevic, R., 2016. Generating images with recurrent adversarial networks. arXiv preprint arXiv:1602.05110 .
  • Kanan and Cottrell (2012) Kanan, C., Cottrell, G.W., 2012. Color-to-grayscale: does the method matter in image recognition? PloS one 7, e29740.
  • Kingma and Welling (2013) Kingma, D.P., Welling, M., 2013. Auto-encoding variational bayesge. arXiv preprint arXiv:1312.6114 .
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, pp. 1097–1105.
  • Lee et al. (2001) Lee, A.B., Mumford, D., Huang, J., 2001. Occlusion models for natural images: A statistical study of a scale-invariant dead leaves model. International Journal of Computer Vision 41, 35–59.
  • Mumford and Gidas (2001) Mumford, D., Gidas, B., 2001. Stochastic models for generic images. Quarterly of applied mathematics 59, 85–111.
  • Odena et al. (2016) Odena, A., Dumoulin, V., Olah, C., 2016. Deconvolution and checkerboard artifacts. Distill URL: http://distill.pub/2016/deconv-checkerboard, doi:10.23915/distill.00003.
  • Radford et al. (2015) Radford, A., Metz, L., Chintala, S., 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 .
  • Ruderman and Bialek (1994) Ruderman, D.L., Bialek, W., 1994. Statistics of natural images: Scaling in the woods. Physical review letters 73, 814–817.
  • Scholte et al. (2009) Scholte, H.S., Ghebreab, S., Waldorp, L., Smeulders, A.W., Lamme, V.A., 2009. Brain responses strongly correlate with weibull image statistics when processing natural images. Journal of Vision 9, 29–29.
  • Shi et al. (2016) Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z., 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883.
  • Srivastava et al. (2003) Srivastava, A., Lee, A.B., Simoncelli, E.P., Zhu, S.C., 2003. On advances in statistical modeling of natural images. Journal of mathematical imaging and vision 18, 17–33.
  • Tetra (2017) Tetra, 2017.

    subpixel: A subpixel convnet for super resolution with tensorflow.

    URL: https://github.com/tetrachrome/subpixel.
  • Theis et al. (2015) Theis, L., Oord, A.v.d., Bethge, M., 2015. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844 .
  • Wainwright and Simoncelli (1999) Wainwright, M.J., Simoncelli, E.P., 1999. Scale mixtures of gaussians and the statistics of natural images., in: Nips, pp. 855–861.
  • Yanulevskaya et al. (2011) Yanulevskaya, V., Marsman, J.B., Cornelissen, F., Geusebroek, J.M., 2011. An image statistics–based model for fixation prediction. Cognitive computation 3, 94–104.
  • Zhu (2003) Zhu, S.C., 2003. Statistical modeling and conceptualization of visual patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 691–712.
  • Zoran (2013) Zoran, D., 2013. Natural Image Statistics for Human and Computer Vision. Ph.D. thesis. Hebrew University of Jerusalem.