Generative Adversarial Networks with Inverse Transformation Unit
In this paper we introduce a new structure to Generative Adversarial Networks by adding an inverse transformation unit behind the generator. We present two theorems to claim the convergence of the model, and two conjectures to nonideal situations when the transformation is not bijection. A general survey on models with different transformations was done on the MNIST dataset and the Fashion-MNIST dataset, which shows the transformation does not necessarily need to be bijection. Also, with certain transformations that blurs an image, our model successfully learned to sharpen the images and recover blurred images, which was additionally verified by our measurement of sharpness.READ FULL TEXT VIEW PDF
Generative Adversarial Networks with Inverse Transformation Unit
In recent two years generative adversarial networks (GAN) have been increasingly concerned 
. GAN introduce two perceptrons that behave against each other: the generator learns the probability distribution of training data, while the discriminator learns to tell the difference. The conciseness of GAN makes it possible to amend the structure in order to improve its performance, or make it able to achieve our additionally desired effects. While many works focused on the first point (see related works), this paper focuses on the second aspect. We add an inverse transformation unit behind the generator, and make it possible to generate data with the ”inverse” effect of the input transformation function. Our architecture is quite useful when we want to generate samples with some additional effects which is hard to implement but the inverse is easy to achieve. This need is natural and common in certain situations. For instance, we want to generate clear images, but we only know the way to implement its inverse – how to blur them.
In this paper, we make the following contributions:
We presented a new architecture for generative adversarial networks by adding an inverse transformation unit behind the generator.
We made rigorous theoretical analysis on our structure: we found the optimal discriminator for a fixed generator when the transformation is a continuous bijection. We also claimed the convergence of the algorithm in such situation.
We made two conjectures for cases when the transformation is not bijection.
We applied our method to MNIST dataset  and the Fashion-MNIST dataset  with different transformation functions. A general survey on various transformation functions was done; and with some special transformation functions, the model showed its ability to sharpen the images and recover blurred images.
In recent two years a lot of works on generative adversarial networks (GAN) have appeared. They have researched various aspects of GAN, from theory to applications, and made great improvement to the original method. Many works put their emphasis on improving the performance of GAN, by introducing new loss functions[5, 6, 7]
, integrating it with other deep learning architectures[4, 8, 9, 13], or making amendments to the original GAN with strong theoretical analysis [10, 11, 12]. A number of works also apply GAN to practical issues and solved problems in those domains [13, 14]. The purpose of our paper is to survey a new architecture of GAN which makes it possible to learn the samples with certain desired effect.
Graph generation has been a popular topic for years, and people have tried different methods to generate graphs with their desired effects. Convolutional Neural Networks (CNN) and GAN are two popular methods in this domain. In, researchers successfully train a model to transfer an image’s texture style to another image using CNN. In , an integration of CNN and GAN is made, and the model turns out to have a better performance in generating images than the original GAN. In [16, 17, 18], three methods including CNN, GAN and Variational Auto Encoders (VAE) are used to learn the typographical style and generate images of letters with new styles.
In order to learn the graphical samples with certain desired effect, we add an inverse transformation unit to the generator, based on the intuition that the generator will learn some additional effect, such as if is invertible, to offset the effect of . This intuition also appears in  and . In , the mapping from the data distribution to the latent distribution is learned. The function needs to be invertible and stable, and its inverse maps samples from the latent distribution to the data distribution. With
, an unsupervised learning algorithm with exact log-likelihood computation, sampling, inference of latent variables, and an interpretable latent space, is developed to model natural images. In, a pair of transformation functions, and , are introduced to be the bridges between the source domain and the target domain . Both and are unknown and learned to satisfy that is indistinguishable from and . This pair (cycle), and , demonstrates great ability to transfer and enhance the photo style. In our paper, the inverse transformation is not required to be invertible though theoretical analysis only apply for invertible ’s. Also, when we learn the inverse effect of , is given explicitly.
GAN  is an excellent architecture for training generative models. It includes two networks, each ”fighting with” the other, and both of them are improved during the process. Specifically, the generator captures the distribution of training data while the discriminator distinguishes between samples from and the training data. In our approach, we add an Inverse Transformation Unit , or a ”filter” after generates a sample distribution. Figure 1 demonstrates our model compared to the original GAN  architecture. in equation (1) is our value function; we maximize it over and minimize it over :
Here is an intuitive explanation for the name of , the inverse transformation unit. If we train on , this is exactly the original GAN, and will learn the probability distribution of training data. In this sense, the generator is creating samples that contain information of ”inverse of ”, if it exists. For example, the generator learns how to generate dogs, while the discriminator learns to judge if it’s a true image of dog. Suppose makes the image blurred. Since learns the distribution of true dogs, will generate samples that are clear enough to eliminate the blurring effect.
However, things are complicated when doesn’t exist. It might be the case that learns information of where is almost identity mapping; however, may also fail to learn it. In the rest part of the paper, both theocratical analysis and experiments are made to investigate such situations.
In this section, we show that when is bijection with invertible Jacobian matrix, then the generator does create samples similar to of data. The optimal discriminator is given explicitly, and the convergence is analyzed. However, when is not bijection, the optimal discriminator either does not exist or cannot be written explicitly. Two conjectures are posted about the optimal discriminator when fixed, with respect to two situations when is not surjection/injection.
Theorem 1. Suppose the transformation function is a bijection from to . If has an invertible Jacobian matrix , then for fixed, the optimal discriminator is given by
Proof. For fixed, the discriminator is trained to maximize
The variational method is used to solve the problem. For any with compact support, function is defined by , . Then the optimal discriminator can be found by solving the equation with constraint .
The function can be expanded as
Thus, can be calculated as
Let , we have
Through substitutions and , we have
As a result,
Since is arbitrary, it follows that
which leads to the result that the optimal discriminator is given by
Additionally, is trivial.
Theorem 2. Let . The global minimum of is achieved if and only if ; at that point, achieves the value .
Proof. Let be the transformed generator, and we see that and has the following relationship according to Theorem 1:
Then, the rest of the proof is exactly the same as the proof of Theorem 1 (not Theorem 1 in this paper) by Goodfellow, et al. 
Furthermore, if we change the view using equation 11, the convergence is guaranteed under the same conditions of Proposition 2 by Goodfellow, et al.
When is not bijection, things are much more complicated. Usually, we can’t find the optimal discriminator for fixed. Following is some analysis as well as two conjectures. When is not surjection, let the range of is . Then, the right side of equation 7 will be the integration on , which leads to the result that equation 8 is changed to
However, we usually can’t make for .
Conjecture 1. When is not surjection, there is no explicit optimal discriminator for fixed. However, the following condition should be satisfied for a good discriminator (i.e. if a discriminator doesn’t satisfy the following condition, it’s always better to change it into this condition):
When is not injection, define the set . Then, in equation 7 the expression doesn’t exist anymore. Instead, it is , where is a point in the set but we don’t know what it is. This makes the system hard to analysis. Another point of view is that suppose , then the following two discriminators perform exactly the same:
Conjecture 2. When is not injection, there is no explicit optimal discriminator for fixed. However, the following discriminator is good enough (i.e. there might be better solutions in different situations, but it’s the best one that can be written explicitly):
where refers to the average value of for .
The existence of the inverse transformation unit makes generative adversarial networks possible to generate a wider range of samples with our additionally desired effects. The bijection case is proved to work well, but other cases still need deeper theoretical analysis so that we could figure out when our architecture is effective.
In this section we made several experiments to show in which conditions our architecture is able to work and some possible effects our architecture is able to produce. First, in order to know if our architecture works (i.e., being successfully trained) for transformation functions with different properties, we made a general survey on ’s with different properties. Then, we showed that our architecture is able to learn the opposite effect of ”blurring” with certain transformations. Specifically, it is able to generate sharpened images and generate recovered images that were initially blurred. Additionally, we introduced a measurement of sharpness and verified the effect of our architecture using . Our architecture is realized by adding an inverse transformation unit based on DCGAN , and all experiments are done on the MNIST dataset  and the Fashion-MNIST dataset .
The intuition is that if is bijection, the training is likely to be successful. This intuition is also supported by Theorem 2 in that the global minimum is achieved with given explicitly. But what if is not bijection? This question leads to a survey on the effect through various ’s with a set of properties shown in Table 1, where if and otherwise 0, and . Since the images are gray and each pixel can be compressed into interval , all ’s are mappings from to . The plots of these functions are demonstrated in Fig 2.
Among the nine transformation functions selected in Table 1, four demonstrate good effects (i.e., the corresponding models are able to generate the numbers with desire effect). With , a bijection that maps an image to its mirror image, the model indeed generates the mirror image of the original numbers. With and , two injections that compress the interval , the models are able to generate numbers with stronger contrast: more white and black but less gray pixels. With , a surjection that is not one-to-one, the model can also generate images of numbers. Fig 5 shows the images these four models generate and images transformed by the transformation functions. As we can see, easier transformation functions are more likely to take effect, while complicated ones will have more problems during the training process, leading to bad local optimum or misconvergence. Specifically, fails because it cannot reach negative values, and for , there is a large range of that is reached by at least two ’s through the transformation, which confuses the model.
One easy way to blur an image is weighted averaging each pixel and its neighbor pixels. This can be achieved by using the convolutional kernel. Consider an image as an matrix . To deal with the edges well, we first design a method of extension. The first step is extending a row on the top and the bottom respectively, with values of the rows next to the original edges. This yields an matrix. The second step is extending a column on the left and the right, using the values of the columns next to them. This yields an matrix. After the extension, we perform a convolution using a convolution kernel , which yields an matrix, representing the blurred image. The whole process is demonstrated in Fig 3.
If we take this function as , the generator is expected to learn the inverse effect of this blurring. In other words, the generator may learn to sharpen the images so that after , the generated images are similar to the images from the dataset. Fig 6 and Fig 7 show several samples generated by the model with convolution kernel . As expected, there are much fewer gray pixels in the images which make the image smooth. Although the effect of our model is not as obvious as that of standard techniques in image processing, the results show the ability of our architecture to deal with such tasks.
Furthermore, if the images are already blurred by the previous method with some convolution kernel , we can recover the blurred images with our model with convolution kernel . Essentially, and do not need to be exactly the same. Fig 8 and Fig 9 demonstrate the blurred and recovered images under several different pairs of . The results show that images indeed can be recovered even when , which implies that our model can be used in practical situations when
is unknown but can be roughly estimated. The selected convolution kernels are
In order to examine the effect of the sharpening and recovery, a measurement of sharpness is introduced in this section. is a function that maps an image (essentially a matrix with all elements ) into the interval . For an image , the larger is, the sharper the image is, according to the meaning of this function. Now we give the definition of .
Let be a matrix that represents an image, where . Let be a matrix with exactly the same size as , where is the average value of the absolute differences of and its neighbours. That is,
and similar for ’s on the edge or corner of the image. In one sentence, is the absolute average difference operator. Then, is defined as the average value of the second-order absolute average difference of :
Using this measurement, we examined the sharpness of images from six different groups with respect to the two datasets. For each dataset, the six groups includes the original MNIST (or Fashion–MNIST) images, sharpened images, blurred images and recovered images with various convolution kernels. For each group, 108 samples were extracted randomly, and the distributions of the values of sharpness are demonstrated in Figure 4. The results almost conform with our theory. For the MNIST dataset, the sharpened images have a higher value of ; although the blurred images have a much lower , the recovered ones with all three ’s tend to have almost the same as the original images. For the Fashion–MNIST dataset, despite that recovered images with convolution kernel , which is the farthest away from , have higher , the other five groups of images still yield good results similar to those of the MNIST dataset.
(from left to right). The minimal values, first quartiles, second quartiles, third quartiles, and maximal values are illustrated in this figure.
In this paper, we presented a new architecture of Generative Adversarial Networks by adding an inverse transformation unit behind the generator. We made rigorous theoretical analysis to our model: we explicitly solved the optimal discriminator given the generator , and proved the convergence of the algorithm under certain conditions. We also made several experiments. The first experiment was a general survey on models with various transformation functions. The survey illustrates that when the is not bijection, the model may still work. In the second experiment, we demonstrated that our architecture is able to generate sharpened images, and able to recover blurred images without using the same convolution kernel. In the third experiment, we defined a measurement of sharpness
and compared this value with respect to different groups of images; the results also imply that our model works well for generating images with sharpening or recovering them at the same time. In the future, we plan to apply our model to a wider range of transformation functions in computer vision, such as various filters, and survey the inverse effects of them.
Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu: SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
Leon A. Gatys, Alexander S. Ecker, Matthias Bethge: Image Style Transfer Using Convolutional Neural Networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414-2423. (2016)
|Yes / Yes||Yes||Yes||Yes|
|Yes / No||Yes||Yes||No|
|Yes / No||Yes||Yes||Yes|
|Yes / No||Yes||Yes||Yes|
|No / Yes||Yes||Yes||No|
|No / Yes||Yes||Yes||Yes|
|No / No||Yes||Yes||No|
|No / No||No when||Not uniform||No|
|No / No||No when||Not uniform||No|