PBGAN: Partial Binarization of Deconvolution Based Generators

02/26/2018 ∙ by Jinglan Liu, et al. ∙ USTC 0

The generator is quite different from the discriminator in a generative adversarial network (GAN). Compression techniques for the latter have been studied widely, while those for the former stay untouched so far. This work explores the binarization of the deconvolution based generator in a GAN for memory saving and speedup. We show that some layers of the generator may need to be kept in floating point representation to preserve performance, though conventional convolutional neural networks can be completely binarized. As such, only partial binarization may be possible for the generator. To quickly decide whether a layer can be binarized, supported by theoretical analysis and verified by experiments, a simple metric based on the dimension of deconvolution operations is established. Moreover, our results indicate that both generator and discriminator should be binarized at the same time for balanced competition and better performance. Compared with the floating-point version, experimental results based on CelebA suggest that our partial binarization on the generator of the deep convolutional generative adversarial network can yield up to 25.81× saving in memory consumption, and 1.96× and 1.32× speedup in inference and training respectively with little performance loss measured by sliced Wasserstein distance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, generative adversarial networks (GANs), which are spin-offs from conventional convolutional neural networks (CNNs), have attracted much attention in the fields of reinforcement learning, unsupervised learning and also semi-supervised learning

[1, 2, 3]

. Some promising applications based on GANs include images reconstruction with super-resolution, art creation and image-to-image translation

[4], many of which can run on mobile devices (edge computing). For example, one potential application of GANs allow videos to be broadcast in low resolution and then reconstructed to ultra-high resolution by end users [5] as shown in Fig. 1.

Figure 1: Low resolution broadcast based on GAN

However, the resources required by GANs to perform computations in real-time may not be easily accommodated by mobile devices. For example, constructing an image of 6464 resolution with deep convolutional generative adversarial network (DCGAN) [6] requires 86.6 MB of memory, most of which is used for the generator. The memory goes up to 620.8 MB for 10241024 resolution [7], and up to about 800 MB for the popular 4K video with resolution of 38402160. On the other hand, one of the state-of-the-art mobile processors, A11 in the newest iPhone X [8], provides only 3 GB RAM, most of which must be occupied by the operating system and its peripheries. As a result, developers must restrict neural network models to just a few megabytes to avoid crash [9]. The memory budget gets even tighter when it comes to mobile devices of smaller form factor such as Apple Watch series 3, which only has 786 MB RAM.

The same problem has been well known for conventional CNNs, and various solutions have been proposed via redesigning the algorithms and/or computation structures [10, 11, 12]. Among them, quantization until binary is one of the most popular techniques as it fits hardware implementation well with high efficiency [9, 13]. Its success on CNNs has been demonstrated by multiple works [14, 15, 16], where memory consumption is deeply compressed although sometimes the performance cannot be preserved.

Compression techniques can be readily applied to discriminator networks in GANs, which are nothing different from conventional CNNs. It may be alluring to also apply it to binarize generators, especially the deconvolution-based [17] ones as the computation process looks similar. However, instead of distilling local information from a global map as in convolution operations, deconvolution attempts to construct the global map by local information. This difference can lead to significantly different binarization results, as will be discussed in Section 3. Accordingly, a scheme tailored to deconvolution-based generators is warranted.

In this paper, we show through theoretical analysis that under certain conditions, binarizing a deconvolution layer may cause significant performance loss, which also happens in compression of CNNs. Since there is no explanation for this phenomenon to the best of the authors’ knowledge, an intuitive guess is that not all layers can be binarized together while preserving performance. Thus, some layers need to stay in the format of floating point for performance, while others can be binarized without affecting performance. To quickly decide whether a layer can be binarized, supported by theoretical analysis and verified by experiments, a simple yet effective metric based on the dimension of deconvolution operations is established. Based on this metric, we can make use of existing compression techniques to binarize the generator of GANs with little performance loss. We then propose the scheme of partial binarization of deconvolution-based generators (PBGen) under the guide of the metric.

Furthermore, we find that only binarizing the generator and leaving discriminator network unchanged will introduce unbalanced competition and performance degradation. Thus, both networks should be binarized at the same time. Experimental results based on CelebA suggest that directly applying state-of-the-art binarization techniques to all the layers of the generator will lead to 2.83 performance loss measured by sliced Wasserstein distance compared with the original generator, while applying them to selected layers only can yield up to 25.81 saving in memory consumption, and 1.96 and 1.32 speedup in inference and training respectively with little performance loss.

2 Related Works and Background

2.1 CNN Compression

Compression techniques for CNNs mainly consist of pruning, quantization, re-structure and other approximations based on mathematical matrix manipulations [10, 18, 19]. The main idea of the pruning method in [14]

is to “prune” connections with smaller weights out so that both synapses and neurons are possible to be removed from the original structure. This can work well with traditional CNNs and reduce the number of parameters of AlexNet by a factor of nine

[14]. Re-structure methods modify network structures for compression, such as changing functions or block order in layers [19, 20].

In this work, we focus on the quantization technique. Quantization aims to use fewer bits to present values of weights or even inputs. It has been used to accelerate CNNs in various works at different levels [21, 22] including ternary quantization [11, 23] and iterative quantization [24], with small loss. In [10], the authors proposed to determine weight sharing after a network is fully trained, so that the shared weights approximate the original network. From a fully trained model, weights are clustered and replaced by the centroids of clusters. During retraining, the summation of the gradients in same groups are used for the fine-tuning of the centroids. Through such quantization, it is reported to be able to compress AlexNet up by around 8% before significant accuracy loss occurs. If the compression rate goes beyond that, the accuracy will deteriorate rapidly.

A number of recent works [13, 16, 20, 25, 26, 27] pushed it further by using binarization to compress CNNs, where only a single bit is used to represent values. Training networks with weights and activations constrained to 1 was firstly proposed in [26]. Through transforming 32-bit floating point weight values to binary representation, CNNs with binary weights and activations are about 32

smaller. In addition, when weight values are binary, convolutions can be estimated by only addition and subtraction without multiplication, which can achieve around 2.0

speedup. However, the method introduces significant performance loss. To alleviate the problem, [20] proposed Binary-Weight-Network, where all weight values are binarized with an additional continuous scaling factor for each output channel. We will base our discussion on this weight binarization, which is one of the state-of-the-art binarization methods.

However, none of the existing works explored the compression of generators in GANs, where deconvolution replaces convolution as the major operation. Note that while there is a recent work that uses the term of “binary generative adversarial networks” [28], it is not about the binarization of GANs. In that work, only the inputs of the generator are restricted to binary code to meet the specific application requirement. All parameters inside the networks and the training images are not quantized.

2.2 Gan

GAN was developed by [29] as a framework to train a generative model by an adversarial process. In a GAN, a discriminative network (discriminator) learns to distinguish whether a given instance is real or fake, and a generative network (generator) learns to generate realistic instances to confuse the discriminator.

Originally, the discriminator and the generator of a GAN are both multilayer perceptrons. Researchers have since proposed many variants of it. For example, DCGAN transformed multilayer perceptrons to deep convolutional networks for better performance. Specifically, the generator is composed by four deconvolutional layers. GANs with such a convolutional/deconvolutional structure have also been successfully used to synthesize plausible visual interpretations of given text

[30] and to learn interpretable and disentangled representation from images in an unsupervised way [31]. Wasserstein generative adversarial networks (WGAN) [32] and least squares generative adversarial networks (LSGAN) [33]

have been proposed with different loss functions to achieve more stable performance, yet they both employed the deconvolution operations too.

3 Analysis on Power of Representation

In this section, to decide whether a layer can be binarized, we analyze the power of a deconvolution layer to represent any given mapping between the input and the output, and how such power will affect the performance after binarization. We will show that the performance loss of a layer is related to the dimension of the deconvolution, and develop a metric called the degree of redundancy to indicate the loss. Finally, based on the analysis, several inferences are deduced at the end of this section, which should lead to effective and efficient binarization.

In the discussion below, we ignore batch normalization as well as activation operations and focus on the deconvolution operation in a layer, as only the weights in that operation are binarized. The deconvolution process can be transformed to equivalent matrix multiplication. Let

(, where , and are number of channels, height and width of the input respectively) be the input matrix, and (, where , and are the number of channels, height and width of the output respectively) be the output matrix. Denote (, where and are the height and width of a kernel in the weight matrix) as the weight matrix to be deconvoluted with

. Padding is ignored in the discussion, since it will not effect the results.

For the deconvolution operation, the local regions in the output can be stretched out into columns, by which we can cast to , where . Similarly, can be restructured from , and can be restructured from , where . Please refer to [34] for details about the transform. Then, the deconvolution operation can be compactly written as

(1)

where denotes matrix multiplication. and are the matrices containing pixels for an image or an intermediate feature map. During the training process, we adjust the values of to construct a desired mapping between and .

We use to denote the -th column of a matrix. Then (1) can be decomposed column-wise as

(2)

where and .

Now we analyze a mapping between an arbitrary input and an arbitrary output . From (2

), when the weights are continuously selected, all vectors that can be expressed by the right hand expression is a subspace

spanned by the columns of , the dimension of which is . Here we have assumed without loss of generality that has full column rank. When , which is the dimension of the output space where lies, is of lower dimension than , and accordingly, can either be uniquely expressed as a linear combination of the columns in if it lies in (i.e. a unique exists), or cannot be expressed if it is not (i.e. no such exists). When , and are equivalent, and any can be uniquely expressed as a linear combination of the columns in . When , and are still equivalent, but any can be expressed as an infinite number of different linear combinations of the columns in . In fact, the coefficients of these combinations lie in a -dimensional sub-space .

The binarization imposes a constraint on the possible values of the elements in . Only finite number of combinations are possible. If , then at least one of these combinations has to be proportional to the unique that yields the desired to preserve performance. If , then one of these combinations needs to lie in the subspace to preserve performance. Apparently, the larger the dimension of is, the more likely this will happen, and the less the performance loss is. A detailed math analysis is straightforward to illustrate this, and is omitted here in the interest of space. Accordingly, we define the dimension of , , as the degree of redundancy in the rest of the paper. Note that when this metric is negative, it reflects that is of lower dimension than and thus this deconvolution layer is more vulnerable to binarization errors. In general, a higher degree of redundancy should give lower binarization error.

We will use a small numerical example to partially validate the above discussion. We construct a deconvolution layer and vary its degree of redundancy by adjusting the in it. For each degree of redundancy we calculate the minimum average Euclidean distance between the original output and the output produced by binarized weights, which reflects the error introduced through the binarization process. The minimum average Euclidean distance is obtained by enumerating all the possible combinations of those binary weights. The results are depicted in Fig. 3. From the figure we can see that the error decreases with the increase of degree of redundancy, which matches our conjecture.

Figure 2: Binarization error v.s. degree of redundancy for a deconvolution layer
Figure 3: Degree of redundancy v.s. layer number for DCGAN. The intermediate feature maps at the output of each layer as well as the final output are also presented

For generators in most state-of-the-art GAN models [6, 33], we find that the degree of redundancy reduces with the increase in depth, eventually dropping below zero. Such a decrease reflects the fact that more details are generated at the output of a layer as its depth grows, as can also be seen in Fig. 3. These details are highly correlated, and reduce the subspace needed to cover then.

Based on our analysis, several inferences can be deduced to guide the binarization:

  • With the degree of redundancy, taking advantage of existing binarization methods becomes reasonable and feasible. Binarizing layers with higher degree of redundancy will lead to lower performance loss after binarization, while layers with negative degree of redundancy should be kept un-binarized to avoid excessive performance loss.

  • According to the chain rule of probability in directed graphs, the output of every layer is only dependent on its direct input. Therefore, the binarizability of each layer can be superposed. If a layer can be binarized alone, it can be binarized with other such layers.

  • When binarizing several deconvolution layers together, the layer with the least degree of redundancy may be the bottleneck of the generator’s performance.

As a result, only shallower layer(s) of a generator can be binarized together to preserve its performance, because of the degree of redundancy trend in it. This leads to PBGen. Besides, such analysis may also explain why binarization can be applied in almost all convolution layers: distilling local information from a global map leads to positive degree of redundancy.

Figure 4: Structure of the generators in DCGAN with dimension of each layer labeled. Deconvolutional layers are denoted as “CONV” (figure credit: [6])

4 Experiments

4.1 DCGAN and Different Binarization Settings

DCGAN will serve as a vehicle to verify the inferences deducted from the theoretical analysis in Section 3

. We will explore how to best binarize it with preserved performance. Specifically, we use the TensorFlow

[35] implementation of DCGAN on GitHub [36]. The structure of its generator is illustrated in Fig. 4. The computed degree of redundancy in each layer in the generator is shown in Fig. 3 and qualitatively summarized in Table 1. The degree of redundancy in the last layer drops to -896. According to the inferences before, we can expect that since the degree of redundancy decreases as the depth increases, binarizing shallower layers and keeping the deeper layers in the format of floating point will help preserve the performance.

Layer number Label in Fig. 4 Degree of redundancy
1 CONV 1 1008
2 CONV 2 448
3 CONV 3 0
4 CONV 4 -896
Table 1: Degree of redundancy in each deconvolution layer of the generator in DCGAN

Since there are four deconvolution layers in total in the generator, each layer can be either binarized or not. For verification, we have conducted experiments on all different settings, but only the eight representative ones are discussed for clarity and space, and others will lead us to the same conclusion. Those eight different representative settings are summarized in Table 2 for clearness. In this table, the “Setting” column labels each setting. “Layer(s) binarized” indicates which layer(s) are binarized in the generator. The “Discriminator binarized” column tells whether the discriminator is binarized or not. “Y” means yes, while “N” means no. This column is introduced to verify an observation in our experiments to be discussed later.

Setting
Layer(s) binarized
Discriminator
binarized
A None N
B 1 N
C 2 N
D 3 N
E 4 N
F 1,2,3 N
G 1,2,3,4 N
H 1,2,3 Y
Table 2: Settings of different partial binarization of generator in DCGAN

Setting G will serve as the baseline model for performance after binarization, because it adopts the compression techniques based on CNNs directly without considering the degree of redundancy. The binarization method we adopt is Binary-Weight-Network from [20], which is one of the state-of-the-art compression techniques [37]. On the other hand, Setting A serves as the baseline model when considering the memory saving, speedup as well as performance difference before and after binarization, because it represents the original DCGAN in floating point representation. It is considered as one common GAN structure providing good performance.

4.2 Dataset and Metrics

CelebA [38] is used as the dataset for our experiments, because it is a popular and verified dataset for different GAN structures. DCGAN, WGAN, LSGAN and many other GAN structures are tested on it [39]. As every image in CelebA contains only one face, it is much easier to tell the quality of the generated images.

Traditionally the quality of the generated images is identified by observation. However, qualitatively evaluation is always a hard problem. According to the in-depth analysis of commonly used criteria in [40], good performance in a single or extrapolated metric from average log-likelihood, Parzen window estimates, and visual fidelity of samples does not directly translate to good performance of a GAN. On the other hand, the log-likelihood score proposed in [41] only estimates a lower bound instead of the actual performance.

Very recently, [7] proposed an efficient metric, which we will use in our experiments, and showed that it is superior to MS-SSIM [42], which is a commonly used metric. It calculates the sliced Wasserstein distance (SWD) between the training samples and the generated images under different resolutions. In particular, the SWD from the lower resolution patches indicates similarity in holistic image structures, while the finer-level patches encode information about pixel-level attributes. In this work, the max resolution is 6464. Thus, according to [7], we will use three different resolutions to evaluate the performance: 1616, 3232, and 6464. For all different resolutions, small SWD indicates that the distributions of the patches are similar, which means that a generator with smaller SWD is expected to produce images more similar to the images from the training samples in both appearance and variation.

4.3 Experimental Results

In this section, we will present experimental results that verify our inferences in Section 3, along with some additional observations about the competition between the generator and the discriminator. The images generated by the original GAN (setting A), in which all weights of each deconvolution layer are in the form of floating point, are displayed in Fig. 5(a). The images generated by the binarized DCGAN without considering the degree of redundancy are displayed in Fig. 5(g). These are our two baseline models.

(a) Setting A
(b) Setting B
(c) Setting C
(d) Setting D
(e) Setting E
(f) Setting F
(g) Setting G
(h) Setting H
Figure 5: Images generated under different settings

4.3.1 Qualitative Comparison of Single-Layer Binarization

We start our experiments by comparing the images generated by binarizing a single layer in the generator of DCGAN. The results are shown in Figures 5(b) - 5(e), which are generated by PBGen’s under setting B - setting E respectively. In other words, those PBGen’s utilize binary weights to the first, the second, the third, and the last deconvolution layer respectively. The degree of redundancy of each layer is shown in Fig. 3. From the generated figures we can then see that Fig. 5(b) generates the highest quality of images, similar to the original ones in Fig. 5(a). Images in Fig. 5(c) are slightly inferior to those in Fig. 5(b), but better than those in Fig. 5(d). Fig. 5(e) has no meaningful images at all. These observations are in accordance with our inferences in Section 3: the performance loss when binarizing a layer is decided by its degree of redundancy, and a layer with negative degree of redundancy should not be binarized.

4.3.2 Quantitative Comparison of Single-Layer Binarization

We further quantitatively compute the SWD values with 1616, 3232 and 6464 resolutions for the four different settings B, C, D, and E. Their relationship with the degree of redundancy of the binarized layer is plotted in Fig. 7. From the figure, two things are clear: first, regardless of resolution, a negative degree of redundancy (setting E) results in a more than 5 increase in SWD compared with other settings with non-negative degree of redundancy (settings B, C, and D). Second, for all the three resolutions, SWD decreases almost linearly with the increase of the degree of redundancy when it is non-negative. This confirms that our degree of redundancy can capture the impact of binarization not only on the holistic structure but also on the pixel-level fine details, and as such, is indeed a good indicator to quickly judge whether a layer can be binarized.

Figure 6: SWD v.s degree of redundancy of the binarized layer in different settings
Figure 7: SWD v.s. resolution for SWD score calculation under Setting A and Setting B

We also report the SWD averaged over different resolutions (1616, 3232, 6464) in Table 3, where the result for the original GAN (setting A) is also reported. From the table we can draw similar conclusions, that binarizing second layer (setting C) increases the average SWD by 2.3% compared with the original GAN (setting A), while binarizing third and fourth layer (settings D and E) further increases it by 52.3% and 913.6%, respectively.

Setting A B C D E
 Average SWD ()  44  38  45  67  449
Table 3: Average SWD under different settings

It is interesting to note that the average SWD achieved by binarizing the first layer (setting B) is 13.6% smaller than that from the original DCGAN (setting A). To further check this, we plot the SWD v.s. resolution for these two settings in Fig. 7. From the figure we can see that the SWD from setting B is always smaller than that from setting A across all three resolutions. This shows that setting B can achieve better similarity as well as detailed attributes. Such an improvement is probably due to the regularization effect, and similar effect has been observed in the compression of CNNs [25].

4.3.3 Validation of Superposition of Binarizability

We now explore experiments to verify our inference that all layers that can be binarized alone can be binarized together. The images generated by setting F in Fig. 5(f), where the first three layers in the generator are binarized together, show no significant difference from those in Figures 5(a)- 5(d). Binarizing any two layers from the first three layers (not shown here) will lead to the same result. On the other hand, setting G does not generate any meaningful output (Fig. 5(g)), as the last layer, which cannot be binarized alone, is binarized together with the first three layers. Binarizing any of the first three layers as well as the last layer (not shown here) will produce meaningless results too. Setting G follows the state-of-the-art binarization for CNNs directly without considering the degree of redundancy. That is, with the assistance of the degree of redundancy, we can figure out that at most the first three deconvolution layers can be binarized with small loss on performance in the generator (Setting F). Nevertheless, directly adopting the existing binarization method will lead to excess degradation in performance and cannot provide any hint to improve (Setting G).

Figure 8: SWD v.s. resolutions under Setting D and Setting F

Moreover, the average SWD for setting F is 0.067, the same as setting D. Further looking at the SWD values under different resolutions for the two different settings as shown in Fig. 8, it is clear that the two curves are very close. This validates our last inference, that when multiple layers are binarized together, the layer with least degree of redundancy is the bottleneck, which decides the overall performance of the network.

4.3.4 Compression Saving

Table 4 summarizes the speedup during training and inference as well as the memory reduction for PBGen compared with the original generator in DCGAN, which is the baseline model when considering the memory saving and speedup. PBGen under Setting F can achieve 25.81 memory saving as well as 1.96 and 1.32 speedup during inference and training respectively with little performance loss. For both the original generator and PBGen, during the training process the floating point representation of all weights need to be used for backward propagation and update [20]. As such, the speedup mainly comes from faster forward propagation with binarized weights. Compression savings are also shown in Table 4. They are practical savings calculated according to our strategy.

Generator model
Speedup
Memory
cost
Inference Training
Original generator from DCGAN
(Setting A)
1.0 1.0 1.0
PBGen
(Setting F)
1.96 1.32 1/25.81
Table 4: Training and inference speedup as well as memory reduction for PBGen

4.3.5 Unbalanced Competition

So far, our discussion has focused on the binarization of the generator in a GAN only, as the discriminator takes the same form as conventional CNNs. However, since competition between generator and discriminator is the key of GANs, would a binarized generator still compete well with a full discriminator?

The loss values for the discriminator network and PBGen under Setting F are depicted in Fig. 9

, where x-axis indicates the number of epochs and y-axis is the loss value. The images generated from different number of epochs are also exhibited aside. From the figure we can see that during the initial stage, distorted faces are generated. As the competition is initiated, image quality improves. But very quickly, the competition vanishes, and the generated images stop improving. However, when we binarize the discriminator at the same time (Setting H), the competition continues to improve image quality, as can be seen in Fig.

10.

Figure 9: Loss values of the original discriminator and PBGen under Setting F along epochs
Figure 10: Loss values of binarized discriminator and PBGen under Setting H along epochs
Figure 11: Loss values of the discriminator and the generator in original DCGAN under Setting A along epochs

We further plot the loss values of the discriminator and the generator of the original DCGAN (Setting A), and the results are shown in Fig. 11. It is very similar to Fig. 10, except that the competition is initiated earlier, which is due to the stronger representation power of both the generator and the discriminator before binarization. These figures confirm that the quick disappearance of competition is mainly due to the unbalanced generator and discriminator, which should be avoided.

We now explore the quality of the images generated from balanced competition using Setting H. The images generated are shown in Fig. 5(h), the quality of which is apparently better than the rest in Fig. 5. To further confirm this quantitatively, we compute the average SWD values of those images, which is 0.034 in average. This is even smaller than any average SWD values listed in Table 3, which shows that the images are of better quality, even compared with the original DCGAN.

4.3.6 Summary

To summarize the discussion and comparisons in this section, we plot the SWD v.s. resolution curves for all the 8 settings in Fig. 12. It allows a complete view of how these different settings compare in terms of similarity as a whole and fine details. From the figure we can see that setting H gives the best similarity as a whole, while setting C yields the finest detailed attributes.

Figure 12: SWD v.s. resolutions under all different settings

Consequently, utilizing the degree of redundancy as a tool, we can efficiently find out eligible layers that can be binarized and based on their superposition, a final binarization strategy can be decided. It cannot guarantee an optimal result but does decrease the search space for the final solution from to or less, where is the number of layers, because testing on all combinations of binarization strategy is not necessary and we only need to binarize every single layer with high degree of redundancy to decide the final strategy. Since our theoretical analysis and experiments are based on deconvolutional layers, we believe this method can work for other deconvolution based generators beyond DCGAN.

5 Conclusion

Compression techniques have been widely studied for convolutional neural networks, but directly adopting them to all layers will fail deconvolution-based generator in generative adversarial networks based on our observation. We propose and validate that the performance of deconvolution-based generator can be preserved when applying binarization to carefully selected layers (PBGen). To accelerate the process deciding whether a layer can be binarized or not, the degree of redundancy is proposed based on theoretical analysis and further verified by experiments. Under the guide of this metric, search space for optimal binarization strategy is decreased from to where is the number of layers in the generator. PBGen for DCGAN can yield up to 25.81 saving in memory consumption with 1.96 and 1.32 speedup in inference and training respectively with little performance loss measured by sliced Wasserstein distance score. Besides, we also demonstrate that both generator and discriminator should be binarized at the same time for a balanced competition and better performance.

References