A Survey on GAN Acceleration Using Memory Compression Technique

08/14/2021 ∙ by dina-tantawy, et al. ∙ NYU college Cairo University 0

Since its invention, Generative adversarial networks (GANs) have shown outstanding results in many applications. Generative Adversarial Networks are powerful yet, resource-hungry deep-learning models. Their main difference from ordinary deep learning models is the nature of their output. For example, GAN output can be a whole image versus other models detecting objects or classifying images. Thus, the architecture and numeric precision of the network affect the quality and speed of the solution. Hence, accelerating GANs is pivotal. Accelerating GANs can be classified into three main tracks: (1) Memory compression, (2) Computation optimization, and (3) Data-flow optimization. Because data transfer is the main source of energy usage, memory compression leads to the most savings. Thus, in this paper, we survey memory compression techniques for CNN-Based GANs. Additionally, the paper summarizes opportunities and challenges in GANs acceleration and suggests open research problems to be further investigated.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, Deep learning (DL)

applications are getting unprecedented fame. Many Machine (Deep) learning applications are used daily like face recognition, voice recognition, weather predictions, image super-resolution, …etc. Companies and researchers alike compete to present more applications each day and enhance existing ones. They also compete to make them affordable and usable by everyone.

Deep Generative models have been on a rise as well. GAN or Generative adversarial network is one of the most famous generative modelsgoodfellow2014generative. It consists of at least two networks competing against each other. It has been used in many applications like speech synthesis speech_bollepalli2019generative, text-to-image translation txttoimg_zhang2017stackgan; txttoimg_zhang2018photographic

, image-to-image translation

imgtoimg_choi2018stargan, image super resolution highres_wang2018high, music generation music_gansynth, videos synthesis video_gan_clark2019efficient, …etc.

Having one network in deep learning is very computationally intensive, having two networks or more is even worse. Additionally, GANs training is susceptible to divergence or mode collapses. GANs training instability adds an extra layer of complexity.

Additionally, there is an increasing need for running generators of GANs on embedded devices. For example, Using style transfer to make clothes fitting at stores app_GAN_tryon_Fwgan. Another usage for GANs on hand-held device is using video super-resolution to save bandwidth while downloading videos app_GAN_superresolution. Thus, current techniques and accelerators need to be revisited and adapted to serve the urgent demand of GANs.

Accelerating GANs goals are power efficiency, speed, and solution quality. In real world, it is hard to improve everything, a price must be paid according to no free-lunch theorem. Depending on the application, the importance of one goal over the other will vary and thus the optimization technique as well. Acceleration (optimization) techniques target three main categories:

  • Memory compression: this category uses compression techniques to minimize memory requirement while preserving solution quality which in turn saves energy usage.

  • Computation optimization: this category uses mathematical transformation and circuit optimization to decrease the number of mathematical operations or cycles needed alongside optimizing their needed power and increase their speed.

  • Dataflow optimization: this category uses mapping, scheduling, and reordering data and/or operations to maximize data reuse and minimize ineffectual operations which in turn will save energy and enhance throughput.

One bottleneck of running GANs is the huge cost endured by data transfer to/from accelerating-chip memory followed by the computation cost and eventually the overhead of data transfer between different units. Thus, our approach is to investigate optimization techniques tackling those areas one by one, starting by the memory compression because of its huge cost. Thus, this paper presents a survey about recent efforts made to accelerate GANs using memory compression. Our work will focus on GANs generating images although it applies to other types of GAN tasks as well. Computational optimization and dataflow optimization are considered for a future work.

Our work is complementary to other surveys papers mentioned in sec.3, highlighting special issues facing GANs and different efforts done to resolve them. This paper has the following contributions:

  • To our best knowledge, this is the first paper to survey GAN compression.

  • Providing a taxonomy for GAN optimization

  • Summarizing recent research work in accelerating GANs.

  • Providing open research questions for accelerating GANs.

The remainder of this paper is organized as follows, section 2 presents a brief background on how GANs work. Section 2.1 reviews different efforts to design GAN accelerators. Section 3 highlights related work and main differences. Finally, Section 4 concludes the work and presents open research questions that need to be further studied.

2 Background

Generative Adversarial Network (GAN) is a type of generative model that uses deep learning (DL) techniques to generate data. As mentioned earlier, GAN is more of a way of training different smaller models to compete against each other than being a newly devised model.

CNN-based GANs can have many different architectures like DCGAN by DCGAN, Pix2Pix by imgtoimg_isola2017image, …etc. Although different CNN-based GAN models have different architectures, Most of them share the same underlying concept of having competitive networks and using transpose convolution or up sampling layers. In this section, we will use DCGAN as a representative model that shares a common base with others to explain the idea of GANs in general.

Structure: GAN model consists of at least two Networks111some applications like style-transfer requires more than one GAN, more than two Networks.: Generator and Discriminator in an organization like fig. 1. In training, the generator takes an input noise () generate data that looks more like a real one, while the discriminator tries to be better at discriminating the generated data from the real one

. Thus, the goal of the discriminator is opposite to that of the generator. That’s why it is called “adversarial” as they are competing against each other. The training ends, when the discriminator cannot enhance its accuracy anymore, and the evaluation metrics, as described later in this section, are satisfactory. Sometimes the training does not converge, and measures and limits are used to halt the training process and restart it.

The optimization problem can be represented as the following equation

(1)

where represents the generator, represents the Discriminator, represents the original data distribution, and represents the noise input distribution. The first part of the equation

represents the probability that the discriminator classifies real data as real. While second part of the equation

represents the ability of discriminator to classify fake data as fake. the discriminator tries to maximize those two parts of the equation, that’s why we need . On the other hand, the generator tries to fool the discriminator to detect fake images as real, thus it tries to minimize the second part of the equation and adding the . Combining the two network optimizations, we get the above equation.

Figure 1: Example of GAN general organization

Discriminator Network is an ordinary CNN or LSTM or any other deep-learning classification/regression models. In contrast to discriminator that outputs only a decision or a prediction, the generator generates the data itself, be it an image, music notes, animated character…etc. Thus, the output of the generator network is larger than its input. Fig. 2

shows the generator Network in DCGAN. Like most generator networks, the input is a noise vector or initial image that gets expanded and reshaped to a bigger size. This expansion and reshaping are performed to train the convolution to act as a Transposed Convolution.

Figure 2: Generator Model in DCGAN DCGAN

Evaluation Metrics:

GANs have many evaluation metrics. we can split metrics into two types: 1) Functional metrics and 2) Performance metrics. Functional metrics measure how good the result of a GAN towards the target functionality. We can also name them as ”quality” metrics or scores. Most commonly used function metrics are Inception Score (IS), Fréchet Inception Distance (FID), Peak-signal-to-noise-Ratio (PSNR), mean Pixel accuracy (mPA) and Intersection over Union (IoU) . Both IS and FID measure the quality of the generated outputs, and their diversity. More information about how they are calculated can be found in the following work

InceptionScore; FID. The higher the IS the better the diversity in the generated data. While the lower the FID, the better image quality is produced. PSNR is also used especially in blending images or creating a super-resolution image (the higher the better). mPA measures the mean difference in percentage between the generated image and the ground truth. IoU measures how close the generated image to the “real image” (the higher the better, maximum =1). Functional metrics are summarized in Table 1.

Name Abb. Enhancement Direction
Inception Score IS higher is better
Frechet Inception Distance FID lower is better
Peak Signal to Noise Ratio PSNR higher is better
Mean Pixel accuracy mPA higher is better
Intersection over Union IoU higher is better
maximum value is 1 and requires ground truth
Table 1: Functional Metrics (Scores) Summary

Performance metrics measures the efficiency of the model optimization. The efficiency of a model refers to its speed, power and used area. These metrics depends on both software model and hardware architecture and used technology. Focusing on compression techniques, our main metrics is the compression ratio, no. of macs (multiply-accumulate operations) and throughput as listed in Table 2.

Name Abb. or Unit Enhancement Direction
Compression Ratio CR higher is better
No. of multiply and accumulate ops #MACs lower is better
Throughput output/s higher is better
Table 2: Performance Metrics Summary

2.1 Memory Compression

What is memory compression ? Any DL Model needs three items to reside in the memory: 1) Model Architecture (control flow or computational graph), 2) Model Parameters (weights and biases), and 3) Inputs (activations). Compressing memory means minimizing any/all the above three components.

What are the advantages of using memory compression techniques ? It improves storage space required to store model leading to cross-layer optimization by allowing several layers to fit into the chip memory at once or even storing the whole model on embedded system. It also decreases off-chip data-transfer leading to higher throughput and lower power usage.

Figure 3: Classification of Memory Compression Techniques

What is the classification of memory compression techniques ? Memory Compression can be done using several techniques as shown in fig. 3. Compression techniques are classified into two main categories: a) lossless compression techniques and b) lossy ones. As the name stated, lossy compression introduces some losses. That’s why finetuning or retraining of the compressed model is advised with those techniques. On the other hand, lossless compression techniques require extra SW and/or HW support to revert compression or apply computation on sparse matrices. We also classify the techniques according to the granularity, a coarse granularity would consider compressing the model architecture, while a finer granularity would consider compressing data used by the model (both weights and activation). Combining several techniques together is always an option, but a careful eye needs to watch the quality of the solution. The rest of this section will focus on lossy compression techniques. It will explain each technique and show the latest work using them on GANs generators. Lossless techniques are general ones that can be used with different applications and are orthogonal to lossy techniques, thus why we choose to focus on lossy techniques.

2.1.1 Pruning

It is the process of eliminating parts of the network model to make it smaller without much loss of results quality. Pruning is usually done after the model is trained and then the model is finetuned to adjust the remaining weights as seen in fig. 4.

Figure 4: Pruning Process Flowchart

Pruning is defined by 4 main decisions as seen in fig. 5

. First, decision is the pruning criteria or in another words “how to choose the part to be pruned”. Criteria could be totally random using trial and error or based on the selected element norm using some certain threshold a.k.a. norm-threshold. A similarity threshold also could be used when there are two activations (feature-maps) very similar to each other, one of them can be removed as it introduces no new information. And Finally, Evolutionary algorithms like genetic algorithm can be used to choose the elements to be pruned.

The second decision is the pruning granularity. The pruning could be unstructured which means eliminating individual elements from weight/bias matrices. This kind of pruning leaves the network structure as it is, but it converts its weight matrices to sparse matrices which could be further compressed. On the other hand, Structured pruning eliminates components from the network leaving it slimmer (less channels or filters) or shorter (less layers).

A third decision is the application time or “when to apply pruning”. As mentioned previously, most work apply pruning as post-processing step, however, lately several techniques are introduced to apply pruning while training as will be explained later in this subsection. The post-processing pruning can further be classified to gradual or iterative pruning where the pruning starts with a small threshold and increases it gradually to avoid accuracy loss. On the contrary, the one-shot pruning is manually finetune network according to some criteria. 222One Shot learning is usually having different meaning; it will be explored in knowledge distillation section in Network Architecture search

A final decision is the module or “which module should be pruned”. Pruning Generator part of the network is usually the target, however due to instability of training and hence finetuning, some works suggested to prune the discriminator as well to avoid overpowering the pruned generator while finetuning.

Figure 5: Pruning decision: Four decisions to define any pruning technique

In a study made by Chong & Jeff GANCompression_whyothersFail, they showed that the quality of results degraded significantly using thresholding pruning as seen in fig. 6. Thresholding pruning is eliminating the element undertest if it is below a certain threshold. The element could be a single weight, a filter, or a channel. They implemented several pruning techniques on StarGANimgtoimg_choi2018stargan. They implemented iterative pruning after training GANCompression_whyothersFail(e), iterative pruning during training GANCompression_whyothersFail(f), pruning both generator and discriminator GANCompression_whyothersFail(n) and one-shot pruning after training GANCompression_whyothersFail(d). It worth noting that pruning during training resulted in a lower quality than pruning after training. This indicates that the model fails to converge. They also presented other non-pruning techniques which we will explore later.

Figure 6: FID scores of different pruning techniques for StarGAN, This shows the failure of pruning techniques to preserve the required quality compared to the original non-pruned version of the algorithm.

Instead of using thresholding in pruning, Shu et al. proposed using evolutionary algorithm like genetic algorithm (GA) to compress the cyclic networks like CYCLEGAN prune_GAN_shu2019co. The idea is to represent the generator as a bitstream where each bit corresponds to a filter if the bit = 0 then the filter is pruned. The GA fitness function is a function in three criteria : a) the size of the network, b) the compression distance, and c)the cycle loss. The compression distance is the mean square error between the discriminator output for compressed and uncompressed generators. Whereas the cycle loss is a special loss in training paired images as explained in imgtoimg_zhu2017unpaired. GA achieved a compression ratio between 3.54x 5.7x compression ratio on cyclegan. GA pruning achieved 0.542 mean pixel accuracy compared to 0.218 using thresholding in pruning. It also achieved a better FID score by average of 30 points compared to thresholding pruning while only degrading FID than the original none-pruned with 8.5 points (calculated using 4 datasets on CYCLEGAN).

Song et al. presented overlapping pruning with training instead of the ordinary train-prune-finetune approach SPGAN_song2020sp. In his work, the author adopted the train-expand-prune approach. He started by training a small network (called seed network), then progressively expanded it by adding more width (filters) to the network. Then similar filters are pruned and the whole network is fine tuned. They scored 1.25x less flops than baseline GAN.

Work/Aspect Criteria Granularity Application Module
GANCompression_whyothersFail(e) threshold Unstructured After training Generator
GANCompression_whyothersFail(f) threshold Unstructured during training Generator
GANCompression_whyothersFail(n) threshold Unstructured After training Generator
GANCompression_whyothersFail(d) threshold Unstructured After training Both
prune_GAN_shu2019co evolutionary structured After training Generator
SPGAN_song2020sp similarity structured during training Generator
Table 3: Summary of GAN pruning work

Although many combinations of pruning techniques have not been explored in pruning as seen in the summary table.3, the current results indicate that pruning alone is not enough and there is a significant loss in quality as shown in fig. 6. This failure is attributed to the following reasons: 1) The high resolution of the generator output compared to discriminator models makes it more sensitive to noise, 2) The generator evaluation metrics are more subjective than objective, and 3) the training of GANs is unstable and a careful care should be taken to avoid discriminator over-powering the generator. To overcome those challenges, a more general approach called knowledge distillation is used where pruning is usually a part of it.

2.1.2 Knowledge Distillation

It is the transfer of knowledge acquired by the uncompressed generator (called teacher model) to a smaller model (called student model). To apply knowledge distillation, we need to define 4 main components as explained in fig. 7.

Figure 7: Knowledge Distillation Components

First component is the Teacher model. While the straight-forward approach suggests that we should have a pretrained one that we are trying to compress, some works trained an overly large model from scratch to give them more flexibility in finding the most optimal student model.

The second component is the reconstruction of the student model. In its simplest form, pruning is used to generate the student model from the teacher model. Student models can also be constructed using network architecture search (NAS). Or even it can be progressively constructed using sub-constructs from the teacher model.

The third component is the training architecture or building blocks. This component is concerned with which components from the teacher model will be included in the training and whether to construct a complete student GAN (both generator and discriminator) or just construct a student generator.

The last component is the loss function. The loss function determines how good and how fast the student will learn from the teacher. A lot of loss function has been introduced that will be explained below.

In GAN_COMPRESSION_DISTILLATION, Aguinaldo et al. used knowledge distillation to train a student generator to even beat the teacher generator. They used pretrained teacher generator and discriminator as seen in fig. 8. They devised a reconstruction loss (eq.3) as a joint function between gan loss (eq. 2) and per-pixel loss (eq. 2). First, they let the same input to both student and teacher generators. Second, they let the output of both generators be fed into the teacher discriminator. Third, they calculate the reconstruction loss as in eq. 3 to train the compressed generator.

(2)
(3)

where is the noise used as input to generators, is the adversarial gan loss from eq.2, and

is a weighting parameter between both losses. The authors used a very large teacher to guide the small network leading 1669x compression ratio while retaining 83% of the teacher’s Inception Score on MNIST. However, the produced images were very blurred at this compression rate. Yet what makes this work standout is that the distilled networks using this method always beat the trained-from scratch networks of the same size.

Figure 8: Training Architecture for work GAN_COMPRESSION_DISTILLATION showing Per-pixel loss & gan-loss (a.k.a. adversarial loss)

In GANCompression_whyothersFail, Chong & Jeff also used a pretrained generator and discriminator to train a student generator as shown in fig. 9. They measured the loss as follows: first they measured the per-pixel loss as the loss between the two generators as in eq. 2. Per-pixel loss is usually the mean square error loss, but it could be anything else. Additionally, they measured class-loss as the loss between the discriminator output of the student and teacher generators as seen in eq. 4. The total loss is the weighted sum of the previous two losses as shown in eq. 5.

(4)
(5)
Figure 9: Training Architecture for work GANCompression_whyothersFail showing Per-pixel loss & Class-loss

Similarly, In KD_GAN_COMPRESSION_NAS, they used function using the teacher discriminator. Additionally, they added another loss term representing intermediate distillation loss. Intermediate distillation loss is the loss between two corresponding inner-layers outputs as in eq. 6. If the data is paired, then they calculate between student generated image and the paired image, else they use the output of the teacher generator to calculate . The training architecture is shown in fig. 10 while the final loss function is seen in eqn. 7. To construct the student model, they used a neural architecture search (NAS). To avoid the long running time of NAS, they used one shot learning to train a variable number of networks at once. The idea of one-for-all network training (OFA) also called one-shot learning in training is best explained in OFA_NAS. They reached a compression ratio between 4x 33x on varies datasets and networks.

(6)

where is the mapping function between teacher inner layer and the corresponding student layer to adjust the size. is the layer number.

(7)

where different are used to weight the different loss functions.

Figure 10: Training Architecture for work KD_GAN_COMPRESSION_NAS; KD_GAN_fu2020autogan showing intermediate distillation loss, Per-pixel loss & gan-loss.

In contrast to the previous work, the work in KD_GAN_fu2020autogan suggested to use more constrained search for the student model instead of unconstrained NAS. They built DART, an AutoML-Like framework for GANs to perform differential search liu2018darts for the operators at each layer and layer width . They constrained the first and last layers to be like famous models’ architecture like Cycle-GAN in style-transfer tasks and ESRGANwang2018esrgan in super-resolution tasks.

In the work made by Qing et al., they reconstructed the teacher network to be a large supernet for image-to-image translation. Thus, pruning and distilling such a large network would lead to a more efficient student networkKD_GAN_CATjin2021teachers. The student network is made by pruning channels of the teacher generator. They used aggregated 4 loss-function. The training architecture is shown in fig. 11. First, they used intermediate distillation using kernel alignment function to map the corresponding intermediate layer sizes. Second, they used perceptual loss, which is the loss between intermediate features in the discriminator between real(teacher) and student images as seen in eq. 8. Third, they used gan adversarial loss as all other GANs. Eventually, they used the cyclic-loss since it is image-to-image translation task. They cycle loss is the loss of converting from domain A to B then back to A’. The difference between A and A’ is the cycle loss. This reconstruction method optimized the time required to construct student network by 10,000x compared to unconstrained NAS with better FID scores.

(8)

where is real image or teacher image, and is the number of layers and is the total number of elements in each layer.

Figure 11: Training Architecture for work KD_GAN_CATjin2021teachers showing intermediate distillation loss, perceptual-loss & gan-loss

Another work by Chen et al. used intermediate distillation but on discriminator instead of generator as shown in fig. 12. They used two discriminators instead of oneKD_GAN_chen2020distilling, so they distill both generator and discriminator. The distillation of the generator required two loss functions: perceptual-loss, per-pixel loss and adversarial-loss. To distill the discriminator, they used the adversarial loss and introduced another loss term, which is triple loss. The idea of this loss term is to consider that the distance between teacher generated, and student generated images will be smaller than the distance between real images and student images, thus they added an extra parameter to weight the two differences in the loss function.

Figure 12: Training Architecture for work KD_GAN_chen2020distilling showing both generator and discriminator distillation

In KD_GAN_liu2021content, the authors combined intermediate distillation loss, perceptual loss and per-pixel loss. They also used a content aware approach to enhance distillation. This is done by detecting the areas of interest using an auxiliary segmentation network (content-aware network) and mask the corresponding images before calculating loss leading to 10x to 11.5x compression ratios compared to several original GANs with FIDs 7.5 for FFHQ dataset (original FID 2.7 4.5). Zhang et al. also combined intermediate distillation loss from discriminator and generator in their work presented PKDGANKD_GAN_PKDGAN

. However, it is only applied on novelty detection, so a further work needs to compare it with other GAN methods.

Figure 13: Training Architecture for work KD_GAN_liu2021content showing both generator and discriminator distillation and using segmentation network to allow content-aware distillation

In a work done by Haotoa et al., They presented a unified framework called ganslimming to stack several memory compression techniques together GAN_Slimming. They used distillation to compress the generator. The student generator is automatically and adaptively generated from the teacher generator by channel-pruning and quantization. They used the normalization layer scale parameter to guide the pruning during training and adapted the adversarial-loss, per-pixel loss and the added normalization scale parameter as the combined loss function.

work/ Teacher Student Training Loss
Component Model Recons. Architecture Function
GANCompression_whyothersFail pretrained pruning 2G+ Teacher D per-pixel
per-class
GAN_COMPRESSION_DISTILLATION pretrained pruning 2G+ Teacher D per-pixel
adversarial loss
KD_GAN_COMPRESSION_NAS pretrained NAS 2G+ Teacher D per-pixel
adversarial loss
intermediate loss
KD_GAN_fu2020autogan pretrained NAS 2G+ Teacher D per-pixel
(DART) adversarial loss
intermediate loss
KD_GAN_CATjin2021teachers super-large pruning 2G+ Teacher D perceptual loss
adversarial loss
intermediate loss
cycle-loss imgtoimg_zhu2017unpaired
KD_GAN_chen2020distilling pretrained pruning 2G+ 2D per-pixel
perceptual loss
adversarial loss
Triplet-loss
KD_GAN_liu2021content pretrained pruning 2G + D per-pixel
+ Segmentation perceptual loss
Network intermediate loss
segmentation-loss
GAN_Slimming pretrained pruning 2G+ Teacher D per-pixel
adversarial loss
normalization-lossliu2017learning
Table 4: Summary of GAN Distillation works

Table 4 summarizes the above-mentioned works and how their four main components are selected. While fig. 14 shows the difference performance of the above techniques with respect to each other and with a sample from pruning techniques as well. For a fair comparison, we used only techniques with Cyclegan and horse2zebra dataset. The axes names in the fig. a, b, c, d, e, f represent the following works KD_GAN_COMPRESSION_NAS, KD_GAN_fu2020autogan, KD_GAN_CATjin2021teachers, KD_GAN_chen2020distilling, GAN_Slimming, prune_GAN_shu2019co respectively. KD_GAN_chen2020distilling didnt́ report the FID score in his paper, thatś why it is omitted from the graph. In works KD_GAN_COMPRESSION_NAS; KD_GAN_CATjin2021teachers, their teacher model FID score was better than the others. This justifies why they have much better FID score because they had a better teacher. However, Wang et al. in GAN_Slimming managed to score very close to them despite starting from a weaker teacher although its compression ratio is one of the lowest compared to others. It worth noting that the pruning technique done by prune_GAN_shu2019co has the worst compression ratio and the worst FID score, which is consistent with our conclusion in pruning section, that pruning alone is not enough.

Figure 14: Comparison between different distillation and pruning techniques using cyclegan model and horse2zebra dataset. (a),(b),(c),(d),(e),(f) represents the works in KD_GAN_COMPRESSION_NAS, KD_GAN_fu2020autogan, KD_GAN_CATjin2021teachers, KD_GAN_chen2020distilling, GAN_Slimming, prune_GAN_shu2019co respectively. (f) prune_GAN_shu2019co

is a pruning technique, while all the others are distillation. In (e), we only reported the distillation results and omitted the quantization effect for a fair comparison. Work (a),(c) had stronger teacher than the rest. In (c), we estimated the compression ratio as the ratio between the number mac operations.

2.1.3 Lowering numeric precision

As the name stated, it is using less bits to represent a number. Lowering numeric precision is not unique to Machine Learning. As the precision increases the accuracy and numerical stability increases. On the other hand, as the precision decreases, the speed, memory footprint and hardware area get improved

15. Despite that, the relation is not linear, and it differs according to the application.

As mentioned earlier, GAN generators are more sensitive to precision due to the resolution of the output. Thus, in this section, we will show the impact of such optimization on generators and explore different numeric formats. Changing numeric precision can be done by using standard formats like single precision (float32), half precision (float16) or fixed-point representation which can take many forms depending on the place of the fixed-point. Those standard format has more adoption in hardware since they are ”standard”. On the contrary, Non-standard formats use out-of-the box ideas like Bfloat333Although Bfloat is not IEEE standardized, however google TPU support for it makes it common to use and supported by different optimized deep learning libraries., FlexPoint, …etc. Those out-of-the box format needs dedicated hardware support to be used.

Figure 15: Impact of precision on performance metrics

Lowering numeric precision to Fixed-point integer is usually called quantization in literature whenever the quantization follows the affine mapping as in eq. 9. The advantage of using affine transformation is that the multiplication and addition can be carried out without the need to revert the mapping gogo_quantization. Other types of quantization requiring reverting back the conversion before calculations is considered as a type of encoding.

(9)

To perform quantization, we need to define the necessary decisions shown in fig. 16. First decision is the general numeric format. Whether the number format is float-like, or integer-like. Float-like format is (sign,exponent,mantissa) format. The integer-like format like fixed point format requires (Integer, Scale factor, Zero-point) format.

Figure 16: Quantization Decisions

Second decision is the quantization function. Determining the quantization function determines how the full-precision number is converted to the quantized one as shown in eq.9. The quantization could be uniform or could give a higher priority to certain range depending on the data distribution using log or tanh function.

Third to Fifth decisions are about what part of the network to apply quantization to. Starting by the granularity whether the same quantization should be applied to all layers, or each layer should have different quantization. The quantization could be per-tensor, or per-channel or per-layer or per network (G has different quantization than D b). Fourth decision is whether to quantize weights or activations or both. If only one type of data is quantized, then the saving will be in storage only not in calculation as it must cast back to the largest of the two formats. And the fifth one is whether to quantize both networks or the generator only.

Last decision is when to apply the quantization. Post-training quantization requires finetuning while during training or called “Training-aware quantization” doesnt́ need any additional finetuning.

Wang introduced a method to quantize GAN called QGANQGAN. In his study, he showed that normal quantization methods like (uniform, log, tanh) are not sufficient for a stable and convergent GAN in a very low-bit quantization. Moreover, and sensitivity to quantization is different. Though eventually, only the generator is needed, but finetuning it needs a balanced discriminator. Thus, quantizing both discriminator and generator leads to a convergent finetuning. This leads to proposing multi-precision quantization for and separately from each other. Then he used the EM algorithm to find the optimal parameters from eq. 9. Quantized weight is calculated from the eq.10. The EM algorithm tries to minimize the mean square error between the non-quantized weight and the quantized one.

(10)

where are same parameters from eq.9. is the quantized weight, is the integer part of the fixed-point number that will be used in the calculation.

QGAN has been applied to several gans like DCGANDCGAN, WGANWGAN-GP and LSGAN LSGAN. With a quantization to 1-4 bits, QGAN achieved a compression ratio from 8x up to 32x with a small loss in inception score.

A study made by Deng et al. using a PATCH-GAN like generator to reconstruct face images showed that as the number of bits decreases, the Peak Signal-to-Noise Ratio (PSNR) gets worse while the memory footprint improves, which is not surprisingQMGAN. In their study for Quantized GAN for Mobiles (QMGAN), They found that 32-bit representation (single precision) has 2.5x improvement in PSNR over the 1-bit quantized (binary) network. However, the 1-bit quantized network has better memory size by 35x over single precision. They tried different values for quantization, what worth noting is that the PSNR of 32-bit is almost the same as the PSNR of 6-bits while the 6-bit has around 5.4x memory size improvement.

Haotao et al. continued their work on ganslimmingGAN_Slimming by applying both quantization and knowledge distillation. He used the uniform quantization on both activations and weights on the generator model. They unified the quantization on all layers so that it would be HW friendly. They performed 4x 8x compression ratios on style-transfer problems with a very competitive result.

In ApGAN_COM_PIM_DF_MEM, the authors of ApGAN used memory compression on a ReRam accelerator 444ReRam accelerator is processing in memory using analog crossbar circuit. They quantized all weights and activations to 1-bit (sign-bit)which simplified MACs to just ANDing followed by ORing. This technique decreases the weights storage size. However, that comes at a cost of accuracy and speed of convergence. An experiment done by ApGAN_COM_PIM_DF_MEM

compared fully-binarized DCGAN to full-precision, after 20 epochs the loss of binarized version is 3x the loss of full precision. For that reason, instead of binarizing all layers, they used variable layers quantization based on the data redundancy measure. Data redundancy measure is defined as

where is the layer number is the number of channels of layer input , and are the input height and width respectively of the layer. A negative redundancy measure indicates a high sensitivity for quantization error; thus, it is not recommended to quantize such layers. Other layers with high redundancy measure are binarized by taking the sign bit of the weight and the average weight is considered the scale factor. Multiplication computations turns into just sign manipulation operation followed by scaling.

In PIMTGAN, Rakin et al. proposed TGAN, a GAN that ternarize weights to {-1,0,1}. The ternarization depends on the sign of the weight like ApGAN. They applied the quantization schema on both generator and discriminator in the forward path and used the backward path to update the scale factor. They achieved on average of IS of the full-precision network.

A work by Köster et al. proposed using non-standard numeric format called Flex-pointFlexPoint. The new format makes one shared exponent for each tensor; thus, the tensor operations are handled as if they were fixed-point operations, and an extra circuit is needed to manage exponent which is faster than the floating-point operation with different exponent for each number. Flex-point format stores the tensor as 16-bit mantissa for each element and one shared 5-bits exponent for the whole tensor. In contrast to floating point, the exponent is shared across tensor elements, and different from fixed point, the exponent is updated automatically every time a tensor is written. By implementing several networks using, the flex-point format, the FID score of those networks was comparable to the same networks implemented using float32 and better than the ones implemented with fixed-point or float16. Flex-point has the advantage of supporting training and applying all techniques of floating-point with even faster calculation given their newly Flex-point format.

decision rep. func. granularity Target Module App.
work
QGAN QGAN Int EM Network W G & D Post
QMGANQMGAN Int Uni Network W G Post
ApGANApGAN_COM_PIM_DF_MEM Int Uni Layer W G & D During
TGANPIMTGAN Int Uni Layer W G & D During
FlexpointFlexPoint Float Cust. Tensor A & W G & D During
GANslimGAN_Slimming Int Uni Network A & W G Post
Int Integer-like Uni Uniform A Activation
customized accelerator Cust Customized W Weights
Table 5: Summary of Quantization works

Table 5 summarizes the design decision taken by the different works, and because each work uses a different network it is hard to make a fair comparison using metrics like IS or FID. Fig. 17 showed the no. of bits that quantization methods reported for best results.

Figure 17: No. of quantized bits that resulted in the best performance for the GAN undertest, Each technique is tested on different GAN. The works QGANQGAN, QMGANQMGAN, GanSlimGAN_Slimming, FlexpointFlexPoint and TGANPIMTGAN are fully quantized. While ApGan ApGAN_COM_PIM_DF_MEM has mixed quantized and non-quantized layers. Also, FlexPoint, ApGAN and TGAN quantization is used during training

2.1.4 Encoding

It is a form of lossy compression used to minimize the data transfer using fewer bits. The data transferred to the chip is not the real data, but an index or key used to get the real data from a preloaded codebook or using a predefined hash function.

This technique has been extensively used in deep learning like in CNN_compression_Clustering; CNN_compression_resharing; CNN_compression_hashing; CNN_compression_hashingBlock; CNN_compression_structuredhashing . However, to the best of our knowledge, we found no work applied to GANs and it is one of the opportunities to seek.

3 Related Work

Several survey papers exist about deep learning compression like RELATED_NOGAN_cheng2017survey; RELATED_NOGAN_cheng2018model; RELATED_NOGAN_Cheng2018RecentAI; RELATED_NOGAN_choudhary2020comprehensive. However, those papers discuss general deep learning algorithm not targeting issues specific to GANs. While some techniques mentioned in previous surveys might apply to GANs, GANs propose more challenges and opportunities. First, generative networks are sensitive to number-representation. In another words, Not all quantization techniques used for general DL models would be efficient for use with GANs. Second, GAN training suffers from instability, which makes normal compression and pruning techniques inefficient. Finally, GAN output resolution is large and correlated as opposed to normal classification or regression problems whose output are very small.

Another related work is the work done by Gou et al RELATED_NOGAN_distill_usedGAN_gou2021knowledge. Although this work “uses” GAN to perform distillation, it does not consider GAN themselves for compression. While the work by Wang and Yoon RELATED_distill_usedGAN_Wang2021KnowledgeDA mentioned briefly impact of distillation on image translation tasks using GANs, its focus was using GAN to perform distillation to other models.

4 Conclusion & Future work

In this paper, we surveyed the lossy compression techniques used to optimize GANs. as mentioned earlier, GAN differ from other DL models because of the instability of the training and the resolution of the output.

No doubt that minimizing the memory footprint enhances storage, speed, and power efficiency of running GANs. But there are some areas like “encoding” that is not explored at all in GAN-domain. As seen from tables tab. 3, 4, 5 a lot of combinations can still be explored like using similarity-pruning with knowledge distillation. Also combining losses introduced in knowledge distillation with quantization needs more exploration.

A challenge exists with all optimization techniques mentioned is how to unify measures (both qualitative and quantitative) and benchmark between several methods. With no unified measures, benchmarks, and platforms, we can hardly evaluate techniques compared to each other. Another opportunity exists for filling the gaps and mixing between different optimization techniques.

A future work of this survey is 1) unifying the design metrics between different designs and provide an evaluation study using several dataset, 2) explore computational optimization, 3) explore dataflow optimization, 4) study systems with combinations of optimization techniques, 5) study the impact of optimization on different platform (FPGA, ASIC, RERAMs, GPUS, …etc.), while compression is very plausible some techniques are not hardware friendly though they give high compression ratio with low accuracy loss. An opportunity exists in enhancing indexing methods in accelerators or building a cache like system to support clustering or hashing -based quantization. Also, the support for processing compressed elements or quantized element should be considered without uncompressing them.

References