Recently, Generative Adversarial Networks (GANs)  have achieved prominent results in diversified visual applications, such as image synthesis [39, 40, 34, 64, 4] and image-to-image translation [23, 66, 10, 26, 11, 47]. Albeit with varying degrees of progress, most of its recent successes [23, 66, 47, 64, 10, 4] are involved in huge resource demands. It is arduous to popularize such models that require tremendous computational costs, which becomes a critical bottleneck as this model is deployed on resource-constrained mobile phones or other lightweight IoT devices [21, 50, 30, 9]. To alleviate such expensive and unwieldy computational costs, GAN compression becomes a newly-raised and crucial task. A great deal of mainstream model compression techniques [29, 28, 57, 38, 56, 33] are employed to learn efficient GAN, including knowledge distillation [30, 1, 49, 8, 13, 15, 31, 20, 9, 24], channel pruning [30, 31, 49] and nerual architecture search [15, 30, 31].
However, the above compression algorithms primarily exist threefold issues. Firstly, they tend to straightforwardly resort to the mature model compression techniques [7, 62, 19], which are not customized for GAN and lack exploration of complex characteristics and structures for GAN. Secondly, they usually formulate GAN compression as a multi-stages task. For example,  needs to pre-train, distillate, evolute, and fine-tune sequentially. The distillation-based methods [30, 31, 13, 9, 8, 49, 24] should pre-train a teacher generator and then distill the student one. An end-to-end approach is essential to reduce the complicated time and computational resources in the multi-stages setting. Thirdly, the current state-of-the-art methods are still burdened with high computational costs. For instance, the best model  requires 3G MACs, which is relatively high for deployment on lightweight edge devices.
To overcome the above issues, we craft to propose a novel Online Multi-Granularity Distillation (OMGD) framework for learning efficient GANs. We abandon the complex multi-stage compression process and design a GAN-oriented online distillation strategy to obtain the compressed model in one step. We can excavate potential complementary information from multiple levels and granularities to assist in the optimization of compressed models. These concepts can be regarded as auxiliary supervision cues, which is very critical to break through the bottleneck of capacity for models with low computational costs. The contributions of OMGD can be summarized as follows:
To the best of our knowledge, we offer the first attempt to popularize distillation to an online scheme in the field of GAN compression and optimize the student generator in a discriminator-free and ground-truth-free setting. This scheme trains the teacher and student alternatively, promoting these two generators iteratively and progressively. The progressively optimized teacher generator helps to warm up the student and guide the optimization direction step by step.
We further extend the online distillation strategy into a multi-granularity scheme from two perspectives. On the one hand, we employ different structure based teacher generators to capture more complementary cues and enhance visual fidelity from more diversified dimensions. On the other hand, besides the concepts of the output layer, we also transfer the channel-wise granularity information from intermediate layers to play as additional supervisory signals.
and cityscapes) demonstrate that OMGD can reduce the computation of two essential conditional GAN models including pix2pix  and CycleGAN  by 40 regarding MACs, without loss of the visual fidelity of generated images. It reveal that OMGD is efficient and robust for various benchmark datasets, diverse conditional GAN, network architectures as well as problem settings (paired or unpaired). Compared with the existing competitive approaches, OMGD helps to obtain better image quality with much less computational costs (see Figure 1 and 2). Furthermore, OMGD 0.5 (only requires 0.333G MACs) successes to achieve impressive results, which provides a feasible solution for deployment on resource-constrained devices and even breaks the barriers to real-time image translation on mobile devices.
2 Related Work
2.1 GANs and GAN Compression
Generative Adversarial Networks (GANs) 
have obtained impressive results on a series of computer vision tasks, such as image-to-image translation[66, 23, 10, 26, 11, 47, 35], image generation [39, 40, 34, 64, 4, 5]
and image inpainting[36, 58, 48, 63]. Specifically, Pix2Pix  conducted a min-max game between a generator and a discriminator to employ paired training data for image-to-image translation. CycleGAN  further extended the capacity of GANs for image translation in a weakly-supervised setting, where no paired data are leveraged in the training stage. Although various GAN methods have achieved impressive successes recently, they tend to occupy a growing number of memory and computational costs [10, 35, 4] to support their powerful performance, which conflicts the deployment on resource-constrained devices.
Recently, GAN-oriented compression has become an essential task due to its potential applications in the field of lightweight device deployment. Shu 
presented the first preliminary study on introducing the co-evolutionary algorithm for removing redundant filters to compress CycleGAN. Fu employed AutoML approaches and searched an efficient generator with the guide of the target computational resource constraints. Wang  proposed a unified GAN compression optimization framework including model distillation, channel pruning and quantization. Li  designed a ”once-for-all” generator which decouples the model training and architecture search via weight sharing. Li  proposed a differentiable mask and a co-attention distillion algorithm to learn effective GAN. Jin  proposed a one-step pruning algorithm to search a student model from the teacher model. In this work, we design an Online Multi-Granularity Distillation (OMGD) scheme. By introducing multi-granularity knowledge guidance, the student generator can be enhanced by leveraging the complementary concepts from diverse teachers and layers, which intrinsically improves the capacity of the compressed model.
2.2 Knowledge Distillation
Knowledge Distillation (KD)  is a fundamental compression technique, where a smaller student model is optimized under the effective information transfer and supervision of a larger teacher model or ensembles . Hinton  performed knowledge distillation via minimizing the distance between the output distribution statistics between student and teacher network. In this way, the student network attempts to learn dark knowledge  that contains the similarities between different classes, which can not be provided by the ground truth labels. Romero  further took advantage of the concepts of feature maps from the intermediate layers to enhance the performance of the student network. Zhou  presented that each channel of the feature map corresponds to a visual pattern, so they focus on transferring attention concepts [53, 54, 55] of feature map from each channel in intermediate layers.
Moreover, You  revealed that multiple teacher networks can provide more comprehensive knowledge for learning a more effective student network. MEAL  compressed large and complex trained ensembles into a single network, which employs an adversarial-based learning strategy to guide the pre-defined student network to transfer knowledge from teacher models. Offline knowledge distillation requires a pre-trained teacher model in the stage of optimizing, while online KD simultaneously optimizes the teacher and student network or just a group of student peers . Anil  trained two networks with the identical architecture parallelly and these two networks play the role of student and teacher iteratively. In this paper, we employ the multi-granularity based online distillation scheme, which aims to learn an effective student model from complementary structure of the teacher generators and the knowledge from diverse layers.
In this section, we first introduce the proposed online GAN distillation framework, where the student generator is not bound to the discriminator and attempts to learn concepts directly from the teacher models. The multi-granularity distillation scheme is presented in section 3.2 The multi-granularity concepts  are captured via complementary teacher generators and knowledge from diverse layers. We illustrate the whole pipeline of the OMGD framework in Figure 3.
3.1 Online GAN Distillation
Recently, a series of distillation based GAN compression [30, 31, 13, 9, 8, 49] employ the offline distillation scheme that leverages a pre-trained teacher generator to optimize the student generator. In this paper, we propose a GAN-oriented online distillation algorithm to address three critical issues in the offline distillation. Firstly, the student generator in the conventional offline distillation method should maintain a certain capacity to keep the dynamic balance with the discriminator to avoid model collapse [37, 43] and vanishing gradients . However, our student generator is no longer deeply bound with the discriminator, which can train more flexibly and obtain further compression. Secondly, the pre-trained teacher generator fail to guide the student on how to learn information progressively and is easy to cause over-fitting in the training stage [17, 27]
. Nevertheless, our teacher generator helps to warm up student generator and guide the direction of optimization step by step. Thirdly, it is not effortless to select a suitable pre-trained teacher generator due to the evaluation metrics are subjective. However, our method does not require a pre-trained model and this selection problem is solved.
We follow the loss functions and training setting in[23, 66] to train the teacher generator and discriminator . aims to learn a function to map data from the source domain to a target domain . We take Pix2Pix  as example, it uses paired data(, where and ) to optimize networks. The generator is trained to map to while the discriminator is trained to distinguish the fake images generated by from the real images. The objective is formalized as:
Moreover, a reconstruction loss is introduced to push the output of output to be close to the ground truth :
The whole objective in the GAN setting is defined as:
Student Generator. In the proposed GAN-oriented online distillation scheme, the student generator only leverages the teacher network for optimization and can be trained in the discriminator-free setting. The optimization of does not require the ground-truth labels simultaneously. Namely, merely learns the output of the larger capacity generator with a similar structure (), which greatly reduces the difficulty of fitting directly. Specifically, we back-propagate the distillation loss between and in every iterative step. In this way, can mimic the training process of to learn progressively.
Denote the output of as , we use Structural Similarity (SSIM) Loss  and Perceptual Losses  to measure the difference between and . SSIM Loss  is sensitive to local structural changes, which is similar to human visual system (HVS). Given , SSIM Loss calculates the similarity of two images by:
are mean values for luminance estimation,
are standard deviations for contrast,is covariance for the structural similarity estimation. are constants to avoid zero denominator.
Perceptual loss 
consists of feature reconstruction loss. encourages and to have similar feature representations, which are measured by a pre-trained VGG network  . is formalized as:
where is the activation of the -th layer of for the input . is the dimensions of .
is introduced to penalize the differences in style characteristic, such as color, textures, common pattern, and so on . The can be calculated as:
where is the Gram matrix of the -th layer activation in the VGG network.
3.2 Multi Granularity Distillation Scheme
Based on the novel online GAN distillation technique, we further extend our method into a multi-granularity scheme from two perspectives: the complementary structure of the teacher generator, and the knowledge from diverse layers. The whole pipeline of the online multi-granularity distillation (OMGD) framework is depicted in Figure 3, we use a wider teacher generator and a deeper teacher generator to formalize a multi-objective optimization task for . In addition to the output layer of the teacher generators, we also mine knowledge concepts from the intermediate layers via channel distillation loss .
Multiple Teachers Distillation. A different structure based teacher generator helps to capture more complementary image cues from the ground truth labels and enhance image translation performance from a different perspective . Moreover, the multiple teachers distillation setting can further relieve the issue of over-fitting. We expand the student model into the teacher model from two complementary dimensions, i.e., depth and width. Given a student generator , we expand the channel of to obtain a wider teacher generator . Specifically, each channel of the convolution layers is multiplied by an channel expand factor . On the other hand, we insert several resnet blocks after every downsample and upsample layers into to build a deeper teacher generator , which has a comparable capacity with .
As is illustrated in Figure 3, a partial share discriminator is designed to share the first several layers and separate two branches to get the discriminator output for and , respectively. This shared design not only offers high flexibility of discriminators but also leverages the similar characteristic of the input image to improve the training of generators . We directly combine two distillation losses provides by the complementary teacher generators as the KD loss in multiple teachers setting:
where and are the activation of the output layer of and , respectively.
Intermediate Layers Distillation. The concepts of the output layer fail to take into account of more intermediate details of the teacher network, so we further transfer the channel-wise granularity information as an additional supervisory signal to promote . Specifically, we compute the channel-wise attention weight [22, 65] to measure the importance of each channel in a feature map. The attention weight is defined as:
where denotes the channel of the feature map. Then a convolution layer is concatenated to the intermediate layers of to expand the number of channel and the channel distillation (CD) loss is calculated as:
where is the number of feature maps be sampled, is the channel number of the feature maps. is the attention weight of -th channel of -th feature map.
To sum up, the whole online multi-granularity distillation objective is formalized as:
4.1 Experimental Settings
|Dataset||Generator Style||Method||MACs||#Parameters||FID ()|
|edgesshoes||Res-Net||Original ||56.80G (1.0)||11.30M (1.0)||24.18|
|GAN-Compression ||4.81G (11.8)||0.70M (16.3)||26.60|
|DMAD ||4.30G (13.2)||0.54M (20.9)||24.08|
|OMGD 1.0||1.408G (40.3)||0.137M (82.5)||25.88|
|OMGD 1.5||2.904G (19.6)||0.296M (38.2)||21.41|
|U-Net||Original ||18.60G (1.0)||54.40M (1.0)||34.31|
|DMAD ||2.99G (6.2)||2.13M (25.5)||46.95|
|OMGD 0.5||0.333G (55.9)||0.852M (63.8)||37.34|
|OMGD 0.75||0.707G (26.3)||1.916M (28.4)||32.19|
|OMGD 1.0||1.219G (15.3)||3.404M (16.0)||25.00|
|Dataset||Generator Style||Method||MACs||#Parameters||mIoU ()|
|cityscapes||Res-Net||Original ||56.80G (1.0)||11.30M (1.0)||44.32|
|GAN-Compression ||5.66G (10)||0.71M (15.9)||40.77|
|DMAD ||4.39G (12.9)||0.55M (20.5)||41.47|
|CAT ||5.57G (10.2)||-||42.53|
|OMGD 1.0||1.408G (40.3)||0.137M (82.5)||45.21|
|OMGD 1.5||2.904G (19.6)||0.296M (38.2)||45.89|
|U-Net||Original ||18.60G (1.0)||54.40M (1.0)||42.71|
|DMAD ||3.96G (4.7)||1.73M (31.4)||40.53|
|OMGD 0.5||0.333G (55.9)||0.852M (63.8)||41.54|
|OMGD 0.75||0.707G (26.3)||1.916M (28.4)||45.52|
|OMGD 1.0||1.219G (15.3)||3.404M (16.0)||48.91|
Models. We conduct the experiments on Pix2Pix  and CycleGAN . Specifically, we adopt the original U-Net style generator  and the Res-Net style generator in  for Pix2Pix  model. The Res-Net style generator employs depthwise convolution and pointwise convolution  to achieve a better performance-computation trade-off. We only use the Res-Net style generator  to conduct the experiments on CycleGAN model.
Datasets and Evaluation Metrics. We evaluate Pix2Pix model on edgesshoes  and cityscapes  dataset. CycleGAN model is measured on horsezebra  and summerwinter . On cityscapes, we use the DRN-D-105  to segment the generated images and calculate the mIoU (mean Intersection over Union) as evaluation metric. Higher mIoU implies the generated images are more realistic. We adopt FID (Frechet Inception Distance)  to evaluate the image on other datasets and smaller FID means the generation performance is more convincing.
Implementation datails. The channel expand factor is set to 4 in this paper. We set the learning rate as 0.0002 in the beginning and decay to zero linearly in the experiments. For Res-Net style generator, batch size is set to 4 on edgesshoes and 1 on other dataset. Batch size is fixed to 4 in all expreiments for U-Net generator. The update interval numbers on edgesshoes, cityscapes, horsezebra and summerwinter are 1, 3, 4, 4 respectively.
|horsezebra||Original ||56.80G (1.0)||11.30M (1.0)||61.53|
|Co-Evolution ||13.40G (4.2)||-||96.15|
|GAN-Slimming ||11.25G (23.6)||-||86.09|
|Auto-GAN-Distiller ||6.39G (8.9)||-||83.60|
|GAN-Compression ||2.67G (21.3)||0.34M (33.2)||64.95|
|DMAD ||2.41G (23.6)||0.28M (40.0)||62.96|
|CAT ||2.55G (22.3)||-||60.18|
|OMGD (Ours)||1.408G (40.3)||0.137M (82.5)||51.92|
|summerwinter||Original ||56.80G (1.0)||11.30M (1.0)||79.12|
|Co-Evolution ||11.10G (5.1)||-||78.58|
|Auto-GAN-Distiller ||4.34G (13.1)||-||78.33|
|DMAD ||3.18G (17.9)||0.30M (37.7)||78.24|
|OMGD (Ours)||1.408G (40.3)||0.137M (82.5)||73.79|
For CycleGAN, we evaluate the teacher generator every epochs and update the performance-best generator as to optimize . In this way, we avoid notorious instability of training CycleGAN and encourage to learn from the best teacher model. is set to 10 and 6 for horsezebra and summerwinter, respectively.
4.2 Experimental Results
4.2.1 Comparison with state-of-the-art methods
In this section, we compare OMGD with several state-of-the-art methods in terms of computation cost, model size and generation quality. We compare the performance of Pix2Pix and CycleGAN respectively.
Pix2Pix. The experimental results of Pix2Pix model are shown in Table 1, which can summarized as the following observerations: 1) OMGD is robust for both style generators and significantly outperforms the state-of-the-art methods with much less computational costs. 2) OMGD with Res-Net style generator (dubbed as, OMGD(R)) achieves comparable performance to the original model when the MACs are compressed by 40.3 and the parameters are compressed by 82.5. Compared with the current best method, i.e., CAT, OMGD(R) 1.0 boosts the mIoU from 42.53 to 45.21 (6.3% improvement) with only a quarter of the computational costs on cityscapes. Furthmore, although OMGD(R) 1.5 is compressed 19.6 MACs and 38.2 memory, it successes to establishes the state-of-the-art performance. 3) It is arduous to compress U-Net style generator due to its U-shape architecture and concatenate operation. OMGD with U-Net style generator (dubbed as, OMGD(U)) compresses the original model by 15.3 and reduces the FID by 9.31 on edgesshoes. With less than half of MACs of DMAD , OMGD(U) 1.0 decreases the FID from 46.95 to 25.0 on edgesshoes and obtains 19.3% improvement in terms of mIoU on cityscapes. Moreover, OMGD(U) 0.5 and 0.75 also achieve impressive results, and OMGD(U) 0.75 can obatin the state-of-the-art compression performance with merely 0.707G MACs.
|Ours w/o OD||26.19|
|Ours w/o DT||33.88|
|Ours w/o CD||26.62|
|Ours w/o OD||45.76|
|Ours w/o DT||44.04|
|Ours w/o CD||48.12|
CycleGAN. We follow previous works [30, 13, 24, 15, 49] to use the Res-Net style generator to conduct the experiments on CycleGAN, and the results are shown in Table 2. On the one hand, OMGD(R) outperforms the original model by a large margin although with 40.3 MACs compression and 82.5 parameters compression. For example, OMGD(R) reduces FID from 61.53 to 51.92 on horsezebra and 79.12 to 73.79 on summerwinter. On the other hand, OMGD(R) significantly surpasses the competitive methods in terms of performance (FID) or computational costs (MACs), and establishes new state-of-the-art performance on both datasets.
|Ours w/o OD||77.09|
|Ours w/o CD||61.21|
|Ours w/o OD||76.48|
|Ours w/o CD||75.47|
4.2.2 Ablation Study
We directly train the student generator via the conventional GAN loss and report its results as the “Baseline” in Table 3 and Table 4. As can be observed, our method surpasses “Baseline” by a large margin. For example, it declines FID from 77.07 to 25.00 on edgesshoes and increases mIoU from 33.90 to 48.91 on cityscapes. To further demonstrate the effectiveness of several essential components in OMGD, we perform extensive ablation studies. The experiments of ablation study are conducted on U-Net style generator for Pix2Pix and Res-Net style generator for CycleGAN,
Analysis of online distillation stage. To evaluate the significance of the online distillation scheme, we design a variant (abbreviated as “Ours w/o OD”) to optimize the model with the offline two-stage distillation setting. As shown in Table 3 and 4, removing the online distillation stage leads to an noticeable drop in performance. For example, “Ours w/o OD” declines mIoU to 45.76 on cityscapes, with a decrease of 6.4% when compared with our approach. It indicates that the online training scheme helps to guide the optimization process to achieve more impressive results.
Analysis of complementary teachers setting. To investigate the effectiveness of complementary teachers setting, we design a variant “Ours w/o DT” that removes the deeper teacher generator and only employs a wider one for optimization. As summarized in Table 3, our method attempts to obtain more promising results compared with “Ours w/o DT” on both benchmarks. It indicates that the complementary teacher setting significantly improves the capacity of the student generator. It is worth notice that the unstable training process of CycleGAN causes confusion for the deeper teacher generator , hence we only leverage the wider teacher generator on CycleGAN.
Analysis of multiple distillation layers. To delve deep into the significance of multiple distillation layers, we design a variant (denote as “Ours w/o CD”) to remove the channel-wise distillation. As shown in Table 3 and Table 4, “Ours w/o CD” gets a less prominent performance, which indicates that concepts from intermediate layers can be viewed as the auxiliary supervision to assist training. By introducing multiple distillation layers for distillation, our method manages to obtain 6.5%, 1.6%, 15.1% and 1.5% performance improvement on four datasets, respectively.
|HuaWei P20||416.73ms||43.00ms (9.7)||15.3|
|Mi 10||140.80ms||14.01ms (10.0 )||15.3|
4.2.3 Latency Speedup
We report the CPU latency results on two mobile phones (i.e., Huawei P20 and Mi 10) using tflite toolkits. As is shown in Table 5, our framework helps to obtain significant acceleration in the inference procedure. For example, OMGD(U) 1.0 contributes to reducing latency from 140.8ms to 14.01ms, with a 90% latency decline. It demonstrates that our framework provides a solution for real-time image translation.
4.2.4 Qualitative Results
We depict the visualization results of OMGD and the state-of-the-art methods in Figure 4, which demonstrates the effectiveness of OMGD. As shown, our method helps to obtain 40.3-46.6 MACs reductions with nearly no visual fidelity loss. For example, Our compressed model can generate natural zebra stripes on horsezebra dataset, while  and original model still retain the color of the input horse. OMGD attempts to transfer the background style smoothly, while preserves the essential elements in the foreground on summerwinter. For Pix2Pix, OMGD contributes to capturing the textural details of the cloth fabric and the shine of the leather fabric on edgesshoes. Furthermore, OMGD shows superiority in the processing of pavement features, such as roughness and lane line.
In this paper, we propose an online multi-granularity distillation (OMGD) technique for learning lightweight GAN. The GAN-oriented online scheme is introduced to alternately promote the teacher and student generator, and the teacher helps to warm up the student and guide the optimization direction step by step. OMGD further takes good advantage of multi-granularity concepts from complementary teacher generators and auxiliary supervision signals from different layers. Extensive experiments demonstrate that OMGD attempts to compress Pix2Pix and CycleGAN into extremely low computational costs without obvious visual fidelity loss, which provides a feasible solution for GAN deployment on resource-constrained devices.
-  (2019) Compressing gans using knowledge distillation. ArXiv abs/1902.00159. Cited by: §1.
Large scale distributed neural network training through online distillation. ArXiv abs/1804.03235. Cited by: §2.2.
-  (2017) Wasserstein gan. ArXiv abs/1701.07875. Cited by: §3.1.
-  (2019) Large scale gan training for high fidelity natural image synthesis. ArXiv abs/1809.11096. Cited by: §1, §2.1.
-  (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §2.1.
-  (2006) Model compression. In KDD ’06, Cited by: §2.2.
-  (2020) Once-for-all: train one network and specialize it for efficient deployment. In International Conference on Learning Representations, External Links: Cited by: §1.
-  (2020) TinyGAN: distilling biggan for conditional image generation. ArXiv abs/2009.13829. Cited by: §1, §1, §3.1.
Distilling portable generative adversarial networks for image translation.
Proceedings of the AAAI Conference on Artificial Intelligence34 (04), pp. 3585–3592. External Links: Cited by: §1, §1, §3.1.
CartoonGAN: generative adversarial networks for photo cartoonization. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1.
-  (2018-06) StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1.
The cityscapes dataset for semantic urban scene understanding. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223. Cited by: 3rd item, §4.1.
-  (2020) AutoGAN-distiller: searching to compress generative adversarial networks. In ICML, Cited by: Figure 2, §1, §1, §2.1, §3.1, §4.1, §4.2.1, Table 2.
-  (2015) A neural algorithm of artistic style. ArXiv abs/1508.06576. Cited by: §3.1.
-  (2019) AutoGAN: neural architecture search for generative adversarial networks. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3223–3233. Cited by: §1, §4.2.1.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. . External Links: Cited by: §1, §2.1.
-  (2020-06) Online knowledge distillation via collaborative learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
-  (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, Cited by: §4.1.
Distilling the knowledge in a neural network.
NIPS Deep Learning and Representation Learning Workshop, External Links: Cited by: §1, §2.2.
-  (2020) Slimmable generative adversarial networks. ArXiv abs/2012.05660. Cited by: §1, §3.2.
MobileNets: efficient convolutional neural networks for mobile vision applications. ArXiv abs/1704.04861. Cited by: §1, §4.1.
-  (2020) Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, pp. 2011–2023. Cited by: §3.2.
-  (2017) Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 5967–5976. External Links: Cited by: 3rd item, §1, §2.1, §3.1, §4.1, Table 1, §6.1.
-  (2021) Teachers do more than teach: compressing image-to-image models. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: Figure 2, §1, §1, §2.1, §4.2.1, Table 1, Table 2.
Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §3.1, §3.1.
-  (2019-06) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1.
-  (2018) Knowledge distillation by on-the-fly native ensemble. In NeurIPS, Cited by: §3.1.
-  (2020) Pams: quantized super-resolution via parameterized max scale. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 564–580. Cited by: §1.
-  (2019) OICSR: out-in-channel sparsity regularization for compact deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7046–7055. Cited by: §1.
-  (2020) GAN compression: efficient architectures for interactive conditional gans. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5283–5293. Cited by: Figure 2, §1, §1, §2.1, §3.1, §4.1, §4.1, §4.2.1, §4.2.4, Table 1, Table 2, §6.1.
-  (2020) Learning efficient gans using differentiable masks and co-attention distillation. ArXiv abs/2011.08382. Cited by: Figure 2, §1, §1, §2.1, §3.1, §4.1, §4.2.1, Table 1, Table 2.
Multi-granularity tracking with modularlized components for unsupervised vehicles anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 586–587. Cited by: §3.
-  (2021) Network pruning using adaptive exemplar filters. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1.
-  (2018) Spectral normalization for generative adversarial networks. ArXiv abs/1802.05957. Cited by: §1, §2.1.
-  (2019-06) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2016-06) Context encoders: feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
Variational discriminator bottleneck: improving imitation learning, inverse RL, and GANs by constraining information flow. In International Conference on Learning Representations, External Links: Cited by: §3.1.
-  (2021) Learning low resource consumption cnn through pruning and quantization. IEEE Transactions on Emerging Topics in Computing. Cited by: §1.
-  (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434. Cited by: §1, §2.1.
-  (2016) Generative adversarial text to image synthesis. In ICML, Cited by: §1, §2.1.
-  (2015) FitNets: hints for thin deep nets. CoRR abs/1412.6550. Cited by: §2.2.
-  (1992) Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60, pp. 259–268. Cited by: §3.1.
-  (2016) Improved techniques for training gans. In NIPS, Cited by: §3.1.
-  (2019) MEAL: multi-model ensemble via adversarial learning. In AAAI, Cited by: §2.2.
-  (2019) Co-evolutionary compression for unpaired image translation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3234–3243. Cited by: Figure 2, §2.1, §4.1, Table 2.
-  (2015) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. Cited by: §3.1.
-  (2019-06) Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1.
-  (2020) Bringing old photos back to life. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2744–2754. Cited by: §2.1.
-  (2020) GAN slimming: all-in-one gan compression by a unified optimization framework. ArXiv abs/2008.11062. Cited by: Figure 2, §1, §1, §2.1, §3.1, §4.2.1, Table 2.
-  (2019-06) HAQ: hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2021) Knowledge distillation and student-teacher learning for visual intelligence: a review and new outlooks. IEEE transactions on pattern analysis and machine intelligence PP. Cited by: §2.2.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13, pp. 600–612. Cited by: §3.1.
-  (2018) Image captioning via semantic guidance attention and consensus selection strategy. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14 (4), pp. 1–19. Cited by: §2.2.
-  (2019) Pseudo-3d attention transfer network with content-aware strategy for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15 (3), pp. 1–19. Cited by: §2.2.
-  (2017) Global-local feature attention network with reranking strategy for image caption generation. In CCF Chinese Conference on Computer Vision, pp. 157–167. Cited by: §2.2.
-  (2017) Building fast and compact convolutional neural networks for offline handwritten chinese character recognition. Pattern Recognition 72, pp. 72–81. Cited by: §1.
Design of a very compact cnn classifier for online handwritten chinese character recognition using dropweight and global pooling. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 891–895. Cited by: §1.
-  (2017-07) High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2017) Learning from multiple teacher networks. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Cited by: §2.2, §3.2.
-  (2014) Fine-grained visual comparisons with local learning. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 192–199. Cited by: 3rd item, §4.1.
-  (2017) Dilated residual networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 636–644. Cited by: §4.1.
-  (2019) Slimmable neural networks. ArXiv abs/1812.08928. Cited by: §1.
-  (2018) Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5505–5514. Cited by: §2.1.
-  (2019) Self-attention generative adversarial networks. In ICML, Cited by: §1, §2.1, §6.1.
-  (2020) Channel distillation: channel-wise attention for knowledge distillation. ArXiv abs/2006.01683. Cited by: §2.2, §3.2, §3.2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2242–2251. External Links: Cited by: 3rd item, §1, §2.1, §3.1, §4.1, §4.1, Table 2, §6.1.
6.1 Additional Implementation Details
Learn-best strategy for CycleGAN. To relieve the distillation difficulties brought by the notoriously unstable training process of CycleGAN, we developed a simple yet effective strategy in the training stage. Specifically, we evaluate the symmetrical generators in CycleGAN every epochs, then employ the best one as the teacher generator to refine the student generator, the whole distillation process of CycleGAN with the learn-best strategy is illustrated in Algorithm 1.
Deeper teacher generator. Figure 5 illustrates the construction of the resnet block from the deeper teacher generator. For the U-Net style generator, we insert two resnet blocks after every downsample and upsample layer to obtain the deeper teacher generator. For the Res-Net style generator, we insert a resnet block and an additional 33 convolution layer after every downsample and upsample layer. Furthermore, we also add two resnet blocks in the middle of the student generator’s backbone to construct the deeper teacher generator.
|Dataset||Generator Style||Training Epochs||ngf||ndf||D|
Hyper-parameters setting. , , , are set as 1e1,1e4,1e1,1e-5, respectively. Following the previous works [66, 23, 30], we apply Adam optimizer [Adam] in all experiments. More details of the hyper-parameters setting are shown in Table 6. “Training Epochs” means the training epochs for the student generator, “Const” is the epochs that keeps the fixed initial learning rate and “Decay” is the epochs of linearly decaying learning rate. denotes the update interval for the teacher generator. For example, we train the student generator for 750 epochs on cityscapes and set the update interval to 3, which means that we train the teacher generator for 250 epochs. is the evaluate interval in our learn-best strategy, “ngf” and “ndf” denote the base number of filters in generator and discriminator respectively, which control the model size. means the weight of channel distillation loss and controls the number of shared layers in our partial shared discriminator. In the experiment, we found that channel distillation loss can not show a significant effect for the Res-Net style generator on the cityscapes dataset, so we set to zero. For the teacher networks, we use the same setting in , Hinge GAN loss [GeometricGAN, 64] is employed to Pix2Pix and LSGAN loss [LSGAN] is employed to CycleGAN.
6.2 Additional Ablation Study
In this section, we further investigate several important components of our method. Experiments are conducted on U-Net style generator for Pix2Pix and Res-Net style generator for CycleGAN.
|Ours w/o LB||63.15|
|Ours w/o LB||75.70|
Analysis of discriminator-free setting. To measure the significance of the discriminator-free setting for the student generator, we design a variant that introduces a discriminator and employ GAN loss for training (denote as “Ours w/ ”). Table 7 shows that Ours w/ gets worse results, which reveals that the unstable optimization process and the invalid concepts from the discriminator influence the performance of the student generator. For example, the discriminator setting increases FID from 25.00 to 73.44 on edgesshoes and declines mIoU from 48.91 to 31.79 on Pix2Pix model.
Analysis of the learn-best strategy. To verify the effectiveness of our learn-best strategy on CycleGAN, we introduce a variant (abbreviated as “Ours w/o LB”) that removes the learn-best strategy and distill student under the basic OMGD scheme. Table 7 shows that learn-best strategy contributes to declining FID from 63.15 to 52.00 on horsezebra and from 75.70 to 74.36 on summerwinter. It further indicates that the learn-best strategy is capable of alleviating the impact of teacher’s training instability.
|Teacher 1||Teacher 2|
|Dataset||Method||MACs||#Parameters||FID ()||mIoU ()|
|edgesshoes||OMGD (U) 0.5||0.333G||0.852M||37.34||-|
|OMGD (U) 0.75||0.707G||1.916M||32.30||-|
|OMGD (U) 1.0||1.219G||3.404M||25.00||-|
|OMGD (R) 0.5||0.446G||0.039M||38.06||-|
|OMGD (R) 0.75||0.867G||0.081M||34.48||-|
|OMGD (R) 1.0||1.421G||0.137M||25.88||-|
|cityscapes||OMGD (U) 0.5||0.333G||0.852M||-||41.54|
|OMGD (U) 0.75||0.707G||1.916M||-||45.52|
|OMGD (U) 1.0||1.219G||3.404M||-||48.91|
|OMGD (R) 0.5||0.446G||0.039M||-||37.65|
|OMGD (R) 0.75||0.867G||0.081M||-||42.15|
|OMGD (R) 1.0||1.421G||0.137M||-||45.21|
|horsezebra||OMGD (R) 0.5||0.446G||0.039M||71.27||-|
|OMGD (R) 0.75||0.867G||0.081M||64.25||-|
|OMGD (R) 1.0||1.421G||0.137M||52.00||-|
|summerwinter||OMGD (R) 0.5||0.446G||0.039M||75.46||-|
|OMGD (R) 0.75||0.867G||0.081M||74.95||-|
|OMGD (R) 1.0||1.421G||0.137M||74.36||-|
Analysis of the complementarity of multiple teachers. A wider teacher generator and a deeper teacher generator helps to maintain a complementarity in the structure dimension, which is very critical to break through the bottleneck of capacity for models with low computational costs. To further verify this motivation, we use two identical teacher generators to distill the student generator. As is shown in Table 8 (“W” denotes the wider teacher generator, “D” denotes the deeper teacher generator), the complementary structure declines FID from 34.28 to 25.00 on edgesshoes and increases mIoU from 46.39 to 48.91 on cityscapes.
Analysis of further compression. To comprehensively show the capability of OMGD, we set the “ngf” as 8 and 12 to obtain further compressed networks of both Pix2Pix and CycleGAN on four datasets (i.e., edgesshoes, cityscapes, horsezebra, summerwinter). Results about MACs, parameters as well as quantitative evaluation metrics of image fidelity are listed in Table 9. OMGD (R) means the Res-Net style generator and OMGD (U) denotes the U-Net generator. Table 9 reveals that there is still some compression space for OMGD. Furthermore, OMGD (U) 0.5 (only requires 0.333G MACs) successes to achieve impressive results, which provides a feasible solution for deployment on resource-constrained devices and even breaks the barriers to real-time image translation on mobile devices.