P-KDGAN: Progressive Knowledge Distillation with GANs for One-class Novelty Detection

07/14/2020 ∙ by Zhiwei Zhang, et al. ∙ Beijing Institute of Technology 0

One-class novelty detection is to identify anomalous instances that do not conform to the expected normal instances. In this paper, the Generative Adversarial Networks (GANs) based on encoder-decoder-encoder pipeline are used for detection and achieve state-of-the-art performance. However, deep neural networks are too over-parameterized to deploy on resource-limited devices. Therefore, Progressive Knowledge Distillation with GANs (PKDGAN) is proposed to learn compact and fast novelty detection networks. The P-KDGAN is a novel attempt to connect two standard GANs by the designed distillation loss for transferring knowledge from the teacher to the student. The progressive learning of knowledge distillation is a two-step approach that continuously improves the performance of the student GAN and achieves better performance than single step methods. In the first step, the student GAN learns the basic knowledge totally from the teacher via guiding of the pretrained teacher GAN with fixed weights. In the second step, joint fine-training is adopted for the knowledgeable teacher and student GANs to further improve the performance and stability. The experimental results on CIFAR-10, MNIST, and FMNIST show that our method improves the performance of the student GAN by 2.44 1.73 700:1, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One-class novelty detection aims to identify patterns that do not belong to the normal data distribution [Chandola et al.2009]. Unlike traditional classification problem, novelty detection is usually trained in an unsupervised setting where novelty data is absent. Novelty detection has a wide variety of applications such as network intrusion [García-Teodoro et al.2009], credit card fraud [Srivastava et al.2008], medical diagnoses [Schlegl et al.2017]

and many more. With the advantage of deep learning, novelty detection based on generative adversarial networks (GANs) has shown state-of-the-art performance by learning the representative latent space of high-dimensional data 

[Schlegl et al.2017, Zenati et al.2018, Perera et al.2019]. However, deep neural networks with high computational costs and large storage prohibit their deployment to computation and memory resource limited systems.

For tackling the above issue, neural network compression has been widely applied in recent years [Cheng et al.2017]

. As one of the mainstream compression methods, Knowledge Distillation (KD) following a teacher-student paradigm transfers knowledge from a teacher network with higher performance to a student network. The early contributions used the outputs of the softmax layers or intermediate layers in teacher networks to improve the performance of student networks 

[Hinton et al.2015, Romero et al.2015]. In the later researches, the discriminator losses were proposed to evaluate the distinction between the distribution spaces of teacher and student networks [Wang et al.2018a, Wang et al.2018b, Liu et al.2018]. To our knowledge, there is no related works on two standard GANs [Goodfellow et al.2014] including two generators and two discriminators to design distillation loss for knowledge distillation. Additionally, there are rare works investigating the initialization of student networks and always random initialization is used. Our experiments demonstrate that student networks without “knowledge” reserve (with random initialization) do not mimic the outputs of teacher networks well.

Figure 1: The flowchart of Knowledge Distillation with GANs for One-class Novelty Detection. (a) The Knowledge Distillation with Generative Adversarial Networks (KDGAN), in which the distillation losses ( = ++ ) is designed for training the student GAN. (b) The two-step progressive learning of KDGAN is used to continuously improve the performance of the student GAN. KDGAN-1⃝, KDGAN-2⃝, KDGAN-3⃝ and KDGAN-4⃝ are four different distillation structures.

In this paper, we apply GANs in the encoder-decoder-encoder structure [Akçay et al.2018] for one-class novelty detection, which outperforms the state-of-the-art approaches. In order to deploy the deep neural networks in computation resources limited mobile devices, we propose the Progressive Knowledge Distillation with GANs (P-KDGAN) method to train the lightweight student network. The P-KDGAN approach improves the performance of student GAN by solving the following three problems. 1) How to design a distillation loss to measure the similarity of intermediate representations learned from the teacher GAN and the student GAN? As is shown in the student GAN of Figure 1

(a), the generator based on encoder-decoder-encoder pipeline can generate two latent vectors

, and a reconstructed image . The generator is trained by minimizing the weighted sum of , and , which have defined in Eq.2-4 [Akçay et al.2018]. Therefore, the distillation loss described in Eq.6 is designed as the weighted sum of the losses , and , where , represent the difference between the two latent vectors ( and , and ) in the teacher GAN and student GAN, and is the difference of two reconstructed images (, ). 2) How to combine the distillation loss with existing generator losses and discriminator losses from the student and teacher GANs to improve the performance of the student GAN? As is illustrated in Figure 1(b), we design four distillation structures (KDGAN-1⃝, KDGAN-2⃝, KDGAN-3⃝ and KDGAN-4⃝) based on different combinations of the above five losses. The difference between them consists of two aspects: on the one hand, whether the weights of the teacher GAN are fixed; on the other hand, whether the distillation loss is combined with the losses of the student GAN for knowledge transfer. 3) Whether the designed four distillation structures can make the performance of the student GAN like that of the teacher GAN? If not, how to fix it? Our experimental results demonstrate that the performance of student GANs trained from scratch (or with random initialization) by the above four distillation structures is incomparable with the teacher GAN. Just as learning is a gradual cumulative process, a two-step progressive learning of KDGAN is proposed to continuously improve the performance of the student GAN. In the first step, the student GAN imitates the representation of the pre-trained teacher GAN with fixed weights to make itself have a certain “knowledge” reserve. Such a ”teaching by teacher” step make the student learn the basic knowledge totally from the teacher. In the second step, the student GAN with basic “knowledge” reserve is fine-trained together with the teacher GAN. The second step of ”fine-learning with teacher” can further improve the performance and stability by jointly utilizing the basic knowledge of the teacher and student.

The performance of proposed progressive knowledge distillation with GANs for one-class novelty detection is evaluated on CIFAR-10 [Krizhevsky2009], MNIST [LeCun and Cortes2005] and FMNIST [Xiao et al.2017] datasets. Our contributions are summarized as follows.

  • We utilize the encoder-decoder-encoder based GAN for one-class novelty detection, which outperforms all the state-of-the-art methods.

  • We propose new distillation losses on latent vectors and reconstructed images of GANs that allow the student to better learn from the teacher.

  • We regard the distillation process as a knowledgeable teacher to improve the performance and stability of student networks through two-step progressive learning, which includes basic knowledge learning and fine-learning.

  • Progressive Knowledge Distillation with GANs is proposed for one-class novelty detection. Our experiments demonstrate that the P-KDAGN can improve the performance of the student GAN on the three datasets CIFAR-10, MNIST and FMNIST by 2.44%, 1.77%, and 1.73%, respectively.

2 Related Work

We briefly review the related works in term of one-class novelty detection and knowledge distillation, as well as the architecture of Ganomaly [Akçay et al.2018].

2.1 One-class Novelty Detection

In the unsupervised one-class novelty detection, only the normal samples with one class are used for training the model. Conventionally, novelty detection methods can be divided into two categories [Chandola et al.2009]. One is the traditional methods, such as One-Class SVM (OC-SVM) [Schölkopf et al.2001]

, Kernel Density Estimation (KDE) 

[Parzen1962]

and Principal Component Analysis (PCA) 

[Wold et al.1987]

. The disadvantage of such approaches is that they are not suitable for high-dimensional image data. The other methods based on deep learning include Deep Belief Networks (DBN) 

[Erfani et al.2016]

, Autoencoders (AE) 

[Vincent et al.2008] and generative adversarial networks (GANs) [Schlegl et al.2017, Zenati et al.2018, Perera et al.2019].

GANs have shown state-of-the-art performance in modeling complex high-dimensional image distributions [Goodfellow et al.2014]. Therefore, a lot of GANs based methods have been used for novelty detection [Schlegl et al.2017, Zenati et al.2018, Perera et al.2019]. The reconstruction errors of images or latent vectors are utilized as novelty score, which means that the learned model only reconstructs normal samples well, and shows very low tolerance for novel samples. Schlegl et al. [Schlegl et al.2017] proposed the first GANs based work, AnoGAN, for novelty detection. In training, the combination of the residual loss on images and discrimination loss on feature maps is minimized to iteratively search the best latent vector. The Efficient GAN [Zenati et al.2018] based on BiGAN [Donahue et al.2017] network was proposed for jointly training the map from the image to the latent space simultaneously. Perera et al. [Perera et al.2019] proposed the OCGAN in which two discriminators were used in the latent space and the input space for making the learned network better model the input images. Recently, Ganomaly [Akçay et al.2018] shown in Figure 1

(a) constructs a novel architecture for multi-class anomaly detection. In our method, the Ganomaly framework is used for one-class novelty detection.

2.2 Knowledge Distillation

To reduce the large computation and storage cost of deep convolutional neural networks, knowledge distillation can transfer the generalization ability of a large network (or an ensemble of networks) to a light-weight network. Hinton et al.  

[Hinton et al.2015] used the outputs of the softmax layer of a teacher network as the target function to train the student network. Romero et al. [Romero et al.2015]

proposed that a student network with random initialization can imitate the intermediate representations of the teacher network to improve its own performance. In order to ensure the student network to learn the true data distribution from the teacher network, knowledge distillation with a discriminator was used for distinguishing features extracted from the teacher and student networks 

[Wang et al.2018a, Wang et al.2018b, Liu et al.2018]. In our method, knowledge distillation is considered as a progressive learning process, which can continuously improve the performance of student networks.

GANs [Goodfellow et al.2014] have been applied to many real world applications such as domain transfer, image generation, and novelty detection. However, to our knowledge, there is no related works that deploy the knowledge distillation on two standard GANs. Therefore, this paper designs a distillation loss to transfer knowledge from the teacher GAN to the student GAN.

2.3 The Architecture of Ganomaly

Akcay et al. [Akçay et al.2018] proposed Ganomaly for multi-class anomaly detection, in which multiple class of samples as normal data and one class of samples as abnormal data. In this paper, we utilize Ganomaly architecture for one-class novelty detection. One-class means that only the instances in one category are regarded as normal data, and the remaining categories are abnormal data.

GAN consists of two adversial modules, a generator and a discriminator . As is shown in the student GAN of Figure 1(a), the Ganomaly [Akçay et al.2018] framework is composed of two modules: 1) an encoder-decoder-encoder () pipeline based generator that learns the distribution of input image , where , from latent spaces , where ; 2) a discriminator that decides whether the reconstructed image is real or fake. and are simultaneously optimized by playing the following minmax game as:

(1)

where training dataset comprises normal images, , and is the expected value of x obeying distribution of normal images .

During training, the generator loss and the cross-entropy loss are minimized to train the student and student , respectively. The model is trained from normal samples, therefore the reconstruction error is large on abnormal samples. In the previous methods [Schlegl et al.2017, Zenati et al.2018, Perera et al.2019], the reconstruction errors of the images or latent vectors are used for anomaly detection. In Ganomaly [Akçay et al.2018], the difference between two latent vectors , is used as novelty score, which is defined in Eq. 3.

3 Our Method

In this work, we adopt the Ganomaly [Akçay et al.2018] framework for one-class novelty detection and achieve state-of-the-art performance. To compress deep neural networks and deploy them to embedded devices with limited resources, we propose the Progressive Knowledge Distillation with GANs (P-KDGAN) to learn a lightweight student GAN from a pre-trained teacher GAN. The P-KDGAN method is composed of three modules. 1) Knowledge distillation with GANs (KDGAN), in which the distillation losses based on Ganomaly framework are proposed for transferring knowledge from the teacher GAN to the student GAN. 2) Four distillation structures are designed for KDGAN. 3) The two-step progressive learning of KDGAN can continuously improve the performance of the student GAN.

3.1 Kdgan

In the KDGAN, both the teacher GAN and the student GAN follow the network architecture of Ganomaly [Akçay et al.2018] and have the same network layers. The difference between them is the number of channels in each layer. Therefore, the generator loss and the discriminator loss in the teacher GAN have the same form as and . The generator loss of student GAN includes reconstructed image loss , latent space loss and adversarial loss :

(2)
(3)
(4)
(5)

where outputs the intermediate representations of discriminator . , and denote the reconstruction errors of the images, latent vectors and feature maps, respectively. The weighted sum of , and constitutes the generator loss and is minimized to train the student generator . The discriminator loss consists of .

The designed distillation loss is a novel attempt for knowledge distillation on two standard GANs. As is shown in Figure 1(a), the teacher GAN and the student GAN transfer knowledge through the intermediate layers of the generators, which includes two latent vectors and one reconstructed images. In the KDGAN, we design three losses , and to measure the similarity of the intermediate layers. and are the distance of latent vectors ( and , and ) from the teacher GAN and student GAN. is the distance of reconstructed images (, ). Based on the above three losses, we propose distillation loss as an objective function for knowledge distillation which is the weighted sum of , and :

(6)

3.2 Distillation Structures

As is shown in Figure 1(a), the designed distillation loss builds a ”bridge” between the teacher and student GANs for knowledge transfer. The losses in the KDGAN consist of three parts: teacher GAN losses , , student GAN losses , , and distillation loss

. We define the above five loss functions as the elements of set

:

(7)

where , , , , and indicates whether the corresponding loss is used to train the networks.

The elements in can be combined into four subsets (, , , ) to form different distillation structures according to the following two rules. The first rule is whether the teacher GAN has fixed weights; the second rule is whether the distillation loss is combined with the losses , to train the student GAN. Before the KDGAN, a teacher GAN is trained by its own generator loss and discriminator loss . The designed four distillation structures are introduced as follows:

  • KDGAN-1⃝: =. Without the use of real labels, the training of student network only depends on the distillation loss , which results in poor detection performance. There is no adversarial networks, and the teacher network is not updated, so its training speed is the fastest.

  • KDGAN-2⃝: =. The student GAN is trained by minimizing its own losses , and distillation loss , while the teacher GAN is not updated. The adversarial network in student GAN causes its training speed to be slightly slower than KDGAN-1⃝.

  • KDGAN-3⃝: =. The teacher GAN uses its own losses , to train to maintain its performance, when the training of the student GAN follows KDGAN-1⃝. Its training speed is almost the same as that of KDGAN-2⃝.

  • KDGAN-4⃝: =. The trainings of the teacher and student GANs follow KDGAN-3⃝ and KDGAN-2⃝, respectively. There are two adversarial networks that need to be trained simultaneously, so the training speed is the slowest.

Input: Pre-trained teacher and , training dataset with normal instances

, epoch


Output: Improved student

1:  First step
2:  Student and with random initialization
3:  for =1 to  do
4:     for =1 to  do
5:        Update student GAN when teacher GAN with fixed weights
6:     end for
7:  end for
8:  Second step
9:  Download the weights of the teacher and student GANs from the previous step
10:  for =1 to  do
11:     for =1 to  do
12:        Update student GAN and teacher GAN together
13:     end for
14:  end for
Algorithm 1 Progressive Knowledge Distillation with GANs

3.3 Progressive Learning of KDGAN

The progressive learning of KDGAN, shown in Figure 1(b), is a two-step approach that continuously improves the performance of the student GAN and achieves better performance than the single step methods. The two-step P-KDGAN is described as follows.

P-Kdgan-I.

In the first step, four distillation structures are utilized to train student network. The experimental results shown in Section 4.4 demonstrate that the performance of student network with random initialization has a large gap compared with teacher network. Therefore, considering the detection accuracy and training time of the four distillation structures, KDGAN-2⃝ is used as the first step of P-KDGAN to enable student network to learn the basics knowledge from teacher network. In the KDGAN-2⃝, the pre-trained teacher has already converged, so the teacher network with fixed weights is used to train the student network relying on real labels and distillation knowledge.

P-Kdgan-Ii.

In the second step, KDGAN-3⃝ and KDGAN-4⃝ continue to train the teacher networks, while the student networks with basic knowledge rely on distilling knowledge to fine-training, thereby further improving accuracy and stability. The fine-learning processes in this step are named as P-KDGAN-II-2⃝3⃝ and P-KDGAN-II-2⃝4⃝. The experimental results prove that the performance of student network even exceeds the teacher network in some categories of one-class novelty detection.

The above process is illustrated in Algorithm 1.

4 Experiments

In this section, the proposed P-KDGAN is evaluated on the well-known CIFAR-10 [Krizhevsky2009], MNIST [LeCun and Cortes2005] and FMNIST [Xiao et al.2017] datasets. Following previous work [Perera et al.2019], we quantify the performance of our method using the Area Under Curve (AUC) of Receiver Operating Characteristics (ROC). The performance results are analyzed in details and are compared with state-of-the-art techniques.

All the reported results are implemented using the PyTorch framework 

[Paszke et al.2017] on NVIDIA TITAN 2080Ti. In the experiments, the batch size and epoch are set to 1 and 500 respectively. Adam [Kingma and Ba2015] is used for training with a learning rate of 0.002.

Figure 2: Training curves of the AUC on three datasets. The normal classes are: (a) AutoMobile on CIFAR-10. (b) 5 on MNIST. (c) Bag on FMNIST. P-KDGAN-I represents KDGAN-2⃝. P-KDGAN-II represents P-KDGAN-II-2⃝3⃝. KDGAN-d represents KDGAN-4⃝.

4.1 Datasets

For the three experimental datasets, the training and testing partitions remain as default. In the setup, one of the classes from training dataset is considered as normal samples for training. During testing, the remaining classes are used to represent novelty samples. For example, every experiment on the CIFAR-10 dataset is trained with 5000 samples and tested with 10,000 samples. The above experiment is repeated for all the ten categories. In addition, in order to compatible with the network architectures, all the images are resized to 32

32 by Bilinear interpolation.

Layer Units BN Activation Kernel
Conv2D 64 LeakyReLU
Conv2D 128 LeakyReLU
Conv2D 256 LeakyReLU
Conv2D 256
ConvTrans2D 256 ReLU
ConvTrans2D 128 ReLU
ConvTrans2D 64 ReLU
ConvTrans2D 3 Tanh
Table 1:

The encoder and decoder architecture for our teacher GAN, layer by layer. Units refer to number of filters in the case of convolution layers, and BN is Batch Normalization abbreviated.

4.2 Network Architectures

The Ganomaly [Akçay et al.2018] framework based on encoder-decoder-encoder () pipeline is used in our method. , in the generator and discriminator are encoders, is decoder. The encoder and decoder follow the DCGAN [Radford et al.2016] architecture, which have three basic layers in our model. As is shown in Table 1, the basic layers consist of: convolutional layers (deconvolutional layers), batch normalization and activation. In contrast, LeakyReLU and ReLU activations are used in encoders and decoders, except for the last layer in decoder, which uses Tanh. All the convolution filters are set to .

The difference between a teacher network and a student network is the number of channels in the intermediate representations. For the three experimental datasets, the intermediate layers in the teacher networks are set to 64-128-256 channels following the OCGAN [Perera et al.2019]. The student networks in each dataset utilize intermediate representations with 8-16-64 channels, 2-4-8 channels and 1-2-4 channels respectively. The encoder and decoder architecture of the teacher GAN is illustrated in Table 1.

NORMAL CLASS OCSVM KDE VAE AND AnoGAN DSVDD OCGAN Ours
AIRPLANE 0.630 0.658 0.700 0.717 0.671 0.617 0.757 0.825
AUTOMOBILE 0.440 0.520 0.386 0.494 0.547 0.659 0.531 0.744
BIRD 0.649 0.657 0.679 0.662 0.529 0.508 0.640 0.703
CAT 0.487 0.497 0.535 0.527 0.545 0.591 0.620 0.605
DEER 0.735 0.727 0.748 0.736 0.651 0.609 0.723 0.765
DOG 0.500 0.496 0.523 0.504 0.603 0.657 0.620 0.652
FROG 0.725 0.758 0.687 0.726 0.585 0.677 0.723 0.797
HORSE 0.533 0.564 0.493 0.560 0.625 0.673 0.575 0.723
SHIP 0.649 0.680 0.696 0.680 0.758 0.759 0.820 0.827
TRUCK 0.508 0.540 0.386 0.566 0.665 0.731 0.554 0.735
MEAN 0.5856 0.6097 0.5833 0.6172 0.6179 0.6481 0.6566 0.7376
0 0.988 0.885 0.997 0.984 0.966 0.980 0.998 0.996
1 0.999 0.996 0.999 0.995 0.992 0.997 0.999 0.999
2 0.902 0.710 0.936 0.947 0.850 0.917 0.942 0.969
3 0.950 0.693 0.959 0.952 0.887 0.919 0.963 0.969
4 0.955 0.844 0.973 0.960 0.894 0.949 0.975 0.970
5 0.968 0.776 0.964 0.971 0.883 0.885 0.980 0.951
6 0.978 0.861 0.993 0.991 0.947 0.983 0.991 0.992
7 0.965 0.884 0.976 0.970 0.935 0.946 0.981 0.982
8 0.853 0.669 0.923 0.922 0.849 0.939 0.939 0.965
9 0.955 0.825 0.976 0.979 0.924 0.965 0.981 0.987
MEAN 0.9513 0.8143 0.9696 0.9671 9127 0.9480 0.9750 0.9780
Table 2: One-class novelty detection results on CIFAR-10 and MNIST dataset. The average AUC of three repeated experiments was used as detection performance.

4.3 Results on One-class Novelty Detection

In this section, we compare our Ganomaly [Akçay et al.2018] based Teacher GAN with several traditional and deep learning based methods on CIFAR-10 and MNIST datasets , including one-class SVM (OC-SVM) [Schölkopf et al.2001], kernel density estimation (KDE) [Parzen1962], deep variational autoencoder (VAE) [Kingma and Welling2014], AND [Abati et al.2019], AnoGAN [Schlegl et al.2017], DSVDD [Ruff et al.2018] and OCGAN [Perera et al.2019]. In light of massive experiments, the parameters of , and in Eq. 5 are mannually configured as 10, 1 and 1. The parameters of , and in Eq. 6 are set as 1. We take the average AUC of the last epoch from multiple trials, but not the manually selected result, as the detection performance, which is more convictive.

Comparisons on CIFAR-10 and MNIST.

The performance of one-class novelty detection on CIFAR-10 dataset, our method shown in Table 2 achieves 73.76%, which is higher than the best OCGAN [Perera et al.2019] method about 8%. For MNIST dataset, our method achieves 97.80% yielding an improvement of about 0.3% compared with state-of-the-art method.

4.4 Evaluation of P-KDGAN Method

In this section, the progressive knowledge distillation with GANs is evaluated on CIFAR-10, MNIST and FMNIST datasets. In each experiment, the weights of the last epoch are served as the teacher network.

Method CIFAR-10 MNIST FMNIST
KDGAN-1⃝ 71.59% 96.7% 92.48%
KDGAN-2⃝ 72.43% 96.75% 92.41%
KDGAN-3⃝ 70.94% 97.18% 92.90%
KDGAN-4⃝ 72.58% 96.87% 92.42%
P-KDGAN-II-2⃝3⃝ 73.05% 97.25% 92.93%
P-KDGAN-II-2⃝4⃝ 72.52% 96.67% 92.57%
Table 3: Compare the performance of KDGAN and P-KDGAN. We highlight the best results in red and the second-best results in blue color.

KDGAN vs. P-KDGAN.

As is shown in Table 3, P-KDGAN-II-2⃝3⃝ achieves the best performance on three datasets, which illustrates the effectiveness of our progressive learning of KDGAN. Although KDGAN-3⃝ achieves the second-best results on MNIST and FMNIST, it shows the worst performance on CIFAR-10 dataset. KDGAN-4⃝ obtain the second-best results on CIFAR-10, but it was about 0.5% lower than the best result. In addition, KDGAN-d (KDGAN-4⃝) iillustrated in Figure 2(c) is inferior in accuracy and training stability compared to P-KDGAN-II. The training curves of the AUC illustrated in Figure 2 clearly shows that proposed P-KDGAN-II can improve the accuracy of the student network and even surpass the teacher network, and reduce shock. Therefore, the above analysis concludes that student networks with random initialization can only learn the basic knowledge of the teacher networks, and the fine-training in the second step of P-KDGAN can further improve performance.

Dataset Method AUC. #Param. #FLOPs.
CIFAR-10 Teacher 73.76% 5.12M 56M
Student 3.15% 6.22 24.45
P-KDGAN 0.71%
MNIST Teacher 97.80% 5.12M 56M
Student 2.32% 52.22 311.11
P-KDGAN 0.55%
FMNIST Teacher 93.11% 5.12M 56M
Student 1.91% 105.45 700
P-KDGAN 0.18%
Table 4: Evaluation of our P-KDGAN method on CIFAR-10, MNIST and FMNIST datasets. (M means million, means the compression ratio of parameter numbers and FLOPs compared to the teacher GAN.)

Results on P-KDGAN.

As is illustrated in Table 4, the performance of the student GAN obtained by two-step P-KDGAN is only 0.71%, 0.55% and 0.18% lower than that of the teacher GAN when compressing the computation at ratios of 24.45:1, 311.11:1, and 700:1, respectively.

5 Conclusion

In this paper, we use the encoder-decoder-encoder pipeline based GANs for one-class novelty detection and achieve state-of-the-art performance. To compress the model, the progressive knowledge distillation with GANs is proposed, which is a novel exploration that applies the knowledge distillation on two standard GANs. The two-step progressive learning can continuously improve the performance and reduce shock of the student network, in which the designed distillation loss plays an important role. Experiments on three datasets validate the effectiveness of our proposed method. Moreover, our proposed method can be used to compress other GANs-based applications, such as image generation.

Acknowledgments

This work is supported by Key-Area Research and Development Program of Guangdong Province (2019B010155003), National Natural Science Foundation of China (U1713203), Shenzhen Science and Technology Innovation Commission (Project KQJSCX20180330170238897), and the Scientific Instrument Developing Project of the Chinese Academy of Sciences (Grant No. YJKYYQ20190028).

References