1 Introduction
Over the past few years, variational autoencoder have demonstrated their effectiveness in many areas, such as unsupervised learning
kingma2013auto ,supervised learningdoersch2016tutorial kingma2014semi . The theory of variational autoencoder is from the perspective of Bayesian Theorem, the posterior distribution of the latent variables z conditioned on the data x, is approximated by a normal distribution, whose mean and variance are the output of a neural network. In order to make the generated sample
x very similar to data x, VAE adds KullbackLeibler divergence to the loss function, where the latent variables
z mapped from the data x corresponds to encoder, and the sample x generated from the distribution of latent variables z corresponds to decoder. The complex mapping function is learned by neural network, as the neural network can approximate all functions theoretically.Taking the classification as an example, we want to use more dimensional hidden variables to represent the input data. In the highdimensional space, the ultimate objective of classification is making interclass distance to be as large as possible, that is, the sample is more separable, and the innerdistance within the same class is as small as possible. In other words, the distribution of each class is more aggregated. For the distribution of the original data, in the case of MNIST deng2012mnist , the distribution of the number ”9” is closer to the number ”4” and the number ”7” doersch2016tutorial ; kangcontinual , that’s the reason why CVAE mix them sometimes.
A lot of work has been done to prove the effectiveness of codec structure. Based on the basic structure of autoencoder and predefined evenlydistributed class centroids (PEDCC), a new supervised learning method for autoencoder is proposed in this paper. By PEDCC that meets the criteria of maximized interclass distance (the distance between classes is the furthest), we map the label of input data x to different predefined class centroids, then let the encoder network learn the mapping function. Through the joint training, the network mapping makes the latent features of the same class samples as close as possible to predefined class centroids, finally to get a good classification. As far as we know, prior to this article, there was no method of using predefined class centroids to train automatic encoders.
We use the output of the encoder as input to the decoder directly, and to make the autoencoder input and output as close as possible, where the mean square error (MSE) loss function is adopted. Because of resampling, the image quality generated by VAE generally has the problem of edge blurring. In order to solve this problem, we added the Sobel loss, that is, a Sobel operator is taken on the input image and the output of the autoencoder respectively, whose mean power difference is taken as a new loss function to improve the edge quality of the generated image. This is an additional constraint to the edge difference of the input and generated image, which is more conducive to improving the subjective quality of the image. In order to further improve the subjective quality of the generated image, we draw on the idea of reparameterization in VAE, to add Gaussian noise to the latent features to reconstruct the input of the decoder in training phase. The experiment results prove that this trick is very effective.
Below, we first introduce some of the previous work in Section 2. In section 3 our approach is described in detail. Then in section 4 we will verify the validity of our method through the experimental results on MNIST and FashionMNIST xiao2017fashion datasets. Finally, in section 5, we discuss some of the issues that still exist and what will be done in the future.
2 Related Work
The autoencoder is an unsupervised learning algorithm, which is mainly used for dimension reduction or feature extraction. It also can be used in deep learning to initialize weight before training phase. Depending on different application scenarios, autoencoders can be divided into sparse autoencoder
olshausen1996emergence ; lee2007efficient, which is add L1 regularization to the basic autoencoder to make feature sparse, denoising autoencoder
vincent2008extracting ; bengio2013generalized , which is designed to prevent the overfitting problem and add noise to the input data, to enhance the generalization ability of the model and variational autoencoder kingma2013auto ; rezende2014stochastic , which learn the distribution of raw data by setting the distribution of latent variables as (0,I), and can in turn produce some data similar to the original data. Normally, the variational autoencoder is used in unsupervised learning, so we cannot control the generation of the decoder. What’s more, the conditional variational autoencoder (CVAE) sohn2015learning ; walker2016uncertain combines the variational autoencoder with supervised information, which allows us to control the generation of decoder. Ref.doersch2016tutorial assumes that class labels are independent of latent features so that they are stitched together directly to generate data. Ref.pandey2016variational controls the generation of latent variables through the labels of the face properties, and then generates data by sampling directly from the distribution of latent variables. Ref.sohn2015learning considers the prediction problem directly, with the label y as the data to be generated, and data x as a label, to achieve the purpose of predicting the label y of data x. bojanowski2017optimizing ; hoshen2018non train generator directly, the latent variables z is also trained as the network parameters. Therefore, they only can generate images, without the function of feature extraction and classification.The above works only learn the data distribution and complete codec work, without classification function. In recent years, some scholars use VAE in the field of incremental learning, which is mainly focus on how to alleviate catastrophic forgetting mccloskey1989catastrophic . Ref.lavda2018continual adds a classification layer to the VAE structure, and use dual network structure, composed of ”teacherstudent” model to mitigate the problem of catastrophic forgetting. In order to achieve the classification function, based on CVAE, Ref.kangcontinual adds an additional classification network for joint training. Usually, in order to add a classification function to an autoencoder, an additional network structure is necessary.
Through predefined evenlydistributed class centroids, CSAE proposed in this paper maps training labels to these class centers and can use latent variables to classify directly. In view of this new framework, a new loss function is also proposed for training.
3 Method
In this section we will introduce the details of CSAE and show how to combine them to form an endtoend learning system. Section 3.1 gives a description of the PEDCC. Section 3.2 describes the loss function, and section 3.3 discusses the network structure.
3.1 Predefined EvenlyDistributed Class Centroids
From the traditional view of statistical pattern recognition, the main objective of dimension reduction is to generate lowdimensional expressions with maximum interclass distance and minimum innerclass distance, such as LDA algorithm. For deep learning classification model, the last Softmax is a linear classifier, while the preceding multilayer network is a mapping of dimension reduction, to generate lowdimensional latent variables. If the generated sample latent variables have the characteristics of small innerclass and large interclass distance, neural networks will learn better features.Ref.
liu2016large and wang2018additive, modify the Softmax function of crossentropy loss in convolution neural networks (CNNs) to improve classification performance. Because of strong learning abilities, it is not difficult for neural networks to obtain a good aggregation within the same class. However, maximizing the interclass distance is a difficult problem, it varies in different classification tasks. If the variance within the class is large and the interclass distance is close, there will be overlaps between different classes, which can lead to wrong classification. There is no good way to avoid this problem. In this work, the class center of latent variables is artificially set by the method of PEDCC to make sure the distance of these clustering centers is furthest.
Thus, we learn a mapping function through the encoder in the autoencoder, and map the different classes of samples to these predefined class centers respectively, so that the distance between different classes can be separated by the strong fitting ability of deep learning, and the validity of the PEDCC is proved by experiments in this paper.
First, we assume that all clustering centers are distributed on a hypersphere, with the goal of generating points that evenly distributed on the dimensional hypersphere surface as clustering centers, where is the number of classes. There is no analytical solution to generate evenlydistributed points on the hyperspherical surface, and even if we can generate, there are infinite solutions. Generally, the numerical solutions are obtained by iterative method. For us, as long as these center points are evenly distributed, then deep learning wound have sufficient fitting capability to achieve this mapping.
In this paper, the method of PEDCC is based on the physical model with the lowest like charge energy on the hyperspherical surface. That is, the charge points on the hyperspherical surface have the repulsive force between each other, and the repulsive force push the charges to move. When the motion is finally balanced and the points on the hypersphere surface stop moving, points will eventually be evenly distributed on the hyperspherical surface. When the equilibrium state is reached, the points can be the furthest apart. We give each point a variable to describe the state of motion, each step of the iteration updates the motion state and position of all points.
If the size of the input images is , the dimension of the encoder output feature of the autoencoder is , and the number of classes is . In order to ensure that the distance between the predefined center points in the iteration is not too close, the PEDCC algorithm needs to set the distance threshold . When the distance between two points is less than , it is set to , then the iteration continues. Here, we set to 0.01.
The detailed algorithm is shown in Algorithm 1, The algorithm requires the output of points of dimensions predefined class centroids . First, it is necessary to randomly sample vectors in the dimensional normal distribution to represent the initial value of the predefined center, then these vectors are normalized to 1. Each point has a speed state , to describe their state of motion. Then we go into the iterations: the distance matrix, between the initial vectors are calculated firstly, and then calculate the resultant force f of each class center by other points according the distance matrix. To simplify the calculation process, we assume , where l is the distance vector between two points. The resultant force vector for th points is:
(1) 
With the resultant force, we can get the tangent vector of the resultant force :
(2) 
where means dot product between two vectors. Finally, the position of each point is updated by the current velocity vector , and the speed according to the tangent vector of the resultant force is also updated.
(3) 
(4) 
where is iteration times, is a constant, just like learning rate, here . Subsequently, u is normalized. The algorithm ends until the maximum number of iterations is reached.
In order to verify the validity of the algorithm 1, we set the dimension of the latent variables to 3 dimensions for easy display. Figure[2] shows the distribution of 2, 4, 10 and 20 points respectively. The distribution of predefined centers on the spherical surface can be seen, the results are very close to evenly distributed. The result of our algorithm 1 is the optimal solution, which lays an important foundation for the subsequent classification and the generation of random sample.
The initial values of predefined centers here are randomly generated, and the convergence results are affected by the initial value, but the final distribution will be distributed evenly. Once determined, it is no longer changed in the later network training and test phase. Since the vector values of points on the highdimensional sphere are too small for each dimension, in order to facilitate the subsequent works, we multiply them with a constant coefficient alpha to enlarge the range of values, here we take:
(5) 
3.2 Design of Loss Function
3.2.1 Loss function of classification
In order to use crossentropy function as the last classification loss function, the outputs of neural network for classification howard2017mobilenets ; zhang2018shufflenet ; he2016deep are usually mapped to onehot type. The biggest difference of our framework is that the goal of optimization is that each class centroid is as close as possible to the corresponding predefined class centroid, which is a multidimensional vector, and the dimensions here are consistent with the feature’s dimension of the encoder output, i.e. latent variables. The number of latent variables is normally greater than the number of classes, so that the crossentropy loss function can no longer be used. In this paper, the mean square error function is used as the optimization target of our classification task. Assume that the mapping function learned by the encoder is:
Then, the loss function of classification is:
(6) 
where is the predefined centroids of class .
3.2.2 Loss function of Image Codec Error
Supposing that the mapping function of the learned decoder is:
In order to make the decoder’s output as similar as the input image , the second item of our loss function calculates the mean square error between the input image and the output image pixelwisely.
(7) 
where is the image dimension.
3.2.3 Loss Function for Enhancing Subjective Quality of Decoded Image
Because of the resampling of VAE generation model and the use of mean square error as the basis of image similarity measurement, the generated images of VAE model have obvious blurring phenomena. In order to improve the image quality, the operation of Sobel operator is added in this paper. Ideally, the generated image should be consistent with the input image, and the consistency here should also include clear edges.
In order to generate better quality images, Laplacian pyramid loss is applied in bojanowski2017optimizing . Because we add noise to latent variables in the training phase, and Laplician operator is very sensitive to noise, it is no longer applicable here. We choose Sobel operator, which is inherently a bandpass filter,and the input image and the decoder output are carried with the Sobel operator respectively. If we use L1 or L2 regularization directly after Sobel operation, it is equivalent to the pixeltopixel error reduction of the two gradient images. But, the noise addition of latent variables in training phase will generate slightly different samples, so the two processings are contradictory. Experiments show that this will lead to the phenomenon of broken strokes in the generated samples for MNIST dataset. Therefore, we first obtain the gradient intensity of each image after Sobel operator, and then use L1 regularization to their mean power, that is, the edge positions do not need to be fully correspondence, but the mean power of gradient image is same to keep the edge sharpness.
(8) 
(9) 
where is mean square of .So, the final loss function is:
(10) 
where , are constantly used to balance the relationship between the three loss functions. We set and in training phase.
The loss function of CSAE no longer contains KullbackLeibler divergence items, which reflects that this article attempts to train autoencoder from a new perspective. That is, unlike the VAE model based on Bayesian Theorem, the input and output of encoder and decoder here are just a complex nonlinear mapping to be learned, and is independent each other statistically.
3.3 Architecture
In recent years, with the rapid development of convolution neural networks, many excellent networks have been proposed, such as simonyan2014very ; hu2018squeeze , and a great deal of works has proved their superiority. In order to extract more effective features, unlike doersch2016tutorial , this paper uses the shortcut connection convolution network structure proposed by he2016deep to replace the conventional fully connected layer to build autoencoder, that is, the convolution block + linear layer to construct the encoder, and the linear layer + deconvolution layer to form the decoder. The specific structure is shown in Figure 1, and the output of the encoder is the input of the decoder. Here, the structure of each convolution block is residual block. CSAE is not limited to this structure.
Because the MNIST dataset is rather simple, when we reduce the input image to the resolution, only one fullconnection layer is used to map the feature directly to the d dimension latent variables. The decoder is also a fully connected layer + some deconvolution layers, where the last layer of the decoder is a convolution layer to change the number of output channels directly, and without the Sigmoid layer, which is also a different point between this paper and the CVAE structure. Table 1 is the network structure of CSAEC used in this paper.
Network  Operation  Input dims  Output dims 

Encoder  Convolution  
Residual block  
Residual block  
Residual block  
Fullyconnected  
Decoder  Fullyconnected  
Deconvolution block  
Deconvolution block  
Deconvolution block  
Convolution 
3.4 Adding Noise to Latent Variables in Training Phase
In VAE, the input of the decoder is a sampling of the distribution of the latent variables, this process is not a continuous operation and there is no gradient, during the training of stochastic gradient descent algorithm, backward propagation cannot continue to propagate. In order to solve this problem, VAE put forward the reparameterization trick, and a sampling is taken from normal distribution (0,
). The sampled points are multiplied by the standard deviation, plus the mean value. CSAE use the output of the encoder directly as the input of the decoder, and no longer requires reparameterization as
kingma2013auto ; doersch2016tutorial. However, we are inspired by reparameterization processing, where mean value can be considered to be the output of encoder, and the standard deviation is equivalent to a noise we add. The addition of noise in the training phase can make the decoder more insensitive to the change of input features, whose generalization ability (including interpolation and extrapolation) performs better and generates a more stable decode result. Based on the above understanding, before inputting the latent features inputs to the decoder, CSAE randomly generates a
dimensional noise from (0,), and add it to the latent variables as the input to the decoder. As mentioned in section 2.1, after normalized the predefined center to the surface of hypersphere, we also multiply them all by a constant coefficient to expand the latent variables value space.Similarly, in order to adapt to the increase of the latent variables value space, we also amplify which meet the standard normal distribution, the amplification of the noise is:
(11) 
Where is same as section 2.1, is an adjustable factor, with the range of [0,1], and the optimal value is determined by experiments, which will be discussed extensively later. The reconstructed decoder input is:
(12) 
4 Experiments
In this section, we have conducted several comparative experiments on MNIST and FashionMNIST datasets to demonstrate the effectiveness of CSAE. This section of the experiment is implemented under the Pytorch
paszke2017pytorch framework. Implementations for CSAE can be found online at: https://github.com/anlongstory/CSAETo facilitate comparison, this paper designed different CSAE structures, CSAE with linear (CSAEL) is mainly for the comparison with the full connection layer construction of CVAE, the CSAE with convolution (CSAEC) is proposed in section 3. We first designed the CSAE with linear and CVAE with the same autoencoder structure, where the encoder and the decoder contain only two fully connected layers, and the dimension of the latent variables is 40. The node change of the network is: 78440040400784. The Adamkingma2014adam
optimization method is used for the optimizer, with a constant learning rate of 1e3. In this way, different epochs are trained and the results of the models are compared.
4.1 Reconstruction performance
We trained different epochs, CVAE converged quickly, so CVAE trained at most 100 epochs. Because the learning rate was small, we redesigned the loss function for CSAE, and increased the number of training epochs to 200 (experiments proved that the Loss2 and Loss3 were still declining after 200 epochs). The images of MNIST test set is then input the model to reconstruction, as shown in Figure 3. There is a blurring phenomenon in CVAE. As the number of iterations increases, the CSAE reconstruction results become clearer. We add the Sobel loss to make the results clearer. At the same time, adding noise is designed to make the network more robust to noise, which will blur the results to a certain extent. We keep the Loss3 coefficients constant and change . With the increase of , the results of CSAE reconstruction gradually become blurred. But overall, CSAE reconstruction is still better than the CVAE which is under the same conditions.
We did the same experiment with CSAEC which is described in section 3. In terms of training hyperparameters, thanks to batch normalization
ioffe2015batch, we set the initial learning rate to 0.1, 120 training epochs, and reduce the learning rate by 10 times per 30 epochs. The SGD optimizer is used, momentum is 0.9, and the weight decay is 0.0005. In order to match the input and output dimensions, we padded the training MNIST images to
. In view of the above results, we did a comparative experiment on FashionMNIST too. For comparison, we directly give the results of the 120 training epochs, where in CSAEL and CSAEC. As is shown in Figure 4 and Figure 7, no matter reconstruction or generate random samples, CSAEL and CSAEC are superior to CVAE obviously.4.2 Generating Random Samples
The mean and variance of latent variables are obtained through network learning in CVAE, and finally the random sample is generated from the data point subject to the standard normal distribution random value. Finally, through the reparameterization, it is input into the decoding network. Here we repeated for 10 times, each time 09 onehot condition is added, 100 randomly generated samples were obtained. CSAE gets the 40dimensional feature vectors of all training images, and put the same class’s feature vectors together, to calculate their mean and covariance matrix. According to the mean and covariance of each class, 10 random points are sampled to input into the decoding network. Finally, 100 random sample points were obtained. With the gradual increase of the , the generation of sample strokes has been gradually improved, indicating that the robustness of the model is improved, the sharpness of the generated numbers is becoming clearer, the quality of the generation is getting better. However, when is too large , the model is more robust to the noise of the input latent vectors, but the decoded images also tend to be more standard form, resulting in a decrease in the diversity of samples. There is a contradiction between diversity and robustness, one side ascends, the other will decline. In CVAE, when the training process is determined, it is difficult to control the choice between the diversity and robustness of the final generation of samples, while CSAE is different, where we can change the value of the coefficient in the Loss2 and in Loss3 to tradeoff between diversity and robustness, so that training is more flexible. As the number of iterations of the CSAE increases, the sharpness and completeness of the generated samples are improved, which also shows that Loss2 and Loss3 are constantly optimizing the model. The sharpness effect is better than the CVAE under the same condition.
4.3 Classification Performance
CSAE’s innovative linkage of latent variables directly to predefined class centroids, is used not only to generate samples, but also to classify patterns directly without adding a network structure. In Figure 8, the Euclidean distance between the cluster centers of latent variables are shown (because this distribution matrix is symmetrical, in order to make it easier to observe only the upper part is shown), the clustering center for numbers ”9” and ”7” should be relatively close doersch2016tutorial ; kangcontinual , but due to the role of our predefined evenlydistributed class centroids, the minimum distance of the cluster center between each other is almost equal, which provides the best interclass separation characteristics, improving the ability of pattern classification and extract features.
Following is the analysis of network recognition performance on test dataset. CSAE uses nearest class mean(NCM) for pattern classification, and predefined class center is mean value of each class. Table 2 shows the accuracy of classification on MNIST. Under the structure of the two fully connected layers, the recognition rate of nearly 98% can be achieved, and the CSAE loss has been decreasing with the increase of the number of iterations. The above results fully illustrate the feasibility of combining predefined class centers with latent variables to classify. In order to further compare the classification performance of CSAE, we design a comparative experiment in which the only contains the encoder part of CSAE with nearest class mean classifier. is the convolution neural network which is consistent with the structure of the CSAE encoder in the convolution part, and only the last fully connected layer is replaced with the
avepooling layer and Softmax layer, the original label and cross entropy loss function is used for training.
Recognition rate  

Model  With linear  With convolution  
98.08%  99.43%  
97.96%  99.42%  
97.88%  99.40%  
97.86%  99.40%  
—  99.39%  
—  99.24% 
As you can see from table 1, the networks of CSAE structure, whether CSAE or can be slightly better than convolution neural networks that use crossentropy training. Compared with , the CSAE classification performance is slightly improved, which shows that the decoder actually improve the classification performance. In addition, due to the use of convolution neural network, the number of parameters of the network is obviously reduced, and the performance is also obviously improved, which shows that the network structure has a great impact on the performance of autoencoder. Table 3 shows the classification results on FashionMNIST dataset. CSAEC is slightly worse than , but the difference is very small, therefore, the validity of PEDCC classification is further proved.
Model  Accuracy 

CSAEL  89.31% 
CSAEC  92.41% 
92.26%  
92.55% 
4.4 Generalization Performance
In order to verify the generalization ability of the model, we test the extrapolation ability of the model in experiments, while its interpolation ability is embodied in the generated sample quality and network classification performance mentioned earlier. For the model trained on the MNIST dataset, we randomly picked 10 classes from the letter subset of the EMNISTcohen2017emnist dataset, the model has never seen a sample of characters, the samples are input directly into the model for reconstruction. CVAE reconstructed letters belong only to MNIST classes, it cannot rebuild a class that have not seen at all. In CSAE, we test the which takes different values including , i.e. no random noise adding. When the random noise is not added, the characters can be reconstructed, and with the increase of , the reconstruction of characters becomes more and more blurred, and when the final , It is also difficult for CSAE to reconstruct letters that have not been seen. Here, because CSAE changes the combination of the loss function, depending on the different usage scenario (generate a more robust model or obtain a more generalized model), we can artificially adjust the generalization ability of the model by changing . To sum up, through and , we can adjust the performance of CSAE in interpolation and extrapolation, more flexible than CVAE.
4.5 Loss Function
In the training process, the values of Loss1, Loss2, Loss3 have their own significance, Loss1 is an indicator to the classification, whose value represents the degree of aggregation within the class, Loss2, Loss3 can be regarded as a reflection of the quality of generation. The validity of the CSAE has been verified by the classification accuracy and the quality of the generated data. Here we give a further numerical analysis. For CSAEC, we further verify the rationality of this loss function by comparing the parameter values obtained by CSAE with Sobel operator or not under the condition that is set to 0.04. After training 100 epochs, the accuracy and loss are shown in table 4:
CSAEC  Train Loss2  Train Loss1  Train Loss3  Test Loss1  Test Loss2  Test Loss3  Accuracy 

Without Loss3  0.0271  0.1029  0.0883  0.0368  0.0789  0.0910  99.36% 
With Loss3  0.0329  0.1112  0.0006  0.0422  0.0802  0.0076  99.42% 
In table 4, we marked the smaller values in two different cases, we can see that the addition of Sobel operator’s loss has an impact on the network. With the addition of the Sobel operator, Loss1 becomes larger, this means that the aggregation degree within the class becomes slightly dispersed after the noise and the constraint of the Sobel operator, which is more favorable to the generalization ability of the model, and the intuitive embodiment is that the accuracy also increases. Loss3 is significantly reduced, indicating that the difference between the generated image and the input image is further reduced by the Sobel operator constraint, which is consistent with the result which is shown in the section 4.1.2.
5 Conclusions
This paper mainly introduces a new autoencoder structure named CSAE, by setting a predefined evenlydistributed class centroids (PEDCC) so that the autoencoder has the classification function at the same time, that is, the intermediate latent variables obtained by the encoder is used not only for decoding, but also directly for classification. PEDCC guarantee the largest distance between different classes center, and make it easier to classification and generate samples. Theoretically, it can generates any number of samples by sampling within different class of distribution. In our work, a new loss function is constructed, which is more flexible in parameter selection. We can artificially change the coefficients in training to control the final reconstruction results and the quality of the samples generated, and validates this statement by means of visual display and numerical analysis. The future works are to explore CSAE in more complex datasets, such as complicated faces and natural images, and broader tasks such as incremental learning, semantic segmentation, style transfer and so on.
References
References
 (1) D. P. Kingma, M. Welling, Autoencoding variational bayes, arXiv preprint arXiv:1312.6114.
 (2) C. Doersch, Tutorial on variational autoencoders, arXiv preprint arXiv:1606.05908.
 (3) D. P. Kingma, S. Mohamed, D. J. Rezende, M. Welling, Semisupervised learning with deep generative models, in: Advances in neural information processing systems, 2014, pp. 3581–3589.

(4)
L. Deng, The mnist database of handwritten digit images for machine learning research [best of the web], IEEE Signal Processing Magazine 29 (6) (2012) 141–142.
 (5) W.Y. Kang, B. Zhang, Continual learning with generative replay via discriminative variational autoencoder.
 (6) H. Xiao, K. Rasul, R. Vollgraf, Fashionmnist: a novel image dataset for benchmarking machine learning algorithms, arXiv preprint arXiv:1708.07747.
 (7) B. A. Olshausen, D. J. Field, Emergence of simplecell receptive field properties by learning a sparse code for natural images, Nature 381 (6583) (1996) 607.
 (8) H. Lee, A. Battle, R. Raina, A. Y. Ng, Efficient sparse coding algorithms, in: Advances in neural information processing systems, 2007, pp. 801–808.
 (9) P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzagol, Extracting and composing robust features with denoising autoencoders, in: Proceedings of the 25th international conference on Machine learning, ACM, 2008, pp. 1096–1103.
 (10) Y. Bengio, L. Yao, G. Alain, P. Vincent, Generalized denoising autoencoders as generative models, in: Advances in Neural Information Processing Systems, 2013, pp. 899–907.
 (11) D. J. Rezende, S. Mohamed, D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, arXiv preprint arXiv:1401.4082.
 (12) K. Sohn, H. Lee, X. Yan, Learning structured output representation using deep conditional generative models, in: Advances in neural information processing systems, 2015, pp. 3483–3491.

(13)
J. Walker, C. Doersch, A. Gupta, M. Hebert, An uncertain future: Forecasting from static images using variational autoencoders, in: European Conference on Computer Vision, Springer, 2016, pp. 835–851.
 (14) G. Pandey, A. Dukkipati, Variational methods for conditional multimodal learning: Generating human faces from attributes, arXiv preprint arXiv 1603.
 (15) P. Bojanowski, A. Joulin, D. LopezPaz, A. Szlam, Optimizing the latent space of generative networks, arXiv preprint arXiv:1707.05776.
 (16) Y. Hoshen, J. Malik, Nonadversarial image synthesis with generative latent nearest neighbors, arXiv preprint arXiv:1812.08985.
 (17) M. McCloskey, N. J. Cohen, Catastrophic interference in connectionist networks: The sequential learning problem, in: Psychology of learning and motivation, Vol. 24, Elsevier, 1989, pp. 109–165.
 (18) F. Lavda, J. Ramapuram, M. Gregorova, A. Kalousis, Continual classification learning using generative models, arXiv preprint arXiv:1810.10612.
 (19) W. Liu, Y. Wen, Z. Yu, M. Yang, Largemargin softmax loss for convolutional neural networks., in: ICML, Vol. 2, 2016, p. 7.
 (20) F. Wang, J. Cheng, W. Liu, H. Liu, Additive margin softmax for face verification, IEEE Signal Processing Letters 25 (7) (2018) 926–930.
 (21) A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, Mobilenets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861.
 (22) X. Zhang, X. Zhou, M. Lin, J. Sun, Shufflenet: An extremely efficient convolutional neural network for mobile devices, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.
 (23) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 (24) K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, arXiv preprint arXiv:1409.1556.
 (25) J. Hu, L. Shen, G. Sun, Squeezeandexcitation networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.

(26)
A. Paszke, S. Gross, S. Chintala, G. Chanan, Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration, PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration.
 (27) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
 (28) S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: International Conference on Machine Learning, 2015, pp. 448–456.
 (29) G. Cohen, S. Afshar, J. Tapson, A. van Schaik, Emnist: an extension of mnist to handwritten letters, arXiv preprint arXiv:1702.05373.