Deep Neural Networks (DNNs) have achieved the state-of-the-art performances in many machine learning tasks. Nevertheless, recent advancements of Deep Neural Networks in image classification with near-human eye level of accuracy raised a variety of questions. Do DNNs develop an understanding of objects based on the training data and recognize them semantically? Or are they only very good mappers from the input to the label data? How would DNNs classify not-seen or unrecognizable objects?
Despite the high performance of DNNs, there is a doubt regarding their semantic generalization. Many researches have been conducted to validate or reject the hypothesis that DNNs develops an understanding of objects based on training data. In , authors trained different architectures with regular images and tested their performances against negative examples. As negative examples have the same structure as the regular ones, humans can easily recognize them. However, in their experiments, the performances of DNNs dropped significantly when tested on negative images and they concluded that current methods in training DNNs fail to semantically recognize objects.
Moreover, in the recent DNN architectures, an end-to-end training is usually applied where the entire architecture is optimized according to a specific loss function. This would cause a scalability problem as it is needed to re-train the entire network with the addition of new classes. In the real-world application of machine learning, one might need to add new classes on a regular basis, and as a result of current end-to-end training approach, the network should be re-trained.
Furthermore, in the classical machine learning scheme, it is assumed that the probe always belongs to one class. As a result, any not-seen object would be classified as one of the classes. This formulation leads to many issues in practice. As an example, in self-driving cars, the system might face classifying a road sign, which hasn’t been seen before and that would probably cause a dangerous situation. One might claim that this issue would be resolved simply by adding not-seen examples in the training data. However, in a real-world application, it is infeasible to add all the not-seen examples to the training data. Moreover, training DNNs on both correct and incorrect classes does not bring a reasonable enhancement.
In this paper, we aim at testing the hypothesis that whether a good compressor or generator trained per class can be also a good classifier. This would lead to an idea of meaningful semantic training, which is not a case for the existing DNN architectures so far. Therefore, we introduce a concept of ”classification by re-generation”. For this purpose, we train Variational Auto-Encoder (VAE)  for each class of data and classify the probe based on the reconstructed output images. Moreover, based on this pipeline, there is no need to re-train the whole network with the addition of new classes and it can be done easily on the fly.
Furthermore, in our pipeline, we exploited a classification metric based on Kullback-Leibler divergence, which gives us a possibility to reject the doubtful cases. This rejection option might be of importance in many physical or medical experiments, where the trust in the obtained results is crucial. Additionally, such a metric is based on a complete distribution whereas the output of classification based on soft-max represents just a point-wise estimation. Last but not least, the ability of the network to reject is useful in making the system robust against adversarial examples. Thus, the main motivations behind the rejection option are threefold:
reliable rejection of doubtful examples (automatically without human intervention);
trust in obtained results;
robustness to semantic adversarial examples and unseen objects.
In this study, we do not compete for an improvement in the classification accuracy with respect to the state-of-the-art end-to-end trained classification. Instead, we aim at introducing a new principle of classification based on re-generation towards scalability, interpretability, rejection, and trust in results.
I-a Related Work
There have been enormous studies regarding the use of generative models for classification. In 2014, Kingma et al. proposed a model for semi-supervised classification based on VAE . Their proposal achieved good performance, but it lacks the interpretability and acts as a black-box discriminator.
Along the same line of research, in , the authors proposed a framework to learn disentangled representations of data in the context of VAE. Their experiments showed promising results at the classification task. In , Gordon et al. explored the work detailed by Kingma et al.  and introduced a slightly different inference network structure. The authors in  used a bayesian neural network for label prediction.
Despite their good performance in the context of classification based on VAE, previous works lack the interpretability of trained features. Moreover, due to an end-to-end training process, with the addition of new classes, the whole network should be retrained. The system scalability is of importance in the large-scale dataset and real-world applications. Furthermore, in our proposed model, we exploit the primary goal of VAE that is to generate for the purpose of classification.
The rest of this paper is organized as follows. In section II, we briefly introduce the Variational Autoencoder framework. We then present our proposed model in section III. The experimental results are reported in section IV. Finally, section V concludes the paper.
Generative models aim at learning the true distribution of data, , in order to generate new samples. To do so, they attempt to model the complex data by using latent variables. Two of the most commonly used and efficient approaches are VAE  and Generative Adversarial Networks (GAN) .
VAE was first introduced by Kingma & Welling in 2014 . The model consists of two networks: Encoder and Decoder. The input data is encoded to a latent representation and then the samples are generated by a decoder from the latent space.
Assume we are given a dataset,
consisting of N i.i.d. samples. In an unsupervised learning scheme, the log-likelihood of observations is maximized under a probabilistic mode:
In VAE, one assumes that the data were generated from low dimensional latent variables
. The probability distribution of latent variables is denoted byand the marginal likelihood can be written as:
Due to the difficulty of working directly with marginal likelihood, a parametric inference model is used. Thus, the marginal likelihood can be formulated as :
where and indicate the generative and variational parameters and is the Kullback-Leibler divergence. As the Kullback-Leibler divergence is non-negative, the term is considered to be a lower bound on the marginal likelihood. Therefore:
The first term of the loss function corresponds to the reconstruction error of the decoder and the second term is the Kullback-Leibler divergence between the prior distribution and the learned latent posterior . The prior distribution is usually chosen to be a centered isotropic multivariate Gaussian . Using the re-parameterization trick, VAE optimizes the lower bound , .
Iii Proposed Model
VAE has been already proposed in semi-supervised classification settings . In this paper, we investigate the potential of VAE from another perspective. The main idea is to exploit the primary objective of VAE that is the generation for the purpose of classification, whereas the VAE for each class is trained in an unsupervised way.
The main principle is to train its own VAE for each class in a way to capture the statistical distribution of that class as shown in Fig. 1. Given a probe image, we pass it through each encoder-decoder pair of VAE and we argue that the best reconstruction would be achieved, if the probe belongs to that class. An example of recognition is shown in Fig. 2. The main argument behind this architecture is the interpretability of learned features and scalability.
Variety of metrics can be applied to measure the similarity of input and reconstructed outputs. In this study, we focus on cross-entropy as a measure of similarity. Due to a hidden random state in VAE, point-wise estimation based cross-entropy has a lot of drawbacks. To address this issue, we re-generate the output 300 times for the same probe and make the decision based on an estimate of PDF of cross-entropies. This can be further extended to any other cost functions. Moreover, the PDF estimate of cross-entropies enables us to define a criterion for rejecting doubtful examples based on Kullback-Leibler divergence.
The rejection criterion is defined as the Kullback-Leibler divergence between the PDF of the second highest score and the PDF of the highest one. The highest score is considered as those that have the smallest reconstruction distortion. If this divergence is below a certain threshold, the classifier would reject it as a doubtful probe. Kullback-Leibler divergence distributions for correct and incorrect classifications for MNIST dataset are shown in Fig.3 . As one can see, we would reject the doubtful probes for which their divergences are below the threshold at the cost of missing some correct classifications reported as in section IV.A.
To better clarify the need for rejection criterion, examples of correct classification and probable misclassification for MNIST dataset are shown in Fig. 4. In this figure, the histograms of cross-entropy distances as well as corresponding images of input, highest score and second highest score are shown. In the case of correct classification, the histogram of the correct class is well-separated from others. However, in the second case, there is an overlap between the histograms of classes. More importantly, the correct class ”2” is not recognizable even by humans.
As shown in Fig. 4, the classes are clearly separated in the correct case and overlap in the probably incorrect case.
Additionally, this pipeline can be further extended for large-scale datasets. In the current architecture, for a dataset of classes, pairs of encoder-decoder are required. However, assuming a tree-based structure, this number can be reduced to . In the case of tree-based structure, the classes are divided into the groups of varying size based on a similarity measurement in a form of a tree. Therefore, instead of training VAE for each class, the VAEs in each layer of the tree are trained on a group of classes with varying size. To classify a probe image, in each layer the distances between the input and reconstructed outputs of that layer are obtained. Afterward, given this distance, the decision of the next group to test in the next layer is made. This process is repeated until the probe image is finally classified in the last layer.
Iv Experimental results
For our experiments, MNIST and Fashion-MNIST datasets were used as they are commonly known and tested for classification and generation. We have used TensorFlow to implement the standard VAE with two-layer MLPs of 500 hidden units as encoder and decoder models. The default dimensionality of latent variables we set to 15. For learning, we used Adam  with a learning rate set to 0.001.
The purpose of these experiments is to validate or reject the hypothesis that whether a well-trained generator can be used as a good classifier. We do not aim at competing with the state-of-the-art end-to-end trained classifiers, instead, we want to evaluate a new principle of classification for better scalability and interpretability.
In all the following experiments, probability of miss and accuracy are defined as:
Additionally, for the outputs of VAEs, their corresponding probability distributions of distances were obtained and ranked based on the minimum distance. Hence, the rejection criterion is defined as:
where and are the probability distributions of the first and second class in the ranking, respectively.
Iv-a Impact of VAE parameters
In this section, we investigate the impact of different parameters of VAE, namely, dimensionality of latent variables and variance of random state, on the classification accuracy for MNIST and Fashion-MNIST datasets.
As shown in Table I for MNIST dataset, the accuracy of system increases as we increase the dimensionality of latent variables. However, at the dimensionality of 20, we noticed over-fitting and a drop in the performance of the system. Moreover, the results for Fashion-MNIST dataset are reported in Table III and the best performance is achieved for the dimensionality of 20.
Moreover, we also investigated the effect of randomness in the decoder on the system accuracy, which are reported in Table II and IV. The randomness in the decoder is defined by the variance of the noise in the re-parametrization trick. We noticed that decreasing the variance , would actually improve the performance.
Iv-B Impact of limited label data
Furthermore, several experiments were conducted to test the model performance in a semi-supervised setting. In these experiments, a limited number of labeled training samples, N, was used to train the network on the MNIST dataset.
There are various approaches for the semi-supervised classification including Transductive SVMs (TSVM)  as an extension of SVM for limited labeled data. In the , the authors proposed two approaches, namely, Contrastive Auto-Encoders (CAE) and Manifold Tangent Classifier (MTC), based on neural networks to achieve high performance for semi-supervised classification. In CAE, the authors trained a two-layer deep network with CAE objective function, whereas MTC is trained with tangent propagation.
In Table V, we compared our result with the state-of-the-art in semi-supervised classification setting. Although the performance of the proposed model is not better than the state-of-the-art, it is still competitive.
Considering the fact that the implementation was based on vanilla VAE without any pre-processing or techniques such as normalizing flows, the obtained results validate our hypothesis that classification based on re-generation and more specifically VAE has the potential to be further investigated.
Iv-C Rejecting the not-seen objects
In order to evaluate our rejection criterion, we designed an experiment in which we trained our network on MNIST dataset and tested on the Fashion-MNIST dataset. The percentages of rejection for different values of threshold as well as are reported in Table VI. The accuracy of the original dataset MNIST for the same threshold can be found in Table I.
|Threshold||Percentage of rejection|
Iv-D Interpretability of learned features
In order to investigate the interpretability of the learned features, we visualized the filters in the last layer of the decoder for different classes and compared them with those of one common VAE trained on all of classes. As shown in Fig.5, in the case of class specific VAE, the filters obviously follow the structure and shapes of the corresponding class, whereas in the common VAE, no specific pattern or structure is observed.
Although deep neural networks have shown a great performance in a variety of machine learning tasks, they do not semantically generalize well . Attempting to address this problem, we proposed the idea of ”classification by re-generation” . In our proposed architecture, for each class of data, an encoder-decoder pair of VAE is trained and the classification decision of the probe image is based on the reconstructed outputs of each class. Moreover, this architecture resolves the scalability issue of existing end-to-end trained classifiers, as it doesn’t need to re-train the whole network with the addition of new classes.
Furthermore, the classification decision is based on a complete PDF of cross-entropy distances. Given the PDF of distances, we introduced a rejection criterion to avoid doubtful probes. In this way, we can ensure a certain level of trust in the produced result that can be semantically and visually validated.
The experimental results validated our idea of classification by re-generation and demonstrated that the proposed model has the potential for further investigation.
Vi Future work
Future work includes implementing the tree-based structure to avoid the scalability issues in large-scale datasets. In addition to that, we intend to apply our approach to other datasets such as CIFAR-10 and boost our encoders and decoders with more recent techniques such as normalizing flows. We will also look into a successive VAE based on residuals to overcome the current problem of VAE related to the blurred nature of generated images.
H. Hosseini, B. Xiao, M. Jaiswal, and R. Poovendran,
“On the limitation of convolutional neural networks in recognizing negative images,”6th IEEE International Conference on Machine Learning and Applications (ICMLA), 2017.
-  A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images.,” CoRR, vol. abs/1412.1897, 2014.
-  D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” In Proceedings of the International Conference on Learning Representations, 2014.
D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling,
“Semi-supervised learning with deep generative models,”in Advances in Neural Information Processing Systems 27, pp. 3581–3589. 2014.
-  N. Siddharth, B. Paige, J. Van de Meent, A. Desmaison, N. D. Goodman, P. Kohli, F. Wood, and P. Torr, “Learning disentangled representations with semi-supervised deep generative models,” in Advances in Neural Information Processing Systems 30, pp. 5925–5935. Curran Associates, Inc., 2017.
J. Gordon and J. Hern ndez-Lobato,
“Bayesian semisupervised learning with deep generative models,”
ICML workshop on Principled Approaches to Deep Learning, 2017.
-  Ian J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014, pp. 2672–2680.
D. J. Rezende, S. Mohamed, and D. Wierstra,
“Stochastic backpropagation and approximate inference in deep generative models,”in Proceedings of the 31st International Conference on Machine Learning, 2014, vol. 32, pp. 1278–1286.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, Software available from tensorflow.org.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization.,” 3rd International Conference for Learning Representations, 2014.
“Transductive inference for text classification using support vector machines,”in Proceeding of the International Conference on Machine Learning (ICML, 1999, vol. 99, p. 200?209.
-  S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Muller, “The manifold tangent classifier,” in Advances in Neural Information Processing Systems 24, pp. 2294–2302. Curran Associates, Inc., 2011.