1 Introduction
Image Classification, refers to the task of categorizing images into one of several predefined classes. This task is one of the fundamental problems in computer vision and it is the basis of other computer vision tasks such as segmentation and detection
[rawat2017deep]. Traditionally, the procedure of image classification included two steps. First handcrafted image features were extracted via feature descriptors, such as SIFT [lowe2004distinctive] and SURF [bay2008speeded], and then these features were used as input to a trainable classifier. The main limitation of this approach was the fact that the accuracy was heavily depended on the design of the feature extraction step, which was a time consuming and labor intensive task
[rawat2017deep]. Thankfully, Deep Learning models and especially Convolutional Neural Networks (CNNs) have been shown to overcome this limitation and become the state of the art for image recognition, classification, and detection tasks
[heras2020supervised, rawat2017deep, mallat2016understanding].In other cases, the images are directly treated as highdimensional vectors where each variable correspond to an image pixel. Unfortunately, the resulting ultra highdimensional data, have some non intuitive characteristics. As presented in
[aggarwal2001surprising], the ratio of distances of a data point to its nearest and furthest neighbors tends to 1 as dimensionality grows, making not only the calculation of distances extremely computationally expensive but also affecting negatively the performance of classification methods. The emerged Dimensionality Reduction methods have been proven to be very effective in retaining the structure of data, making them a useful tool.Dimensionality Reduction is a widely used preprocessing step that facilitates classification, visualization and the storage of highdimensional data [hinton2006reducing]. Especially for classification, it is utilised to increase the learning speed of the classifier, improve its performance and mitigate the effect of overfitting on small datasets through the noise reduction property of dimensionality reduction methods [wang2014role]. The majority of Supervised Dimensionality Reduction techniques, exploit the data and label pairs contained in the training dataset in order to learn the best dimensionality reduction mapping and then use those mappings as input to a standard classification algorithm. These methodologies, are the most common ones and they encourage the dimensionality reduction mapping to separate the inputs or manifolds that have different labels from each other, which can sometimes be effective [wang2014role].
Artificial Neural Networks have been widely used for dimensionality reduction as well. Autoencoder Networks, which are a nonlinear generalization of PCA
[hinton2006reducing] have shown widespread success in producing powerful feature representations [duan2019improving]. This type of network, allowed the process of feature extraction to be learnable and automatic, providing a flexible and scalable solution to the problem of dimensionality reduction and feature extraction [duan2019improving]. Most importantly, Autoencoders on the contrary to traditional dimensionality reduction methodologies, allow the utilization of deep learning architectures, such as convolutional networks, taking image local structure into consideration during feature extraction [guo2017deep].In this work, we propose a methodology for supervised dimensionality reduction and classification based on a neural network architecture that optimizes classification loss jointly with reconstruction error, aiming to improve both classification performance and model explainability. The most well established example of this approach in statistical machine learning is the Linear Discriminant Analysis (LDA), which finds the best linear mapping, regarding betweenclass scatter against withinclass scatter, that can also be used for classification tasks. However, according to
[wang2014role], LDA solves a difficult nonconvex problem, especially for a nonlinear dimensionality reduction mapping. To overcome this obstacle, we propose a novel optimization strategy which exploits a Convolutional Autoencoder for dimensionality reduction and a neural network classifier, entitled Convolutional Supervised Autoencoder (CSAE). Motivated by LDA, through this architecture we also focus on explainability, providing visualizations of the generated Latent Space. Furthermore, we show that the aforementioned Latent Space can greatly enhance the classification performance of traditional algorithms.Hereby, the major contributions of this study are summarized:

A novel approach for supervised non linear dimensionality reduction and classification of image data, focusing on explainability through the generated Latent Space.

The utilization of the optimized for classification low dimensional representations from traditional classification algorithms to improve their performance.

An extensive study of the resulting classification boundaries and their properties, through the resulting low dimensional representations.

Provide extensive experimental analysis on real world benchmark and biomedical image datasets that justify the paper’s assumptions.
2 Related Work
Images are characterized by high dimensionality, even when their size is relatively small, presenting a common case of the widely known curse of dimensionality. As described in
[aggarwal2001surprising] the ratio of distances of a data point to its nearest and furthest neighbors tends to 1 as dimensionality grows. This behaviour, affects negatively the performance of machine learning methods. A solution to this problem, came from the dimensionality reduction methods which have been proven to be effective on retaining the data structure, being a fruitful tool for the classification of high dimensional data [wang2014role].The goal of dimensionality reduction is to retain as much of the significant structure of the highdimensional data as possible in the lowdimensional representation. Principal Component Analysis (PCA),
[pearson1901liii], projects the original data onto the directions of maximal variance in an unsupervised way. Linear Discriminant Analysis (LDA), as described in
[tharwat2017linear, wang2014generalized], is a supervised dimensionality reduction method, which aims to find a linear subspace, which maximizes the between class to withing class variance ratio and thus guaranteeing maximum class separability.However, the manifold structure of real world data types, such as images, is complicated. As argued in [wang2014generalized]
, the use of dimensionality reduction methods that utilize a simple parametric model, such as the Principal Component Analysis, or exploit fixed and defined data relations on the original high dimensional space, which may not be valid on the manifold, in order to learn the latter, e.g. ISOMAP
[tenenbaum2000global], are not sufficient to capture such complicated structures. Thankfully, Neural Networks and especially Autoencoders, have shown widespread success in producing powerful feature representations [duan2019improving], mitigating the previously presented limitation.The Autoencoder Neural Networks [rumelhart1985learning], is a nonlinear generalization to Principal Component Analysis [hinton2006reducing]. This type of Neural Networks, have shown wide success as tools for non linear dimensionality reduction and feature extraction for clustering [duan2019improving, guo2017improved, xie2016unsupervised, nellas2021convolutional, makhzani2015adversarial, MRABAH2020206]
[gogna2016semi, rasmus2015semi] and classification [le2018supervised, nousi2020self, rolfe2013discriminative, gao2015single] tasks. In our interest, the Supervised Autoencoder (SAE), which is proposed in [le2018supervised], is a neural network that jointly predicts the input and the classification result. Moreover, in the aforementioned study, a proof of the uniform stability of the SAE with one hidden layer (linear SAE), is performed, and thus a bound on the generalization error is provided. Finally, they empirically show that the addition of the reconstruction loss never harms performance when compared with the corresponding neural network.Another recent methodology that utilizes Autoecoders for classification is presented in [nousi2020self]. Therein, the Latent Space of the Autoencoder is exploited to perform classification while, a finetuning of the learned representation is performed in a selfsupervised fashion, forcing the Autoencoder to learn better separated low dimensional representations. In an earlier study [rolfe2013discriminative]
, the discriminative recurrent sparse autoencoder model is proposed, which is composed of a recurrent encoder that has Rectified Linear Units and it is connected with two linear decoders which not only reconstruct the input but also predict the classification result. Simultaneously, the label information was embedded into the training of Autoencoder by enlarging the error function to include the classification error. Moreover, supervised deep autoencoders were used for Face Recognition
[gao2015single]. Finally, in the study of [heras2020supervised], extracted image features from pretrained Convolutional Neural Networks were exploited and provided as input to Linear Discriminant Analysis in order to perform Supervised Dimensionality Reduction and Classification.In this work, Supervised Dimensionality Reduction and Classification is studied similarly to [le2018supervised]. However, motivated by the aforementioned approaches more emphasis was laid on deep learning architectures for the Image Classification task. In contrast to [le2018supervised], a novel optimization strategy is proposed, while extensive visualizations of the generated Latent Space and its exploration are provided in order to deeply understand the behaviour of the proposed methodology in terms of performance and explainability concerning its decision making, structure preservation and information capture behaviour. Additionally, the proposed methodology is characterized by lower complexity than the supervised Autoencoder presented in [nousi2020self], while concurrently, as opposed to [rolfe2013discriminative], pretraining is not required. In addition, we study the exploitation of the optimized for classification, Latent Space of the Convolutional Supervised Autoencoder in order to improve already existing classification algorithms. More importantly, the Latent Space, the classification behavior and the explainability of the proposed methodology, are extensively studied.
3 Proposed Methodology
In what follows, we present a novel optimization strategy for classification and reconstruction error. We exploit a Convolutional Autoencoder for dimensionality reduction that preserves local structure of data generating distribution, as presented in [guo2017improved], and a classifier, in the form of a Fully Connected Neural Network, in order to achieve the desired task. This methodology, provides a framework for supervised nonlinear dimensionality reduction and classification in an end to end manner. The aforementioned methodology, is entitled Convolutional Supervised Autoencoder (CSAE). Subsequently, we utilize the powerful latent representations of images that lie in the generated Latent Space, and use them as inputs to traditional classification algorithms, such as the kNearest Neighbors method, in order to improve their performance.
3.1 Convolutional Supervised Autoencoder
The primary tasks of this methodology are Supervised Dimensionality Reduction and Image Classification. The Convolutional Autoencoder, and thus the reconstruction error, is employed as an auxilary task in order to not only preserve the local structure of the data generating distribution, as presented in [guo2017improved], but also to act as a regularizer for the solution. This results in promoting stability and achieving better generalization [le2018supervised]. In order to examine the non linear dimensionality reduction and classification capabilities of the proposed methodology, the following hypothesis is formulated and studied: A non linear data transformation generates a space ontop of which the data are linearly separable.
A Graphical Representation of the proposed methodology is presented in Figure 1 while, the proposed methodology for supervised dimensionality reduction and classification is presented in Algorithm 1. In what follows, we describe one iteration, for a given batch of images and their labels:

Initially, a forward pass of the batch of training images through the Convolutional Autoencoder is performed.

The Loss Function of the Convolutional Autoencoder is evaluated, and its weights are updated through backpropagation.

Then, a forwards pass of the batch of training images through the Classifier network is performed.

Finally, the Loss Function of the Classifier is evaluated, using the image labels contained in the batch, and its weights are updated through backpropagation.
This procedure is repeated until convergence or for a specified number of epochs. Subsequently, the Classifier Network is detached, to be used as a standalone classifier, reducing significantly the number of required parameters for the classification task. Finally, as Loss Functions of the Convolutional Autoencoder and the Classifier Networks, the Mean Squared Error (MSE) and the Categorical Crossentropy are utilized.
3.2 Improving Classification Methods with CSAE
One of the basic components of CSAE is the Convolutional Autoencoder. We consider the generated Latent Space 1 optimized for the classification task, since the minimization of the classification error is the main training objective. Additionally, this Latent Space is constrained by the reconstruction error of the Convolutional Autoencoder and thus, the local structure of the data generating distribution is preserved. Therefore, the feature space corruption phenomenon is mitigated, as described in [guo2017improved]. Finally, it can be concluded that if the formulated hypothesis holds, then the exploit of a linear classifier onto the Latent Space of CSAE should perform adequately.
A Schematic Representation of the described methodology is presented in Figure 2 while the complete algorithmic procedure is presented in Algorithm 2. In detail, CSAE is initially trained following the procedure described in Algorithm 1. Then, the Encoder Network of the Convolutional Autoencoder is detached and a pass of the images contained in the train and test set through the Encoder Network is performed, in order to acquire their low dimensional representation, and respectively. Afterwards, a traditional classifier is trained using and . The classification result of the images contained in the test set , is acquired by providing as input to the trained traditional classifier.
The advantages of this methodology are two fold. Initially, the only component of CSAE that is required for prediction is the Encoder Network, which means that the essential number of parameters are further reduced. Also, the second advantage is that, the original images can be deleted after the computation of their latent representations, and thus decreasing the memory requirements of the dataset, while concurrently increasing the execution time of the utilized traditional classification methodology. The images can be reconstructed by providing their latent representations as input to the Decoder Network. In conclusion, from the previously described advantages, it is realized that this methodology offers an efficient solution to the classification problem.
4 Experimental Analysis
This Section is devoted to the experimental evaluation of the proposed methodologies. For this purpose we employ two widely used benchmark datasets and two recent real world biomedical image datasets. In what follows, we provide a brief overview of the datasets, the preprocessing procedure and the evaluation metrics. In addition, we present the algorithms used for comparison and the experimental procedure. Finally, the experimental results are presented and interpreted through a thorough discussion.
4.1 Datasets
Selecting widely used datasets for our experiments allow us to provide direct comparisons with recent methodologies found in the literature. For this purpose we utilized the MNIST and FashionMNIST dataset respectively. Nevertheless, we also utilize two recent biomedical image datasets to expose the true potential of the proposed methods both in terms of classification performance and their generalization capability. In detail, the four employed datasets are the following:

MNIST [lecun2010mnist]: is a dataset of 70,000 grayscale images of handwritten digits 0 to 9. Each image, contained in this set of data, has pixels size.

FashionMNIST [xiao2017fashion]: consists of 70,000 grayscale images, were each one is associated with a label from 10 classes. Each image has pixels size.

Brain Tumor Image Dataset [Cheng2017]: This dataset contains 3064 T1weighted contrastenhanced images from 233 patients with three kinds of brain tumor: meningioma (708 slices), glioma (1426 slices), and pituitary tumor (930 slices). This dataset is publicly available in Kaggle^{1}^{1}1see:https://www.kaggle.com/denizkavi1/braintumor.

SARSCOV2 CTScan dataset [soares2020sars]: This dataset contains 2482 CT scans in total, where 1252 CT scans are positive for SARSCoV2 infection (COVID19) and 1230 CT scans are from patients that are noninfected by SARSCoV2. This dataset was collected from real patients, which they are hospitalized in Sao Paulo, Brazil and it is publicly available in Kaggle^{2}^{2}2see:https://www.kaggle.com/plameneduardo/sarscov2ctscandataset?select=nonCOVID.
4.2 Data Preprocessing and Evaluation Metrics
The images contained in the Brain Tumor and SARSCOV2 CT Scan datasets, were resized to
using the Nearest Neighbor Interpolation method. In addition, they were flattened and standardized for the application of traditional classification algorithms, and normalized to the
range, for the rest methodologies. For the MNIST and Fashion MNIST datasets the provided traintest splits were used, while for the Brain Tumor Dataset and the SARSCOV2 CTScans Datasets, the train and test splits were retrieved by random sampling an 80% to 20% ratio respectively. The validation set for each dataset was created by random sampling 10% of samples from the training set. Finally, for the evaluation of the performance of the classification algorithms, two standard metrics were used: the Accuracy [grandini2020metrics] and the weighted by support F1Score^{3}^{3}3as described in https://scikitlearn.org/stable/modules/generated/sklearn.metrics.f1_score.html.4.3 Algorithms used for comparison
Aiming at the evaluation of the classification performance of CSAE, a wide variety of algorithms were used through an extensive comparison (namely, First set of comparisons). We initially compare against Linear Discriminant Analysis (denoted as LDA), in order to compare CSAE with the most well established methodology for Supervised Dimensionality Reduction and Classification. Subsequently, our aim is to illustrate the impact of each individual component of the proposed methodology to the classification result. Specifically, CSAE is compared with a CNN classifier (denoted as CNN classifier) of the same architecture, while also compared against the independent use of a Convolutional Autoencoder for dimensionality reduction and a Classifier (denoted as AE + Class. Net.), where both parts have similar architecture to the corresponding component of CSAE. Finally, CSAE is compared against several state of the art methodologies proposed in [kabir2020spinalnet, nokland2019training, 8451379, 8683759, wang2020contrastive, Jaiswal2021Class].
To investigate the methodology that exploits the Latent Space of CSAE in order to improve the performance of traditional classification methodologies (Second set of comparisons), three traditional classification algorithms were used: kNearest Neighbors (denoted as kNN), Support Vector Machines with Radial Basis Function Kernel (denoted as SVM) and the Naive Bayes Classifier (denoted as GNB). The comparison includes the execution of these methodologies to both the original flattened images,and the latent representations of the images created by CSAE, denoted as kNN/SVM/GNB and CSAE L.S.+ kNN/SVM/GNB respectively.
4.4 Implementation
The implementation of the whole experimental process was accomplished with the Python programming language and it is available under an open source licence through a GitHub page. The deep learning models, were implemented using the Keras
[chollet2015keras] application programming interface (API), while for the traditional Machine Learning algorithms for classification and Evaluation Metrics, the implementations contained in Scikitlearn [scikitlearn] were utilized. The Convolutional Networks are constructed similarly to [guo2017deep]. More precisely, for the MNIST and Fashion MNIST datasets, the Convolutional Network of the Encoder, is composed of 2 convolutional layers, where the kernel maps were assigned to , while concurrently the number of filters were assigned to 32, 64 respectively. Also, for the Brain Tumor and SARSCOV2 CTScan datasets, 4 convolutional layers were exploited, where the kernel maps were assigned to for the first two convolutional layers andfor the last two, while simultaneously the number of filters were assigned to 32, 64, 128 and 256 respectively. The convolutional layers of the Convolutional Network, contained in the Decoder are identical to the encoder’s, but in reverse order. Additionally, the stride parameter for all the convolutional layers is set to two, because, as described in
[guo2017deep], this setting allows the convolutional network of the Encoder and its transpose counterpart contained in the Decoder to learn spatial subsampling and upsampling respectively and thus leading to higher capability of transformation.The Fully Connected Network of the Encoder Network is composed of three fully connected layers. The first two were assigned to 128 units and the final one equal to the specified number of dimensions of the Latent Space (denoted as
). The Fully Connected Network of the Decoder is identical to that of the Encoder, but in reverse order. Additionally, for the last Fully Connected Network of the Classifier Network, three fully connected layers were utilized, where the first two were assigned to 128 units and the final one equal to the number of classes of the corresponding dataset (Classification Layer). Finally, regarding the Activation Functions, for all the layers except for the output layers of the Encoder, Classifier and the Decoder Networks, Rectified Linear Units are utilized, while for the aforementioned exceptions, Linear, Softmax and Sigmoid Activation Functions were employed, respectively.
Each deep learning model was trained for 200 epochs and the model of highest validation accuracy during training was preserved. The minibatch size and the optimizer are set to 128 and Adam [kingma2014adam], with learning rate equal to , which is decreased by a factor of per 50 epochs, respectively. The remaining parameters for the classification algorithms, were kept to their default values, except for the number of neighbors parameter of the kNearest Neighbors Classifier, which was set to 3.
The proposed methodologies, were applied across all the datasets for different values of with minor performance variations confirming previous observations [MaiNgocHwang2020]. We choose to report results for a relative small value for which high classification accuracy can be obtained. The Keras implementation on the MNIST Dataset and detailed experimental results for different values of of the proposed methodologies, along with additional visualizations can be found at the GitHub repository^{4}^{4}4see:https://github.com/JohnNellas/CSAE. All the experiments were conducted on a server PC with Intel(R) Core(TM) i710700K CPU @ 3.80GHz, NVIDIA TITAN Xp 12 GB GPU, and 130GiB of RAM.
MNIST  Fashion MNIST  Brain Tumor Dataset  SARSCOV2 CTScans  

Accuracy  Weighted F1Score  Accuracy  Weighted F1Score  Accuracy  Weighted F1Score  Accuracy  Weighted F1Score 
LDA  0.8730  0.8726  0.8151  0.8159  0.9119  0.9126  0.8008  0.8007 
CNN Classifier  0.9869  0.9869  0.9135  0.9132  0.9543  0.9540  0.9396  0.9396 
AE + Class. Net. ()  0.6968  0.6759  0.5805  0.5362  0.4649  0.2951  0.4969  0.3520 
CSAE ()  0.9751  0.9750  0.8959  0.8961  0.9575  0.9574  0.9436  0.9436 
CSAE ()  0.9871  0.9871  0.9117  0.9114  0.9510  0.9510  0.9436  0.9436 
VGG5 (Spinal FC) [kabir2020spinalnet]  0.9972    0.9468           
VGG8B [nokland2019training]  0.9974    0.9547           
CapsNet [8451379]          0.8656       
CapsNet [8683759]          0.9089       
Contrastive Learning [wang2020contrastive]              0.9083   
DenseNet201 [Jaiswal2021Class]              0.9574   
Best performance per dataset is highlighted using boldface text. Most efficient solution per dataset is denoted using  
boldface and italic text. 
4.5 Experimental Results
MNIST  Fashion MNIST  Brain Tumor Dataset  SARSCOV2 CTScans  

Accuracy  Weighted F1Score  Accuracy  Weighted F1Score  Accuracy  Weighted F1Score  Accuracy  Weighted F1Score 
kNN  0.9580  0.9579  0.8915  0.8911  0.8874  0.8836  0.8651  0.8647 
CSAE L.S. + kNN ( ) 
0.9785  0.9784  0.9245  0.9241  0.9624  0.9625  0.9657  0.9657 
CSAE L.S. + kNN ()  0.9926  0.9925  0.9309  0.9303  0.9657  0.9656  0.9657  0.9657 
SVM.  0.9660  0.9660  0.8836  0.8828  0.9135  0.9120  0.9336  0.9335 
CSAE L.S. + SVM ()  0.9758  0.9758  0.8959  0.8964  0.9608  0.9607  0.9436  0.9436 
CSAE L.S. + SVM ()  0.9872  0.9871  0.9157  0.9152  0.9510  0.9510  0.9436  0.9436 
GNB  0.5240  0.4772  0.5706  0.5398  0.7406  0.7329  0.7364  0.7285 
CSAE L.S. + GNB ()  0.9161  0.9167  0.8244  0.8264  0.9200  0.9196  0.9456  0.9456 
CSAE L.S. + GNB ()  0.9742  0.9742  0.8743  0.8724  0.9396  0.9396  0.9456  0.9456 
Best performance per dataset and classification method is highlighted using boldface text. Best performance per dataset  
across classification methods is denoted using boldface and italic text. 
The experimental results, regarding the first set of comparisons, are reported in Table 1. We observe that CSAE performance surpasses the methods used for comparisons while also achieving competitive results against well established methods from the literature. In more detail, for the Fashion MNIST datasets, even though CSAE achieved inferior performance than the methods presented in [kabir2020spinalnet, nokland2019training], it constitutes a significantly smaller model with 4.03 and 8.12 times less parameters respectively. Regarding the Braim Tumor Dataset, CSAE outperformed any other method, including those presented in [8451379, 8683759]. In addition, for the SARSCOV2 CTScans dataset, CSAE achieved only slightly worse performance than the methodology presented in [Jaiswal2021Class]
, but still, the proposed methodology reached this performance with 3.85 times less parameters, allowing us to dismiss any need for transfer learning. Most importantly, we be observed that the proposed optimization strategy of the reconstruction and classification error leads to an improvement over the scheme where two procedures are performed independently or by only using a single CNN classifier.
The second set of experimental results are reported in Table 2. It can be observed that the performance of the traditional classification methodologies was significantly improved, when they were applied onto the optimized for classification latent representations produced by CSAE. The best performance across methods and metrics, was achieved by the kNearest Neighbors classifier. Interestingly, it even surpasses that of CSAE in Table 1. The aforementioned results can be visually justified by investigating two dimensional representations of the Latent Spaces constructed by CSAE, retrieved by the tSNE algorithm [van2008visualizing]. As shown in Figure 3 , we observe that points of the same class form dense neighborhoods of points.
4.5.1 Visualization
The proposed Convolutional Supervised Autoencoder, allow both the visualization of the generated Latent Space and the decision boundary of the Classifier Network. Specifically, we set to retrieve the two dimensional Latent Space that is subsequently provided as input to the fully connected network of the classifier. We visualize decision regions and boundaries by generating a coloured scatter grid of points where each colour corresponds to a prediction class. A scatter plot of the Latent Representations of the images contained in the test set of the MNIST, Fashion MNIST, Brain Tumor and SARSCOV2 CTScan datasets, and the decision boundary of the Classifier drawn on the corresponding Latent Space, is presented in Figures 4, 5, 6 and 7, correspondingly. A bold coloured point corresponds to the ground truth class of the corresponding latent representation, while a point with lower opacity corresponds to the class predicted by the network. We observe that, decision regions in the Latent Space created by CSAE, are almost linearly separable, and the classifier converged to a linear decision boundary. This observation confirms the objective of most data transformation methodologies, where a non linear data transformation creates a space where the data are linearly separable. Most importantly, in this case the visualization offers the much requested explainability, since it allows the realization of the decision that the network makes in order to classify the input points.
An example of the explainability provided by the aforementioned scatter plots can be illustrated by randomly replacing data points with the corresponding original images. Then we can simultaneously visually examine their pairwise distances and their distance from the decision boundary with respect to their visual characteristics. Apparently, images that lookalike tend to be closer to each other confirming the data structure preservation capability of the proposed methodology. Visual investigation of the images found across the decision boundary can be extremely beneficial for real world biomedicine applications where class membership is often not easily distinguishable [cheng2016retrieval, hani2020covid, JIA2021104425].
Finally, we further examine the information that the network has captured regarding the data in the Latent Space of CSAE by providing a grid of points from the Latent Space as input to the Decoder Network. A decoded grid of points from the two dimensional () Latent Space of CSAE, for the MNIST and Fashion MNIST datasets, is presented in Figure 8. We observe that, except from the label information, the network has also captured the rotation and intensity information of the MNIST and Fashion MNIST dataset respectively.
5 Conclusions
In this study, a novel supervised dimensionality reduction and classification methodology is proposed, which is constituted by a Convolutional Autoencoder for dimensionality reduction and a classifier for the classification of the latent representations. Its main characteristic is that it concurrently optimizes not only the reconstruction but also the classification error. This method is entitled Convolutional Supervised Autoencoder (CSAE). In addition, we consider the produced Latent Space of the proposed methodology as optimized for classification, and thus we argue that its latent representations can be provided as inputs to a trainable classifier to significantly improve their performance. To support the aforementioned claims, a thorough study regarding the Latent Space and the classification behaviour of CSAE is performed. The experimental results on two benchmark and two biomedical image datasets, showed that CSAE, achieved competitive classification performance against state of the art, while surpassing alternative methodological scenarios. Simultaneously, it offers a much more efficient solution in terms of parameters count. It is also observed that the performance of traditional classification algorithms was indeed improved when they were applied onto the latent representations of CSAE. Most importantly, motivated by the visualization capabilities of CSAE we investigated the explainability perspective, which adds greater value to the proposed methodology. We specifically observed that the resulting decision boundaries of the classifier converged to a linear hyperplanes. To that end, we highlight our interest in further investigating similar architectures that enable us to visualized and provide wider explainability in time series classification tasks.
Acknowledgements
This project has received funding from the Hellenic Foundation for Research and Innovation (HFRI), under grant agreement No 1901.