. However, having a large-scale annotated dataset like ImageNet is expensive. Therefore, many research techniques such as active learning, semi-supervised learning, or self-supervised learning focus on utilizing small sets of annotated data.
The discriminative power of latent features is an important factor that affects the performance of classification. In this work, we propose an optimization technique to strengthen the discriminative potential of latent spaces in classification DCNNs. Our approach is based on the objectives of Linear Discriminant Analysis (LDA) which is to minimize the intra-class variance and maximize the inter-class variance. We incorporate these criteria in DCNN training via different losses that can be viewed as constraints for optimizing discriminant features. In other words, our goal is to optimize discriminant analysis for latent features in DCNNs that leads to better classification results, hence the name Neural Discriminant Analysis (NDA). This optimization focuses on solving these three criteria: (i) reducing intra-class variance by minimizing the total distance between features of objects in the same class to their means, (ii) increasing inter-class variance by transforming the features into other feature spaces such that the feature distances between two classes in the target space are pushed further apart, (iii) the previous two criteria should also improve classification accuracy.
The relevance of our approach is shown through the performance improvements of our proposed optimization in various fields such as general classification, semi-supervised learning, out of distribution detection and fine-grained classification.
Semi-supervise learning (SSL), as the name suggests, is a learning technique that combines both labeled and unlabeled data in the training process. The majority of work in this field Sohn et al. (2020); Lee (2013); Xie et al. (2019a, b) uses pseudo labeling and consistency regularization for unsupervised training part. Consistency regularization on unlabeled data is applied on different augmentations of the same input image. One example of the state-of-the-art for unsupervised data augmentation is proposed in Xie et al. (2019a) (UDA). Using a similar approach as UDA, we add our NDA losses to improve the discriminant of both labeled and pseudo labeled data. It shows that NDA improves the final classification result.
One important aspect of a real-world classification system is the ability to identify anomalous data or samples that are significantly different from the rest of the training data. These samples should be detected as out of distribution (OOD) data. This problem has been studied in various papers Blundell et al. (2015); Gal and Ghahramani (2016); Maddox et al. (2019); Franchi et al. (2019b); Lakshminarayanan et al. (2017); Liang et al. (2017). It turns out that having a more discriminative latent space can also help to improve OOD data detection, and our NDA technique has proved to be useful in this field.
Another specific field in classification is Fine-Grained Visual Classification (FGVC). The task of FGVC is to classify subordinate classes under one common super-class. Examples are recognizing different breeds of dogs and cats Khosla et al. (2011); Parkhi et al. (2012), sub-species of birds Wah et al. (2011); Van Horn et al. (2015), different models and manufactures of cars and airplanes Krause et al. (2013); Yang et al. (2015); Maji et al. (2013), sub-types of flowers Nilsback and Zisserman (2008) or natural species Horn et al. (2018) and so on. On the one hand, it is challenging because the subordinate classes share the same visual structures and appearances; the differences are very subtle. In many cases, it requires domain experts to distinguish and label these classes by recognizing their discriminative features on specific parts of the objects. Therefore, it is also a great challenge to obtain large-scale datasets for FGVC. On the other hand, the intra-class variance can be visually higher than the inter-class variance. Such cases can be seen in different colors and poses of objects in the same class. Our proposed NDA directly addresses the challenge of FGVC, and our experiments show that we achieve improvements on various FGVC dataset using NDA optimization.
What makes our method interesting is not only the improvement of performance, but also the fact that NDA can be deployed easily as a component to an existing network for different tasks. We can connect NDA to any layer of any existing DCNN and make it end-to-end. However, in this paper, we just introduce NDA on the pre logit layer, which contains have high-level information that is useful for classification. Our contribution is to propose a method to learn a discriminant latent space for DCNNs. We conduct various experiments to show the applicability of NDA in different research topics. Our proposed optimization helps to improve fine-grained classification and semi-supervised learning performance. In addition, in OOD data detection, our algorithm helps in obtaining a confidence score that is better calibrated and more effective for detecting OOD data.
2 Related Work
Discriminant Analysis with DCNN: To improve the classification performance, one option is to use discriminant analysis incorporated with DCNNs. Mao et al. Mao and Jain (1993) proposed a nonlinear discriminant analysis network while others use linear discriminant analysis Wang et al. (2017); Dorfer et al. (2016); Zhong et al. (2018); Li et al. (2019). The common idea of these methods is to implement the discriminative principles of LDA in trainings of DCNNs with different implementations. Zhong et al. Zhong et al. (2018)
proposed a method to optimize feature vectors with respect to the centers of all the classes. The training is done in single branch of a DCNN. Dorferet al. Dorfer et al. (2016)
compute eigenvalues on each training batch to minimize between class variances. However, this method is slow due to the eigenvalue computations. Our proposed method is based on a Siamese training without the need for solving the eigen problem. A preliminary work building on this general strategy and addressing only FGVC has been accepted in ICIP 2020Anonymous (2020)
. Here, we generalize and extend the idea of NDA in a combined loss function and a joint training procedure, and we demonstrate the general usefulness of this new framework in a wider range of applications.
Semi-supervised learning (SSL): Training DCNNs often needs a large amount of data, which can be expensive to annotate. Semi-supervised learning (SSL) techniques overcome this challenge by using an auxiliary dataset together with a small annotated target dataset. Auxiliary datasets can come from different datasets such as Dvornik et al. (2019); Gidaris et al. (2019), or are formed by a subset of the main training dataset of which the annotations are omitted Sohn et al. (2020); Lee (2013); Xie et al. (2019a, b). We consider the latter case since the unlabeled data is easily obtained from the main labeled dataset. A set of methods Sohn et al. (2020); Lee (2013); Xie et al. (2019a) train DCNNs on the unlabeled dataset by applying two random data transformations on the unlabeled data and forcing the DCNNs to be consistent between the two predictions on the two transformed images.
Out of distribution (OOD) data detection:
OOD detection is challenging since it involves understanding the limitations of a trained DCNN. Relying on an uncertainty measure is a solution for this task since the data with high uncertainty can be considered as OOD. Uncertainty in deep learning has been initiated with Bayesian DCNNs, which assume a posterior distribution on the weights given the dataset. It can be estimated directlyBlundell et al. (2015) or indirectly via dropout Gal and Ghahramani (2016). Thanks to the marginalization principle, the results are integrated over the posterior distribution. Among different Bayesian DCNNs, MC Dropout Gal and Ghahramani (2016) allows a fast estimation of the uncertainty, while SWAG Maddox et al. (2019) and TRADI Franchi et al. (2019b) estimate the posterior distribution by studying the training process of the DCNNs. Uncertainty has also been studied via ensemble methods on Deep Ensembles Lakshminarayanan et al. (2017) and OVNNI Franchi et al. (2019a) which presents state-of-the-art results. Finally, other methods try to detect OOD by learning the loss of the DCNN such as Confidnet Corbière et al. (2019), or by proposing algorithms specific for this task such as ODIN Liang et al. (2017) that is tuned with OOD data.
Fine-grained visual classification (FGVC): The principles of LDA address directly to the challenges that FGVC faces. In the field of FGVC, the inter-class differences are often subtle. Experts distinguish subordinate classes based on specific parts of the objects. Therefore, a straight-forward approach is to learn features of object parts Farrell et al. (2011); Khosla et al. (2011); Parkhi et al. (2011); Liu et al. (2012); Branson et al. (2014); Zhang et al. (2014a, b); Krause et al. (2014); Lin et al. (2015); Zhang et al. (2015); Huang et al. (2016); Zhang et al. (2016a). This approach requires heavy part annotations from domain experts, and therefore it is difficult to extend to larger scale datasets. Some other works rely on attribute annotations and text descriptors Vedaldi et al. (2014); Reed et al. (2016); He and Peng (2017); Xu et al. (2018). Stepping away from those types of annotations, another set of works focus on learning and using visual attention on discriminative regions Xiao et al. (2015); Liu et al. (2016); Zheng et al. (2017); Zhao et al. (2017); Fu et al. (2017); Peng et al. (2018); Yang et al. (2018); Sun et al. (2018). Analyzing the filter responses from DCNNs has also led to good part descriptors and localization Wang et al. (2015); Zhang et al. (2016b). Utilizing the internal responses from DCNNs, different pooling techniques have also been developed such as bilinear pooling to study the interactions of sets of local features Lin et al. (2015); Gao et al. (2016); Lin and Maji (2017); Cui et al. (2017); Cai et al. (2017); Yu et al. (2018); Wei et al. (2018). Besides the high intra-class and low inter-class variance challenge, FGVC also faces a problem from small scale datasets. In order to address this issue, researchers work on different strategies to collect more relevant images to enrich the datasets Xie et al. (2015); Krause et al. (2016); Gebru et al. (2017); Zhang et al. (2018); Xu et al. (2018); Cui (2018) or to employ human in the loop and human interaction to bootstrap datasets Branson et al. (2010); Cui et al. (2016); Deng et al. (2016).
3 Neural Discriminant Analysis (NDA)
Features for image classifications can be obtained from various types of network models. A straightforward approach is to extract features from pre-trained models on the ImageNet dataset Deng et al. (2009) with 1,000 classes such as Inception Szegedy et al. (2016), ResNet He et al. (2016, 2016), etc. However, they are not the most discriminative features for various tasks such as fine-grained visual classification (FGVC), semi-supervised learning (SSL), or out of distribution (OOD) detection. Hence, we opt to improve their discriminative potential by learning a discriminant analysis on these deep feature spaces.
3.1 Linear Discriminant Analysis (LDA)
Let be a Deep Convolutional Neural Network (DCNN) with a set of weights . An input image is denoted as . Hence the classification result of the DCNN applied on the input image is . In addition we denote
the latent space of the DCNN corresponding to the features extracted at layerwhich is usually located before the classification layer.
Let be a training set composed of data samples, and is an input that can be a signal or an image, while is its corresponding class. We assume that the dataset contains classes. Let be the empirical mean of each class that has data samples, and the empirical global mean. The within scatter matrix and the between scatter matrix are computed as follows:
where is the Dirac function equal to when the class of is different to and otherwise.
The objective of the LDA is to learn a projection ) that maps the data from the initial dimension to a smaller dimension such that the inter-class variance is maximized and the intra-class variance is minimized. The projection matrix U needs to satisfy Fisher’s condition:
3.2 From Linear Discriminant Analysis to Neural Discriminant Analysis
Inspired by the objectives of LDA, we propose an optimization to learn a discriminative latent space. It minimizes the intra-class variance and maximizes the inter-class variance of the latent space. The optimization’s objectives for are the following:
Maximizing the classification results: Let the classification results be . Maximizing the results is equivalent to minimizing the categorical cross entropy loss:
is the one-hot encoded classification ground-truth of the target.
Minimizing intra-class variance: This loss minimizes the total distances for all the feature points to their respective class mean feature, and is calculated as below:
where is the mean of latent features of class
, which is calculated at the beginning of each epoch.is a distance function or a dissimilarity measure. The mean loss can be either the exact equation 4 where is the L2-norm, or we can also use the prototypical loss Snell et al. (2017) where the prototypes are the mean features. Therefore, instead of just evaluating the mean distance we compute the softmax of the mean distances, followed by cross-entropy.
Maximizing inter-class variance: With this optimization, we propose to use pairs of images . If this pair of images belongs to the same class, we want to reduce the distance between the image latent space, otherwise, we want to increase the distance between them. The effect is to push features from different classes apart from each other while keeping features within the same class close to each other. Let if the two images and are from the same class and if they are from different classes. The optimization for inter-class variance minimizes the following function:
where is a distance function, which is L2-norm in all experiments.
Total loss: The core structure of combining losses for NDA optimization is a Siamese architecture. A pair of input images are passed through a shared weight Siamese network. The Siamese network is designed with 5 losses: two Classification losses, two Mean losses (one Classification loss and one Mean loss for each image), and a Siamese loss (Fig 1). The Classification loss is the categorical cross entropy loss defined in Equation 3. The Mean loss is defined as in Equation 4 and the Siamese loss is defined as in Equation 5. The combination loss is defined as the weighted sum of all the losses as follows:
where and are the Classification loss and Mean loss for the first input image. and are the Classification loss and Mean loss for the second input image. The coefficients , , and are fine-tuned hyper-parameters.
The NDA model (Figure 1) is designed as follows: the images of a batch are provided to a DCNN that produces the latent space representation of these images and the classification output target. To optimize jointly the classification and inter-class and intra-class variance of the latent space, we first calculate the mean feature for every class at the beginning of an epoch. Then, we provide two batches of images to the DCNN. Each batch is used for the Classification loss and the Mean loss. The combination of two batches is used for the Siamese loss. We then combine all the losses as in Eq 6.
We run the experiments with different values for the hyper-parameters , , and . While we cannot extensively search for many configurations of the hyper-parameters due to limited resources, we find that the configuration , , and provides slightly better results. For the experiments in SSL and OOD, we implement the prototypical loss Snell et al. (2017) on the Mean loss. Instead of just using L2 distance on the mean class vector, we use linear distance and normalize the distance by a softmax function. We will briefly describe experiments for different applications and their results in this Section.
4.1 General Supervised Classification
We train end-to-end networks for classification on the CIFAR-10 datasetKrizhevsky et al. (2009). We use the official (train, test) split of the dataset. However, we further split the training set to 80% for training data and 20% for validation. We only validate our training on the validation set and test on the test set. We use two different network architectures for the Siamese: AlexNet Krizhevsky et al. (2012) with Kaggle implementation and ResNet50 He et al. (2016). The results for AlexNet Kaggle are shown in Table 1 and ResNet50 are in Table 2.
|Baseline||Optimization||Improvement Over Baseline|
|CDA Zhong et al. (2018)||60.2%||62.5%||2.3%|
We compare with a competing method "Convolutional Discriminant Analysis" (CDA) proposed by Zhong et al. Zhong et al. (2018)
using AlexNet Kaggle base network. Different from our Siamese architecture, CDA uses a single branch CNN with a single input image. With the Siamese network, we can specify the Mean loss and the Siamese loss explicitly. On the contrary, CDA’s objective is that an input image trained to be close to its class center and further away from other classes’ centers. NDA achieves 70.9% accuracy whereas CDA only reaches to 62.5%. Compared to the baseline, CDA has 2.3% increase and NDA has 6.2% improvement. We re-implement AlexNet Kaggle using Keras-Tensorflow. Due to different framework supports, our baseline has higher accuracy than the CDA’s. The numerical results conclude that our NDA achieves significantly better results than CDA and much higher improvement from the baseline.
We also test our proposed NDA using ResNet50 He et al. (2016). We use the code for the model provided by Keras 111 https://keras.io/examples/cifar10_resnet/, as well as the pre-defined learning schedule and 200 epochs for each training. It is reported that the accuracy of ResNet50 on CIFAR-10 is 93.0%. However it is important to take note that this accuracy is reported for training that validates on the test set’s accuracy. On the contrary, we split the default training data into 80% for training and 20% for validation to avoid over-fitting on the test data. This also results in having less data for training. We achieve a baseline accuracy of 91.8%, while NDA improves the accuracy to 93.0%. We also experiment with using the prototypical loss Snell et al. (2017) to optimize for the Mean loss. This setup improves the result to 93.3%, surpasses the state-of-the-art results. The NDA results are averaged over five runs.
|NetCCE Dorfer et al. (2016)||SOTA (*)||Baseline||NDA||NDA + prototypical loss Snell et al. (2017)|
We also evaluate NDA optimization on CIFAR-100 dataset using WideResNet 28x10 networkZagoruyko and Komodakis (2016). The accuracy of the network without NDA is 73.9% and 76.4% with NDA. Our optimization increases the baseline result by 2.5%.
4.2 Semi-Supervised Learning (SSL) Classification
In this section, we propose to use NDA optimization in a semi-supervised learning (SSL) task. We first randomly select 250 images with labels from CIFAR-10 dataset, 25 images per class for 10 classes. These 250 images are used in supervised training for classification. We also select 2,000 labeled images from the official training set as validation data. The rest of the data in the training set is used as unlabeled data. We use WideResNet 28x10 architecture Zagoruyko and Komodakis (2016) and consistency loss, similar to those used in UDA Xie et al. (2019a), for training SSL.
The training is divided into two phases. In the first phase, we have two losses: one for supervised training and one for unsupervised training. The first loss is the categorical cross-entropy which is used for supervised training on the 250 labeled images. We also use the unlabeled set to train the DCNN with a consistency loss. The consistency loss is the Kullback–Leibler (KL) divergence on the class probability outputs of two transformations of an input image. The first transformation is a random horizontal flip and random cropping, the second one is the more complex transformation proposed by UDAXie et al. (2019a) that uses RandAugment Cubuk et al. (2019)
. The consistency loss makes sure that the class probability distributions of two transformations from the same image are similar.
In the second phase, we use a deep ensemble with three models trained in the first phase to pseudo label the unlabeled set. The pseudo labels are created by averaging the predictions from all three models. We also use the deep ensemble networks to compute the confidence score of an unlabeled image, which is the max probability class (MCP) of the average result. If the confidence score is higher than , we will select the image and its pseudo label for training. We combine the pseudo labeled data and 250 labeled images to continue training the three ensemble models for classification using cross-entropy loss. After every training epoch, if the performance of a deep ensemble network is improved in the second phase, we will use it to update its predecessor network in the first phase. Even though the technique is similar to UDA Xie et al. (2019a), the deep ensemble pseudo labeling is a new contribution.
The result of our SSL technique on CIFAR-10 is 90.0%. When incorporating NDA optimization into the loss in the second phase, we achieve an accuracy of 90.5%, an increase of 0.5% over the baseline.
4.3 Out Of Distribution Detection (OOD)
In this subsection, we show the performance of NDA for detecting OOD data. We train a Resnet50 architecture on CIFAR-10 and test it on a combination of images from CIFAR-10 and SVHN, which is a commonly used dataset for OOD. The goal of this section is to show that without any hard extension, NDA can reach a performance close to the state of the art results.
For ResNet50 trained on CIFAR-10, the Maximum Class Probability (MCP), which is the confidence score Hendrycks and Gimpel (2016) for a classical training, seems to be close to 0.9 for most of the OOD data. This behaviour is wrong and could lead to a wrong use of DCNN. Furthermore, for the NDA training we also consider the MCP as a confidence score, yet in this case the OOD data have a lower score.
To have a better quantitative evaluation, we compare NDA with most competitive algorithms and use the criteria metrics proposed in Hendrycks and Gimpel (2016). We evaluate our method and compare with existing work using AUC, AUPR and FPR-95%-TPR metrics. We also evaluate the Expected Calibration Error (ECE) Guo et al. (2017), which checks if the confidence scores are reliable. NDA has better results than the Bayesian DCNN and the learning loss techniques. The only one that surpasses NDA is Deep Ensembles that needs to train several DCNNs, thus, requires more training time. We also compare NDA to ODIN. It shows that NDA is comparable to strategies that fine-tune the DCNNs using OOD data.
4.4 Fine-grained Visual Classification
We evaluate our proposed method on the following FGVC datasets: Stanford-Dogs Khosla et al. (2011)
that contains 120 breeds of dogs, CUB-200-2011Wah et al. (2011) that has 200 types of birds, Flower-102 Nilsback and Zisserman (2008)
with 102 types of flowers, Stanford-CarsYang et al. (2015) that has 196 types of cars and NABirds Van Horn et al. (2015)
with 555 classes of birds. We use transfer learning Inception-ResNet-V2, Inception-ResNet-V2-SE and Inception-V3-iNat models from Cuiet al. Cui (2018) as base networks to compare across all datasets. In this experiment, we alternate the Mean loss and the Siamese loss in the optimization.
|Inception-V3 from Cui (2018)||82.8 - 89.3||78.5 - 85.2||88.3 - 91.4||96.3 - 97.7||82.0 - 87.9|
|Best of Transfer Learning Cui (2018)||89.6||88.0||93.5||97.7||87.9|
|NDA (Inception-V3-iNat) (ours)||87.4||89.1||99.9||95.5||83.9|
|NDA (Inc-Res-V2) (ours)||90.1||95.3||97.4||97.7||88.4|
|NDA (Inc-Res-V2-SE) (ours)||89.7||95.5||99.9||97.7||89.5|
We use features extracted from Inception-ResNet-V2, Inception-ResNet-V2-SE and Inception-V3-iNat networks from Cui (2018). It is worth to note that the best performances of transfer learning networks in Cui (2018) are from Inception-V3 and Inception-ResNet-V2-SE. Even though the features from Inception-ResNet-V2 do not produce the best results, we are still able to top the state-of-the-art in Stanford-Cars, Flower-102 and NABirds datasets. Authors in Cui (2018) trained Inception-V3 by using different data-sampling strategies but there is no strategy that is able to consistently produce the best transfer learning result on all the datasets. It is unclear which of the strategies has been applied in the publicly available network that we use. Thus, for Inception-V3 results from Cui (2018), we report the range of accuracy from all the data sampling strategies. The training is repeated 10 times, with different random initializations. The reported results are the average performance over 10 runs.
With NDA optimization on the Stanford-Cars dataset using features extracted from Inception-ResNet-V2-SE and Inception-V3-iNat, we raise the accuracy to 99.9% consistently throughout all 10 runs (the standard deviation is therefore 0). Without the NDA optimization, the average accuracy of Inception-ResNet-V2-SE is 97.4%, and the results fluctuate from 89.1% to 99.9% (standard deviation is 3.88). This shows the consistency of the NDA optimization.
5.1 Learning a Discriminant Latent Space
We choose to integrate the losses of discriminant analysis in Section 3.2 into Deep Convolutional Neural Networks and form Neural Discriminant Analysis (NDA) networks due to several reasons. First, we can easily add an NDA component to an existing classification model in between the final feature layer and the prediction layer. The NDA approach is flexible in design and transforms the latent space to be more discriminative. Secondly, when we train a DCNN to classify images, the first part of the DCNN (such as convolutional layers) learns the features while the last part is specialized to classify. By incorporating NDA into DCNNs and making it end-to-end networks, it will help to improve the discriminative of features in both latent space and other convolutional layers. Thus, our technique is useful in cases where we have relatively few training data.
The loss that we propose allows us to have a latent space that is more discriminative. Also, if we consider that is the square norm 2 then the Siamese loss for the data of a batch that have the same mean turn out to be : , with the empirical mean of on the batch of the latent space of the class , and the number of data on the batch of class . While the mean loss tries to make the latent space converges to empirical mean of size , the intra-class variance part of the Siamese loss tries to make it converge to an empirical mean of size . Hence the empirical mean of the mean loss is more accurate. In addition, the mean loss acts like an anchor that stabilizes the DCNN.
We compute and report standard deviations of the accuracy across 10 runs per dataset per network on FGVC task in Table 5. The standard deviations are consistently lower in all NDA optimization results compared to transfer learning. It shows that the NDA optimization transforms the features such that the classification becomes more stable and reliable. High average with small standard deviation results are much more desired, compared to lower average and high standard deviation.
|Inc-ResNet-V2||TL / NDA||0.74 / 0.28||0.96 / 0.81||2.53 / 1.80||0.22 / 0.10||0.73 / 0.08|
|Inc-ResNet-V2-SE||TL / NDA||0.69 / 0.17||0.37 / 0.11||3.88 / 0.00||0.13 / 0.13||0.06 / 0.06|
|Inception-V3-iNat||TL / NDA||3.96 / 0.27||0.86 / 0.40||8.32 / 0.00||0.50 / 0.23||0.09 / 0.10|
5.3 NDA versus Siamese
|NDA (Inc-Res-V2) (ours)||90.1||95.3||97.4||97.7||88.4|
We also experiment with an optimization that uses only Siamese Loss alone on FGVC task, the performance drops (Table 6). Without the Mean Loss, it lacks a strong and explicit constraint for intra-class optimization. Without Siamese Loss, there is no inter-class optimization.
Inspired by the objectives of Linear Discriminant Analysis (LDA) and making use of the power of DCNNs, we propose a Neural Discriminant Analysis (NDA) optimization that is useful for many research fields. Our proposed NDA consists of Mean losses, Siamese losses and Classification losses. The combination of all the losses minimizes the intra-class variance and maximizes the inter-class variance in the deep feature domain. We validate the NDA optimization in four different topics: general supervised classification, semi-supervised learning, out of distribution detection and fine-grained classification. The experiments show that NDA always improves the classification results over the baseline. We also obtain state-of-the-art accuracy on CIFAR-10 and several popular FGVC datasets. NDA results on OOD indicate that it can help to detect OOD data with competitive results to most of the methods, except for Deep Ensembles which requires much more training time. The analysis for FGVC shows that our optimization provides more stable and reliable results.
7 Broader Impact
The discriminant analysis on latent space for features of DCNNs improves the discriminative ability of the networks, especially for cases with less training data such as semi-supervised learning and fine-grained classification. Therefore, it is economically beneficial in helping developing algorithms to achieve good performance on less expensive annotated data. Our work pulls the focus from data-driven approaches that purely depend on millions of annotated images to the direction of incorporating classical machine learning principles in deep learning, using less training data but still achieve excellent results.
We also want to improve the performance in out of distribution detection using NDA optimization. This is an essential feature for real-world applications, especially in medical imaging, to identify anomalous data that the networks have not been trained on or seen before. In those systems, it is crucial not to assign a trained label to anomalous data blindly.
Our proposed method is easier to apply to existing network architectures than Anonymous (2020), faster training than Dorfer et al. (2016), provides higher accuracy than Zhong et al. (2018) and is more general than Li et al. (2019). Aspects less often mentioned in published papers are the stability of the training and the reliability of the outcome networks. In this paper, we study the variance of the classification results. It shows that NDA optimization stabilizes the training. It helps to reduce the fluctuations in networks’ performance. We would like to emphasize that the reliability in network training is also an important aspect besides the performance itself.
-  (2020) Neural discriminant analysis for fine-grained classification. In ICIP, Note: accepted, see supplemental material Cited by: §2, §7.
-  (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §1, §2.
-  (2014) Bird species categorization using pose normalized deep convolutional nets. In British Machine Vision Conference (BMVC), Cited by: §2.
-  (2010) Visual recognition with humans in the loop. In European Conference on Computer Vision (ECCV), pp. 438–451. External Links: Cited by: §2.
-  (2017-10) Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization. In IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 511–520. External Links: Cited by: §2.
-  (2019) Addressing failure prediction by learning model confidence. In Advances in Neural Information Processing Systems, pp. 2898–2909. Cited by: §2.
-  (2019) RandAugment: practical automated data augmentation with a reduced search space. arXiv preprint arXiv:1909.13719. Cited by: §4.2.
Kernel pooling for convolutional neural networks.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3049–3058. External Links: Cited by: §2.
-  (2016-06-27) Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018) Large scale fine-grained categorization and domain-specific transfer learning. In CVPR, Cited by: §2, §4.4, §4.4, Table 4.
-  (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §1, §3.
-  (2016-04) Leveraging the wisdom of the crowd for fine-grained recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 38 (4), pp. 666–676. External Links: Cited by: §2.
-  (2016) Deep linear discriminant analysis. In International Conference on Learning Representations (ICLR), pp. 1–11. Cited by: §2, Table 2, §7.
-  (2019) Diversity with cooperation: ensemble methods for few-shot classification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3723–3731. Cited by: §2.
-  (2011-11) Birdlets: subordinate categorization using volumetric primitives and pose-normalized appearance. In International Conference on Computer Vision (ICCV), pp. 161–168. External Links: Cited by: §2.
-  (2019) One versus all for deep neural network incertitude (OVNNI) quantification. arXiv preprint arXiv:2006.00954. Cited by: §2.
-  (2019) TRADI: tracking deep neural network weight distributions. arXiv preprint arXiv:1912.11316. Cited by: §1, §2.
-  (2017-07) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 4476–4484. External Links: Cited by: §2.
-  (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §1, §2.
-  (2016-06) Compact bilinear pooling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 317–326. External Links: Cited by: §2.
-  (2017) Fine-grained recognition in the wild: a multi-task domain adaptation approach. IEEE International Conference on Computer Vision (ICCV), pp. 1358–1367. Cited by: §2.
-  (2019) Boosting few-shot visual learning with self-supervision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8059–8068. Cited by: §2.
-  (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. Cited by: §4.3.
-  (2016-06) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Cited by: §3.
-  (2016) Identity mappings in deep residual networks. In ECCV, Cited by: §3, §4.1, §4.1.
-  (2017) Fine-grained image classification via combining vision and language. In CVPR, pp. 7332–7340. Cited by: §2.
-  (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §4.3, §4.3.
-  (2018-06) The inaturalist species classification and detection dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR, pp. 8769–8778. External Links: Cited by: §1.
-  (2016-06) Part-stacked cnn for fine-grained visual categorization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2011-06) Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §4.4.
-  (2014-08) Learning features and parts for fine-grained recognition. In International Conference on Pattern Recognition (ICPR), Vol. , pp. 26–33. External Links: Cited by: §2.
-  (2016) The unreasonable effectiveness of noisy data for fine-grained recognition. In ECCV, Vol. 9907, pp. 301–320. Cited by: §2.
-  (2013) 3D object representations for fine-grained categorization. In International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Cited by: §1.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
-  (2012) ImageNet classification with deep convolutional neural networks. In nips, pp. 1097–1105. Cited by: §4.1, Table 1.
-  (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §1, §2.
-  (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3, pp. 2. Cited by: §1, §2.
-  (2019) Discriminant analysis deep neural networks. In Conference on Information Sciences and Systems (CISS), pp. 1–6. Cited by: §2, §7.
-  (2017) Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690. Cited by: §1, §2.
-  (2015-06) Deep lac: deep localization, alignment and classification for fine-grained recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1666–1674. External Links: Cited by: §2.
-  (2015-12) Bilinear cnn models for fine-grained visual recognition. In IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 1449–1457. External Links: Cited by: §2.
-  (2017-09) Improved bilinear pooling with cnns. In Proceedings of the British Machine Vision Conference (BMVC), pp. 117.1–117.12. External Links: Cited by: §2.
-  (2012) Dog breed classification using part localization. In European Conference on Computer Vision (ECCV), pp. 172–185. External Links: Cited by: §2.
-  (2016) Fully convolutional attention localization networks: efficient attention localization for fine-grained recognition. CoRR abs/1603.06765. Cited by: §2.
-  (2019) A simple baseline for Bayesian uncertainty in deep learning. arXiv preprint arXiv:1902.02476. Cited by: §1, §2.
-  (2013) Fine-grained visual classification of aircraft. Technical report arXiv:1306.5151. External Links: Cited by: §1.
-  (1993) Discriminant analysis neural networks. In International Conference on Neural Networks, pp. 300–305 vol.1. Cited by: §2.
-  (2008-12) Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics Image Processing, pp. 722–729. External Links: Cited by: §1, §4.4.
-  (2011-11) The truth about cats and dogs. In International Conference on Computer Vision (ICCV), Vol. , pp. 1427–1434. External Links: Cited by: §2.
-  (2012) Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
Object-part attention model for fine-grained image classification. IEEE Trans. Image Processing (TIP) 27 (3), pp. 1487–1500. Cited by: §2.
-  (2016) Learning deep representations of fine-grained visual descriptions. In CVPR, pp. 49–58. Cited by: §2.
-  (2017) Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: 2nd item, §4.1, Table 2, §4.
-  (2020) Fixmatch: simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685. Cited by: §1, §2.
-  (2018-09) Multi-attention multi-class constraint for fine-grained image recognition. In European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2016) Rethinking the inception architecture for computer vision. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826. Cited by: §3.
-  (2015-06) Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 595–604. External Links: Cited by: §1, §4.4.
-  (2014) Understanding objects in detail with fine-grained attributes. IEEE Conference on Computer Vision and Pattern Recognition, pp. 3622–3629. Cited by: §2.
-  (2011) The caltech-ucsd birds-200-2011 dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §1, §4.4.
-  (2015-12) Multiple granularity descriptors for fine-grained categorization. In IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2399–2406. External Links: Cited by: §2.
Convolutional 2d lda for nonlinear dimensionality reduction.
International Joint Conference on Artificial Intelligence (IJCAI), pp. 2929–2935. External Links: Cited by: §2.
-  (2018-09) Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification. In The European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2015-06) The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 842–850. External Links: Cited by: §2.
-  (2019) Unsupervised data augmentation for consistency training. Cited by: §1, §2, §4.2, §4.2, §4.2.
-  (2019) Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252. Cited by: §1, §2.
-  (2015-06) Hyper-class augmented and regularized deep learning for fine-grained image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2645–2654. External Links: Cited by: §2.
-  (2018-07) Fine-grained image classification by visual-semantic embedding. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1043–1049. External Links: Cited by: §2.
-  (2018-05) Webly-supervised fine-grained visual categorization via deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 40 (5), pp. 1100–1113. External Links: Cited by: §2.
-  (2015-06) A large-scale car dataset for fine-grained categorization and verification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3973–3981. External Links: Cited by: §1, §4.4.
-  (2018-09) Learning to navigate for fine-grained classification. In European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2018-09) Hierarchical bilinear pooling for fine-grained visual recognition. In European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.1, §4.2.
-  (2016-06) SPDA-cnn: unifying semantic part detection and abstraction for fine-grained recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1143–1152. External Links: Cited by: §2.
-  (2014) Part-based r-cnn for fine-grained category detection. In European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2014) PANDA: pose aligned networks for deep attribute modeling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015) Fine-grained pose prediction, normalization, and recognition. CoRR abs/1511.07063. Cited by: §2.
-  (2016-06) Picking deep filter responses for fine-grained image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1134–1142. External Links: Cited by: §2.
-  (2018-09) Fine-grained visual categorization using meta-learning optimization with sample selection of auxiliary data. In European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2017-06) Diversified visual attention networks for fine-grained object classification. IEEE Transactions on Multimedia 19 (6), pp. 1245–1256. External Links: Cited by: §2.
-  (2017-10) Learning multi-attention convolutional neural network for fine-grained image recognition. In IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 5219–5227. External Links: Cited by: §2.
-  (2018) Convolutional discriminant analysis. In International Conference on Pattern Recognition (ICPR), pp. 1456–1461. Cited by: §2, §4.1, Table 1, §7.