Neural Network Representation Control: Gaussian Isolation Machines and CVC Regularization

02/06/2020
by   Guy Amit, et al.
44

In many cases, neural network classifiers are likely to be exposed to input data that is outside of their training distribution data. Samples from outside the distribution may be classified as an existing class with high probability by softmax-based classifiers; such incorrect classifications affect the performance of the classifiers and the applications/systems that depend on them. Previous research aimed at distinguishing training distribution data from out-of-distribution data (OOD) has proposed detectors that are external to the classification method. We present Gaussian isolation machine (GIM), a novel hybrid (generative-discriminative) classifier aimed at solving the problem arising when OOD data is encountered. The GIM is based on a neural network and utilizes a new loss function that imposes a distribution on each of the trained classes in the neural network's output space, which can be approximated by a Gaussian. The proposed GIM's novelty lies in its discriminative performance and generative capabilities, a combination of characteristics not usually seen in a single classifier. The GIM achieves state-of-the-art classification results on image recognition and sentiment analysis benchmarking datasets and can also deal with OOD inputs. We also demonstrate the benefits of incorporating part of the GIM's loss function into standard neural networks as a regularization method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 6

09/07/2021

Generatively Augmented Neural Network Watchdog for Image Classification Networks

The identification of out-of-distribution data is vital to the deploymen...
11/26/2017

Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples

The problem of detecting whether a test sample is from in-distribution (...
01/02/2020

Reject Illegal Inputs with Generative Classifier Derived from Any Discriminative Classifier

Generative classifiers have been shown promising to detect illegal input...
08/27/2020

Adversarially Robust Learning via Entropic Regularization

In this paper we propose a new family of algorithms for training adversa...
02/16/2021

Unsupervised Energy-based Out-of-distribution Detection using Stiefel-Restricted Kernel Machine

Detecting out-of-distribution (OOD) samples is an essential requirement ...
04/23/2021

Eccentric Regularization: Minimizing Hyperspherical Energy without explicit projection

Several regularization methods have recently been introduced which force...
12/13/2021

WOOD: Wasserstein-based Out-of-Distribution Detection

The training and test data for deep-neural-network-based classifiers are...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Related work

Generative Classifiers vs Discriminative Classifiers

Machine learning classifiers are often divided into two families: generative and discriminative. The difference between the two is the information produced when calculating the prediction. Formally, let be a sample and be a label. Generative classifiers learn a model of the joint probability and make a prediction by calculating . This enables to be calculated as well. Discriminative classifiers do not calculate

; instead they predict the posterior probability

directly [ng2002discriminative]. The term can be interpreted as a confidence rate for the prediction, i.e., the probability for from the input space to be labeled as a specific , which is used as a measure of being the correct prediction for . Although the confidence rate can be useful in various applications, in practice, generative classifiers are not usually used due to the fact that they are outperformed by discriminative classifiers.

Identification of Out-of-Distribution Data with Neural Networks

A classifier that can identify whether a sample is not from the same distribution as the training data is capable of handling unpredictable inputs. The presence of unpredictable inputs can be intentional or accidental. Technically, identifying out-of-distribution (OOD) data means that the model labels the input as OOD instead of classifying it to a specific class. Classifiers based exclusively on softmax do not inherently have this capability, as softmax classifies every input to some class. Some research has been performed on unsupervised means of OOD detection, such as [chalapathy2018anomaly, an2015variational, chalapathy2017robust]. However, because the proposed methods use components that are external to the classifier, they require the training of an additional component/model for each class. Hendrycks et al. [hendrycks2016baseline] established a baseline method which is based on softmax. Later, Liang et al. [liang2017enhancing] introduced ODIN. ODIN uses a distillation [hinton2015distilling] like softmax, combining it with adversarial-like perturbations [goodfellow2014explaining]

to the input in order to predict whether the input is in or out of the distribution. This method does not require additional training, but it does require the performance of two feedforward and one backpropagation operations which makes it impractical for real-time use. The most recent research on OOD detection was conducted by Devries et al. 

[devries2018learning]

. Their method adds an external functionality to the existing neural network design. In addition, their method is based on a neural network with softmax output, with the addition of an output neuron that serves as an OOD detector; the neuron added must be trained using a novel method which is described in the paper. In contrast, we introduce a method inherently capable of detecting OOD inputs. We compare our results with the baseline proposed by Hendrycks et al. 

[hendrycks2016baseline], which is the most closely related study performed on the subject, as it deduces the affiliation of the input to OOD, relying solely on representations learned by the classifier.

Multivariate Gaussians in Neural Networks

Gaussians and neural networks are known for their ability to approximate functions and data distributions. In the literature we find the Gaussian and neural network combination proposed in many domains. There are articles that use Gaussians as part of the neural network itself, such as [watanabe1996fuzzy], where the Gaussians are used as activation functions, and [viroli2019deep]

, where the layers of the neural network follow a Gaussian mixture model (GMM). In general, the notable traits of a multivariate Gausssian is that knowing its parameters (means vector and covariance matrix) allow easy generation of new samples. For example, in variational autoencoders 

[doersch2016tutorial]

neural networks are used to estimate the parameters of a multivariate Gaussian; then the parameters are used to produce samples from the distribution of the data in the input space. Another issue related to the combination of multivariate Gaussians and neural networks is how to integrate GMMs into neural networks in order to perform classification. In a recent study performed by Tüske et al. 

[tuske2015integrating]

the authors took an approach similar to ours, proposing the integration of latent variables into the last layer of the neural network. In their work, the last layer represents the parameters of the GMM. In this case, a GMM for the sample is calculated, and the prediction is made accordingly. In our method, we approximate each class representation using a multivariate Gaussian distribution (a special type of GMM) on each class, but in contrast to 

[tuske2015integrating], we do not incorporate the parameters of the Gaussians in the neural network, meaning that the parameters do not need to be learned explicitly, thus making the training easier. One more important difference is that in the work performed by Tüske et al.  [tuske2015integrating] the authors use the same covariance matrix for all of the classes, while we approximate a different covariance matrix for each class.

Large Margin Algorithms in Neural Networks

As shown by Boser at el.  [boser1992training], one of the desired qualities of a classifier is to have large margins between the representations of the classes. Liang at el. [liang2017soft] explored the effects of large margins between the classes in neural networks. Cross-entropy loss is currently the most frequent loss function used by neural network-based classifiers. Sun et al. [sun2016depth] empirically showed that the cross-entropy loss does not encourage a large margin. Neural networks contain many representations of the data, and hence, a reasonable approach for achieving the large margin effect in neural networks is to form large margins in some of the network’s inner representations. [liu2016large, sokolic2017robust, liang2017soft]. In [elsayed2018large], Elsayed et al. proposed a new optimization target aimed at replacing the softmax/cross-entropy combination completely. In our method, we also replace the softmax/cross-entropy combination - with a new optimization target - but our method forces a more general margin definition. The algorithms proposed by Elsayed et al. aim to maximize the distance between the closest points of different classes, while our method tries to maximize the distance between the means of each class, which represents a relaxation of the formal definition of a large margin between the classes.

Methodology

[scale=0.47]Figures/GIM_visualization.png

Figure : Gaussian isolation machine trained on three classes predicting the fourth class.

In this paper, we propose a means of improving the design of neural network classifiers so that they can cope with input that does not belong to one of the classes in the training set. The proposed classifier takes advantage of the fact that neural networks can model complex probability distributions to transform the input of the neural network into a vector space in which each class has an approximately known probability distribution - a Gaussian. More specifically, we train the neural network to produce an output space where the model’s output has a simple distribution, thus forcing the samples from each class to behave like dense, isolated clusters in the output space. Forcing each class to be dense improves its approximation using a multivariate Gaussian with diagonal covariance matrix, and separating the clusters from one another is a way of creating large distances between the classes, which is similar to creating large margins. In contrast to softmax-based neural networks that take a discriminative approach and directly model

, the Gaussian isolation machine models as follows:

()

where is the network output, and is the class label. We use the likelihood probability component as our confidence rate, and this allows us to deal with data from untrained distributions. The modeling technique used is similar to that of a generative model, but we consider it a hybrid model, because it is a discriminative model that is generative towards the output of the neural network and not toward the input space. We formulated two ways of controlling the distribution of each class representation: the first one controls the class representation density (CTV loss), and the second one controls the class representation spread (CH loss). Consider a classification problem with classes, such that , where is the class, and a function , such that is not necessarily equal to , which represent a neural network. We define the following metrics:

  1. Class Mean Vector:

    ()

    The mean vector of class in the neural network output space.

  2. Class Neighborhood Probability:

    ()

    where and are classes, and is the mean vector of each class (Equation eq2). This metric corresponds to the unnormalized probability of class being near class , assuming a Gaussian distribution on class with a diagonal covariance matrix whose diagonal elements are all equal to . For optimization purposes, the class neighborhood probability equation (Equation eq3) is insufficient for the purpose of separating the class and achieving the large margin effect, because when the probability is low, there isn’t a need for the network to separate the classes. Therefore, for optimization, we use the following modified version of the equation to ensure class separation and the desired large margin effect:

    ()

    where is a large constant, compared to the actual class covariance matrix diagonal elements. This is similar to the method used by Hinton, et al [hinton2015distilling], but it is used for a different purpose. In our method, during optimization this constant forces the assumed Gaussians to (1) cover more space, and (2) always include the other classes.

  3. Center Distance:

    ()

    where is the class mean vector (Equation(eq2). This metric is the distance from the class mean.

  4. Class Total Variance:


    The class total variance (CTV) is the first moment of the center distance (Equation eq5) over the samples of class

    :

    ()

    The class total variance is equal to the sum of the diagonal elements of the covariance matrix of class :


    Class Homogeneity:

    ()

    The class homogeneity (CH) is the second moment of the Equation eq5 over the samples of class , and it defines the variance of distances from the center of the class.

The minimization of Equation eq6 will effectively shrink the diagonal elements of the covariance matrix of the class in the output space, thus making its representation small and dense. This minimization allows us to assume a multivariate Gaussian distribution for each class with a diagonal covariance matrix, those diagonal covariance matrices will be employed when making predictions. In the first of the two optional loss functions for the GIM, the combination of Equations eq6 and eq4 results in the CTV loss. This loss function both controls the class representation size in the output space and ensures that the representations of classes will be far apart from one another.

()

When combining the class neighborhood probability (Equation eq4) and the class homogeneity (Equation eq7), we were able to attain the second loss function for the GIM - the CH loss. This loss function controls the class spread by ensuring that the variance of distances from the class mean vector will be small

()

We trained individual neural networks to minimize Equations eq7 and eq9), using different neural network architectures where the last layer is an arbitrary size (e.g., 24, 32, 64). As we hoped, the distribution of the network’s output is similar to that of a GMM, with a large Euclidean distance between the clusters (classes). Knowing that the output space approximately distributed as a multivariate Gaussian provides us with access to the likelihood term , which can be thought of as a confidence metric for the neural network’s predictions. The likelihood probability is the probability for a sample to belong to a certain class

in the neural network’s output space, and it is defined as follows: (the multivariate Gaussian probability density function):


()

In practice, we use the of the probability. We use the log-likelihood as a confidence metric that allows us to differentiate between in-distribution data and out-of-distribution data, by setting a threshold on its value, so that inputs that result in a value lower then this confidence threshold will be labeled as out-of-distribution. To make a prediction, the GIM follows an approach similar to that of many generative classifiers. It uses a term which includes both the likelihood term and the prior over the labels. We combine the confidence (log-likelihood) term with a prior over the class labels and utilize Bayes’ rule to approximate as follows:
Prior over the labels:

()

Posterior probability for classification:

()

In practice, we use the of the whole expression, in order to avoid numerical errors. As can be seen in Figure Document, our method has a different decision boundary shape, so that rather than splitting the input space into three areas (as seen in Figure Document), a heat map is created for each class probability distribution.

Evaluation

Experimental Settings

0.9!

Layer Neural Network Architectures
Number MNIST FASHION-MNIST IMDB

1
Conv2D 3X3X32 Conv2D 3X3X32 Embedding

2
Conv2D 3X3X64 Conv2D 3X3X32 Dropout 0.25

3
MaxPool 2X2

MaxPool 2X2,strides 2

Conv1D 250X3

4
Dropout 0.25 Dropout 0.3 GlobalMaxPool1D

5
Flatten Conv2D 3X3X64 Dense 250


6
Dense 32 Conv2D 3X3X64


7
MaxPool 2X2, strides 2


8
Dropout 0.4


9
Flatten


10
Dense 24

Table : Neural Network Architectures

In this section, we evaluate the performance of the GIM on three tasks and compare its performance to standard neural networks. Our evaluation shows that the Gaussian isolation machine achieves similar classification results and has a similar convergence speed to that of standard neural networks, while possessing the inherent ability to detect OOD data with high accuracy. We evaluate the GIM on two classic classification tasks and three out-of-distribution data detection tasks. For classification, we chose three standard object detection benchmark datasets and one sentiment analysis benchmark dataset. In all of the classification experiments we compare our method to state-of-the-art neural network classifiers with the same architecture, but we removed the last layer (weights and softmax activation) of the GIMs, and as a result, they have fewer parameters. We created two scenarios for the identification of untrained distribution data (OOD) experiments. In these experiments, we trained a GIM on several classes of a dataset and determined whether data from the remaining classes is classified by the GIM as one of the trained classes. In addition, we also performed an experiment similar to the one presented in Linag et al. [liang2017enhancing] to compare the GIM’s detection abilities to the baseline detector [hendrycks2016baseline]. To measure the classification accuracy and convergence speed, we used the architectures presented in Table Document. For the CIFAR 10 dataset, we trained a ResNet20v1, like the one presented in [he2016deep], which is a very compact ResNet with under

trainable parameters. We used Keras 

[chollet2015keras] to implement the neural network, but because our implementation of ResNet20V1 uses random data augmentation, we don’t obtain the same results as those presented in  [he2016deep]. However, after several trials we were able to achieve 91.2% accuracy, matching the results in that paper. In addition to the ResNet20v1, we also trained a VGG16 [simonyan2014very], without its final layer.

Classification Accuracy

0.95!

Dataset GIM vs Standard
GIM-CTV equation GIM-CH equation Standard

MNIST
98.6 99.25 99.08

FASHION
92.22 93.05 93.55

IMDB
89.0 89.4 89.5

CIFAR 10-ResNet
89.4

CIFAR 10-VGG
93.5 93.4 93.5

Table : Gaussian Isolation Machine vs Standard Neural Networks

In this section, we determine the accuracy of the GIM and compare it to that of standard neural networks. All of the neural networks were created using the architectures presented in Table Document and were initialized using the same random seed. We test our method on the MNIST character recognition dataset  [lecun-mnisthandwrittendigit-2010], the FASHION-MNIST clothing recognition dataset  [xiao2017fashion], the CIFAR 10 object detection dataset  [krizhevsky2009learning], and the IMDB sentiment analysis dataset  [maas-EtAl:2011:ACL-HLT2011]. In Table Document, we provide a comparison of the results, presenting the average accuracy achieved by each method with each dataset. It is clear from the results presented in Table Document that our method does not compromise the classification accuracy and that in some cases it even improves it. When examining other generative and hybrid models, such as Bayesian neural networks [gal2015bayesian]

, KNN, Naïve Bayse, and ClassRBM 

[larochelle2012learning]

, the accuracy level is usually low when the datasets contain high dimensional data, as is the case with the MNIST and CIFAR10 images. The novelty of our work is that we were able to retain the classification accuracy of the fully discriminative neural network, while creating a hybrid model.

Convergence Speed

[scale=0.47]Figures/GIMvsStandardVallAcc.png

Figure : The Gaussian isolation machine vs a standard neural network on the CIFAR 10 dataset.

Figure Document compares the convergence speed of the GIM to that of a standard neural network, where both methods use the ResNet20v1 architecture. All training was performed using a single NVIDIA-2080 TI GPU. Note that both formulations of the Gaussian isolation machine converge slower than a standard neural network, although they achieved nearly the same final accuracy (see Table Document). We hypothesize that the slower convergence speed is due to the fact that the GIM must separate the representations of the classes, as well as isolate them and force them into a Gaussian form.

Identifying Out Of Distribution Data

[scale=0.47]Figures/Caltech101_LogLiklihood.png

Figure : Caltech101 dataset confidence metric for the trained and untrained class data.

[scale=0.47]Figures/Hand_gestures_LogLiklihood.png

Figure : Hand-Gesture dataset confidence metric for the trained and untrained class data.

Anomalous data and data from outside the trained distributions can appear in a variety of applications. The proposed method can detect data from other distributions, i.e., data classes that the model did not train on. In this section, we empirically evaluate the proposed method’s ability to distinguish between data from the trained distribution and data from outside the trained distribution. To accomplish this, we designed two experiments in which we trained a GIM on a portion of the classes in a dataset and evaluated its detection ability on the remainder of the classes in the dataset. In the first experiment, we used the Caltech101 dataset [fei2004learning], and in the second experiment, we used the Hand-Gestures [mantecon2016hand] dataset. In both experiments, we also trained standard neural networks with the same architecture and compared our performance to those of the neural networks. The difference between the confidence values for the GIM’s predictions for Caltech101 data from the trained distribution (first 10 classes) and Caltech101 data from untrained distribution (classes 10-40) can be seen in Figure Document. There is a big difference between the two graphs presented in the figure: data from classes that our model trained on has much higher confidence than data from classes it didn’t train on. This observation makes it possible to set a threshold for the confidence values and, given an input, determine whether it belongs to the trained distribution or not. In this case, the threshold value was set at , creating an almost perfect separation of between out-of-distribution data and the trained distribution. In the second experiment, we loaded the Hand-Gestures dataset and resized it to . The GIM trained on the first three classes. The confidence values can be seen in Figure Document. Here we used a confidence threshold of . Again, the GIM achieved an almost perfect detection rate of . To perform a fair comparison to other OOD detectors, we implemented the baseline method introduced by Hendrycks & Gimpel[hendrycks2016baseline], comparing the baseline method to the GIM when the threshold values for the softmax (baseline) and the log-likelihood (GIM) were set such that the would yield , i.e., 97% of the neural networks’ predictions on the training set will be above the thresholds. Table Document presents an evaluation similar to that presented by Liang et al.  [liang2017enhancing]. A comparison is made between two VGG13 neural networks trained on the CIFAR 10 dataset, to

test accuracy. The CIFAR 10 test set serves as the in-distribution data, and the out-of-distribution data comes from the following datasets: Tiny ImageNet, LSUN, iSUN. The results of this comparison appears in Table 

Document.

0.95! Baseline (Hendrycks & Gimpel, 2017) / GIM OOD Dataset FPR Detection Error AUROC AUPR In AUPR Out ImageNet (resize) 0.70/0.21 0.36/0.16 0.85/0.86 0.88/0.80 0.82/0.86 ImageNet (crop) 0.57/0.13 0.33/0.12 0.90/0.92 0.92/0.92 0.88/0.89 LSUN (resize) 0.54/0.17 0.29/0.14 0.81/0.87 0.85/0.79 0.76/0.87 LSUN (crop) 0.68/0.18 0.34/0.15 0.89/0.91 0.92/0.92 0.85/0.88 iSUN 0.80/0.15 0.42/0.14 0.76/0.82 0.84/0.79 0.74/0.87 Gaussian 1.0/ 0.0 0.52 /0.05 0.84/0.99 0.97/0.99 0.83/0.94 Uniform 0.0/0.0 0.02/0.05 0.91/0.99 0.97/0.99 0.83/0.94

Table : Baseline Method vs Gaussian Isolation Machine Detection of OOD Data (the CIFAR 10 test set serves as the in-distribution data)

Evaluation Metrics

  • TPR and FPR: Measures the false positive rate and true positive rate. Let TP, FP, TN, and FN respectfully represent the true positives, false positives, true negatives, and false negatives. The true positive rate is calculated as , and the false positive rate is calculated as .

  • AUROC: Measures the area under the ROC curve. The receiver operating characteristic (ROC) curve plots the relationship between the TPR and FPR. The area under the ROC curve can be interpreted as the probability that a positive example (in-distribution) will have a higher detection score than a negative example (out-of-distribution).

  • AUPR: Measures the area under the precision-recall (PR) curve. The PR curve is created by plotting precision = TP/(TP + FP) against recall = TP/(TP + FN). In our tests, AUPR In denotes in-distribution data, which is used as the positive class, and AUPR Out denotes out-of-distribution examples, which are used as the positive class.

Test Sets for OOD Detection

  • Tiny ImageNet This is a subset of the original ImageNet dataset containing 200 classes. For testing purposes we used two datasets that were created from the Tiny ImageNet test set which contains 10,000 images: ImageNet (resize) and ImageNet (crop).

  • LSUN

    The Large-scale Scene Understanding dataset contains 10,000 test images which were used to create two datasets: LSUN (resize) and LSUN (crop).

  • iSun The iSUN dataset is a subset of the SUN dataset, and it contains 8,925 images. All images in this dataset were used, resized to 32 × 32 pixels.

  • Gaussian and Uniform Noise The Gaussian and Uniform Noise datasets are datasets created by sampling 10,000

    pixel images from a uniform distribution and 10,000

    pixel images sampled from 2D multivariate Gaussian distribution with a mean of 0.5 and STD of one.

Regularization Case Study-CVC Regularization

In this section, we examine the effectiveness of using Equations eq3 and eq4 as regularization terms, along with the cross-entropy loss function. The purpose of these regularization terms is to limit, to some extent, the hypothesis space. The regularization terms are applied to the logits, denoted by

, i.e., the input to the softmax activation. The and the terms are also calculated using the logits and not the output of the neural network.

[scale=0.36]Figures/LogitsMargin.png

Figure : MNIST classification: class 1 vs class 2 logits space

The regularization will limit the logits of each class to a cluster form if Equation eq3 is used then, or to a sphere in cases where Equation eq4 is used. We a show a visualization to convey the intuition we had for making CVC regularization in Figure Document. The regularization forces a large margin relatively to the scatter’s scale in the logits space. We hypothesize that the large margin effect along with narrowing the hypothesizes space is what improves the neural network’s performance. We reformulate the relevant equations using the logits of the neural network, and define the regularization terms:

()
()

First moment regularization:

()

Second moment regularization:

()

Because both of these terms control the variance inside each class, we refer to these regularization terms as Class Variance Control (CVC). We evaluated our regularization methods and compared them to networks with dropout [hinton2012improving] regularization and/or weight regularization. In the experiments, we used the architectures presented in Table Document and for CIFAR10 we have used ResNet20v1. For the networks with CVC regularization, we removed all other types of regularization. In Table Document we present the results, which are averaged over five runs.The results presented in Table Document show that using CVC regularization improves the classification ability of neural networks. Both of the CVC regularization techniques are relatively easy to implement and do not increase training time compared to other forms of regularization. In the version presented, CVC is comparable to other regularization techniques that were shown to be beneficial [hinton2012improving, Tripathi2014ASO]. This proves that regularization on the representation can be as effective as regularizing the weights directly. We expect that with further research this approach can be improved even further.

0.95! Dataset CVC vs Standard First Moment Reg Second Moment Reg Standard
MNIST
99.15 99.18 99.08

FASHION
93.55 93.15 93.5

IMDB
89.90 89.59 89.5

CIFAR10
91.73 91.83 91.2


Table : Class Variance Control Regularization vs Standard Neural Networks with Regularization

Summary

In this paper, we presented the Gaussian isolation machine, a new neural network-based classification method. The GIM is based on neural network that was trained to transfer the inputs to a vector space where the data distribution can be approximated using multivariate Gaussian. The approach integrates principles from generative and discriminative models to form a hybrid classification method that can classify data with high accuracy, as well as to identify data from untrained distributions. In the process of creating the Guassian isolation machine, we also experimented with new regularization terms that improves the classification ability of cross-entropy/softmax-based neural networks. The main contribution of this paper is the ability of the GIM to identify whether the input data is from the training set distribution or not, without the need for any preprocessing or external detection measures. In future work, we intend to add a sampling capability to the GIM (i.e., giving it the ability to produce new samples), and make changes to the loss function to enable it to perform multi-label classification. In our experiments we tried using the full covariance matrix of each class, and found that although the classification results were better, the run-time was much longer. We believe that additional research in this area will lead to better classification results.