The Softmax (Eqn 1
) is a standard final layer in classifiers such as Neural Nets (NNs) and Support Vector Machines (SVMs). It interprets the outputs of the NN to deliver an estimated class for test samples. This is a logical choice because NNs are trained on a loss function with an embedded Softmax. Softmax also scales to large number of classes, and returns values that can be interpreted as probabilities. In a highly-trained NN, Softmax will ideally return a value1 for the predicted (and hopefully correct) class and 0 for all other classes.
But not all situations allow for such a highly-trained NN. For example, there may be insufficient training data available, as is often the case for medical, scientific, or field-collected datasets (Koller & Bengio, 2018). However, a trained NN model that falls short of this “0-1” ideal may still contain much class-specific information encoded in the responses of the penultimate, pre-Softmax layer of neurons. Each of these neurons (hereafter Response neurons, s) develops a characteristic response to each class. Let be the number of classes. These class - response distributions form a x array
of probability distributions, whereis the distribution of the responses of the to the class. An example of these distributions is seen in Figure 2.
Softmax systematically under-utilizes this array . For a given test sample of true class , Softmax uses only the ’th row in this array, by comparing the various ’s responses to samples of class . If the NN has not been trained sufficiently to have encoded class separations for class in this row, Softmax will make mistakes. The core of our proposed approach is to scan the entire array for clues to the test sample’s class.
Softmax, e.g. (Murphy, 2012), is calculated using the values of the Response neurons (the penultimate layer) to the sample :
= predicted class of sample
= response of the th R to
are the classes (0-9).
During training, a softmax-based optimizer will seek, for the ’th class, to maximize the Fisher distance between the response of the ’th neuron () to the ’th class () and the responses of the other s to class (), by boosting the mean of and depressing the means of
, and reducing the standard deviationsfor all . When training is “sufficient” for Softmax, these Fisher distances become large. Then, given a test sample with class , Softmax yields a very high value in the ’th readout, which distinguishes the correct (’th) class using only the ’th row of . But if training is not “sufficient”, then the Fisher distances between and for some will be small. In this case, responses will be relatively strong in both and , giving a low Softmax score and possibly an incorrect estimated class ( instead of ). An example, in this case confusion between classes 1 and 9, is seen in row 1 of Figures 1 and 3.
However, given uncertainty between classes and , we can examine the responses of all the , relative to the expected statistical behaviors of both the ’th and ’th classes. That is, we can use more than one row of the array of class- response distributions, and choose between class and by assessing the likelihoods of the responses given rows and of . If there is confusion between class responses in the row, we can examine rows.
In a case observed by (Delahunt & Kutz, 2018), involving a model not trained using a Softmax-based loss function, such a “full ” classifier was in some cases much more accurate than Softmax (and in other cases worse). In this work we apply a “full ” classifier to NN models. Our experiments suggest that a trained NN model is in some cases far more capable than a final Softmax layer makes evident: Full use of the response distribution array gives much better test set classification on more difficult test samples.
In this paper we describe a Softmax/Pooled-likelihood Hybrid classifier (hereafter ) that (partially) replaces the Softmax layer at testing, in any model for which a validation set is available. Training occurs as usual, e.g. with a Softmax-based loss function. To apply the classifier, we first characterize the array of class- response distributions using the validation set. We also generate a weight matrix , based on the Fisher distances between various classes for each . Given a test sample , if its top Softmax score is high, then we trust the Softmax prediction, since a high Softmax score typically indicates that the Softmax estimate is correct. If the top Softmax score is low, we route the sample to the Pooled-likelihood classifier, which leverages the information in to make a prediction. Thus, takes advantage of the fact that Softmax and Pooling each work best on different types of samples.
Our experiments with vectorized MNIST show that delivers meaningful gains in accuracy over Softmax, using the same trained models, regardless of the Softmax test accuracy of the model. This indicates that the class- response distribution array contains a wealth of information untapped by Softmax. Conversely, it indicates that a “sufficient” amount of training data, as measured by Softmax performance, actually represents an excess, the surplus training data being required to make up for that part of the encoded class information left on the table by Softmax.
The core message of this paper is to highlight the valuable, but currently under-utilized, information content of . An NN whose final classifier layer leverages this extra information encoded in may be trained on less data yet have equivalent performance to an NN trained on more data but that uses a Softmax final layer. This can potentially ease the training data bottleneck so common in NNs, while still allowing the classifier to hit a given task’s performance specs. In this paper we describe one possible algorithm, . The paper is organized as follows: We first give a brief overview of the algorithm; we next report results of experiments on a vectorized MNIST dataset; we then give some concluding comments. An Appendix contains the full details of the algorithm.
2 Overview of the Sph algorithm
combines Softmax and Pooling, a classifier designed to more fully use the array . Pooling is basically a Naïve Bayes (Hand & Yu, 2001; Ng & Jordan, 2002), log-likelihood classifier, with the responses as input features. A prior on the response distributions defines them as asymmetrical Gaussians. Each of these is applied to different categories of samples: Besides leveraging the diversity of class- responses of the trained NN, takes advantage of two kinds of diversity in sample responses. First, it distinguishes between samples with high vs low softmax scores. Second, it distinguishes between samples that are amenable to the Naïve Bayes approach, and those for which Softmax is a more reliable predictor.
For samples with high softmax scores, we use the Softmax prediction. Pooling is applied to samples that have low Softmax scores. If the prediction returned by the Pooling classifier is among a set of “trusted” classes, (as determined on the validation set), then we trust the prediction. If it is in the set of “untrusted” classes, then we revert to the Softmax prediction. A schematic is given in Figure 4.
The model is trained as usual, e.g. with Softmax embedded in the loss function. We note that training with Softmax is not required: Any model, regardless of optimization method, that has a layer of s (where = number of classes) can use as the final layer (here we only examined Softmax-optimized NNs). In the context of Softmax optimizers, we define the s as the pre-Softmax units, with corresponding to the unit that targets the th class.
After training is done, we use the validation set to define the array of class- response distributions , in particular the means and asymmetrical standard deviations for each
. We then optimize hyperparameters forusing the validation set. This includes creating a weight matrix and a class masking vector . indicates how informative the entries of are, and indicates which Pooled class predictions we should trust and which we should ignore.
At test time, samples are run through the NN as usual. Given a sample , if the Softmax score is high (above some threshold ) the Softmax prediction is accepted. If the Softmax score is low, the sample is sent to the Pooling classifier.
The Pooling classifier
The Pooling classifier has two stages: a veto stage, and a pooled likelihood predictor. In the veto stage, the vector of the s responses to sample ( are compared to the distributions . If falls far outside the expected behavior of class , for enough , then class is vetoed, i.e. it is removed from consideration as a possibility for . In the pooled likelihood stage, likelihoods for each class are calculated using an asymmetric Mahalanobis distance measure (over all s) and a sample-specific weight matrix . The predicted (most likely) class is returned.
If this predicted class is trustworthy (according to ), we keep this Pooled prediction. Otherwise we revert to the Softmax prediction.
Thus, uses the Pooled classifier on harder (more uncertain) samples, but only when Pooling returns a class for which we expect Pooling to succeed, based on results on the validation set. uses the Softmax classifier on easier samples, and on harder samples when Pooling returns a class for which we expect it to do poorly.
Full details are given in the Appendix. A full codebase will be found at https://github.com/charlesDelahunt/MoneyOnTheTable
. This includes Python/Keras code for theclassifier and for a hyperparameter sweep.
3 Experiments and results
We ran experiments on vectorized MNIST (hereafter “MNIST” to make the vectorization constraint explicit), and on the Cifar datasets: Cifar 10; Cifar 20 (which is Cifar 100 with 20 meta-classes); and Cifar 100 (LeCun & Cortes, 2010; Krizhevsky, 2009). Results of MNIST experiments indicate the benefits of the method, while results of Cifar experiments show its limitations.
3.1 Comparison of SPH and Softmax on Mnist
In order to see effects of vs Softmax over a range of training data loads and baseline model accuracies, we used the
MNIST dataset. We used a simple NN (in Keras and Tensorflow) with two dense layers, and controlled trained model accuracy by varying the number of training samples, from 100 to 50,000 (e.g. 200 training samples gave 78% test accuracy, 10k total gave 96% test accuracy). The trained models had mean Softmax accuracies ranging from 70% to 98%. We then compared the test set accuracy of and Softmax across this range.
For each choice of number of training samples we trained and ran 9 models, each with random draws for Train, Test, and Validation sets. We used 4000 validation and 4000 test samples in all cases. For each model, we (i) trained the model with a Softmax-based loss function, as usual; (ii) randomly chose non-overlapping validation and test sets from the Test data; (iii) ran a parameter sweep over parameters using the validation set as described in the Appendix; and (iv) recorded test set accuracies using the resulting calculated parameter sets.
As two Figures-of-Merit we measured: Raw percentage gain in test set accuracy; and relative reduction in test set error (as a percentage), due to versus the Softmax baseline. The second metric allows easier comparison of results on models with widely different Softmax accuracies. For these experiments, we reported the optimal test set results, i.e. we did not use cross-validation on the validation set to fix hyperparameters before proceeding to the iest set. That is, we set aside the issue of choosing exactly optimal hyperparameters, in order to see what gains were possible (see Appendix for details).
Our core finding is that all MNIST models benefited from at test time, even when the baseline model already had high accuracy: Test set error was reduced by a mean of 6 to 20% , with higher gains in models that were trained on fewer samples (and therefore with lower baseline Softmax accuracy). The mean raw percent gains in accuracy are shown as vertical red bars in Figure 5. As baseline (Softmax) accuracy increased, raw gains from over Softmax decreased, but the relative reduction in error remained stable even at very high baseline accuracies. See Figure 6.
The gains from are picked up at virtually no cost (except for accumulation of the class- response distributions in the array ), since it is bolted on after the model is trained as usual, simply by accessing information encoded in the trained model but unseen by Softmax.
3.1.1 Gains from measured as reduced training data loads
Deep NNs (DNNs) typically require large amounts of training data, which can be time-consuming and costly to collect, annotate, and manage. In some situations (e.g. medical work, field tests, and scientific experiments) data is not only expensive to collect but also constrained in quantity due to real-world exigencies. In this context, tools that reduce the training data required to hit a given performance spec are valuable.
increased the test set accuracy of a given model on MNIST, allowing it to match the performance of another model, trained with more data but using Softmax. Thus, the gain from using can be measured in “virtual training samples”, i.e. the number of extra samples that would be needed to attain equivalent test accuracy using only Softmax. yielded a gain of between roughly 1.2x to 1.6x on MNIST, i.e. that when using Softmax alone, 20% to 40% more training data is needed to attain equivalent accuracy to . This is plotted in Figure 7 as “wasted” training samples. Thus, use of directly cut training data costs.
3.2 Results on Cifar
To see whether the method worked with deeper NN architectures, we applied it to the Cifar dataset with 10, 20, and 100 classes. In each case we tested 3 or 4 models of varied trained accuracy (Cifar 10: 59 to 89%; Cifar 20: 51 to 71%; Cifar 100: 40 to 60%). Models were built from templates at (Keras-team, 2018; Kumar, 2018).
We found that yielded either only small benefits, or sometimes none at all. On Cifar 10, relative reductions in test error were 2.0, 2.4, 2.6, and 1.0% for models with Softmax test accuracies 59, 61, 73, 89%. On Cifar 20, reductions were 1.4, 1.0, 0.2, and 0.0% for models with Softmax test accuracies 51, 57, 60, and 71%. On Cifar 100, all reductions were 0.13, 0.0, and 0.0% for models with Softmax test accuracies 40, 46, and 60%.
We see three trends here. First, the DNNs were much less responsive to than the shallow NNs used on MNIST. Second, datasets with larger numbers of classes were less responsive. Third, there was some (small) benefit to this approach. Whether this indicates that better algorithms than might yield useful benefits, or whether DNNs are intrinsically not amenable to this approach, is unknown.
Neural Nets are more capable than we realize, and contain reserves of class-specific information untapped by Softmax. This is particularly true when there is not an abundance of training data to push the pre-softmax response neurons to their ideal [0, 1] extremes. Limited data, even scarcity of data, is the norm in many important ML use cases, including medicine, scientific experiments, and field-collected data. In these cases, it is important to leverage as much class-specific information encoded in the trained model as possible. Softmax may be a non-optimal tool for summarizing the model’s output.
In this work, we built a hybrid classifier, , that combines Softmax and a Naïve Bayes-like pooled-likelihoods method. takes advantage of class-specific behaviors encoded in the trained model which are not used by Softmax. In our experiments on NNs trained with Softmax-based loss functions, delivered improvements to performance, using the same trained model, through simple substitution for Softmax at testing. Improvements were substantial on shallow NNs trained on the MNIST dataset, but were minor or non-existent on DNNs trained on the Cifar dataset.
Our method focused on two forms of inter-class diversity:
(i) The class- response distributions (where s are the pre-Softmax neurons) contain a wealth of class-specific information, encoded by the NN during training, which can be tapped to improve model performance. We leverage this information by defining a pooled likelihood classifier over the array of the class- response distributions.
(ii) Classes display diverse behaviors as they pass through the model. In particular, Softmax may tend to fail on certain classes more than others. We leverage this diversity by defining a mask that determines which Pooling predictions are trustworthy, and which should be ignored, trusting instead on the Softmax predictions.
We note there is nothing at all magical or optimal about as a means to leverage information encoded in . We fully expect there exist other more effective approaches to the problem. Possible avenues include: (i) find algorithms that better utilize ; (ii) dig deeper, into the NN’s inner layers, to find salient class-specific behaviors; (iii) find new loss functions that directly utilize to guide training, in order to accentuate class-specific contrasts in the final trained .
The “full ” approach proposed here may be potentially useful for any trained model that contains accessible class-response distributions such as the array . We experimented here only on models (NNs) trained with a Softmax loss function, which would tend to maximize the Softmax-accessible information encoded by training. It is an open question whether models trained with other loss functions might prove more amenable to this approach. It is also unknown whether other Softmax-based models (such as SVM) respond differently than NNs to methods usch as , and whether DNNs are a viable target for these alternatives to Softmax.
This appendix gives details about the Softmax-Pooling Hybrid () algorithm. It has two main parts: (i) Determining hyperparameters, including ; and (ii) applying to a test set. Further details can be found in the codebase.
5.1 Determining hyperparameters
This section discusses how to prepare resources for . We need the following: (i) the array of class- response distributions ; (ii) a weight matrix ; (iii) a class mask ; (iv) assorted other hyperparameters. and both depend on hyperparameters, while does not.
5.1.1 Characterize , the array of class- response distributions
During the training phase, as internal NN weights update, each develops distinct responses to classes. Let equal the response distribution of to class . To characterize , we first run a Validation set through the trained NN. (We cannot use the Training set for this, since samples that directly affect internal model weights have distinct behavior when passed through the trained model. )
Of these validation samples, we select only those within a certain range of Softmax scores (high and low limits are hyperparameters) to characterize . An optimal range will likely not include high-scoring samples, because the distributions in are only relevant to samples that will be passed to Pooling, which are low-scoring by design. We wish the array to target this population.
We define as an asymmetrical Gaussian, in terms of a mean (or median if wished) and separate left and right standard deviations (i.e. ) to better characterize asymmetrical response distributions.
is simply the mean (or median) of .
We generate via mirror images: Let = the set of responses where = logical AND. Then subtract , mirror this set, and calculate the std dev: = std dev. A similar calculation gives .
Doing this for each class- pair characterizes as three x arrays:
5.1.2 Weight matrix
Not all are created equal.
In some cases, contains valuable class-separating information, while in other cases, an ’s responses to different classes overlap, so that is noisy for some (not all) ’s.
We encode these differences in a weight matrix .
When assessing the likelihoods of the various classes at testing, emphasizes some s over others, different for each class.
The process has three parts: (i) calculate a variant of Fisher distance for each ; (ii) set for noisy ; (iii) assign positive values to the remaining .
(i) We wish a variant of Fisher distance that reflects whether distinguishes class from the other classes. Define
(assuming . A similar formulation works for the opposite case).
This is done for each because a given may separate some class ’s from the other classes well, and some class ’s badly (eg in a highly-trained Softmax NN, class ’s “home ”, , is optimized to distinguish class best of all). We note that this is a compromise solution, that loses some class-specific information.
(ii) Sparsify : For some pairs the class distributions are just too overlapped. We set some threshold . For example, if we wish to only consider classes at least 2 std dev apart, we set = 2.
(iii) We set . Then normalize the rows of : , where is a sharpener, and each new is calculated using old values, not the new values. We normalize by row (i.e. for each class, over all s), since for a test sample , the vector of pre-Softmax NN outputs is what will be visible.
5.1.3 Class mask
Classes display diverse behaviors when passed through a trained NN, and they will benefit (or suffer) more or less from a Pooled-likelihood classifier versus Softmax . For a sample , let be the vector of pre-Softmax NN outputs , and let be the predicted classes (via Pooling or Softmax). If a class is poorly classified by Pooling, we wish to distrust and ignore the output , and revert to the Softmax prediction instead.
To estimate which classes respond well to Pooling, we run the low-scoring samples () from the Validation set through the Pooling classifier, and calculate the accuracy on each class. Let be a 1 x vector of zeros. Let be the set of validation samples from class .
If some threshold , we set , to indicate that Pooling will (hopefully) give better results than Softmax on these low-scoring samples at testing.
is used by as follows: Suppose the class predicted for a sample by Pooling , then if we trust the result; else we use the Softmax result, .
5.1.4 Hyperparameter optimization
Important hyperparameters include (organized by purpose):
1. To determine which samples are routed to Pooling:
= certainty threshold: For a sample s, if then gets re-routed to the pooling classifier.
2. To determine which samples to use when characterizing , by keeping within a relevant range:
= lower threshold (eg ); = upper threshold (eg ).
3. To determine the weight matrix :
: minimum Fisher distance threshold, used to cull noisy class- distributions. A high value makes sparser.
: an exponent to sharpen the contrasts in different .
4. To control the Veto stage:
: For sample , if class- mahalanobis distance , then is suggesting that class i is improbable (eg ).
: For sample and class , if s trigger the improbability flag, then (for only), class is veto’ed as a possible prediction.
5. To determine which classes have trustworthy Pooled results (i.e. ):
: defines an expected gain of Pooling over Softmax. If a class has Validation set Pooled accuracy greater than Softmax accuracy (on low-scoring samples only), Pooling results will be ignored and Softmax used instead.
6. For calculating Mahalanobis distances in the pooled likelihood measure:
: an exponent to sharpen mahalanobis distances ( makes similar to log-likelihood.)
To optimize hyperparameters via a parameter sweep, we (i) choose a set of hyperparameters; (ii) create and on the Validation set; then (iii) re-apply to the Validation set. This is fast in practice, since the NN model only needs to run on the validation and test sets once, and the process of assessing hyperparameter sets admits of various shortcuts.
In this work we elided the question of how to choose the hyperparameter set based solely on validation set outcomes. In some cases there is a clean correlation between gains on validation and test sets. In other cases, selecting the hyperparameters which give maximum validation gains yield sub-optimal test set gains. Figure 8 shows a range of scenarios. We expect that in general some kind of cross-validation is required to optimally select hyperparameters for generalization to test sets.
5.2 Applying Sph to a Test set
To apply to a test sample , we have the following order of events:
Direct to either Softmax or Pooling, according to its Softmax score.
If is sent to Softmax, , and we are done. If is sent to Pooling, there are three steps:
The veto step removes certain classes via a new weight matrix .
The pooled likelihood measure, using and , returns a Pooled prediction .
The class mask is applied: If is a trusted class for Pooling, we keep the Pooling prediction, , and we are done. Else we revert to the Softmax prediction, .
5.2.1 Gate samples using Softmax scores
For a sample let = the vector of pre-Softmax readouts. Softmax tends to be reliable when its top score is high, and unreliable when this top score is low.
We apply a gating threshold: If , we accept the Softmax prediction (by overloaded notation) and we are done. If , we route to the Pooling branch.
5.2.2 Asymmetric Mahalanobis distance (definition)
Mahalanobis distance measures how far a sample is from the mean of a distribution, using 1 standard deviations as the unit of distance: . For a Gaussian, is the log likelihood.
The Pooling branch uses an asymmetrical Mahalanobis distance of from the centers of each
, where standard deviations are different on either side of the center (median may be used instead of mean, to downplay outliers).
Let = the asymmetrical Mahalanobis distance of from :
if , or if .
5.2.3 Veto stage
We use the distributions array to rule out “impossible” class predictions. Roughly speaking, if is many standard deviations ( or ) from for some , then class is very unlikely to be correct: The behavior of sample does not fit the distributions of class . On the other hand, sometimes samples happen to fall in the outer reaches of their class distributions, especially when is high (eg in Cifar 100). So we do not want to veto a class due to just one outlandish response.
We use two parameters, and . is the number of standard deviations that triggers an outlier status. is the number of s that must be triggered to veto a class. For sample , let if , 0 otherwise. We wish to veto all classes for which .
We create a new sample-specific weight matrix from by zeroing out the ’th row of (), for each such class . This removes these classes from consideration. Columns of this are then renormalized to sum to 1.
5.2.4 Pooled-likelihood classifier
The Pooling branch uses a weighted asymmetric Mahalanobis distance to predict the class of :
, where is a sharpener.
This is the weighted sum of each row of , a quantity similar to a log likelihood for each class.
Another approach to this classifier step might be to treat it as a weighted Naive Bayes model, without the asymmetrical Gaussian as a regularizing prior.
5.2.5 Enforce the Pooling class mask
Based on the diverse behaviors of classes in the Validation set, we expect Pooling to do relatively better than Softmax on some classes and worse on others. This is encoded in the masking vector , where if Pooling is good at classifying class , and 0 otherwise.
For a sample routed through the Pooling branch, suppose the Pooling branch prediction . If , we accept the Pooled prediction, . If , we ignore the Pooled result and revert to the Softmax prediction, .
CBD gratefully acknowledges partial funding from the Swartz Foundation.
- Delahunt & Kutz (2018) Delahunt, C. B. and Kutz, J. N. Putting a bug in ml: The moth olfactory network learns to read mnist. arXiv, 2018. URL https://arxiv.org/abs/1802.05405. In review.
- Hand & Yu (2001) Hand, D. J. and Yu, K. Idiot’s bayes — not so stupid after all? International Statistical Review, 69(3):385–398, 2001. doi: 10.1111/j.1751-5823.2001.tb00465.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1751-5823.2001.tb00465.x.
- Keras-team (2018) Keras-team. 2018. URL https://github.com/keras-team/.
- Koller & Bengio (2018) Koller, D. and Bengio, Y. A fireside chat with daphne koller. ICLR, 2018. URL https://www.youtube.com/watch?v=N4mdV1CIpvI.
- Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.
- Kumar (2018) Kumar, A. Achieving 90% accuracy in object recognition task on cifar with keras. 2018. URL https://appliedmachinelearning.blog.
- LeCun & Cortes (2010) LeCun, Y. and Cortes, C. MNIST handwritten digit database. Website, 2010. URL http://yann.lecun.com/exdb/mnist/.
- Murphy (2012) Murphy, K. P. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. ISBN 0262018020, 9780262018029.
Ng & Jordan (2002)
Ng, A. Y. and Jordan, M. I.
On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes.In Dietterich, T. G., Becker, S., and Ghahramani, Z. (eds.), Advances in Neural Information Processing Systems 14, pp. 841–848. MIT Press, 2002.