Log In Sign Up

Maximum-Entropy Fine-Grained Classification

by   Abhimanyu Dubey, et al.

Fine-Grained Visual Classification (FGVC) is an important computer vision problem that involves small diversity within the different classes, and often requires expert annotators to collect data. Utilizing this notion of small visual diversity, we revisit Maximum-Entropy learning in the context of fine-grained classification, and provide a training routine that maximizes the entropy of the output probability distribution for training convolutional neural networks on FGVC tasks. We provide a theoretical as well as empirical justification of our approach, and achieve state-of-the-art performance across a variety of classification tasks in FGVC, that can potentially be extended to any fine-tuning task. Our method is robust to different hyperparameter values, amount of training data and amount of training label noise and can hence be a valuable tool in many similar problems.


Training with Confusion for Fine-Grained Visual Classification

Research in Fine-Grained Visual Classification has focused on tackling t...

Maximum Entropy Regularization and Chinese Text Recognition

Chinese text recognition is more challenging than Latin text due to the ...

Investigating Label Noise Sensitivity of Convolutional Neural Networks for Fine Grained Audio Signal Labelling

We measure the effect of small amounts of systematic and random label no...

Improving CNN classifiers by estimating test-time priors

The problem of different training and test set class priors is addressed...

Extreme Classification in Log Memory

We present Merged-Averaged Classifiers via Hashing (MACH) for K-classifi...

Learning Where to Fixate on Foveated Images

Foveation, the ability to sequentially acquire high-acuity regions of a ...

On Guiding Visual Attention with Language Specification

While real world challenges typically define visual categories with lang...

1 Introduction

For ImageNet 

imagenet_cvpr09 classification and similar large-scale classification tasks that span numerous diverse classes and millions of images, strongly discriminative learning by minimizing the cross-entropy from the labels improves performance for convolutional neural networks (CNNs). Fine-grained visual classification problems differ from such large-scale classification in two ways: (i) the classes are visually very similar to each other and are harder to distinguish between (see Figure 0(a)), and (ii) there are fewer training samples and therefore the training dataset might not be representative of the application scenario. Consider a technique that penalizes strongly discriminative learning, by preventing a CNN from learning a model that memorizes specific artifacts present in training images in order to minimize the cross-entropy loss from the training set. This is helpful in fine-grained classification: for instance, if a certain species of bird is mostly photographed against a different background compared to other species, memorizing the background will lower generalization performance while lowering training cross-entropy error, since the CNN will associate the background to the bird itself.

In this paper, we formalize this intuition and revisit the classical Maximum-Entropy

regime, based on the following underlying idea: the entropy of the probability logit vector produced by the CNN is a measure of the “peakiness” or “confidence” of the CNN. Learning CNN models that have a higher value of output entropy will reduce the “confidence” of the classifier, leading in better generalization abilities when training with limited, fine-grained training data. Our contributions can be listed as follows: (i) we formalize the notion of “fine-grained” vs “large-scale” image classification based on a measure of diversity of the features, (ii) we derive bounds on the

regularization of classifier weights based on this diversity and entropy of the classifier, (iii) we provide uniform convergence bounds on estimating entropy from samples in terms of feature diversity, (iv) we formulate a fine-tuning objective function that obtains state-of-the-art performance on five most-commonly used FGVC datasets across six widely-used CNN architectures, and (v) we analyze the effect of Maximum-Entropy training over different hyperparameter values, amount of training data, and amount of training label noise to demonstrate that our method is consistently robust to all the above.

Figure 1: (a) Samples from the CUB-200-2011 FGVC (top) and ImageNet (bottom) datasets. (b) Plot of top 2 principal components (obtained from ILSVRC-training set on GoogleNet pool5 features) on ImageNet (red) and CUB-200-2011 (blue) validation sets. CUB-200-2011 data is concentrated with less diversity, as hypothesized.

2 Related Work

Maximum-Entropy Learning:

The principle of Maximum-Entropy, proposed by Jaynes 


is a classic idea in Bayesian statistics, and states that the probability distribution best representing the current state of knowledge is the one with the largest entropy, in context of testable information (such as accuracy). This idea has been explored in different domains of science, from statistical mechanics 


and Bayesian inference 


to unsupervised learning 


and reinforcement learning 

mnih2016asynchronous ; luo2016learning

. Regularization methods that penalize minimum entropy predictions have been explored in the context of semi-supervised learning 

grandvalet2006entropy , and on deterministic entropy annealing rose1998deterministic

for vector quantization. In the domain of machine learning, the regularization of the entropy of classifier weights has been used empirically 

chen2009similarity ; szummer2002partially and studied theoretically shawe2009pac ; zhu2009maximum .

In most treatments of the Maximum-Entropy principle in classification, emphasis has been given to the entropy of the weights of classifiers themselves shawe2009pac . In our formulation, we focus instead on the Maximum-Entropy principle applied to the prediction vectors. This formulation has been explored experimentally in the work of Pereyra et al.pereyra2017regularizing for generic image classification. Our work builds on their analysis by providing a theoretical treatment of fine-grained classification problems, and justifies the application of Maximum-Entropy to target scenarios with limited diversity between classes with limited training data. Additionally, we obtain large improvements in fine-grained classification, which motivates the usage of the Maximum-Entropy training principle in the fine-tuning setting, opening up this idea to much broader range of applied computer vision problems. We also note the related idea of label smoothing regularization szegedy2016rethinking , which tries to prevent the largest logit from becoming much larger than the rest and shows improved generalization in large scale image classification problems.

Fine-Grained Classification: Fine-Grained Visual Classification (FGVC) has been an active area of interest in the computer vision community. Typical fine-grained problems such as differentiating between animal and plant species, or types of food. Since background context can act as a distraction in most cases of FGVC, there has been research in improving the attentional and localization capabilities of CNN-based algorithms. Bilinear pooling lin2015bilinear is an instrumental method that combines pairwise local features to improve spatial invariance. This has been extended by Kernel Pooling cui2017kernel that uses higher-order interactions instead of dot products proposed originally, and Compact Bilinear Pooling gao2016compact

that speeds up the bilinear pooling operation. Another approach to localization is the prediction of an affine transformation of the original image, as proposed by Spatial Transformer Networks 

jaderberg2015spatial . Part-based Region CNNs ren2015faster use region-wise attention to improve local features. Leveraging additional information such as pose and regions have also been explored branson2014bird ; zhang2012pose , along with robust image representations such as CNN filter banks cimpoi2015deep , VLAD jegou2012aggregating and Fisher vectors perronnin2010improving . Supplementing training data krause2016unreasonable and model averaging Moghimi2016 have also had significant improvements.

The central theme among current approaches is to increase the diversity of relevant features that are used in classification, either by removing irrelevant information (such as background) by better localization or pooling, or supplementing features with part and pose information, or more training data. Our method focuses on the classification task after obtaining features (and is hence compatible with existing approaches), by selecting the classifier that assumes the minimum information about the task by principle of Maximum-Entropy. This approach is very useful in context of fine-grained tasks, especially when fine-tuning from ImageNet CNN models that are already over-parameterized.

3 Method

In the case of Maximum Entropy fine-tuning, we optimize the following objective:


Where represents the model parameters, and is initialized using a pretrained model such as ImageNet imagenet_cvpr09 and is a hyperparameter. The entropy can be understood as a measure of the “peakiness” or “indecisiveness” of the classifier in its prediction for the given input. For instance, if the classifier is strongly confident in its belief of a particular class , then all the mass will be concentrated at class , giving us an entropy of 0. Conversely, if a classifier is equally confused between all classes, we will obtain a value of of the entropy, which is the maximum value it can take. In problems such as fine-grained classification, where samples that belong to different classes can be visually very similar, it is a reasonable idea to prevent the classifier from being too confident in its outputs (have low entropy), since the classes themselves are so similar.

3.1 Preliminaries

Consider the multi-class classification problem over classes. The input domain is given by , with an accompanying probability metric defined over . The training data is given by i.i.d. samples drawn from . Each point has an associated label . We learn a CNN such that for each point in

, the CNN induces a conditional probability distribution over the

classes whose mode matches the label .

A CNN architecture consists of a series of convolutional and subsampling layers that culminate in an activation , which is fed to an -way classifier with weights such that:


During training, we learn parameters and feature extractor (collectively referred to as ), by minimizing the expected KL (Kullback-Liebler)-divergence of the CNN conditional probability distribution from the true label vector over the training set :


During fine-tuning, we learn a feature map from a large training set (such as ImageNet), discard the original classifier (referred now onwards as ) and learn new weights on the smaller dataset (note that the number of classes, and hence the shape of , may also change for the new task). The entropy of conditional probability distribution in Equation 28 is given by:


To minimize the overall entropy of the classifier over a data distribution , we would be interested in the expected value of the entropy over the distribution:


Similarly, the empirical average of the conditional entropy over the training set is:


To have high training accuracy, we do not need to learn a model that gives zero cross-entropy loss. Instead, we only require a classifier to output a conditional probability distribution whose coincides with the correct class. Next, we show that for problems with low diversity, higher validation accuracy can be obtained with a higher entropy (and higher training cross-entropy). We now formalize the notion of diversity in feature vectors over a data distribution.

3.2 Diversity and Fine-Grained Visual Classification

We assume the pretrained -dimensional feature map to be a multivariate mixture of Gaussians, where

is unknown (and may be very large). Using an overall mean subtraction, we can re-center the Gaussian distribution to be zero-mean.

for is then given by:


where s are -dimensional covariance matrices for each class , and is the mean feature vector for class . The zero-mean implies that . For this distribution, the equivalent covariance matrix can be given by:


Now, the eigenvalues

of the overall covariance matrix

characterize the variance of the distribution across

dimensions. Since is positive-definite, all eigenvalues are positive (this can be shown using the fact that each covariance matrix is itself positive-definite, and ). Thus, to describe the variance of the feature distribution we define Diversity.

Definition 1.

Let the data distribution be over space , and feature extractor be given by . Then, the Diversity of the features is defined as:

This definition of diversity is consistent with multivariate analysis, and is a common measure of the total variance of a data distribution 

jonsson1982some . Now, let denote the data distribution under a large-scale image classification task such as ImageNet, and let denote the data distribution under a fine-grained image classification task. We can then characterize fine-grained problems as data distributions for any feature extractor that have the property:


On plotting pretrained for both the ImageNet validation set and the validation set of CUB-200-2011 (a fine-grained dataset), we see that the CUB-200-2011 features are concentrated with a lower variance compared to the ImageNet training set (see Figure 0(b)), consistent with Equation 9. In the next section, we describe the connections of Maximum-Entropy with model selection in fine-grained classification.

3.3 Maximum-Entropy and Model Selection

By the Tikhonov regularization of a linear classifier golub1999tikhonov , we would want to select such that is small (

regularization), to get higher generalization performance. This technique is also implemented in neural networks trained using stochastic gradient descent (SGD) by the process of “weight-decay”. The next result provides some insight into how fine-grained problems can potentially limit model selection. We use the following result to lower-bound the norm of the weights

in terms of the expected entropy and the feature diversity:

Theorem 1.

Let the final layer weights be denoted by , the data distribution be over , and feature extractor be given by . For the expected condtional entropy, the following holds true:

A full proof of Theorem 1 is included in the supplement. Let us consider the case when is large (ImageNet classification). In this case, this lower bound is very weak and inconsequential. However, in the case of small (fine-grained classification), the denominator is small, and this lower bound can subsequently limit the space of model selection, by only allowing models with large values of weights, leading to potential overfitting. We see that if the numerator is small, the diversity of the features has a smaller impact on limiting the model selection, and hence, it can be advantageous to maximize prediction entropy. We note that since this is a lower bound, the proof is primarily expository.

More intuitively, however, it can be understood that problems that are fine-grained will often require more information to distinguish between classes, and regularizing the prediction entropy prevents creating models that memorize a lot of information about the training data, and thus can potentially benefit generalization. Now, Theorem 1 involves the expected conditional entropy over the data distribution. However, during training we only have sample access to the data distribution, which we can use as a surrogate. It is essential to then ensure that the empirical estimate of the conditional entropy (from training samples) is an accurate estimate of the true expected conditional entropy. The next result ensures that for large , in a fine-grained classification problem, the sample estimate of average conditional entropy is close to the expected conditional entropy.

Theorem 2.

Let the final layer weights be denoted by , the data distribution be over , and feature extractor be given by . With probability at least and , we have:

A full proof of Theorem 2 is included in the supplement. We see that as long as the diversity of features is small, and is large, our estimate for entropy will be close to the expected value. Using this result, we can express Theorem 1 in terms of the empirical mean conditional entropy.

Corollary 1.

With probability at least , the empirical mean conditional entropy follows:

A full proof of Corollary 1 is included in the supplement. We see that we recover the result from Theorem 1 as . Corollary 1 shows that as long as the diversity of features is small, and is large, the same conclusions drawn from Theorem 1 apply in the case of the empirical mean entropy as well. We will now proceed to describing the results obtained from maximum-entropy fine-grained classification.

4 Experiments

(A) CUB-200-2011 wah2011caltech
Method Top-1
Prior Work
STNjaderberg2015spatial 84.10 -
Zhang et al. zhang2016picking 84.50 -
Lin et al. lin2017improved 85.80 -
Cui et al. cui2017kernel 86.20 -
Our Results
GoogLeNet 68.19 (6.18)
MaxEnt-GoogLeNet 74.37
ResNet-50 75.15 (5.22)
MaxEnt-ResNet-50 80.37
VGGNet16 73.28 (3.74)
MaxEnt-VGGNet16 77.02
Bilinear CNN lin2015bilinear 84.10 (1.17)
MaxEnt-BilinearCNN 85.27
DenseNet-161 84.21 (2.33)
MaxEnt-DenseNet-161 86.54
(B) Cars krause20133d
Method Top-1
Prior Work
Wang et al. Wang_2016_CVPR 85.70 -
Liu et al. liu2016hierarchical 86.80 -
Lin et al. lin2017improved 92.00 -
Cui et al. cui2017kernel 92.40 -
Our Results
GoogLeNet 84.85 (2.17)
MaxEnt-GoogLeNet 87.02
ResNet-50 91.52 (2.33)
MaxEnt-ResNet-50 93.85
VGGNet16 80.60 (3.28)
MaxEnt-VGGNet16 83.88
Bilinear CNN lin2015bilinear 91.20 (1.61)
MaxEnt-Bilinear CNN 92.81
DenseNet-161 91.83 (1.18)
MaxEnt-DenseNet-161 93.01
(C) Aircrafts maji2013fine
Method Top-1
Prior Work
Simon et al. simon2017generalized 85.50 -
Cui et al. cui2017kernel 86.90 -
LRBP kong2016low 87.30 -
Lin et al. lin2017improved 88.50 -
Our Results
GoogLeNet 74.04 (5.12)
MaxEnt-GoogLeNet 79.16
ResNet-50 81.19 (2.67)
MaxEnt-ResNet-50 83.86
VGGNet16 74.17 (3.91)
MaxEnt-VGGNet16 78.08
BilinearCNN lin2015bilinear 84.10 (2.02)
MaxEnt-BilinearCNN 86.12
DenseNet-161 86.30 (3.46)
MaxEnt-DenseNet-161 89.76
(D) NABirds van2015building
Method Top-1
Prior Work
Branson et al. branson2014bird 35.70 -
Van et al. van2015building 75.00 -
Our Results
GoogLeNet 70.66 (2.38)
MaxEnt-GoogLeNet 73.04
ResNet-50 63.55 (5.66)
MaxEnt-ResNet-50 69.21
VGGNet16 68.34 (4.28)
MaxEnt-VGGNet16 72.62
BilinearCNN lin2015bilinear 80.90 (1.76)
MaxEnt-BilinearCNN 82.66
DenseNet-161 79.35 (3.67)
MaxEnt-DenseNet-161 83.02
(E) Stanford Dogs khosla2011novel
Method Top-1
Prior Work
Zhang et al. zhang2016weakly 80.43 -
Krause et al. krause2016unreasonable 80.60 -
Our Results
GoogLeNet 55.76 (6.25)
MaxEnt-GoogLeNet 62.01
ResNet-50 69.92 (3.64)
MaxEnt-ResNet-50 73.56
VGGNet16 61.92 (3.52)
MaxEnt-VGGNet16 65.44
BilinearCNN lin2015bilinear 82.13 (1.05)
MaxEnt-BilinearCNN 83.18
DenseNet-161 81.18 (2.45)
MaxEnt-DenseNet-161 83.63
Table 1: Maximum-Entropy training (MaxEnt) obtains state-of-the-art performance on five widely-used fine-grained visual classification datasets (A-E). Improvement over the baseline model is reported as . All results averaged over 6 trials.

We perform all experiments using the PyTorch

pytorch framework over a cluster of NVIDIA Titan X GPUs. We now describe our results on benchmark datasets in fine-grained recognition and some ablation studies.

4.1 Fine-Grained Visual Classification

Maximum-Entropy training improves performance across five standard fine-grained datasets, with substantial gains in low-performing models. We obtain state-of-the-art results on all five datasets (Table 1-(A-E)). Since all these datasets are small, we report numbers averaged over 6 trials.

Classification Accuracy: First, we observe that Maximum-Entropy training obtains significant performance gains when fine-tuning from models trained on the ImageNet dataset (e.g., GoogLeNet szegedy2015going , Resnet-50 he2016deep ). For example, on the CUB-200-2011 dataset, fine-tuning GoogLeNet by standard fine-tuning gives an accuracy of 68.19%. Fine-tuning with Maximum-Entropy gives an accuracy of 74.37%—which is a large improvement, and it is persistent across datasets. Since a lot of fine-tuning tasks use general base models such as GoogLeNet and ResNet, this result is relevant to the large number of applications that involve fine-tuning on specialized datasets.

Maximum-Entropy classification also improves prediction performance for CNN architectures specifically designed for fine-grained visual classification. For instance, it improves the performance of the Bilinear CNN lin2015bilinear on all 5 datasets and obtains state-of-the-art results, to the best of our knowledge. The gains are smaller, since these architectures improve diversity in the features by localization, and hence maximizing entropy is less crucial in this case. However, it is important to note that most pooling architectures lin2015bilinear use a large model as a base-model (such as VGGNet simonyan2014very ) and have an expensive pooling operation. Thus they are computationally very expensive, and infeasible for tasks that have resource constraints in terms of data and computation time.

Increase in Generality of Features: We hypothesize that Maximum-Entropy training will encourage the classifier to reduce the specificity of the features. To evaluate this hypothesis, we perform the eigendecomposition of the covariance matrix on the pool5 layer features of GoogLeNet trained on CUB-200-2011, and analyze the trend of sorted eigenvalues (Figure 1(a)). We examine the features from CNNs with (i) no fine-tuning (“Basic”), (ii) regular fine-tuning, and (iii) fine-tuning with Maximum-Entropy.

For a feature matrix with large covariance between the features of different classes, we would expect the first few eigenvalues to be large, and the rest to diminish quickly, since fewer orthogonal components can summarize the data. Conversely, in a completely uncorrelated feature matrix, we would see a longer tail in the decreasing magnitudes of eigenvalues. Figure 1(a) shows that for the Basic features (with no fine-tuning), there is a fat tail in both training and test sets due to the presence of a large number of uncorrelated features. After fine-tuning on the training data, we observe a reduction in the tail of the curve, implying that some generality in features has been introduced in the model through the fine-tuning. The test curve follows a similar decrease, justifying the increase in test accuracy. Finally, for Maximum-Entropy, we observe a substantial decrease in the width of the tail of eigenvalue magnitudes, suggesting a larger increase in generality of features in both training and test sets, which confirms our hypothesis.

Effect on Prediction Probabilities: For Maximum-Entropy training, the predicted logit vector is smoother, leading to a higher cross entropy during both training and validation. We observe that the average value of the logit probability of the top predicted class decreases significantly with Maximum-Entropy, as predicted by the mathematical formulation (for ). On CUB-200-2011 dataset for GoogLeNet architecture, with Maximum-Entropy, the mean probability of the top class is 0.34, as compared to 0.77 without it. Moreover, the tail of probability values is fatter with Maximum-Entropy, as depicted in Figure 1(b).

Figure 2: (a) Maximum-Entropy training encourages the network to reduce the specificity of the features, which is reflected in the longer tail of eigenvalues for the covariance matrix of pool5 GoogLeNet features for both training and test sets of CUB-200-2011. We plot the value of for the th eigenvalue obtained after decomposition of test set (dashed) and training set (solid) (for . (b) For Maximum-Entropy training, the predicted logit vector is smoother with a fatter tail (GoogleNet on CUB-200-2011).
Method CIFAR-10 CIFAR-100
GoogLeNet 84.16 (-0.06) 70.24 (3.26)
MaxEnt + GoogLeNet 84.10 73.50
DenseNet-121 92.19 (0.03) 75.01 (1.21)
MaxEnt + DenseNet-121 92.22 76.22
Table 2: Maximum Entropy obtains larger gains on the finer CIFAR-100 dataset as compared to CIFAR-10. Improvement over the baseline model is reported as .
Method Random-ImageNet Dogs-ImageNet
GoogLeNet 71.85 (0.35) 62.28 (2.63)
MaxEnt + GoogLeNet 72.20 64.91
ResNet-50 82.01 (0.28) 73.81 (1.86)
MaxEnt + ResNet-50 82.29 75.66
Table 3: Maximum Entropy obtains larger gains on the a subset of ImageNet containing dog sub-classes versus a randomly chosen subset of the same size which has higher visual diversity. Improvement over the baseline model (in cross-validation) is reported as .
Method CUB-200-2011 Cars Aircrafts NABirds Stanford Dogs
VGG-Net16 MaxEnt 77.02 83.88 78.08 72.62 65.44
LSR 70.03 81.45 75.06 69.28 63.06
ResNet-50 MaxEnt 80.37 93.85 83.86 69.21 73.56
LSR 78.20 92.04 81.26 64.02 70.03
DenseNet-161 MaxEnt 86.54 93.01 89.76 83.02 83.63
LSR 84.86 91.96 87.05 80.11 82.98
Table 4: Maximum-Entropy training obtains much large gains on Fine-grained Visual Classification as compared to Label Smoothing Regularization (LSR) szegedy2015going .

4.2 Ablation Studies

CIFAR-10 and CIFAR-100: We evaluate Maximum-Entropy on the CIFAR-10 and CIFAR-100 datasets krizhevsky2014cifar . CIFAR-100 has the same set of images as CIFAR-10 but with finer category distinction in the labels, with each “superclass” of 20 containing five finer divisions, and a 100 categories in total. Therefore, we expect (and observe) that Maximum-Entropy training provides stronger gains on CIFAR-100 as compared to CIFAR-10 across models (Table 2).

ImageNet Ablation Experiment: To understand the effect of Maximum-Entropy training on datasets with more samples compared to the small fine-grained datasets, we create two synthetic datasets: (i) Random-ImageNet, which is formed by selecting 116K images from a random subset of 117 classes of ImageNet imagenet_cvpr09 , and (ii) Dogs-ImageNet, which is formed by selecting all classes from ImageNet that have dogs as labels, which has the same number of images and classes as Random-ImageNet. Dogs-ImageNet has less diversity compared to Random-ImageNet, and thus we expect the gains from Maximum-Entropy to be higher. On a 5-way cross-validation on both dataset, we observe higher gains on the Dogs-ImageNet dataset for two CNN models (Table 3).

Choice of Hyperparameter : An integral component of regularization is the choice of weighing parameter. We find that performance is fairly robust to the choice of (Figure 2(a)). Please see supplement for experiment-wise details.

Figure 3: (a) Classification performance is robust to the choice of over a large region as shown here for CUB-200-2011 with models VGGNet-16 and BilinearCNN. (b) Maximum-Entropy is more robust to increasing amounts of label noise (CUB-200-2011 on GoogleNet with ). (c) Maximum-Entropy obtains higher validation performance despite higher training cross-entropy loss.

Robustness to Label Noise: In this experiment, we gradually introduce label noise by randomly permuting a fraction of labels for increasing fractions of total data. We follow an identical evaluation protocol as the previous experiment, and observe that Maximum-Entropy is more robust to label noise (Figure 2(b)).

Training Cross-Entropy and Validation Accuracy: We expect Maximum-Entropy training to provide higher accuracy at the cost of higher training cross-entropy. In Figure 2(c), we show that we achieve a higher validation accuracy when training with Maximum-Entropy despite the training cross-entropy loss converging to a higher value.

Comparison with Label-Smoothing Regularization: Label-Smoothing Regularization szegedy2015going

penalizes the KL-divergence of the classifier logits from the uniform distribution – and is also a method to prevent peaky distributions. On comparing performance with Label-Smoothing Regularization, we found that Maximum-Entropy provides much larger gains on fine-grained recognition (see Table 


5 Discussion and Conclusion

Many real-world applications of computer vision models involve extensive fine-tuning on small, relatively imbalanced datasets with much smaller diversity in the training set compared to the large-scale models they are fine-tuned from, a notable example of which is fine-grained recognition. In this domain, Maximum-Entropy training provides an easy-to-implement and simple to understand training schedule that consistently improves performance. There are several extensions, however, that can be explored: explicitly enforcing a large diversity in the features through a different regularizer might be an interesting extension to this study, as well as potential extensions to large-scale problems by tackling clusters of diverse objects separately. We leave these as a future study with our results as a starting point.


We thank Ryan Farrell, Pei Guo, Xavier Boix, Dhaval Adjodah, Spandan Madan, and Ishaan Grover for their feedback on the project and Google’s TensorFlow Research Cloud Program for providing TPU computing resources.

Appendix 1: Preliminaries

Probabilistic Tail Bounds

Theorem 3 (Hoeffding’s Inequality (Theorem 2.8 of Blp12 )).


be independent random variables such that

takes its values in almost surely for all Let

then for every

Theorem 4 (Cantelli’s Inequality (Equation 7 of mallows1969inequalities )).

The inequality states that

where is a real-valued random variable, is the probability measure, is the expected value of , is the variance of .

Basic Derivations for Multivariate Gaussian Mixtures

Lemma 1.

For vectors , constants and target vector ,


For each vector , we know by the Cauchy-Schwarz Inequality that:




Combining the above, we have:


Lemma 2.

For vectors , and target vector ,


For each vector , we know by the Cauchy-Schwarz Inequality that:

Multiplying the above equation by , we have:


Multiplying the above equation by , we have:

Combining the above, we have:


Lemma 3.

For an

-dimensional multivariate normal distribution

, we have:


Lemma 4.

For a random variable that is distributed by an -dimensional mixture of Gaussians, that is for and :


By law of conditional expectation:

Since the conditional distribution given the mixture component is -dimensional Gaussian , from Lemma 3, we have:

Lemma 5.

For an -dimensional multivariate normal distribution , we have:


Lemma 6.

For a random variable that is distributed by an -dimensional mixture of Gaussians, that is for and :


By law of conditional expectation:

Since the conditional distribution given the mixture component is -dimensional Gaussian , from Lemma 5, we have:

Lemma 7.

For a random variable that is distributed by an -dimensional mixture of Gaussians, that is for and :

Using results from Lemma 6 and Lemma 4, we have:

Classification Preliminaries

Consider the multi-class classification problem over classes. The input domain is given by , with an accompanying probability metric defined over . The training data is given by i.i.d. samples drawn from . Each point has an associated label . We learn a CNN such that for each point in , the CNN induces a conditional probability distribution over the classes whose mode matches the label .

A CNN architecture consists of a series of convolutional and subsampling layers that culminate in an activation , which is fed to an -way classifier with weights such that:


The entropy of conditional probability distribution in Equation 28 is given by:


The expected entropy over the distribution is given by:


The empirical average of the conditional entropy over the training set is:


Diversity of the features is given by:


Appendix 2: Theoretical Results

Lemma 8.

For the above classification setup, where :


For an input , the conditional probability distribution over classes for a statistical model with feature map and weights can be given by:


We can thus write the conditional entropy for the above sample as:

Since is a concave function:
By Lemma 1, we have:
By Lemma 2, we have:

Now we are ready to prove Theorem 1 from the main paper.

Theorem 5 (Theorem 1 from Text: Lower Bound on -norm of Classifier).

The expected conditional entropy follows:


From Lemma 8, we have:

Since , we have:
Taking expectation over , we have:
By Cauchy-Schwarz Inequality, . Using this:
By Lemma 4, we have:
Rearranging and using the definition of Diversity we have:

Lemma 9.

With probability at least ,


Since has i.i.d. samples of , we have:

From Lemma 8, we know that for sample :
Thus, by applying Hoeffding’s Inequality we get:
Setting RHS as , we have with probability at least :

Lemma 10.

With probability at least , we have:


Since has i.i.d. samples of ,:

By Lemma 7, we have: