Amended Cross Entropy Cost: Framework For Explicit Diversity Encouragement

07/16/2020 ∙ by Ron Shoham, et al. ∙ Ben-Gurion University of the Negev 0

Cross Entropy (CE) has an important role in machine learning and, in particular, in neural networks. It is commonly used in neural networks as the cost between the known distribution of the label and the Softmax/Sigmoid output. In this paper we present a new cost function called the Amended Cross Entropy (ACE). Its novelty lies in its affording the capability to train multiple classifiers while explicitly controlling the diversity between them. We derived the new cost by mathematical analysis and "reverse engineering" of the way we wish the gradients to behave, and produced a tailor-made, elegant and intuitive cost function to achieve the desired result. This process is similar to the way that CE cost is picked as a cost function for the Softmax/Sigmoid classifiers for obtaining linear derivatives. By choosing the optimal diversity factor we produce an ensemble which yields better results than the vanilla one. We demonstrate two potential usages of this outcome, and present empirical results. Our method works for classification problems analogously to Negative Correlation Learning (NCL) for regression problems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and motivation

It has been shown in several studies, both theoretically and empirically, that training an ensemble of models, i.e. aggregating predictions from multiple models, is superior to training a single modelBrown et al. (2005); Feng et al. (2018); Fernandez-Delgado et al. (2014); Liu et al. (2019); Liu and Yao (1999); Ren et al. (2016); Sollich and Krogh (1996); Wang et al. (2019); Zhang and Suganthan (2017); Freund and Schapire (1997). Many works point out that one of the keys for an ensemble to perform well is to encourage diversity among the models Adeva et al. (2005); Carreira-Perpinan and Raziperchikolaei (2016); Lee et al. (2016); Liu and Yao (1999); Shi et al. (2018); Zhou et al. (2018); Freund and Schapire (1997). This property is the main motivation our work.

Sigmoid and Softmax are both well known functions which are used for classification (the former for binary and the second for multi label classifications). Both are used to generate distribution vectors

over the labels , where

is a given input. For Deep Neural Networks (DNNs) the framework of applying a Sigmoid/Softmax on top of the network is very popular, where the goal is to estimate the real distribution

, which might be a 1-hot vector for a hard label. Henceforth, we omit unless it is crucial for some definition or proof. We denote . We optimize by minimizing the CE cost function


The optimization is usually gradient basedKingma and Ba (2014); Tieleman and Hinton (2012)

. Hence, one of the main motivations for using the CE cost function over Sigmoid/Softmax outputs is the linear structure of the gradient, which is similar to that obtained by applying the Mean Squared Error (MSE) method over a linear regression estimator. Studies show that this property is important for preventing vanishing gradient phenomena

Goodfellow et al. (2016); Nielsen (2015).

Now let us define the setting of the ensemble problem. We train classifiers, with distribution functions , to generate ensemble , which estimates the real distribution . This setting is very common and the straightforward way to tackle it is by training each model independently using the CE cost function . Encouraging diversity is manifested by using different training samples or different seeds for weight initialization. However, to the best of our knowledge, there is no explicit way to control the “amount” of diversity between the classifiers.

In this work we present a novel framework, called Amended Cross Entropy (ACE), which makes it possible for us to train each model and, simultaneously, to achieve diversity between the classifiers. Our main result in this work is the introduction of a new cost function


which is applied for the -th classifier and is not independent of the other classifiers. We see that ACE is built from the vanilla CE between and , minus the average of the CE between and the other estimators, factored with . This result is very intuitive since we wish to minimize the CE of the estimated distribution with the real one, while enlarging the CE of the estimator with the others, i.e. encourage diversity. The hyper-parameter

explicitly controls the diversity, and is fine-tuned in order to achieve optimal results. The development of ACE starts from an assumption of the structure we wish the gradient to be in. As we show in this paper, a similar assumption lies at the base of applying CE over Softmax. We develop a variant especially for DNNs, which can be stacked on top of the network instead of the vanilla Softmax layer, and makes it possible to yield superior results without significantly increasing the number of parameters or the computational resources.

This work has been inspired by the Negative Correlation Learning (NCL) Brown et al. (2005); Liu and Yao (1999); Shi et al. (2018) framework, which is used for regression ensembles. In the next section we will present the NCL framework, its development and its results, in order to explain the analogous approach we used in our work.

2 Related work: Negative Correlation Learning (NCL)

Liu and Yao (1999) and Brown et al. (2005) presented the NCL framework as a solution for the diversity issue for ensembles of regression. Let us denote as the vector of features and as the target. The goal is to find which yields as low as possible error w.r.t. MSE criteria, i.e. to minimize


Here, stands for the parameters of . In practice, the distribution is unknown, so we use realizations (training set) to estimate (3) with an empirical MSE using . Under the assumption that are i.i.d., or at least stationary and ergodic, converges to . We use the short notation to denote . Instead of (3) we can use the expectation operator

and decompose the error to the known structure of bias and variance


A common way to apply an ensemble of models is to average multiple trained estimators


By checking the decomposition of the ensemble expected error it is straightforward to show that


This outcome is called the bias-variance-covariance decomposition, and is the main motivation for NCL. We notice that by reducing the correlation between the estimators of an ensemble, the ensemble might yield a lower error. Based on this, Liu and Yao (1999) proposed a regularization factor that is added to the cost function of any of the single estimators during the training phase. This factor is an estimation of the sum of covariances between the trained estimator and the others. The factor is multiplied by a hyper-parameter , which explicitly controls the “amount” of the diversity between the single estimator and the other estimators in the ensemble


Notice that in order to avoid a factor of in the gradient analysis, we multiply the MSE by a factor of . By setting we get the conventional MSE cost function, i.e. each model is optimized independently.

Gradient analysis

Gradient-wise optimizationKingma and Ba (2014); Tieleman and Hinton (2012) is a very popular method for optimizing a model. Therefore, conducting analysis over the gradient behaviour of a cost function is advisable. Let us check the gradient of the cost function with respect to


By defining , we get


We notice again that by setting we get the same gradient as with independent training.

2.1 Usage of NCL

Liu and Yao (1999) and Brown et al. (2005) suggested a vanilla approach for optimizing multiple regressors. They suggested training multiple regression models that do not have to be of the same architecture, but train simultaneously in order to reduce the correlation between the models. The architecture is presented in Fig. 1. However, applying this approach, the computational power and the number of parameters used increases significantly. For example, if we use the same architecture for all of the models, we use times the number of parameters used by a single model. If we train a DNN with millions of parameters, this might result in a non scalable training scheme.

Figure 1: NCL. A sketch of a training phase of the -th model. First, the input is processed by models, which yields the predictions . Using this, the cost function is calculated. Finally, the gradient of is calculated and model is updated accordingly.

In order to handle this, Shi et al. (2018) suggested a new approach. They suggested stacking a layer of a regressors ensemble on top of a DNN instead of the vanilla regression layer. In this way, they claimed that they got the benefit of NCL while not increasing the number of parameters and computational power significantly. This architecture, called D-ConvNet, yields state of the art results in a Crowd Counting task. The work, as well as a sketch of the architecture can be seen in their paper Shi et al. (2018).

3 Amended Cross Entropy (ACE)

In this section we first show the main motivation for using the CE cost function for a Softmax classifier. Like many other functions, CE achieves its minima when both of the distribution vectors are equal (MSE, Mean Absolute Error (MAE), etc.). However, CE is the only cost function which yields a linear gradient for a distribution generated by Softmax, similarly to the gradient of the MSE cost function over a linear regressor. We show this over a single classifier case first, and later we use this approach analogously for multi-classifiers, where we wish to yield the same gradient structure as in NCL, in order to analytically develop the ACE framework for multi-classifiers.

CE cost function for Softmax classifier

Let us denote as the size of the set of events (labels), and as the real distribution vector for a given input (which is a 1-hot vector for a hard label). We wish to train an estimator for the real distribution. We denote the estimator parameters as . The estimator generates a raw vector , which is a function of the input, and applies Softmax over it in order to yield the estimator , i.e.


Later, a CE cost function is applied to measure the error between the estimator and the real distribution (1). In order to optimize the estimator’s parameters , gradient based methods are appliedKingma and Ba (2014); Tieleman and Hinton (2012)

. The gradient is calculated using the chain rule


Now, let us calculate explicitly


We see that a linear structure of a gradient is obtained when applying CE over a Softmax classifier. This structure is similar to that of the MSE cost function over a linear regression estimatorGoodfellow et al. (2016); Nielsen (2015).

3.1 Ace

Inspired by the NCL result and by our belief that an important consideration for the choice of a cost function is the gradient behaviour (as long as it is a valid cost function), we wish to find a cost function that would yield the same properties. Therefore, we first assume the gradient structure, and later integrate it in order to find the appropriate cost function. Let us denote as the number of classifiers in the ensemble, as the -th model cost function, as the raw output vector of the -th model, as the estimated distribution of the -th model, and as the parameters of the -th model. We would like to train an ensemble of models to estimate . Since the gradient structure might be one of the most important considerations for choosing and constructing a cost function, by combining the results of (9) and (12) we assume a gradient


This assumption is the foundation of our proposed method and is the basis for developing the ACE framework. In order to find we need to integrate the above with respect to


By reverse engineering (12), and using the fact that and are independent of , we get


where is a constant independent of . We set . We can also set in order to get


i.e. the average of the CE between the -th classifier and the others. Notice that by setting we get the regular CE cost function.

Alternative formulation and analogy to NCL

Using algebraic manipulations, one can show that ACE (15) has a similar structure to the one of NCL (7). Let us check the result in (15)


Note that , i.e. the entropy of . Now let us check the result in (7)


If we refer to the MSE and CE as divergence operators and , respectively, we can observe that both of the cost functions have the same structure


where are constants. The first component of both expressions in (19) and (20) is the divergence between the real value and the estimator’s prediction, i.e. the vanilla error. The second component is a negative divergence between estimator’s prediction and the ensemble prediction. Minimizing it (maximizing the divergence) encourages diversity between the estimator and the ensemble. The last component is the minimum of the divergence, where for MSE it is zero and for CE it is the entropy.

Non-uniform weights

Let us check the case where our ensemble is aggregated using non-uniform weights, i.e. , where , , and . Instead of (13) we get


Hence, for weights which are independent of , instead of (15) we obtain


4 Implementation

In this section we examine two alternative implementations for the result we got above.

4.1 ACE for multiple models

The straightforward vanilla implementation of our result is training multiple models simultaneously using ACE. In this approach we train models and fine-tune to yield the best ensemble result.

for  in  do
       calculate predictions ;
end for
for  in  do
       calculate loss (17);
       calculate gradient ;
       apply optimization step over using ;
end for
Algorithm 1 Training step of ACE for

models with respect to a single input with probability vector

The models do not have to be of the same architecture. Let us denote as the parameters of the models

, respectively. The loss functions

are calculated as in (17). We calculate the gradient for each parameter set with respect to the corresponding loss function (Algorithm 1). This can also be used over a batch of samples while averaging the gradients. A sketch of this architecture can be viewed in Fig. 2. In the inference phase, we calculate the outputs of all of the models, and average them to yield a prediction.

Figure 2: ACE for multiple models. A sketch of a training phase of the -th model. First, the input is processed by models, which yields the distribution vectors . Later, the cost function is calculated. Finally, the gradient of is calculated and model is updated accordingly.

4.2 Stacked Mixture Of Classifiers

A drawback of the above usage is that it takes times the computational power and memory compared to training a single vanilla model. In order to avoid this overhead and to still gain the advantages of training multiple classifiers using ACE we developed a new architecture called Stacked Mixture Of Classifiers (SMOC). This implementation is an ad-hoc variant for DNNs. Let us denote as the depth of a DNN, and as the output vector of the first layers of the net. Usually, we stack a fully-connected layer and Softamx activation on top of such that , where and are the matrix and the bias of the last fully-connected layer, respectively, and is the output of the DNN. Instead, we stack a mixture of fully-connected+Softmax classifiers, and train them with respect to different loss functions. The output of each classifier is , where and are the matrix and the bias of the -th fully-connected final layer. For optimization we use ACE loss (17). In the inference phase we use an average of the classifiers . We denote this architecture as Stacked Mixture Of Classifiers (SMOC). A sketch of SMOC can be seen in Fig. 3. The parameters vector is the set of parameters of the -th final layer, i.e. . As we can see, the number of parameters is increased by compared to a similar DNN with a vanilla final layer. Using this approach, we can gain a highly diversified ensemble without having to train multiple models and increase the number of parameters significantly. Instead, we use a regular single DNN of layers, and create an ensemble by training multiple fully-connected+Softmax layers over its output.

SMOC gradient calculation optimization

We can think about this architecture as training DNNs which share the parameters of the first layers. Let us denote the shared parameters as

calculate ;
for  in  do
       calculate predictions ;
end for
for  in  do
       calculate loss (17);
       calculate gradient ;
       calculate gradient ;
end for
calculate (25);
apply optimization step for using respectively;
Algorithm 2 Training step of SMOC with K stacked classifiers w.r.t. a single input with probability vector

. Similar to ACE for multiple models, we need to calculate losses and the gradients with respect to them. A naive way to do so would be to calculate the gradients separately for each cost function and to average them over the shared parameters . However, this computation has the same complexity as training different models. Since the gradients are calculated using the chain rule (back-propagation) we can use it to tackle this issue. Let us denote as the average of the gradients over the shared parameters


By using the chain rule we get


By combining (23) and (24), and due to the linearity of the gradient we get


Therefore, we can apply averaging on , and calculate the gradient for once. The gradients for each must still be calculated separately with respect to (Algorithm 2).

Figure 3: SMOC. A sketch of a training phase of the -th classifier. First, the input is processed by a DNN, which generates . Second, is processed by a pool of classifiers, which yields the distribution vectors . Each classifier is optimized by its corresponding ACE cost function . The gradient w.r.t. is calculated and classifier is updated accordingly. The gradient w.r.t. is calculated and later the gradients are averaged and used to calculate the gradient w.r.t. (25).

5 Experiments

5.1 ACE for multiple models

For the vanilla version we conducted an experiment over the MNIST dataset. The MNIST is a standard toy dataset, where the task is to classify the images into 10 digit classes. For the ensemble,

Ensemble scores Averaged single NN score
Accuracy CE Accuracy CE
0 0.9790 0.0669 0.9767 0.0810
0.05 0.9798 0.0663 0.9770 0.0809
0.1 0.9799 0.0664 0.9768 0.0802
0.3 0.9797 0.0658 0.9767 0.0806
0.5 0.9802 0.0649 0.9764 0.0842
0.7 0.9800 0.0659 0.9760 0.0866
Table 1: ACE for multiple models - MNIST dataset

we used 5 models of the same architecture. The architecture was DNN with a single hidden layer and ReLU activation. The results include both the accuracy and the CE of the predictions over the test set. We ran over multiple values of

, where for , i.e. vanilla CE, we trained the models independently (different training batches). The results in Table LABEL:MNIST_table show that we succeeded in reducing the error of the ensemble and increasing its accuracy by applying ACE instead of the vanilla CE (i.e. ). We also added the averaged accuracy and CE of a single DNN. An interesting thing to notice is that even though the result of a single DNN deteriorates when using the optimal , the ensemble result is superior. The reason for this is that we add a penalty for each DNN during the training phase that causes it to perform worse; however, the penalty is coordinated with the other DNNs so that the ensemble would perform better. The results were averaged over 5 experiments.

5.2 Stacked Mixture Of Classifiers

We conducted studies of the SMOC architecture over the CIFAR-10 dataset Krizhevsky (2012). We used the architecture and code of ResNet 110 He et al. (2015) and stacked on top of it an ensemble of 10 fully-connected+Softmax layers instead of the single one that was used. This resulted in adding parameters to a model with an original size of , i.e. enlarging the model by . The results are shown in Table 2. In the table we also show the results for a single classifier with a vanilla single Softmax layer (K=1). The results have been averaged over 5 experiments with different seeds. We notice that the optimal reduces the accuracy error by compared to with almost no cost in the number of parameters and computational power. We also notice that the CE reduces significantly.

error(%) 6.43 6.2 6.14 6.12 5.98 6.09 6.13 6.31
CE 0.3056 0.3102 0.3041 0.3048 0.2968 0.2918 0.3137 0.4957
Table 2: Stacked Mixture Of Classifiers - CIFAR-10 dataset

6 Conclusion and future work

In this paper we developed a novel framework for encouraging diversity explicitly between ensemble models in classification tasks. First, we introduced the idea of using an amended cost function for multiple classifiers based on NCL results. Later, we showed two usages - a vanilla one and the SMOC. We perform experiments to validate our analytical results for both of the architectures. For SMOC, we showed that by a small change and redundant addition of parameters we achieve superior results compared to the vanilla implementation. In future work, we would like to seek a way of using ACE with a non-uniform and, possibly, trainable weights (22). Also, in the case of a large amount of labels, using SMOC results in a high amount of added parameters. We would like to research implementation solutions where this can be avoided.


  • J. Adeva, U. B., and R. Calvo (2005) Accuracy and diversity in ensembles of text categorisers. CLEI Electron. J. 8. Cited by: §1.
  • G. Brown, J.L. Wyatt, and P. Tiňo (2005) Managing diversity in regression ensembles. Journal of machine learning research 6 (Sep), pp. 1621–1650. Cited by: §1, §1, §2.1, §2.
  • M.A. Carreira-Perpinan and R. Raziperchikolaei (2016) An ensemble diversity approach to supervised binary hashing. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 757–765. Cited by: §1.
  • J. Feng, Y. Yu, and Z.H. Zhou (2018)

    Multi-layered gradient boosting decision trees

    In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 3551–3561. Cited by: §1.
  • M. Fernandez-Delgado, E. Cernadas, S. Barro, and D. Amorim (2014) “Do we need hundreds of classifiers to solve real world classification problems. Journal of Machine Learning Research 15 (1), pp. 3313–3181. Cited by: §1.
  • Y. Freund and R.E. Schapire (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, pp. 119–139. Cited by: §1.
  • I. Goodfellow, Y. B., and A. C. (2016) Deep learning. MIT Press. Cited by: §1, §3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. Cited by: §5.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §1, §2, §3.
  • A. Krizhevsky (2012) Learning multiple layers of features from tiny images. University of Toronto. Cited by: §5.2.
  • S. Lee, P. S.P.S., M. Cogswell, V. Ranjan, D. Crandall, and D. Batra (2016) Stochastic multiple choice learning for training diverse deep ensembles. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2119–2127. Cited by: §1.
  • J. Liu, J. Paisley, M.A. Kioumourtzoglou, and B. Coull (2019) Accurate uncertainty estimation and decomposition in ensemble learning. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8952–8963. Cited by: §1.
  • Y. Liu and X. Yao (1999) Ensemble learning via negative correlation. Neural networks 12 (10), pp. 1399–1404. Cited by: §1, §1, §2.1, §2.
  • M.A. Nielsen (2015) Neural networks and deep learning. Determination Press. Cited by: §1, §3.
  • Y. Ren, L. Zhang, and P.N. Suganthan (2016) Ensemble classification and regression-recent developments, applications and future directions. IEEE Computational Intelligence Magazine 11 (1), pp. 41–53. Cited by: §1.
  • Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M.M. Cheng, and G. Zheng (2018) Crowd counting with deep negative correlation learning.

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1, §1, §2.1.
  • P. Sollich and A. Krogh (1996) Learning with ensembles: how overfitting can be useful. In Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo (Eds.), pp. 190–196. Cited by: §1.
  • T. Tieleman and G. Hinton (2012) Neural networks for machine learning.

    Lecture 6.5 - RMSProp, COURSERA

    Cited by: §1, §2, §3.
  • B. Wang, Z. Shi, and S. Osher (2019) ResNets ensemble via the feynman-kac formalism to improve natural and robust accuracies. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 1657–1667. Cited by: §1.
  • L. Zhang and P.N. Suganthan (2017)

    Benchmarking ensemble classifiers with novel co-trained kernal ridge regression and random vector functional link ensembles [research frontier]

    IEEE Computational Intelligence Magazine 12 (4), pp. 61–72. Cited by: §1.
  • T. Zhou, S. Wang, and J.A. Bilmes (2018) Diverse ensemble evolution: curriculum data-model marriage. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 5909–5920. Cited by: §1.