1 Introduction and motivation
It has been shown in several studies, both theoretically and empirically, that training an ensemble of models, i.e. aggregating predictions from multiple models, is superior to training a single modelBrown et al. (2005); Feng et al. (2018); FernandezDelgado et al. (2014); Liu et al. (2019); Liu and Yao (1999); Ren et al. (2016); Sollich and Krogh (1996); Wang et al. (2019); Zhang and Suganthan (2017); Freund and Schapire (1997). Many works point out that one of the keys for an ensemble to perform well is to encourage diversity among the models Adeva et al. (2005); CarreiraPerpinan and Raziperchikolaei (2016); Lee et al. (2016); Liu and Yao (1999); Shi et al. (2018); Zhou et al. (2018); Freund and Schapire (1997). This property is the main motivation our work.
Sigmoid and Softmax are both well known functions which are used for classification (the former for binary and the second for multi label classifications). Both are used to generate distribution vectors
over the labels , whereis a given input. For Deep Neural Networks (DNNs) the framework of applying a Sigmoid/Softmax on top of the network is very popular, where the goal is to estimate the real distribution
, which might be a 1hot vector for a hard label. Henceforth, we omit unless it is crucial for some definition or proof. We denote . We optimize by minimizing the CE cost function(1) 
The optimization is usually gradient basedKingma and Ba (2014); Tieleman and Hinton (2012)
. Hence, one of the main motivations for using the CE cost function over Sigmoid/Softmax outputs is the linear structure of the gradient, which is similar to that obtained by applying the Mean Squared Error (MSE) method over a linear regression estimator. Studies show that this property is important for preventing vanishing gradient phenomena
Goodfellow et al. (2016); Nielsen (2015).Now let us define the setting of the ensemble problem. We train classifiers, with distribution functions , to generate ensemble , which estimates the real distribution . This setting is very common and the straightforward way to tackle it is by training each model independently using the CE cost function . Encouraging diversity is manifested by using different training samples or different seeds for weight initialization. However, to the best of our knowledge, there is no explicit way to control the “amount” of diversity between the classifiers.
In this work we present a novel framework, called Amended Cross Entropy (ACE), which makes it possible for us to train each model and, simultaneously, to achieve diversity between the classifiers. Our main result in this work is the introduction of a new cost function
(2) 
which is applied for the th classifier and is not independent of the other classifiers. We see that ACE is built from the vanilla CE between and , minus the average of the CE between and the other estimators, factored with . This result is very intuitive since we wish to minimize the CE of the estimated distribution with the real one, while enlarging the CE of the estimator with the others, i.e. encourage diversity. The hyperparameter
explicitly controls the diversity, and is finetuned in order to achieve optimal results. The development of ACE starts from an assumption of the structure we wish the gradient to be in. As we show in this paper, a similar assumption lies at the base of applying CE over Softmax. We develop a variant especially for DNNs, which can be stacked on top of the network instead of the vanilla Softmax layer, and makes it possible to yield superior results without significantly increasing the number of parameters or the computational resources.
This work has been inspired by the Negative Correlation Learning (NCL) Brown et al. (2005); Liu and Yao (1999); Shi et al. (2018) framework, which is used for regression ensembles. In the next section we will present the NCL framework, its development and its results, in order to explain the analogous approach we used in our work.
2 Related work: Negative Correlation Learning (NCL)
Liu and Yao (1999) and Brown et al. (2005) presented the NCL framework as a solution for the diversity issue for ensembles of regression. Let us denote as the vector of features and as the target. The goal is to find which yields as low as possible error w.r.t. MSE criteria, i.e. to minimize
(3) 
Here, stands for the parameters of . In practice, the distribution is unknown, so we use realizations (training set) to estimate (3) with an empirical MSE using . Under the assumption that are i.i.d., or at least stationary and ergodic, converges to . We use the short notation to denote . Instead of (3) we can use the expectation operator
and decompose the error to the known structure of bias and variance
(4) 
A common way to apply an ensemble of models is to average multiple trained estimators
(5) 
By checking the decomposition of the ensemble expected error it is straightforward to show that
(6) 
This outcome is called the biasvariancecovariance decomposition, and is the main motivation for NCL. We notice that by reducing the correlation between the estimators of an ensemble, the ensemble might yield a lower error. Based on this, Liu and Yao (1999) proposed a regularization factor that is added to the cost function of any of the single estimators during the training phase. This factor is an estimation of the sum of covariances between the trained estimator and the others. The factor is multiplied by a hyperparameter , which explicitly controls the “amount” of the diversity between the single estimator and the other estimators in the ensemble
(7) 
Notice that in order to avoid a factor of in the gradient analysis, we multiply the MSE by a factor of . By setting we get the conventional MSE cost function, i.e. each model is optimized independently.
Gradient analysis
Gradientwise optimizationKingma and Ba (2014); Tieleman and Hinton (2012) is a very popular method for optimizing a model. Therefore, conducting analysis over the gradient behaviour of a cost function is advisable. Let us check the gradient of the cost function with respect to
(8) 
By defining , we get
(9) 
We notice again that by setting we get the same gradient as with independent training.
2.1 Usage of NCL
Liu and Yao (1999) and Brown et al. (2005) suggested a vanilla approach for optimizing multiple regressors. They suggested training multiple regression models that do not have to be of the same architecture, but train simultaneously in order to reduce the correlation between the models. The architecture is presented in Fig. 1. However, applying this approach, the computational power and the number of parameters used increases significantly. For example, if we use the same architecture for all of the models, we use times the number of parameters used by a single model. If we train a DNN with millions of parameters, this might result in a non scalable training scheme.
In order to handle this, Shi et al. (2018) suggested a new approach. They suggested stacking a layer of a regressors ensemble on top of a DNN instead of the vanilla regression layer. In this way, they claimed that they got the benefit of NCL while not increasing the number of parameters and computational power significantly. This architecture, called DConvNet, yields state of the art results in a Crowd Counting task. The work, as well as a sketch of the architecture can be seen in their paper Shi et al. (2018).
3 Amended Cross Entropy (ACE)
In this section we first show the main motivation for using the CE cost function for a Softmax classifier. Like many other functions, CE achieves its minima when both of the distribution vectors are equal (MSE, Mean Absolute Error (MAE), etc.). However, CE is the only cost function which yields a linear gradient for a distribution generated by Softmax, similarly to the gradient of the MSE cost function over a linear regressor. We show this over a single classifier case first, and later we use this approach analogously for multiclassifiers, where we wish to yield the same gradient structure as in NCL, in order to analytically develop the ACE framework for multiclassifiers.
CE cost function for Softmax classifier
Let us denote as the size of the set of events (labels), and as the real distribution vector for a given input (which is a 1hot vector for a hard label). We wish to train an estimator for the real distribution. We denote the estimator parameters as . The estimator generates a raw vector , which is a function of the input, and applies Softmax over it in order to yield the estimator , i.e.
(10) 
Later, a CE cost function is applied to measure the error between the estimator and the real distribution (1). In order to optimize the estimator’s parameters , gradient based methods are appliedKingma and Ba (2014); Tieleman and Hinton (2012)
. The gradient is calculated using the chain rule
(11) 
Now, let us calculate explicitly
(12) 
We see that a linear structure of a gradient is obtained when applying CE over a Softmax classifier. This structure is similar to that of the MSE cost function over a linear regression estimatorGoodfellow et al. (2016); Nielsen (2015).
3.1 Ace
Inspired by the NCL result and by our belief that an important consideration for the choice of a cost function is the gradient behaviour (as long as it is a valid cost function), we wish to find a cost function that would yield the same properties. Therefore, we first assume the gradient structure, and later integrate it in order to find the appropriate cost function. Let us denote as the number of classifiers in the ensemble, as the th model cost function, as the raw output vector of the th model, as the estimated distribution of the th model, and as the parameters of the th model. We would like to train an ensemble of models to estimate . Since the gradient structure might be one of the most important considerations for choosing and constructing a cost function, by combining the results of (9) and (12) we assume a gradient
(13) 
This assumption is the foundation of our proposed method and is the basis for developing the ACE framework. In order to find we need to integrate the above with respect to
(14) 
By reverse engineering (12), and using the fact that and are independent of , we get
(15) 
where is a constant independent of . We set . We can also set in order to get
(16) 
i.e. the average of the CE between the th classifier and the others. Notice that by setting we get the regular CE cost function.
Alternative formulation and analogy to NCL
Using algebraic manipulations, one can show that ACE (15) has a similar structure to the one of NCL (7). Let us check the result in (15)
(17) 
Note that , i.e. the entropy of . Now let us check the result in (7)
(18) 
If we refer to the MSE and CE as divergence operators and , respectively, we can observe that both of the cost functions have the same structure
(19)  
(20) 
where are constants. The first component of both expressions in (19) and (20) is the divergence between the real value and the estimator’s prediction, i.e. the vanilla error. The second component is a negative divergence between estimator’s prediction and the ensemble prediction. Minimizing it (maximizing the divergence) encourages diversity between the estimator and the ensemble. The last component is the minimum of the divergence, where for MSE it is zero and for CE it is the entropy.
Nonuniform weights
4 Implementation
In this section we examine two alternative implementations for the result we got above.
4.1 ACE for multiple models
The straightforward vanilla implementation of our result is training multiple models simultaneously using ACE. In this approach we train models and finetune to yield the best ensemble result.
The models do not have to be of the same architecture. Let us denote as the parameters of the models
, respectively. The loss functions
are calculated as in (17). We calculate the gradient for each parameter set with respect to the corresponding loss function (Algorithm 1). This can also be used over a batch of samples while averaging the gradients. A sketch of this architecture can be viewed in Fig. 2. In the inference phase, we calculate the outputs of all of the models, and average them to yield a prediction.4.2 Stacked Mixture Of Classifiers
A drawback of the above usage is that it takes times the computational power and memory compared to training a single vanilla model. In order to avoid this overhead and to still gain the advantages of training multiple classifiers using ACE we developed a new architecture called Stacked Mixture Of Classifiers (SMOC). This implementation is an adhoc variant for DNNs. Let us denote as the depth of a DNN, and as the output vector of the first layers of the net. Usually, we stack a fullyconnected layer and Softamx activation on top of such that , where and are the matrix and the bias of the last fullyconnected layer, respectively, and is the output of the DNN. Instead, we stack a mixture of fullyconnected+Softmax classifiers, and train them with respect to different loss functions. The output of each classifier is , where and are the matrix and the bias of the th fullyconnected final layer. For optimization we use ACE loss (17). In the inference phase we use an average of the classifiers . We denote this architecture as Stacked Mixture Of Classifiers (SMOC). A sketch of SMOC can be seen in Fig. 3. The parameters vector is the set of parameters of the th final layer, i.e. . As we can see, the number of parameters is increased by compared to a similar DNN with a vanilla final layer. Using this approach, we can gain a highly diversified ensemble without having to train multiple models and increase the number of parameters significantly. Instead, we use a regular single DNN of layers, and create an ensemble by training multiple fullyconnected+Softmax layers over its output.
SMOC gradient calculation optimization
We can think about this architecture as training DNNs which share the parameters of the first layers. Let us denote the shared parameters as
. Similar to ACE for multiple models, we need to calculate losses and the gradients with respect to them. A naive way to do so would be to calculate the gradients separately for each cost function and to average them over the shared parameters . However, this computation has the same complexity as training different models. Since the gradients are calculated using the chain rule (backpropagation) we can use it to tackle this issue. Let us denote as the average of the gradients over the shared parameters
(23) 
By using the chain rule we get
(24) 
By combining (23) and (24), and due to the linearity of the gradient we get
(25) 
Therefore, we can apply averaging on , and calculate the gradient for once. The gradients for each must still be calculated separately with respect to (Algorithm 2).
5 Experiments
5.1 ACE for multiple models
For the vanilla version we conducted an experiment over the MNIST dataset. The MNIST is a standard toy dataset, where the task is to classify the images into 10 digit classes. For the ensemble,
Ensemble scores  Averaged single NN score  

Accuracy  CE  Accuracy  CE  
0  0.9790  0.0669  0.9767  0.0810 
0.05  0.9798  0.0663  0.9770  0.0809 
0.1  0.9799  0.0664  0.9768  0.0802 
0.3  0.9797  0.0658  0.9767  0.0806 
0.5  0.9802  0.0649  0.9764  0.0842 
0.7  0.9800  0.0659  0.9760  0.0866 
we used 5 models of the same architecture. The architecture was DNN with a single hidden layer and ReLU activation. The results include both the accuracy and the CE of the predictions over the test set. We ran over multiple values of
, where for , i.e. vanilla CE, we trained the models independently (different training batches). The results in Table LABEL:MNIST_table show that we succeeded in reducing the error of the ensemble and increasing its accuracy by applying ACE instead of the vanilla CE (i.e. ). We also added the averaged accuracy and CE of a single DNN. An interesting thing to notice is that even though the result of a single DNN deteriorates when using the optimal , the ensemble result is superior. The reason for this is that we add a penalty for each DNN during the training phase that causes it to perform worse; however, the penalty is coordinated with the other DNNs so that the ensemble would perform better. The results were averaged over 5 experiments.5.2 Stacked Mixture Of Classifiers
We conducted studies of the SMOC architecture over the CIFAR10 dataset Krizhevsky (2012). We used the architecture and code of ResNet 110 He et al. (2015) and stacked on top of it an ensemble of 10 fullyconnected+Softmax layers instead of the single one that was used. This resulted in adding parameters to a model with an original size of , i.e. enlarging the model by . The results are shown in Table 2. In the table we also show the results for a single classifier with a vanilla single Softmax layer (K=1). The results have been averaged over 5 experiments with different seeds. We notice that the optimal reduces the accuracy error by compared to with almost no cost in the number of parameters and computational power. We also notice that the CE reduces significantly.
error(%)  6.43  6.2  6.14  6.12  5.98  6.09  6.13  6.31 
CE  0.3056  0.3102  0.3041  0.3048  0.2968  0.2918  0.3137  0.4957 
6 Conclusion and future work
In this paper we developed a novel framework for encouraging diversity explicitly between ensemble models in classification tasks. First, we introduced the idea of using an amended cost function for multiple classifiers based on NCL results. Later, we showed two usages  a vanilla one and the SMOC. We perform experiments to validate our analytical results for both of the architectures. For SMOC, we showed that by a small change and redundant addition of parameters we achieve superior results compared to the vanilla implementation. In future work, we would like to seek a way of using ACE with a nonuniform and, possibly, trainable weights (22). Also, in the case of a large amount of labels, using SMOC results in a high amount of added parameters. We would like to research implementation solutions where this can be avoided.
References
 Accuracy and diversity in ensembles of text categorisers. CLEI Electron. J. 8. Cited by: §1.
 Managing diversity in regression ensembles. Journal of machine learning research 6 (Sep), pp. 1621–1650. Cited by: §1, §1, §2.1, §2.
 An ensemble diversity approach to supervised binary hashing. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 757–765. Cited by: §1.

Multilayered gradient boosting decision trees
. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 3551–3561. Cited by: §1.  “Do we need hundreds of classifiers to solve real world classification problems. Journal of Machine Learning Research 15 (1), pp. 3313–3181. Cited by: §1.
 A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, pp. 119–139. Cited by: §1.
 Deep learning. MIT Press. Cited by: §1, §3.
 Deep residual learning for image recognition. CoRR abs/1512.03385. Cited by: §5.2.
 Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §1, §2, §3.
 Learning multiple layers of features from tiny images. University of Toronto. Cited by: §5.2.
 Stochastic multiple choice learning for training diverse deep ensembles. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2119–2127. Cited by: §1.
 Accurate uncertainty estimation and decomposition in ensemble learning. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 8952–8963. Cited by: §1.
 Ensemble learning via negative correlation. Neural networks 12 (10), pp. 1399–1404. Cited by: §1, §1, §2.1, §2.
 Neural networks and deep learning. Determination Press. Cited by: §1, §3.
 Ensemble classification and regressionrecent developments, applications and future directions. IEEE Computational Intelligence Magazine 11 (1), pp. 41–53. Cited by: §1.

Crowd counting with deep negative correlation learning.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
. Cited by: §1, §1, §2.1.  Learning with ensembles: how overfitting can be useful. In Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo (Eds.), pp. 190–196. Cited by: §1.

Neural networks for machine learning.
Lecture 6.5  RMSProp, COURSERA
. Cited by: §1, §2, §3.  ResNets ensemble via the feynmankac formalism to improve natural and robust accuracies. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 1657–1667. Cited by: §1.

Benchmarking ensemble classifiers with novel cotrained kernal ridge regression and random vector functional link ensembles [research frontier]
. IEEE Computational Intelligence Magazine 12 (4), pp. 61–72. Cited by: §1.  Diverse ensemble evolution: curriculum datamodel marriage. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 5909–5920. Cited by: §1.
Comments
There are no comments yet.