1 Introduction
A common practice in most deep convolutional neural networks, is to employ fullyconnected layers, followed by a Softmax activation to minimize crossentropy loss. Recent studies has shown that, substitution of the Softmax objective with SVM or LDA cost functions, is highly effective to improve the classification performance of deep neural networks. This paper proposes a novel paradigm to link the optimization of several objectives through a unified backpropagation scheme. This alleviates the burden of extensive boosting for each independent objective functions and avoids complex formulation of multiobjective gradients. Here, several loss functions are linked through BPA at the time of backpropagation.
Deep learning has been proven to be extremely successful for several applications. The combination of machine learning methods, with deep neural networks, achieves better performances. Deep versions of CCA (
andrew2013deep ), FA (clevert2015rectified ), PCA (chan2015pcanet ), SVM (vinyals2012learning ), and finally, LDA (stuhlsatz2012feature), have been introduced in the literature. There are two schools of thought about how to alternate the Softmax layer, so as, to achieve better performance.
The first strategy trains a deep architecture to produce highorder features and give them to a classifier (
coates2010analysis ). For example, replacing the Softmax with SVM as the top layer and minimizing of a standard hinge loss, produces better results in some deep architectures (tang2013deep). Another successful practice is LDA which maximizes an objective, derived from a general eigenvalue problem (
dorfer2015deep ). The drawback is that, the features in the bottom layers are not further finetuned with the new objective.The second strategy trains a combination of objectives, by error backpropagation through the gradients of their loss functions. It is possible to optimize these objectives with either boosting methods or multiobjective evolutionary algorithms (
gong2015multiobjective ), but the former needs extensive training of different networks, and the latter requires very complex formulations for the gradients.2 Method
A novel unified backpropagation scheme is proposed for deep multiobjective learning, based on BPA of evidence theory, applying to a network that includes different loss functions, such as, Softmax, SVM, and LDA. In contrast to the boosting method, which trains each loss function independently to make an ensemble of models, the proposed backpropagation approach, unifies the gradients of all objective functions.
The advantages of unified backpropagation can be outlined as follows. First, this scheme mutually optimizes all the objective functions together. In this way, the contribution of each objective function to the overall classification performance, is managed by sharing the basic probability masses, among the gradients. Second, this unification is less computationally expensive than ensemble learning. Third, it prevents more complexity on the formulation of gradients, for each of the combined loss functions. The experiments for a variety of scenarios and standard datasets, confirm the advantage of the proposed approach, which delivers consistent improvements to the performance of deep convolutional neural networks.
3 Formulation
A deep convolutional architecture can boost the typical Softmax layer with other classifiers, using a multiobjective optimization regime. The two widelyused classifiers are SVM and LDA, employed as the top layer of the deep neural networks.
Employing SVM, a deep convolutional network optimizes the primal problem of SVM and backpropagate the gradients of the top layer SVM, to learn the bottom layers. This is in contrast with finetuning, where loworder features are usually trained by Softmax, and top layers tuned by SVM. It has been shown that the performance on standard benchmarks, is much better than the networks with Softmax layer at top. The optimization is performed using stochastic gradient descent method (chan2015pcanet ).
For LDA as the top layer, the deep architecture is almost the same. The objective of a deep neural network is reformulated to learn linearly separable features by backpropagation, because LDA allows to define optimal linear decision boundaries in the latent layers. This finds linear combinations of lowlevel features, to maximize the scattering between classes of data, whilst minimizing the discrepancy within individual classes. The top layer LDA tries to produce high separation between deep features, rather than, minimizing the norm of prediction error (
dorfer2015deep ).Classification of some multiclass datasets, is challenging due to nonuniform distribution of data (
kocco2013multi ). The accuracy of classification does not seem to be a suitable objective to optimize, because there may be high accuracy with strong biases towards some classes (fawcett2006introduction ). Although there are many algorithms to deal with imbalanced data for binary classification (he2009learning ), the multiclass problem has been usually addressed by generalization of the binary solutions, with a oneversusall strategy (abe2004iterative ). For some leaning tasks, optimization of some relevant measures within the imbalance data distribution, provides alternative measures to the accuracy (wang2012multiclass ).The emergence of new costsensitive methods for dealing with imbalanced multiclass data (elkan2001foundations
), has enabled the embedding of misclassification costs, into a cost matrix. They usually measure the error, based on misclassification costs of individual classes in the confusion matrix. This matrix is the most informative contingency table in multiclass learning problems, because it gives the success rate of a classifier for a special class, and the failure rate on distinguishing that class from other classes. The confusion matrix has proven to be a great regularizer; smoothing the accuracy among classes (
ralaivola2012confusion ).The determination of a probabilistic distribution from the confusion matrix is highly effective at producing a probability assignment, which contributes to imbalance distribution problems. The probability assignments can be constructed from recognition, substitution and rejection rates (xu1992methods
), or both precision and recall rates (
deng2016improved ). The key point is to harvest the maximum possible prior knowledge, provided by the confusion matrix to overcome the imbalance classification challenge.3.1 Basic Probability Assignment
A confusion matrix is generally represented as classbased predictions against actual labels, in the form of a square matrix. Inspired by DempsterSchafer theory (sentz2002combination
), construction of BPA gives a vector, which is independent of the number of samples in each classes and sums up to one, for each labels. BPA provides the ability to reflect the different contributions of a classifier, or combine the outcomes of multiple weak classifiers. A raw, twodimensional confusion matrix, indexed by the predicted classes and actual labels, provides some common measures of classification performance. Some general measures are accuracy (the proportion of the total number of predictions that are correct), precision (a measure of the accuracy, provided that a specific class has been predicted), recall (a measure of the ability of a prediction model to select instances of a certain class from a dataset), and Fscore (the harmonic mean of precision and recall) (
sammut2011encyclopedia ).Suppose that a set of training samples from different classes, are assigned to a label set , using a classifier . Each element of the confusion matrix is considered as the number of samples belonging to class , which assigned to label . The recall () and precision () ratios for all and , can be defined as follows (deng2016improved ),
(1) 
It can be seen that, the recall ratio is summed over the predicted classes (rows), whilst the precision ratio is accumulated by the actual labels (columns) of the confusion matrix . The probability elements of recall () and precision () for each individual class are,
(2) 
These elements are synthesized to form the final probability assignments by DempsterSchafer rule of combination (sentz2002combination ), representing the recognition ability of classifier to each of the classes of set as,
(3) 
where the operator is an orthogonal sum. The overall contribution of the classifier can be presented as a probability assignment vector,
(4) 
It is worth mentioning that should be computed by the training set, because it is assumed that, there is no actual label set at the test time.
3.2 Unified Backpropagation
Suppose that is a set of objectives in a multiobjective learning regime, presented in Figure 2. To apply the unified backpropagation, Algorithm 1 is deployed for each of the loss functions, to come up with a set of normalized probability assignments as,
(5) 
where follows the same definition as Equation 4.
In each layer of the th objective, feedforward propagation is calculated as follows,
(6) 
where and are weights and biases, is an activation and is a rectification function. Considering as the loss function, the output error holds,
(7) 
The backpropagation error can be stated as,
(8) 
For the sake of gradient descent, the weights and biases are updated via,
(9) 
For the unified backpropagation, larger of the th objective will generate bigger update rates for weights and biases, than only employing a fix . This helps to update only loss functions, which largely affect the overall classification performance, and properly connect the objectives in the backpropagation process. This also implies that, in spite of forwardbackward propagation, the overall contribution of each objective function is taken into account. Algorithm 2 wraps up the proposed unification strategy.
3.3 Objective Functions
Following the successful practices in the literature, three types of widelyused loss functions i.e. Softmax, SVM (tang2013deep ), and LDA (dorfer2015deep ) are further investigated. Suppose that is a set of
different classes in the dataset at hand and the discrete probability distribution
denotes, to what extent, each sample of set belongs to class . Assuming as the output of the last fullyconnected layer for the th loss function (Equation 6), the closed form formulation of gradients for the above objective functions, can be worked out as follows.3.3.1 Softmax
For a conventional Softmax activation, can be defined as follows,
(10) 
such that and the predicted class is yielded by,
(11) 
The Softmax loss function forms as,
(12) 
where its gradient with respect to holds,
(13) 
3.3.2 Support Vector Machine
The squared hinge loss for norm binary SVM () is defined as,
(14) 
that its gradient can be derived as follows,
(15) 
The multiclass scenario is the extension of the binary objective, using onevsrest approach. Minimizing Equation 15 for , gives the predicted class as,
(16) 
3.3.3 Linear Discriminant Analysis
The focus is on maximizing the smallest eigenvalues of the generalized LDA eigenvalue problem,
(17) 
such that, is betweenclass scattering, is withinclass scattering,
corresponds to eigenvectors and
represents the eigenvalues. This leads to a maximization of the discriminant power of any deep architecture. Hence, the objective can be stated as,(18) 
which holds the following gradient (),
(19) 
4 Experiments
The experiments are conducted for two different scenarios, using MNIST (lecun1998gradient ), CIFAR (krizhevsky2009learning ), and SVHN (netzer2011reading ) datasets. In the first scenario, the unified backpropagation is applied to singleobjective learning and its performance is compared to the baseline of Softmax, SVM and LDA. For the second scenario, multiobjective learning are considered and multiple loss functions are combined via the unification backpropagation paradigm. We report the results on standard architectures, implemented in deep learning library of the Oxford Visual Geometry Group (vedaldi08vlfeat ).
4.1 SingleObjective Learning
In this scenario, the advantage of the unified backpropagation (Unified) is validate for each of the individual objective functions under examination. We provide the outcomes for the Softmax, SVM, and LDA as common loss functions among deep convolutional networks.
Tables 3, 3, and 3 show the outcomes of this scenario. It can be seen that, the unified backpropagation is able to consistently improve the classification performance on deep convolutional networks, and provides smaller trainingtest errors. According to the Table 3, the unified backpropagation outperforms all the baselines. The greatest improvement belongs to CIFAR10, which reduces the test error from 22.72% to 18.77%. The smallest improvement on the training error, comes from the same datasets. It seems that, in spite of larger training error, better generalization leads to considerable enhancement in the overall performance.
In Table 3, the best improvement in test error goes to CIFAR100, which decreases from 49.11% to 39.76%. Although the number of classes are higher for other datasets, the unified backpropagation successfully avoids biases for SVM and hence, the overall performance is considerably improved. Looking at the Table 3, the best result is recorded for MNIST. The unification paradigm outperforms in both training and test errors, on all the experimental datasets. This is the result of distinction, imposed by LDA. Since LDA pushes the separation among classes, rather than the likelihood of predictions and labels (Softmax,SVM), the trainingtest errors reduce accordingly in all the experiments. This confirms that the proposed scheme is successful at providing better learning practices, compared to the typical methods.
Figures 3 and 4 present the comparative plots for MNIST and CIFAR10, respectively. It is obvious that the unified backpropagation provides better generalization, and improves the trainingtest errors of the classification task. In Figure 3, the gap between training and validation errors, hugely reduces by the proposed unification method. It means that, this provides better generalization for the trained model. the energy of loss function for Softmax is lower than the proposed method. Although this might result in the better performance for the former, the latter outperforms, in regards, to the test errors. The reason lies in the capability of this method to deal with nonsmooth decision boundaries of nonconvex objectives in deep neural networks. This is a critical point, especially when the classes are highly correlated in the datasets, as they are for MNSIT or CIFAR10.
Dataset  Softmax  Unified  

Train (%)  Test (%)  Train (%)  Test (%)  
MNIST  0.09  0.65  0.04  0.32 
CIFAR10  1.32  22.72  1.56  18.77 
CIFAR100  0.17  50.90  0.21  48.01 
SVHN  0.13  3.81  0.07  2.59 
Dataset  SVM  Unified  

Train (%)  Test (%)  Train (%)  Test (%)  
MNIST  0.07  0.53  0.08  0.51 
CIFAR10  1.27  19.87  1.47  17.64 
CIFAR100  0.13  49.11  0.17  39.76 
SVHN  0.12  3.46  0.09  2.38 
Dataset  LDA  Unified  

Train (%)  Test (%)  Train (%)  Test (%)  
MNIST  0.07  0.42  0.05  0.38 
CIFAR10  0.94  7.39  0.87  6.95 
CIFAR100  0.15  19.63  0.13  18.46 
SVHN  0.06  2.78  0.06  2.27 
Figure 4
shows that the generalization for CIFAR10, imposed by the unified backpropagation, is much higher than MNIST. As mentioned before, less correlation between classes in CIFAR100 dataset, results in a better job at tuning the model parameters by this backpropagation strategy. Another observation for CIFAR10 is that, the pattern of variations in the baseline and outcomes, is not highly aligned with MNIST. It seems that in some epochs, the unified backpropagation forces the learning process towards special classes, which do not contribute to the overall precision of classification.
4.2 MultiObjective Learning
This scenario applies the unified backpropagation to combine Softmax, SVM, and LDA objective functions. The baselines are produced by ensemble learning via Adaboost algorithm. Tables 6, 6, and 6 summarize the trainingtest errors on all the experimental datasets, for Softmax + SVM, Softmax + LDA, and Softmax + SVM + LDA, powered by the proposed backpropagation paradigm. It can be seen that, almost everywhere, the proposed unification improves the classification performance. The only exception is Softmax + LDA on CIFAR100 dataset, where this method is not able to outperform the baseline.
Table 6 shows the outcomes of the unification on Softmax and SVM combination. It is obvious that, the unified backpropagation improves test errors for all the experiments. The training errors are all improved, except on CIFAR100 dataset. In table 6, we report the errors for the joint Softmax and LDA. Here, all the improvements come from the unified backpropagation. On CIFAR100, training error increases from 0.9% to 1.5%, and on CIFAR100 the testing error jumps from 18.27% to 35.28%. Since these degradations belong to CIFAR, it can be concluded that, LDA is not that successful at separating their highlycorrelated classes.
Finally, Table 6 gathers the results of experiments for the composition of Softmax, SVM, and LDA. It is clear that the unification scheme outperforms the baseline on all trainingtest errors, except for the training on CIFAR10 and test on CIFAR100 datasets. It seems that SVM makes a significant contribution towards compensating LDA’s disadvantage on CIFAR datasets, but that is not enough for the unified backpropagation to take an edge over baseline.
Dataset  Softmax + SVM  Unified  

Train (%)  Test (%)  Train (%)  Test (%)  
MNIST  0.08  0.57  0.07  0.45 
CIFAR10  1.54  18.72  1.15  15.82 
CIFAR100  0.23  48.85  0.91  38.58 
SVHN  0.11  3.27  0.08  2.48 
Dataset  Softmax + LDA  Unified  

Train (%)  Test (%)  Train (%)  Test (%)  
MNIST  0.07  0.44  0.05  0.41 
CIFAR10  0.90  7.51  1.50  6.81 
CIFAR100  0.18  18.27  0.71  35.28 
SVHN  0.09  3.64  0.06  2.41 
Dataset  Softmax + SVM + LDA  Unified  

Train (%)  Test (%)  Train (%)  Test (%)  
MNIST  0.08  0.38  0.05  0.30 
CIFAR10  0.78  5.96  1.83  5.44 
CIFAR100  1.35  22.49  0.68  35.13 
SVHN  0.08  3.01  0.07  2.34 
All in all, LDA does a better job than SVM in both of the baseline and unified backpropagation, and the proposed method performs better, when all the objectives come together. The best improvement goes to CIFAR100 dataset, which reduces the test error from 38.58% for Softmax + SVM to 35.28% for Softmax + LDA, followed by 35.13% for Softmax + SVM + LDA. For CIFAR10, Softmax + LDA improves the performance quite well, in comparison with, Softmax + SVM. Although the joint venture of all classifiers, generates higher precisions in test, the lowest training errors, varies between CIFAR10 for Softmax + SVM, MNIST and SVHN for Softmax + LDA, and CIFAR100 for Softmax + SVM + LDA.
4.3 Discussion
Dataset  Baseline  Unified  

Train (%)  Test (%)  Train (%)  Test (%)  
MNIST  0.08  0.38  0.05  0.30 
CIFAR10  0.94  7.39  1.83  5.44 
CIFAR100  0.18  18.27  0.13  18.46 
SVHN  0.07  2.78  0.06  2.27 
Considering both of the experimental scenarios, Table 7 summarizes the minimum test errors, and its corresponding training errors for each of the datasets under examination. The unified backpropagation either outperforms baselines by high margins (5.44% vs 7.39% for CIFAR10) or follows them by close rates (18.46% vs 18.27% for CIFAR100). This confirms advantage of the proposed backpropagation for the classification.
On the other hand, the best results for the unification method, go to multiobjective learning on MNIST & CIFAR100 and singleobjective learning on CIFAR100 & SVHN datasets. Although further investigations remain, initial results indicate that multiobjective learning performs better on a small or medium number of samplesclasses, while singleobjective learning performs best on a large number of samplesclasses. This is due to the fact that the multiobjective regime is not able to cope, with either complex data distributions, or highlycorrelated classes, when several objectives contradict each other.
5 Conclusion
The typical classification architectures in deep neural networks employ Softmax, support vector machines or linear discriminant analysis as the top layer and backpropagate the error by the gradient of their specific lost functions. We propose a novel paradigm to learn hybrid multiobjective networks with unified backpropagation. Using basic probability assignment form evidence theory, we link the gradients of hybrid loss functions and update the network parameters by backpropagation. This also avoid biases in imbalanced data distributions and improves the classification performance of singleobjective or hybrid models. Our extensive experiments on standard datasets prove that the proposed unification scheme contributes to the overall precision of deep convolutional neural networks.
References
 (1) N. Abe, B. Zadrozny, and J. Langford. An iterative method for multiclass costsensitive learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 3–11. ACM, 2004.
 (2) G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu. Deep canonical correlation analysis. In ICML (3), pages 1247–1255, 2013.
 (3) T.H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma. Pcanet: A simple deep learning baseline for image classification? IEEE Transactions on Image Processing, 24(12):5017–5032, 2015.
 (4) D.A. Clevert, A. Mayr, T. Unterthiner, and S. Hochreiter. Rectified factor networks. In Advances in neural information processing systems, pages 1855–1863, 2015.
 (5) A. Coates, H. Lee, and A. Y. Ng. An analysis of singlelayer networks in unsupervised feature learning. Ann Arbor, 1001(48109):2, 2010.
 (6) X. Deng, Q. Liu, Y. Deng, and S. Mahadevan. An improved method to construct basic probability assignment based on the confusion matrix for classification problem. Information Sciences, 340:250–261, 2016.
 (7) M. Dorfer, R. Kelz, and G. Widmer. Deep linear discriminant analysis. arXiv preprint arXiv:1511.04707, 2015.

(8)
C. Elkan.
The foundations of costsensitive learning.
In
International joint conference on artificial intelligence
, volume 17, pages 973–978. LAWRENCE ERLBAUM ASSOCIATES LTD, 2001.  (9) T. Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006.
 (10) M. Gong, J. Liu, H. Li, Q. Cai, and L. Su. A multiobjective sparse feature learning model for deep neural networks. IEEE transactions on neural networks and learning systems, 26(12):3263–3277, 2015.
 (11) H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
 (12) S. Koço and C. Capponi. On multiclass classification through the minimization of the confusion matrix norm. In ACML, pages 277–292, 2013.
 (13) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 (14) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 (15) Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011.
 (16) L. Ralaivola. Confusionbased online learning and a passiveaggressive scheme. In Advances in Neural Information Processing Systems, pages 3284–3292, 2012.
 (17) C. Sammut and G. I. Webb. Encyclopedia of machine learning. Springer Science & Business Media, 2011.
 (18) K. Sentz and S. Ferson. Combination of evidence in DempsterShafer theory, volume 4015. Citeseer, 2002.
 (19) A. Stuhlsatz, J. Lippel, and T. Zielke. Feature extraction with deep neural networks by a generalized discriminant analysis. IEEE transactions on neural networks and learning systems, 23(4):596–608, 2012.
 (20) Y. Tang. Deep learning using linear support vector machines. In In ICML. Citeseer, 2013.

(21)
A. Vedaldi and B. Fulkerson.
VLFeat: An open and portable library of computer vision algorithms.
http://www.vlfeat.org/, 2008.  (22) O. Vinyals, Y. Jia, L. Deng, and T. Darrell. Learning with recursive perceptual representations. In Advances in Neural Information Processing Systems, pages 2825–2833, 2012.
 (23) S. Wang and X. Yao. Multiclass imbalance problems: Analysis and potential solutions. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(4):1119–1130, 2012.
 (24) L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE transactions on systems, man, and cybernetics, 22(3):418–435, 1992.
Comments
There are no comments yet.