1 Introduction
Image classification is a fundamental yet still challenging task in machine learning and computer vision. Over the past years, deep
Convolutional Neural Networks (CNNs) have greatly boosted the performance of a series of image classification tasks, like object classification [Krizhevsky et al.2012, He et al.2016, Liu et al.2016], face verification [Wen et al.2016, Zhang et al.2016, Liu et al.2017a, Wang et al.2017a] and handwritten digit recognition [Goodfellow et al.2013, Lin et al.2013], etc. Deep networks naturally integrate low/mid/highlevel features and classifiers in an endtoend multilayer fashion. Wherein each layer mainly consists of convolution, pooling and nonlinear activation, leading CNNs to the strong visual representation ability as well as their current significant positions.To train a deep model, the loss functions, such as (square) hinge loss, contrastive loss, triplet loss and softmax loss,
etc., are usually equipped. Among them, the softmax loss is arguably the most popular one [Liu et al.2016], which consists of three components, including the last fully connected layer, the softmax function, and the crossentropy loss^{1}^{1}1The details of each component will be described in section 2.1. . It is widely adopted by many CNNs [Krizhevsky et al.2012, Simonyan and Andrew2014, He et al.2016] due to its simplicity and clear probabilistic interpretation. However, the works [Liu et al.2016, Wen et al.2016, Zhang et al.2016] have shown that the original softmax loss is inadequate due to the lack of encouraging the discriminability of CNN features. Recently, a renewed trend is to design more effective losses to enhance the performance. But this is nontrivial because a new designed loss usually should be easily optimized by the Stochastic Gradient Descent (SGD) [LeCun et al.1998a].To improve the softmax loss, existing works can be mainly categorized into two groups. One group tries to refine the crossentropy loss of softmax loss. Sun et al. [Sun et al.2014] trained the CNNs with the combination of softmax loss and contrastive loss, but the pairs of training samples are difficult to select. Schroff et al. [Schroff et al.2015] used the triplet loss to minimize the distance between an anchor sample and a positive sample (of the same identity), as well as maximize the distance between the anchor and a negative sample (of different identities). However, requiring a multiple of three training samples as input makes it inefficient. Tang et al. [Tang2013] replaced the crossentropy loss with the hinge loss, while Liu et al. [Liu et al.2017b]
employed a congenerous cosine loss to enlarge the interclass distinction as well as alleviate the innerclass variance. Its drawback is that these two losses are frequently unstable. Recently, Wen
et al. [Wen et al.2016] introduced a center loss together with the softmax loss. Zhang et al. [Zhang et al.2016] proposed a range loss to handle the case of the long tail distribution of data. Both of them have achieved promising results on face verification task. However, the objective of the openset face verification (i.e., mainly to learn discriminative features) is different from that of the closedset image classification (i.e., simultaneously to learn discriminative features and a strong classifier). The other group is to reformulate the softmax function of softmax loss.Liu et al. [Liu et al.2016, Liu et al.2017a] enlarged the margin of the softmax function to encourage the discriminability of features and further extended it to the face verification task. Wang et al. [Wang et al.2017a] developed a normalized softmax function to learn discriminative features. However, for the last fully connected layer^{2}^{2}2For convenience, we denote it as softmax classifier.of softmax loss, few works have considered. The fully convolutional networks [Li et al.2016] and the global average pooling [Lin et al.2013, Zhou et al.2016] aim to modify the fully connected layers of DNNs, they are not applicable to the softmax classifier. In fact, for deep image classification, the softmax classifier is of utmost importance.Since feature extracting and classifier learning in CNNs are in an endtoend framework, in this paper, we argue that the weakness of softmax loss mainly comes from two aspects. One is that the extracted features are not discriminative. The other one is that the learned classifier is not strong. To address the above issues, we introduce a simple yet effective softmargin softmax function to explicitly emphasize the feature discriminability, and adopt a novel ensemble strategy to learn a strong softmax classifier. For clarity, our main contributions are summarized as follows:

We cast a new viewpoint on the weakness of the original softmax loss. i.e., the extracted CNN features are insufficiently discriminative and the learned classifier is weak for deep image classification.

We design a softmargin softmax function to encourage the feature discriminability and attempt to assemble the weak classifiers of softmax loss by employing the HilbertSchmidt Independence Criterion (HSIC).

We conduct experiments on the datasets of MNIST, CIFAR10/CIFAR10+, CIFAR100/CIFAR100+, and ImageNet32 [Chrabaszcz et al.2017], which reveal the effectiveness of the proposed method.
2 Preliminary Knowledge
2.1 Softmax Loss
Assume that the output of a single image through deep convolution neural networks is (i.e., CNN features), where , is the feature dimension. Given a minibatch of labeled images, their outputs are . The corresponding labels are , where is the class indicator, and is the number of classes. Similar to the work [Liu et al.2016], we define the complete softmax loss as the pipeline combination of the last fully connected layer, the softmax function and the crossentropy loss. The last fully connected layer transforms the feature into a primary score through multiple parameters , which is formulated as: . Generally speaking, the parameter can be regarded as the linear classifier of class . Then, the softmax function is applied to transform the primary score into a new predicted class score as: . Finally, the crossentropy loss is employed.
2.2 HilbertSchmidt Independence Criterion
The HilbertSchmidt Independence Criterion (HSIC) was proposed in [Gretton et al.2005]
to measure the (in)dependence of two random variables
and .Definition 1
(HSIC) Consider independent observations drawn from ,
, an empirical estimator of HSIC(
), is given by:(1) 
where and are the Gram matrices with , . and are the kernel functions defined in space and , respectively. centers the Gram matrix to have zero mean.
Note that, according to Eq. (1), to maximize the independence between two random variables and , the empirical estimate of HSIC, i.e., should be minimized.
3 Problem Formulation
Inspired by the recent works [Liu et al.2016, Wen et al.2016, Zhang et al.2016], which argue that the original softmax loss is inadequate due to its nondiscriminative features. They either reformulate the softmax function into a new desired one (e.g., Lsoftmax [Liu et al.2016] etc.) or add additional constraints to refine the original softmax loss (e.g., contrastive loss [Sun et al.2014] and center loss [Wen et al.2016] etc.). Here, we follow this argument but cast a new viewpoint on the weakness, say the extracted features are not discriminative meanwhile the learned classifier is not strong.
3.1 SoftMargin Softmax Function
To enhance the discriminability of CNN features, we design a new softmargin softmax function to enlarge the margin between different classes. We first give a simple example to describe our intuition. Consider the binary classification and we have a sample from class 1. The original softmax loss is to enforce ( i.e., ) to classify correctly. To make this objective more rigorous, the work LSoftmax [Liu et al.2016] introduced an angular margin:
(2) 
and used the intermediate value to replace the original during training. In that way, the discrmination between class 1 and class 2 is explicitly emphasized. However, to make derivable, should be a positive integer. In other words, the angular margin cannot go through all possible angles and is a hard one. Moreover, the forward and backward computation are complex due to the angular margin involved. To address these issues, inspired by the works [Sun et al.2014, Liang et al.2017, Bell and Bala.2015], we here introduce a soft distance margin and simply let
(3) 
where is a nonnegative real number and is a distance margin. In training, we employ to replace , thus our multiclass softmargin softmax function can be defined as: Consequently, the softMargin Softmax (MSoftmax) loss is formulated as:
(4) 
Obviously, when is set to zero, the designed MSoftmax loss Eq. (4) becomes identical to the original softmax loss.
3.2 Diversity Regularized Ensemble Strategy
Though learning discriminative features may result in better classifier as these two components highly depend on each other, the classifier may not be strong enough without explicitly encouragement. To learn a strong one, as indicted in [Guo et al.2017]
, a combination of various classifiers can improve predictions. Thus we adopt the ensemble strategy. Prior to formulating our ensemble strategy, we revisit that the most popular way to train an ensemble in deep learning is arguably the dropout
[Hinton et al.2012]. The idea behind dropout is to train an ensemble of DNNs by randomly dropping the activations and average the results of the whole ensemble instead of training a single DNN. However, in the last fully connected layer of softmax loss, dropout is usually not permitted because it will lose the useful label information, especially with the limited training samples. Therefore, we need to design a new manner to assemble weak classifiers.Without loss of generality, we take two weak softmax classifiers and as an example to illustrate the main idea. Specifically, it has been wellrecognized that the diversity of weak classifiers is of utmost importance to a good ensemble [Guo et al.2017, Li et al.2012]. Here, we exploit the diverse/complementary information across different weak classifiers by enforcing them to be independent. High independence of two weak classifiers and means high diversity of them. Classical independence criteria like the Spearmans rho and Kendalls tau [Fredricks and Nelsen.2007], can only exploit linear dependence. The recent exclusivity regularized term [Guo et al.2017, Wang et al.2017b] and ensemble pruning [Li et al.2012] may be good candidates for classifier ensemble, but both of them are difficult to differentiate. Therefore, these methods are not suitable for assembling the weak softmax classifiers.
In this paper, we employ the HilbertSchmidt Independence Criterion (HSIC) to measure the independence (i.e., diversity) of weak classifiers, mainly for two reasons. One is that the HSIC measures the dependence by mapping variables into a Reproducing Kernel Hilbert Space (RKHS), such that the nonlinear dependence can be addressed. The other one is that the HSIC is computational efficient. The empirical HSIC in Eq. (1) turns out to be the trace of product of weak classifiers, which can be easily optimized by the typical SGD. Based on the above analysis, we naturally minimize the following constraint according to Eq. (1):
(5) 
For simplicity, we adopt the inner product kernel for the proposed HSIC, say for both and . Considering the multiple ensemble settings and ignoring the scaling factor of HSIC for notational convenience, leads to the following equation:
(6)  
where , is the centered matrix defined in section 2.2, and . However, according to the formulation Eq. (6), we can see that the HSIC constraint is valueaware, the diversity is determined by the value of weak classifiers. If the magnitude of different weak classifiers is quite large, the diversity may not be well handled. To avoid the scale issue, we use the normalized weak classifiers to compute the diversity. In other words, if not specific, the weak classifiers , where are normalized in Eq. (6). Merging the diversity constraint into softmax loss, leads to Ensemble Softmax (ESoftmax) loss as:
(7) 
where
is a hyperparameter to balance the importance of diversity. The backward propagation of
is computed as . Clearly, the update of the weak classifier is codetermined by the initializations and other weak classifiers (i.e., is computed based on other classifiers). This means that the diversity of different weak classifiers will be explicitly enhanced.3.3 Optimization
In this part, we show that the proposed EMSoftmax loss is trainable and can be easily optimized by the typical SGD. Specifically, we implement the CNNs using the wellknown Caffe
[Jia et al.2014]library and use the chain rule to compute the partial derivative of each
and the feature as:(9)  
where the computation forms of , , are the same as the original softmax loss.
4 Experiments
4.1 Dataset Description
MNIST [LeCun et al.1998b]: The MNIST is a dataset of handwritten digits (from 0 to 9) composed of pixel gray scale images. There are 60, 000 training images and 10, 000 test images. We scaled the pixel values to the range before inputting to our neural network.
CIFAR10/CIFAR10+ [Krizhevsky and Hinton2009]: The CIFAR10 contains 10 classes, each with 5, 000 training samples and 1, 000 test samples. We first compare EMSoftmax loss with others under no data augmentation setup. For the data augmentation, we follow the standard technique in [Lee et al.2015, Liu et al.2016]
for training, that is, 4 pixels are padded on each side, and a
crop is randomly sampled from the padded image or its horizontal flip. In testing, we only evaluate the single view of the original image. In addition, before inputting the images to the network, we subtract the perpixel mean computed over the training set from each image.CIFAR100/CIFAR100+ [Krizhevsky and Hinton2009]: We also evaluate the performance of the proposed EMSoftmax loss on CIFAR100 dataset. The CIFAR100 dataset is the same size and format as the CIFAR10 dataset, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. For the data augmentation set CIFAR100+, similarly, we follow the same technique provided in [Lee et al.2015, Liu et al.2016].
ImageNet32 [Chrabaszcz et al.2017]
: The ImageNet32 is a downsampled version of the ImageNet 2012 challenge dataset, which contains exactly the same number of images as the original ImageNet,
i.e., 1281, 167 training images and 50, 000 validation images for 1, 000 classes. All images are downsampled to . Similarly, we subtract the perpixel mean computed over the downsampled training set from each image before feeding them into the network.Method  MNIST  CIFAR10  CIFAR10+  CIFAR100  CIFAR100+  
Compared  HingeLoss [Tang2013]  99.53*  90.09*  93.31*  67.10*  68.48 
CenterLoss [Wen et al.2016]  99.41  91.65  93.82  69.23  70.97  
ASoftmax [Liu et al.2017a]  99.66  91.72  93.98  70.87  72.23  
NSoftmax [Wang et al.2017a]  99.48  91.46  93.90  70.49  71.85  
LSoftmax [Liu et al.2016]  99.69*  92.42*  94.08*  70.47*  71.96  
Baseline  Softmax  99.60*  90.95*  93.50*  67.26*  69.15 
Our  MSoftmax  99.70  92.50  94.27  70.72  72.54 
ESoftmax  99.69  92.38  93.92  70.34  71.33  
EMSoftmax  99.73  93.31  95.02  72.21  75.69 
4.2 Compared Methods
We compare our EMSoftmax loss with recently proposed stateoftheart alternatives, including the baseline softmax loss (Softmax), the marginbased hinge loss (HingeLoss [Tang2013]), the combination of softmax loss and center loss (CenterLoss [Wen et al.2016]), the largemargin softmax loss (LSoftmax [Liu et al.2016]), the angular margin softmax loss (ASoftmax [Liu et al.2017a]) and the normalized features softmax loss (NSoftmax [Wang et al.2017a]). The source codes of Softmax and HingeLoss are provided in Caffe community. For other compared methods, their source codes can be downloaded from the github or from authors’ webpages. For fair comparison, the experimental results are cropped from the paper [Liu et al.2016] (indicated as *) or obtained by trying our best to tune their corresponding hyperparameters. Moreover, to verify the gain of our soft margin and ensemble strategy, we also report the results of the MSoftmax loss Eq. (4) and the ESoftmax loss Eq. (7).
4.3 Implementation Details
In this section, we give the major implementation details on the baseline works and training/testing settings. The important statistics are provided as follows:
Baseline works. To verify the universality of EMSoftmax, we choose the work [Liu et al.2016] as the baseline. We strictly follow all experimental settings in [Liu et al.2016], including the CNN architectures (LiuNet^{3}^{3}3The detailed CNNs for each dataset can be found at https://github.com/wy1iu/LargeMargin_Softmax_Loss.), the datasets, the preprocessing methods and the evaluation criteria.
Training. The proposed EMSoftmax loss is appended after the feature layer, i.e., the second last innerproduct layer. We start with a learning rate of 0.1, use a weight decay of 0.0005 and momentum of 0.9. For MNIST, the learning rate is divided by 10 at 8k and 14k iterations. For CIFAR10/CIFAR10+, the learning rate is also divided by 10 at 8k and 14k iterations. For CIFAR100/CIFAR100+, the learning rate is divided by 10 at 12k and 15k iterations. For all these three datasets, the training eventually terminates at 20k iterations. For ImageNet32, the learning rate is divided by 10 at 15k, 25k and 35k iterations, and the maximal iteration is 40k. The accuracy on validation set is reported.
Testing. At testing stage, we simply construct the final ensemble classifier by averaging weak classifiers as: . Finally, is the learned strong classifier and we use it with the discriminative feature to predict labels.
4.4 Accuracy vs HyperParameter
The softmargin softmax function (4) involves one parameter . Inspired by [Bell and Bala.2015], we try a few different and select the one that performs best. Regarding the diversity regularization (6), it involves the tradeoff parameter and the ensemble number . In this part, we mainly report the sensitiveness of these two variables on CIFAR10 and CIFAR100. The subfigures of Figure 1 displays the testing accuracy rate vs. parameter plot of EMSoftmax loss. We set and vary from 0 to 30 to learn different models. From the curves, we can observe that, as grows, the accuracy grows gradually at the very beginning and changes slightly in a relatively large range. The experimental results are insensitive to . Too large may hinder the performance because it will degenerate the focus of classification part in Eq. (8). Moreover, it also reveals the effectiveness of the diversity regularization ( vs. ). The subfigures of Figure 2 is the testing accuracy rate vs. ensemble number plot of EMSoftmax loss. We set and vary from 1 to 10. From the curves, we can see that a single classifier () is weak for classification. Our EMSoftmax () benefits from the ensemble number of weak classifiers. But the improvement is slight when the ensemble number is big enough. The reason behind this may come from two aspects. One is that too many classifiers ensemble will bring too much redundant information, thus the improvement is finite. The other one is that the discriminative features help to promote weak classifiers, without needing assemble too many classifiers. Based on the above observations, we empirically suggest in practice to avoid the parameter explosion of weak classifiers.
4.5 Classification Results on MNIST and CIFAR
Table 1 provides the quantitative comparison among all the competitors on MNIST and CIFAR datasets. The bold number in each column represents the best performance.
On MNIST dataset, it is wellknown that this dataset is typical and easy for deep image classification, and all the competitors can achieve over 99% accuracy rate. So the improvement of our EMSoftmax is not quite large. From the experimental results, we observe that ASoftmax [Liu et al.2017a], LSoftmax [Liu et al.2016], the proposed EMSoftmax and its degenerations MSoftmax and ESoftmax outperform the other compared methods. Moreover, we have achieved a high accuracy rate 99.73% on this dataset.
On CIFAR10/CIFAR10+ dataset, we can see that our EMSoftmax significantly boosts the performance, achieving at least 2% improvement over the baseline Softmax. Considering all the competitors can achieve over 90% accuracy rate on this dataset, the improvement is significant. To the soft distance margin MSoftmax, it is slightly better than the hard angular margin LSoftmax [Liu et al.2016] and ASoftmax [Liu et al.2017a] because the soft margin can go through all possible desired ones. It is much better than Softmax because of the learned discriminative features. To the ensemble softmax ESoftmax, it is about 1% higher than the baseline Softmax because of the assembled strong classifier. Our EMSoftmax absorbs the complementary merits from these two aspects (i.e., discriminative features and strong classifier).
On CIFAR100/CIFAR100+ dataset, it can be observed that most of competitors achieve relatively low performance. The major reason is that the large variation of subjects, color and textures and the finegrained category involve in this dataset. Even so, our EMSoftmax still reaches significant improvements, at least 5% higher than the baseline Softmax. Compared with the recent LSoftmax [Liu et al.2016] and ASoftmax [Liu et al.2017a], our EMSoftmax can achieve about 3% improvement. Moreover, we can see a similar trend as that shown on CIFAR10/CIFAR10+ dataset, that is, the EMSoftmax loss is generally better than its degenerations MSoftmax and ESoftmax.
Method  Top1  Top5  

Compared  HingeLoss [Tang2013]  46.52  71.56 
CenterLoss [Wen et al.2016]  47.43  71.98  
ASoftmax [Liu et al.2017a]  48.12  72.51  
NSoftmax [Wang et al.2017a]  47.52  72.06  
LSoftmax [Liu et al.2016]  47.85  72.63  
Baseline  Softmax  46.89  71.94 
Our  MSoftmax  48.21  72.77 
ESoftmax  48.16  72.99  
EMSoftmax  49.22  74.22 
4.6 Classification Results on ImageNet
We report the top1 and top5 validation accuracy rates on ImageNet32 in Table 2. From the numbers, we can see that the results exhibit the same phenomena that emerged on CIFAR dataset. In particular, the proposed EMSoftmax achieves a higher top1 accuracy by 2.4% and top5 accuracy by 2.2% in comparison with the baseline Softmax. The improvements are significant as the imagenet is very large and difficult for image classification, especially with such a smaller downsampled size (3232). Compared with other competitors, our EMSoftmax can achieve at least 1% improvement. The results presented in Table 2 also reveal that our EMSoftmax can benefits from the discriminative features (MSoftmax) and the strong classifier (ESoftmax).
4.7 EMSoftmax vs. Model Averaging
To validate the superiority of our weak classifiers ensemble strategy (i.e., EMSoftmax) over the simple model averaging, we conduct two kinds of model averaging experiments on both CIFAR10 and CIFAR100 datasets. One is the model averaging of the same architecture but with different numbers of filters (i.e., 48/48/96/192, 64/64/128/256 and 96/96/192/382)^{4}^{4}464/64/128/256 denotes the number of filters in conv0.x, conv1.x, conv2.x and conv3.x, respectively.. For convenience, we use CNN(48), CNN(64) and CNN(96) to denote them, respectively. The other one is the model averaging of different CNN architectures. We use AlexNet [Krizhevsky et al.2012] (much larger than LiuNet [Liu et al.2016]) and CIFAR10 Full (much smaller than LiuNet [Liu et al.2016]) architectures as an example, which have been provided in the standard Caffe [Jia et al.2014] library^{5}^{5}5https://github.com/BVLC/caffe.. For comparison, all the architectures of these two kinds of model averaging strategies are equipped with the original softmax loss. Table 3 provides the experimental results of model averaging on CIFAR10 and CIFAR100 datasets, from which, we can see that the strategy of model averaging is beneficial to boost the classification performance. However, the training is timeconsuming and the model size is large. Compared our weak classifiers ensemble (EMSoftmax) with these two kinds of model averaging, we can summarize that the accuracy of our EMSoftmax is general higher and our model size is much lower than the simple model averaging.
Method  CIFAR10  CIFAR100  

LiuNet  15.1  90.95  15.7  67.26 
Full  0.35  83.09  0.71  63.32 
AlexNet  113.9  88.76  115.3  64.56 
LiuNet+Full+Alex  129.35  90.97  131.71  68.05 
LiuNet(48)  9.3  90.63  9.4  66.08 
LiuNet(64)  15.1  90.95  15.3  66.70 
LiuNet(96)  30.9  91.16  31.0  67.26 
LiuNet(48+64+96)  55.4  91.99  55.7  70.01 
EMSoftmax  15.2  93.31  31.1  72.74 
4.8 EMSoftmax vs. Dropout
Dropout is a popular way to train an ensemble and has been widely adopted in many works. The idea behind it is to train an ensemble of DNNs by randomly dropping the activations^{6}^{6}6Thus it cannot be applied to softmax classifier. and averaging the results of the whole ensemble. The adopted architecture LiuNet [Liu et al.2016] contains the second last FC layer and is without the dropout. To validate the gain of our weak classifiers ensemble, we add the dropout technique to the second last fullyconnected layer and conduct the experiments of Softmax, Softmax+Dropout and EMSoftmax+Dropout. The proportion of dropout is tuned in and the diversity hyperparameter of our EMSoftmax is 0.1. Table 4 gives the experimental results of dropout on CIAFR10 and CIFAR100 datasets. From the numbers, we can see that the accuracy of our EMSoftmax is much higher than the Softmax+Dropout, which has shown the superiority of our weak classifier ensemble over the dropout strategy. Moreover, we emperically find that the improvement of dropout on both Softmax and EMSoftmax losses is not big with the adopted CNN architecture. To sum up, our weak classifiers ensemble is superior to the simple dropout strategy and can seamlessly incorporate with it.
Method  CIFAR10  CIFAR100 

Softmax  90.95  67.26 
Softmax+Dropout  91.06  68.01 
EMSoftmax  93.31  72.74 
EMSoftmax+Dropout  93.49  72.85 
4.9 Running Time
We give the time cost vs. accuracy of EMSoftmax, Softmax and two kinds of model averaging on CIFAR10. From Table 5, the training time on 2 Titan X GPU is about 1.01h, 0.99h, 4.82h and 3.02h, respectively. The testing time on CPU (Intel Xeon E52660v0@2.20Ghz) is about 3.1m, 2.5m, 8.1m and 10m, respectively. While the corresponding accuracy is 93.31%, 90.90%, 90.97% and 91.99%, respectively. Considering time cost, model size and accuracy together, our weak classifiers ensemble EMSoftmax is a good candidate.
Method  Training  Test  Accuracy 

(GPU)  (CPU)  
Softmax  0.99h  2.5m  90.90 
Softmax+Dropout  0.99h  2.5m  90.96 
LiuNet+Full+AlexNet  4.82h  8.1m  90.97 
LiuNet(48+64+96)  3.02h  10m  91.99 
EMSoftmax  1.01h  3.1m  93.31 
5 Conclusion
This paper has proposed a novel ensemble softmargin softmax loss (i.e., EMSoftmax) for deep image classification. The proposed EMSoftmax loss benefits from two aspects. One is the designed softmargin softmax function to make the learned CNN features to be discriminative. The other one is the ensemble weak classifiers to learn a strong classifier. Both of them can boost the performance. Extensive experiments on several benchmark datasets have demonstrated the advantages of our EMSoftmax loss over the baseline softmax loss and the stateoftheart alternatives. The experiments have also shown that the proposed weak classifiers ensemble is generally better than model ensemble strategies (e.g., model averaging and dropout).
Acknowledgments
This work was supported by National Key Research and Development Plan (Grant No.2016YFC0801002), the Chinese National Natural Science Foundation Projects , , , , the Science and Technology Development Fund of Macau (No.151/2017/A, 152/2017/A), and AuthenMetric R&D Funds.
References

[Bell and Bala.2015]
S. Bell and K. Bala.
Learning visual similarity for product design with convolutional neural networks.
In TOG, 2015.  [Chrabaszcz et al.2017] P. Chrabaszcz, L. Ilya, and H. Frank. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017.

[Fredricks and Nelsen.2007]
G. Fredricks and R. Nelsen.
On the relationship between spearman’s rho and kendall’s tau for pairs of continuous random variables.
Journal of statistical planning and inference., 2007.  [Goodfellow et al.2013] I. Goodfellow, D. WardeFarley, and M. Mirza. Maxout networks. In ICML, 2013.
 [Gretton et al.2005] A. Gretton, O. Bousquet, and A. Smola. Measuring statistical dependence with hilbertschmidt norms. In ICAL, 2005.
 [Guo et al.2017] X. Guo, X. Wang, and H. Ling. Exclusivity regularized machine. In IJCAI, 2017.
 [He et al.2016] K. He, X. Zhang, and S. Ren. Deep residual learning for image recognition. In CVPR, 2016.
 [Hinton et al.2012] G. Hinton, N. Srivastava, and A. Krizhevsky. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580., 2012.
 [Jia et al.2014] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 [Krizhevsky and Hinton2009] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, 2009.
 [Krizhevsky et al.2012] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [LeCun et al.1998a] Y. LeCun, L. Bottou, and Y. Bengio. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, 1998.

[LeCun et al.1998b]
Y. LeCun, C. Cortes, and C. Burges.
The mnist database of handwritten digits.
1998.  [Lee et al.2015] Y. Lee, S. Xie, and P. Gallagher. Deeplysupervised nets. In AISTATS, 2015.
 [Li et al.2012] N. Li, Y. Yu, and Z. Zhou. Diversity regularized ensemble pruning. In ECML, 2012.
 [Li et al.2016] Y. Li, K. He, and J. Sun. Rfcn: Object detection via regionbased fully convolutional networks. In NIPS, 2016.
 [Liang et al.2017] X. Liang, X. Wang, Z. Lei, S. Liao, and Stan. Li. Softmargin softmax for deep classification. In ICONIP, 2017.
 [Lin et al.2013] M. Lin, Q. Chen, and Y. Yan. Network in network. In ICLR, 2013.
 [Liu et al.2016] W. Liu, Y. Wen, and Z. Yu. Largemargin softmax loss for convolutional neural networks. In ICML, 2016.

[Liu et al.2017a]
W. Liu, Y. Wen, Z. Yu, M. Li, and L. Song.
Sphereface: Deep hypersphere embedding for face recognition.
In CVPR, 2017.  [Liu et al.2017b] Y. Liu, H. Li, and X. Wang. Learning deep features via congenerous cosine loss for person recognition. arXiv preprint arXiv:1702.06890., 2017.
 [Martins and Astudillo.2016] A. Martins and R. Astudillo. From softmax to sparsemax: A sparse model of attention and multilabel classification. In ICML, 2016.
 [Schroff et al.2015] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
 [Simonyan and Andrew2014] K. Simonyan and Z. Andrew. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [Sun et al.2014] Y. Sun, Y. Chen, and X. Wang. Deep learning face representation by joint identificationverification. In NIPS, 2014.
 [Tang2013] Y. Tang. Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239, 2013.
 [Wang et al.2015] X. Wang, X. Guo, and S. Li. Adaptively unified semisupervised dictionary learning with active points. In CVPR, 2015.
 [Wang et al.2017a] F. Wang, X. Xiang, J. Chen, and A. Yuille. Normface: hypersphere embedding for face verification.. In ACM MM, 2017.
 [Wang et al.2017b] X. Wang, X. Guo, Z. Lei, C. Zhang, and S. Li. Exclusivityconsistency regularized multiview subspace clustering. In CVPR, 2017.
 [Wen et al.2016] Y. Wen, K. Zhang, and Z. Li. A discriminative feature learning approach for deep face recognition. In ECCV, 2016.
 [Zhang et al.2016] X. Zhang, Z. Fang, and Y. Wen. Range loss for deep face recognition with longtail. arXiv preprint arXiv:1611.08976., 2016.

[Zhou et al.2016]
B. Zhou, A. Khosla, and A. Lapedriza.
Learning deep features for discriminative localization.
In CVPR, 2016.
Comments
There are no comments yet.