I Introduction
Machine learning is an important and fundamental component in visual understanding tasks. The core idea of supervised learning is to learn a model that explores the causal relationship between the dependent variables and the predictor variables. To quantify this relationship, the conventional approach is to make a hypothesis on the model, and feed the observed pairs of dependent variables and predictor parameters to the model for predicting future cases. For most learning problems, it is infeasible to make a perfect hypothesis that matches the underlying pattern, whereas a badly designed hypothesis often leads to a model that is more complicated than necessary and violates the principle of parsimony. Therefore, when designing or evaluating a model, the core objective is to seek a balance between two conflicting goals: how complicated a model should be to achieve accurate predictions, and how to design a model as simple as possible, but not simpler.
In the past decade, deep learning methods have significantly accelerated the development of machine learning research, where Convolutional Network (ConvNet) has achieved superior performance in numerous realworld visual understanding tasks [12, 22, 40, 11, 42, 16, 69, 57, 1, 17, 62, 46]. Although their architectures vary with each other, the softmax function is widely used along with the cross entropy loss at the training phase [14, 22, 24, 48, 51]
. The softmax function may not take the distribution pattern of previously observed samples into account to boost classification accuracy. In this work, we design a statistically driven extension of the softmax function that fits into the Stochastic Gradient Descent (SGD) scheme for endtoend learning. Furthermore, the final layer of the softmax function directly connects to the predictions and can maximally preserve generality for various ConvNets,
i.e., avoid complex modification of existing network architectures.Features are key for prediction in ConvNet Learning. According to the central limit theorem
[20], the arithmetic mean of a sufficiently large number of iterates of i.i.d. random variables, each with a finite expected value and variance, can be approximately normally distributed even if the original variables are not normally distributed. This makes the Gaussian distribution generally valid in a great variety of contexts. Following this line of thought, online learning methods
[7, 8, 54] assumed that the weights follow Gaussian distribution and make use of its distribution pattern for classification. Given a largescale training data [45], the underlying distributions of discriminative features generated by ConvNets can be modeled. This distribution pattern has not been fully explored in existing literature.Intraclass compactness and interclass separability of features are generally correlated to the quality of the learned features. If intraclass compactness and interclass separability are simultaneously maximized, the learned features are more discriminative [36]. We introduce a variant of the softmax function, named Gaussianbased softmax (softmax) function, which aims to improve intraclass compactness and interclass separability as shown in Figure 1. With a typical assumption that features are distributed according to Gaussian distributions, where Gaussian cumulative distribution function (CDF) is used in prediction and normalization to generate the final confidence in a soft form.
Figure 2 demonstrates the role and position of the proposed softmax function in a supervised learning framework. Given the training samples, the feature extractor would extract the features and then pass them to the predictor for inference. In this work, we follow the mainstream deep learning framework where the feature extractor is modeled with a ConvNet. The proposed softmax function is able to replace the softmax function. The contributions can be summarized as:

With the general assumption, i.e., features w.r.t. a class are subject to a Gaussian distribution, we propose the softmax function which models the distributions of features for better prediction. The experiments on CIFAR10, CIFAR100 [21] and Tiny ImageNet^{1}^{1}1https://tinyimagenet.herokuapp.com/ show that the proposed softmax function consistently outperforms the softmax and Lsoftmax function on various stateoftheart models. Also, we apply the proposed softmax function to solve the multilabel classification problem, which yields better performance than the softmax function on MS COCO [31] and NUSWIDE [3]. The source code is available^{2}^{2}2https://gitlab.com/luoyan/gsoftmax and is easy for use.

The proposed softmax function can quantify the compactness and separability. Specifically, for each learned Gaussian distribution, the corresponding mean and variance indicate the center and compactness of the predictor.

In our analysis of correlation between intraclass compactness (or interclass separability) and average precision, we observe that high intraclass compactness and interclass separability are linearly correlated to average precision on MS COCO and NUSWIDE. This implies that improvement of intraclass compactness and interclass separability would leads to improvement of average precision.
Ii Related Works
Gaussianbased Online Learning. We first review the Gaussianbased online learning methods. In the online learning context, the training data are provided in a sequential order to learn a predictor for unobserved data. These methods usually make some assumptions to minimize the cumulative disparity errors between the ground truth and predictions over the entire sequence of instances [6, 7, 8, 44, 54]. In this sense, these works can give some guidance and inspiration for designing a flexible mapping function.
PassiveAggressive model [6], Dredze et al. [8] made an explicit assumption on the weights : , where is the mean of the weights and is a covariance matrix for the underlying Gaussian distribution. Given an input instance with the corresponding label
, the multivariate Gaussian distribution over weight vectors induces a univariate Gaussian distribution over the margin:
, where is the inner product operation. Hence, the probability of a correct prediction is. The objective is to minimize the KullbackLeibler divergence between the current distribution and the ideal distribution with the constraint that the probability of a correct prediction is not smaller than the confidence hyperparameter
, i.e., . With the mean of the margin and the variance , the constraint can lead to , where is the cumulative function of the Gaussian distribution. This inequality is used as a constraint in optimization in practice. However, it is not convex with respect to and Dredze et al. [8] linearized it by omitting the square root: . To solve this nonconvex problem, Crammer et al. [7] discovered that a change of variable helps to maintain the convexity, i.e., when , the constraint becomes . The confidence weighted method [7] employs an aggressive updating strategy by changing the distribution to satisfy the constraint imposed by the current instance, which may incorrectly update the parameters of the distribution when handling a mislabeled instance. Therefore, Wang et al. [54] introduced a tradeoff parameter to balance the passiveness and aggressiveness.The aforementioned online learning methods [7, 8, 54] hypothesize that the weights are subject to a multivariate Gaussian distribution and predefine a confidence hyperparameter to formalize a constraint for optimization. Nevertheless, the weights are learned based on the training data, putting hypothesis on the weights could be similar to put the cart before the horse. Moreover, such confidence hyperparameter may not be flexible or adaptive for various datasets. In this work, we instead hypothesize that the features are subject to Gaussian distribution and there is no confidence hyperparameter. To update the weights, [7, 8, 54] apply the Lagrangian method to compute the optimal weights. This mechanism does not straightforwardly fit into SGD scheme. Along the same line, this work is motivated to investigate how to incorporate the Gaussian assumption in SGD.
Softmax Function in ConvNet Learning. The success of ConvNets is largely attributed to the layerstacking mechanism. Despite its effectiveness in complex realworld visual classification, this mechanism will result in coadaptation and overfitting. To prevent the coadaptation problem, Hinton et al. [15]
proposed a method which randomly omits a portion of neurons in a feedforward network. Then, Srivastava
et al. [50] introduced the dropout unit to minimize overfitting and presented a comprehensive investigation of its effect in ConvNets. Similar regularization methods are also proposed in [13] and [52]. Instead of modifying the connection between layers, [64] replaced the deterministic pooling with the stochastic pooling for regularizing ConvNets. The proposed softmax function can be used together with these models to offer better general ability. We posit a general assumption and establish Gaussian distributions over the feature space at the final layer, i.e., the softmax module. In other words, the proposed softmax function is general for most ConvNets without requiring much modification of the network structure.ConvNets [19, 63, 14, 24, 22, 48, 51, 61] have strong representational ability in learning invariant features. Although their architectures vary with each other, the softmax function is widely used along with cross entropy loss at the training phase. Hence, the softmax module is important and general for ConvNets. Liu et al. [36] introduced a largemargin softmax function to enhance the compactness and the separability from a geometric perspective. Substantially, the largemargin softmax function is fundamentally similar to the softmax function, i.e., both use the exponential function, while having different inputs for the exponential function. In contrast, we model the mappings between features and ground truth labels as Gaussian CDF. Similar to the softmax function, we utilize normalization to identify the maximal element but not its exact value.
Multilabel Classification. Multilabel classification is a special case of multioutput learning tasks. Read et al. [43]
proposed the classifier chain model to model label correlations. In particular, label order is important for chain classification models. A dynamic programming based classifier chain algorithm
[32] was proposed to find the globally optimal label order for the classifier chain models. Shen et al. [47] introduced CoEmbedding and CoHashing method that explores the label correlations from the perspective of crossview learning to improve prediction accuracy and efficiency. On the other hand, the classifier chain model does not take the order of difficulty of the labels into account. Therefore, the easytohard learning paradigm [35] was proposed to make good use of the predictions from simple labels to improve the predictions from hard labels. Liu et al. [34]presented comprehensively theoretical analysis on the curse of dimensionality of decision tree models and introduced a sparse coding tree framework for multilabel annotation problems. In multilabel prediction, a large margin metric learning paradigm
[33] was introduced to reduce the complexity of decoding procedure in canonical correlation analysis and maximum margin output coding methods. Liu et al. [37] introduced a large margin metric learning method to efficiently learn an appropriate distance metric for multioutput problems with theoretical guarantee.Recently, there have been attempts to apply deep networks in multilabel classification, especially ConvNets and Recurrent Neural Networks (RNNs), for their promising performance in various vision tasks. In
[53], ConvNet and RNN are utilized together to explicitly exploit the label dependencies. In contrast to [53], [66] proposed a regional latent semantic dependencies model to predict smallsize objects and visual concepts by exploiting the label dependencies at the regional level. Similarly, [10] automatically selected relevant image regions from global image labels using weakly supervised learning. Zhao et al. [68] reduced irrelevant and noisy regions with the help of region gating module. These region proposal based methods usually suffer from redundant computation and suboptimal performance. Wang et al. [55] addressed these problems by developing a recurrent memorizedattention module, and the module allows to locate attentional regions from the ConvNet’s feature maps. Instead of utilizing the label dependencies, [28]proposed a novel loss function for pairwise ranking, and the loss function is smooth everywhere so that it is easy to optimize within ConvNets. Also, there are two works that focus on improving the architectures of the networks for multilabel classification
[70, 9]. In this work, we adopt a common baseline, i.e., ResNet101 [14], which is widely used in the stateoftheart models [70, 9].Iii Methodology
Iiia Softmax Function
Logistic function, i.e
., sigmoid function, and hyperbolic tangent function are widely used in deep learning, whose graphs are “Sshaped” curves. Their curves imply a graceful balance between linearity and nonlinearity
[39]. The Gaussian CDF has the same monotonicity as logistic and hyperbolic tangent function and shares similar shapes. It makes the Gaussian CDF a potential substitute with the capability to model the distribution pattern with class dependent and . Fundamentally, softmax function in mainstream deep learning models is the normalized exponential function, which is a generalization of the logistic function. In this work, the proposed softmax function uses the Gaussian CDF to substitute the exponential function.Similar to the softmax loss, we use cross entropy as the loss function, i.e.,
(1) 
where is the loss, is the label with respect to the th category, is the prediction confidence with respect to the th category, and is the number of categories. Conventionally, given features that with respect to various labels, is given by the softmax function
(2) 
The softmax function can be considered to represent a categorical distribution. By normalizing exponential function, the largest value is highlighted and the other values are suppressed significantly. As discussed in Section II, [7, 8, 54]
hypothesized that the classification margin is subject to a Gaussian distribution. Slightly differently, we assume that the deep features
with respect to the th category is subject to a Gaussian distribution, i.e., . In this work, we define the proposed softmax function as(3) 
where is a parameter controlling the width of CDF along yaxis. We can see that if , Equation (3) becomes the conventional softmax function. is the CDF of a Gaussian distribution, that is
(4) 
where and
are the mean and standard deviation, respectively. For simplicity, we denote
as in the following paragraphs.Comparing to the softmax function (2), the proposed softmax function takes the feature distribution into account, i.e., the distribution term in (3). This formulation leads to two advantages. First, it enables to approximate a large variety of the distributions w.r.t. every class on the training samples, whereas the softmax function only learns from current observing sample. Second, with distribution parameters and , it is straightforward to quantify intraclass compactness and interclass separability. In other words, the proposed softmax function is more analytical than the softmax function.
The proposed softmax function can work with any ConvNets, such as VGG [48] and ResNet [14]. In this work, we make , and is not an arbitrary layer but the fullyconnected layer. When , is prone to shift towards the positive axis direction because . The curve of has a similar shape as that of logistic function and hyperbolic tangent function, and can accurately capture the distribution of . As discussed in Section II, the online learning methods [7, 8, 54]
considered the features as a Gaussian distribution and use KullbackLeibler divergence (KLD) between the estimated distribution and the optimal distribution. Since their formulations involve the unknown optimal Gaussian distribution, they had to apply the Lagrangian to optimize and approximate
and . This may not fit the backpropagation in modern ConvNets which commonly use SGD as a solver.Usually, equals to 1 due to the normalization. Similarly, we can obtain the partial derivatives with respect to ,
(6) 
According to the CDF, i.e., Equation (3), the derivatives with respect to and are
(7)  
(8) 
(9)  
(10) 
In the backpropagation of ConvNets, the chain rule requires the derivatives of upper layers to compute the weight derivatives of lower layers. Therefore, is needed to pass backwards the lower layers. Because has the same form as in Equation (5) and we know
(11) 
Then, is obtained
(12) 
IiiB Softmax in Multilabel Classification
The aforementioned section are based on the singlelabel classification problems. Here, we apply the proposed Softmax function to the multilabel classification problem. In the singlelabel classification problems, the softmax loss and the softmax variant are defined as
(13) 
For multilabel classification, multilabel soft margin loss (MSML) is widely used to solve the multilabel classification problems [9, 70], as defined by Equation (14).
(14) 
In contrast with MSML, there is a variant that takes and as inputs, instead of only taking as inputs in MSML. is the positive feature which is used to compute the probability that the input image is classified to the th category, while is the negative feature that is used to compute the probability that the input image is classified to the nonth category. The variant is used in the multilabel classification problems [29]. It is defined by Equation (15).
(15) 
The terms and in MSML (14) are both determined by . To make the learning process consistent with the loss function used in singlelabel classification, we use the variant, i.e., Equation (15), for multilabel classification in this work and denote it as the softmax loss function for consistency. Correspondingly, the softmax loss function is defined as
(16) 
In this way, we can model the distributions of and by and , respectively.
We can see that the proposed softmax and softmax function are both straightforward to extend for multilabel classification. In contrast, Lsoftmax function may not be easy to adapt to multilabel classification. This is because Lsoftmax function needs to be aware of the feature related to the ground truth label so that it is able to impose a margin constraint on the feature, i.e., , where is an integer representing the margin, indicates the th label is the ground truth label of , is the th column of , and is the angle between and . When , the exponential term is the same as softmax function. However, when , is used to guarantee the margin between and . As a consequence, it is hard to use in MSML because Lsoftmax function will treat the terms in Equation (14) differently.
IiiC Malleable Learning Rates
The training of a model usually required a series of predefined learning rates. The learning rate is a real value and a function of the current epoch with given starting and final value. There are several popular types of learning rates,
e.g. linspace, logspace, and staircase. Usually, the number of epochs with these types of learning rates is not more than 300. Although Huang et al. [18] use many more epochs with annealing learning rates, the learning rate is designed as a function of iteration number instead of epoch number. Therefore, it may not generalize to distributed or parallel processing because the iterations are not processed sequentially. We would like to test the proposed Softmax function for an extreme condition, i.e., more epochs, to investigate the stability. In the following, we first describe the three learning rates followed by showing how these learning rates are in correlation to the proposed malleable learning rate. The proposed malleable learning rates can control the curvature of the scheduled learning rates to boost convergence of the learning process.The linspace learning rates are generated with a simple linear function, where learning rate at epoch, , is denoted as . Here, is the maximum epoch number, while and are the starting and final value of the learning sequences, respectively. is the initial learning rate. Because of linearity, changes of the learning rates are constant through all epochs. As the learning rates become smaller when epoch number increase, it is expected that the training process can converge stably. Logspace learning rates meet this requirement by a log function .
The logspace learning rate has a gradual descent trace that rapidly becomes stable. On the other hand, the staircase learning rate remains constant for a large number of epochs. As the learning rate is not frequently adjusted, the model learning process may not converge. These problems undermine the sustainable convergence ability of deep learning model. Therefore, we integrate the advantages of these learning rates and propose a malleable learning rate, that is,
(17) 
where is the end epoch of the th piece of learning rates and . As shown in Equation (17), the propose learning rate is able to separate piece wise learning rates (i.e., staircase learning rates), yet able to control the shape of each piece (e.g. curvature or degree of bend) by configuring and .
For the experiments using pretrained models with the ImageNet dataset [45], the initialization contains well learned knowledge for Tiny ImageNet, MS COCO, and NUSWIDE, which are similar to ImageNet in terms of visual content and concept labels. Hence, the training process on these datasets do not need a number of epochs [70, 9]. In this work, we instead apply malleable learning rates on CIFAR to train the models from scratch.
IiiD Compactness & Separability
As commonly studied in machine learning [36, 60, 67], intraclass compactness and interclass separability are important characteristics that can reveal some intuition about the learning ability and efficacy of a model. Due to the underlying Gaussian nature of the proposed softmax function, the intraclass compactness for a given class is characterized by the respective standard deviation , where smaller indicates the learned model is more compact. Mathematically, the compactness of a given class can be represented by .
The interclass separability can be measured by computing the disparity of two models, i.e
., the divergence between two Gaussian distributions. In the probability and information theory literature, KLD is commonly used to measure the difference between two probability distributions. In the following, we denote a learned Gaussian distribution
as . Specifically, given two learned Gaussian distributions and , the divergence between two distributions is(18) 
where and are the probability density functions of the respective class. KLD is always nonnegative. As proven by Gibbs’ inequality, KLD is zero if and only if the two distributions are equivalent almost everywhere. To quantify the divergence between the distribution of the th category and the distributions of the rest of categories, we use the mean of KLDs,
(19) 
Because KLD is asymmetric, we compute the mean of and for a fair measurement.
Since compactness indicates the intraclass correlations and separability indicates the interclass correlations, we multiply (which is the operator) intraclass compactness with interclass separability to overall quantify how discriminative the features with the same label are. Hence we define separability ratio , with respect to the th class as follows
(20) 
Since of a distribution is inversely proportional to compactness, is also inversely proportional to . Ideally, we hope a model’s is as large as possible, which requires separability as large as possible and as small as possible at the same time.
Iv Empirical Evaluation
In this section, we provide comprehensive comparison between the softmax function and the proposed softmax function for singlelabel classification and multilabel classification. Specifically, we evaluate three baseline ConvNets (i.e., VGG, DenseNet, and wide ResNet) on CIFAR10 and CIFAR100 datasets for singlelabel classification. For multilabel classification, we conduct the experiments with ResNet on the MS COCO dataset.
Datasets & Evaluation Metrics
. To evaluate the proposed softmax function for singlelabel classification, we use the CIFAR10 [21] and CIFAR100 datasets, which are widely used in machine learning literature [30, 26, 49, 36, 4, 25, 63, 19]. CIFAR10 consists of 60,000 color images with 3232 pixels in 10 classes. Each class has 6,000 images including 5,000 training images and 1,000 test image. CIFAR100 has 100 classes and the image resolution is same as CIFAR10. It has 600 images per class including 500 training images and 100 test images. Moreover, we also use Tiny ImageNet in this work. It is a variant of ImageNet, which has 200 classes and each class has 500 training images and 50 validation images.For multilabel classification task, we adopt widely used datasets, i.e., MS COCO [31] and NUSWIDE [3]. The MS COCO dataset is primarily designed for object detection in context, and it is also widely used for multilabel recognition. Therefore, MS COCO is adopted in this work. It comprises a training set of 82,081 images, and a validation set of 40,137 images. The dataset covers 80 common object categories, with about 3.5 object labels per image. In this work, we follow the original split for training and test, respectively. Following [28, 70, 55, 9], we only use the image labels for training and evaluation. NUSWIDE consists of 269,648 images with 81 concept labels. We use official train/test split i.e., 161,789 images for training and 107,859 images for evaluation.
We use the same evaluation metrics as [70, 55], namely mean average precision (mAP), perclass precision, recall, F1 score (denoted as CP, CR, CF1), and overall precision, recall, F1 score (denoted as OP, OR, OF1). More concretely, average precision is defined as follows
(21) 
where is a relevant function that returns 1 if the item at the rank is relevant to the th class and returns 0 otherwise. To compute mAP, we collect all predicted probabilities for each class of all the images. The corresponding predicted th labels over all images are sorted in descending order. The average precision of the th class is the average of precisions predicted correctly th labels. is the precision ranked at over all predicted th labels. denotes the number of predicted th labels. Finally, the mAP is obtained by averaging AP over all classes. The other metrics are defined as follows
(22) 
where is the number of images that correctly predicted for the th class, is the number of predicted images for the th label, is the number of ground truth images for the th label. For CP, CR, and CF1, is the number of labels.
Baselines & Experiment Configurations. For the classification task, we adopt softmax and Lsoftmax [36] as baseline methods for comparison purposes. For multilabel classification, due to the limits of Lsoftmax as discussed in Subsection IIIB, we only use softmax as the baseline method.
There are a number of ConvNets, such as AlexNet [22], GoogLeNet [51], VGG [48], ResNet [14], wide ResNet [63], and DenseNet [19]. For the experiment on CIFAR10 and CIFAR100, we adopt the stateoftheart wide ResNet and DenseNet as baseline models. Also, considering that the network structure of wide ResNet and DenseNet are quite different with conventional networks, such as AlexNet and VGG, VGG is taken into account too. Specifically, we use VGG16 (16layer model), wide ResNet with 40 convolutional layers and the widening factor 14, and DenseNet with 100 convolutional layers and the growth rate 24 in this work. Our experiments focus on comparing the conventional softmax function with the proposed softmax function. Softmax and Lsoftmax function are considered the baseline functions in this work. For fair comparisons, the experiments are strictly conducted under the same conditions. For all comparisons, we only replace the softmax function in the final layer with the proposed softmax function, and preserve other parts of the network. In the training stage, we keep most of training hyperparameters, e.g. weight decay, momentum and so on, the same as AlexNet [23]. Both the baseline and the proposed softmax function would be trained from scratch under the same conditions. In wide ResNet experiments, the batch size for CIFAR10 and CIFAR100 are both 128, which is the number used in its original work [63]
. In DenseNet experiments, since its graphics memory usage is considerably higher than wide ResNet’s, we use 50 as batch size, which leads to fully graphics memory usage for three GPUs. The hardware used in this work are Intel Xeon E52660 CPU and GeForce GTX 1080 Ti. All models are implemented with Torch
[5].We follow the original experimental settings of the baseline models for the training and evaluation of the softmax function and the softmax function. For example, in DenseNet [19], Huang et al. train their model in 300 epochs with staircase learning rates. From 1st epoch to 149th epoch, the learning rate is set to . From 150th epoch to 224th epoch, it is , and the learning rates of the remaining epochs are . The wide ResNet model is trained in 200 epochs [63]. The learning rate is initialized to , and at 60th, 120th, 160th, it will decrease to , , , respectively. To make it comparable to DenseNet, we extend the epochs from 200 to 300 and decrease the learning rate at 220th and 260th epoch by multiplying . To avoid ad hoc training of hyperparameter settings, we set the weight decay and momentum to be the same as the default hyperparameters in the baselines [63, 19] (i.e., ) for the softmax function and the proposed softmax function.
For the experiments on Tiny ImageNet, we adopt wide ResNet [63] with 40 convolutional layers and width 14 as the baseline model. Initial learning rate is 0.001 and weight decay is 1e4. The training process consists of 30 epochs with batch size 80 and the learning will be decreased to its one tenth every 10 epoch. Following [18, 58], we use the ImageNet pretrained weights as an initialization and the input image will be resized to 224224 to feed the wide ResNet.
Top 1 error ()  

WideResNet [63]  39.63 
WideResNet SE [18]  32.90 
DenseNet [19]  39.09 
IGCV2 [56]  39.02 
PyramidNet Shackdrop [58]  31.15 
ResNet101 (Input size: 6464)  31.66 
ResNet101 (Input size: 224224)  18.36 
ResNet101 LSoftmax  17.57 
ResNet101 Softmax ()  16.86 
ResNet101 Softmax ()  16.95 
ResNet101 Softmax ()  17.04 
ResNet101 Softmax ()  17.29 
ResNet101 Softmax ()  16.96 
In the experiments with malleable learning rates, 1100 epochs are used in training. There are only two phases throughout the whole training, i.e., and , where and .
Different from the softmax function, the proposed softmax function has two learnable parameters (i.e., and ) and one hyperparameter (i.e., ). Without loss of generality, and are initialized with standard Gaussian distribution (i.e., to and ). These two parameters would be learned through training by Equation (9) and (10). To determine , we follow the similar rule where we start from and try the value between . As mentioned in Section III, the softmax function would be equivalent to the softmax function if . In our experiments, is initialized to 1 for CIFAR10 and CIFAR100 experiments with DenseNet. In wide ResNet, is initialized as for CIFAR100 experiments and for CIFAR10 experiments.
CP  CR  CF1  OP  OR  OF1  mAP  
VGG MCE [29]              70.2 
Weak sup (GMP) [41]              62.8 
CNNRNN [53]  66.0  55.6  60.4  69.2  66.4  67.8   
RGNN [68]              73.0 
WELDON [10]              68.8 
MultiCNN [66]  54.8  51.4  53.1  56.7  58.6  57.6  60.4 
CNN+LSTM [66]  62.1  51.2  56.1  68.1  56.6  61.8  61.8 
MCGCNN+LSTM [66]  64.2  53.1  58.1  61.3  59.3  61.3  64.4 
RLSD [66]  67.6  57.2  62.0  70.1  63.4  66.5  68.2 
Pairwise ranking [28]  73.5  56.4    76.3  61.8     
MIMLFCN [59]              66.2 
RDAR [55]  79.1  58.7  67.4  84.0  63.0  72.0  72.2 
ResNet101 (GAP, ImgSize: , bz=96, lr=1e3) [70]  80.8  63.4  69.5  82.2  68.0  74.4  75.2 
SRN [70]  81.6  65.4  71.2  82.7  69.9  75.8  77.1 
ResNet101 (GAP, ImgSize: , unknown bz and lr) [9]              72.5 
WILDCAT [9]              80.7 
ResNet101 (GMP, ImgSize: , bz=16, lr=1e5)  81.3  70.2  74.1  81.9  74.3  77.9  80.6 
ResNet101 softmax w/ (, )  82.4  69.3  74.1  83.2  73.3  77.9  80.8 
ResNet101 softmax w/ (, )  82.7  69.3  74.0  83.5  72.9  77.8  81.1 
ResNet101 softmax w/ (, )  80.6  71.0  74.4  81.3  74.7  77.8  80.9 
ResNet101 softmax w/ (, )  81.3  74.5  74.5  83.4  73.8  78.3  81.3 
ResNet101 softmax w/ (, )  83.2  68.6  73.7  84.3  72.3  77.9  81.1 
). GAP indicates a global average pooling is used in the last pooling layer while GMP indicates a global max pooling is used in the last pooling layer. ResNet101
softmax uses the same hyperparameters and architecture as ResNet101 with the softmax function. For simplicity, we denote ResNet101 with the softmax function as ResNet101, ImgSize as image size, bz as batch size, and lr as learning rate. Notablely, mAP, CF1, and OF1 are more important to measure the performance [70, 55].CP  CR  CF1  OP  OR  OF1  mAP  
LSEP [28]  66.7  45.9  54.4  76.8  65.7  70.8   
Orderfree RNN [2]  59.4  50.7  54.7  69.0  71.4  70.2   
ResNet101 [70]  65.8  51.9  55.7  75.9  69.5  72.5  59.8 
ResNet101  62.0  56.3  56.9  74.7  71.4  73.0  59.9 
ResNet101 softmax (, )  62.5  56.5  57.8  74.7  71.7  73.2  60.3 
ResNet101 softmax (, )  62.3  56.0  57.2  74.9  71.3  73.0  60.3 
ResNet101 softmax (, )  62.9  55.9  57.1  74.9  71.2  73.0  60.1 
ResNet101 softmax (, )  63.2  55.8  57.1  74.9  71.3  73.1  60.0 
ResNet101 softmax (, )  62.0  55.9  57.3  74.9  71.4  73.1  60.4 
. Specifically, F1 score is more comprehensive than precision and recall because it takes both precision and recall into account for evaluation. Although LSEP
[28] achieves better CP and OP scores, the CR, OR, and CF1 scores of LSEP are the lowest in the table. The proposed softmax function achieves better performance in five metrics including mAP, CF1, and OF1.For the experiments on MS COCO, we refer to the stateoftheart works [14, 9] to set the weight decay and momentum to and , respectively. The model would be trained with the learning rate in 8 epochs on MS COCO validation set. In the experiments, we experiment with various initializations of and to observe how these factors influence the learning. is initialized as . Since we follow the convention of multilabel classification [70, 9], we use the pretrained weights to initialize the ConvNet and this is different from the initializations in the experiments on CIFAR10 and CIFAR100. This difference enables the model to determine and in a datadriven way, that is, empirically computing the and from data with the pretrained weights. The image size used in this work is the same as the one used in [9], i.e., , while the minibatch size is 16, which is limited by the number of the GPUs.
For the experiments on NUSWIDE, we use the same experimental setting as the one on MS COCO.
Notations. We denote a model with the softmax function as model softmax, e.g. ResNet101 softmax. To simplify notations, we omit softmax following the model name because we assume that the models work with the softmax function by default. For example, ResNet implies that the ResNet model works with the softmax function. In the following tables and figures, RSN, DSN, and WRN stand for ResNet, DenseNet, and wide ResNet, respectively.
Evaluations on CIFAR. The performances of the softmax function and the softmax function are listed in Table LABEL:tbl:perf_train in terms of top1 error rate. For the convenient purpose, DenseNet and wide ResNet are denoted as DSN and WRN, respectively. The proposed softmax function outperforms softmax and Lsoftmax function over all evaluated scenarios.
On CIFAR10, VGG with the softmax function achieves error rate while the error rates of the softmax and Lsoftmax function are and , respectively. Consistently, VGG with the softmax function achieves the similar improvement on CIFAR100. DenseNet reports their best error rate on CIFAR10 and CIFAR100 with 190 convolutional layers and 40 growth rate (denoted as DSNBC19040) [19]. However, DenseNet with this configuration consumes huge graphics memory due to the large depth number, which would occupy about 30 GB of graphics memory to process a batch of 10 images on 3 GPUs. Therefore, we adopt a moderate setting, i.e., DSN10024, in our experiments to process as large batch size as possible, i.e., 50 on CIFAR10 and 32 on CIFAR100. Under this configuration, the softmax function achieves error rate, which is better than the error rate of the softmax function and the error rate of the Lsoftmax function, on CIFAR10. Also, the error rate of the softmax function is decreased to compared to the error rate of the softmax function and the error rate of of the Lsoftmax function on CIFAR100. In wide ResNet experiments, the baseline consistently achieves better performances than the baseline of DenseNet on both CIFAR10 and CIFAR100, where the softmax function further improves the performances to achieve error rate on CIFAR10 and on CIFAR100. As shown in Table LABEL:tbl:perf_train, although the structures of the three model are distinct to each other, the softmax function generalize to these models and improve the respective performances. Applying malleable learning rates with wide ResNet softmax can further improve the performances, i.e., on CIFAR10 and on CIFAR100.
Evaluations on Tiny ImageNet. Table LABEL:tbl:timg_perf reports the error rates of softmax, Lsoftmax, and the proposed softmax function on Tiny ImageNet. We present the error rates of ResNet with input image size and , where is used in the setting of training on ImageNet and the training of the initialized ResNet fed with this image size leads to a lower error rate . The proposed softmax function with various leads to overall lower error rates than the softmax and Lsoftmax function. In particular, achieves the lowest error rate .
Evaluations on MS COCO. As shown in Table LABEL:tbl:mscoco_perf, ResNet101 softmax with an initialization of Gaussian distributions for achieves the best performance over three metrics (i.e., CF1, OF1, and mAP). The proposed softmax functions are initialized in two straightforward ways. One is to set to the standard Gaussian distribution parameter , while the other one is to empirically compute from the data. Both approaches achieve better mAPs ( and ) than the stateoftheart model [9] (). To comprehensively understand the effects of , we initialize them with other values, i.e., , , and , By comparing with the performance of ResNet101 softmax with , we can see the respective influences of . Overall, the four initializations leads to better performances than the initialization of and the initialization of yields the best performance over CF1, OF1, and mAP. An observation on is that smaller leads to higher precision but lower recall. For example, the OP of is whereas the one of is . Nevertheless, the OR of is whereas the one of is . According to metrics (22), we can infer that small yields less and than large . The change in is relatively smaller than the one in and these effects of decreasing lead to higher precision but lower recall.
Evaluations on NUSWIDE. The experimental results of NUSWIDE are consistent with the experimental results of MS COCO, as shown in Table LABEL:tbl:nus_perf. The proposed softmax function overall outperforms the softmax function over all metrics. Specifically, the setting achieves the best mAP .
V Analysis
In this section, we discuss the influence of the proposed softmax function on prediction by presenting a visual comparison with the softmax and Lsoftmax function. Then, we further quantify the influences caused by the softmax, Lsoftmax, and the proposed softmax function in terms of intraclass compactness and interclass separability. Moreover, the analysis of significance of the AP differences between the softmax function and the proposed softmax function on MS COCO and NUSWIDE is provided. Last but not least, we analyze the correlations between compactness (separability and ratio) and AP on MS COCO and NUSWIDE.
Influence of the softmax function on ConvNets. In the literature, there are many works [65, 38, 27] that analyze ConvNets using visualization. In this work, our hypothesis is related to the distributions of the activations of deep layers. Therefore, we analyze the proposed softmax function from the aspect of the mapping between and . Given images with a certain label out of labels, ConvNets would generate the final feature preceding to the process of the softmax function. Each in represents the corresponding confidence for the predicted label . By the idea of winnertakesall in the softmax function, the corresponding label that has the highest value of the softmax function would be marked as the prediction. We hope that the predicted label is the ground truth label, i.e., , and name imposter labels. Ideally, the imposter feature is expected to be lower and far away from the ground truth feature so as to enlarge the probability of correct prediction.
To investigate the influence of the trained softmax function on the training set and test set, we inspect the relationship between features and predictions on CIFAR10 and CIFAR100, as shown in Figure 3. To remove unnecessary interference from patterns of other classes, we fix the prediction of a subset of the training set and the test set of CIFAR10 from a single class. For example, given all images with the ground truth class label “airplane”, the ConvNet would generate the deep features , in CIFAR10, and pass them to the predictor for computing the predictions . Note that here we denote as the feature of the class “airplane” and all are considered the features w.r.t. “nonairplane”. Similarly, we also plot the scattered points w.r.t. the images with label “aquarium fish” on CIFAR100.
As shown in Figure 3, the range of of the proposed softmax function is different from the range of of the softmax and Lsoftmax function. The most of imposter features of the proposed softmax function are distributed in the range , whereas of the softmax and Lsoftmax function spreads out. In the test set of CIFAR10, the range of of the proposed softmax function approximately spans from 0 to 9, whereas the range of the softmax function is and the range of the Lsoftmax function is . In the test set of CIFAR100, the range of of the proposed softmax function approximately spans from 0 to 11, whereas the range of the softmax function is and the range of the Lsoftmax function is .
Figure 5 w.r.t. two categories on MS COCO shows consistent pattern. In category “cow” and “baseball bat”, the positive features of ResNet101 softmax, i.e., the features related to “cow” and “baseball bat”, are closer to each other than the ones of ResNet101 with the softmax function.
To quantitatively understand the distributions of the scattered points in Figure 3, we empirically compute and of the points w.r.t. the softmax function, the Lsoftmax function, and the proposed softmax function. With theses distribution parameters, we further compute the compactness, separability, and ratio, as shown in Figure 4.
The proposed
softmax function influences the kurtosis of the Gaussian distributions of
of class ‘airplane’ (CIFAR10) or ‘aquarium fish’ (CIFAR100), comparing to the softmax function. In other words, the curves of the distributions w.r.t. the proposed softmax function are narrower and taller than the ones w.r.t. the softmax function on both CIFAR10 and CIFAR100. In particular, the distributions w.r.t. the Lsoftmax function yields a flatter and wider curves than the softmax function and the proposed softmax function on CIFAR10 and CIFAR100. With the distribution parameters, the intraclass compactness, interclass separability, and separability ratio can be computed and visualized in the bar plots in Figure 4. Overall, the proposed softmax function achieves better intraclass compactness, interclass separability, and separability ratio than the softmax function and the Lsoftmax function on both CIFAR10 and CIFAR100.Figure 6 shows more comprehensive analysis intraclass compactness, interclass separability, and separability ratio for each class on CIFAR10. We can see that the proposed softmax function improves intraclass compactness, interclass separability, and separability ratio in most of classes over the softmax function and the Lsoftmax function. Due to the limitation of space, we do the similar analysis on the first 10 classes on CIFAR100, as shown in Figure 7. In contrast to Figure 6, where the Lsoftmax function yields the lowest intraclass compactness, interclass separability, and separability ratio on both the training and test set of CIFAR10, the Lsoftmax function yields the highest intraclass compactness, interclass separability, and separability ratio in most of classes on the training set but still yields the lowest intraclass compactness, interclass separability, and separability ratio in most of classes on the test set. This implies that it may overfit the training data. Again, the proposed softmax function consistently yields better intraclass compactness, interclass separability, and separability ratio in most of classes.
We also analyze intraclass compactness, interclass separability, and separability ratio for multilabel classification on MS COCO. The experimental results of MS COCO show a consistent pattern with the ones of CIFAR. For example, the vs. plots of category “baseball bat” in Figure 5 show that of ResNet softmax are more compact than of ResNet. Consistently, the Gaussian distribution of ResNet softmax w.r.t. the positive is taller and narrower than the one of ResNet. The compactness of ResNet w.r.t. class “baseball bat” is 2.1, while the one of ResNet softmax is 2.3. Figure 8 shows the average compactnesses of ResNet and ResNet softmax over all 80 categories on MS COCO validation set. The average compactness of ResNet is 2.6, while the one of ResNet softmax is 2.8. The separability of the proposed softmax function between category “noncow” and “cow” is 4.3, which is significantly greater than 1.8 (i.e., the separability of the softmax function). The average separability over all 80 categories on MS COCO is shown in Figure 8. The average separability (4.5) of the proposed softmax function is greater than the average separability (4.2) of the softmax function. Similar to intraclass compactness and interclass separability, the average ratio of the proposed softmax function is higher than the one of the softmax function.
MS COCO  

softmax(0,1)  0.9060 
softmax(0,0.5)  0.0359 
softmax(0,5)  0.0548 
softmax(0.1,1)  0.0764 
softmax(0.1,1)  0.3160 
NUSWIDE  

softmax(0,1)  0.0049 
softmax(0,2)  0.0001 
softmax(0,3)  0.0098 
softmax(0.05,1)  0.6773 
softmax(0.05,1)  0.0131 
softmax function is computed by pairedsample ttest. The resulting
valreported in the table is the probability assuming the null hypothesis were true. If
val is equal to or less than 0.05, it implies that there is a significant difference between the softmax function and the proposed softmax function in compactness (separability or ratio). In the experiments on MS COCO, the difference of APs between the softmax function and the proposed softmax function with is statistically significant (). In the experiments on NUSWIDE, besides the proposed softmax function with , the differences of APs between the softmax function and the proposed softmax functions are statistically significant.Significance of Difference between softmax and softmax. As aforementioned discussion about the influence of the proposed softmax function, we further quantify the difference of prediction performance caused by the influence. Specifically, we study the difference of average precision between the softmax function and the proposed softmax function on MS COCO and NUSWIDE, which are richer in visual content and visual semantics than CIFAR and Tiny ImageNet. First, APs of the softmax function and the proposed softmax function w.r.t. each class are computed, respectively. Particularly, the proposed softmax functions with each pair of and in Table LABEL:tbl:mscoco_perf and LABEL:tbl:nus_perf are used for analysis. With APs of the softmax function and APs of the proposed softmax function with a specific and , pairedsample ttest will be used to compute p value denoted as val, which indicates the probability assuming the null hypothesis were true. When , this implies that the pair of two series of APs are significantly different. Table IV shows such an analysis on MS COCO and NUSWIDE, respectively. We can see that val of the softmax function and the proposed softmax function with is less than 0.05 in the experiments on MS COCO. This implies the resulting APs of the proposed softmax function are significantly different from the ones of the softmax function. In contrast, in the experiments on NUSWIDE, the proposed softmax functions in Table LABEL:tbl:nus_perf are significantly different from the softmax function in terms of APs, other than the proposed softmax function with .
Correlations between Compactness/Separability/ratio and APs. In this work, we study intraclass compactness and interclass separability for each class in the datasets. A question comes up, that is, how are intraclass compactness and interclass separability correlated to APs in the proposed softmax function? Note that intraclass compactness and interclass separability may not be influential when the values of them are low. Hence, we only inspect the classes with best average intraclass compactness, interclass separability, or separability ratio across various softmax functions. On one hand, we have intraclass compactnesses (interclass separabilities or separability ratios) of these classes w.r.t. each softmax functions in Table LABEL:tbl:mscoco_perf and LABEL:tbl:nus_perf. On the other hand, we have the APs yielded by each softmax functions in Table LABEL:tbl:mscoco_perf and LABEL:tbl:nus_perf. With the compactness/separabilities/ratios and corresponding APs of a certain class yielded by various softmax functions, we use Pearson correlation method to quantify the correlation between the three factors and AP and report Pearson correlation coefficients and corresponding vals in Table LABEL:tbl:corr. We can observe that overall intraclass compactness, interclass separability, or separability ratio are linearly correlated to AP to a significance level of 0.05. This implies improvement of intraclass compactness and interclass separability will leads to improvement of AP.
MS COCO  NUSWIDE  

Correlation(compactness, AP)  (0.9472, 0.0144)  (0.9635, 0.0083) 
Correlation(separability, AP)  (0.9791, 0.0036)  (0.9045, 0.0349) 
Correlation(ratio, AP)  (0.9636, 0.0083)  (0.9702, 0.0062) 
Vi Conclusion
In this work, we propose a Gaussianbased softmax function, namely softmax, which uses cumulative probability function to improve features’ intraclass compactness and interclass separability. The proposed softmax function is simple to implement and can easily replace the softmax function. For evaluation purposes, classification datasets (i.e., CIFAR10, CIFAR100 and Tiny ImageNet) and on multilabel classification datasets (i.e., MS COCO and NUSWIDE) are used in this work. The experimental results show that the proposed softmax function improves the stateoftheart ConvNet models. Moreover, in our analysis, it is observed that high intraclass compactness and interclass separability are linearly correlated to average precision on MS COCO and NUSWIDE.
Acknowledgment
This research was funded in part by the NSF under Grant 1849107, in part by the University of Minnesota Department of Computer Science and Engineering Startup Fund (QZ), and in part by the National Research Foundation, Prime Minister’s Office, Singapore under its Strategic Capability Research Centres Funding Initiative.
References
 [1] X. Chang, Y. L. Yu, Y. Yang, and E. P. Xing. Semantic pooling for complex event analysis in untrimmed videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8):1617–1632, 2017.
 [2] S.F. Chen, Y.C. Chen, C.K. Yeh, and Y.C. F. Wang. Orderfree rnn with visual attention for multilabel classification. arXiv preprint arXiv:1707.05495, 2017.
 [3] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.T. Zheng. NUSWIDE: A realworld web image database from National University of Singapore. In CIVR, 2009.
 [4] D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). In ICLR, 2016.
 [5] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlablike environment for machine learning. In BigLearn, NIPS Workshop, 2011.
 [6] K. Crammer, O. Dekel, J. Keshet, S. ShalevShwartz, and Y. Singer. Online passiveaggressive algorithms. Journal of Machine Learning Research, 7(Mar):551–585, 2006.
 [7] K. Crammer, M. Dredze, and F. Pereira. Exact convex confidenceweighted learning. In Advances in Neural Information Processing Systems, pages 345–352, 2009.
 [8] M. Dredze, K. Crammer, and F. Pereira. Confidenceweighted linear classification. In ICML, pages 264–271, 2008.
 [9] T. Durand, T. Mordan, N. Thome, and M. Cord. WILDCAT: weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In CVPR, pages 5957–5966, 2017.

[10]
T. Durand, N. Thome, and M. Cord.
WELDON: Weakly supervised learning of deep convolutional neural networks.
In CVPR, pages 4743–4752, 2016.  [11] K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang. Aligning where to see and what to tell: Image captioning with regionbased attention and scenespecific contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2321–2334, 2017.
 [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
 [13] I. J. Goodfellow, D. WardeFarley, M. Mirza, A. C. Courville, and Y. Bengio. Maxout networks. ICML, pages 1319–1327, 2013.
 [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [15] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
 [16] W. Hong, J. Yuan, and S. Das Bhattacharjee. Fried binary embedding for highdimensional visual features. In CVPR, pages 2749–2757, 2017.
 [17] W. Hou, X. Gao, D. Tao, and X. Li. Blind image quality assessment via deep learning. IEEE transactions on neural networks and learning systems, 26(6):1275–1286, 2015.
 [18] G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger. Snapshot Ensembles: Train 1, get M for free. In ICLR, 2017.
 [19] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017.
 [20] O. Kallenberg. Foundations of Modern Probability (Probability and Its Applications). SpringerVerlag, 1997.
 [21] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronton, 2009.
 [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
 [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
 [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [25] C.Y. Lee, P. W. Gallagher, and Z. Tu. Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In AISTATS, 2016.
 [26] C.Y. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. In AISTATS, 2015.
 [27] K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In CVPR, pages 991–999, 2015.
 [28] Y. Li, Y. Song, and J. Luo. Improving pairwise ranking for multilabel image classification. In CVPR, pages 1837–1845, 2017.
 [29] Z. Li, W. Lu, Z. Sun, and W. Xing. Improving multilabel classification using scene cues. Multimedia Tools Appl., 77(5):6079–6094, 2018.
 [30] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.
 [31] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, volume 8693 of Lecture Notes in Computer Science, pages 740–755, 2014.
 [32] W. Liu and I. Tsang. On the optimality of classifier chain for multilabel classification. In Advances in Neural Information Processing Systems, pages 712–720, 2015.

[33]
W. Liu and I. W. Tsang.
Large margin metric learning for multilabel prediction.
In
TwentyNinth AAAI Conference on Artificial Intelligence
, 2015.  [34] W. Liu and I. W. Tsang. Making decision trees feasible in ultrahigh feature and label dimensions. Journal of Machine Learning Research, 18(81):1–36, 2017.
 [35] W. Liu, I. W. Tsang, and K.R. Müller. An easytohard learning paradigm for multiple classes and multiple labels. Journal of Machine Learning Research, 18(94):1–38, 2017.
 [36] W. Liu, Y. Wen, Z. Yu, and M. Yang. Largemargin softmax loss for convolutional neural networks. In ICML, pages 507–516, 2016.
 [37] W. Liu, D. Xu, I. Tsang, and W. Zhang. Metric learning for multioutput tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2018.
 [38] A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In CVPR, pages 5188–5196, 2015.
 [39] A. Menon, K. Mehrotra, C. K. Mohan, and S. Ranka. Characterization of a class of sigmoid functions with applications to neural networks. Neural Networks, 9(5):819–835, 1996.
 [40] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In ICCV, pages 1520–1528, 2015.
 [41] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free?  weaklysupervised learning with convolutional neural networks. In CVPR, pages 685–694, 2015.

[42]
C. Potthast, A. Breitenmoser, F. Sha, and G. S. Sukhatme.
Active multiview object recognition: A unifying view on online feature selection and view planning.
Robotics and Autonomous Systems, 84:31 – 47, 2016.  [43] J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multilabel classification. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 254–269, 2009.

[44]
F. Rosenblatt.
The perceptron: a probabilistic model for information storage and organization in the brain.
Psychological review, 65(6):386, 1958. 
[45]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li.
Imagenet large scale visual recognition challenge.
International Journal of Computer Vision
, 115(3):211–252, 2015.  [46] L. Shao, D. Wu, and X. Li. Learning deep and wide: A spectral method for learning deep networks. IEEE Transactions on Neural Networks and Learning Systems, 25(12):2303–2308, 2014.
 [47] X. Shen, W. Liu, I. W. Tsang, Q.S. Sun, and Y.S. Ong. Multilabel prediction via crossview search. IEEE transactions on neural networks and learning systems, 29(9):4324–4338, 2018.
 [48] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [49] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity. In ICLR, 2015.
 [50] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [51] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
 [52] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropconnect. In ICML, pages 1058–1066, 2013.
 [53] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu. CNNRNN: A unified framework for multilabel image classification. In CVPR, pages 2285–2294, 2016.
 [54] J. Wang, P. Zhao, and S. C. Hoi. Exact soft confidenceweighted learning. In ICML, pages 121–128, 2012.
 [55] Z. Wang, T. Chen, G. Li, R. Xu, and L. Lin. Multilabel image recognition by recurrently discovering attentional regions. In ICCV, pages 464–472, 2017.

[56]
G. Xie, J. Wang, T. Zhang, J. Lai, R. Hong, and G.J. Qi.
Interleaved structured sparse convolutional neural networks.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pages 8847–8856, 2018.  [57] H. Xu, Y. Gao, F. Yu, and T. Darrell. Endtoend learning of driving models from largescale video datasets. In CVPR, pages 2174–2182, 2017.
 [58] Y. Yamada, M. Iwamura, and K. Kise. Shakedrop regularization. In International Conference on Learning Representations, 2018.
 [59] H. Yang, J. T. Zhou, J. Cai, and Y. Ong. MIMLFCN+: multiinstance multilabel learning via fully convolutional networks with privileged information. In CVPR, pages 5996–6004, 2017.
 [60] L. Yang, R. Jin, R. Sukthankar, and Y. Liu. An efficient algorithm for local distance metric learning. In AAAI, pages 543–548, 2006.
 [61] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In CVPR, pages 472–480, 2017.
 [62] Y. Yuan, L. Mou, and X. Lu. Scene recognition by manifold regularized deep learning architecture. IEEE Transactions on Neural Networks and Learning Systems, 26(10):2222–2233, 2015.
 [63] S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.
 [64] M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks. In ICLR, 2013.
 [65] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, volume 8689 of Lecture Notes in Computer Science, pages 818–833, 2014.
 [66] J. Zhang, Q. Wu, C. Shen, J. Zhang, and J. Lu. Multilabel image classification with regional latent semantic dependencies. IEEE Transactions on Multimedia, 2018. Early Access.
 [67] W. Zhang, X. Xue, Z. Sun, Y.F. Guo, and H. Lu. Optimal dimensionality of metric space for classification. In ICML, pages 1135–1142, 2007.

[68]
R. Zhao, J. Li, Y. Chen, J. Liu, Y. Jiang, and X. Xue.
Regional gating neural networks for multilabel image classification.
In BMVC, 2016.  [69] C. Zhou and J. Yuan. Multilabel learning of part detectors for heavily occluded pedestrian detection. In ICCV, pages 3486–3495, 2017.
 [70] F. Zhu, H. Li, W. Ouyang, N. Yu, and X. Wang. Learning spatial regularization with imagelevel supervisions for multilabel image classification. In CVPR, pages 2027–2036, 2017.