Learning from an imbalanced data set is one of the most important and practical problems and has been actively studied [ReviewIBD2017, He2009]. This study focuses on classification problems based on neural networks in imbalanced data sets. Several methods have been proposed for these problems [He2009] such as under/over-samplings, SMOTE [SMOTE2002], and cost-sensitive (CS). We consider a CS-based method in this study. A simple type of CS method is archived by assigning weights to each point of data loss in a loss function [He2009, CSM2010]. Weights in the CS method can be viewed as importances of corresponding data points. In an imbalanced data set, the weights of data points that belonging to majority classes are set to smaller than those that belonging to minority classes. As discussed in Sec. 2, the weighting changes the interpretation of the effective size of data set in the resultant weighted loss function.
Batch normalization (BN) is a powerful regularization method for neural networks [BN2015] and it has been one of the most important techniques in deep learning [DL2016]. In the BN of a specific mini-batch, affine signals from units in lower layers are normalized over the data points in the mini-batch. A simple combination of the weighted loss function and BN causes a size-mismatch problem. As mentioned above, the interpretation of the effective size of data set is changed in the weighted loss function. Therefore, the effective size of the mini-batch should be changed in accordance with it. In this study, we propose a simple modification to BN to correct the size-mismatch.
The remainder of this paper is organized as follows. In Sec. 2 and 3, we briefly explain the weighted loss function and BN, respectively. In Sec. 4, we demonstrate the proposed modification on BN and show the results of experiments using MNIST which is a database comprising labeled images of handwritten digits. The results show that our method is effective in data-imbalanced environment. Finally, the conclusion is presented in Sec. 5.
2 Weighted Loss Function
Let us consider a classification model that classifies a
-dimensional input vectorinto different classes . It is convenient to use 1-of- representation (or 1-of- coding) to identify each class [Bishop2006]. In 1-of- representation, each class corresponds to the dimensional vector whose elements are and , i.e., a vector in which only one element is one and the remaining elements are zero. When , indicates class . For the simplicity of notation, we denote the 1-of- vector, whose th element is one, by , so that .
Suppose that a training data set comprising data points: , where and are th input and corresponding (1-of- represented) target-class label, respectively, is given. Let us consider the minimizing problem of the loss function given by:
where expresses the loss (e.g., the cross-entropy loss) for th data point and denotes the set of learning parameters in the classification model.
A weighted loss function is introduced as:
where is the weight of th data point and . It can be viewed as a simple type of CS approach [He2009, CSM2010]. For an imbalanced data set, the weights are often set to the inverse class frequency [CSM2016, CSM2017, CSM2018]. Here, we assume that is an imbalanced data set, implying that the sizes of s are imbalanced (some s are very large (majority classes) and some are very small (minority classes)), where is the number of data points belonging to class , i.e., , where is the Kronecker delta function. In this case, the weights are set to
The weights in Eq. (3) effectively correct the imbalance in the loss function due to the imbalanced data set. In the weighted loss function with Eq. (3), th data point that belongs to class is replicated to data points, in which the effective sizes of becomes
This means that the effective sizes of (i.e., ) are balanced in the weighted loss function.
3 Batch Normalization
Consider a standard neural network whose th layer consists of units. In the standard scenario of the feed-forward propagation for input in the neural network, the th unit in the th layer receives an affine signal from the units in the th layer as:
where is the connection-weight parameter from the th unit in the th layer to the th unit in the th layer and is the output signal of the th unit in the th layer. After receiving the signal, the th unit in the th layer outputs:
where is the bias parameter of the unit and
is the specific activation function of theth layer.
For BN, the training data is divided into mini-batches:
. BN is based on mini-batch-wise stochastic gradient descent (SGD). During training with BN, in the feed-forward propagation for inputin mini-batch , Eqs. (5) and (6) are replaced by:
are the expectation and (unbiased) variance of the affine signals over the training inputs in, and denotes the size of . It should be noted that in Eq. (7) is a small positive value to avoid the division by zero and it is usually set to . At each layer, the distribution of affine signals, , is standardized by Eq. (7). It is noteworthy that in training with BN, in Eq. (7) are also the learning parameters with and , which are determined by an appropriate back-propagation. In the inference stage for a new input, we use and (i.e., the average values over mini-batches) instead of Eqs. (9) and (10), respectively.
4 Proposed Method and Numerical Experiment
In this section, we consider BN when we employ the weighted loss function in Eq. (2). The idea of our method is simple. In BN, the affine signals are normalized on the basis of sizes of mini-batches. As mentioned in Sec. 2, in the weighted loss function, the interpretation of the effective size of data set is changed in accordance with the corresponding weights. Thus, to maintain a consistency with the reinterpretation in BN, the sizes of mini-batches should be reviewed. This implies that
Because our proposed method is realized by replacing Eqs. (9) and (10) by Eqs. (11) and (12), respectively, it does not produce any additive hyper-parameters. This replacement slightly changes the form of back-propagation. However, these details are omitted owing to space limitations.
Next, the results of our numerical experiments are illustrated. In the experiments, we used the data set available in MNIST, which is a database of handwritten digits, . Each data point includes the input data, a digit image, and the corresponding target digit label. Using MNIST, we executed three types of two-class classification problems with strongly imbalanced data: (i) the classification problem of “zero” and “one”, (ii) the classification problem of “one” and “eight”, and (iii) the classification problem of “four” and “seven”. The number of training data and test data points used in these experiments are shown in Tab. 1
, which were randomly picked up from MNIST. For the experiments, we used a three-layered neural network with 784 input units, 200 hidden units with rectified linear function (ReLU) activation[ReLU2011], and two soft-max output units. We adopted BN for the hidden and output layers. In the training of neural network, we used the Xavier’s initialization [Xavier2010], the Adam optimizer [Adam2015], and set .
The results of the three experiments, (i), (ii), and (iii), are shown in Tab. 2. For each experiments, we used three different methods: (a) employing the standard loss function in Eq. (1) and using the standard BN explained in Sec. 3 (LF+sBN), (b) employing the weighted loss function in Eq. (2) and using the standard BN (WLF+sBN), and (c) employing the weighted loss function in Eq. (2) and using the proposed BN (WLF+pBN). The weights in the weighted loss function were set according to Eq. (3). The proposed method (WLF+pBN) was found to be the best in all experiments.
|(c) WLF+pBN (proposed)||99.8%||87.5%||93.2%||83%||100%||92.7%||49.3%||99.4%||74.9%|
5 Concluding Remarks
In this paper, we proposed a modification in BN for the weighted loss function and applied our method to imbalanced data sets. The idea of the proposed method is simple but is essential. Our method improved the classification accuracies in the experiments.
Recently, a new type of weighted loss function (referred to as class-balanced loss function), which is effective for imbalanced data, was proposed [CBL2019]. In this function, the weights are set to:
for , where . The class-balanced loss function is equivalent to Eq. (1) when , and the weighted loss function with inverse-class-frequency weights used in this study when , except for the difference in constant factor. The forms of s are not limited in our method, which is then available in the class-balanced loss function. We will address this in future studies.
This work was partially supported by JSPS KAKENHI (Grant Numbers: 15H03699, 18K11459, and 18H03303), JST CREST (Grant Number: JPMJCR1402), and the COI Program from the JST (Grant Number JPMJCE1312).