1 Introduction
Learning from an imbalanced dataset is one of the most important and practical problems in the field of machine learning, and it has been actively studied [8, 9]. This study focuses on classification problems based on neural networks in imbalanced datasets. Several methods have been proposed for these problems [9], for example, under/oversampling, SMOTE [2], and costsensitive (CS) methods. We consider a CSbased method in this study. A simple type of CS method was developed by assigning weights to each point of data loss in a loss function [9, 4]. The weights in the CS method can be viewed as importances of corresponding data points. As discussed in Sec. 2, the weighting changes the interpretation of the effective size of each data point in the resultant weighted loss function (WLF). In an imbalanced dataset, the weights of data points that belong to the majority classes are often set to be smaller than those that belong to the minority classes. For imbalanced datasets, two different effective settings of the weights are known: inverse class frequency (ICF) [11, 15, 12] and classbalance loss (CBL) [3].
Batch normalization (BN) is a powerful regularization method for neural networks [13], and it has been one of the most important techniques in deep learning [7]. In BN, the affine signals from the lower layer are normalized over a specific minibatch. As discussed in this paper, a simple combination of WLF and BN causes a sizeinconsistency problem, which degrades the classification performance. As mentioned above, the interpretation of the effective size of data points is changed in WLF. However, in the standard scenario of BN, one data point is counted as just one data point. This inconsistency gives rise to the sizeinconsistency problem.
The aim of this study is not to find a better learning method in a dataimbalanced environment, but to resolve the sizeinconsistency problem in the simple combination of WLF and BN. In this study, we propose a consistent BN to resolve the sizeinconsistency problem and demonstrate that the proposed consistent BN improves the classification performance using numerical experiments. This paper is an extension of our previous study [16]
, and it includes some new improvements: (i) the theory is improved (i.e., the definition of signal variance in the proposed method is improved) and (ii) more experiments are conducted (i.e., for a new data set and for a new WLF setting).
The remainder of this paper is organized as follows. In Sec. 2, we briefly explain WLF and the two different settings of the weights, i.e., ICF and CBL. In Sec 3
, we give a brief explanation of BN and show the performance degradation in a simple combination of WLF and BN (caused by the sizeinconsistency problem) using numerical experiments for Iris and MNIST databases. In Sec.
4, we propose a consistent BN, which can resolve the sizeinconsistency problem, and justify the proposed method by using numerical experiments. As shown in the numerical experiments in Sec. 4.2, our method improves the classification performance. Finally, the summary and some future works are presented in Sec. 5. In the following sections, the definition sign “” is used only when the expression does not change throughout the paper. In the definitions of parameters that are redefined somewhere in the paper, we do not use the definition sign.2 Weighted Loss Function in Imbalanced Data Environment
Let us consider a classification model that classifies an
dimensional input vector
into different classes . It is convenient to use 1of vector (or onehot vector) to identify each class [1]. Here, each class corresponds to the dimensional vector whose elements are and , i.e., a vector in which only one element is one and the remaining elements are zero. When , indicates class . For simplicity of notation, we denote the 1of vector, whose th element is one, by , so that . Thus, corresponds to class . Here, we denote the size of the data points belonging to class by , which is defined as , where is the Kronecker delta. By the definition, .Suppose that a training dataset comprising data points: , where and are the th input and the corresponding targetclass label represented by the 1of vector, respectively, is given. In the standard machinelearning scenario, we minimize a loss function given by
(1) 
with respect to using an appropriate backpropagation algorithm, where denotes the data loss (e.g., the crossentropy loss) for th data point and denotes the set of learning parameters in the classification model.
WLF is introduced as
(2) 
where is the weight of the th data point and is the normalization factor. It can be viewed as a simple type of CS approach [9, 4]. In WLF, the th data point is replicated to data points. Hence, the relative occupancy (or the effective size) of the data points in each class is changed according to the weights. In WLF, the “effective” size of is then expressed by
(3) 
In general,
(4) 
holds.
The weights should be individually set according to the tasks. For an imbalanced dataset (i.e., some are very large (majority classes) and some are very small (minority classes)), there are two known settings: ICF [11, 15, 12] and CBL [3]. In the following sections, we briefly explain the two settings.
2.1 Inverse class frequency
Here, we assume that is an imbalanced dataset, implying that the sizes of s are imbalanced. In ICF, the weights are set to
(5) 
where is assumed. The ICF weights effectively correct the imbalance in the loss function due to the imbalanced dataset. In WLF with ICF, the th data point that belongs to class is replicated to data points. Therefore, in this setting, the effective sizes of are equal to for any :
where and . This means that the effective sizes of (i.e., ) are completely balanced in WLF (see Fig. 1).
From Eq. (4), in ICF.
2.2 Classbalanced loss
Recently, a new type of WLF (referred to as classbalanced loss (CBL)), which is effective for imbalanced datasets, was proposed [3]. In CBL, the weights in Eq. (2) are set to
(6) 
for , where . In CBL, is also assumed. The CBL weights corrects the effective sizes of data points by taking the overlaps of data points into account. Here,
is a hyperparameter that smoothly connects the loss function in Eq. (
1) and WLF with ICF. WLF with CBL is equivalent to Eq. (1) when , and it is equivalent to WLF with ICF when , except for the difference in the constant factor:Hence, CBL can be viewed as an extension of ICF. In Ref. [3], it is recommended that has a value close to one (e.g., around ). In CBL, the effective sizes of are given by
(7) 
in CBL.
As mentioned, in the limit of ; therefore, are completely balanced in this limit. In fact, when . The complete balance gradually begins to break as decreases. When is very small (i.e., is very close to one), Eq. (7) is expanded as
Therefore, when ,
This means that and are almost balanced as compared with and when is very small.
3 Batch Normalization
Consider a standard neural network for classification whose th layer consists of units (). The zeroth layer () is the input layer, i.e., , and the network output, , is determined from the output signals of the th layer. In the standard scenario for the feedforward propagation of input in the neural network, for , the th unit in the th layer receives an affine signal from the th layer as
(8) 
where is the directedconnection parameter from the th unit in the th layer to the th unit in the th layer, and is the output signal of the th unit in the th layer, where is identified as . After receiving the affine signal, the th unit in the th layer outputs
(9) 
to the upper layer, where is the bias parameter of the unit and
is the specific activation function of the
th layer. In the classification, for input , the network output (the 1of vector) is usually determined through theclass probabilities,
, which are obtained via the softmax operation for ():(10) 
The network output is the 1of vector whose th element is one, where . The feedforward propagation is illustrated in Fig. 2.
Learning with BN is based on minibatchwise stochastic gradient descent (SGD)
[13]. In BN, the training dataset, , is divided into minibatches: . In the following sections, we briefly explain the feedforward and back propagation during training in the th layer for minibatch . Although all the signals appearing in the explanation depend on the index of minibatch , we omit the explicit description of the dependence of unless there is a particular reason.3.1 Feedforward propagation
In the feedforward propagation for input of the th layer with BN, the form of in Eq. (8) is replaced by
(11) 
where
(12) 
and
(13) 
are the mean and (unbiased) variance of the affine signals over , respectively. Here, denotes the size of and is assumed. The constant in Eq. (11) is set to a negligibly small positive value to avoid zero division, and it is usually set to . In BN, the distribution of the affine signals over is standardized. This means that the mean and (unbiased) variance of over are always zero and one (when ), respectively:
(14) 
The factors in Eq. (11) are also learning parameters with and , which are determined via an appropriate backpropagation algorithm.
It is noteworthy that in the inference stage (after training), we use and (i.e., the sample average values over all the minibatches) instead of and , respectively, in Eq. (11), where and are the mean and variance, respectively, computed for minibatch , namely, they are identified as Eqs. (12) and (13), respectively.
3.2 Back propagation and gradients of loss
In this section, we briefly explain the backpropagation rule with or without BN. For expressing the backpropagation rule, we define a backpropagating signal of the th unit in the th layer for the th data point () as
(15) 
where is the loss over minibatch and is the loss for the th data point, i.e., for Eq. (1) or for Eq. (2). Here, is the normalization factor that is for Eq. (1) or is for Eq. (2), where
(16) 
First, we show the backpropagation rule for the th layer “without” BN, i.e., the case in which the form of is defined by Eq. (8
). In this case, the backpropagation rule is obtained by using the chain rule and Eqs. (
8) and (9):(17) 
for , where . Note that is obtained by directly differentiating by . By using the backpropagating signals, the gradients of with respect to and are expressed as
(18)  
(19) 
respectively, for .
Next, we show the expression of the back propagation rule for the th layer “with” BN, i.e., the case in which the form of is defined by Eq. (11). Using a similar manipulation with Eq. (17) (i.e., using the chain rule and Eqs. (11) and (9)), we obtain
(20) 
for , where
In this case, the gradients of with respect to , , and are as follows: The gradient has the same expression as Eq. (18). The gradients with respect to and are
(21)  
(22) 
respectively, for .
3.3 Numerical experiment
In the experiments in this section, we used two different datasets, namely Iris and MNIST. Iris is a database for the classification of three types of irises, whose each data point consists of a fourdimensional input data (sepal length, sepal width, petal length, and petal width) and the corresponding target iris label (setosa, versicolor, and virginica). MNIST is a database of handwritten digits, . Each data point in MNIST includes the input data, a digit image, and the corresponding target digit label.
For the two different datasets, we considered some imbalanced twoclass classification problems. For Iris, (i) the classification problem of “versicolor” and “virsinica” and (ii) the classification problem of “versicolor” and “setosa” were considered. For MNIST, (I) the classification problem of “one” and “zero”, (II) the classification problem of “one” and “eight”, and (III) the classification problem of “four” and “seven” were considered. The number of training data and test data points used in these experiments are shown in Tabs. 1 and 2, which were randomly picked up from Iris and MNIST, respectively. In the experiments for MNIST, the minority data accounts for less than 1% of all data. All the input data were standardized in the preprocessing. For the experiments, we used a threelayered neural network (): input units,
hidden units with rectified linear function (ReLU) activation (
) [6] (i.e., first hidden layer), and hidden units with (i.e., second hidden layer). The network output, , is computed from the output signals of the second hidden layer via the softmax operation as explained in the first part of this section. We adopted BN for the first and second hidden layers. In the training of the neural network, we used the He initialization [10] for the first hidden layer and the Xavier initialization [5] for the second hidden layer, and used the Adam optimizer [14]. The crossentropy loss,was used, where is the class probability defined in Eq. (10). We set and for Iris and and for MNIST. The minibatch sizes were and for Iris and MNIST, respectively.
(i)  (ii)  
versicolor (minority)  virginica (majority)  versicolor (minority)  setosa (majority)  
train  6  50  5  50 
test  50  50  50  50 
(I)  (II)  (III)  

1 (minority)  0 (majority)  1 (minority)  8 (majority)  4 (minority)  7 (majority)  
train  45  5923  45  5851  39  6265 
test  1135  980  1135  974  982  1028 
The results of the experiments for Iris dataset ((i) and (ii)) and for MNIST dataset ((I), (II), and (III)) are shown in Tabs. 3 and 4, respectively. For each dataset, we used three different methods: (a) the standard loss function, , combined with BN (LF+BN), (b) WLF, , with ICF combined with BN (WLF(ICF)+BN), and (c) WLF with CBL () combined with BN (WLF(CBL)+BN). The accuracies shown in Tabs. 3 and 4
are the average values over 10 experiments (in all the experiments, the training and test datasets were fixed). In each experiment, we chose the best model from the perspective of the classification accuracy for the test set, obtained during 100 epoch training. The classification accuracies for the majority classes are very good and those for the minority classes are poor as we expected. However, almost all the results of WLF are worse than those of the standard loss function. This means that the correction for the effective data size in WLF does not work well in imbalanced datasets.
The results of (a) and (b) in Tab. 4 are slightly worse than those in the previous study [16], because some experimental settings were different. For example, in the previous study, we constructed the minibatches in such a way that all of them certainly include the data points in the minority class, but we did not do it in the current experiments. Furthermore, the results shown in the previous study were not average values but the best ones obtained from several experiments. However, this performance difference is not essential for our main claim in this paper. Because our aim is to resolve the sizeinconsistency problem and to improve the classification performance, the difference of the baseline is not important.
(i)  (ii)  
versicolor  virginica  overall  versicolor  setosa  overall  
(a) LF+BN  39.4%  100%  69.7%  61.4%  100%  80.7% 
(b) WLF(ICF)+BN  30.8%  100%  65.4%  47.4%  100%  73.7% 
(c) WLF(CBL)+BN  40.2%  100%  70.1%  56.8%  100%  78.4% 
(I)  (II)  (III)  

1  0  overall  1  8  overall  4  7  overall  
(a) LF+BN  66.9%  100%  82.2%  69.7%  100%  83.7%  34.4%  100%  68.0% 
(b) WLF(ICF)+BN  63.8%  100%  80.6%  71.2%  100%  84.5%  27.6%  100%  64.7% 
(c) WLF(CBL)+BN  64.8%  100%  81.1%  73.1%  100%  85.5%  28.1%  100%  64.8% 
4 Weighted Batch Normalization
As demonstrated in Sec. 3.3, the simple combination of WLF and BN is not good. This is considered to be due to the inconsistency of the effective data size in WLF and BN. As mentioned in Sec. 2, in WLF, the interpretation of the effective sizes of data points is changed in accordance with the corresponding weights. However, in BN, one data point is treated as just one data point in the computation of the mean and variance of the affine signals (cf. Eqs. (12) and (13)). This fact causes a sizeinconsistency problem, which results in the degradation of the classification performance as shown in Sec. 3.3. To resolve this problem, we propose a modified BN (referred to as weighted batch normalization (WBN)) for WLF.
4.1 Feedforward and back propagations for WBN
The idea of our method is simple and natural. To maintain consistency with the interpretation of data size, the sizes of the minibatches should be reviewed according to the corresponding weights in BN. This implies that
(23) 
and
(24) 
should be used in Eq. (11) instead of Eqs. (12) and (13), where
(25) 
is the normalization factor for the variance. The normalization factor is already defined in Eq. (16). Eqs. (23) and (24) are the modified versions of Eqs. (12) and (13), weighted in the same manner as WLF. The normalization factors in Eqs. (23) and (24) ensure that they are unbiased; see Appendix A. In the previous study [16], the normalization factor was defined by . However, such a definition is not appropriate from the perspective of unbiasedness of . When is a constant, WBN is equivalent to BN, because Eqs. (23) and (24) are reduced to Eqs. (12) and (13) in this case.
In BN, the distribution of over is standard as shown in Eq. (14). On the other hand, in WBN, it is not standard but its weighted distribution is standard:
when . This standardization property reflects the effective data size in WLF. In WLF, the th data point is replicated according to the corresponding weight as mentioned in Sec. 2. Therefore, the signal , which corresponds to the th data point, should also be replicated in the same manner. The above standardization property implies this notion.
4.2 Numerical experiment
In this section, we show the experimental result of the proposed method for the imbalanced classification problems ((i) and (ii) for Iris and (I)–(III) for MNIST) described in Sec. 3.3. The detailed settings of the experiments were basically the same as that in Sec. 3.3. For each experiment, we used two different methods: (d) WLF with ICF combined with WBN (WLF(ICF)+WBN), and (e) WLF with CBL () combined with WBN (WLF(CBL)+WBN). The results of the experiments for Iris ((i) and (ii)) and for MNIST ((I)–(III)) are shown in Tabs. 5 and 6, respectively. The results in the tables are the average values obtained over 10 experiments. We observe that the classification accuracies for the minority classes in all the experiments are largely improved (cf. Tabs. 3 and 4). This means that the correction for the effective data size in WLF works well, because the sizeinconsistency problem is resolved by WBN. The results of CBL are better that those of ICF in all the experiments.
We executed the same experiments using WBN proposed in the previous study, in which was defined by [16]. The results obtained from these experiments were worse than those shown in Tabs. 5 and 6. This means that the definition of in Eq. (25) is better from the perspective not only of the unbiasedness of the variance but also of the performance of classification.
(i)  (ii)  
versicolor  virginica  overall  versicolor  setosa  overall  
(d) WLF(ICF)+WBN  59.6%  100%  79.8%  79.6%  100%  89.8% 
(e) WLF(CBL)+WBN  77.2%  100%  88.6%  84.4%  100%  92.2% 
(I)  (II)  (III)  

1  0  overall  1  8  overall  4  7  overall  
(d) WLF(ICF)+WBN  85.7%  100%  92.3%  79.6%  100%  89.0%  48.7%  100%  75.0% 
(e) WLF(CBL)+WBN  87.6%  100%  93.4%  80.6%  100%  89.6%  49.9%  100%  75.5% 
5 Summary and Future Works
In this paper, we proposed a new BN method for the learning based on WLF. The idea of the proposed method is simple but essential. The proposed BN (i.e., WBN) can resolve the sizeinconsistency problem in the combination of WLF and BN, and it improves the classification performance in dataimbalanced environments, as demonstrated in the numerical experiments. We verified the validity of WBN for two different databases, Iris and MNIST, and for two different weight settings, ICF and CBL, though numerical experiments. However, it is important to check the validity of WBN for other databases and other weight settings. Furthermore, deepening the mathematical aspect of WBN, such as the internal covariance shift in WBN, is also important. We will address them in our future studies.
We have considered only the classification problem in this study. The sizeinconsistency in the combination of WLF and BN will also arise in other types of problems, e.g., regression problem. Our idea is presumably applicable to those cases, because the idea of our method is independent from the style of output. We will also address the application of WBN to problems other than the classification problem in our future studies.
acknowledgments
This work was partially supported by JSPS KAKENHI (Grant Numbers: 15H03699, 18K11459, and 18H03303), JST CREST (Grant Number: JPMJCR1402), and the COI Program from the JST (Grant Number JPMJCE1312).
Appendix A Unbiasedness of Estimators in Equations (23) and (24)
In this appendix, we show that the mean and variance in Eqs. (23) and (24) are unbiased. By omitting indices ( and ) unrelated to this analysis, they are expressed as
(28) 
Here, we assume that are i.i.d. samples drawn from a distribution . The mean and variance of are denoted by and , respectively. The expectation of in Eq. (28) over is
Therefore,
is the unbiased estimator. Similarly, the expectation of
in Eq. (28) isHere, we used
Therefore, is also an unbiased estimator.
References
 [1] (2006) Pattern recognition and machine learning. SpringerVerlag New York. External Links: ISBN 9780387310732 Cited by: §2.

[2]
(2002)
SMOTE: synthetic minority oversampling technique.
Journal of Artificial Intelligence Research
16, pp. 321–357. Cited by: §1.  [3] (2019) Classbalanced loss based on effective number of samples. arXiv preprint arXiv:1901.05555. Cited by: §1, §2.2, §2.
 [4] (2010) Maximum likelihood in costsensitive learning: model specification, approximations, and upper bounds. The Journal of Machine Learning Research 3, pp. 3313–3332. Cited by: §1, §2.
 [5] (2010) Understanding the difficulty of training deep feedforward neural networks. In Proc. of the 13th International Conference on Artificial Intelligence and Statistics 9, pp. 249–256. Cited by: §3.3.
 [6] (201111–13 Apr) Deep sparse rectifier neural networks. In Proc. of the 14th International Conference on Artificial Intelligence and Statistics 15, pp. 315–323. Cited by: §3.3.
 [7] (2016) Deep learning. MIT Press. Cited by: §1.
 [8] (2017) Learning from classimbalanced data: review of methods and applications. Expert Systems with Applications 73, pp. 220–239. External Links: ISSN 09574174 Cited by: §1.
 [9] (2009) Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21 (9), pp. 1263–1284. Cited by: §1, §2.

[10]
(2015)
Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification
.In Proc. of the 2015 IEEE International Conference on Computer Vision
, pp. 1026–1034. Cited by: §3.3.  [11] (2016) Learning deep representation for imbalanced classification. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5375–5384. Cited by: §1, §2.

[12]
(2018)
Deep imbalanced learning for face recognition and attribute prediction
. arXiv preprint arXiv:1806.00194. Cited by: §1, §2.  [13] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. of the 32nd International Conference on International Conference on Machine Learning 37, pp. 448–456. Cited by: §1, §3.
 [14] (2015) Adam: a method for stochastic optimization. In Proc. of the 3rd International Conference on Learning Representations, pp. 1–13. Cited by: §3.3.
 [15] (2017) Learning to model the tail. In In Proc. of the Advances in Neural Information Processing Systems 30, pp. 7029–7039. Cited by: §1, §2.
 [16] (2019) Improvement of batch normalization in imbalanced data. In Proc. of the 2019 International Symposium on Nonlinear Theory and its Applications, pp. 146–149. Cited by: §1, §3.3, §4.1, §4.2.