## 1 Introduction

In visual recognition studies using neural networks, such as image classification lecun1998gradient; he2016deep and face recognition wen2016discriminative; liu2017sphereface

, the networks can be thought a mapping function between high dimensional data and the low dimensional space. Commonly used approaches to improving mapping functionality and recognition accuracies in such visual recognition studies, are modifying network connections

he2016deep; huang2017denselyor adjusting loss functions

liu2017sphereface; wen2016discriminative. Above methods show outstanding performances when the well-classified and balanced datasets are provided.

There is a possibility that a model has poor mapping capability due to extrinsic factors related to the uncertainty of a given dataset e.g., class-wise unbalance and noisy label. These issues can sometimes be addressed by handling hard samples. A sample is considered as a hard sample when it is on the wrong side of the correct decision boundary viola2001robust

or in the margin of the hyperplanes for classification

dollar2014fast. Hard samples are frequently observed when an unbalanced and not-well classified datasets are given, because existing learning methods usually have to inductive bias toward the dominant classes if training data are unbalanced, resulting in poor minority class recognition performance dong2018imbalanced. Most approaches to solving this issues are dataset resampling chawla2002smote setting the balanced proportion of label and data to train a network model, and applying weight decided by the training loss using hard sampling mining kumar2010self; chang2017active; jiang2017mentornet. Recently, a hard sample generation methods based on deep neural networks has been proposed to tackle the hard sample problem caused from dataset imbalance schroff2015facenet; wu2017sampling.However, without an appropriate definition for the bias of dataset, handling the dataset bias is inherently ill-defined ren2018learning. Moreover, using these methods without a proper definition of dataset bias may lead to poorly optimized mapping results when training networks. Also, hard samples can appear even if neural networks are trained with a physically well-balanced and correctly classified dataset. Consequently, it is necessary to develop a method for boosting a mapping functionality of neural networks which is invariant to the bias of datasets and does not require meta information such as the definition of dataset bias and sample proportions between classes.

In this work, we present a reversible network learning (RNL) which can allow the reversiblity to neural networks, and we propose a generation and learning of hard-sample corresponding to latent features based on RNL. The reversible network (RN) is inspired by the auto-encoder. However, it differs than AE since it focuses on reconstructing, generating, and learning latent features to improve supervised learning performance. In learning the RNs, the network can generate the latent features regarded as samples which have a lower likelihood, and apply the features to network learning. We demonstrate an efficiency of the proposed method using image classification problems. The experimental results show that the networks trained by the proposed manner outperform the others.

The key contributions of this paper are summarized as three points: First, we propose a RN for generation and learning of latent features related to hard samples; Second, the proposed method is easily applied to the various network structures and loss functions, and the resulting models perform substantially better than existing ones. Third, we provide extensive experimental results, including the comparison for recognition performances between the proposed methods and other models and the relation between latent features and hard samples.

This paper is organized as follows. We describe the RN and how to generate and learn the hard sample corresponding latent features using the network in Section. 2. In Section 3, we provide experimental results to demonstrate the efficiency of the proposed method. We conclude this paper in Section. 5.

## 2 Reversible Network Learning

### 2.1 Reversibility of Neural Network

In supervised learning manner, neural network conduct feed-forward process. Given the input sample , the networks map the samples into latent space and extract latent feature , and compute the confidence value related to each class. The primary goal of the learning networks is minimizing errors such as classification error. The reversibility of a neural network is not considered as a significant issue usually.

In a -class classification problem setting using a neural network, suppose the network consists of two functions: an encoding function , where and are an input sample and the corresponding latent feature, and probabilistic model to compute the likelihood of corresponding to class for each input sample using given and the model parameter . The entire process can be represented by , and the predicted class is decided as follows: .

Under the above setting, the neural network is reversible if the network satisfies the following condition from a given sample :

(1) |

However, since the network components such as activation functions and connectivity are sometimes non-invertible, it is difficult to build a mathematically RN in practice

xu2014deep. For instance, the softmax function is non-linear, and the inverse form of softmax function is regarded as: , where is the class output of softmax function corresponding the latent feature , and denotes a constant. since originally represents , and this replacement make difficulty when neural networks take reversibility.Despite the neural network is theoretically irreversible. We can develop a learning-based approach to improve the reversibility of a neural network using the condition in Eq. 1 and reconstruction manner of AE.

To improve the reversibility of neural networks in supervised learning manner, RN conducts two processes: feed-forward and feed-backwards processes. The feed-forward process is a general process of supervised learning. In classification problem setting, the feed-forward process generates an output of networks for classification: . In a classification problem setting, the goal of the feed-forward process is computing the likelihood related to each class, and the optimization scheme conducts to minimize a classification error. The cross-entropy with softmax function is commonly used as a cost function in classification problem setting defined as follows:

(2) |

where is the dimensionality of a final layer, and it is usually equal to the number of classes. and are an output of the feed-forward process of a network model and the corresponding label. The network output is decided by computing the softmax function with latent feature , and it can be interpreted as a likelihood for each class of a given sample .

On the other hands, the feed-backwards process reconstructs the input samples thought the reverse process represented by:

(3) |

where is the reconstruction results from a given value . Unfortunately, as mentioned above, because of some network components such as non-linear activation functions and irreversible network connections, it is difficult to make a mathematically exact inverse network.

In RN, the feed-backwards process conducts a reverse process inspired by AE. The reverse process for fully connected networks defined by:

(4) |

where is the transpose matrix of a weight matrix in fully connected layer, and is the biase of the previous layer of the layer.

is an activation function such as rectified linear unit , softmax function, and hyperbolic tangent function. In Equation. 4,

can be represented as follows:(5) |

where is element of the coordinate in the weight matrix , and

is the column vector of

. and are the output and -th element of the output. Additionally, the reverse process for convolutional layer is replaced by the deconvolutional layer xu2014deep.Above feed-backward process products the reconstruction results , and the reconstruction results are applied to maximize the reversibility of neural networks by minimizing the reconstruction error based on mean square error as follows:

(6) |

where and is elements of input and feed-backwarding results . The reversibility maximization via minimizing reconstruction error is inspired from AE.

However, RN and AE have different objectives methodologically. AE is one of unsupervised learning manners, and the goal of AE is to minimize a reconstruction error between an input sample and a reconstructed result, and this process does not consider classes of samples. This helps AE to learn significant representations from given samples. On the other hands, the goal of training RN is both minimize a recognition accuracy and a reconstruction error. This can be regarded as a class-wise embedding of latent features depending on specific classes since the learning of RN includes clustering process of the latent features based on their likelihoods by conducting two minimizations for the recognition accuracy and the reconstruction error simultaneously. This property of RN may help to reconstruct latent features corresponding to hard samples. Figure

1 illustrates the workflows and the methodological difference between RN learning and an AE. To apply the above two objectiveness to train the network, we used straightforward aggregation to compute to the total loss function. We aggregate classification loss on the feed-forward process and reconstruction loss on the feed-backwards process as follows:(7) |

In our experiment, other non-differential operations including pooling or another downsampling are replaced to an upsampling function based on simple image transformation methods. Figure 2 shows the reconstruction results of latent features using normal classification network (a.k.a., neural network), AE, and the RN. Parameter setting . As shown in Figure 2, the network trained by classification loss only shows poor reconstruction results compared to the others.

### 2.2 Latent Feature Generation and Learning

Inherently, the most simple approach to boost the mapping functionality of neural networks is providing a large-scale and well-categorized dataset which can be used to train the various variations of each class. However, it is difficult to construct the dataset in practice. When neural networks train the biased dataset, the learned features are biased to the dominant samples, and the other samples which are not included in the dominant sample set, are considered as hard samples and it can be classified into wrong class in the test phases.

The solution for the above issue using RL is surprisingly simple, and we only need an one steps of feed backward and forward process in Eq. 3 and Eq. 1 in RN. Reversibility of RN can apply to generate latent features, by providing reverse mapping from the likelihood of class to latent feature. The hard sample is considered as unrecognizable samples using a model under the close-set condition, and it is represented as follow: , where and denote a hard sample and the annotated class ,

represents the parameters for probability model of

class. is the likelihood corresponding to class .In feed-backward process of RN, generating the latent features can be interpreted by , where is given likelihood data for generating a corresponding latent feature, and is the generated latent feature. Above process can be applied to generate the latent features corresponding to hard samples. This process to generate the latent features corresponding to hard samples can be represented with Eq. 4 as follows:

(8) |

where tr is a transformation function for an output to modify the likelihood value on output . In this work, we select some elements among the elements in an output vector randomly and assign a value which is similar to the maximum likelihood. The detail method to modifying the and applying to network training are described at Algorithm. 1. A further process is straightforward. The generated latent features are directly applied to the feed-forward process, and it is equivalent to the general process for image classification. Figure 2(a) illustrates that the process of the latent feature generation on RN, and generation and visualization results of the latent features. s

## 3 Experiment

### 3.1 Experimental setting and datasets

We have compared the model applying the reversible manner and the normally trained models. We have implemented a baseline neural network, very deep neural network (VGGnet) simonyan2014very, residual network (ResNet) he2016deep

, and the densely connected convolutional neural network (DenseNet)

huang2017densely. The structural details of the baseline neural network are shown in table 2. In implementing the others, we have employed the structures of VGG-19, ResNet-18, and DenseNet-40 on their studies. Our work is concentrated to demonstrate the efficiency of RL, and not on encourage state-of-the-art performance. Therefore, the experiment is conducted based on the several baseline models intentionally and focused on the comparison between normally trained model and trained model using the reversible learning manner. In our experiments, All networks are trained using stochastic gradient descent (SGD). we employed learning rate decay of 0.0001 and momentum of 0.9. The learning rate is initially set to 0.1, and divided by 10 in 20, 40, and 60 epochs. We conduct a simple data augmentation by cropping and flipping given images. The training and evaluation using each dataset are performed 10 times. The average values for all experiments are considered as the final quantitative results for each model. All experiments are conducted using Nvidia Titan Xp GPU and 3.20

CPU. The source codes for these experiments are implemented based on Pytorch library.

We demonstrate the efficiencies of RNL through the image classification setting. Af first, we evaluate the models using Cifar-10 and Cifar-100 datasets krizhevsky2009learning. The Cifar-10 dataset is composed of 50000 training images and 10000 test images, which can be classified into 10 categories. Each category contains 6000 images. The Cifar-1000 dataset consists of 100 image categories, and each category has 500 training images and 100 test images. The resolution of an image on the dataset is 32 32. When we train the models mentioned above using Cifar-10 and Cifar-100 dataset, we take 128 of batch size in the training stage and 100 batch size in the test stage. All images in Cifar-10 and Cifar-100 datasets are normalized by dividing the channel-wise expectation values when they are inputted to the networks.

In addition to the experiments using Cifar-10 and Cifar-100 datasets. We have carried out additional experiments with an Extremely Biased and Poorly Categorized (EBPC) dataset^{1}^{1}1The EBPC dataset is available at https://github.com/andreYoo/CED-algorithm. We propose EBPC dataset for observing a network performance when networks are trained with a highly unbalanced and terribly classified dataset. EBPC dataset is constructed by combining several public datasets roughly, and the dataset has 3,470 classes and consists of 271,516 images for the training and 82,771 images for the test. The datasets which are used to construct EBPC dataset have proposed for image classification, face recognition, and person re-identification. The datasets used to construct the EBPC dataset as follows: 1) MNIST dataset lecun1998gradient, 2) Cifar-10 & 100 datasets krizhevsky2009learning, 3) Stanford dog dataset khosla2011novel, 4) Flowers dataset with 101 categories nilsback2008automated, LFW Face dataset learned2016labeled, and CUHK03 dataset li2014deepreid. Even if several labels take homogeneous, these are identified as different classes in the EBPC dataset. For example, the automobile class in Cifar-10 dataset and the vehicles class in Cifar-100 dataset are considered as different classes. We did not increase the number of samples in each class artificially, and we only normalized the image size of each dataset as . EBPC dataset consists of 3470 class, and each class has a minimum 2 and maximum 10000 samples. The details of each dataset and the quantitative properties of EBPC dataset are shown in table 4. As same as the experiments using Cifar-10 and Cifar-100 datasets, we set 128 batch size and 100 batch size for the training and test the network models respectively.

### 3.2 Quantitative comparison

Table 2 contains the classification error of listed network models on Cifar-10 and Cifar-100 datasets. The model achieving the lowest classification error is the reversible-DenseNet applying a simple data augmentation. The model shows 5.17% of classification error on Cifar-10 dataset and 20.94% of classification error on Cifar-100 dataset. Among the experimental results using ResNet, the lowest errors for Cifar-10 and Cifar-100 datasets are 5.94% and 20.03% respectively. In experimental results using VGG-19, 6.94% and 24.91% are the lowest classification errors on Cifar-10 and Cifar-100 datasets. The experimental results show clear advantages over current deep neural network models and a lot of compared baselines. The models trained with the proposed reversible learning achieve better classification errors than the others. In the experimental results of baseline models, the network trained with the proposed method shows at least 8% better classification errors whether the simple data augmentation is applied or not. The evaluation results using other network models show a similar trend to the experiment using the baseline network.

In experimental results using EBPC dataset, the lowest classification error is 28.74%, and this figure has achieved by the DenseNet trained with the reversible learning manner, and the data augmentation. The ResNet model, which respectively achieved 5.94% and 20.03% classification errors on Cifar-10 and Cifar-100 datasets, recorded 40.08% error on the experiment using EBPC dataset. VGG-19 also achieve 41.59% of classification error in the experiment. The overall classification performances evaluated using EBPC dataset are lower than the performances on Cifar-10 and Cifar-100 datasets. The typically trained DenseNet achieves 31.88% classification errors, and this figure is 3.14% larger than the reversibly trained model. As same as the experimental results using DenseNet, the experimental results using VGG-19 and ResNet also shows a similar trend to the experimental results using DenseNet. In evaluation results using VGG-19, the VGG-19 trained with RNL shows 3% lower classification errors than the others. The experimental results using ResNet also shows the ResNet trained by RNL achieves better performance than the other. The classification accuracies using EBPC dataset are presented in table 4.

### 3.3 Analysis

The experimental results show clear advantages over current deep neural network models and a lot of compared baselines. The experimental results show that the network model trained with RNL outperformed the normally trained models. The most noticeable things in our experiment are that the models trained to improve reversibility of networks achieve better performance whether the performance differences are small or large collectively.

Our interpretation of these performance improvements is as follows. As we mentioned in Section 2, the latent feature generation method based on RNL can influence recognition performance in a model based on the neural network. We tried to improve mapping functionality using the RNL. The RNL can encourage the reversibility of neural networks, which can reconstruct input data on supervised learning setting. In the learning procedure, the proposed RNL plays a critical role to improve the network reversibility explicitly. The experimental results on Cifar-10 and Cifar-100 shows the model trained by RNL achieve better classification errors than the others. Not only classification errors, but also the descending trends of loss also shows that the models applying RNL achieves better performances. Figure 3(a) and figure 3(c) represent the loss trend graphs of baseline network models, which are trained with RNL and general training process, using Cifar-10 and Cifar-100 datasets. The models are trained and tested by the training sets and test sets on Cifar-10 and Cifar-100 datasets. These graphs show that the networks applying RNL take the lower loss than the others. Additionally, the graphs for classification accuracy trend during network training, which are shown in figure 3(b) and figure 3(d) also show similar circumstance.

Not only the experimental results on Cifar-10 and Cifar-100 dataset but also the experimental results on EBPC dataset also shows that the models improving the reversibility can achieve better classification accuracies than the others that trained normally. Figure 5 illustrates that the trend of loss and accuracy of the baseline network model depending on training manners. Both the cross-entropy graphs and the classification accuracy graphs present that the networks trained with RNL can provide better classification performance than others. Interestingly, in contrast to the cross-entropy curve of RN on the test set of EBPC dataset is gradually decreased during training, the curve of the cross-entropy of NN represents that the cross-entropy increases during training. It may mean that the RNL using latent feature generation can be considered as a stable learning method when a biased and poorly classified dataset is given.

## 4 Conclusion

In this paper, we propose the reversible learning method to boost the mapping capability of neural networks. The proposed method generates and learns the latent features regarded as samples which have a lower likelihood automatically. Thus, it can improve the mapping capability of neural networks without both additional data augmentation and a complementary process for resampling a given dataset accordingly. Also, it can be a memory and cost-effective approach since it is not a method for augmenting or generating samples for dataset itself and generates latent features which have lower dimensionality than given samples. Additionally, the proposed method does not require modification on network structures or loss functions, and it may be easily applied to the various recognition methods using neural networks, not only visual recognition but also for speech recognition. The experimental results show that the network models trained with the proposed method can outcome the performance of existing models.

Comments

There are no comments yet.