Boosting Mapping Functionality of Neural Networks via Latent Feature Generation based on Reversible Learning

10/21/2019 ∙ by Jongmin Yu, et al. ∙ 0

This paper addresses a boosting method for mapping functionality of neural networks in visual recognition such as image classification and face recognition. We present reversible learning for generating and learning latent features using the network itself. By generating latent features corresponding to hard samples and applying the generated features in a training stage, reversible learning can improve a mapping functionality without additional data augmentation or handling the bias of dataset. We demonstrate an efficiency of the proposed method on the MNIST,Cifar-10/100, and Extremely Biased and poorly categorized dataset (EBPC dataset). The experimental results show that the proposed method can outperform existing state-of-the-art methods in visual recognition. Extensive analysis shows that our method can efficiently improve the mapping capability of a network.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In visual recognition studies using neural networks, such as image classification lecun1998gradient; he2016deep and face recognition wen2016discriminative; liu2017sphereface

, the networks can be thought a mapping function between high dimensional data and the low dimensional space. Commonly used approaches to improving mapping functionality and recognition accuracies in such visual recognition studies, are modifying network connections

he2016deep; huang2017densely

or adjusting loss functions

liu2017sphereface; wen2016discriminative

. Above methods show outstanding performances when the well-classified and balanced datasets are provided.

There is a possibility that a model has poor mapping capability due to extrinsic factors related to the uncertainty of a given dataset e.g., class-wise unbalance and noisy label. These issues can sometimes be addressed by handling hard samples. A sample is considered as a hard sample when it is on the wrong side of the correct decision boundary viola2001robust

or in the margin of the hyperplanes for classification

dollar2014fast. Hard samples are frequently observed when an unbalanced and not-well classified datasets are given, because existing learning methods usually have to inductive bias toward the dominant classes if training data are unbalanced, resulting in poor minority class recognition performance dong2018imbalanced. Most approaches to solving this issues are dataset resampling chawla2002smote setting the balanced proportion of label and data to train a network model, and applying weight decided by the training loss using hard sampling mining kumar2010self; chang2017active; jiang2017mentornet. Recently, a hard sample generation methods based on deep neural networks has been proposed to tackle the hard sample problem caused from dataset imbalance schroff2015facenet; wu2017sampling.

However, without an appropriate definition for the bias of dataset, handling the dataset bias is inherently ill-defined ren2018learning. Moreover, using these methods without a proper definition of dataset bias may lead to poorly optimized mapping results when training networks. Also, hard samples can appear even if neural networks are trained with a physically well-balanced and correctly classified dataset. Consequently, it is necessary to develop a method for boosting a mapping functionality of neural networks which is invariant to the bias of datasets and does not require meta information such as the definition of dataset bias and sample proportions between classes.

In this work, we present a reversible network learning (RNL) which can allow the reversiblity to neural networks, and we propose a generation and learning of hard-sample corresponding to latent features based on RNL. The reversible network (RN) is inspired by the auto-encoder. However, it differs than AE since it focuses on reconstructing, generating, and learning latent features to improve supervised learning performance. In learning the RNs, the network can generate the latent features regarded as samples which have a lower likelihood, and apply the features to network learning. We demonstrate an efficiency of the proposed method using image classification problems. The experimental results show that the networks trained by the proposed manner outperform the others.

The key contributions of this paper are summarized as three points: First, we propose a RN for generation and learning of latent features related to hard samples; Second, the proposed method is easily applied to the various network structures and loss functions, and the resulting models perform substantially better than existing ones. Third, we provide extensive experimental results, including the comparison for recognition performances between the proposed methods and other models and the relation between latent features and hard samples.

This paper is organized as follows. We describe the RN and how to generate and learn the hard sample corresponding latent features using the network in Section. 2. In Section 3, we provide experimental results to demonstrate the efficiency of the proposed method. We conclude this paper in Section. 5.

2 Reversible Network Learning

2.1 Reversibility of Neural Network

In supervised learning manner, neural network conduct feed-forward process. Given the input sample , the networks map the samples into latent space and extract latent feature , and compute the confidence value related to each class. The primary goal of the learning networks is minimizing errors such as classification error. The reversibility of a neural network is not considered as a significant issue usually.

In a -class classification problem setting using a neural network, suppose the network consists of two functions: an encoding function , where and are an input sample and the corresponding latent feature, and probabilistic model to compute the likelihood of corresponding to class for each input sample using given and the model parameter . The entire process can be represented by , and the predicted class is decided as follows: .

Under the above setting, the neural network is reversible if the network satisfies the following condition from a given sample :


However, since the network components such as activation functions and connectivity are sometimes non-invertible, it is difficult to build a mathematically RN in practice

xu2014deep. For instance, the softmax function is non-linear, and the inverse form of softmax function is regarded as: , where is the class output of softmax function corresponding the latent feature , and denotes a constant. since originally represents , and this replacement make difficulty when neural networks take reversibility.

Figure 1:

(a) denotes the workflow of an autoencoder (AE). (b) represents the workflow of the reversible network (RN).

is a given sample, and denotes the reconstruction of on AE and RN. of (a) represents the latent features of AE. and defines the network outputs and corresponding labels. Black solid lines show the working process of each model. Red dotted lines denote loss functions and assigned features for the functions.

Despite the neural network is theoretically irreversible. We can develop a learning-based approach to improve the reversibility of a neural network using the condition in Eq. 1 and reconstruction manner of AE.

To improve the reversibility of neural networks in supervised learning manner, RN conducts two processes: feed-forward and feed-backwards processes. The feed-forward process is a general process of supervised learning. In classification problem setting, the feed-forward process generates an output of networks for classification: . In a classification problem setting, the goal of the feed-forward process is computing the likelihood related to each class, and the optimization scheme conducts to minimize a classification error. The cross-entropy with softmax function is commonly used as a cost function in classification problem setting defined as follows:


where is the dimensionality of a final layer, and it is usually equal to the number of classes. and are an output of the feed-forward process of a network model and the corresponding label. The network output is decided by computing the softmax function with latent feature , and it can be interpreted as a likelihood for each class of a given sample .

On the other hands, the feed-backwards process reconstructs the input samples thought the reverse process represented by:


where is the reconstruction results from a given value . Unfortunately, as mentioned above, because of some network components such as non-linear activation functions and irreversible network connections, it is difficult to make a mathematically exact inverse network.

In RN, the feed-backwards process conducts a reverse process inspired by AE. The reverse process for fully connected networks defined by:


where is the transpose matrix of a weight matrix in fully connected layer, and is the biase of the previous layer of the layer.

is an activation function such as rectified linear unit , softmax function, and hyperbolic tangent function. In Equation. 4,

can be represented as follows:


where is element of the coordinate in the weight matrix , and

is the column vector of

. and are the output and -th element of the output. Additionally, the reverse process for convolutional layer is replaced by the deconvolutional layer xu2014deep.


Figure 2: Example samples of the reconstruction results on general neural network (NN), autoencoder (AE), and reversible network (RN). The visualisation results show ordinarily trained network cannot ensure the reversibility.

Above feed-backward process products the reconstruction results , and the reconstruction results are applied to maximize the reversibility of neural networks by minimizing the reconstruction error based on mean square error as follows:


where and is elements of input and feed-backwarding results . The reversibility maximization via minimizing reconstruction error is inspired from AE.

However, RN and AE have different objectives methodologically. AE is one of unsupervised learning manners, and the goal of AE is to minimize a reconstruction error between an input sample and a reconstructed result, and this process does not consider classes of samples. This helps AE to learn significant representations from given samples. On the other hands, the goal of training RN is both minimize a recognition accuracy and a reconstruction error. This can be regarded as a class-wise embedding of latent features depending on specific classes since the learning of RN includes clustering process of the latent features based on their likelihoods by conducting two minimizations for the recognition accuracy and the reconstruction error simultaneously. This property of RN may help to reconstruct latent features corresponding to hard samples. Figure

1 illustrates the workflows and the methodological difference between RN learning and an AE. To apply the above two objectiveness to train the network, we used straightforward aggregation to compute to the total loss function. We aggregate classification loss on the feed-forward process and reconstruction loss on the feed-backwards process as follows:


In our experiment, other non-differential operations including pooling or another downsampling are replaced to an upsampling function based on simple image transformation methods. Figure 2 shows the reconstruction results of latent features using normal classification network (a.k.a., neural network), AE, and the RN. Parameter setting . As shown in Figure 2, the network trained by classification loss only shows poor reconstruction results compared to the others.

Figure 3: (a) illustrates the methodology for generating the latent features corresponding to hard samples. (b) shows the visualization results using Cifar10 dataset. Each column represents input samples, original likelihood (a.k.a., network outputs), transformed likelihoods, generated latent features, and the reconstruction results of the features.

2.2 Latent Feature Generation and Learning

Inherently, the most simple approach to boost the mapping functionality of neural networks is providing a large-scale and well-categorized dataset which can be used to train the various variations of each class. However, it is difficult to construct the dataset in practice. When neural networks train the biased dataset, the learned features are biased to the dominant samples, and the other samples which are not included in the dominant sample set, are considered as hard samples and it can be classified into wrong class in the test phases.

Input: Input sample , where and are an input data and corresponding label.
Result: The optimized network parameters , where and are the sets of weight and bias parameters of the network model.
for The number samples in a batch do
       The feed-forward
      Compute the network output: =
       The feed-backward
      Reconstruct the input sample using the network output :
       The latent feature generation
      Generate the latent feature corresponding to hard samples:
       One-step feed-forward
      Compute the output using the generated features:
       Loss computing
       Update parameters
       , where is a learning rate.
end for
Algorithm 1 The algorithms of the reversible network learning with the latent feature generation for a single batch.

The solution for the above issue using RL is surprisingly simple, and we only need an one steps of feed backward and forward process in Eq. 3 and Eq. 1 in RN. Reversibility of RN can apply to generate latent features, by providing reverse mapping from the likelihood of class to latent feature. The hard sample is considered as unrecognizable samples using a model under the close-set condition, and it is represented as follow: , where and denote a hard sample and the annotated class ,

represents the parameters for probability model of

class. is the likelihood corresponding to class .

In feed-backward process of RN, generating the latent features can be interpreted by , where is given likelihood data for generating a corresponding latent feature, and is the generated latent feature. Above process can be applied to generate the latent features corresponding to hard samples. This process to generate the latent features corresponding to hard samples can be represented with Eq. 4 as follows:


where tr is a transformation function for an output to modify the likelihood value on output . In this work, we select some elements among the elements in an output vector randomly and assign a value which is similar to the maximum likelihood. The detail method to modifying the and applying to network training are described at Algorithm. 1. A further process is straightforward. The generated latent features are directly applied to the feed-forward process, and it is equivalent to the general process for image classification. Figure 2(a) illustrates that the process of the latent feature generation on RN, and generation and visualization results of the latent features. s

3 Experiment

3.1 Experimental setting and datasets

We have compared the model applying the reversible manner and the normally trained models. We have implemented a baseline neural network, very deep neural network (VGGnet) simonyan2014very, residual network (ResNet) he2016deep

, and the densely connected convolutional neural network (DenseNet)

huang2017densely. The structural details of the baseline neural network are shown in table 2

. In implementing the others, we have employed the structures of VGG-19, ResNet-18, and DenseNet-40 on their studies. Our work is concentrated to demonstrate the efficiency of RL, and not on encourage state-of-the-art performance. Therefore, the experiment is conducted based on the several baseline models intentionally and focused on the comparison between normally trained model and trained model using the reversible learning manner. In our experiments, All networks are trained using stochastic gradient descent (SGD). we employed learning rate decay of 0.0001 and momentum of 0.9. The learning rate is initially set to 0.1, and divided by 10 in 20, 40, and 60 epochs. We conduct a simple data augmentation by cropping and flipping given images. The training and evaluation using each dataset are performed 10 times. The average values for all experiments are considered as the final quantitative results for each model. All experiments are conducted using Nvidia Titan Xp GPU and 3.20

CPU. The source codes for these experiments are implemented based on Pytorch library.

We demonstrate the efficiencies of RNL through the image classification setting. Af first, we evaluate the models using Cifar-10 and Cifar-100 datasets krizhevsky2009learning. The Cifar-10 dataset is composed of 50000 training images and 10000 test images, which can be classified into 10 categories. Each category contains 6000 images. The Cifar-1000 dataset consists of 100 image categories, and each category has 500 training images and 100 test images. The resolution of an image on the dataset is 32 32. When we train the models mentioned above using Cifar-10 and Cifar-100 dataset, we take 128 of batch size in the training stage and 100 batch size in the test stage. All images in Cifar-10 and Cifar-100 datasets are normalized by dividing the channel-wise expectation values when they are inputted to the networks.

Figure 4: (a) and (b) shows the trend of loss and accuracies on training and test sets on Cifar10 dataset respectively. (c) and (d) represents the trend of loss and accuracies on training and test set of Cifar100 dataset respectively. The baseline neural network (NN) and reversible network (RN) are used for this experiment. Solid lines denote that a training set is applied to evaluate models, and dotted lines represented that a test set is used to evaluate models.
Layer Kernel Act Conv 55332


Conv 553232 L-Relu Max-pool - - Conv 553264 L-Relu Conv 556464 L-Relu Max-pool - - Conv 5564128 L-Relu Conv 55128128 L-Relu Fc1 2048256 L-Relu Fc2 256 Softmax
Table 1: Structural detail of baseline neural network applied to the baseline neural network (NN) and the reversible network (RN) on our experiments. is the number of classes corresponding to a given dataset.
Method Params C10 C10+ C100 C100+ Baseline-NN 1.3M 20.18 17.17 49.72 40.16 Baseline-RN 1.3M 13.62 9.17 45.19 34.61 VGG-19 simonyan2014very 13.4M 8.48 7.82 43.80 28.96 Reversible-VGG-19 13.4M 7.12 6.94 37.56 24.91 ResNet he2016deep 1.7M 7.92 6.53 33.41 23.24 Reversible-ResNet 1.7M 6.01 5.94 27.72 20.03 DenseNet huang2017densely 1.0M 8.01 6.47 28.15 23.24 Reversible-DenseNet 1.0M 5.84 5.17 22.17 20.94
Table 2: Error rates (%) on Cifar-10 and Cifar-100. ’Reversible’ denotes the model is trained with the proposed reconstruction error. indicates that the data augmentation based on simple image transformation is used. The bolded value is the best performance in our experiments.

In addition to the experiments using Cifar-10 and Cifar-100 datasets. We have carried out additional experiments with an Extremely Biased and Poorly Categorized (EBPC) dataset111The EBPC dataset is available at We propose EBPC dataset for observing a network performance when networks are trained with a highly unbalanced and terribly classified dataset. EBPC dataset is constructed by combining several public datasets roughly, and the dataset has 3,470 classes and consists of 271,516 images for the training and 82,771 images for the test. The datasets which are used to construct EBPC dataset have proposed for image classification, face recognition, and person re-identification. The datasets used to construct the EBPC dataset as follows: 1) MNIST dataset lecun1998gradient, 2) Cifar-10 & 100 datasets krizhevsky2009learning, 3) Stanford dog dataset khosla2011novel, 4) Flowers dataset with 101 categories nilsback2008automated, LFW Face dataset learned2016labeled, and CUHK03 dataset li2014deepreid. Even if several labels take homogeneous, these are identified as different classes in the EBPC dataset. For example, the automobile class in Cifar-10 dataset and the vehicles class in Cifar-100 dataset are considered as different classes. We did not increase the number of samples in each class artificially, and we only normalized the image size of each dataset as . EBPC dataset consists of 3470 class, and each class has a minimum 2 and maximum 10000 samples. The details of each dataset and the quantitative properties of EBPC dataset are shown in table 4. As same as the experiments using Cifar-10 and Cifar-100 datasets, we set 128 batch size and 100 batch size for the training and test the network models respectively.

Figure 5: (a) shows the trend of loss on training and test sets on EBPC dataset respectively. (b) shows the trend of accuracies on training and test sets on EBPC dataset respectively. Solid lines denote that a training set is applied to evaluate models, and dotted lines represented that a test set is used to evaluate models.
Dataset Subject Class# Train# Test# Cifar-10 krizhevsky2009learning IC 10 50000 10000 5000 Cifar-100 krizhevsky2009learning IC 100 50000 10000 500 Flower102 nilsback2008automated IC 102 1020 6149 102 CUHK03 li2014deepreid PRI 1467 19574 8619 13.3 LFW huang2008labeled FR 1650 5665 3391 3.4 MNIST lecun1998gradient IC 10 60000 10000 6000 Stanford khosla2011novel IC 120 12000 8580 1000 SVNH netzer2011reading IC 10 73257 26032 7325.7 Total - 3470 271516 82771 -
Table 3: Composition of the extremely biased and poorly categorized (EBPC) dataset. denotes the number of ’samples per class’ on train set of each dataset, and it is computed by . ’IC’, ’PRI’, and ’FR’ denote ’Image classification’, ’Person re-identification’, and ’Face recognition’
Method EBPC EBPC+ Baseline-NN 50.27 45.83 Baseline-RN 44.19 36.88 VGG-16 simonyan2014very 39.97 36.71 Reversible-VGG-16 34.18 31.59 ResNet-18 he2016deep 39.74 34.64 Reversible-ResNet-18 32.36 30.86 DenseNet-32 huang2017densely 37.51 31.95 Reversible-DenseNet-32 31.88 28.74
Table 4: Error rates (%) on EBPC dataset. ’Reversible’ denotes the model is trained with the proposed reconstruction error. indicates that simple data augmentation is used. The bolded value is the best performance in our experiments.

3.2 Quantitative comparison

Table 2 contains the classification error of listed network models on Cifar-10 and Cifar-100 datasets. The model achieving the lowest classification error is the reversible-DenseNet applying a simple data augmentation. The model shows 5.17% of classification error on Cifar-10 dataset and 20.94% of classification error on Cifar-100 dataset. Among the experimental results using ResNet, the lowest errors for Cifar-10 and Cifar-100 datasets are 5.94% and 20.03% respectively. In experimental results using VGG-19, 6.94% and 24.91% are the lowest classification errors on Cifar-10 and Cifar-100 datasets. The experimental results show clear advantages over current deep neural network models and a lot of compared baselines. The models trained with the proposed reversible learning achieve better classification errors than the others. In the experimental results of baseline models, the network trained with the proposed method shows at least 8% better classification errors whether the simple data augmentation is applied or not. The evaluation results using other network models show a similar trend to the experiment using the baseline network.

In experimental results using EBPC dataset, the lowest classification error is 28.74%, and this figure has achieved by the DenseNet trained with the reversible learning manner, and the data augmentation. The ResNet model, which respectively achieved 5.94% and 20.03% classification errors on Cifar-10 and Cifar-100 datasets, recorded 40.08% error on the experiment using EBPC dataset. VGG-19 also achieve 41.59% of classification error in the experiment. The overall classification performances evaluated using EBPC dataset are lower than the performances on Cifar-10 and Cifar-100 datasets. The typically trained DenseNet achieves 31.88% classification errors, and this figure is 3.14% larger than the reversibly trained model. As same as the experimental results using DenseNet, the experimental results using VGG-19 and ResNet also shows a similar trend to the experimental results using DenseNet. In evaluation results using VGG-19, the VGG-19 trained with RNL shows 3% lower classification errors than the others. The experimental results using ResNet also shows the ResNet trained by RNL achieves better performance than the other. The classification accuracies using EBPC dataset are presented in table 4.

3.3 Analysis

The experimental results show clear advantages over current deep neural network models and a lot of compared baselines. The experimental results show that the network model trained with RNL outperformed the normally trained models. The most noticeable things in our experiment are that the models trained to improve reversibility of networks achieve better performance whether the performance differences are small or large collectively.

Our interpretation of these performance improvements is as follows. As we mentioned in Section 2, the latent feature generation method based on RNL can influence recognition performance in a model based on the neural network. We tried to improve mapping functionality using the RNL. The RNL can encourage the reversibility of neural networks, which can reconstruct input data on supervised learning setting. In the learning procedure, the proposed RNL plays a critical role to improve the network reversibility explicitly. The experimental results on Cifar-10 and Cifar-100 shows the model trained by RNL achieve better classification errors than the others. Not only classification errors, but also the descending trends of loss also shows that the models applying RNL achieves better performances. Figure 3(a) and figure 3(c) represent the loss trend graphs of baseline network models, which are trained with RNL and general training process, using Cifar-10 and Cifar-100 datasets. The models are trained and tested by the training sets and test sets on Cifar-10 and Cifar-100 datasets. These graphs show that the networks applying RNL take the lower loss than the others. Additionally, the graphs for classification accuracy trend during network training, which are shown in figure 3(b) and figure 3(d) also show similar circumstance.

Not only the experimental results on Cifar-10 and Cifar-100 dataset but also the experimental results on EBPC dataset also shows that the models improving the reversibility can achieve better classification accuracies than the others that trained normally. Figure 5 illustrates that the trend of loss and accuracy of the baseline network model depending on training manners. Both the cross-entropy graphs and the classification accuracy graphs present that the networks trained with RNL can provide better classification performance than others. Interestingly, in contrast to the cross-entropy curve of RN on the test set of EBPC dataset is gradually decreased during training, the curve of the cross-entropy of NN represents that the cross-entropy increases during training. It may mean that the RNL using latent feature generation can be considered as a stable learning method when a biased and poorly classified dataset is given.

4 Conclusion

In this paper, we propose the reversible learning method to boost the mapping capability of neural networks. The proposed method generates and learns the latent features regarded as samples which have a lower likelihood automatically. Thus, it can improve the mapping capability of neural networks without both additional data augmentation and a complementary process for resampling a given dataset accordingly. Also, it can be a memory and cost-effective approach since it is not a method for augmenting or generating samples for dataset itself and generates latent features which have lower dimensionality than given samples. Additionally, the proposed method does not require modification on network structures or loss functions, and it may be easily applied to the various recognition methods using neural networks, not only visual recognition but also for speech recognition. The experimental results show that the network models trained with the proposed method can outcome the performance of existing models.