## 1 Introduction

Convolutional Neural Networks (CNNs) [CNN] are attracting a great deal of attention because they show remarkable performance in general object recognition tasks. Various methods have been proposed so far for improving the performance of CNNs: pre-processing[preprocess1, preprocess2, preprocess3], dropout [dropout]

[batchnorm], ensemble learning [Ensenble1, Ensemble2], and so on.In this paper, we propose a new model based on CNNs to further improve the performance in image recognitioin tasks. Our model consists of one base CNN and multiple Fully Connected SubNetworks (FCSNs). The base CNN generates a set of multi-channel feature-maps after each convolutional layer. The set of feature-maps generated by the last convolutional layer is divided along channels into disjoint subsets, and each subset is assigned to one of the FCSNs, which is trained independent of others so that it can predict the class label from the subset of the feature-maps assigned to it. The output of the overall model is determined by majority vote of the base CNN and the FCSNs. Namely, ensemble learning is performed in the proposed method. We thus call this model EnsNet in this paper. It is known that, in order for ensemble learning to be effective, the base learners must represent certain degree of diversity. In the proposed model, it is expected that FCSNs have this property because different subnetworks are trained using different training data.

In what follows, we first explain the architechture of the EnsNet and how to train it. We then provide results of some experiments using the MNIST[MNIST], Fashion-MNIST[Fashion-MNIST], and CIFAR-10[Cifar-10] datasets, which show that the proposed approach certainly improves the performance of CNNs. In particular, it is shown that an EnsNet achieves a state-of-the-art error rate of 0.16% on MNIST.

## 2 EnsNet

### 2.1 Architecture

The proposed model called EnsNet consists of one base CNN and multiple subnetworks as shown in Fig. 1. The structure of the base CNN varies depending on image recognition tasks. Table 1 shows two different structures used in the experiments shown in Section 3

: one is for MNIST and Fashion-MNIST, and the other is for CIFAR-10. The ReLU activation function is used in the proposed model, though this is not explicitly shown in Table

1. The set of feature-maps generated by the last convolutional layer of the base CNN is divided along channels into disjoint subsets, and each subset is fed into one of the subnetworks. Each subnetwork is a fully connected neural network consisting of multiple weight layers. Table 2 shows the details of the structure of the subnetworks used in the experiments: one is for MNIST and Fashion-MNIST, and the other is for CIFAR-10. The output of the overall model is determined by majority vote of the base CNN and the subnetworks.Input: MNIST or Fashion-MNIST image |
---|

Conv3-64 (zero padding) |

BatchNormalization |

Dropout(0.35) |

Conv3-128 |

BatchNormalization |

Dropout(0.35) |

Conv3-256 (zero padding) |

BatchNormalization |

maxpool() |

Dropout(0.35) |

Conv3-512 (zero padding) |

BatchNormalization |

Dropout(0.35) |

Conv3-1024 |

BatchNormalization |

Dropout(0.35) |

Conv3-2000 (zero padding) |

BatchNormalization |

maxpool() |

Dropout(0.35) |

Dividing feature-maps (10 divition) |

FC-512 |

BatchNormalization |

Dropout(0.5) |

Dropconnect(0.5) [Dropconnect, ChainerDropconnect] |

FC-512 |

FC-10 |

soft-max |

Input: CIFAR-10 image |
---|

Conv3-64 (zero padding) |

BatchNormalization |

Dropout(0.25) |

Conv3-128 |

BatchNormalization |

Dropout(0.25) |

Conv3-256 (zero padding) |

BatchNormalization |

maxpool() |

Dropout(0.25) |

Conv3-512 (zero padding) |

BatchNormalization |

Dropout(0.25) |

Conv3-1024 |

BatchNormalization |

Dropout(0.25) |

Conv3-2048 (zero padding) |

BatchNormalization |

maxpool() |

Dropout(0.25) |

Conv3-3000 (zero padding) |

BatchNormalization |

Dropout(0.25) |

Conv3-3500 (zero padding) |

BatchNormalization |

Dropout(0.25) |

Conv3-4000 (zero padding) |

BatchNormalization |

Dropout(0.25) |

Dividing feature-maps (10 divition) |

FC-512 |

BatchNormalization |

Dropout(0.3) |

Dropconnect(0.3) |

FC-512 |

FC-10 |

soft-max |

Input: feature-maps |
---|

(MNIST or Fashion-MNIST) |

FC-512 |

BatchNormalization |

Dropout(0.5) |

Dropconnect(0.5) |

FC-512 |

FC-10 |

soft-max |

Input: feature-maps |
---|

(CIFAR-10) |

FC-512 |

BatchNormalization |

Dropout(0.3) |

Dropconnect(0.3) |

FC-512 |

FC-10 |

soft-max |

### 2.2 Training

The EnsNet is trained by alternating two steps: one is the base CNN training step, and the other is the subnetworks training step. In the base CNN training step, the parameters of the convolutional layers and the fully connected layers of the base CNN are updated by same optimization algorithm, while the parameters of the subnetworks are fixed. In the experiments shown in Section 3, the Adam optimizer [Adam] is used. In the subnetworks training step, the parameters of the base CNN are fixed, and each subnetwork is trained independent of other subnetworks, using the corresponding subset of the feature-maps generated by the last convolutional layer of the base CNN and the target class labels as the training data. The parameters of the fully connected layers of each subnetwork are updated by the same optimization algorithm as the base CNN.

## 3 Classification Experiments

### 3.1 Setup

In order to evaluate the effectiveness of the EnsNet, we conducted classification experiments using the MNIST, Fashion-MNIST and CIFAR-10 datasets. The models used in the experiments were implemented in Chainer framework[chainer], and trained by the Adam optimizer [Adam]. The parameters of the Adam optimizer were set as follows: , , , , and a weight decay was set to .

The MNIST dataset is a collection of gray scale images of handwritten digits from to . The training and test sets consist of 60,000 and 10,000 images, respectively. Before training, we augmented data by rotating images by various angles between and , scaling images by various factors between and , shifting images to the width direction or the height direction by a fraction between and of the total width or the total height, stretching images by the shear transformation with various angles between and

. We also set the batch size and the number of epochs to 100 and 1,300, respectively.

The Fashion-MNIST dataset is a collection of gray scale images in 10 classes. The training and test sets consist of 60,000 and 10,000 images, respectively. Before training, we augmented data by rotating images by various angles between and . We also set the batch size and the number of epochs to 100 and 600, respectively.

The CIFAR-10 dataset is a collection of colored images in 10 classes. The training and test sets consist of 50,000 and 10,000 images, respectively. Before training, we augmented data by rotating images by various angles between and , scaling images by various factors between and , shifting images to the width direction or the height direction by a fraction between and of the total width or the total height, stretching images by the shear transformation with various angles between and . We also set the batch size and the number of epoch to 100 and 200. Furthermore, we decayed the learning rate by every 100 epochs.

### 3.2 Effect of Fully Connected Subnetworks

In the first experiment, in order to evaluate the effectiveness of the fully connected subnetworks and the majority vote, we trained the EnsNet with the structure shown in the left columns of Tables 1 and 2, and the base CNN in the EnsNet using the MNIST dataset, and measured how the test set accuracies of these two models change as the number of epochs increases. Fig. 2 shows the results of this experiment. It is seen from Fig. 2 that the test set accuracy of the EnsNet is higher than that of the base CNN. This means that the fully connected subnetworks and the majority vote can improve the performance of CNNs.

### 3.3 Comparison with Other Models

In the second experiment, we trained the EnsNet and some conventional models including the base CNN using the MNIST, Fashion-MNIST and CIFAR-10 datasets, and measured the error rates for the test sets. Table 3

shows the error rates of six models: Random Multimodel Deep Learning for Classification (RMDL)

[RMDL], Dropconnect [Dropconnect], Multi-Column Deep Neural Network (MCDNN) [MCDNN], Augmented PAttern Classification (APAC) [APAC], EnsNet, and the base CNN in the EnsNet for the MNIST dataset. Here, by EnsNet, we mean the model with the structure shown in the left columns of Tables 1 and 2. The EnsNet acheived the error rate of 0.16% which is lower than that of the RMDL, one of the state-of-the-art models for the MNIST dataset classification task.Table 4 shows the error rates of four models: Random Erasing [RandomErasing], VGG8B(2x)+LocalLearning+CO [VGG8LCO], EnsNet, and the base CNN in EnsNet for the Fashion-MNIST dataset. Here, by EnsNet, we mean the model with the structure shown in the left columns of Tables 1 and 2. The EnsNet is not the best, but outperforms the base CNN. Table 5 shows the error rates of the EnsNet with the structure shown in the right columns of Tables 1 and 2, and the base CNN in the EnsNet for the CIFAR-10 dataset. The EnsNet acheived a lower error rate than the base CNN. These results mean that the fully connected sunetworks and the majority vote certainly improved the performance of CNNs.

Model | Error rate |
---|---|

RMDL [RMDL] | 0.18% |

Dropconnect [Dropconnect] | 0.21% |

MCDNN [MCDNN] | 0.23% |

APAC [APAC] | 0.23% |

EnsNet (Proposed) | 0.16% |

Base CNN in EnsNet | 0.21% |

Model | Error rate |
---|---|

Random Erasing [RandomErasing] | 3.65% |

VGG8B(2x)+LocalLearning+CO [VGG8LCO] | 4.14% |

EnsNet (Proposed) | 4.70% |

Base CNN in EnsNet | 5.00% |

Model | Error rate |
---|---|

EnsNet (Proposed) | 23.75% |

Base CNN in EnsNet | 23.90% |

## 4 Conclution

We proposed a new CNN model called EnsNet, which is composed of one base CNN and multiple fully connected subnetworks. In this model, the set of feature-maps generated by the last convolutional layer of the base CNN is divided into disjoint subsets, and each subset is fed into one of the subnetworks as its input. The training of the EnsNet is done by updating the parameters of the base CNN and those of the subnetworks alternately, and the prediction is done by the majority vote of the base CNN and the subnetworks. Experimental results using the MNIST, Fashion-MNIST and CIFAR-10 datasets show that the EnsNet outperforms the base CNN. In particular, the EnsNet achieves the lowest error rate among some of the state-of-the-art models. A future work is to evaluate the effectiveness of our approach on other CNN models such as ResNet[ResNet].

Comments

There are no comments yet.