The CIFAR-10 dataset includes 10 classes, with 5 thousand training images per class and 1 thousand testing images per class [cifar]
. The images are of dimensionality 32x32 with three color channels [Fig 1]. State-of-the-art methods with the highest classification accuracies involve giant neural networks and custom architectures which reach the limits of performance (greater than 98% accuracy [table 1]). These methods involve significant customization and training times, and may not be robust to generalized image classification outside of CIFAR-10 labels. CIFAR-VGG is a modified VGG16 architecture with additional dropout and weight decay to reduce overfitting potential 
. These changes lead to competitive published classification accuracy of 93.43% on the CIFAR-10 testing dataset. We used this network architecture as our gold standard because it is a simplified network compared to modern, highly-specific and manually-optimized, architectures. Given the wide variety of available solutions to image classification problems, we were curious to see if different convolutional neural network architectures learned unique features, and if combining these features with histogram of oriented gradients (HOG) features and pixel intensity values would result in improved predictive power. In lieu of generating a single optimized architecture to optimize classification accuracy, we instead chose to integrate features sets obtained from HOG, pixel intensities, VGG16 and Inception ResNet v2 with transfer learning (TL-VGG and TL-Inception respectively), and CIFAR-VGG to obtain image classification accuracies that surpass CIFAR-VGG alone.
[t](topskip=0pt, botskip=0pt, midskip=0pt)fig1.png Images from the CIFAR-10 dataset. The dataset contains images grouped into 10 unique classes (rows). Each class contains a subset of images belonging to the training dataset (5 thousand) and testing/validation dataset (1 thousand images). Images are of small resolution (32x32) RGB color images.
Convolutional Neural Networks (CNN) optimize convolution kernels which map image characteristics to lower dimensionality, allowing for improved memory management while maintaining relevant image information. Fully connected layers within CNNs allows optimized weighing of convolutional features, and eventual mapping onto discrete categories.
Data augmentation increases training dataset size by adding images derived from the original dataset via simple transformations, such as horizontal and vertical flipping, small angle rotation, and translation. These new images should retain their original class membership without loss of meaningful group information. These methods may also improve model generalizability.
Ensemble methods are an approach to algorithm improvement which seeks to combine the results of relatively weak classifiers with the assumption that significant learned features and transformations are retained, whereas inconsistent and unhelpful perspectives are reduced in the final model 
. This requires that the original weak classifiers are sufficiently independent such that their combination results in novel strategies. We developed and trained fully-connected neural networks (FCNN) which receives pre-computed features from each image and assigns labels according to internal weights which minimize error across the training dataset. The features used include: HOG, pixel intensities, TL-VGG, TL-Inception, and CIFAR-VGG final dense layer weights [Fig. 2]. Training dataset was augmented with horizontally flipped training images to increase the size from 5K to 10K images per class. All Deep Learning (DL) was implemented with TensorFlow version 2 and Keras in Python. Testing dataset was used to calculate validation accuracy throughout training. Final model accuracy was taken to be the validation accuracy upon completion of model training. All code is available on Github (https://github.com/jvizcar/feature_ensembles_clf).
Ii-a HOG Feature Generation
HOG is a common image feature extraction method that partitions the image into equally sized sections and calculates the intensity gradient direction within each section. A histogram of these orientations are then created and the
|AutoAugment: Learning Augmentation Policies from Data ||98.52%|
|XNAS: Neural Architecture Search with Expert Advice ||98.40%|
counts are used as the feature values. The ”hog” function in Python’s scikit-image package was used to generate HOG features. We chose this feature set generation method because it provides meaningful information about edge content.
Ii-B Image Pixel Values as Features
The 32x32, 3-channel, images were flattened into an array of 3072 pixel and channel intensity values. These features represent the entire image, and therefore can be used as raw representations of the original image when input to FCNN.
Ii-C TL-VGG Network Optimization & Features
VGG16 initialized with ImageNet weights was trained on the augmented CIFAR-10 training set to optimize network weights 
. During training, early stopping was implemented using a “patience” parameter of 10 epochs. This means that stopping ensued after 10 consecutive epochs of validation accuracy not increasing. The parameter “min_delta” was kept at the default value of 0, meaning that any validation accuracy increase would reset the early stopping epoch counter. This limits the effects of overfitting by ensuring that validation accuracy consistently increases. The final softmax layer, mapping to 1000 output classes, was replaced with a dense layer, 50% dropout layer, and softmax layer mapping to the CIFAR-10 labels [Fig. 3]. These layers were introduced to maximize TL-VGG classification accuracy during the transfer learning process. To extract model features, the top dense layer was removed, along with dropout layer, and the output of the previous dense layer was used as a feature vector resulting in a 512-length labeled feature vector per image.
Ii-D TL-Inception Network Optimization & Features
Inception ResNet v2 was retrained on the augmented training set using ImageNet starting weights, after replacement of its top softmax layer with a softmax layer mapping to the CIFAR labels, to generate the TL-Inception model. Features were generated by removing the top softmax and dropout layers of the network. The final feature set included 1024 features representing each image.
Ii-E CIFAR-VGG Features
CIFAR-VGG features were obtained by removing the top softmax and dropout layers to generate 512 image features for every CIFAR-10 image [Fig. 4]. The trained model weights were loaded directly from provided source files (https://github.com/geifmany/cifar-vgg).
[t!](topskip=0pt, botskip=0pt, midskip=0pt)fig2.png FCNN single feature set performance. Two methods were used to generate features without deep learning models and four deep learning models were used to extract features. These features were each used to train an FCNN model and predict on the testing data. Top boxes represent the individual feature sets, bottom polygon shows the accuracy when using trained FCNN to predict on testing dataset. TL: transfer learning, HOG: histogram of oriented gradients.
[t!](topskip=0pt, botskip=0pt, midskip=0pt)fig3.png Modification of VGG16 used with transfer learning. Standard VGG16 model was modified by adding a flattened layer, followed by a dense layer with dropout and softmax for 10 classes. Transfer learning was used during training of the model with early stopping.
[t!](topskip=0pt, botskip=0pt, midskip=0pt)fig4.png CIFAR-VGG model. Architecture of the CIFAR-VGG model adapted from (https://github.com/geifmany/cifar-vgghttps://github.com/geifmany/cifar-vgg). Features were extracted from this previously trained model.
Ii-F FCNN Generation and Training
In order to test the ability of each feature set to categories test set images, a simple architecture FCNN was developed. Features are connected with a dense layer of 300 neurons, followed by 50% dropout layer, and 100 neuron dense layer fully connected with a softmax layer leading to 10 classes. Dropout and early stopping (patience of 10 epochs was used) were included to prevent overfitting of the network. This was especially of concern because the CNN features were already the result of significant training.
Iii-a Baseline Model Accuracy
Baseline CNN testing image classification accuracy was measured on VGG16 (ImageNet weights), TL-VGG, CIFAR-VGG, Inception ResNet v2 (ImageNet weights), and TL-Inception CNNs. It was observed that Inception ResNet v2 had chance performance (10% accuracy) on the 10-class problem, potentially signifying a lack of generalizability of the pretrained model or a lack of similarity between the CIFAR-10 dataset and ImageNet. Both are suspected to be true given the relatively large network size of Inception ResNet v2 as well as the significant image dimension differences of ImageNet (128x128, 3 channels) and CIFAR-10 (32x32, 3 channels).
The two re-trained models with transfer learning, TL-VGG and TL-Inception, significantly outperformed their original networks, with a 15% increase in classification accuracy difference between VGG architectures (60% to 85% accuracy), and a 75% increase in TL-Inception relative to ImageNet weighted architecture (10% to 85% accuracy). CIFAR-VGG outperformed all CNN models with a 93.43% test-set classification accuracy.
Iii-B Individual Feature Set Classification Success
To compare feature set classification potential, HOG, pixel intensities, TL-VGG, TL-Inception, and CIFAR-VGG feature sets were used to independently train an FCNN and the resulting network was used to calculate testing accuracy [Fig. 2]. We observed modest performance of HOG and pixel intensities features sets (59% and 53% accuracy, respectively), good performance with TL-VGG and TL-Inception features sets (85% and 90.74% accuracy, respectively), and excellent performance with CIFAR-VGG feature set (93.43%). Although CIFAR-VGG features did not improve with further training on an FCNN, it still outperformed the others.
[h!](topskip=0pt, botskip=0pt, midskip=0pt)fig5.png Ensemble features results. (a) Combining features from VGG16 TL model, HOG, and pixel intensities and taking the top 500 PCA resulted in an 85% accuracy when using the features to train a new FCNN. (b) Best performance was obtained when combining the 5 feature sets and taking the top 1000 PCA and training a new FCNN with these features (94.6% accuracy)
[h!](topskip=0pt, botskip=0pt, midskip=0pt)fig6.png (A) TL-VGG and (B) CIFAR-VGG features from final convolutional layer. These images are representative of their corresponding layers.
Iii-C Feature Set Ensembles
To test our hypothesis that combining disparate features sets would lead to improved results, we trained 3 separate FCNNs. The first FCNN trained with the top 500 principal components (after principal component analysis (PCA)) of TL-VGG, HOG, and pixel intensity feature sets (95% explained variance) to achieve 85% test-set accuracy (no improvement from TL-VGG alone). The second FCNN trained on TL-VGG and TL-Inception features directly (512+1024 features) to achieve 91.12% accuracy; this was a significant improvement from individual model performances. The final model trained with the top 1000 principal components of TL-VGG, HOG, pixel intensity, CIFAR-VGG, and TL-Inception feature sets. This final model obtained the best performance with 94.6% classification accuracy on the testing dataset, surpassing the benchmark established by CIFAR VGG (93.43%) by a small margin [Fig. 5].
We obtained notable performance improvement with an ensemble of feature sets compared to individual classification models. Success was measured by CIFAR-10 test set classification accuracy (%). We believe that this lends some support for the hypothesis that each feature set represents unique facets of variability between class labels, and that by merging a diverse array of successful features sets, a more complete description of relevant differences emerges to aid in image classification. This hypothesis is further supported by the drastic differences between CNN model features upon visual inspection [Fig. 6]. It appears that TL-VGG features detect larger image patterns as compared with CIFAR-VGG. The differences in CNN features suggests that there are many ways of extracting meaningful features from these images, and that combining them may improve classification. Non-CNN features are known to be very different from CNN features as they describe likely meaningful patterns found in the images (eg. intensity gradients in HOG). The addition of non-CNN features further enhances the diversity of image descriptions available for classification by the FCNN, allowing for more robust image classification relative to single feature set training alone.
Error analysis shows that although 5-Set PCA FCNN model exceeded the others in overall image classification accuracy, it had a more difficult time differentiating between cats and dogs relative to CIFAR-VGG [Fig. 7]. The confusion matrix also showed a decrease in airplane misclassification which suggests that the inclusion of additional independent feature sets in the final model improves the classification of high-error (relatively speaking) labels. Correctly classified images had high confidence for the correct label, and misclassified images seemed to contain the correct label within the top three class results [Fig. 8]. Inspected images that were incorrectly labeled were difficult to make out due to high background noise, low light conditions, and ambiguous class membership (automobile vs. truck). The low resolution of images increased the ambiguity of many images, yet the final model which utilized a feature ensemble was robust enough to consistently distinguish between distinct labels. In general, the inclusion of multiple features sets resulted in improved classification accuracy, and failure in situations which made it difficult for humans to classify as well.
Although the generation of multiple feature sets takes additional time and computational power, the potential for performance increases may justify these costs in some scenarios. These results may also be applicable in other classification tasks, especially those including a small margin for error (eg. clinical decision support or automated driving). This work emphasizes the need for continued exploration of improved feature generation methods to maximize the utility of current deep learning architectures.