1 Introduction
Deep convolution neural networks (CNNs) have recently gained immense attention in computer vision and machine learning communities mainly because it’s superior performance in various applications including image classification
[15, 19, 21], object detection [14, 17, 28], face detection/recognition
[37, 22, 27, 38]and many others. These networks usually consist of a stack of convolution layers and fully connected layers with pooling and nonlinearity in between. By stacking multiple layers, deep network can essentially extract complex features which are more discriminative than features extracted by traditional machine learning algorithms
[35, 36, 40, 42]. Krizhevsky et al. [19]proposed a deep CNN architecture (dubbed AlexNet) which performed exceptionally well on ImageNet image classification dataset. The tremendous success of AlexNet lead to a flurry of research activity in the community resulting in a variety of deep CNN architectures for face recognition, action recognition etc. etc.
As there are no specific guidelines regarding the choice of the depth and width of the network, a significant amount of research has focused on finding heuristics to determine these parameters to obtain the “optimal” network for the target application. This resulted in very deep networks like DenseNet201 (of depth 201)
[17], ResNet50 (of depth 168) [15], InceptionResnetv2 (of depth 572) [39], Xception (of depth 126) [5] and others. Though these very deep networks perform well on large datasets like ImageNet [10], JFT dataset [16] and others, retraining these networks for small datasets or different target applications is difficult due to their enormous size (in terms of number of parameters). This raises the question, is it possible to combine multiple “weak” networks (of smaller depth and hence lower accuracy) and boost the performance significantly over each individual network in the combination?In response to the above question, recently, several researchers proposed algorithms that construct a combination of different networks to achieve improved performance. The basic idea of these methods have been borrowed from traditional ML algorithms like bagging [12] and boosting [31]. Some of these methods rely on a weighted combination of different networks [33, 34, 29]. While boosting methods like the Diabolo classifier [32] and the multicolumn deep network [2, 7] focus on retraining networks based on the previously misclassified samples. In multicolumn CNN, the authors train multiple CNNs simultaneously so that a linear combination of these CNNs boost the performance and serve as the final predictor. Recently, the authors in [25] proposed a boosting technique named BoostCNN where similar to Adaboost [33, 34], they learn CNNs sequentially on the mistakes from the earlier networks in the sequence. Essentially, they built a deep CNN where the final network output is aligned with the boosting weights. Though this sequential approach is less expensive than multicolumn deep network, this still needs training of the CNNs which is time consuming. Very recently, several significantly deep networks have been proposed in literature [39, 15]
. Though these networks perform very well, training takes a significant amount of time and hence retraining is not computationally feasible. Even using transfer learning
[26], sometimes it is not computationally viable to train/fine tune these networks.In this work, we propose a novel framework which takes multiple pretrained “weak” CNNs as input and outputs a probabilistic model which is an aggregation of the pretrained CNNs. We formulate the problem of combining weak CNNs as a mixture model of the distributions learned from the output of the deep networks. Our formulation can also deal with features of different dimensions and provide a boosted performance. Hence, we have two sets of experiments one to show the performance boost on multiple weak deep networks and the other experiment to show performance boost on multiple popular hand crafted features. In practice, our method takes seconds of additional time to achieve the boosted performance. One of the key advantages of our proposed framework is unlike previous boosting techniques, it does not require any retraining of CNNs
. We show that our model requires a simple optimization on a hypersphere which is solved using a Riemannian gradient descent based approach. We have incorporated both the parametric and nonparametric models for representing the combination of networks and have shown that both these models achieve boosted performance of the aggregation technique when compared to each of the weak network classifiers. Through experiments, we show that on CIFAR10 data
[18], using weak classifiers of depth , our parametric model improved the accuracy by about . On MNIST data [20, 11], using weak classifiers of depth , our model achieves improvement in classification accuracy.2 An aggregation of multiple weak networks
In this section, we propose both parametric and nonparametric models to combine multiple “weak” networks in order to boost the overall performance. In any deep network used for classification, the output is a probability vector corresponding to the probability of the given test data belonging to set of classes under consideration. In this paper, we propose to exploit the geometry of the space of probability densities. However, this space is a statistical manifold and the natural metric on it is the well known FisherRao metric
[3], which is difficult to compute. Hence, a square root parameterization of the density is used to map the density on to a unit Hilbert sphere whose geometry is fully known. Further, the natural metric on the sphere can be used in all computations as it is in closed form and is computationally efficient. We now present the relevant basic concepts of differential geometry as applied to the sphere that are needed in this work.2.1 Review of Basic Riemannian Geometry of
The Ndimensional sphere, , is a Riemannian manifold with constant positive curvature and is the simplest and widely encountered manifold in many application domains. In following paragraph, we will present a very brief review of the relevant differential geometry concepts of .
Geodesic distance: We will use the arc length distance as the geodesic distance on . The arc length distance, is defined as follows:
where .
Exponential map: Let, . Let be an open ball centered at the origin in the tangent space at , where is the injectivity radius of [24]. Then, we can define the Exponential map, as:
where, . The Exponential map maps a tangent vector to a point on the great circle along the direction and with distance from . Note that on , .
Inverse Exponential map: Inside , is a diffeomorphism, hence, the inverse exists and we can define the inverse of the Exponential map by and is given by
where and .
Shortest Geodesic curve: Let and . Then, the shortest geodesic curve between and is a function given by:
2.2 A parametric model for the aggregation of networks
Let, be the “weak” networks that we want to combine to achieve an improved performance. Let be an input image, where is the given set of image data. Let be the output of the networks, where is the output of , i.e., , and is the number of classes. Here can be viewed as the probability vector of size , containing the probabilities of an image belonging to each of the classes. We use the squareroot parametrization to map on to the hypersphere . To make the notation more concise, for network , we define a map as
where the squareroot is taken elementwise.
Let be the partition of the data . We assume that for the network and for the class, the features
are independent and identically distributed with a Gaussian distribution
on with location parameter and scale parameter, , i.e., for each ,(1) 
On , we will use the Gaussian distribution, , as defined in [4]. Let be an
valued random variable, then the p.d.f. is given by:
(2) 
where is the geodesic distance on . is the normalizing constant. This distribution, , gives the probability of a feature coming from the network and belonging to the class.
Let be the weights associated with the networks such that, they satisfy the affine constraint, i.e.,
Now, we will use these weights to define a mixture to model the combination of these networks. For each class , we define the probability density, by . Hence, for all ,
Clearly, for all . And because of the affine constraint on , is a valid probability density, for all . Each will represents an ensemble of the learned models for all the networks. Now, in the prediction phase, we will assign the test image to the class which maximizes this probability value.
We define the prediction by our ensemble classifier by . It is easy to see that given the image , this is a probability vector since
Training the model: Now we have the training data denoted by, , that is used to learn the unknown parameters , and the test data denoted by, . Though, it is possible for one to learn , instead, we use the Fréchet mean (FM)[13] on
to get the estimate
and use the sample standard deviation within
to get the estimate , i.e.,(3)  
(4)  
(5) 
In this work, rather than optimizing the minimization problem to get the FM, we will use an incremental FM estimator on presented in [30]. For completeness, we will give the formulation of the FM estimator here. Given on , the FM of these samples can be estimated by , where is defined recursively as follows:
In [30], the authors provide a proof of weak consistency of this estimator.
Note that in our case, all entries of are positive, so they lie in the positive quadrant of the hypersphere. Hence the existence and uniqueness of the FM are guaranteed [1]. Given , we will learn by minimizing the following objective function,
(6) 
Training of : is the weight on network . Since and , we will identify on the hypersphere of dimension , i.e., on and then do Riemannian gradient descent on the hypersphere. The algorithm to solve for by minimizing is given in Algo. 1.
In the above algorithm Exp is Riemannian Exponential map on hypersphere. This above algorithm ensures that satisfy the affine constraints.
Since labeled images are given, without loss of generality, we can assume that the label is of the form where is from th class and then we can view as a degenerated distribution. To be consistent, we identify these two distributions, and , with points on the hypersphere and use the arclength distance as the distance between and , i.e.
Prediction of the class for a new sample : Given , , , the predicted class probability is given by,
When a test image is given, we will assign it to a class for which the prediction probability is maximized, i.e.,
Now, that we have a model and an algorithm to learn the model, we will present a framework that can combine features extracted from different algorithms (deep networks or handcrafted) and hence can have different number of feature dimension.
as the output from the fully connected layer (or as hand crafted features): Note that, is the output of the network from an intermediate fully connected layer (or be the dimension of hand crafted features). Let, , for all . We want the features to be affine invariant, but as none of the networks output affine invariant features, we quotient out the group of affine transformations from the features to map each feature on to the Grassmannian. We want the affine invariance in the extracted features, so that if two networks (or algorithms to compute hand crafted features) output features which are related by an affine transformation, we will not consider these two networks to be different.
We will use to denote the point on the Grassmannian corresponding to , i.e., . Observe that each may lie on the Grassmannian of different dimensions (as may be different for different networks). Let, be the Gaussian distribution which has been fitted to corresponding to , i.e., , where, , .
On , we will use the Gaussian distribution, , as defined in [4]. Let be a valued random variable, then the p.d.f. is written as:
(7) 
where, is the canonical geodesic distance on . is the normalizing constant. The canonical distance on is defined as follows. Let with the respective orthonormal basis and . Then, the geodesic distance is defined by:
where
is the singular value decomposition.
Note that, though, is defined on , we will use the support of as , i.e.,
(8) 
The support of over is needed to define a mixture of for each .
We define the mixture of as for each class.
Theorem 1.
For all , is a probability density on .
Proof.
For each ,
As, and , for all , . This completes the proof. ∎
The above definition of mixture has components defined on different dimensional spaces, but because of the definition in Eq. 2.2, the mixture is a valid probability density on for each . This is a more general framework as it allows us to combine output of intermediate layers of deep networks. As future work, we will explore utilizing this more general framework to combine outputs from intermediate network layers. As in our experiments, we have found that the choice of layer for is crucial, a detailed study in this more general direction should be needed and is beyond the scope of this paper. However, in this work we showed the performance gain of our proposed framework on hand crafted features such as Histogram of Oriented gradients (HOG) [9], SIFT [23] etc.
2.3 Nonparametric model
In the previous subsection, we have assumed a Gaussian distribution on for the network and class. Though this parametric assumption is simple, it is not very realistic since, the features of those being classified correctly and those being misclassified are not from a single Gaussian distribution but maybe a multimodal distribution. Hence, in this section, we will estimate
using kernel density estimation. We will assume Gaussian kernel and write
as follows. Let be the set of outputs of on .for . Here, is the bandwidth of the kernel which we have selected based on Silverman’s rule of thumb, i.e., , where, is the sample standard deviation from Eq. 3 and
The rest of the algorithm is same as in the previous subsection. We define the mixture of networks model and then solve for in order to minimize the objective function in Eq. 6.
The entire procedure of our ensemble method is shown in Figure 1.
3 Experiments
In this section, we present experiments for both the parametric and the nonparametric model on four publicly available datasets: CIFAR10, CIFAR100, MNIST, EMNISTletters (with English alphabet only) [8], EMNIST. A brief description for each of the datasets is given below.

The CIFAR10 dataset consists of 60,000 color images from 10 classes, of which 50,000 are used for training and the rest are used for testing.

The CIFAR100 dataset consists of 60,000 color images from 100 classes, of which 50,000 are used for training and the rest are used as test data.

The MNIST dataset consists of 70,000 grey images of handwritten digits 0 9, of which 60,000 are used for training and the rest are used as test data.

The EMNISTletters dataset consists of 145,600 grey images of handwritten English alphabets (26 classes), of which 124,800 are used for training and the rest for testing.

The EMNISTbalanced dataset consists of 131,600 grey images of handwritten alphabets and digits in 47 classes (merging those alphabets with similar uppercase and lowercase, e.g. C, O), of which 112,800 are used for training and the rest for testing.
An outline of the entire procedure used in the experiments is presented below:

Train 20 CNNs
for each dataset. The choice of CNN can be arbitrary and in order to show the power of our proposed ensemble technique, we trained the networks for only a few epochs to yield “weak” networks. Here, for the sake of convenience, we choose the following architectures (all the models we used in this experiment are based on the models provided by keras
[6] and modified slightly to meet our needs):
CIFAR10 We chose ResNet[15] with 20 weight layers and train these networks for only 3 epochs. The classification accuracies of these networks range from 61.6% to 72.8% and the average accuracy is 67.02%.

CIFAR100 We chose ResNet with 56 weight layers and train these networks for 50 epochs. The classification accuracies of these networks range from 59.1% to 63.5% and the average accuracy is 61.71%.

MNIST We chose a very simple CNN with only one convolution layer and one fullyconnected layer and train these networks for only 1 epoch. The classification accuracies of these networks range from 89.8% to 93.2% and the average accuracy is 90.89%.

EMNISTletters We chose a CNN with 2 convolution layer and 2 fullyconnected layer and train these networks for only 1 epoch. The classification accuracies of these networks range from 89.8% to 93.2% and the average accuracy is 90.24%.

EMNISTbalanced We chose a CNN with 2 convolution layer and 2 fullyconnected layer and train these networks for only 1 epoch. The classification accuracies of these networks range from 82.1% to 83.7% and the average accuracy is 82.94%.


Compute the estimated weights , using Algorithm 1.

Combine these networks and compute the classification accuracy on the test data.
The results are shown in Table 1.
Ave. Acc.  Param.  Nonparam.  

CIFAR10  67.02%  75.99%  79.5% 
CIFAR100  61.71%  65.71%  73.14% 
MNIST  90.89%  93.55%  93.58% 
EMNISTletters  90.24%  91.52%  91.61% 
EMNISTbalanced  82.94%  84.27%  85.66 % 
The result shows clearly that the proposed method works quite well and as we expected, when the networks are strong there is not much leeway to improve. On the contrary, when the networks are weak, the improvement is very significant. We can also see that the difference between parametric and nonparametric models decreases as the networks get stronger. Since obviously the features from those being classified correctly and those being classified incorrectly are not from the same distribution, in such cases, using a single Gaussian is not appropriate. When the networks are stronger, the difference between a single Gaussian distribution and the kernel density estimate is smaller. The motivation to use the parametric model when it performs almost as good as nonparametric model is clear: the nonparametric model takes 2 to 5 times longer than the parametric model.
In practice, we would like to know whether this ensemble technique reduces the time needed to achieve a certain accuracy. To answer this question, we run a experiment based on CIFAR10 and the parametric ensemble model. The experiment goes as follow:

We trained 5 networks on CIFAR10 using the same architecture as in the previous experiment.

Ensemble the intermediate models after running different number of epochs.
The result is shown in Figure 2. As we can see, the ensemble network performed constantly better. Since our ensemble method requires multiple networks, when comparing the efficiency of our method and the traditional CNN, it is better to consider the effective number of epochs, e.g., if we combine 5 networks and each of them is trained for 10 epochs, then the effective number of epochs would be . Table 2 shows the result of this experiment in terms of the effective number of epochs. The table is to be interpreted as follows: on CIFAR10, training a network with 50 epochs gives a classification accuracy 76.66% while training 5 networks, each with 10 epochs, and building the ensemble classifier based on these five networks gives a classification accuracy 80.06%. The message is that if you train multiple networks and build the ensemble network, you will get a better performance.
Another advantage of our ensemble method is that we can run multiple networks on different machines in parallel and then combine them without any retraining. The extra optimization step for finding the weights takes less than a few minutes in all our experiments.
Epochs  20()  50()  100() 

Ave.  77.86%  76.66%  77.13% 
Ensemble  78.46%  80.6%  79.91% 
The third experiment is based ensemble classifiers using the intermediate features instead of the final outputs. The experiment is performed on MNIST, using weak classifiers based on two HOG features [9] (with two different configuration) and the Daisy feature [41]. Each weak classifier is built using the mixture model described in Section 2, i.e., the special case when there is only one network. The average accuracy of these three weak classifiers is 85.16% and the accuracy of the ensemble classifier is 88.6%. The result again shows capability of our ensemble method to boost the performance without retraining.
4 Conclusions
In this paper we presented a novel aggregation technique to combine
“weak” networks/algorithms in order to boost the classification
accuracy over each constituent of the aggregate. Traditional boosting
requires retraining of every constituent of the aggregate and in
contrast, our aggregation model does not require any retraining. This
makes our aggregation model quite attractive from a computational cost
perspective. We presented both parametric and nonparametric
aggregation techniques and demonstrated via experiments the efficiency of the proposed methods. Another key advantage
of our technique stems from the fact that it can cope with aggregation
of features of distinct dimensions that are likely to result from
using either different networks or even handcrafted features that are
extracted from the data. These salient features make our aggregation
model unique. We presented several experiments demonstrating the
performance of our proposed aggregation technique on widely used
image databases in computer vision literature.
Acknowledgements: This research was funded in part by the NSF grant IIS1525431 and IIS1724174 to BCV.
References
 [1] Bijan Afsari. Riemannian lp center of mass: existence, uniqueness, and convexity. Proceedings of the American Mathematical Society, 139(2):655–673, 2011.
 [2] Forest Agostinelli, Michael R Anderson, and Honglak Lee. Adaptive multicolumn deep neural networks with application to robust image denoising. In Advances in Neural Information Processing Systems, pages 1493–1501, 2013.
 [3] Shunichi Amari. Information geometry and its applications. Springer, 2016.
 [4] Rudrasis Chakraborty and Baba Vemuri. Statistics on the (compact) stiefel manifold: Theory and applications. arXiv preprint arXiv:1708.00045, 2017.

[5]
François Chollet.
Xception: Deep learning with depthwise separable convolutions.
arXiv preprint, 2016.  [6] François Chollet et al. Keras, 2015.

[7]
Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber.
Multicolumn deep neural networks for image classification.
In
Computer vision and pattern recognition (CVPR), 2012 IEEE conference on
, pages 3642–3649. IEEE, 2012.  [8] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. Emnist: an extension of mnist to handwritten letters. arXiv preprint arXiv:1702.05373, 2017.
 [9] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
 [10] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.

[11]
Li Deng.
The mnist database of handwritten digit images for machine learning research [best of the web].
IEEE Signal Processing Magazine, 29(6):141–142, 2012.  [12] Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.
 [13] Maurice Fréchet. Les éléments aléatoires de nature quelconque dans un espace distancié. In Annales de l’institut Henri Poincaré, volume 10, pages 215–310. Presses universitaires de France, 1948.
 [14] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
 [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [16] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [17] Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, and Kurt Keutzer. Densenet: Implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869, 2014.
 [18] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 2014.
 [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [20] Yann LeCun. The mnist database of handwritten digits. nec research institute, 1998.
 [21] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
 [22] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
 [23] David G Lowe. Object recognition from local scaleinvariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. Ieee, 1999.
 [24] Berger Marcel. A Panoramic View of Riemannian Geometry. Springer, 2003.
 [25] Mohammad Moghimi, Serge J Belongie, Mohammad J Saberian, Jian Yang, Nuno Vasconcelos, and LiJia Li. Boosted convolutional neural networks. In BMVC, 2016.
 [26] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
 [27] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. Deep face recognition. In BMVC, volume 1, page 6, 2015.
 [28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 [29] Mohammad J Saberian and Nuno Vasconcelos. Multiclass boosting: Theory and algorithms. In Advances in Neural Information Processing Systems, pages 2124–2132, 2011.
 [30] Hesamoddin Salehian, Rudrasis Chakraborty, Edward Ofori, David Vaillancourt, and Baba C Vemuri. An efficient recursive estimator of the fréchet mean on a hypersphere with applications to medical image analysis. Mathematical Foundations of Computational Anatomy, 2015.
 [31] Robert E Schapire. The boosting approach to machine learning: An overview. In Nonlinear estimation and classification, pages 149–171. Springer, 2003.
 [32] Holger Schwenk. The diabolo classifier. Neural Computation, 10(8):2175–2200, 1998.
 [33] Holger Schwenk and Yoshua Bengio. Adaboosting neural networks: Application to online character recognition. In International Conference on Artificial Neural Networks, pages 967–972. Springer, 1997.
 [34] Holger Schwenk and Yoshua Bengio. Boosting neural networks. Neural computation, 12(8):1869–1887, 2000.
 [35] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
 [36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [37] Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation by joint identificationverification. In Advances in neural information processing systems, pages 1988–1996, 2014.
 [38] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation from predicting 10,000 classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1891–1898, 2014.

[39]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In AAAI, pages 4278–4284, 2017.  [40] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [41] Engin Tola, Vincent Lepetit, and Pascal Fua. Daisy: An efficient dense descriptor applied to widebaseline stereo. IEEE transactions on pattern analysis and machine intelligence, 32(5):815–830, 2010.
 [42] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.