Despite the admitted success, such boosting methods are suffered from two essential problems. First, the weak classifier selected at each boosting step is limited by their own discriminative ability when faces with complex classification problems. In order to decrease the training error, the final classifier is linearly combined by a large numbers of weak classifiers through boosting . On the other hand, amounts of effective learning procedure always lead the training error approaching to zero. However, under the unknown decision boundary, how to decrease the test error when training error is approaching zero is still an open issue .
have played an irreplaceable role in multimedia and computer vision literature. Generally, such hierarchical architecture represents different layer of vision primitives such as pixels, edges, object parts and so on. The basic principles of hierarchical models are concentrated on two folds: (1) layerwise learning philosophy, whose goal is to learn single layer of the model individually and stack them to form the final architecture; (2) feature combination rules, which aim at utilizing the combination of low layer detected features to construct the high layer impressive features by introducing the activation function. In this paper, the related exciting researches inspire us to employ such compositional representation to construct the impressive features with more discriminative power. Different from previous works[8, 9, 10] applying the hierarchical generative model, we address the problem on general image classification directly and design the final classifier leveraging the generalization and discrimination abilities.
This paper proposes a novel feature mining framework, namely deep boosting, which aims to construct the effective discriminative features for image classification task. Compared with the concept ’mining’ proposed in , whose goal is picking a subset of features as well as modeling the entire feature space, we utilize the word to describe the processing of feature selection and combination, which is more related to . For each layer, following the famous boosting method , our deep model sequentially selects visual features to learn the classifier to reduce the training error. In order to construct high-level discriminative representations, we composite selected features in the same layer and feed into higher layer to build a multilayer architecture. Another key to our approach is introducing the spatial information when combining the individual features, that inspires upper layer representation more structured on the local scale. The experiment shows that our method achieves excellent performance on image classification task.
2 Related Work
In the past few decades, many works focus on designing different types of features to capture the characteristics of images such as color, SIFT and HoG . Based on these feature descriptors, Bag-of-Feature (BoF) model seems to be the most classical image representation method in computer vision and related multimedia applications. Several promising studies [12, 13, 14] were published to improve this traditional approach in different aspects. Among these extension, a class of sparse coding based methods [13, 14], which employ spatial pyramid matching kernel (SPM) proposed by Lazebnik et al, has achieved great success in image classification problem. Despite we are developing more and more effective representation methods, the lack of high-level image expression still plagues us to build up the ideal vision system.
On the other hand, learning hierarchical models to simultaneously construct multiple levels of visual representation has received much attention recently . Our deep boosting method is partially motivated by recent developed deep learning techniques [8, 9, 16]. Different from previous hand-craft feature design method, deep model learns the feature representation from raw data and validly generates the high-level semantic representation. However, as shown in recent study , these network-based hierarchical models always contain thousands of nodes in a single layer, and is too complex to control in real multimedia application. In contrast, an obvious characteristic of our study is that we build up the deep architecture to generate expressive image representation simply and obtains the near optimal classification rate in each layer.
3 Deep Boosting for Image Recognition
3.1 Background: Gentle Adaboost
We start with a brief review of Gentle Adaboost algorithm . Without loss of generality, considering the two-class classification problem, let be the training samples, where is a feature representation of the sample and . is the sample weight related to . Gentle Adaboost [7, 17] provides a simple additive model with the form,
where is called weak classifier in the machine learning literature. It often defines as the regression stump , where denotes the indicator function, is the
-th dimension of the feature vector, is a threshold, and
are two parameters contributing to the linear regression function. In iteration, the algorithm learns the parameter of by weighted least-squares of to with weight ,
where is the dimension of the feature space. In order to give much attention to the cases that are misclassified in each round, Gentle Adaboost adjusts the sample weight in the next iteration as and updates . At last, the algorithm outputs the result of strong classifier as the form of sign function . Please refer to [7, 17] for more academic details.
The basic units in the Gentle Adaboost algorithm are individual features, also known as weak classifiers. Unlike the rectangle feature in 
for face detection, we employ Gabor wavelets response as the image feature representation. Letbe an image defined on image lattice domain and be the Gabor wavelet elements with parameters , where is the central position belonging to the lattice domain, and denote the orientation and scale parameters. Following , we utilize the normalized term to make the Gabor responses comparable between different training images:
where is the total number of pixels in image , and is the number of orientations. denotes the convolution process. For each image , we normalize the local energy as and define positive square root of such normalized result as feature response. In practice, we resize image into pixels and apply one scale and eight orientations in our implementation, so there are total filter responses for each grayscale image.
3.3 Discriminative Feature Selection
In this subsection, we set up the relationship between the weak classifier and Gabor wavelet representation. After the Gabor responses calculated, we learn the classification function utilizing the given feature set and the training set including both positive and negative images. Suppose the size of the training set is . In our deep boosting system, the weak learning method is to select the single feature ( i.e. weak classifier ) which best divides the positive and negative samples. To fix the notation, let be the feature representation of image , where is the dimension of the feature space. It is obvious that in the first layer, corresponding to Gabor wavelets in Sec.(3.2). Specifically, each element of is a special Gabor response of image (in the first layer) or their composition (in other layers). Note that in the rest of the paper, we apply to denote the value of in the -th dimension. In each round of feature selection procedure, instead of using the indictor function in Eq.(2
), we introduce the sigmoid function defined by the formula:
In this way, we consider a collection of regressive function where each is a candidate weak classifier whose definition is given in Definition. 1.
Definition 1 (Discriminative Feature Selection)
In each round, the algorithm retrieves all of the candidate regression functions, each of which is formulated as:
where is a sigmoid function defined in Eq.(4). The candidate function with current minimum training error is selected as the current weak classifier , such that
where is associate with the -th element of and the function parameter .
According to the above discussion, we build the bridge between the weak classifier and the special Gabor wavelet ( or their composition ), thus the weak classifiers learning can be viewed as the feature selection procedure in our deep boosting model.
3.4 Composite Feature Construction
Since the classification accuracy based on an individual feature or single weak classifier is usually low and the strong classifier, which is the weighted linear combination of weak classifiers, is hardly to decease the test error when training error is approaching to zero. It is of our interest to improve the discriminative ability of features and learn high-level representations as well.
In order to achieve the goal above, we introduce the feature combination strategy in Definition.2. All features selected in the feature selection stage are combined in a pair-wise manner with spatial constraints, and the output composition features of each layer are treated as base components to construct the next layer.
Definition 2 (Feature Combination Rule)
For each image , whose feature representation is denoted by , we combine two selected features in local area as,
where and indicate the -th and -th feature response corresponding to the image in the layer .
As illustrate in the Fig.(1), and are response values of selected features which are indicated by the red circles in each layer. and are the combination weights proportion to the training error rates of -th and -th weak classifiers calculated over the training set. is the local area determined by the projection coordinate of composition feature on the normalized image ( i.e. the image with the size of pixels in practice ). In the higher layer, the feature selection process is the same as the lower layer, which can be formulated as Eq.(6). Please refer to Fig.(2) for more details about feature combination.
3.5 Multi-class Decision
We employ the naive one-against-all strategy to handle the multi-class classification task in this paper. Given the training data ,, we train binary strong classifiers, each of which returns a classification score for a special test image. In the testing phrase, we predict the label of image referring to the classifier with the maximum score.
4.1 Dataset and Experiment Setting
We apply the proposed method on general classification task, using Caltech 256 Dataset  and the 15 Scenes Dataset  for validation. For both datasets, we split the data into training and test, utilize the training set to discover the discriminative features and learn the strong classifiers, and apply the test to evaluate classification performance.
As mentioned in Sec.(3.2). For both datasets, we resize each image as pixels, and simply set the Gabor wavelets with one scale and eight orientations. In each layer, the strong classifier training is performed in a supervised manner and the number of selected features are set as 1000, 800, 500 respectively. We combine the selected features in the block densely and capture composite features every layer. According to the experiment, the number of composite features in each layer relies on the complexity of image content seriously. The visualization of feature map in each layer is shown in Fig.(3).
We carry out the experiments on a PC with Core i7-3960X 3.30 GHZ CPU and 24GB memory. On average, it takes hours for training a special category model, depending on the numbers of training examples and the complexity of image content. The time cost for recognizing a image is around seconds.
4.2 Experiment I: Caltech 256 Dataset
We evaluate the performance of our deep boosting algorithm on the Caltech 256 Dataset  which is widely used as the benchmark for testing the general image classification task [13, 14]. The Caltech 256 Dataset contains 30607 images in 256 categories. We consider the image classification problem on Easy10 and Var10 image sets according to . We evaluate classification results from 10 random splits of the training and testing data ( i.e. 60 training images and the rest as testing images ) and report the performance using the mean of each class classification rate. Besides our own implementations, we refer some released Matlab code from previous published literature [13, 14] in our experiments as well. As Tab.(LABEL:Tab:Easy10) and Tab.(LABEL:Tab:Var10) report, our method reaches the classification rate of and on Easy10 and Var10 datasets, outperforming other approaches [11, 14, 13].
4.3 Experiment II: 15 Scenes Dataset
We also test our method on the 15 Scenes Dataset . This dataset totally includes 4485 images collected from 15 representative scene categories. Each category contains at least 200 images. The categories vary from mountain and forest to office and living room. As the standard benchmark procedure in [13, 12]
, we select 100 images per class for training and others for testing. The performance is evaluated by randomly taking the training and testing images 10 times. The mean and standard deviation of the recognition rates are shown in Table(LABEL:Tab:15Scenes). In this experiment, our deep boosting method achieves better performance than previous works [21, 13] as well. Note that, instead of HoG+SVM, we compare our approach with GIST+SVM method in this experiment, due to the effectiveness of GIST 
in the scene classification task. Considering the subtle engineering details, we can hardly achieve desired results applying and  methods in our own implementations. So we quote the reported result directly from  and abandon  as a way of comparison. We also compare the recognition rate utilizing different layer’s strong classifier, the results of top five outstanding categories on 15 Sences Dataset are reported in Fig.(4). It is obvious that our proposed feature combination strategy improve the performance effectively.
This paper studies a novel layered feature mining framework named deep boosting. According to the famous boosting algorithm, this model sequentially selects the visual feature in each layer and composites selected features in the same layer as the input of upper layer to construct the hierarchical architecture. Our approach achieves the excellent success on several image classification tasks. Moreover, the philosophy of such deep model is very general and can be applied to other multimedia applications.
-  Isabelle Guyon and André Elisseeff, “An introduction to variable and feature selection,” Journal of Machine Learning Research, 2003.
-  Piotr Dollár, Zhuowen Tu, Hai Tao, and Serge Belongie, “Feature mining for image classification,” in CVPR, 2007.
-  Liang Lin, Ping Luo, Xiaowu Chen, and Kun Zeng, “Representing and recognizing objects with massive local image patches,” Pattern Recognition, vol. 45, no. 1, pp. 231–240, 2012.
-  Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján, “Conditional likelihood maximisation: A unifying framework for information theoretic feature selection,” Journal of Machine Learning Research, 2012.
-  Paul A. Viola and Michael J. Jones, “Robust real-time face detection,” in ICCV, 2001.
-  Junsong Yuan, Jiebo Luo, and Ying Wu, “Mining compositional features for boosting,” in CVPR, 2008.
Jerome Friedman, Trevor Hastie, and Robert Tibshirani,
“Additive logistic regression: a statistical view of boosting,”Annals of Statistics, 1998.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, 1998.
-  Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, 2006.
-  Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng, in ICML, 2009.
-  Navneet Dalal and Bill Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
-  Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in CVPR, 2006.
-  Jianchao Yang, Kai Yu, Yihong Gong, and Thomas S. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in CVPR, 2009.
-  Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas S. Huang, and Yihong Gong, “Locality-constrained linear coding for image classification,” in CVPR, 2010.
-  Liang Lin, Tianfu Wu, Jake Porway, and Zijian Xu, “A stochastic graph grammar for compositional object representation and recognition,” Pattern Recognition, vol. 42, no. 7, pp. 1297–1307, 2009.
-  Ping Luo, Xiaogang Wang, and Xiaoou Tang, “A deep sum-product architecture for robust facial attributes analysis,” in ICCV, 2013.
-  Antonio Torralba, Kevin P. Murphy, and William T. Freeman, “Sharing features: Efficient boosting procedures for multiclass object detection,” in CVPR, 2004.
-  Ying Nian Wu, Zhangzhang Si, Haifeng Gong, and Song Chun Zhu, “Learning active basis model for object detection and recognition,” International Journal of Computer Vision, 2010.
-  G. Grifin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” Caltech Technical Report 7694, 2007.
Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio,
“Large scale online learning of image similarity through ranking,”Journal of Machine Learning Research, 2010.
-  Aude Oliva and Antonio Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International Journal of Computer Vision, 2001.